Content extraction from Web pages occurs in a variety of domains such as information retrieval, data mining, etc. This article introduces five power-packed Python libraries that make the process of content extraction from Web resources simple and quick.
The mammoth size of the World Wide Web with its explosive growth, both in terms of its contents and its variety, poses certain important challenges to end users. Searching for a required piece of information from large volumes is one of the key problems that the average Web user faces on a daily basis. We have moved from the era of information scarcity to the era of information overload. Almost any keyword that you fire into a search engine returns results ranging in number from a few hundreds to many thousands and even a few million. Fetching the data required to satisfy the users information needs is one of the challenging research issues.
An important solution to this problem of information overload is to extract the contents from Web resources and present them in a form that is comparatively simpler for the user to understand. This article presents Python based solutions that handle Web content extraction. There is an array of libraries available in Python for content extraction, as illustrated in Figure 1.
The focus of this article is to introduce the following five power-packed Python libraries, each of which has some unique features:
- html2text
- Lassie
- Newspaper
- Python-Goose
- Textract
html2text
The major task performed by the html2text library is the conversion of HTML pages into an easy-to-read plain text format. This library was initiated by Aaron Swartz. The installation of html2text can be done easily, with the following command:
pip install html2text
After successfully completing the installation process, the html2text library can be used either from the terminal or within a Python program. The terminal usage of html2text is shown in the following code:
html2text [(filename|url) [encoding]]
For example, the following command will fetch the text version of the Open Source For You websites home page:
$html2text https://www.opensourceforu.com
This can be achieved through a Python program also, as shown below:
import html2text import requests #Make the proxy settings, if your access is through a proxy server proxies={http:10.10.80.11:3128,https:10.10.80.11:3128} url = https://www.opensourceforu.com res = requests.get(url, proxies = proxies, timeout=5.0 ) html = res.text # extract the text from the html version text = html2text.html2text(html) print(text)
Lassie
The Lassie library enables the retrieval of component information from Web pages. For example, the key words, images, videos and description can be fetched with the Lassie library. The installation of Lassie can be carried out with the following commands:
$ pip install lassie
or,
$ easy_install lassie
The information from the Web page is gathered with the lassie.fetch() function as illustrated below:
import lassie import pprint url=https://www.opensourceforu.com/2016/04/ubuntu-16-04-lts-xenial-xerus-official/ pprint.pprint (lassie.fetch(url))
The execution of the above script results in the following output:
{description: uUbuntu 16.04 LTS aka Xenial Xerus comes with new features like snap package and LXD container to deliver you an upgraded experience. Being as an LTS platform, the new version additionally comes with five-year support., images:[{src:uhttps://www.opensourceforu.com/wp-content/uploads/2016/04/Ubuntu-16-04-lts-150x150.jpg, type: uog:image}, {src: uhttps://www.opensourceforu.com/favicon1.ico, type: ufavicon}], keywords: [uUbuntu, uUbuntu 16.04 LTS, uUbuntu 16.04, uUbuntu Xenial Xerus, uXenial Xerus, uCanonical], locale: uen_US, title: uUbuntu 16.04 LTS Xenial Xerus goes official - Open Source For You, url: uhttps://www.opensourceforu.com/2016/04/ubuntu-16-04-lts-xenial-xerus-official/, videos: []}
Lassie facilitates the fetching of all the images from the Web page with the all images = true option. The following code will provide information regarding all the images present in the page:
pprint.pprint (lassie.fetch(url, all_images=True))
Newspaper
The Newspaper library, developed and maintained by Lucas Ou-Yang, is specially designed for extracting information from the websites of newspapers and magazines. The objective of this library is to extract and curate the articles from the newspapers and similar websites. To install Newspaper, use the following commands.
For lxml dependencies, type:
$ sudo apt-get install libxml2-dev libxslt-dev
PIL is for recognising images.
$ sudo apt-get install libjpeg-dev zlib1g-dev libpng12-dev
For Newspaper library installation, type:
$ pip3 install newspaper3k
The NLP corpora will be downloaded:
$ curl https://raw.githubusercontent.com/codelucas/newspaper/master/download_corpora.py | python3
The Newspaper library is used to gather information associated with articles. This includes author names, publication dates, major images in the article, videos present in the article, key words describing the article and the summary of the article. A sample program with an article from the Open Source For You magazine is given below:
From newspaper import Article url = https://www.opensourceforu.com/2016/02/ionic-a-ui-framework-to-simplify-hybrid-mobile-app-development/ article = Article(url) #1 . Download the article article.download() #2. Parse the article article.parse() #3. Fetch Author Name(s) print(article.authors) #4. Fetch Publication Date print(Article Publication Date:) print(article.publish_date) #5. The URL of the Major Image print(Major Image in the article:) print(article.top_image) #6. Natural Language Processing on Article to fetch Keywords article.nlp() print (Keywords in the article) print(article.keywords) #7. Generate Summary of the article print(Article Summary) print(article.summary)
The output of the program is as follows:
[uArticle Written, uK S Kuppusamy] Article Publication Date: 2016-02-14 12:25:28+00:00 Major Image in the article: https://www.opensourceforu.com/wp-content/uploads/2016/01/ionic-html-and-Angular-js-visual2.jpg Keywords in the article [udevelopment, usmartphone, usimplify, umobile, uapps, uapp, uhybrid, uframework, uui, ubrowser, uionic, unative] Article Summary To assist developers in increasing the speed of app development, there are various app development frameworks. Native app testing (device): Another method to test the app is as the native app in the device. Ionic is a promising open source mobile UI framework that makes the hybrid app development process swifter and smoother. This article presents Ionic, which is a framework for hybrid mobile app development. So native app development requires specialized development teams.
The major features of the Newspaper library are its multi-threaded article-downloading framework, summary extraction and support for around 20 languages.
Python-Goose
Goose is a popular library for article extraction, which was originally developed for the Java ecosystem. Python-Goose is a rewrite of Goose in Python. The primary objective of Goose is to extract information from news articles. It can be installed with the following commands:
mkvirtualenv --no-site-packages goose git clone https://github.com/grangier/python-goose.git cd python-goose pip install -r requirements.txt python setup.py install
The Python version of the Goose library was developed by Xavier Gragnier. The features of Python-Goose are as illustrated in Figure 2.
The Newspaper library, which was covered earlier, uses the Goose library.
from goose import Goose url = https://www.opensourceforu.com/2016/02/ionic-a-ui-framework-to-simplify-hybrid-mobile-app-development/ g = Goose() article = g.extract(url=url) print(Title:) print(article.title) print(Meta Description) print(article.meta_description) print(First n characters of the cleaned text) print(article.cleaned_text[:20]) print (URL of Main image) print(article.top_image.src)
The output is:
Title: Ionic: A UI Framework to Simplify Hybrid Mobile App Development
The meta description is:
Ionic makes the task of creating hybrid apps very easy, and these can be ported to various platforms.
First n characters of the cleaned text
This article present
URL of main image
https://www.opensourceforu.com/wp-content/uploads/2016/01/Figur-1-350x390.jpg
Textract
The Web includes resources apart from HTML pages. Textract is a library to extract data from those resource file formats. In simple terms, it can be described as a library to extract text from any type of filefrom resources such as Word documents, PowerPoint presentations, PDFs, etc. Textract attempts to extract text from gif, jpg, mp3, ogg, tiff, xls, etc, and has various dependencies to handle these file formats. To install it, use the following commands:
apt-get install python-dev libxml2-dev libxslt1-dev antiword unrtf poppler-utils pstotext tesseract-ocr flac ffmpeg lame libmad0 libsox-fmt-mp3 sox pip install textract
The extraction of text is carried out with the textract.process() function. The text extraction from the image shown in Figure 3 can be carried out with the following simple program:
import textract text = textract.process(oa.jpg) print text OPEN aACCESS
The Textract extraction process depends upon the quality of the source document. Though it is not guaranteed to extract 100 per cent correct text from all the sources, it provides an insight into the type of content residing in the resource, which is definitely key information to have in many scenarios.
To summarise, this article aims to provide an eye-opener to the domain of content extraction from the Web. Though the above-mentioned libraries provide powerful functions for the content extraction process, they do pose research level challenges. Interested developers could take these up and provide better solutions.
Thank you for this article sir. I explored the python Library – Newspaper for my task and found it to be very efficient and simple.