The Complete Magazine on Open Source


, / 124 0

In this month’s column, we continue our discussion on natural language processing, focusing on automatic key phrase extraction.

Over the past few months, we have been discussing natural language processing algorithms. Let us continue our discussion with the focus on automatic key phrase extraction. First, let us understand what exactly automatic key phrase extraction is and why it is important in the context of natural language processing.

Automatic key phrase extraction refers to the automatic identification and selection of important and topical phrases contained in the document being examined. In layman’s terms, key phrases can be considered as the phrases from the document, which represent the topic or subject matter that is being discussed in the document. This definition of automatic key phrase extraction is quite subjective in that we are now left with the question of, “What is important and topical for a given document?” Well, the answer would depend on the consumer of that information. Key phrases have multiple uses in information retrieval and text mining. They can be used to categorise documents and to automatically generate indexes of the documents, so that they can be searched by user queries. Consider a large digital library, where documents get added in huge numbers. In order to index/categorise them, automatic key phrase extraction can be used. Key phrase extractions can also be used for summarisation of documents and to answer questions.

As human beings, we unconsciously perform key phrase detection when we read through documents or listen to texts. We pick out the relevant key phrases, which convey the important information or facts, and ignore the rest, automatically, without any explicit effort on our part. However, doing this with a computer is not an easy process. Humans use their knowledge of the world and the context of the document to identify what information is ‘key’ and what is not. Computers cannot mimic such human-like heuristics, because many of these filters are learnt and are based on our knowledge of the world. So how do computers detect key phrases automatically?

Automatic key phrase extraction involves three main steps. The first step is the candidate generation phase, during which a set of phrases are selected as ‘candidate key phrases’ using certain heuristics or rules. The second step is to determine which of the candidates from Step 1 need to be retained as key phrases and which need to be pruned. Typically, supervised learning algorithms or unsupervised learning algorithms are employed in this step. The third step is to rank these key phrases. Some of the approaches omit the third step and do not provide a ranking of key phrases.

Certain factors play a major role in key phrase extraction. For example, the length of the document determines the number of candidate key phrases generated in Step 1. The longer the document, the more the number of candidate key phrases that get generated and this makes Step 2 much more compute intensive. Typically, emails, news articles and scientific abstracts are much shorter in length and, hence, the number of key phrases identified in Step 1 is much smaller for them than for long scientific articles or technical reports.

The structure of the document also plays a key role in key phrase extraction. Most documents have a certain structure or at least a ‘semi-structure’ which can be used in key phrase extraction. It is well known in language processing literature that the key ideas in a document are typically at the beginning and end of most documents. For instance, in structured documents such as scientific articles or technical reports, the ‘introduction’ and the ‘conclusion’ sections generate most, if not all, of the key phrases associated with that document. However, it is difficult to identify the structure in unstructured documents such as emails, blogs or meeting transcripts, which make them less amenable to easy detection of key phrases. Key phrases are typically closely related to the topic of the document. When there are multiple topics being discussed in the document, each topic contributes to key phrase candidate generation. This implies that using the structure of the document, such as the beginning and end of the document, as the major source of key phrases can result in missing key phrases arising from intermediate topic changes occurring in the document.

Another important factor that can be used in pruning key phrases in Step 3 is that key phrases are typically related to each other through a common topic. Here, again, this fact holds true for structured documents such as scientific articles. However, informal communications such as blogs, emails, etc, contain key phrases belonging to multiple topics and, hence, topic correlation is poor among key phrase candidates in such documents.

Step 2 of the key phrase extraction technique employs a learning algorithm to prune the candidates. Supervised learning algorithms for key phrase extraction consist of two major items, namely, task reformulation and feature identification. One approach is to formulate the problem as a binary classification, wherein the learning function is a binary classifier trained on both positive and negative key phrases. One of the disadvantages of using a binary classifier is that the algorithm does not rank the candidate key phrases. If there are key phrases that are more representative than others, the less relevant ones are not filtered by the binary classifier, which simply provides a ‘Yes’ or ‘No’ as an answer. This issue can be overcome by training a ranking classifier, which ranks the two key phrase candidates and ranks one more representative than the other.

Features that are used to represent a sample key phrase instance in supervised key phrase extraction typically include the length of the document, the document’s structure such as where in the document the phrase appears, ‘parts of speech’ tagging for the candidate phrase such as whether a word is a noun or verb, external features such as whether the candidate phrase appears in the title of a Wikipedia entry or in the query log of a search engine, etc.
Unsupervised approaches to key phrase candidate pruning include graph based approaches, wherein key phrase candidates are represented as nodes in a graph. An edge between two nodes connects two related candidates. This idea is very similar to the popular page ranking algorithm adopted in search engines, where a node is considered important if it is related to a large number of other nodes and is connected to candidates that themselves are important. Each edge is treated as a vote for a candidate key phrase, and the edge weights are used to label the importance of the edges connecting the nodes. The score of a node is the summation of the edge weights, and the score is used to rank the candidate key phrases. One of the popular unsupervised automatic key phrase extraction techniques based on graph based ranking is the TextRank algorithm described in the paper A Java implementation of the same is available in GitHub at

Another unsupervised learning method for key phrase extraction is using topic modelling of the document. Topics of the document are identified first and the key phrase candidates are grouped under the topics. Since key phrases are supposed to be representatives of the topics covered in the document, this can be modelled as a set-cover problem, where the key phrases are selected to cover all the topics of the document. Unlike the learning based approaches, language modelling based key phrase extraction techniques are becoming popular. We will discuss these in our next column.

Here is a takeaway exercise for our readers. We are all familiar with tag clouds which are built for Web pages. Given a Wikipedia article entry, can you come up with an algorithm to create a tag cloud for the same using any of the automatic key phrase extraction techniques we have discussed in this column? You can use any of the popular NLP toolkits such as NLTK for this exercise.

If you have any favourite programming questions or software topics that you would like to discuss on this forum, please send them to me, along with your solutions and feedback, at sandyasm_AT_yahoo_DOT_com. Till we meet again next month, happy programming!