The Complete Magazine on Open Source


, / 140 0

In this month’s column, we feature a set of computer science interview questions.

Over the last couple of months, we have been discussing a set of questions on machine learning and data management. This month, we return to our discussion of topics in natural language processing (NLP). We focus on the important problem in NLP known as parts of speech (POS) tagging, covering why it is important and some of the techniques used in this area.

Parts of speech tags are used to tag words in a sentence with their syntactic category, such as whether a word is a noun, verb, adjective or adverb. For instance, given the sentence, ‘The dog barks,’ its POS tags would be as follows: ‘The (determiner) dog (noun) barks (verb).’ The tags in brackets indicate the parts of speech associated with the word in the sentence. Some readers may wonder why one needs to do POS tagging. Isn’t the POS tag of a word known automatically? Well, the answer is not so straightforward.

Most of the words in the English language (and in the case of most other languages as well) can take different tags based on the context in which they are used. For example, consider the two sentences:
(1) Please give me the book.
(2) Please book the evening flight to Delhi.
In Sentence 1, the word ‘book’ is associated with the noun POS tag. In Sentence 2, the same word is associated with the verb POS tag. Certain tags such as prepositions in English are easier to tag in a sentence, since they are associated with an unambiguous POS tag. On the other hand, most other words have an ambiguous association with POS tags, since they can take different tags based on the context. POS tags can be divided into two broad categories, namely, Closed Class
Tags and Open Class Tags. In the first category, member words are typically fixed and form a closed set. Open Class Tags are those whose membership can increase over time due to the addition of new words. For example, a preposition is a Closed Class Tag since, typically, in English, new prepositional words don’t get added. On the other hand, consider tag classes such as nouns and verbs. In English, new nouns can get added due to new entities such as ‘Google’ or ‘iPhone’ and new verbs can get added, such as ‘fax’, etc. Hence, verbs and nouns are Open Class Tags.

Apart from POS tag classes such as nouns, verbs, prepositions, etc, how many POS tag classes are typically considered for POS tagging of English sentences? While there are different tag sets available with differing number of tags, a popular one in the NLP community is the Penn Tree Bank tag set of 45 tags, available at:

To those still wondering why POS tagging is needed for text, it is a pre-processing step that’s necessary in a number of related problems, namely, information retrieval (IR), text-to-speech processing, word sense disambiguation, etc. POS tagging is also useful for syntactic parsing, since the tag associated with a word can be used to infer syntactic classes of surrounding words. For instance, nouns are preceded by determiners and adjectives, and verbs are typically followed by nouns, and so on. There has been considerable prior work in using POS tagging to improve IR performance, by cutting down the terms indexed based on their POS tags. For instance, since indexing only the nouns in a document generally provides a good idea on what the document is about, it is possible to create a pruned index set using POS tags. Similarly, POS tags can be used to improve text-to-speech systems, in deciding how a word should be spoken with different intonations. For instance, consider the following two sentences:
(1) I object to your insolent tone.
(2) What is the object of your shipping expedition?

In Sentence 1, ‘object’ being tagged with the verb POS tag would be spoken with a different intonation as opposed to its usage in Sentence 2, where it is being used as a noun in the query.
POS tagging is also useful in the important but challenging problem of word sense disambiguation. Polysemy of many languages means that a single word can have different meanings associated with it. POS tagging of the sentence can aid in word sense disambiguation. Consider the following two sentences:
(1) I bank with HSBC.
(2) I sat on the river bank.

We can use the POS tags to disambiguate between the different meanings associated with the word ‘bank’.
Given the usefulness of POS tagging, now let us discuss the possible approaches to it. First, let us define the problem of POS tagging formally. Given a sentence consisting of words {w1, w2, w3,… wn} and a tag set T = { t1, t2, … tm}, the task is to identify the best possible tag sequence that can be assigned for the sentence S.

Just as in the case of various other tasks in NLP, POS tagging can also be performed using both supervised and unsupervised learning techniques. Supervised POS tagging requires that you are given a text Corpus C and a Tag set T. The Corpus C consists of a set of text documents and each document is a collection of sentences. Each Sentence S is a collection of words. Each word in S is tagged/labelled with the appropriate tag set element from T, resulting in a tag sequence <t1,t2… tn> being associated with the Sentence S consisting of <w1, w2, … wn>. Hence, we are given a set of <sentence, tag sequence corresponding to the sentence> as training data. The task is to learn a function that can predict/assign the tag sequence corresponding to the input test sentence, given the labelled training data.

While supervised POS tagging approaches perform well in terms of accuracy, the disadvantage is that they need the corpus to be labelled with the tags. Unsupervised POS tagging is based on finding equivalent word classes; however, without any labelled data, it cannot assign human readable tags to words. Hence, a small amount of manually labelled corpuses can be used with unsupervised POS taggers to induce a mapping between human readable tags and the word equivalence classes induced by the tagger. We will discuss the unsupervised POS tagging in detail in the next column.

Another way of categorising the POS tagging approaches is as either rule based POS tagging techniques or stochastic POS tagging techniques. Rules can either be provided manually by linguistic experts, or they can be automatically induced by learning from large text corpuses which are tagged. The rules can then be used to tag an untagged sentence. Stochastic POS tagging techniques are based on finding the probabilities of the tag sequences associated with a given sentence. Assume that you have a black box, which can assign a probability of ‘p[i]’ to each possible tag sequence ‘i’ for sentence S. Given these probabilities, the stochastic approach can choose that tag sequence whose ‘p[i]’ is the maximum from among all possible tag sequences.

Now let us consider that we have a sentence S whose word sequence is <w1, w2, .. wN>. Each word ‘w’ can be tagged with one of the possible tags from tag set T. Let us assume that tag set T contains M tags. Hence, each word can be assigned one of M possible tags. Can you calculate how many tag sequences are possible for the given sentence S?

The number of possible tag sequences associated with S becomes very large and is equal to MN. Hence, brute force search among all possible tag sequences is not efficient. Even if the probabilities are known for each possible tag sequence associated with S, it is computationally expensive to find that particular tag sequence whose probability is the highest.

Now can you think of ways by which we can avoid this brute force search? The question to our readers is this: Can you think of any solution from your algorithms course, that you have used in similar problems? For instance, consider the example of finding the shortest path from a source vertex to a destination vertex in a graph. Since there are exponentially many possible paths, a brute force listing of each path and its associated cost is computationally expensive. In such cases, can you recall what kind of algorithmic techniques you used to overcome this issue? A clue I would like to give you is to think about the dynamic programming approach. We will continue the discussion of POS tagging approaches in our next column.

If you have any favourite programming questions/software topics that you would like to discuss on this forum, please send them to me, along with your solutions and feedback, at sandyasm_AT_yahoo_DOT_com. Till we meet again next month, happy programming