The Complete Magazine on Open Source

Code Sport

, / 696 0

In this month’s column, we discuss information extraction.

In the last couple of columns, we have been discussing computer science interview questions. This month, let’s return to the subject of natural language processing. In particular, let’s focus on information extraction.
Given a piece of text, the goal is to extract the information contained in it, using natural language processing techniques. We know that vast amounts of information are present on the World Wide Web. However, much of this information lies hidden in the form of unstructured and semi-structured text. Today, much of this information is presented to us through search engine results, as Web pages, and we have to spend considerable time in reading through the individual Web pages in order to find the specific information we are interested in.

For example, consider the following query presented on a Web search engine: “Who were the presidents of the United States that were assassinated?” Google returns a boxed answer with the exact answer, naming the four US presidents who were assassinated while in office. Questions which are of the type ‘What, where, when, who…’ are known as ‘Wh*’ factual questions. Search engines use knowledge bases such as Freebase, Wikipedia, etc, to find answers to ‘Wh*’ questions.
On the other hand, let us try another example query: “Why did Al Gore concede the election?” This time, Google returns thousands of search engine result pages (SERP) and does not point us to an exact answer. Perhaps this was a tough question; so let us try another: “Why did Al Pacino refuse his Oscar award?” Again, Google is stumped in providing a succinct answer; instead, it showers us with thousands of SERPs and makes us hunt for the answer ourselves. These two examples may make us think that search engines are very good in answering questions which are of the ‘What, Who, Where, When’ category, but not so good when it comes to answering ‘Why’ questions. But that is not actually true. As a final example, consider the question: “Why is the sky blue?” Google returns the exact answer in a box right on top of search results. So not all ‘why’ questions are difficult.

One of the reasons Google could easily answer the last question is possibly because the information needed to answer this question was extracted from the relevant Web pages with a sufficient confidence threshold for the search engine to say that this is probably the right answer. On the other hand, for questions such as “Why did Al Gore concede the election?”, it could not extract the relevant information needed to answer the question, with a sufficient confidence threshold to provide a succinct answer. However, if we pursue the search engine results, the first link, namely the Wikipedia article on Al Gore’s presidential campaign, does indeed contain the answer to the question, “Gore strongly disagreed with the Court’s decision, but said: ‘For the sake of our unity as a people and the strength of our democracy, I offer my concession.” However, this information could not be effectively extracted by the search engine to offer the exact answer, though it was present in the very first search engine result. Hence the question we want to consider in today’s column is: What makes information extraction difficult and how can we effectively extract the information we need to answer our queries?
In earlier columns, we discussed closed and open information extraction techniques. Basically, given a set of text documents, information extraction systems are intended to analyse this unstructured text and extract information in a structured form. The structure supporting the informational needs of the particular domain is the domain ontology. In laymen’s terms, we can think of the ontology as the structure or schema of the structured database. Ontology defines the entity types, their relationship types and also in certain cases, relevant entities of interest. For instance, a medical ontology could include DISEASE, SYMPTOM and DRUG as entity types; and DISEASE causes SYMPTOMS, DRUG treats DISEASE, etc, as relationship types. It can also include optionally, entities such as diabetes, asthma, heart disease as concrete entities of interest.

Let us assume that you are given the ontology which describes, in a structured form, the entity types and the relationships among them. Now you are also given the text corpus which contains all the documents of interest. This could be the entire World Wide Web (WWW) or a small complete set of documents, such as an enterprise document store. For our example, let us consider that you are given a medical ontology of the type we had discussed earlier. The document corpus can be the entire WWW. Given that we know the entity types and relation types in the ontology, we first need to identify the entity instances of the relevant types. For our example, we need to find DISEASE entities, DRUG entities and SYMPTOM entities, which are mentioned in the corpus. Standard NLP techniques such as Named Entity Recognition can be modified to identify entities, and then the identified entity mentions are analysed to determine which entity type they belong to.
Entity typing can be done by means of: (a) the supervised technique, (b) the weakly-supervised technique, and (c) the distantly supervised technique. Standard supervised techniques use annotated data, and entity typing is a multi-label classification problem where each entity is classified into the right entity type label, using labelled data to learn the prediction function. Weakly supervised techniques use pattern based approaches to identify the entity types. Typically, a few seed instances and seed patterns are provided to bootstrap the extraction process. For instance, a sentence such as “Metformin is the first line drug of choice in the treatment of diabetes,” is marked with ‘metformin’ as the seed drug instance, and the seed pattern extracted is ‘drug … treatment of …’. This pattern can then be used as an extractor pattern on the remaining corpus to extract other similar drug disease mentions.

Weakly supervised techniques require that seed instances should be specified for each of the different entity and relation types that need to be extracted. Distantly supervised techniques do not need explicit seed instances and seed patterns. Instead, they use available knowledge bases such as Freebase and Wikipedia to obtain the seed instances automatically.

While entity typing is a hard problem, there is another associated challenge linked to it, which is not often discussed. Consider our problem of extracting diseases. Let us assume that we have two sentences, such as the following:
1. Diabetes has multiple causes, the primary one being that blood sugar levels are not being controlled adequately.

2. Diabetes mellitus is a serious disease affecting millions and Metformin is the first line drug of choice against it.
Assume that we have already identified the entity mentions ‘diabetes’ and ‘diabetes mellitus’ from the above two sentences. We have also been able to determine that ‘diabetes’ and ‘diabetes mellitus’ are of entity type ‘DISEASE’ and ‘Metformin’ is of entity type ‘DRUG’. Now the main question that confronts us is the following: Are the two entity mentions ‘diabetes’ and ‘diabetes mellitus’ referring to the same entity instance or different ones? If they are referring to the same entity instance, what should be the knowledge base entity reference which will be used to refer to both these entities? This is the problem of canonicalisation of knowledge bases, wherein the entity mentions in the unstructured text need to be mapped to a unique structured database entity.

Before we discuss how we can address knowledge base canonicalisation, you may be wondering why these two entity mentions need to be canonicalised at all? Why can’t we create two rows in the knowledge base table, one for diabetes and the second for diabetes mellitus such that the facts present in sentences (1) and (2) above are included in those two rows independently? Well, the answer is that canonicalisation of knowledge bases allows us to unify the facts we have extracted so that we can present a more complete answer to a query. For instance, if there is a query such as, “What is the primary cause of diabetes mellitus?” to our knowledge base, we need to provide the answer, “Inadequate blood sugar control.” However, unless we unify the two entity mentions of ‘diabetes’ and ‘diabetes mellitus’ to a single database instance ID, we can’t combine the two facts and provide the right answer. Therefore, knowledge base canonicalisation is needed to avoid storing ambiguous and redundant facts, and to provide a complete answer to user queries.

As you may have realised, one of the simplest approaches for canonicalising these two entity mentions is by means of clustering similar words and phrases. Clustering using word similarity would result in ‘diabetes’ and ‘diabetes mellitus’ being assigned the same cluster and, hence, by assigning a unique cluster representative instance which will represent all the cluster members, canonicalisation can be achieved. But this does not always work. For instance, consider the two entity mentions: Type 1 diabetes and Type 2 diabetes. Now, clustering based on similarity would end up clustering these two instances and assigning a single unique KB entity ID. However, these are two distinct diseases and have different drugs and treatment plans. By forcing these two entities to be clustered into the same knowledge base instance ID, we actually lose precision in providing an answer to a user query of the form: “What drugs are used to treat Type 1 diabetes?” Here is a takeaway question to our readers. In the context of the over-clustering issue, how will you come up with an efficient technique for knowledge base canonicalisation of entity mentions? We will discuss this further in our next column.
If you have any favourite programming questions/software topics that you would like to discuss on this forum, please send them to me, along with your solutions and feedback, at sandyasm_AT_yahoo_DOT_com. Till we meet again next month, happy programming!