The Complete Magazine on Open Source


, / 109 0

In this month’s column, we continue our discussion on natural language processing.

For the past few months, we have been discussing information retrieval and natural language processing, as well as the algorithms associated with them. This month, we continue our discussion on natural language processing (NLP) and look at how NLP can be applied in the field of software engineering. Given one or many text documents, NLP techniques can be applied to extract information from the text documents. The software engineering (SE) lifecycle gives rise to a number of textual documents, to which NLP can be applied.

So what are the software artifacts that arise in SE? During the requirements phase, a requirements document is an important textual artifact. This specifies the expected behaviour of the software product being designed, in terms of its functionality, user interface, performance, etc. It is important that the requirements being specified are clear and unambiguous, since during product delivery, customers would like to confirm that the delivered product meets all their specified requirements.

Having vague ambiguous requirements can hamper requirement verification. So text analysis techniques can be applied to the requirements document to determine whether there are any ambiguous or vague statements. For instance, consider a statement like, “Servicing of user requests should be fast, and request waiting time should be low.” This statement is ambiguous since it is not clear what exactly the customer’s expectations of ‘fast service’ or ‘low waiting time’ may be. NLP tools can detect such ambiguous requirements. It is also important that there are no logical inconsistencies in the requirements. For instance, a requirement that “Login names should allow a maximum of 16 characters,” and that “The login database will have a field for login names which is 8 characters wide,” conflict with each other. While the user interface allows up to a maximum of 16 characters, the backend login database will support fewer characters, which is inconsistent with the earlier requirement. Though currently such inconsistent requirements are flagged by human inspection, it is possible to design text analysis tools to detect them.
The software design phase also produces a number of SE artifacts such as the design document, design models in the form of UML documents, etc, which also can be mined for information. Design documents can be analysed to generate automatic test cases in order to test the final product. During the development and maintenance phases, a number of textual artifacts are generated. Source code itself can be considered as a textual document. Apart from source code, source code control system logs such as SVN/GIT logs, Bugzilla defect reports, developers’ mailing lists, field reports, crash reports, etc, are the various SE artifacts to which text mining can be applied.
Various types of text analysis techniques can be applied to SE artifacts. One popular method is duplicate or similar document detection. This technique can be applied to find out duplicate bug reports in bug tracking systems. A variation of this technique can be applied to code clones and copy-and-paste snippets.

Automatic summarisation is another popular technique in NLP. These techniques try to generate a summary of a given document by looking for the key points contained in it. There are two approaches to automatic summarisation. One is known as ‘extractive summarisation’, using which key phrases and sentences in the given document are extracted and put back together to provide a summary of the document. The other is the ‘abstractive summarisation’ technique, which is used to build an internal semantic representation of the given document, from which key concepts are extracted, and a summary generated using natural language understanding.

The abstractive summarisation technique is close to how humans would summarise a given document. Typically, we would proceed by building a knowledge representation of the document in our minds and then using our own words to provide a summary of the key concepts. Abstractive summarisation is obviously more complex than extractive summarisation, but yields better summaries.

Coming to SE artifacts, automatic summarisation techniques can be applied to generate large bug reports. They can also be applied to generate high level comments of methods contained in source code. In this case, each method can be treated as an independent document and the high level comment associated with that method or function is nothing but a short summary of the method.

Another popular text analysis technique involves the use of language models, which enables predicting what the next word would be in a particular sentence. This technique is typically used in optical character recognition (OCR) generated documents, where due to OCR errors, the next word is not visible or gets lost and hence the tool needs to make a best case estimate of the word that may appear there. A similar need also arises in the case of speech recognition systems. In case of poor speech quality, when a sentence is being transcribed by the speech recognition tool, a particular word may not be clear or could get lost in transmission. In such a case, the tool needs to predict what the missing word is and add it automatically.

Language modelling techniques can also be applied in intelligent development environments (IDE) to provide ‘auto-completion’ suggestions to the developers. Note that in this case, the source code itself is being treated as text and is analysed.
Classifying a set of documents into specific categories is another well-known text analysis technique. Consider a large number of news articles that need to be categorised based on topics or their genre, such as politics, business, sports, etc. A number of well-known text analysis techniques are available for document classification. Document classification techniques can also be applied to defect reports in SE to classify the category to which the defect belongs. For instance, security related bug reports need to be prioritised. While people currently inspect bug reports, or search for specific key words in a bug category field in Bugzilla reports in order to classify bug reports, more robust and automated techniques are needed to classify defect reports in large scale open source projects. Text analysis techniques for document classification can be employed in such cases.

Another important need in the SE lifecycle is to trace source code to its origin in the requirements document. If a feature ‘X’ is present in the source code, what is the requirement ‘Y’ in the requirements document which necessitated the development of this feature? This is known as traceability of source code to requirements. As source code evolves over time, maintaining traceability links automatically through tools is essential to scale out large software projects. Text analysis techniques can be employed to connect a particular requirement from the requirements document to a feature in the source code and hence automatically generate the traceability links.

We have now covered automatic summarisation techniques for generating summaries of bug reports and generating header level comments for methods. Another possible use for such techniques in SE artifacts is to enable the automatic generation of user documentation associated with that software project. A number of text mining techniques have been employed to mine ‘stack overflow’ mailing lists to generate automatic user documentation or FAQ documents for different software projects.

Regarding the identification of inconsistencies in the requirements document, inconsistency detection techniques can be applied to source code comments also. It is a general expectation that source code comments express the programmer’s intent. Hence, the code written by the developer and the comment associated with that piece of code should be consistent with each other. Consider the simple code sample shown below:

/* linux/drivers/scsi/in2000.c: */
/* caller must hold instance lock */
Static int reset_hardware(…)
static int in2000_bus_reset(…)

In the above code snippet, the developer has expressed the intention that ‘instance_lock’ must be held before the function ‘reset_hardware’ is called as a code comment. However, in the actual source code, the lock is not acquired before the call to ‘reset_hardware’ is made. This is a logical inconsistency, which can arise either due to: (a) comments being outdated with respect to the source code; or (b) incorrect code. Hence, flagging such errors is useful to the developer who can fix either the comment or the code, depending on which is incorrect.

My ‘must-read book’ for this month
This month’s book suggestion comes from one of our readers, Sharada, and her recommendation is very appropriate to the current column. She recommends an excellent resource for natural language processing—a book called, ‘Speech and Language Processing: An Introduction to Natural Language Processing’ by Jurafsky and Martin. The book describes different algorithms for NLP techniques and can be used as an introduction to the subject. Thank you, Sharada, for your valuable recommendation.

If you have a favourite programming book or article that you think is a must-read for every programmer, please do send me a note with the book’s name, and a short write-up on why you think it is useful so I can mention it in the column. This would help many readers who want to improve their software skills.

If you have any favourite programming questions/software topics that you would like to discuss on this forum, please send them to me, along with your solutions and feedback, at sandyasm_AT_yahoo_DOT_com. Till we meet again next month, happy programming!