NLP: Text Summarisation with Python

0
22

Here’s a simple Python method based on the Natural Language Toolkit for extractive text summarisation in natural language processing.

In natural language processing (NLP), frequency-based summarisation is a straightforward extractive text summarisation technique that selects sentences based on the frequency of important words in the text. The approach is based on the assumption that frequently occurring words represent the core themes of the text. Let’s discuss a simplified algorithm using this approach.

Steps in frequency-based summarisation

Preprocessing:

  • Tokenization: Split the text into sentences and words.
  • Stop word removal: Remove common words like ‘and’, ‘the’, or ‘is’ that do not contribute to meaning.
  • Stemming: Reduce words to their base forms.

Word frequency calculation:

  • Count the occurrences of each word in the text.
  • Normalise frequencies if needed, e.g., by dividing by the total number of words.

Sentence scoring:

  • Assign scores to sentences based on the cumulative frequency of the words they contain.
  • Sentences with more frequent words score higher.

Sentence selection:

  • Rank sentences by their scores.
  • Select the top n sentences (based on a predefined ratio or word count) to form the summary.

Natural Language Toolkit (NLTK) package-based text processing uses this package with all the required modules. The following modules have been used here.

from nltk.corpus import stopwords
 
from nltk.tokenize import word_tokenize, sent_tokenize
 
from nltk.stem import PorterStemmer

Tokenization

In natural language processing, tokenization divides a string into a list of tokens. Tokens are useful when finding valuable patterns; tokenization also replaces sensitive data components with non-sensitive ones.

Tokens can be considered as words in a sentence or in a paragraph. word_tokenize is a function in Python that splits a given sentence into words using the NLTK library.

# import word_tokenize method from nltk
 
from nltk import word_tokenize
 
# Create a string input
 
sentence = “Python is a high-level programming language”

Now, using word_tokenize() one can split the sentence into its constituent words.

words = word_tokenize(sentence)
 
print(words)
 
[‘Python’, ‘is’, ‘a’, ‘high’, ‘level’, ‘programming’, ‘language’]

In addition, SyllableTokenizer() in the NLTK library can be used to split a word into its tokens according to its syllables.

Stemming

The English language often uses inflected forms of different words as many English words are derived from other words; for example, the inflected word ‘generosity’ is derived from the word ‘generous’, which is the root form. All the inflected words consist of words with common root forms, but the degree of inflexion varies based on the language. For example, there are four inflected words for the following sentence — overwhelmed, generosity, friends, neighbours — all of which can be identified and transformed into their root form with the NLTK PortStemmer module.

sentence = “We were overwhelmed by the generosity of friends and neighbours”
 
 
from nltk.stem import PorterStemmer
words = word_tokenize(sentence)
ps = PorterStemmer()
[ps.stem(word) for word in words]
[‘we’,
‘were’,
‘overwhelm’,
‘by’,
‘the’,
‘generos’,
‘of’,
‘friend’,
‘and’,
‘neighbour’]

Frequency table

Extractive text summarisation requires a frequency table of word counts in a Python dictionary data form. This exercise requires both word tokenization and stemming to count each word. The function create_frequency_table(text_body) will create a dictionary of words with their count as the key field. This function filters out the stop words and punctuations from the text body to make the frequency table more concise and meaningful.

[ps.stem(word) for word in words if word.lower() not in [‘we’,’were’,’by’,’the’,’of’,’and’]]
 
Out: [‘overwhelm’, ‘generos’, ‘friend’, ‘neighbour’]
 
def createFrequencyTable(textBody) -> dict:
 
#Collect all stop words from NLTK
 
stopWords = set(stopwords.words(“english”))
 
#Tokenization
 
words = word_tokenize(textBody)
 
#Filterout all stop words
 
words = [w for w in words if w.lower() not in stopWords]
 
#Filter out all punctuations
 
words = [w for w in words if w.lower().isalpha()]
 
#Call stemmer object
 
ps = PorterStemmer()
 
freqTable = dict()
 
for word in words:
 
word = ps.stem(word)
 
if word in freqTable:
 
freqTable[word] += 1
 
else:
 
freqTable[word] = 1
 
return freqTable

Stop words

One of NLP’s core aspects is handling ‘stop words’, which due to their frequent occurrence in the text often don’t offer substantial insights on their own.

Stop words like ‘the’, ‘and’, and ‘I’, although common, don’t usually provide meaningful information about a document’s specific topic. Removing these words from a corpus allows us to identify unique and relevant terms more easily.

It’s important to note that there is no universally accepted list of stop words in NLP. However, the Natural Language Toolkit offers a powerful list of ‘stop words’ for researchers and practitioners to utilise.

from nltk.corpus import stopwords
 
stopWords = set (stopwords.words(“english”))

This returns a WordListCorpusReader object and can be used to identify stop words which are in most cases a well-accepted list of words. This object can be used as follows:

if word not in stopWords:
 
print(word)

Sentence categorisation

Sentence categorisation is a heuristic approach and depends on the application areas. A very simplified approach is used here to assign a weightage to each sentence depending on the frequency of words within the given text.

j = n # can be selected judicially
def score (sentences, freqTable) -> dict:
sentenceScore = dict()
 
 
for sentence in sentences:
#word_count_in_sentence = (len(word_tokenize(sentence)))
word_count_in_sentence = (len(word_tokenize(sentence))) for wordValue in freqTable:
if wordValue in sentence.lower():
if sentence[:j] in sentenceScore:
sentenceScore[sentence[:j]] += freqTable[wordValue]
else:
sentenceScore[sentence[:j]] = freqTable[wordValue]
#Normalize sentence score
sentenceScore[sentence[:j]] = sentenceScore[sentence[:j]] // word_count_in_sentence
return sentenceScore

The global variable j extracts a string from each sentence to assign the cumulative score value to each sentenceScore[] entry, where the sentenceScore[] is a dictionary to represent the score of each sentence.

Summarisation

To summarise a text body, we need a threshold value to identify the most meaningful sentences to represent the text body. For that, the average of sentence scores is considered as the threshold, and the summary of the text-body is generated by the selection of all those sentences for which their score is more than the threshold value.

def averageScore(sentenceScore) -> int:
sumScores = 0
for entry in sentenceScore :
sumScores += sentenceScore[entry]
# Average value of a sentence score from original text
average = (sumScores // len(sentenceScore))
return average
def calculateSummary(sentences, sentenceScore, threshold):
sentenceCount = 0
summary = ‘’
for sentence in sentences:
if sentence[:j] in sentenceScore:
if sentenceScore[sentence[:j]] > threshold:
summary += “ “ + sentence
sentenceCount += 1
return [summary, sentenceCount]

Finally, all these can be used to summarise a given text.

textBody = “””
 
“Natural Language Processing (NLP) is a subfield of artificial intelligence. NLP helps computers understand and process human language. Tasks like text summarisation, sentiment analysis, and machine translation are part of NLP.”
“””
[freqTable,wordCount] = createFrequencyTable(textBody)
sentences = sent_Tokenize(textBody)
j = 8 # an arbitrary value can be adjusted as per need
sentenceScore = score(sentences, freqTable)
# Find the threshold
threshold = averageScore(sentenceValue)
# Generate the summary
[summary, sentenceCount] = calculateSummary(sentences, sentenceScore, threshold)
print(summary)
print(sentenceCount)

The output will be “Tasks like text summarisation, sentiment analysis, and machine translation are part of NL”.

Frequency-based summarisation provides a straightforward and efficient baseline for quick overviews or in settings with limited resources. Other summarising methods, such as abstractive and hybrid, may yield superior outcomes, but they necessitate a thorough understanding of AI and ML in addition to language concepts.

Previous articleModelling Toeplitz Networks with SageMath
Next articleOpen-Source 3D-Printed Robot
The author is a member of IEEE, IET, with more than 20 years of experience in open source versions of UNIX operating systems and Sun Solaris. He is presently working on data analysis and machine learning using a neural network and different statistical tools. He has also jointly authored a textbook called ‘MATLAB for Engineering and Science’. He can be reached at dipankarray@ieee.org.