Here’s a simple Python method based on the Natural Language Toolkit for extractive text summarisation in natural language processing.
In natural language processing (NLP), frequency-based summarisation is a straightforward extractive text summarisation technique that selects sentences based on the frequency of important words in the text. The approach is based on the assumption that frequently occurring words represent the core themes of the text. Let’s discuss a simplified algorithm using this approach.
Steps in frequency-based summarisation
Preprocessing:
- Tokenization: Split the text into sentences and words.
- Stop word removal: Remove common words like ‘and’, ‘the’, or ‘is’ that do not contribute to meaning.
- Stemming: Reduce words to their base forms.
Word frequency calculation:
- Count the occurrences of each word in the text.
- Normalise frequencies if needed, e.g., by dividing by the total number of words.
Sentence scoring:
- Assign scores to sentences based on the cumulative frequency of the words they contain.
- Sentences with more frequent words score higher.
Sentence selection:
- Rank sentences by their scores.
- Select the top n sentences (based on a predefined ratio or word count) to form the summary.
Natural Language Toolkit (NLTK) package-based text processing uses this package with all the required modules. The following modules have been used here.
from nltk.corpus import stopwords from nltk.tokenize import word_tokenize, sent_tokenize from nltk.stem import PorterStemmer |
Tokenization
In natural language processing, tokenization divides a string into a list of tokens. Tokens are useful when finding valuable patterns; tokenization also replaces sensitive data components with non-sensitive ones.
Tokens can be considered as words in a sentence or in a paragraph. word_tokenize is a function in Python that splits a given sentence into words using the NLTK library.
# import word_tokenize method from nltk from nltk import word_tokenize # Create a string input sentence = “Python is a high-level programming language” |
Now, using word_tokenize() one can split the sentence into its constituent words.
words = word_tokenize(sentence) print(words) [‘Python’, ‘is’, ‘a’, ‘high’, ‘level’, ‘programming’, ‘language’] |
In addition, SyllableTokenizer() in the NLTK library can be used to split a word into its tokens according to its syllables.
Stemming
The English language often uses inflected forms of different words as many English words are derived from other words; for example, the inflected word ‘generosity’ is derived from the word ‘generous’, which is the root form. All the inflected words consist of words with common root forms, but the degree of inflexion varies based on the language. For example, there are four inflected words for the following sentence — overwhelmed, generosity, friends, neighbours — all of which can be identified and transformed into their root form with the NLTK PortStemmer module.
sentence = “We were overwhelmed by the generosity of friends and neighbours” from nltk.stem import PorterStemmer words = word_tokenize(sentence) ps = PorterStemmer() [ ps .stem(word) for word in words] [‘we’, ‘were’, ‘overwhelm’, ‘by’, ‘the’, ‘generos’, ‘of’, ‘friend’, ‘and’, ‘neighbour’] |
Frequency table
Extractive text summarisation requires a frequency table of word counts in a Python dictionary data form. This exercise requires both word tokenization and stemming to count each word. The function create_frequency_table(text_body) will create a dictionary of words with their count as the key field. This function filters out the stop words and punctuations from the text body to make the frequency table more concise and meaningful.
[ ps .stem(word) for word in words if word.lower() not in [‘we’,’were’,’by’,’the’,’of’,’and’]] Out: [‘overwhelm’, ‘generos’, ‘friend’, ‘neighbour’] def createFrequencyTable(textBody) -> dict: #Collect all stop words from NLTK stopWords = set (stopwords.words(“english”)) #Tokenization words = word_tokenize(textBody) #Filterout all stop words words = [w for w in words if w.lower() not in stopWords] #Filter out all punctuations words = [w for w in words if w.lower().isalpha()] #Call stemmer object ps = PorterStemmer() freqTable = dict() for word in words: word = ps .stem(word) if word in freqTable: freqTable[word] += 1 else : freqTable[word] = 1 return freqTable |
Stop words
One of NLP’s core aspects is handling ‘stop words’, which due to their frequent occurrence in the text often don’t offer substantial insights on their own.
Stop words like ‘the’, ‘and’, and ‘I’, although common, don’t usually provide meaningful information about a document’s specific topic. Removing these words from a corpus allows us to identify unique and relevant terms more easily.
It’s important to note that there is no universally accepted list of stop words in NLP. However, the Natural Language Toolkit offers a powerful list of ‘stop words’ for researchers and practitioners to utilise.
from nltk.corpus import stopwords stopWords = set (stopwords.words(“english”)) |
This returns a WordListCorpusReader object and can be used to identify stop words which are in most cases a well-accepted list of words. This object can be used as follows:
if word not in stopWords: print(word) |
Sentence categorisation
Sentence categorisation is a heuristic approach and depends on the application areas. A very simplified approach is used here to assign a weightage to each sentence depending on the frequency of words within the given text.
j = n # can be selected judicially def score (sentences, freqTable) -> dict: sentenceScore = dict() for sentence in sentences: #word_count_in_sentence = (len(word_tokenize(sentence))) word_count_in_sentence = (len(word_tokenize(sentence))) for wordValue in freqTable: if wordValue in sentence.lower(): if sentence[:j] in sentenceScore: sentenceScore[sentence[:j]] += freqTable[wordValue] else : sentenceScore[sentence[:j]] = freqTable[wordValue] #Normalize sentence score sentenceScore[sentence[:j]] = sentenceScore[sentence[:j]] // word_count_in_sentence return sentenceScore |
The global variable j extracts a string from each sentence to assign the cumulative score value to each sentenceScore[] entry, where the sentenceScore[] is a dictionary to represent the score of each sentence.
Summarisation
To summarise a text body, we need a threshold value to identify the most meaningful sentences to represent the text body. For that, the average of sentence scores is considered as the threshold, and the summary of the text-body is generated by the selection of all those sentences for which their score is more than the threshold value.
def averageScore(sentenceScore) -> int: sumScores = 0 for entry in sentenceScore : sumScores += sentenceScore[entry] # Average value of a sentence score from original text average = (sumScores // len(sentenceScore)) return average def calculateSummary(sentences, sentenceScore, threshold): sentenceCount = 0 summary = ‘’ for sentence in sentences: if sentence[:j] in sentenceScore: if sentenceScore[sentence[:j]] > threshold: summary += “ “ + sentence sentenceCount += 1 return [summary, sentenceCount] |
Finally, all these can be used to summarise a given text.
textBody = “”” “Natural Language Processing (NLP) is a subfield of artificial intelligence. NLP helps computers understand and process human language. Tasks like text summarisation, sentiment analysis, and machine translation are part of NLP.” “”” [freqTable,wordCount] = createFrequencyTable(textBody) sentences = sent_Tokenize(textBody) j = 8 # an arbitrary value can be adjusted as per need sentenceScore = score(sentences, freqTable) # Find the threshold threshold = averageScore(sentenceValue) # Generate the summary [summary, sentenceCount] = calculateSummary(sentences, sentenceScore, threshold) print(summary) print(sentenceCount) |
The output will be “Tasks like text summarisation, sentiment analysis, and machine translation are part of NL”.
Frequency-based summarisation provides a straightforward and efficient baseline for quick overviews or in settings with limited resources. Other summarising methods, such as abstractive and hybrid, may yield superior outcomes, but they necessitate a thorough understanding of AI and ML in addition to language concepts.