Python: Text Processing With The Collections Package

0
252
Python Text processing

The ‘collections’ package in Python is a powerful tool to initiate text processing. It offers an alternative to Python’s general-purpose built-in containers — dict, list, set, and tuple.

A traditional method of analysing texts is to find the contentions of each word so that analysts can ultimately deduce the overall contention of the text. The typical approach to text analysis involves determining how many words in a text are also in a predefined list of words associated with a given context. Text analysis can be executed in various ways. Here, we will explore how this can be implemented using the ‘collections’ package. Python has several built-in collection classes, including lists, tuples, and sets. The collections package provides an alternative to collections of items in Python. This also provides several useful functions in its modules enabling the counting of the number of occurrences of elements in a collection. With Python 3.7 and above, one can also use the built-in ‘defaultdict’ class for a more efficient way to count the occurrences of elements in a dictionary.

Collections module

This module implements specialised container data types, providing alternatives to Python’s general-purpose built-in containers — dict, list, set, and tuple.

  • namedtuple(): Factory function for creating tuple subclasses with named fields
  • deque: List-like container with fast appends and pops on either end
  • ChainMap: Dict-like class for creating a single view of multiple mappings
  • Counter: Dict subclass for counting hashable objects
  • OrderedDict: Dict subclass that remembers the order entries that were added
  • Defaultdict: Dict subclass that calls a factory function to supply missing values
  • UserDict: Wrapper around dictionary objects for easier dict subclassing
  • UserList: Wrapper around list objects for easier list subclassing
  • UserString: Wrapper around string objects for easier string subclassing

Collections: List module

The collections package also provides the list data type with lots of methods like count, pop, popitems, index, and insert. A list of collection types must be declared initially before applying any methods from the collections package.

from collections import UserList

 l = [1,2,3,4]
 userL = UserList(l)
 print(type(userL))

The output is:

<class ‘collections.UserList’>

The content of the list collection is displayed using the data method:

print(userL.data)

The output is:

[1, 2, 3, 4]

In the collection, the list data type is retrieved using the ‘pop method, and data can be inserted using the ‘insert’ method:

#Pop 
userL.pop()
print(userL.data)

#Insert

userL.insert(0,5)
 print(userL.data)

[1, 2, 3]  # Pop from the left
[5, 1, 2, 3] #insert at 0th position

Collections: String module

Like the list, the collections package also supports string data types. This package provides various string methods, like list, as mentioned earlier. A few of these methods are shown here to give an idea of the functionalities of the collections package. The use of string data requires the UserString module of the package.

from collections import UserString

As mentioned in the list, all the methods of the module UserString can be used as shown below:

l = “open source for you 2023”
   userS = UserString(l)
   print(type(userS))
   print(userS.data)

The output is:

<class ‘collections.UserString’>
open source for you 2023

Some useful methods of UserString are ‘count ()’, ‘find()’ and ‘index()’, where ‘count’ is used to count the number of occurrences of a string. The ‘find()’ and ‘index() are used to locate a string position. The capitalize() function is a handy function for capitalising the first character of a given string. Functions like ‘isalpha()’, ‘isascii()’, and ‘isalnum()’ are boolean functions to verify an alphabetic, ASCII, and alphanumeric string, respectively. All this can be done as follows.

Count the number of occurrences of ‘o’ within the given string:

print(userS.count(“o”))

Similarly, to find the index position of a string ‘r’, one can use ‘find()’ or ‘index()’:

 print(userS.find(“r”))

 print(userS.index(“r”))

The outputs are:

8
8

To capitalise the first character of a string, the capitalize() method can be used as follows:

print(userS.capitalize())

The output is:

Open source for you 2023

Boolean functions ‘isalpha()’, ‘isascii()’, and ‘isalnum()’ can be used as shown below:

print(userS.isalpha())
print(userS.isascii())
print(userS.isalnum())

Collections: Default dictionary

The default dictionary of the ‘collections’ module is an added advantage over the ordinary dictionary structure. Where an ordinary dictionary variable gives an error message for a non-existing key-value pair, the ‘defaultdict’ module of collections produces a valid customised response for a non-existing key value.

In the ordinary dictionary the outputs of:

d = { “a”:1,”b”:1}
print(d[“a”])

print(d[“c”])

…the output are:

1
Traceback (most recent call last):
  File “D:\SENTIMENT_ANALYSIS\tp_2.py”, line 35, in <module>
    print(d[“c”])
KeyError: ‘c’

In the case of default dictionary of the ‘collections’ package:

from collections import defaultdict
d = defaultdict(lambda: “Not Present”)  # default dictionary function
d[“a”] = 1
d[“b”] = 2
print(type(d)) 
print(d[“a”]) 
print(d[“b”]) 
print(d[“c”])

…the output is:

<class ‘collections.defaultdict’>
1
2
Not Present  # Instead of an error message  it returns a customised response

Counter

A counter tool is used to provide fast text pattern matching. For example, a frequency analysis of a sentence can be carried out with the ‘counter’ function of the collections package.

from collections import Counter
text = r”””
       A traditional method of analysing texts is to
       find the contentions of each word so that analysts
       can finally conclude the overall contention of the text.
       “””
cnt = Counter()
for word in text.split(): 
    cnt[word] += 1
print(type(cnt))
print(cnt)

The output is:

<class ‘collections.Counter’>
Counter({‘of’: 3, ‘the’: 3, ‘A’: 1, ‘traditional’: 1, ‘method’: 1, ‘analysing’: 1, ‘texts’: 1, ‘is’: 1, ‘to’: 1, ‘find’: 1, ‘contentions’: 1, ‘each’: 1, ‘word’: 1, ‘so’: 1, ‘that’: 1, ‘analysts’: 1, ‘can’: 1, ‘finally’: 1, ‘conclude’: 1, ‘overall’: 1, ‘contention’: 1, ‘text.’: 1})

To load the ‘counter’ module, it is necessary to call the collections package. As shown here, the counter returns a list of dictionary variables from the class collection ‘Counter’.

The most common words

Subsequently, finding the most common word items in the collection is also possible. This can be implemented with the most_common(n) method of Counter().

import re
words = re.findall(r’\w+’, text.lower())
x=Counter(words).most_common(5)
print(x)  
The output is:
[(‘of’, 3), (‘the’, 3), (‘a’, 1), (‘traditional’, 1), (‘method’, 1)]

First, call the ‘findall()’ function of ‘re’ to extract all words from the given string:

words = re.findall(r’\w+’, text.lower())

The output is:

[‘a’, ‘traditional’, ‘method’, ‘of’, ‘analysing’, ‘texts’, ‘is’, ‘to’, ‘find’, ‘the’, ‘contentions’, ‘of’, ‘each’, ‘word’, ‘so’, ‘that’, ‘analysts’, ‘can’, ‘finally’, ‘conclude’, ‘the’, ‘overall’, ‘contention’, ‘of’, ‘the’, ‘text’]

To eliminate duplicity, convert all strings to lower case. Then, from the word list, it is easy to create a dictionary of unique words using ‘Counter()’.

x=Counter(words)
print(x)

The output is:

Counter({‘of’: 3, ‘the’: 3, ‘a’: 1, ‘traditional’: 1, ‘method’: 1, ‘analysing’: 1, ‘texts’: 1, ‘is’: 1, ‘to’: 1, ‘find’: 1, ‘contentions’: 1, ‘each’: 1, ‘word’: 1, ‘so’: 1, ‘that’: 1, ‘analysts’: 1, ‘can’: 1, ‘finally’: 1, ‘conclude’: 1, ‘overall’: 1, ‘contention’: 1, ‘text’: 1})

Text processing is the preamble of natural language processing. The ‘collections’ package is a powerful tool to initiate text processing. Moreover, it provides alternatives to iterable datatypes like lists, dictionaries, and strings. It also offers additional functionalities for these data types. Some of them have been discussed here with examples. Interested readers can explore this package further in relevant web documents.

Previous articleAI’s Human-Like Traits And Behavioral Patterns
Next articleUnderstanding Cache Techniques
The author is a member of IEEE, IET, with more than 20 years of experience in open source versions of UNIX operating systems and Sun Solaris. He is presently working on data analysis and machine learning using a neural network and different statistical tools. He has also jointly authored a textbook called ‘MATLAB for Engineering and Science’. He can be reached at dipankarray@ieee.org.

LEAVE A REPLY

Please enter your comment!
Please enter your name here