We have been exploring various R packages for handling text for natural language processing. In this twenty-third article in the R, Statistics and Machine Learning series, we delve into the ‘stringr’ package, which provides a comprehensive set of functions to easily work with strings.
We will use R version 4.1.2 installed on Parabola GNU/Linux-libre (x86-64) for the code snippets.
$ R --version R version 4.1.2 (2021-11-01) -- “Bird Hippie” Copyright (C) 2021 The R Foundation for Statistical Computing Platform: x86_64-pc-linux-gnu (64-bit) R is free software and comes with ABSOLUTELY NO WARRANTY. You are welcome to redistribute it under certain conditions. Type ‘license()’ or ‘licence()’ for distribution details. |
You can install and load the ‘stringr’ package using the following commands:
> install .packages(“stringr”) Installing package into ‘ /usr/local/lib/R/site-library ’ (as ‘lib’ is unspecified) ... ** installing vignettes ** testing if installed package can be loaded from temporary location ** testing if installed package can be loaded from final location ** testing if installed package keeps a record of temporary installation path * DONE (stringr) > library(stringr) |
str_to_
The str_to_() functions provide methods to transform strings to upper case and lower case, format titles, and convert text into a sentence format. The syntax for this function is as follows:
str_to_< function >(string, locale = “en”) |
A few examples are given below:
> t <- “R, Statistics and Machine Learning” > str_to_upper(t) [1] “R, STATISTICS AND MACHINE LEARNING” > str_to_lower(t) [1] “r, statistics and machine learning” > str_to_title(t) [1] “R, Statistics And Machine Learning” > str_to_sentence(t) [1] “R, statistics and machine learning” |
str_count
You can count the number of occurrences of a character in a string with the str_count() function. The syntax usage is as follows:
str_count(string, pattern = “”) |
A couple of examples are given below for reference:
> str_count(t, “a”) [1] 4 > str_count(t, c(“a”, “e”)) [1] 4 2 |
str_dup
You can duplicate a string with the str_dup() function, which accepts an input string and a number for replication. The number of times for duplication can also be a list as shown below:
> str_dup(t, 1) [1] “R, Statistics and Machine Learning” > str_dup(t, 2) [1] “R, Statistics and Machine LearningR, Statistics and Machine Learning” > str_dup(t, 1:3) [1] “R, Statistics and Machine Learning” [2] “R, Statistics and Machine LearningR, Statistics and Machine Learning” [3] “R, Statistics and Machine LearningR, Statistics and Machine LearningR, Statistics and Machine Learning” |
str_detect
The str_detect() function returns a TRUE Boolean value if the pattern match exists in the given input string, and FALSE otherwise. You can use the regex() syntax for specific patterns for a match. The negate argument, if set to TRUE, can return non-matching elements. Examples to demonstrate this function are given below:
> str_detect(t, “a”) [1] TRUE > str_detect(t, “[ae]”) [1] TRUE > str_detect(t, “^s”) [1] FALSE > str_detect(t, “g$”) [1] TRUE str_conv |
The str_conv() function can help convert the encoding of a string from the default format. The syntax usage is as follows:
str_conv(string, encoding) |
Examples that use the ISO-8859-1 encoding are given below for reference:
> str_conv(“\xa9”, “ISO-8859-1”) [1] “©” > str_conv(“\xbc”, “ISO-8859-1”) [1] “¼” > str_conv(“\xbd”, “ISO-8859-1”) [1] “½” > str_conv(“\xbe”, “ISO-8859-1”) [1] “¾” |
str_equal
You can compare if two strings are equal using the Unicode rules with the str_equal() function. It accepts the following arguments:
Argument | Description |
x | A character vector |
y | Another character vector |
locale | ‘en’ for English |
ignore_case | Boolean value to ignore case |
A couple of examples are shown below:
> str_equal(“hello”, “hi”) [1] FALSE > str_equal(“\u1342”, “\u1342”) [1] TRUE |
str_like
The pattern matching for a string for the SQL LIKE operator syntax is implemented with the str_like() function. The syntax usage is as follows:
str_like(string, pattern, ignore_case) |
A few examples are given below:
> str_like(vowels, “a”) [1] TRUE FALSE FALSE FALSE FALSE > str_like(t, “Mach”) [1] FALSE > str_like(t, “%R%”) [1] TRUE str_match |
The str_match() function does pattern matching as described in vignette (‘regular-expressions’) and as implemented by string. A couple of examples are given below:
> str_match(t, “[a-z]+”) [,1] [1,] “tatistics” > str_match(t, “[a-zA-Z]+”) [,1] [1,] “R” |
str_extract
The str_extract() function matches a pattern in a string, and obtains the same. It accepts the following arguments:
Argument | Description |
string | Input vector |
pattern | Regular expression |
group | Return specified matched group |
simplify | TRUE returns character matrix FALSE returns list of character vectors |
A few examples are given below:
> str_extract(t, “\\d”) [1] NA > str_extract(t, “and”) [1] “and” > str_extract(t, “[a-z]+”) [1] “tatistics” > str_extract(t, “[a-zA-Z]+”) [1] “R” |
str_flatten
You can convert a character vector to a string using the str_flatten() function. It takes the following arguments:
Argument | Description |
string | Input vector |
collapse | String to insert between elements |
last | Optional string for the final separator |
na.rm | Boolean to handle missing values |
A few examples given below illustrate this function:
> vowels <- c(“a”, “e”, “i”, “o”, “u”) > str_flatten(vowels) [1] “aeiou” > str_flatten(vowels[1:3], “-”) [1] “a-e-i” > str_flatten_comma(vowels) [1] “a, e, i, o, u” |
str_locate
The str_locate() function returns the beginning and end position of the first pattern match for a given input string. The str_locate_all() function returns all matching occurrences. A few examples are given below:
> str_locate(t, “a”) start end [1,] 6 6 > str_locate(t, “$”) start end [1,] 35 34 > str_locate_all(t, “a”) [[1]] start end [1,] 6 6 [2,] 15 15 [3,] 20 20 [4,] 29 29 |
str_sort
You can order, rank or sort a character vector using the str_sort() function. It takes the following arguments:
Argument | Description |
x | Character vector |
decreasing | TRUE for highest to lowest FALSE otherwise (default) |
na_last | Boolean to handle NA values |
numeric | Sort digits numerically |
A couple of examples are given below:
> str_sort(vowels) [1] “a” “e” “i” “o” “u” > f <- c(“beta”, “alpha”, “gamma”, “delta”) > str_sort(f) [1] “alpha” “beta” “delta” “gamma” |
str_remove
The str_remove() function removes text that matches a pattern for an input string. It accepts two arguments — an input vector string, and a pattern. An example of the use of this function is given below:
> str_remove(t, “\\s”) [1] “R,Statistics and Machine Learning” > str_remove(t, “[aeiou]”) [1] “R, Sttistics and Machine Learning” |
str_replace
The str_replace() function replaces the first occurrence of the pattern with the replacement string. The syntax usage is as follows:
str_replace(string, pattern, replacement) |
A couple of examples are as follows:
> str_replace(t, “\\s”, “-”) [1] “R,-Statistics and Machine Learning” > str_replace(t, “[aeiou]”, “ “) [1] “R, St tistics and Machine Learning” |
str_split
You can split a string into multiple segments using the str_split() function. It has multiple options.
The str_split() accepts a character vector and returns a list as shown below:
> str_split(t, “ “) [[1]] [1] “R,” “Statistics” “and” “Machine” “Learning” |
The str_split_1() uses a single string and returns a character vector. For example:
> str_split_1(t, “and”) [1] “R, Statistics “ “ Machine Learning” |
The str_split_fixed() accepts a character vector and returns a matrix of values. A couple of examples are given below:
> str_split_fixed(t, “ “, 2) [,1] [,2] [1,] “R,” “Statistics and Machine Learning” > str_split_fixed(t, “ “, 3) [,1] [,2] [,3] [1,] “R,” “Statistics” “and Machine Learning” |
The str_split_i() takes a character vector and returns a character vector. A few examples are given below to demonstrate this function:
> str_split_i(t, “ “, 1) [1] “R,” > str_split_i(t, “ “, 2) [1] “Statistics” > str_split_i(t, “ “, 3) [1] “and” |
You are encouraged to read the stringr reference manual to learn more about its functions, arguments, options, and usage.