We have been exploring various R packages for handling text for natural language processing. In this twenty-third article in the R, Statistics and Machine Learning series, we delve into the ‘stringr’ package, which provides a comprehensive set of functions to easily work with strings.
We will use R version 4.1.2 installed on Parabola GNU/Linux-libre (x86-64) for the code snippets.
$ R --version R version 4.1.2 (2021-11-01) -- “Bird Hippie” Copyright (C) 2021 The R Foundation for Statistical Computing Platform: x86_64-pc-linux-gnu (64-bit) R is free software and comes with ABSOLUTELY NO WARRANTY. You are welcome to redistribute it under certain conditions. Type ‘license()’ or ‘licence()’ for distribution details.
You can install and load the ‘stringr’ package using the following commands:
> install.packages(“stringr”) Installing package into ‘/usr/local/lib/R/site-library’ (as ‘lib’ is unspecified) ... ** installing vignettes ** testing if installed package can be loaded from temporary location ** testing if installed package can be loaded from final location ** testing if installed package keeps a record of temporary installation path * DONE (stringr) > library(stringr)
str_to_
The str_to_() functions provide methods to transform strings to upper case and lower case, format titles, and convert text into a sentence format. The syntax for this function is as follows:
str_to_<function>(string, locale = “en”)
A few examples are given below:
> t <- “R, Statistics and Machine Learning” > str_to_upper(t) [1] “R, STATISTICS AND MACHINE LEARNING” > str_to_lower(t) [1] “r, statistics and machine learning” > str_to_title(t) [1] “R, Statistics And Machine Learning” > str_to_sentence(t) [1] “R, statistics and machine learning”
str_count
You can count the number of occurrences of a character in a string with the str_count() function. The syntax usage is as follows:
str_count(string, pattern = “”)
A couple of examples are given below for reference:
> str_count(t, “a”) [1] 4 > str_count(t, c(“a”, “e”)) [1] 4 2
str_dup
You can duplicate a string with the str_dup() function, which accepts an input string and a number for replication. The number of times for duplication can also be a list as shown below:
> str_dup(t, 1) [1] “R, Statistics and Machine Learning” > str_dup(t, 2) [1] “R, Statistics and Machine LearningR, Statistics and Machine Learning” > str_dup(t, 1:3) [1] “R, Statistics and Machine Learning” [2] “R, Statistics and Machine LearningR, Statistics and Machine Learning” [3] “R, Statistics and Machine LearningR, Statistics and Machine LearningR, Statistics and Machine Learning”
str_detect
The str_detect() function returns a TRUE Boolean value if the pattern match exists in the given input string, and FALSE otherwise. You can use the regex() syntax for specific patterns for a match. The negate argument, if set to TRUE, can return non-matching elements. Examples to demonstrate this function are given below:
> str_detect(t, “a”) [1] TRUE > str_detect(t, “[ae]”) [1] TRUE > str_detect(t, “^s”) [1] FALSE > str_detect(t, “g$”) [1] TRUE str_conv
The str_conv() function can help convert the encoding of a string from the default format. The syntax usage is as follows:
str_conv(string, encoding)
Examples that use the ISO-8859-1 encoding are given below for reference:
> str_conv(“\xa9”, “ISO-8859-1”) [1] “©” > str_conv(“\xbc”, “ISO-8859-1”) [1] “¼” > str_conv(“\xbd”, “ISO-8859-1”) [1] “½” > str_conv(“\xbe”, “ISO-8859-1”) [1] “¾”
str_equal
You can compare if two strings are equal using the Unicode rules with the str_equal() function. It accepts the following arguments:
Argument | Description |
x | A character vector |
y | Another character vector |
locale | ‘en’ for English |
ignore_case | Boolean value to ignore case |
A couple of examples are shown below:
> str_equal(“hello”, “hi”) [1] FALSE > str_equal(“\u1342”, “\u1342”) [1] TRUE
str_like
The pattern matching for a string for the SQL LIKE operator syntax is implemented with the str_like() function. The syntax usage is as follows:
str_like(string, pattern, ignore_case)
A few examples are given below:
> str_like(vowels, “a”) [1] TRUE FALSE FALSE FALSE FALSE > str_like(t, “Mach”) [1] FALSE > str_like(t, “%R%”) [1] TRUE str_match
The str_match() function does pattern matching as described in vignette (‘regular-expressions’) and as implemented by string. A couple of examples are given below:
> str_match(t, “[a-z]+”) [,1] [1,] “tatistics” > str_match(t, “[a-zA-Z]+”) [,1] [1,] “R”
str_extract
The str_extract() function matches a pattern in a string, and obtains the same. It accepts the following arguments:
Argument | Description |
string | Input vector |
pattern | Regular expression |
group | Return specified matched group |
simplify | TRUE returns character matrix FALSE returns list of character vectors |
A few examples are given below:
> str_extract(t, “\\d”) [1] NA > str_extract(t, “and”) [1] “and” > str_extract(t, “[a-z]+”) [1] “tatistics” > str_extract(t, “[a-zA-Z]+”) [1] “R”
str_flatten
You can convert a character vector to a string using the str_flatten() function. It takes the following arguments:
Argument | Description |
string | Input vector |
collapse | String to insert between elements |
last | Optional string for the final separator |
na.rm | Boolean to handle missing values |
A few examples given below illustrate this function:
> vowels <- c(“a”, “e”, “i”, “o”, “u”) > str_flatten(vowels) [1] “aeiou” > str_flatten(vowels[1:3], “-”) [1] “a-e-i” > str_flatten_comma(vowels) [1] “a, e, i, o, u”
str_locate
The str_locate() function returns the beginning and end position of the first pattern match for a given input string. The str_locate_all() function returns all matching occurrences. A few examples are given below:
> str_locate(t, “a”) start end [1,] 6 6 > str_locate(t, “$”) start end [1,] 35 34 > str_locate_all(t, “a”) [[1]] start end [1,] 6 6 [2,] 15 15 [3,] 20 20 [4,] 29 29
str_sort
You can order, rank or sort a character vector using the str_sort() function. It takes the following arguments:
Argument | Description |
x | Character vector |
decreasing | TRUE for highest to lowest FALSE otherwise (default) |
na_last | Boolean to handle NA values |
numeric | Sort digits numerically |
A couple of examples are given below:
> str_sort(vowels) [1] “a” “e” “i” “o” “u” > f <- c(“beta”, “alpha”, “gamma”, “delta”) > str_sort(f) [1] “alpha” “beta” “delta” “gamma”
str_remove
The str_remove() function removes text that matches a pattern for an input string. It accepts two arguments — an input vector string, and a pattern. An example of the use of this function is given below:
> str_remove(t, “\\s”) [1] “R,Statistics and Machine Learning” > str_remove(t, “[aeiou]”) [1] “R, Sttistics and Machine Learning”
str_replace
The str_replace() function replaces the first occurrence of the pattern with the replacement string. The syntax usage is as follows:
str_replace(string, pattern, replacement)
A couple of examples are as follows:
> str_replace(t, “\\s”, “-”) [1] “R,-Statistics and Machine Learning” > str_replace(t, “[aeiou]”, “ “) [1] “R, St tistics and Machine Learning”
str_split
You can split a string into multiple segments using the str_split() function. It has multiple options.
The str_split() accepts a character vector and returns a list as shown below:
> str_split(t, “ “) [[1]] [1] “R,” “Statistics” “and” “Machine” “Learning”
The str_split_1() uses a single string and returns a character vector. For example:
> str_split_1(t, “and”) [1] “R, Statistics “ “ Machine Learning”
The str_split_fixed() accepts a character vector and returns a matrix of values. A couple of examples are given below:
> str_split_fixed(t, “ “, 2) [,1] [,2] [1,] “R,” “Statistics and Machine Learning” > str_split_fixed(t, “ “, 3) [,1] [,2] [,3] [1,] “R,” “Statistics” “and Machine Learning”
The str_split_i() takes a character vector and returns a character vector. A few examples are given below to demonstrate this function:
> str_split_i(t, “ “, 1) [1] “R,” > str_split_i(t, “ “, 2) [1] “Statistics” > str_split_i(t, “ “, 3) [1] “and”
You are encouraged to read the stringr reference manual to learn more about its functions, arguments, options, and usage.