In this twenty-second article in the R, Statistics and Machine Learning series, we will continue on the exploration of handling text using the ‘dplyr’ package. This package is a grammar for data manipulation, and provides simple and fast functions to handle data frame-like objects.
We will use R version 4.2.2 installed on Parabola GNU/Linux-libre (x86-64) for the code snippets.
$ R --version R version 4.2.2 (2022-10-31) -- “Innocent and Trusting” Copyright (C) 2022 The R Foundation for Statistical Computing Platform: x86_64-pc-linux-gnu (64-bit) R is free software and comes with ABSOLUTELY NO WARRANTY. You are welcome to redistribute it under the terms of the GNU General Public License versions 2 or 3. For more information about these matters see https://www.gnu.org/licenses/.
You can install and load the ‘dplyr’ package using the following commands:
> install.packages(“dplyr”) Installing package into ‘/home/shakthi/R/x86_64-pc-linux-gnu-library/4.1’ (as ‘lib’ is unspecified) --- Please select a CRAN mirror for use in this session --- ... * copying figures * building package indices * installing vignettes * testing if installed package can be loaded from temporary location * checking absolute paths in shared objects and dynamic libraries * testing if installed package can be loaded from final location * testing if installed package keeps a record of temporary installation path * DONE (dplyr) > library(dplyr) Attaching package: ‘dplyr’
Consider the ‘Bank Marketing Data Set’ available from the UCI Machine Learning Repository available at https://archive.ics.uci.edu/ml/datasets/Bank+Marketing. The data set is from a Portuguese banking institution and is available freely for public research use. There are four data sets available, and we will use the read.csv() function to import the data from a bank.csv file into a data frame, as shown below:
> bank <- read.csv(file=”bank.csv”, sep=”;”) > bank[1:3,] age job marital education default balance housing loan contact day 1 30 unemployed married primary no 1787 no no cellular 19 2 33 services married secondary no 4789 yes yes cellular 11 3 35 management single tertiary no 1350 yes no cellular 16 month duration campaign pdays previous poutcome y 1 oct 79 1 -1 0 unknown no 2 may 220 1 339 4 failure no 3 apr 185 1 330 1 failure no
select
You can use the select() function to choose specific columns from the data frame. In the following example, we select the ‘age’, ‘job’ and ‘description’ fields from the data set:
> bank %>% select(age, job, education) age job education 1 30 unemployed primary 2 33 services secondary 3 35 management tertiary 4 30 management tertiary 5 59 blue-collar secondary 6 35 management tertiary.
filter
The filter() function is used to produce a subset of the data that matches the input conditions. The bank entries with ‘blue-collar’ jobs alone can be listed as follows:
> bank %>% filter(job == “blue-collar”) age job marital education default balance housing loan contact 1 59 blue-collar married secondary no 0 yes no unknown 2 31 blue-collar married secondary no 360 yes yes cellular 3 25 blue-collar single primary no -221 yes no unknown 4 55 blue-collar married primary no 627 yes no unknown 5 32 blue-collar married secondary no 2089 yes no cellular ...
arrange
The data from selected columns can be sorted using the arrange() function. In the following example, the bank data is first sorted by ‘age’ and then by the ‘balance’ column:
> bank %>% arrange(age, balance) age job marital education default balance housing loan 1 19 student single unknown no 0 no no 2 19 student single primary no 103 no no 3 19 student single secondary no 302 no no 4 19 student single unknown no 1169 no no 5 20 student single secondary no 291 no no 6 20 student single secondary no 502 no no 7 20 student single secondary no 1191 no no 8 21 student single secondary no 6 no no
relocate
The relocate() function is used to change the column positions in the output. You can use the ‘.before’ and ‘.after’ arguments to specify the location of the columns. For example:
> bank %>% relocate(age, .after = “job”) job age marital education default balance housing loan 1 unemployed 30 married primary no 1787 no no 2 services 33 married secondary no 4789 yes yes 3 management 35 single tertiary no 1350 yes no 4 management 30 married tertiary no 1476 yes yes 5 blue-collar 59 married secondary no 0 yes no 6 management 35 single tertiary no 747 no no ...
count
The unique values for a column can be computed using the count() function, as shown below:
> bank %>% count(age) age n 1 19 4 2 20 3 3 21 7 4 22 9 5 23 20 6 24 24 7 25 44 8 26 77 9 27 94
tally
The tally() method is a low-level function that can also be used to count unique values for a column. The total number of bank entries in the CSV file is 4521, as indicated below:
> bank %>% tally() n 1 4521
distinct
The unique rows in a data frame can be obtained using the distinct() function. The three possible values for marital status are shown below:
> bank %>% distinct(marital) marital 1 married 2 single 3 divorced
mutate
The mutate() function is used to create new columns from existing data. The balance field is in Euros, and a new USD column is computed based on the Euro-USD conversion rate as follows:
> bank %>% select(age, education, balance) %>% mutate(usd = balance * 1.08) age education balance usd 1 30 primary 1787 1929.96 2 33 secondary 4789 5172.12 3 35 tertiary 1350 1458.00 4 30 tertiary 1476 1594.08 5 59 secondary 0 0.00 6 35 tertiary 747 806.76 ...
pull
The pull() function accepts a numeric argument for the column number and returns its values. The last column in the data set is indexed at ‘-1’ and corresponds to whether the client had subscribed to a term deposit.
> bank %>% pull(-1) [1] “no” “no” “no” “no” “no” “no” “no” “no” “no” “no” “no” “no” [13] “no” “yes” “no” “no” “no” “no” “no” “no” “no” “no” “no” “no” [25] “no” “no” “no” “no” “no” “no” “yes” “no” “no” “yes” “yes” “no” [37] “yes” “yes” “yes” “no” “no” “no” “no” “no” “no” “no” “no” “no”
group_by
The data set can be categorised using the group_by() function, as demonstrated below:
> bank %>% group_by(age) # A tibble: 4,521 × 17 # Groups: age [67] age job marital education default balance housing loan contact day <int> <chr> <chr> <chr> <chr> <int> <chr> <chr> <chr> <int> 1 30 unemploy… married primary no 1787 no no cellul… 19 2 33 services married secondary no 4789 yes yes cellul… 11 3 35 manageme… single tertiary no 1350 yes no cellul… 16 4 30 manageme… married tertiary no 1476 yes yes unknown 3 5 59 blue-col… married secondary no 0 yes no unknown 5 6 35 manageme… single tertiary no 747 no no cellul… 23 7 36 self-emp… married tertiary no 307 yes no cellul… 14 8 39 technici… married secondary no 147 yes no cellul… 6 9 41 entrepre… married tertiary no 221 yes no unknown 14 10 43 services married primary no -88 yes yes cellul… 17 # ℹ 4,511 more rows
summarise
You can create a new data frame by also grouping variables using the summarise() function. The summarize() name can also be used instead.
> bank %>% summarise(age, balance) age balance 1 30 1787 2 33 4789 3 35 1350 4 30 1476 5 59 0
glimpse
The entire data frame can be transposed using the glimpse() function, which shows every column as a row. It tries to show as much of the data as possible in the output.
> bank %>% glimpse() Rows: 4,521 Columns: 17 $ age <int> 30, 33, 35, 30, 59, 35, 36, 39, 41, 43, 39, 43, 36, 20, 31, … $ job <chr> “unemployed”, “services”, “management”, “management”, “blue-… $ marital <chr> “married”, “married”, “single”, “married”, “married”, “singl… $ education <chr> “primary”, “secondary”, “tertiary”, “tertiary”, “secondary”,… $ default <chr> “no”, “no”, “no”, “no”, “no”, “no”, “no”, “no”, “no”, “no”, … $ balance <int> 1787, 4789, 1350, 1476, 0, 747, 307, 147, 221, -88, 9374, 26…
slice
You can select the first three rows by specifying them with the slice() function. It allows you to select, filter and show duplicate rows as well. For example:
> bank %>% slice(1:3) age job marital education default balance housing loan contact day 1 30 unemployed married primary no 1787 no no cellular 19 2 33 services married secondary no 4789 yes yes cellular 11 3 35 management single tertiary no 1350 yes no cellular 16 month duration campaign pdays previous poutcome y 1 oct 79 1 -1 0 unknown no 2 may 220 1 339 4 failure no 3 apr 185 1 330 1 failure no desc The desc() function is used to sort a column in the descending order. The bank balances can be displayed from the highest to the lowest values, as shown below: > bank %>% arrange(desc(balance)) age job marital education default balance housing loan 1 60 retired married primary no 71188 no no 2 42 entrepreneur married tertiary no 42045 no no 3 43 technician single tertiary no 27733 yes no 4 36 management married tertiary no 27359 yes no 5 57 technician married tertiary no 27069 no yes 6 31 housemaid single primary no 26965 no no ...
nth
A specific row from the data set can be retrieved using the nth() function. The 10th entry in the bank data frame is shown below:
> bank %>% nth(10) age job marital education default balance housing loan contact day 1 43 services married primary no -88 yes yes cellular 17 month duration campaign pdays previous poutcome y 1 apr 313 1 147 2 failure no
You are encouraged to read the dplyr package manual to learn more functions, arguments and usage.