In this fifth article in the ‘R, Statistics and Machine Learning’ series, we shall learn the various R functions that are available to combine, modify, select and apply functions on data.
We will be using R version 4.1.0 installed on Parabola GNU/Linux-libre (x86-64) for the code snippets given here.
$ R --version R version 4.1.0 (2021-05-18) -- “Camp Pontanezen” Copyright (C) 2021 The R Foundation for Statistical Computing Platform: x86_64-pc-linux-gnu (64-bit) R is free software and comes with absolutely no warranty.
You are welcome to redistribute it under the terms of the GNU General Public License versions 2 or 3. For more information about these matters, see https://www.gnu.org/licenses/.
> a <- c(“1.”, “2.”, “3.”, “4.”) > b <- c(“Chapter: Introduction”, “Chapter: Trees”, “Chapter: Graphs”, “Chapter: Networks”) > paste(a, b) [1] “1. Chapter: Introduction” “2. Chapter: Trees” [3] “3. Chapter: Graphs” “4. Chapter: Networks”
You can also pass a separator string to combine the character vectors. In the following example, we use the ‘.’ (dot) to combine strings.
You can also pass a separator string to combine the character vectors. In the following example, we use the ‘.’ (dot) to combine strings. > a <- c(“1”, “2”, “3”, “4”) > b <- c(“Chapter: Introduction”, “Chapter: Trees”, “Chapter: Graphs”, “Chapter: Networks”) > paste(a, b, sep=”. “) [1] “1. Chapter: Introduction” “2. Chapter: Trees” [3] “3. Chapter: Graphs” “4. Chapter: Networks”
Consider the mtcars data set available in R. You can add a column to the data frame or matrix, using the cbind() function, as shown below:
> head(mtcars) mpg cyl disp hp drat wt qsec vs am gear carb Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 . 4 Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 > data <- head(mtcars) > year <- c(1970, 1970, 1973, 1974, 1977, 1962) > new_mtcars <- cbind(data, year) > new_mtcars mpg cyl disp hp drat wt qsec vs am gear carb year Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 1970 Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 1970 Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 1973 Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 1974 Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 1977 Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 1962
If you would like to add a new row to the data, use the rbind() function, as follows:
> v <- data.frame(21.4, 4, 121.0, 109, 4.11, 2.780, 18.60, 1, 1, 4, 2, 1966) > names(v) <- c(“mpg”, “cyl”, “disp”, “hp”, “drat”, “wt”, “qsec”, “vs”, “am”, “gear”, “carb”, “year”) > rownames(v) <- c(“Volvo 142E”) > rbind(new_mtcars, v) mpg cyl disp hp drat wt qsec vs am gear carb year Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 1970 Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 1970 Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 1973 Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 1974 Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 1977 Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 1962 Volvo 142E 21.4 4 121 109 4.11 2.780 18.60 1 1 4 2 1966
There also exists the merge() function that combines two data sets. In the following example, the ‘data’ and ‘new_mtcars’ data sets are merged and sorted on the mpg column in descending order. The merge function uses common variables for its operation.
> merge(data, new_mtcars) mpg cyl disp hp drat wt qsec vs am gear carb year 1 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 1962 2 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 1977 3 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 1970 4 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 1970 5 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 1974 6 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 1973
The merge() function allows the following arguments:
- x: A data frame to combine with.
- y: Another data frame to combine with.
- by: The vector of column names for merge.
- by.x: The column names in x to be used for combining the data.
- by.y: The column names in y to be used for the merge operation.
- sort: A Boolean value on whether to sort the results or not.
- incomparables: A list of variables that cannot be compared.
The intersect() function returns the common variables between two vectors. For example:
> intersect(data, new_mtcars) mpg cyl disp hp drat wt qsec vs am gear carb Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
You can combine vectors or data frames into a single data frame using the make.groups() function, as shown below:
> make.groups(new_mtcars$gear, new_mtcars$cyl) data which new_mtcars$gear1 4 new_mtcars$gear new_mtcars$gear2 4 new_mtcars$gear new_mtcars$gear3 4 new_mtcars$gear new_mtcars$gear4 3 new_mtcars$gear new_mtcars$gear5 3 new_mtcars$gear new_mtcars$gear6 3 new_mtcars$gear new_mtcars$cyl1 6 new_mtcars$cyl new_mtcars$cyl2 6 new_mtcars$cyl new_mtcars$cyl3 4 new_mtcars$cyl new_mtcars$cyl4 6 new_mtcars$cyl new_mtcars$cyl5 8 new_mtcars$cyl new_mtcars$cyl6 6 new_mtcars$cyl
Transform
Computation can be performed to update existing values or add new data. For example, the qsec column in the mtcars data set represents the time in seconds to reach a quarter mile. You can convert the data into minutes using the following calculation:
> data$minsec <- (data$qsec / 60) > data mpg cyl disp hp drat wt qsec vs am gear carb minsec Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 0.2743333 Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 0.2836667 Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 0.3101667 Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 0.3240000 Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 0.2836667 Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 0.3370000
The transform() function also exists to change variables in a data frame. You can update multiple columns with the transform() function, as shown below:
> transform(new_mtcars, wt=wt*0.4545, year=2021-year) mpg cyl disp hp drat wt qsec vs am gear carb year Mazda RX4 21.0 6 160 110 3.90 1.190790 16.46 0 1 4 4 51 Mazda RX4 Wag 21.0 6 160 110 3.90 1.306688 17.02 0 1 4 4 51 Datsun 710 22.8 4 108 93 3.85 1.054440 18.61 1 1 4 1 48 Hornet 4 Drive 21.4 6 258 110 3.08 1.461218 19.44 1 0 3 1 47 Hornet Sportabout 18.7 8 360 175 3.15 1.563480 17.02 0 0 3 2 44 Valiant 18.1 6 225 105 2.76 1.572570 20.22 1 0 3 1 59
The apply() function can be used on an array or matrix. It takes three arguments — an array, dimensions, and a function. The function is applied to the specific dimensions of the array or matrix. In the following example, the rows (MARGIN=1) represent the first dimension while the columns (MARGIN=2) represent the second dimension. The max() and min() functions are used to return the maximum and minimum values respectively.
> a <- array(c(1, 2, 3, 4, 5, 6, 7, 8, 9), dim=c(3, 3)) > a [,1] [,2] [,3] [1,] 1 4 7 [2,] 2 5 8 [3,] 3 6 9 > apply(X=a, MARGIN=1, FUN=max) [1] 7 8 9 > apply(X=a, MARGIN=2, FUN=max) [1] 3 6 9 > apply(X=a, MARGIN=1, FUN=min) [1] 1 2 3 > apply(X=a, MARGIN=2, FUN=min) [1] 1 4 7
If you would like to return a list, you can use the lapply() function to a vector or list, as shown below:
> l = list(1, 3, 5, 7) > lapply(l, function(x) x*x) [[1]] [1] 1 [[2]] [1] 9 [[3]] [1] 25 [[4]] [1] 49
On the other hand, if you would like to return a vector, matrix or array, you can use the sapply() function.
> sapply(l, FUN=function(x) x*x) [1] 1 9 25 49
The mapply() function is a multivariate version of the sapply() function. For example:
> mapply(paste, c(“1”, “2”, “3”, “4”), + c(“. “, “. “, “. “, “. “), + c(“Chapter: Introduction”, “Chapter: Trees”, “Chapter: Graphs”, “Chapter: Networks”)) 1 2 “1 . Chapter: Introduction” “2 . Chapter: Trees” 3 4 “3 . Chapter: Graphs” “4 . Chapter: Networks” >
The following arguments are supported by the mapply() function.
- FUN: The function to be applied to the data.
- …: A list of vectors on which the function should be applied.
- MoreArgs: Additional arguments to the function.
- SIMPLIFY: A Boolean value on whether to simplify the result.
- USE.NAMES: A Boolean value on whether to use names for the values.
Select
The shingle() function can group data into bins. You will need to load the lattice library in order to use the function. For example, the various cars in ‘new_mtcars’ are grouped based on 4, 6, or 8 cylinders, as indicated below:
> library(lattice) > shingle(new_mtcars$cyl) Data: [1] 6 6 4 6 8 6 Intervals: min max count 1 4 4 1 2 6 6 4 3 8 8 1 Overlap between adjacent intervals: [1] 0 0
A list of discrete factors is returned from a continuous numerical vector using the cut() function. The horsepower in new_mtcars data is broken into two groups, as follows:
> cut(new_mtcars$hp,breaks=2) [1] (92.9,134] (92.9,134] (92.9,134] (92.9,134] (134,175] (92.9,134] Levels: (92.9,134] (134,175]
The cut() function accepts the following arguments:
- x: A numeric vector.
- breaks: The number of points to split the data into.
- labels: The labels for the factor levels.
- include.lowest: A Boolean value on whether to include the smallest value in the bin.
- right: A Boolean value on whether the interval should be open on the left and closed on the right.
The bracket notation can be used on the data frame to filter the results. For example, the cars that have more than five cylinders are listed below:
mpg cyl disp hp drat wt qsec vs am gear carb year Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 1970 Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 1970 Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 1974 Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 1977 Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 1962
You can also use the subset() function to select the above data:
> subset(new_mtcars, cyl>5) mpg cyl disp hp drat wt qsec vs am gear carb year Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 1970 Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 1970 Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 1974 Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 1977 Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 1962
A random sample from the data set can be obtained using the sample() function. For example:
> sample(new_mtcars, 2) qsec cyl Mazda RX4 16.46 6 Mazda RX4 Wag 17.02 6 Datsun 710 18.61 4 Hornet 4 Drive 19.44 6 Hornet Sportabout 17.02 8 Valiant 20.22 6 > sample(new_mtcars, 2) disp mpg Mazda RX4 160 21.0 Mazda RX4 Wag 160 21.0 Datsun 710 108 22.8 Hornet 4 Drive 258 21.4 Hornet Sportabout 360 18.7 Valiant 225 18.1
Summary
Consider the ‘Bank Marketing Data Set’ available from the UCI Machine Learning Repository at https://archive.ics.uci.edu/ml/datasets/Bank+Marketing. The data is from a Portuguese banking institution and is available freely for public research use. There are four data sets available, and we will use the read.csv() function to import the data from a ‘bank.csv’ file into a data frame.
> bank <- read.csv(file=”bank.csv”, sep=”;”) > bank[1:3,] age job marital education default balance housing loan contact day 1 30 unemployed married primary no 1787 no no cellular 19 2 33 services married secondary no 4789 yes yes cellular 11 3 35 management single tertiary no 1350 yes no cellular 16 month duration campaign pdays previous poutcome y 1 oct 79 1 -1 0 unknown no 2 may 220 1 339 4 failure no 3 apr 185 1 330 1 failure no
The tapply() function can be used to provide the summary of the bank balances for the various job categories, as shown below:
> tapply(X=bank$balance, INDEX=list(bank$job), FUN=sum) admin. blue-collar entrepreneur housemaid management 586380 1026563 276381 233386 1712154 retired self-employed services student technician 533414 254811 460350 129681 1022205 unemployed unknown 139446 57065
You can also use the aggregate() function to produce the above result, and it is more suited for time-series data.
> aggregate(x=bank$balance, by=list(bank$job), FUN=sum) Group.1 x 1 admin. 586380 2 blue-collar 1026563 3 entrepreneur 276381 4 housemaid 233386 5 management 1712154 6 retired 533414 7 self-employed 254811 8 services 460350 9 student 129681 10 technician 1022205 11 unemployed 139446 12 unknown 57065
The aggregate() function accepts the following arguments:
- x: The object to apply the summary on.
- by: A list of elements to be categorised.
- FUN: The function to use to compute the statistic.
- nfrequency: The number of observations per unit time.
- …: Any arguments passed to the function.
The rowsum() function can sum the variables in a data set. For example, the sum of the ‘day’ column grouped by the job category is shown below:
> rowsum(x=bank$day, group=bank$job) [,1] admin. 7803 blue-collar 14646 entrepreneur 2563 housemaid 1713 management 15751 retired 3578 self-employed 2961 services 6470 student 1377 technician 12429 unemployed 2060 unknown 602
You can count the number of observations for each value using the tabulate() function, as follows:
> tabulate(bank$previous) [1] 286 193 113 78 47 25 22 18 10 4 3 5 1 2 1 0 1 1 1 [20] 1 0 1 1 1 1
The table() function can also provide a count of the observed data based on categories. For example, for the bank data set, we see the classification of people based on their marital status.
> table(bank$marital) divorced married single 528 2797 1196
The duplicated() and unique() functions show any repetitive and distinct values for the data. A couple of examples are shown below:
> duplicated(new_mtcars$cyl) [1] FALSE TRUE FALSE TRUE FALSE TRUE > unique(bank$job) [1] “unemployed” “services” “management” “blue-collar” [5] “self-employed” “technician” “entrepreneur” “admin.” [9] “student” “housemaid” “retired” “unknown” #+END_SRC
The summary results can be sorted using the sort() function.
> r <- rowsum(x=bank$day, group=bank$job) > r [,1] admin. 7803 blue-collar 14646 entrepreneur 2563 housemaid 1713 management 15751 retired 3578 self-employed 2961 services 6470 student 1377 technician 12429 unemployed 2060 unknown 602 > sort(r) [1] 602 1377 1713 2060 2563 2961 3578 6470 7803 12429 14646 15751
You can also specify the ordering required for a specific column in the data using the order() function:
> data[order(data$hp), ] mpg cyl disp hp drat wt qsec vs am gear carb minsec Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 0.3101667 Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 0.3370000 Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 0.2743333 Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 0.2836667 Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 0.3240000 Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 0.2836667
You are encouraged to read the R documentation for the above functions and try it out on your data sets.