In this seventeenth article in the R series, we shall learn about linear regression in R.
In statistics, regression is a way to model the relationship between two variables. We will use R version 4.2.1 installed on Parabola GNU/Linux-libre (x86-64) for the code snippets.
$ R --version R version 4.2.1 (2022-06-23) -- “Funny-Looking Kid” Copyright (C) 2022 The R Foundation for Statistical Computing Platform: x86_64-pc-linux-gnu (64-bit) R is free software and comes with ABSOLUTELY NO WARRANTY. You are welcome to redistribute it under the terms of the GNU General Public License versions 2 or 3. For more information about these matters see https://www.gnu.org/licenses/.
The mathematical equation for a linear regression between two variables ‘x’ and ‘y’ is defined as ‘y = ax + b’, where ‘a’ and ‘b’ are constants.
The lm() function is used to fit linear models. Consider two vectors ‘x’ and ‘y’ that represent odd and even numbers respectively. We can fit a regression equation between them using the lm() function, as follows:
> x <- c(1, 3, 5, 7, 9) > y <- c(2, 4, 6, 8, 10) > r <- lm(y~x) > print(r) Call: lm(formula = y ~ x) Coefficients: (Intercept) x 1 1
The lm() function accepts the following arguments:
Argument | Description |
formula | symbolic description of the model |
data | data frame, list or environment |
subset | optional vector of observations |
weights | optional vector of weights |
method | only ‘qr’ is supported |
singular.ok | logical value, error if false |
contrasts | optional list |
offset | a priori known component |
digits | significant digits |
A more detailed output on the relation ‘r’ can be printed using the summary() function, as shown below:
> summary(r) Call: lm(formula = y ~ x) Residuals: 1 2 3 4 5 4.367e-16 -8.276e-16 3.028e-16 1.303e-16 -4.222e-17 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 1.000e+00 5.207e-16 1.920e+15 <2e-16 *** x 1.000e+00 9.065e-17 1.103e+16 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 5.733e-16 on 3 degrees of freedom Multiple R-squared: 1, Adjusted R-squared: 1 F-statistic: 1.217e+32 on 1 and 3 DF, p-value: < 2.2e-16
We can use the predict() function to find the possible value of ‘y’ for a given ‘x’. For example:
> v <- data.frame(x = 11) > p <- predict(r, v) > p 1 12
The plot() function is used to display the fitted model, as illustrated below:
plot(y, x, col=”red”, main = “Odd and Even Numbers”, abline(lm(x~y)))
The normal QQ plot for the residual values can be displayed using the qqnorm() and qqline() functions, as shown below:
> qqnorm(r$resid) > qqline(r$resid)
The kernel density estimates on the residuals can also be graphed using the plot() function, as shown in the figure.
> plot(density(r$resid))
The ls() function on the relation lists the various components of the ‘r’ object. For example:
> ls(r) [1] “assign” “call” “coefficients” “df.residual” [5] “effects” “fitted.values” “model” “qr” [9] “rank” “residuals” “terms” “xlevels”
The components are briefly described below:
Component | Description |
assign | integer vector for each column |
effects | named numeric vector |
rank | numeric rank of the model |
call | the matched call |
fitted.values | the fitted mean values |
residuals | response – fitted values |
coefficients | named vector of coefficients |
model | model frame used |
terms | the ‘terms’ object |
df.residual | residual degrees of freedom |
qr | QR decomposition |
xlevels | levels of the factors used in fitting |
> r$coefficients (Intercept) x 1 1 > r$assign [1] 0 1 > r$effects (Intercept) x -1.341641e+01 6.324555e+00 0.000000e+00 -4.440892e-16 -8.881784e-16 > r$rank [1] 2 > r$call lm(formula = y ~ x) > r$fitted.values 1 2 3 4 5 2 4 6 8 10 > r$residuals 1 2 3 4 5 4.367205e-16 -8.275904e-16 3.027995e-16 1.302899e-16 -4.221962e-17 > r$model y x 1 2 1 2 4 3 3 6 5 4 8 7 5 10 9 > r$terms y ~ x attr(,”variables”) list(y, x) attr(,”factors”) x y 0 x 1 attr(,”term.labels”) [1] “x” attr(,”order”) [1] 1 attr(,”intercept”) [1] 1 attr(,”response”) [1] 1 attr(,”.Environment”) <environment: R_GlobalEnv> attr(,”predvars”) list(y, x) attr(,”dataClasses”) y x “numeric” “numeric” > r$df.residual [1] 3 > r$qr $qr (Intercept) x 1 -2.2360680 -11.1803399 2 0.4472136 6.3245553 3 0.4472136 -0.1954395 4 0.4472136 -0.5116673 5 0.4472136 -0.8278950 attr(,”assign”) [1] 0 1 $qraux [1] 1.447214 1.120788 $pivot [1] 1 2 $tol [1] 1e-07 $rank [1] 2 attr(,”class”) [1] “qr” > r$xlevels named list()
The confint() function computes the confidence intervals for the parameters in the fitted model. For example:
> confint(r) 2.5 % 97.5 % (Intercept) 1 1 x 1 1
The confint() function accepts the following arguments:
Argument | Description |
object | fitted model object |
parm | vector of numbers or names for the parameters |
level | confidence level required |
… | additional arguments |
The analysis of variance tables for the fitted model can be obtained using the anova() function, as follows:
anova(r) Analysis of Variance Table Response: y Df Sum Sq Mean Sq F value Pr(>F) x 1 40 40 1.2169e+32 < 2.2e-16 *** Residuals 3 0 0 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Consider the mtcars data set available in the lattice library:
> head(mtcars) mpg cyl disp hp drat wt qsec vs am gear carb Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
We can fit a regression between the number of cylinders and horsepower using the lm() function, as follows:
> m <- lm(cyl~hp, data=mtcars) > print(m) Call: lm(formula = cyl ~ hp, data = mtcars) Coefficients: (Intercept) hp 3.00680 0.02168
A summary of the fitted model is shown below:
> summary(m) Call: lm(formula = cyl ~ hp, data = mtcars) Residuals: Min 1Q Median 3Q Max -2.27078 -0.74879 -0.06417 0.63512 1.74067 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 3.006795 0.425485 7.067 7.41e-08 *** hp 0.021684 0.002635 8.229 3.48e-09 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 1.006 on 30 degrees of freedom Multiple R-squared: 0.693, Adjusted R-squared: 0.6827 F-statistic: 67.71 on 1 and 30 DF, p-value: 3.478e-09
Some of the individual components from the fitted model are listed below for reference:
> m$residuals Mazda RX4 Mazda RX4 Wag Datsun 710 Hornet 4 Drive 0.608014994 0.608014994 -1.023364771 0.608014994 Hornet Sportabout Valiant Duster 360 Merc 240D 1.198584681 0.716432710 -0.319263348 -0.351174929 ... > m$model cyl hp Mazda RX4 6 110 Mazda RX4 Wag 6 110 Datsun 710 4 93 Hornet 4 Drive 6 110 Hornet Sportabout 8 175 > m$terms cyl ~ hp attr(,”variables”) list(cyl, hp) attr(,”factors”) hp cyl 0 hp 1 attr(,”term.labels”) [1] “hp” attr(,”order”) [1] 1 attr(,”intercept”) [1] 1 attr(,”response”) [1] 1 attr(,”.Environment”) <environment: R_GlobalEnv> attr(,”predvars”) list(cyl, hp) attr(,”dataClasses”) cyl hp “numeric” “numeric” > m$qr $qr (Intercept) hp Mazda RX4 -5.6568542 -829.78980772 Mazda RX4 Wag 0.1767767 381.74189579 Datsun 710 0.1767767 0.12620114 Hornet 4 Drive 0.1767767 0.08166844 Hornet Sportabout 0.1767767 -0.08860368 ... > confint(m) 2.5 % 97.5 % (Intercept) 2.13783849 3.87575200 hp 0.01630186 0.02706522 > anova(m) Analysis of Variance Table Response: cyl Df Sum Sq Mean Sq F value Pr(>F) hp 1 68.517 68.517 67.71 3.478e-09 *** Residuals 30 30.358 1.012 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
We can again use the predict() function to determine the number of cylinders required for a given horsepower. In the following example, we observe that in order to achieve 200 horsepower, we require at least 7 cylinders.
> hp <- data.frame(hp = 200) > cyl <- predict(m, hp) > cyl 1 7.343504
You are encouraged to read the manual pages for the above R functions to learn more on their arguments, options and usage.