Blogs

R Series: Regression

February 15, 2023

572

In this seventeenth article in the R series, we shall learn about linear regression in R.

In statistics, regression is a way to model the relationship between two variables. We will use R version 4.2.1 installed on Parabola GNU/Linux-libre (x86-64) for the code snippets.

$ R --version
 
R version 4.2.1 (2022-06-23) -- “Funny-Looking Kid”
Copyright (C) 2022 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)
 
R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under the terms of the
GNU General Public License versions 2 or 3.
For more information about these matters see
https://www.gnu.org/licenses/.

The mathematical equation for a linear regression between two variables ‘x’ and ‘y’ is defined as ‘y = ax + b’, where ‘a’ and ‘b’ are constants.

The lm() function is used to fit linear models. Consider two vectors ‘x’ and ‘y’ that represent odd and even numbers respectively. We can fit a regression equation between them using the lm() function, as follows:

> x <- c(1, 3, 5, 7, 9)
> y <- c(2, 4, 6, 8, 10)
> r <- lm(y~x)
> print(r)
 
Call:
lm(formula = y ~ x)
 
 
Coefficients:
(Intercept)            x  
          1            1

The lm() function accepts the following arguments:

Argument	Description
formula	symbolic description of the model
data	data frame, list or environment
subset	optional vector of observations
weights	optional vector of weights
method	only ‘qr’ is supported
singular.ok	logical value, error if false
contrasts	optional list
offset	a priori known component
digits	significant digits

A more detailed output on the relation ‘r’ can be printed using the summary() function, as shown below:

> summary(r)
 
Call:
lm(formula = y ~ x)
 
Residuals:
         1          2          3          4          5 
 4.367e-16 -8.276e-16  3.028e-16  1.303e-16 -4.222e-17 
 
Coefficients:
             Estimate Std. Error   t value Pr(>|t|)    
(Intercept) 1.000e+00  5.207e-16 1.920e+15   <2e-16 ***
x           1.000e+00  9.065e-17 1.103e+16   <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
 
Residual standard error: 5.733e-16 on 3 degrees of freedom
Multiple R-squared:      1, Adjusted R-squared:      1 
F-statistic: 1.217e+32 on 1 and 3 DF,  p-value: < 2.2e-16

We can use the predict() function to find the possible value of ‘y’ for a given ‘x’. For example:

> v <- data.frame(x = 11)
> p <- predict(r, v)
> p
 1 
12

The plot() function is used to display the fitted model, as illustrated below:

plot(y, x, col=”red”, main = “Odd and Even Numbers”, abline(lm(x~y)))

The normal QQ plot for the residual values can be displayed using the qqnorm() and qqline() functions, as shown below:

> qqnorm(r$resid)
> qqline(r$resid)

The kernel density estimates on the residuals can also be graphed using the plot() function, as shown in the figure.

> plot(density(r$resid))

The ls() function on the relation lists the various components of the ‘r’ object. For example:

> ls(r)
[1] “assign”    “call”    “coefficients”  “df.residual”  
[5] “effects”  “fitted.values” “model”         “qr”           
[9] “rank”      “residuals”     “terms”         “xlevels”

The components are briefly described below:

Component	Description
assign	integer vector for each column
effects	named numeric vector
rank	numeric rank of the model
call	the matched call
fitted.values	the fitted mean values
residuals	response – fitted values
coefficients	named vector of coefficients
model	model frame used
terms	the ‘terms’ object
df.residual	residual degrees of freedom
qr	QR decomposition
xlevels	levels of the factors used in fitting

> r$coefficients
(Intercept)           x 
          1           1 
 
> r$assign
[1] 0 1
 
> r$effects
  (Intercept)             x                                           
-1.341641e+01  6.324555e+00  0.000000e+00 -4.440892e-16 -8.881784e-16
 
> r$rank
[1] 2
 
> r$call
lm(formula = y ~ x)
 
> r$fitted.values
 1  2  3  4  5 
 2  4  6  8 10
  
> r$residuals
            1             2             3             4             5 
4.367205e-16 -8.275904e-16  3.027995e-16  1.302899e-16 -4.221962e-17
  
> r$model
   y x
1  2 1
2  4 3
3  6 5
4  8 7
5 10 9
 
> r$terms
y ~ x
attr(,”variables”)
list(y, x)
attr(,”factors”)
  x
y 0
x 1
attr(,”term.labels”)
[1] “x”
attr(,”order”)
[1] 1
attr(,”intercept”)
[1] 1
attr(,”response”)
[1] 1
attr(,”.Environment”)
<environment: R_GlobalEnv>
attr(,”predvars”)
list(y, x)
attr(,”dataClasses”)
        y         x 
“numeric” “numeric” 
 
> r$df.residual
[1] 3
 
> r$qr
$qr
  (Intercept)           x
1  -2.2360680 -11.1803399
2   0.4472136   6.3245553
3   0.4472136  -0.1954395
4   0.4472136  -0.5116673
5   0.4472136  -0.8278950
attr(,”assign”)
[1] 0 1
 
$qraux
[1] 1.447214 1.120788
 
$pivot
[1] 1 2
 
$tol
[1] 1e-07
 
$rank
[1] 2
 
attr(,”class”)
[1] “qr”
 
> r$xlevels
named list()

The confint() function computes the confidence intervals for the parameters in the fitted model. For example:

> confint(r)
            2.5 % 97.5 %
(Intercept)     1      1
x               1      1

The confint() function accepts the following arguments:

Argument	Description
object	fitted model object
parm	vector of numbers or names for the parameters
level	confidence level required
…	additional arguments

The analysis of variance tables for the fitted model can be obtained using the anova() function, as follows:

anova(r)
Analysis of Variance Table
 
Response: y
          Df Sum Sq Mean Sq    F value    Pr(>F)    
x          1     40      40 1.2169e+32 < 2.2e-16 ***
Residuals  3      0       0                         
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Consider the mtcars data set available in the lattice library:

> head(mtcars)
            mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4  21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag  21.0  6  160 110 3.90 2.875 17.02  0  1  4   4
Datsun 710   22.8   4  108  93 3.85 2.320 18.61  1  1  4    1
Hornet 4 Drive  21.4 6 258 110 3.08 3.215 19.44  1  0  3   1
Hornet Sportabout 18.7 8  360 175 3.15 3.440 17.02  0  0 3  2
Valiant    18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

We can fit a regression between the number of cylinders and horsepower using the lm() function, as follows:

> m <- lm(cyl~hp, data=mtcars)
> print(m)
 
Call:
lm(formula = cyl ~ hp, data = mtcars)
 
Coefficients:
(Intercept)           hp  
    3.00680      0.02168

A summary of the fitted model is shown below:

> summary(m)
 
Call:
lm(formula = cyl ~ hp, data = mtcars)
 
Residuals:
     Min       1Q   Median       3Q      Max 
-2.27078 -0.74879 -0.06417  0.63512  1.74067 
 
Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 3.006795   0.425485   7.067 7.41e-08 ***
hp          0.021684   0.002635   8.229 3.48e-09 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
 
Residual standard error: 1.006 on 30 degrees of freedom
Multiple R-squared:  0.693, Adjusted R-squared:  0.6827 
F-statistic: 67.71 on 1 and 30 DF,  p-value: 3.478e-09

Some of the individual components from the fitted model are listed below for reference:

> m$residuals
  Mazda RX4     Mazda RX4 Wag    Datsun 710    Hornet 4 Drive 
 0.608014994   0.608014994    -1.023364771        0.608014994 
 Hornet Sportabout   Valiant    Duster 360         Merc 240D 
1.198584681    0.716432710     -0.319263348      -0.351174929 
  ...
   
> m$model
                    cyl  hp
Mazda RX4             6 110
Mazda RX4 Wag         6 110
Datsun 710            4  93
Hornet 4 Drive        6 110
Hornet Sportabout     8 175
 
> m$terms
cyl ~ hp
attr(,”variables”)
list(cyl, hp)
attr(,”factors”)
    hp
cyl  0
hp   1
attr(,”term.labels”)
[1] “hp”
attr(,”order”)
[1] 1
attr(,”intercept”)
[1] 1
attr(,”response”)
[1] 1
attr(,”.Environment”)
<environment: R_GlobalEnv>
attr(,”predvars”)
list(cyl, hp)
attr(,”dataClasses”)
      cyl        hp 
“numeric” “numeric” 
 
> m$qr
$qr
                    (Intercept)            hp
Mazda RX4            -5.6568542 -829.78980772
Mazda RX4 Wag         0.1767767  381.74189579
Datsun 710            0.1767767    0.12620114
Hornet 4 Drive        0.1767767    0.08166844
Hornet Sportabout     0.1767767   -0.08860368
...
 
> confint(m)
                 2.5 %     97.5 %
(Intercept) 2.13783849 3.87575200
hp          0.01630186 0.02706522
 
> anova(m)
Analysis of Variance Table
 
Response: cyl
          Df Sum Sq Mean Sq F value    Pr(>F)    
hp         1 68.517  68.517   67.71 3.478e-09 ***
Residuals 30 30.358   1.012                      
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

We can again use the predict() function to determine the number of cylinders required for a given horsepower. In the following example, we observe that in order to achieve 200 horsepower, we require at least 7 cylinders.

> hp <- data.frame(hp = 200)
> cyl <- predict(m, hp)
> cyl
       1 
7.343504

You are encouraged to read the manual pages for the above R functions to learn more on their arguments, options and usage.

R Series: Regression

NO COMMENTS

LEAVE A REPLY Cancel reply

RELATED ARTICLES

Meeting Special Education Needs with Open Source Software

Protect Private Information with this ML Model

Onsen UI: A Boon for Mobile App Developers

NO COMMENTS

LEAVE A REPLY Cancel reply