Module 4 Lesson 14 :Regression Analysis in R

Learning Objectives:

At the end of the session, the participants are expected to:

understand the basic concepts of simple and multiple linear regression
learn how to use the R functions in regression analysis

Simple Linear Regression

What is a linear regression model?

Regression Analysis is a statistical modeling tool that is used to explain a response (criterion or dependent) variable as a function of one or more predictor (independent) variables.

Employee efficiency may be related to years of training , educational background, and age of the employee.
The amount of time until a headache is gone, when taking a pain killer, may be related to the dosage , age , and gender.
The number of votes for a presidential candidate may be related to gender, income, and the state.

Single regression is study of a single response as a function of a single predictor.

$X$ : The predictor $x_1, x_2, ..., x_n$ are $n$ observations from $X$.

$Y$ : The response $y_1, y_2, ..., y_n$ are $n$ observations from $Y$.

In a simple linear regression model we assume this relationship is a linear function.

Linear Function

A linear function can be determine by the intercept and the slope of the line.

The figure below shows plot of linear function

\[ y= f(x)=1+0.5x\]

The intercept is equal to 1 and slope of the line is equal to 0.5.

The intercept represent the value of y given x = 0.
The slope represent increase in y for one unit increase in the x.

Figure 1

Linear regression as a statistical model

In a real life situation we usually cannot explain $Y$ as an exact function of $X$. There is always some errors or noises in the dependent variable that cannot be explained by the relationship between independent and dependent variable. We call that $error$ or residual term in the model. We represent the formula for a simple linear model as:

\[y_i = \beta_0+\beta_1 x_i+\epsilon_i\] Where $\beta_0$ is the intercept and $\beta_1$ is The slope of the model and $\epsilon_i$ is the residual term.

The quantities $\beta_0$ and $\beta_1$ are called model parameters and they are estimated from the data.
We assume the linear function can predict the mean value of the outcome for a given value of predictor variable. The mean value in mathematical statistics is noted as $E(Y|X=x)$ and reads the expected value of $Y$ given $X=x$.
In a simple linear regression, the mean function of outcome is given by

mean of Y given a value of X equal x = $E(Y|X=x)=\beta_0+\beta_1 x_i$

The goal of linear regression model is to estimate parameters $\beta_0$ and $\beta_1$ in a way that the line fit the best with the data.
The method to achieve this best fitted linear function is called Ordinary Least Square (OLS) method.
In the OLS we estimate the parameters of the model (intercept and slope) by minimizing sum of square of residuals.
Estimate model parameters are noted as $\hat{\beta_0}$ and $\hat{\beta_1}$ .
The estimated model will be \[\hat{y}=\hat{\beta_0}+\hat{\beta_1}x\]
$\hat{y}$ represent the predicted mean of the outcome for given value of $x$. We also call it fitted value.

Note : The predictor variable may be continuous, meaning that it may assume all values within a range, for example, age or height. Or it may be dichotomous, meaning that the variable may assume only one of two values, for example, 0 or 1 or a categorical variable with more that with more than two levels. There is only one response or dependent variable, and it is continuous

Linear model in R

Now that we have some review on the linear model, let’s use R and run a simple regression model.

In our data example we are interested to study the relationship between students’ academic performance with some characteristics in their school life. For example we can use variable api00 as outcome and variable enroll which is the number of students in the school. See the file “elemapiv2v.csv” for reference on the said dataset.

#Load the dataset
d <- read.csv("F:/ROEL/2020 USEP Docs/Lecture/Foundations of Computer Programming/R Programming/elemapi2v2.csv")

We expect that better academic performance would be associated with lower number of students in the school.
In R we use function lm() to run a linear regression model. Let’s look at R help documentation for function lm()

help(lm)   #shows R Documentation for function lm()

Now we run our model in R api00 as a dependent variable and enroll as independent variable.

m1 <- lm(api00 ~ enroll, data = d) #specify the model
print(m1)    #Prints coefficients of the model

## 
## Call:
## lm(formula = api00 ~ enroll, data = d)
## 
## Coefficients:
## (Intercept)       enroll  
##    744.2514      -0.1999

The first term of the function lm() is the formula of the model which itself is a function. The response variable will be in the left side of a “~” and in the right side we have the predictor variable.
api00 ~ enroll interpreted as api00 is modelled by the linear predictor enroll.
data = d is telling lm function to find variables name in the data frame d.
print() prints estimated coefficients of the model.
The estimated linear line is: \[api=744.2514 - 0.1999 enroll\]
The coefficient for enroll is -.1999, or approximately -.2, meaning that for a one unit increase in enroll, we would expect a .2 unit decrease in api00. In other words, a school with 1000 students would be expected (on average) to have an api score 20 units more than a school with 1100 students.
The constant is 744.2514, and this is the predicted value when enroll equals zero. In most cases, the constant is not very interesting.

summary and components of lm object

To observe the result of the lm object in more detail we use summary() function:

summary(m1)    #Prints summary of the model

## 
## Call:
## lm(formula = api00 ~ enroll, data = d)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -285.50 -112.55   -6.70   95.06  389.15 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 744.25141   15.93308  46.711  < 2e-16 ***
## enroll       -0.19987    0.02985  -6.695 7.34e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 135 on 398 degrees of freedom
## Multiple R-squared:  0.1012, Adjusted R-squared:  0.09898 
## F-statistic: 44.83 on 1 and 398 DF,  p-value: 7.339e-11

The first line shows the model specification.
The second part of the summary report the five number statistics of residuals estimated which is: \[e_i =y_i - \hat{y}\]
In the coefficients part we have a table of estimated coefficients, standard error, t-value, and p-value of the hypothesis test of coefficients equal to zero.
If the p-value is a smaller than 0.001 then we have 3 stars in front of it, showing there is strong significant evidence that coefficient is not zero. For p-value less than 0.01 and greater than 0.001 we have 2 star, between 0.01 and 0.05 we get “*" and between 0.05 and 0.1 we have “.”.
Here both intercept and slope are significant.
In the last portion of the result we observe residual standard error which is 135 and its degree of freedom equal to 398.
Multiple R-squared is the R-squared of the model equal to 0.1012, and adjusted R-squared is 0.09898 which is adjusted for number of predictors.
In the simple linear regression model R-square is equal to square of the correlation between response and predicted variable. We can run the function cor() to see if this is true.

r <- cor(d$api00, d$enroll) #correlation coefficient of api00 and enroll
r ^ 2                       #this is equal to r-squared in simple regression

## [1] 0.1012335

The last line gives the overal significance of the model against the null model which is the model with only intercept. F-statistic is equal to 44.83 with 1 and 398 degrees of freedom and p-value equal to 7.339e-11. This p-value in a simple regression model is exactly eqaul to p-value of the slope. Why?
As we said the result of lm() function is a lm class. We can use function ls() to list components includes in object m1.

ls(m1)        #list of components in object class lm

##  [1] "assign"        "call"          "coefficients"  "df.residual"  
##  [5] "effects"       "fitted.values" "model"         "qr"           
##  [9] "rank"          "residuals"     "terms"         "xlevels"

To access to the above components we can use $ sign after the lm object.

m1$coefficients        #returns coefficients of the model

## (Intercept)      enroll 
## 744.2514142  -0.1998674

m1$fitted.values[1:10] #a vector of fitted values here we show the first 10

##        1        2        3        4        5        6        7        8 
## 694.8842 651.7128 665.3038 660.7068 640.3203 675.6969 683.6916 441.8520 
##        9       10 
## 612.3389 671.8994

We can store residuals in a new object.

residuals <- m1$resid  #a vector of residuals

Some of the components can be extracted using a function.

coefficients(m1)

## (Intercept)      enroll 
## 744.2514142  -0.1998674

There are some R functions that can be apply on a lm object.

To get the Confidence interval for the coefficients of the model we use confint()

confint(m1) # returns a matrix of Confidence Interval for coefficients

##                   2.5 %      97.5 %
## (Intercept) 712.9279024 775.5749259
## enroll       -0.2585532  -0.1411817

In addition to getting the regression table and statistics, it can be useful to see a scatterplot of the predicted and outcome variables with the regression line plotted.

plot(api00 ~ enroll, data = d) #scatter plot of api00 vs. enroll
abline(m1, col = "blue")       #add regression line to the scatter plot

As you see, some of the points appear to be outliers. If you add text with labels = d$snum on the scatter, you can see the school number for each point. This allows us to see, for example, that one of the outliers is school 2910.

#adding labels(school number) to the scatter plot
plot(api00 ~ enroll, data = d) 
text(d$enroll, d$api00+20, labels = d$snum, cex = .7)
abline(m1, col = "blue")

Breaking down the sum of squares and anova

If we use only intercept to model the response variable the regression line will be a horizontal line from the mean of the response variable. In our example this will be the mean of api00 which is the line $y=647.6225$

mean(d$api00)

## [1] 647.6225

The residuals for this line will be $y_i−\bar{y}$. We can breakdown this error to two part using the predicted value from regression of api00 by predicted variable enroll.

This can be written as: \[y_i−\bar{y}=y_i- \hat{y_i}+\hat{y_i}-\bar{y}\] Where $\bar{y}$ is the mean of response, is the fitted value from the regression model including predictor variable and $y_i$ is the observed response.

The graph below shows this error parts.

Figure 2

It can be shown the following equation will be hold: \[ \sum_{i=1}^n (y_i - \bar{y})^2 = \sum_{i=1}^n (y_i -\hat{y_i} )^2 +\sum_{i=1}^n (\hat{y_i}-\bar{y})^2\]

The right hand side of this equation is called Sum of Square Total (SST). The first term of the left hand side of the above equation is called Residual Sum of Square (RSS) and the second term is called Sum of Square Regression (SSreg). We have: \[SST=RSS+SS_{reg}\]

The ratio of SSreg to SST is called R-squared. In percentage, R-square is the percentage of error that can be explained by the model. \[R^2=\frac{SS_{reg}}{SST}=1-\frac{RSS}{SST}\] We can use function anova() to observe sum of squares of the model.

anova(m1)  #anova table

In anova table we have sum of square of the regression model in row named enroll and sum of square of residuals with their degrees of freedom. Also it shows the F statistics that we have seen before in the summary of the model. Mean squares are sum of squares divided by their degrees of freedom.

Multiple Regression

Adding more predictors to a simple regression model

Now, let’s look at an example of multiple regression, in which we have one outcome (dependent) variable and multiple predictors. The percentage of variability explained by variable enroll was only 10.12%. In order to improve the percentage of variability accounted by the model, we can add more predictors. We add percentage of students who gets full meal as an indicator of socioeconomic status and percentage of teachers with full credential to our model. In R we can do this by simply + variable name to our lm() function.

#multiple regression model of DV api00 and DVs enroll, meals, and full
m2 <- lm(api00 ~  enroll + meals + full, data=d)
summary(m2)   #summary of model m2

## 
## Call:
## lm(formula = api00 ~ enroll + meals + full, data = d)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -181.721  -40.802    1.129   39.983  158.774 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 801.82983   26.42660  30.342  < 2e-16 ***
## enroll       -0.05146    0.01384  -3.719 0.000229 ***
## meals        -3.65973    0.10880 -33.639  < 2e-16 ***
## full          1.08109    0.23945   4.515 8.37e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 58.73 on 396 degrees of freedom
## Multiple R-squared:  0.8308, Adjusted R-squared:  0.8295 
## F-statistic: 648.2 on 3 and 396 DF,  p-value: < 2.2e-16

anova(m2)     #anova table of model m2

Looking at the result of the multiple regression model we will see that based on the F test the overall model is significant. In also all of the coefficients in the model are significantly not equal to zero.
The R-square is 0.8308, meaning that approximately 83% of the variability of api00 is accounted for by the variables in the model. The adjusted R-square shows after taking the account of number of predictors in the model R_square is still about 0.83.
The coefficients for each of the variables indicates the amount of change one could expect in api00 given a one-unit change in the value of that variable, given that all other variables in the model are held constant. For example, consider the variable meals. We would expect a decrease of about 3.66 in the api00 score for every one unit increase in percent free meals, assuming that all other variables in the model are held constant.
We see quite a difference in the coefficient of variable enroll compared to the simple linear regression. Before the coefficient for variable enroll was -.1999 and now it is -0.05146.
The anova table shows the sum of squares that will be explain by adding each variable at a time to the model. Or it is the reduced amount in sum of square residuals by an additional variable.

For example variable enroll reduces the total error by 817326. By adding variable meals we reduce additional 5820066 from the residuals and by adding variable full we reduce the error by 70313. Finally we have 1365967 left as unexplained error. The total error is all of those sum of squares added together. To get the total sum of square of variable api00 we can multiply its’ variance by $(n−1)$.

sum(anova(m2)$Sum)          #sum of RSS and SSreg

## [1] 8073672

(400 - 1) * var(d$api00)    #Total sum of squre

## [1] 8073672

Standardized regression coefficients

Some researchers are interested to compare the relative strength of the various predictors within the model. Since each variable have different unit we cannot compare coefficients to one another. To address this problem we use standardized regression coefficients which can be obtain by transforming the outcome and predictor variables all to their standard scores, also called z-scores, before running the regression.

In R we use function scale() to do this for each variable.

#Standardized regression model
m2.sd <- lm(scale(api00) ~  scale(enroll) + scale(meals) + scale(full), data = d)
summary(m2.sd) #coefficients are standardized

## 
## Call:
## lm(formula = scale(api00) ~ scale(enroll) + scale(meals) + scale(full), 
##     data = d)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.27749 -0.28683  0.00793  0.28108  1.11617 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    4.454e-16  2.064e-02   0.000 1.000000    
## scale(enroll) -8.191e-02  2.203e-02  -3.719 0.000229 ***
## scale(meals)  -8.210e-01  2.441e-02 -33.639  < 2e-16 ***
## scale(full)    1.136e-01  2.517e-02   4.515 8.37e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4129 on 396 degrees of freedom
## Multiple R-squared:  0.8308, Adjusted R-squared:  0.8295 
## F-statistic: 648.2 on 3 and 396 DF,  p-value: < 2.2e-16

In the standardized regression coefficients summary we see that the intercept is zero and all t-statistics for other coefficients are exactly the same as the original model.
Because the coefficients are all in the same standardized units you can compare these coefficients to assess the relative strength of each of the predictors. In this example, meals has the largest Beta coefficient, -0.821.
Thus, a one standard deviation increase in meals leads to a 0.821 standard deviation decrease in predicted api00, with the other variables held constant.

Summary

In this part we have discussed the basics of how to perform simple and multiple regressions in R.The next part we are going into a more thorough discussion of the assumptions of linear regression and how you can use R to assess these assumptions for your data. In particular, the next part will address the following issues.

Checking for points that exert undue influence on the coefficients
Checking for constant error variance (homoscedasticity)
Checking for linear relationships
Checking normality of residuals
Checking for multicollinearity