At the end of the session, the participants are expected to:
Regression Analysis is a statistical modeling tool that is used to explain a response (criterion or dependent) variable as a function of one or more predictor (independent) variables.
Employee efficiency may be related to years of training , educational background, and age of the employee.
The amount of time until a headache is gone, when taking a pain killer, may be related to the dosage , age , and gender.
The number of votes for a presidential candidate may be related to gender, income, and the state.
Single regression is study of a single response as a function of a single predictor.
\(X\) : The predictor \(x_1, x_2, ..., x_n\) are \(n\) observations from \(X\).
\(Y\) : The response \(y_1, y_2, ..., y_n\) are \(n\) observations from \(Y\).
In a simple linear regression model we assume this relationship is a linear function.
A linear function can be determine by the intercept and the slope of the line.
The figure below shows plot of linear function
\[ y= f(x)=1+0.5x\]
The intercept is equal to 1 and slope of the line is equal to 0.5.
Figure 1
In a real life situation we usually cannot explain \(Y\) as an exact function of \(X\). There is always some errors or noises in the dependent variable that cannot be explained by the relationship between independent and dependent variable. We call that \(error\) or residual term in the model. We represent the formula for a simple linear model as:
\[y_i = \beta_0+\beta_1 x_i+\epsilon_i\] Where \(\beta_0\) is the intercept and \(\beta_1\) is The slope of the model and \(\epsilon_i\) is the residual term.
mean of Y given a value of X equal x = \(E(Y|X=x)=\beta_0+\beta_1 x_i\)
Note : The predictor variable may be continuous, meaning that it may assume all values within a range, for example, age or height. Or it may be dichotomous, meaning that the variable may assume only one of two values, for example, 0 or 1 or a categorical variable with more that with more than two levels. There is only one response or dependent variable, and it is continuous
Now that we have some review on the linear model, let’s use R and run a simple regression model.
#Load the dataset
d <- read.csv("F:/ROEL/2020 USEP Docs/Lecture/Foundations of Computer Programming/R Programming/elemapi2v2.csv")
Now we run our model in R api00 as a dependent variable and enroll as independent variable.
##
## Call:
## lm(formula = api00 ~ enroll, data = d)
##
## Coefficients:
## (Intercept) enroll
## 744.2514 -0.1999
To observe the result of the lm object in more detail we use summary() function:
##
## Call:
## lm(formula = api00 ~ enroll, data = d)
##
## Residuals:
## Min 1Q Median 3Q Max
## -285.50 -112.55 -6.70 95.06 389.15
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 744.25141 15.93308 46.711 < 2e-16 ***
## enroll -0.19987 0.02985 -6.695 7.34e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 135 on 398 degrees of freedom
## Multiple R-squared: 0.1012, Adjusted R-squared: 0.09898
## F-statistic: 44.83 on 1 and 398 DF, p-value: 7.339e-11
r <- cor(d$api00, d$enroll) #correlation coefficient of api00 and enroll
r ^ 2 #this is equal to r-squared in simple regression
## [1] 0.1012335
## [1] "assign" "call" "coefficients" "df.residual"
## [5] "effects" "fitted.values" "model" "qr"
## [9] "rank" "residuals" "terms" "xlevels"
## (Intercept) enroll
## 744.2514142 -0.1998674
## 1 2 3 4 5 6 7 8
## 694.8842 651.7128 665.3038 660.7068 640.3203 675.6969 683.6916 441.8520
## 9 10
## 612.3389 671.8994
We can store residuals in a new object.
Some of the components can be extracted using a function.
## (Intercept) enroll
## 744.2514142 -0.1998674
There are some R functions that can be apply on a lm object.
To get the Confidence interval for the coefficients of the model we use confint()
## 2.5 % 97.5 %
## (Intercept) 712.9279024 775.5749259
## enroll -0.2585532 -0.1411817
In addition to getting the regression table and statistics, it can be useful to see a scatterplot of the predicted and outcome variables with the regression line plotted.
plot(api00 ~ enroll, data = d) #scatter plot of api00 vs. enroll
abline(m1, col = "blue") #add regression line to the scatter plot
As you see, some of the points appear to be outliers. If you add text with labels = d$snum on the scatter, you can see the school number for each point. This allows us to see, for example, that one of the outliers is school 2910.
#adding labels(school number) to the scatter plot
plot(api00 ~ enroll, data = d)
text(d$enroll, d$api00+20, labels = d$snum, cex = .7)
abline(m1, col = "blue")
If we use only intercept to model the response variable the regression line will be a horizontal line from the mean of the response variable. In our example this will be the mean of api00 which is the line \(y=647.6225\)
## [1] 647.6225
The residuals for this line will be \(y_i−\bar{y}\). We can breakdown this error to two part using the predicted value from regression of api00 by predicted variable enroll.
This can be written as: \[y_i−\bar{y}=y_i- \hat{y_i}+\hat{y_i}-\bar{y}\] Where \(\bar{y}\) is the mean of response, is the fitted value from the regression model including predictor variable and \(y_i\) is the observed response.
The graph below shows this error parts.
Figure 2
It can be shown the following equation will be hold: \[ \sum_{i=1}^n (y_i - \bar{y})^2 = \sum_{i=1}^n (y_i -\hat{y_i} )^2 +\sum_{i=1}^n (\hat{y_i}-\bar{y})^2\]
The right hand side of this equation is called Sum of Square Total (SST). The first term of the left hand side of the above equation is called Residual Sum of Square (RSS) and the second term is called Sum of Square Regression (SSreg). We have: \[SST=RSS+SS_{reg}\]
The ratio of SSreg to SST is called R-squared. In percentage, R-square is the percentage of error that can be explained by the model. \[R^2=\frac{SS_{reg}}{SST}=1-\frac{RSS}{SST}\] We can use function anova() to observe sum of squares of the model.
In anova table we have sum of square of the regression model in row named enroll and sum of square of residuals with their degrees of freedom. Also it shows the F statistics that we have seen before in the summary of the model. Mean squares are sum of squares divided by their degrees of freedom.
Now, let’s look at an example of multiple regression, in which we have one outcome (dependent) variable and multiple predictors. The percentage of variability explained by variable enroll was only 10.12%. In order to improve the percentage of variability accounted by the model, we can add more predictors. We add percentage of students who gets full meal as an indicator of socioeconomic status and percentage of teachers with full credential to our model. In R we can do this by simply + variable name to our lm() function.
#multiple regression model of DV api00 and DVs enroll, meals, and full
m2 <- lm(api00 ~ enroll + meals + full, data=d)
summary(m2) #summary of model m2
##
## Call:
## lm(formula = api00 ~ enroll + meals + full, data = d)
##
## Residuals:
## Min 1Q Median 3Q Max
## -181.721 -40.802 1.129 39.983 158.774
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 801.82983 26.42660 30.342 < 2e-16 ***
## enroll -0.05146 0.01384 -3.719 0.000229 ***
## meals -3.65973 0.10880 -33.639 < 2e-16 ***
## full 1.08109 0.23945 4.515 8.37e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 58.73 on 396 degrees of freedom
## Multiple R-squared: 0.8308, Adjusted R-squared: 0.8295
## F-statistic: 648.2 on 3 and 396 DF, p-value: < 2.2e-16
For example variable enroll reduces the total error by 817326. By adding variable meals we reduce additional 5820066 from the residuals and by adding variable full we reduce the error by 70313. Finally we have 1365967 left as unexplained error. The total error is all of those sum of squares added together. To get the total sum of square of variable api00 we can multiply its’ variance by \((n−1)\).
## [1] 8073672
## [1] 8073672
Some researchers are interested to compare the relative strength of the various predictors within the model. Since each variable have different unit we cannot compare coefficients to one another. To address this problem we use standardized regression coefficients which can be obtain by transforming the outcome and predictor variables all to their standard scores, also called z-scores, before running the regression.
In R we use function scale() to do this for each variable.
#Standardized regression model
m2.sd <- lm(scale(api00) ~ scale(enroll) + scale(meals) + scale(full), data = d)
summary(m2.sd) #coefficients are standardized
##
## Call:
## lm(formula = scale(api00) ~ scale(enroll) + scale(meals) + scale(full),
## data = d)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.27749 -0.28683 0.00793 0.28108 1.11617
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.454e-16 2.064e-02 0.000 1.000000
## scale(enroll) -8.191e-02 2.203e-02 -3.719 0.000229 ***
## scale(meals) -8.210e-01 2.441e-02 -33.639 < 2e-16 ***
## scale(full) 1.136e-01 2.517e-02 4.515 8.37e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4129 on 396 degrees of freedom
## Multiple R-squared: 0.8308, Adjusted R-squared: 0.8295
## F-statistic: 648.2 on 3 and 396 DF, p-value: < 2.2e-16
In this part we have discussed the basics of how to perform simple and multiple regressions in R.The next part we are going into a more thorough discussion of the assumptions of linear regression and how you can use R to assess these assumptions for your data. In particular, the next part will address the following issues.