DATA605_Discussion11_Simple linear Regression

The simple linear regression is used to predict a quantitative outcome y on the basis of one single predictor variable x. The goal is to build a mathematical model (or formula) that defines y as a function of the x variable.

In a nutshell, this technique finds a line that best “fits” the data and takes on the following form:

Example

For this example, we’ll create an assumed dataset that contains the following two variables for 20 students:Total hours studied for some exam and Exam score

We’ll attempt to fit a simple linear regression model using hours as the explanatory variable and exam score as the response variable.

data <- data.frame(hours=c(1, 2, 4, 5, 5, 6, 6, 7, 8, 10, 7, 4, 3, 9, 10, 3, 5, 11, 7, 9),
                 score=c(64, 66, 76, 73, 74, 81, 83, 82, 80, 88, 83, 68, 71, 90, 92, 84, 82, 91, 83, 89))

head(data)

##   hours score
## 1     1    64
## 2     2    66
## 3     4    76
## 4     5    73
## 5     5    74
## 6     6    81

model <- lm(score ~ hours, data = data)
summary(model)

## 
## Call:
## lm(formula = score ~ hours, data = data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -6.549 -2.858  0.164  1.574 12.046 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  64.1673     2.2811  28.130  2.5e-16 ***
## hours         2.5955     0.3408   7.616  4.9e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.199 on 18 degrees of freedom
## Multiple R-squared:  0.7632, Adjusted R-squared:   0.75 
## F-statistic:    58 on 1 and 18 DF,  p-value: 4.904e-07

From the output above, the estimated regression line equation can be written as follow:

score = 64.167 + 2.596hours

Residuals. Provide a quick view of the distribution of the residuals, which by definition have a mean zero. Therefore, the median should not be far from zero, and the minimum and maximum should be roughly equal in absolute value.

res <- resid(model)
qqnorm(res)
qqline(res)

From the Plot, it is easy to see that the residual follows a normal distribution. Hence, we can say that the linear model is appropriate.