library(tidyverse)
The simple linear regression is used to predict a quantitative outcome y on the basis of one single predictor variable x. The goal is to build a mathematical model (or formula) that defines y as a function of the x variable.
In a nutshell, this technique finds a line that best “fits” the data and takes on the following form:
\(ŷ = b_{0} + b_{1}x\)
where:
ŷ: The estimated response value
\(b_{0}\): The intercept of the regression line
\(b_{1}\): The slope of the regression line
For this example, we’ll create an assumed dataset that contains the following two variables for 20 students:Total hours studied for some exam and Exam score
We’ll attempt to fit a simple linear regression model using hours as the explanatory variable and exam score as the response variable.
data <- data.frame(hours=c(1, 2, 4, 5, 5, 6, 6, 7, 8, 10, 7, 4, 3, 9, 10, 3, 5, 11, 7, 9),
score=c(64, 66, 76, 73, 74, 81, 83, 82, 80, 88, 83, 68, 71, 90, 92, 84, 82, 91, 83, 89))
head(data)
## hours score
## 1 1 64
## 2 2 66
## 3 4 76
## 4 5 73
## 5 5 74
## 6 6 81
model <- lm(score ~ hours, data = data)
summary(model)
##
## Call:
## lm(formula = score ~ hours, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.549 -2.858 0.164 1.574 12.046
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 64.1673 2.2811 28.130 2.5e-16 ***
## hours 2.5955 0.3408 7.616 4.9e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.199 on 18 degrees of freedom
## Multiple R-squared: 0.7632, Adjusted R-squared: 0.75
## F-statistic: 58 on 1 and 18 DF, p-value: 4.904e-07
From the output above, the estimated regression line equation can be written as follow:
score = 64.167 + 2.596hours
Residuals. Provide a quick view of the distribution of the residuals, which by definition have a mean zero. Therefore, the median should not be far from zero, and the minimum and maximum should be roughly equal in absolute value.
res <- resid(model)
qqnorm(res)
qqline(res)
From the Plot, it is easy to see that the residual follows a normal distribution. Hence, we can say that the linear model is appropriate.