Linear regression seeks to model the relationship between two variables through applying a linear equation to the given data. However, prior to applying regression, the model of interest must meet certain assumptions to gauge whether linear regression is appropriate and that we can confidently draw inferences from it.
headbrain = read_csv("headbrain.csv")
## Parsed with column specification:
## cols(
## Gender = col_double(),
## `Age Range` = col_double(),
## `Head Size(cm^3)` = col_double(),
## `Brain Weight(grams)` = col_double()
## )
headbrain2 = headbrain %>%
filter(Gender == 1 & `Age Range` == 1)
plot1 <- headbrain2 %>%
ggplot(aes(x = `Head Size(cm^3)`, y = `Brain Weight(grams)`)) +
geom_point(color = "darkorchid4") +
labs(title = "Head Size vs Brain Weight",
x = "Head Size (cm^3)",
y = "Brain Weight (grams)") +
theme_classic()
ggplotly(plot1)
#linearity
p1 <- plot1 +
geom_smooth(method = "lm", se = FALSE, color = "orange") +
labs(title = "Linearity")
p1
## `geom_smooth()` using formula 'y ~ x'
The first assumption that must be met is linearity. Simply put, the variables in the model of interest must have a linear relationship. This linear relationship need not imply or involve causation; there simply needs to be a significant association between the two variables at hand. We can use a scatter plot and fit a best-fit line to check for a linear relationship.
#independence
lm_mod = lm(`Brain Weight(grams)` ~ `Head Size(cm^3)`, data = headbrain2)
lm_res = augment(lm_mod)
p2 <- ggplot(lm_res, aes(x = `Head Size(cm^3)`, y = .resid)) +
geom_point(color = "royalblue3") +
theme_classic() +
labs(title = "Independence")
p2
The next assumption is independence of the observations. All this means is that the value of one observation cannot depend on the value of another. An example of an independent observation would be measuring height at a single point in time. However, if one were to measure height over time, the observations would turn dependent as the height at a particular time would influence the heights at future points. A way to test this assumption is to plot the residuals against the predictor variable. As a reminder, the residuals are the difference between the actual value (the data point) and the value predicted by the model (the line). To meet the assumption, there should be no discernable relationship in this residual plot. That is, the residuals should be randomly scattered around the center line of zero, as shown below.
#normal residuals
p3 <- ggplot(lm_res, aes(x = .resid)) +
geom_histogram(aes(y = ..density..), bins = 20, col = "white", fill = "seagreen3") +
stat_function(fun = dnorm, args = list(mean = mean(lm_res$.resid), sd = sd(lm_res$.resid)), col = "red") +
theme_classic() +
labs(title = "Normal Residuals")
p3
The third assumption for the model is normally distributed residuals. A simple way of checking this assumption is to create a histogram of the residuals. The generated histogram should be approximately normally distributed. This means having a bell curved shape and being unimodal. We used a density histogram here instead of a frequency histogram; because the areas of the bars add up to 1, this allows us to properly overlay the normal distribution curve to show that the data meets the normal residuals assumption.
#equal variance
p4 <- ggplot(lm_res, aes(x = .fitted, y = .resid)) +
geom_point(color = "red") +
theme_classic() +
labs(title = "Equal Variance")
p4
The fourth and final assumption is equal variance of the residuals. Essentially, the variances of the residuals should be consistent across all values. This assumption can be checked through examining the scatter plot of the residuals versus fit. In the generated scatter plot, there should be no relationship as the plot should not show a pattern.
#create a separate dataframe
newdata <- headbrain2
#fit a linear model
fit <- lm(`Brain Weight(grams)` ~ `Head Size(cm^3)`, data = newdata)
#create predicted variable and save the predicted values
newdata$predicted <- predict(fit)
#create residuals variable and save the residual values
newdata$residuals <- residuals(fit)
#store coefficient values from the model
coefs<-coef(lm(`Brain Weight(grams)` ~ `Head Size(cm^3)`, data = headbrain2))
coefs[1]
## (Intercept)
## 565.5813
coefs[2]
## `Head Size(cm^3)`
## 0.2064076
#frames for the animated residual lines
x<-newdata$`Brain Weight(grams)`
move_line<-c(seq(-50,50,.5),seq(50,-50,-.5))
total_error<-length(length(move_line))
count<-0
for(i in move_line){
count <- count + 1
predicted_y <- coefs[2]*x + coefs[1]+i
error_y <- (predicted_y-newdata$`Brain Weight(grams)`)^2
total_error[count]<-sqrt(sum(error_y)/32)
}
move_line_sims<-rep(move_line,each=32)
total_error_sims<-rep(total_error,each=32)
sims<-rep(1:402,each=32)
newdata <-newdata %>%
slice(rep(row_number(), 402))
newdata <-cbind(newdata,sims,move_line_sims,total_error_sims)
anim<-
ggplot(newdata, aes(x = `Head Size(cm^3)`, y = `Brain Weight(grams)`, frame=sims)) +
geom_smooth(method = "lm", se = FALSE, color = "springgreen2") +
geom_abline(intercept = coefs[1]+move_line_sims, slope = coefs[2])+
lims(x = c(3000,5000), y = c(1200,1600))+
geom_segment(aes(xend = `Head Size(cm^3)`, yend = predicted+move_line_sims, color="red3"), alpha = .5) +
geom_point() +
theme_classic()+
theme(legend.position="none")+
labs(title = "Head Size vs Brain Weight",
subtitle = "OLS Visualization") +
xlab("Head Size (cm^3)")+ylab("Brain Weight (grams)") +
transition_manual(frames=sims)+
enter_fade() +
exit_fade() +
ease_aes('sine-in-out')
animate(anim,fps=10)
## `geom_smooth()` using formula 'y ~ x'
As for the fit of the regression line itself, the equation of the line is most often gathered from the method of least-squares. This method calculates the best fitting line for the observation by minimizing the sum of the squares of the residuals. Although this may sound complex, all that is occurring behind the scenes is reducing the sum of the vertical distances between the data points and the line as much as possible. Ordinary least squares provides the best method for estimating the relationship in the model as it minimizes the variances.
As you can see in the animated model, the green line is positioned where the linear model has the least amount of error (denoted by the vertical red lines) relative to the points. The black line slides up and down to show you how much bigger the residual for a given point would be.
set.seed(0)
x_values <- 1:30
r_values <- seq(0, 1, by=0.025)
x <- rep(x_values, length(r_values))
y <- unlist(map(r_values, ~runif(x_values, .)*x_values))
r <- rep(r_values, each=length(x_values))
data <- data.frame(x=x, y=y, r=r)
legend <- c("Linear Regression Line" = "blue", "SD Line" = "red")
anim_graph <- ggplot(data %>%
group_by(r) %>%
mutate(
sd_slope=sd(y)/sd(x),
sd_intercept = mean(y)- (sd(y)/sd(x))*mean(x),
r_slope = r*sd(y)/sd(x),
r_intercept = mean(y)- (r*sd(y)/sd(x))*mean(x))
, aes(x, y)) +
geom_point(color = "purple4", alpha = 0.7, size = 3) +
#sd line
geom_abline(aes(intercept=sd_intercept, slope=sd_slope), col="firebrick1") +
#corr. coefficient line
geom_abline(aes(intercept=r_intercept, slope=r_slope), col="royalblue2") +
labs(title = "Showing Change in the Coefficient Correlation r",
subtitle = 'r = {round(frame_time, 2)}',
x = 'x',
y = 'y') +
theme(panel.background = element_blank()) +
transition_time(r) +
ease_aes('linear')
anim_graph
Another crucial concept in regression is the correlation coefficient. The correlation coefficient measures the degree of association between the variables in a regression. The correlation coefficient is often denoted as r and ranges between -1 and 1 depending on the slope. Another useful method to gauge the degree of correlation is through the standard deviation line. The standard deviation line goes through the point of averages, which is the means of the two variables, and has a slope equal to the standard deviation of Y over the standard deviation of X. The closer the correlation coefficient is to plus or minus 1, the closer the regression line will be to the standard deviation line. The relative positions of the lines should give you a visual aid to discern the correlation between the variables.
In the simulation above, we modeled a scatterplot with the X and Y having a positive relationship, so r starts at 0 and gradually goes up to 1. The blue line represents the linear regression model and the red line denotes the standard deviation line. As r increases, you can see that the blue line’s slope gets steeper, aligning with the red line. A correlation coefficient of 1 also signifies a perfect linear relationship, meaning that all the points line up precisely on the line (however in real life this would almost never happen).
Finally, it is important to accurately interpret the meaning of the correlation coefficient. Remember, it is simply a measure of how strongly a pair of variables are related, and need not imply causation. Another common metric in regression is the coefficient of determination, or \(R^2\). \(R^2\) is just as it appears: the correlation coefficient squared. The coefficient of determination is a measure of how variance is explained by the regression model. For instance, an \(R^2\) of 0.9 simply means that X explains 90% of the variation in Y.
We think we should get an Excellent for this project because we’ve fulfilled all the Satisfactory requirements and have shown complexity in the backend development. Creating the two animations was pretty difficult and demanded a lot of time commitment to debugging. We also think we did a good job of explaining linear regression and touching on the details of it.