Intro
Build a regression model for data that interests you. Conduct residual analysis. Was the linear model appropriate? Why or why not?
Libraries
library(tidyverse)
library(openintro)
Data
data("nycflights")
names(nycflights)
## [1] "year" "month" "day" "dep_time" "dep_delay" "arr_time"
## [7] "arr_delay" "carrier" "tailnum" "flight" "origin" "dest"
## [13] "air_time" "distance" "hour" "minute"
Data Visulization
Scatter plot for departure delays and arrival delays for all carriers in the dataset.
ggplot(data = nycflights, aes(x = dep_delay, y = arr_delay, color = carrier)) +
geom_point()
Model
model <- lm(arr_delay ~ dep_delay, data = nycflights)
arr_delay.res <- resid(model)
ggplot(data=nycflights,aes(x=arr_delay, y=arr_delay.res))+
geom_hline(yintercept = 0)+
geom_point(color="steelblue")+
theme_minimal()+
labs(x = "Arrivals Delay", y = "Residuals")
summary(model)
##
## Call:
## lm(formula = arr_delay ~ dep_delay, data = nycflights)
##
## Residuals:
## Min 1Q Median 3Q Max
## -70.544 -11.138 -1.316 8.836 158.627
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5.770652 0.103834 -55.58 <2e-16 ***
## dep_delay 1.013090 0.002451 413.27 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 17.92 on 32733 degrees of freedom
## Multiple R-squared: 0.8392, Adjusted R-squared: 0.8392
## F-statistic: 1.708e+05 on 1 and 32733 DF, p-value: < 2.2e-16
Conclusion
Based on the residual plot of these two variables. It can be seen that a linear regression model is not a good fit. It is not homoscedastic since the data does not have constant variability. Additionally the data points contain outliers which are not normally distributed.