Computational Mathematics - Regression Analysis I

Euclides Rodriguez

2022-04-07

Intro

Build a regression model for data that interests you. Conduct residual analysis. Was the linear model appropriate? Why or why not?

Libraries

library(tidyverse)
library(openintro)

Data

data("nycflights")
names(nycflights)
##  [1] "year"      "month"     "day"       "dep_time"  "dep_delay" "arr_time" 
##  [7] "arr_delay" "carrier"   "tailnum"   "flight"    "origin"    "dest"     
## [13] "air_time"  "distance"  "hour"      "minute"

Data Visulization

Scatter plot for departure delays and arrival delays for all carriers in the dataset.

ggplot(data = nycflights, aes(x = dep_delay, y = arr_delay, color = carrier)) +
  geom_point()

Model

model <- lm(arr_delay ~ dep_delay, data = nycflights)

arr_delay.res <- resid(model)

ggplot(data=nycflights,aes(x=arr_delay, y=arr_delay.res))+
  geom_hline(yintercept = 0)+
  geom_point(color="steelblue")+
  theme_minimal()+
  labs(x = "Arrivals Delay", y = "Residuals")

summary(model)
## 
## Call:
## lm(formula = arr_delay ~ dep_delay, data = nycflights)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -70.544 -11.138  -1.316   8.836 158.627 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -5.770652   0.103834  -55.58   <2e-16 ***
## dep_delay    1.013090   0.002451  413.27   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 17.92 on 32733 degrees of freedom
## Multiple R-squared:  0.8392, Adjusted R-squared:  0.8392 
## F-statistic: 1.708e+05 on 1 and 32733 DF,  p-value: < 2.2e-16

Conclusion

Based on the residual plot of these two variables. It can be seen that a linear regression model is not a good fit. It is not homoscedastic since the data does not have constant variability. Additionally the data points contain outliers which are not normally distributed.