Objective : To fit a linear model on US air traffic data.
Step 1 => Read in data
Step 2 => Find variables which you think have a linear relationship
In this case we think, AIR_TIME and DISTANCE have a linear relationship.
# Syntax :lm( [target variable] ~ [predictor variable] , data = [data source] )
# In R, to add another coefficient, add the + for every additional variable we want to add to model.
# Ex : lm( [target variable] ~ [predictor_variable1 + predictor_variable2] , data = [data source] )
load(file = "flights_delay.RData")
simple.fit = lm(AIR_TIME ~ DISTANCE, data = flight_delay)
summary(simple.fit)
##
## Call:
## lm(formula = AIR_TIME ~ DISTANCE, data = flight_delay)
##
## Residuals:
## Min 1Q Median 3Q Max
## -108.442 -6.138 -1.308 5.159 105.749
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.710e+01 3.997e-02 427.7 <2e-16 ***
## DISTANCE 1.170e-01 3.905e-05 2996.6 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.34 on 269771 degrees of freedom
## (5191 observations deleted due to missingness)
## Multiple R-squared: 0.9708, Adjusted R-squared: 0.9708
## F-statistic: 8.979e+06 on 1 and 269771 DF, p-value: < 2.2e-16
A Good way to test quality of fit of model is to look at residuals or difference between real values and predicted values. The idea is sum of the residuals should be zero or as low as possible.
P value indicates, wheather we can accept or reject a hypothesis. IF p value is small, it means
variable is good addition to our model
p-value for distance is 2e-16 it means 2e-16 times distance is not related to AIR_TIME
Testing if linear model is good fit
R^2 = Explained variation of the model / Total variation of model
Multiple R-squared is 0.9708, notice there are 2 R-squared in summary
One problem with multiple R-squared is that it does not reduce even if we add more varibles that have no # effect on end result, therefore we need adjusted R-squared to compare added variables
LEt’s plot DISTANCE and look at the model line
plot(flight_delay$DISTANCE, pch = 16, col = "red") +
abline(simple.fit)

## integer(0)
LEt’s plot AIR_TIME and look at the model line
plot(flight_delay$AIR_TIME, pch = 16, col = "red") +
abline(simple.fit)

## integer(0)
We see our linear model is not a good fit for the data, hence we introduce a quadratic term.
Comparing values of R-squared in simple.fit and quadratic.fit, we see our model has improved by 0.0004
LEt’s plot DISTANCE and look at the model line = quadratic.fit
plot(flight_delay$DISTANCE, pch = 16, col = "red") +
abline(quadratic.fit)
## Warning in abline(quadratic.fit): only using the first two of 3 regression
## coefficients

## integer(0)
LEt’s plot AIR_TIME and look at the model line = quadratic.fit
plot(flight_delay$AIR_TIME, pch = 16, col = "red") +
abline(quadratic.fit)
## Warning in abline(quadratic.fit): only using the first two of 3 regression
## coefficients

## integer(0)
LEt’s plot model and look at the model line = quadratic.fit
# FOLLOWING LINES ARE COMMENTED JUST FOR KNIT HTML TO WORK
# plot(flight_delay$quadratic.fit, pch = 16, col = "green") +
# abline(quadratic.fit)
Detecting influential points of quadratic fit
plot(cooks.distance(quadratic.fit), pch = 16, col = "blue")

Detecting influential points of simple fit
plot(cooks.distance(simple.fit), pch = 16, col = "blue")

This concludes our discussion of linear model