Objective : To fit a linear model on US air traffic data.

Step 1 => Read in data

Step 2 => Find variables which you think have a linear relationship

In this case we think, AIR_TIME and DISTANCE have a linear relationship.

# Syntax :lm( [target variable] ~ [predictor variable] , data = [data source] )
# In R, to add another coefficient, add the + for every additional variable we want to add to model.
# Ex : lm( [target variable] ~ [predictor_variable1 + predictor_variable2] , data = [data source] )

load(file = "flights_delay.RData")

simple.fit = lm(AIR_TIME ~ DISTANCE, data = flight_delay)

summary(simple.fit)
## 
## Call:
## lm(formula = AIR_TIME ~ DISTANCE, data = flight_delay)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -108.442   -6.138   -1.308    5.159  105.749 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1.710e+01  3.997e-02   427.7   <2e-16 ***
## DISTANCE    1.170e-01  3.905e-05  2996.6   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.34 on 269771 degrees of freedom
##   (5191 observations deleted due to missingness)
## Multiple R-squared:  0.9708, Adjusted R-squared:  0.9708 
## F-statistic: 8.979e+06 on 1 and 269771 DF,  p-value: < 2.2e-16

A Good way to test quality of fit of model is to look at residuals or difference between real values and predicted values. The idea is sum of the residuals should be zero or as low as possible.

P value indicates, wheather we can accept or reject a hypothesis. IF p value is small, it means

variable is good addition to our model

Testing if linear model is good fit

R^2 = Explained variation of the model / Total variation of model

Multiple R-squared is 0.9708, notice there are 2 R-squared in summary

One problem with multiple R-squared is that it does not reduce even if we add more varibles that have no # effect on end result, therefore we need adjusted R-squared to compare added variables

LEt’s plot DISTANCE and look at the model line

plot(flight_delay$DISTANCE, pch = 16, col = "red") +
abline(simple.fit)

## integer(0)

LEt’s plot AIR_TIME and look at the model line

plot(flight_delay$AIR_TIME, pch = 16, col = "red") +
abline(simple.fit)

## integer(0)

LEt’s plot model and look at the model line

plot(simple.fit, pch = 16, col = "green") +
abline(simple.fit)

## integer(0)

We see our linear model is not a good fit for the data, hence we introduce a quadratic term.

This method is called transformation of a variable

quadratic.fit = lm(flight_delay$AIR_TIME ~ flight_delay$DISTANCE + I(flight_delay$DISTANCE^2), data = flight_delay)

# Do a summary of new fit
summary(quadratic.fit)
## 
## Call:
## lm(formula = flight_delay$AIR_TIME ~ flight_delay$DISTANCE + 
##     I(flight_delay$DISTANCE^2), data = flight_delay)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -103.085   -5.926   -1.062    5.053  115.103 
## 
## Coefficients:
##                              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                 1.464e+01  5.931e-02  246.89   <2e-16 ***
## flight_delay$DISTANCE       1.231e-01  1.154e-04 1066.62   <2e-16 ***
## I(flight_delay$DISTANCE^2) -2.418e-06  4.337e-08  -55.76   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.27 on 269770 degrees of freedom
##   (5191 observations deleted due to missingness)
## Multiple R-squared:  0.9712, Adjusted R-squared:  0.9712 
## F-statistic: 4.543e+06 on 2 and 269770 DF,  p-value: < 2.2e-16

Comparing values of R-squared in simple.fit and quadratic.fit, we see our model has improved by 0.0004

LEt’s plot DISTANCE and look at the model line = quadratic.fit

plot(flight_delay$DISTANCE, pch = 16, col = "red") +
abline(quadratic.fit)
## Warning in abline(quadratic.fit): only using the first two of 3 regression
## coefficients

## integer(0)

LEt’s plot AIR_TIME and look at the model line = quadratic.fit

plot(flight_delay$AIR_TIME, pch = 16, col = "red") +
abline(quadratic.fit)
## Warning in abline(quadratic.fit): only using the first two of 3 regression
## coefficients

## integer(0)

LEt’s plot model and look at the model line = quadratic.fit

# FOLLOWING LINES ARE COMMENTED JUST FOR KNIT HTML TO WORK
# plot(flight_delay$quadratic.fit, pch = 16, col = "green") +
# abline(quadratic.fit)

Detecting influential points of quadratic fit

plot(cooks.distance(quadratic.fit), pch = 16, col = "blue")

Detecting influential points of simple fit

plot(cooks.distance(simple.fit), pch = 16, col = "blue")

This concludes our discussion of linear model