Linear Regression for Data

Reading in the data - let’s do starts first

First we will read in the data from “start.csv” and “trimet.csv”

library(ggplot2)

start <- read.csv("start.csv")
trimet <- read.csv("trimet.csv")
names(start)[2] <- "census_block"
names(trimet)[1] <- "census_block"

Joining the data together using the merge() function

df <- merge(start, trimet, by = "census_block")
df <- df[-c(2)]

# The column names are a little weird, so the code
# below just fixes up the column names 

names(df)[2] <- "scooter_starts"
names(df)[3] <- "trimet_stops"
summary(df)
##        census_block scooter_starts      trimet_stops   
##  410050208001:  1   Min.   :    1.00   Min.   : 1.000  
##  410050208002:  1   1st Qu.:   35.75   1st Qu.: 4.000  
##  410050208003:  1   Median :  114.50   Median : 7.000  
##  410050209001:  1   Mean   : 1458.42   Mean   : 8.436  
##  410050209002:  1   3rd Qu.:  608.00   3rd Qu.:10.000  
##  410050209003:  1   Max.   :53961.00   Max.   :79.000  
##  (Other)     :466

Plotting and making an initial model

First we will use R’s basic plot function

#Plot scooter starts (y axis) as a function of trimet stops (x axis)
plot(scooter_starts ~ trimet_stops, df)

Now I’ll use ggplot2 to make prettier graphs, I’ll use ggplot from here on! So anywhere you see ggplot() or something it’s just me plotting the graphs

ggplot(df, aes(x = trimet_stops, y = scooter_starts)) + 
  geom_point() +
  ggtitle("E-Scooter Trip Starts vs. Trimet Stops per Census Block") +
  xlab("Trimet Stops") + 
  ylab("Scooter Trip Starts")

ggsave("start_vs_trips.png")
## Saving 7 x 5 in image

Making the linear model

We will make and plot our linear model, the

model <- lm(df$scooter_starts ~ df$trimet_stops)
ggplotRegression(model) + ggtitle("Scooter Trip Starts vs. Trimet stops") +
  xlab("Trimet Stops") + 
  ylab("Scooter Trip Starts")
## `geom_smooth()` using formula 'y ~ x'

ggsave("start_regression_outliers.png")
## Saving 7 x 5 in image
## `geom_smooth()` using formula 'y ~ x'
summary(model)
## 
## Call:
## lm(formula = df$scooter_starts ~ df$trimet_stops)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -12480  -1347   -834   -280  49806 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        62.17     344.13   0.181    0.857    
## df$trimet_stops   165.50      30.09   5.501 6.21e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5049 on 470 degrees of freedom
## Multiple R-squared:  0.06049,    Adjusted R-squared:  0.05849 
## F-statistic: 30.26 on 1 and 470 DF,  p-value: 6.211e-08

Getting rid of outliers

As you can see from the summary, while the p-value for the variable was significant, the p-value for the intercept was not less than 0.5, so the model itself is lacking significant predictive power. Even looking from the plot, you can tell that very large and very small values have a large amount of sway on the linear model, so we can get rid of the outliers to see if that has any positive effect. To do this, all we are going to do is subset the data within our interquartile range, meaning we get rid of the bottom and top 25% of our data.

Q <- quantile(df$scooter_starts, probs=c(.25, .75), na.rm = FALSE)
iqr <- IQR(df$scooter_starts)
eliminated <- subset(df, df$scooter_starts > (Q[1] - 1.5*iqr) & df$scooter_starts < (Q[2]+1.5*iqr))

Q2 <- quantile(df$trimet_stops, probs=c(.25, .75), na.rm = FALSE)
iqr2 <- IQR(df$trimet_stops)
eliminated2 <- subset(df, df$trimet_stops > (Q2[1] - 1.5*iqr2) & df$trimet_stops < (Q2[2]+1.5*iqr2))


test_df <- merge(eliminated, eliminated2, by="census_block")
model_outliers <- lm(test_df$scooter_starts.x ~ test_df$trimet_stops.x)
summary(model_outliers)
## 
## Call:
## lm(formula = test_df$scooter_starts.x ~ test_df$trimet_stops.x)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -272.69 -187.11 -136.32   45.59 1238.68 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             186.450     32.716   5.699 2.42e-08 ***
## test_df$trimet_stops.x    4.958      4.244   1.168    0.243    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 313.7 on 379 degrees of freedom
## Multiple R-squared:  0.003588,   Adjusted R-squared:  0.0009586 
## F-statistic: 1.365 on 1 and 379 DF,  p-value: 0.2435
ggplotRegression(model_outliers) + ggtitle("Scooter trip starts vs. Trimet stops, outliers removed") +
  xlab("Trimet Stops") + 
  ylab("Scooter Trip Starts")
## `geom_smooth()` using formula 'y ~ x'

ggsave("start_regression_no_outliers.png")
## Saving 7 x 5 in image
## `geom_smooth()` using formula 'y ~ x'

In this case, you can see that our intercept is significant, but our variable is no longer signficant.

Testing with trip ends:

end <- read.csv("end.csv")
trimet <- read.csv("trimet.csv")
names(end)[2] <- "census_block"
names(trimet)[1] <- "census_block"

Joining the data together using the merge() function

df <- merge(end, trimet, by = "census_block")
df <- df[-c(2)]

# The column names are a little weird, so the code
# below just fixes up the column names 

names(df)[2] <- "scooter_ends"
names(df)[3] <- "trimet_stops"
summary(df)
##        census_block  scooter_ends    trimet_stops   
##  410050202002:  1   Min.   :    1   Min.   : 1.000  
##  410050204014:  1   1st Qu.:   31   1st Qu.: 4.000  
##  410050208001:  1   Median :  147   Median : 7.000  
##  410050208002:  1   Mean   : 1332   Mean   : 8.491  
##  410050208003:  1   3rd Qu.:  613   3rd Qu.:10.000  
##  410050209001:  1   Max.   :28029   Max.   :79.000  
##  (Other)     :509

Plotting and making an initial model

ggplot(df, aes(x = scooter_ends, y = trimet_stops)) + 
  geom_point() +
  ggtitle("Scooter trip starts as a function of Trimet stops") +
  xlab("Trimet Stops") + 
  ylab("Scooter Trip Ends")

ggsave("ends_vs_trips.png")
## Saving 7 x 5 in image

Making the linear model

We will make and plot our linear model, the

model <- lm(df$scooter_ends ~ df$trimet_stops)
ggplotRegression(model) + ggtitle("Scooter Trip Ends vs. Trimet Stops") +
  xlab("Trimet Stops") + 
  ylab("Scooter Trips Ends")
## `geom_smooth()` using formula 'y ~ x'

ggsave("ends_regression_outliers.png")
## Saving 7 x 5 in image
## `geom_smooth()` using formula 'y ~ x'
summary(model)
## 
## Call:
## lm(formula = df$scooter_ends ~ df$trimet_stops)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8581.8 -1206.4  -824.6  -391.2 26301.9 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       376.36     243.82   1.544    0.123    
## df$trimet_stops   112.56      21.43   5.251 2.22e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3682 on 513 degrees of freedom
## Multiple R-squared:  0.05101,    Adjusted R-squared:  0.04916 
## F-statistic: 27.58 on 1 and 513 DF,  p-value: 2.215e-07

Getting rid of outliers

Q <- quantile(df$scooter_ends, probs=c(.25, .75), na.rm = FALSE)
iqr <- IQR(df$scooter_ends)
eliminated <- subset(df, df$scooter_ends > (Q[1] - 1.5*iqr) & df$scooter_ends < (Q[2]+1.5*iqr))

Q2 <- quantile(df$trimet_stops, probs=c(.25, .75), na.rm = FALSE)
iqr2 <- IQR(df$trimet_stops)
eliminated2 <- subset(df, df$trimet_stops > (Q2[1] - 1.5*iqr2) & df$trimet_stops < (Q2[2]+1.5*iqr2))


test_df <- merge(eliminated, eliminated2, by="census_block")
model_outliers <- lm(test_df$scooter_ends.x ~ test_df$trimet_stops.x)
summary(model_outliers)
## 
## Call:
## lm(formula = test_df$scooter_ends.x ~ test_df$trimet_stops.x)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -270.67 -195.34 -122.19   73.42 1125.74 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             193.057     30.144   6.404 4.14e-10 ***
## test_df$trimet_stops.x    4.534      3.799   1.194    0.233    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 300.2 on 410 degrees of freedom
## Multiple R-squared:  0.003463,   Adjusted R-squared:  0.001032 
## F-statistic: 1.425 on 1 and 410 DF,  p-value: 0.2333
ggplotRegression(model_outliers) + ggtitle("Scooter trip ends vs. Trimet stops") +
  xlab("Trimet Stops") + 
  ylab("Scooter Trips Ends")
## `geom_smooth()` using formula 'y ~ x'

ggsave("ends_regression_no_outliers.png")
## Saving 7 x 5 in image
## `geom_smooth()` using formula 'y ~ x'

Why this doesn’t work

As you can see, the linear regression for either case doesn’t offer a significant prediction. This is largely due to the nature of the data, count data from the real world is often skewed from a normal distribution.

To use linear regression, the data has to generally follow a normal distribution

hist(test_df$scooter_ends.x)

hist(test_df$trimet_stops.x)

The scooter data follows what is known as a Poisson distribution, which is generally what happens when you count data in real life

In order to generate models for non-normal distributions, you must use a tool called Generalized Linear Models, or GLM

require(ggiraph)
## Loading required package: ggiraph
require(ggiraphExtra)
## Loading required package: ggiraphExtra
require(plyr)
## Loading required package: plyr
poisson_df <- glm(scooter_ends.x ~ trimet_stops.x, family="poisson", data=test_df)
summary(poisson_df <- glm(scooter_ends.x ~ trimet_stops.x, family="poisson", data=test_df))
## 
## Call:
## glm(formula = scooter_ends.x ~ trimet_stops.x, family = "poisson", 
##     data = test_df)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -22.674  -16.986   -9.421    4.707   51.247  
## 
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    5.2749953  0.0067714  779.00   <2e-16 ***
## trimet_stops.x 0.0195964  0.0008197   23.91   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 133105  on 411  degrees of freedom
## Residual deviance: 132544  on 410  degrees of freedom
## AIC: 135077
## 
## Number of Fisher Scoring iterations: 6
ggPredict(poisson_df,se=TRUE,interactive=TRUE,digits=3)

Summary Table Explained:

  1. P-value: Significance, we have statistically significant relationship according to our p-value (< 0.05)

  2. Null Deviance: A low null deviance means that the data can be modeled well just using the intercept. SINCE OURS IS HIGH, WE SHOULD THEORETICALLY HAVE MORE PARAMETERS (i.e. MORE VARIABLES) INCLUDED IN THE MODEL

  3. Residual Deviance: Low residual deviance means the data is well dispersed. Our RD is high, so the data is OVERDISPERSED AND UNFIT FOR A GOOD PREDICTIVE MODEL

  4. AIC: Akaike information criterion (AIC) is an information-theoretic measure that describes the quality of a model. LOW = BETTER, we have HIGH so it’s bad.