First we will read in the data from “start.csv” and “trimet.csv”
library(ggplot2)
start <- read.csv("start.csv")
trimet <- read.csv("trimet.csv")
names(start)[2] <- "census_block"
names(trimet)[1] <- "census_block"
df <- merge(start, trimet, by = "census_block")
df <- df[-c(2)]
# The column names are a little weird, so the code
# below just fixes up the column names
names(df)[2] <- "scooter_starts"
names(df)[3] <- "trimet_stops"
summary(df)
## census_block scooter_starts trimet_stops
## 410050208001: 1 Min. : 1.00 Min. : 1.000
## 410050208002: 1 1st Qu.: 35.75 1st Qu.: 4.000
## 410050208003: 1 Median : 114.50 Median : 7.000
## 410050209001: 1 Mean : 1458.42 Mean : 8.436
## 410050209002: 1 3rd Qu.: 608.00 3rd Qu.:10.000
## 410050209003: 1 Max. :53961.00 Max. :79.000
## (Other) :466
First we will use R’s basic plot function
#Plot scooter starts (y axis) as a function of trimet stops (x axis)
plot(scooter_starts ~ trimet_stops, df)
Now I’ll use ggplot2 to make prettier graphs, I’ll use ggplot from here on! So anywhere you see ggplot() or something it’s just me plotting the graphs
ggplot(df, aes(x = trimet_stops, y = scooter_starts)) +
geom_point() +
ggtitle("E-Scooter Trip Starts vs. Trimet Stops per Census Block") +
xlab("Trimet Stops") +
ylab("Scooter Trip Starts")
ggsave("start_vs_trips.png")
## Saving 7 x 5 in image
We will make and plot our linear model, the
model <- lm(df$scooter_starts ~ df$trimet_stops)
ggplotRegression(model) + ggtitle("Scooter Trip Starts vs. Trimet stops") +
xlab("Trimet Stops") +
ylab("Scooter Trip Starts")
## `geom_smooth()` using formula 'y ~ x'
ggsave("start_regression_outliers.png")
## Saving 7 x 5 in image
## `geom_smooth()` using formula 'y ~ x'
summary(model)
##
## Call:
## lm(formula = df$scooter_starts ~ df$trimet_stops)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12480 -1347 -834 -280 49806
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 62.17 344.13 0.181 0.857
## df$trimet_stops 165.50 30.09 5.501 6.21e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5049 on 470 degrees of freedom
## Multiple R-squared: 0.06049, Adjusted R-squared: 0.05849
## F-statistic: 30.26 on 1 and 470 DF, p-value: 6.211e-08
As you can see from the summary, while the p-value for the variable was significant, the p-value for the intercept was not less than 0.5, so the model itself is lacking significant predictive power. Even looking from the plot, you can tell that very large and very small values have a large amount of sway on the linear model, so we can get rid of the outliers to see if that has any positive effect. To do this, all we are going to do is subset the data within our interquartile range, meaning we get rid of the bottom and top 25% of our data.
Q <- quantile(df$scooter_starts, probs=c(.25, .75), na.rm = FALSE)
iqr <- IQR(df$scooter_starts)
eliminated <- subset(df, df$scooter_starts > (Q[1] - 1.5*iqr) & df$scooter_starts < (Q[2]+1.5*iqr))
Q2 <- quantile(df$trimet_stops, probs=c(.25, .75), na.rm = FALSE)
iqr2 <- IQR(df$trimet_stops)
eliminated2 <- subset(df, df$trimet_stops > (Q2[1] - 1.5*iqr2) & df$trimet_stops < (Q2[2]+1.5*iqr2))
test_df <- merge(eliminated, eliminated2, by="census_block")
model_outliers <- lm(test_df$scooter_starts.x ~ test_df$trimet_stops.x)
summary(model_outliers)
##
## Call:
## lm(formula = test_df$scooter_starts.x ~ test_df$trimet_stops.x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -272.69 -187.11 -136.32 45.59 1238.68
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 186.450 32.716 5.699 2.42e-08 ***
## test_df$trimet_stops.x 4.958 4.244 1.168 0.243
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 313.7 on 379 degrees of freedom
## Multiple R-squared: 0.003588, Adjusted R-squared: 0.0009586
## F-statistic: 1.365 on 1 and 379 DF, p-value: 0.2435
ggplotRegression(model_outliers) + ggtitle("Scooter trip starts vs. Trimet stops, outliers removed") +
xlab("Trimet Stops") +
ylab("Scooter Trip Starts")
## `geom_smooth()` using formula 'y ~ x'
ggsave("start_regression_no_outliers.png")
## Saving 7 x 5 in image
## `geom_smooth()` using formula 'y ~ x'
In this case, you can see that our intercept is significant, but our variable is no longer signficant.
end <- read.csv("end.csv")
trimet <- read.csv("trimet.csv")
names(end)[2] <- "census_block"
names(trimet)[1] <- "census_block"
df <- merge(end, trimet, by = "census_block")
df <- df[-c(2)]
# The column names are a little weird, so the code
# below just fixes up the column names
names(df)[2] <- "scooter_ends"
names(df)[3] <- "trimet_stops"
summary(df)
## census_block scooter_ends trimet_stops
## 410050202002: 1 Min. : 1 Min. : 1.000
## 410050204014: 1 1st Qu.: 31 1st Qu.: 4.000
## 410050208001: 1 Median : 147 Median : 7.000
## 410050208002: 1 Mean : 1332 Mean : 8.491
## 410050208003: 1 3rd Qu.: 613 3rd Qu.:10.000
## 410050209001: 1 Max. :28029 Max. :79.000
## (Other) :509
ggplot(df, aes(x = scooter_ends, y = trimet_stops)) +
geom_point() +
ggtitle("Scooter trip starts as a function of Trimet stops") +
xlab("Trimet Stops") +
ylab("Scooter Trip Ends")
ggsave("ends_vs_trips.png")
## Saving 7 x 5 in image
We will make and plot our linear model, the
model <- lm(df$scooter_ends ~ df$trimet_stops)
ggplotRegression(model) + ggtitle("Scooter Trip Ends vs. Trimet Stops") +
xlab("Trimet Stops") +
ylab("Scooter Trips Ends")
## `geom_smooth()` using formula 'y ~ x'
ggsave("ends_regression_outliers.png")
## Saving 7 x 5 in image
## `geom_smooth()` using formula 'y ~ x'
summary(model)
##
## Call:
## lm(formula = df$scooter_ends ~ df$trimet_stops)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8581.8 -1206.4 -824.6 -391.2 26301.9
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 376.36 243.82 1.544 0.123
## df$trimet_stops 112.56 21.43 5.251 2.22e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3682 on 513 degrees of freedom
## Multiple R-squared: 0.05101, Adjusted R-squared: 0.04916
## F-statistic: 27.58 on 1 and 513 DF, p-value: 2.215e-07
Q <- quantile(df$scooter_ends, probs=c(.25, .75), na.rm = FALSE)
iqr <- IQR(df$scooter_ends)
eliminated <- subset(df, df$scooter_ends > (Q[1] - 1.5*iqr) & df$scooter_ends < (Q[2]+1.5*iqr))
Q2 <- quantile(df$trimet_stops, probs=c(.25, .75), na.rm = FALSE)
iqr2 <- IQR(df$trimet_stops)
eliminated2 <- subset(df, df$trimet_stops > (Q2[1] - 1.5*iqr2) & df$trimet_stops < (Q2[2]+1.5*iqr2))
test_df <- merge(eliminated, eliminated2, by="census_block")
model_outliers <- lm(test_df$scooter_ends.x ~ test_df$trimet_stops.x)
summary(model_outliers)
##
## Call:
## lm(formula = test_df$scooter_ends.x ~ test_df$trimet_stops.x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -270.67 -195.34 -122.19 73.42 1125.74
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 193.057 30.144 6.404 4.14e-10 ***
## test_df$trimet_stops.x 4.534 3.799 1.194 0.233
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 300.2 on 410 degrees of freedom
## Multiple R-squared: 0.003463, Adjusted R-squared: 0.001032
## F-statistic: 1.425 on 1 and 410 DF, p-value: 0.2333
ggplotRegression(model_outliers) + ggtitle("Scooter trip ends vs. Trimet stops") +
xlab("Trimet Stops") +
ylab("Scooter Trips Ends")
## `geom_smooth()` using formula 'y ~ x'
ggsave("ends_regression_no_outliers.png")
## Saving 7 x 5 in image
## `geom_smooth()` using formula 'y ~ x'
As you can see, the linear regression for either case doesn’t offer a significant prediction. This is largely due to the nature of the data, count data from the real world is often skewed from a normal distribution.
hist(test_df$scooter_ends.x)
hist(test_df$trimet_stops.x)
In order to generate models for non-normal distributions, you must use a tool called Generalized Linear Models, or GLM
require(ggiraph)
## Loading required package: ggiraph
require(ggiraphExtra)
## Loading required package: ggiraphExtra
require(plyr)
## Loading required package: plyr
poisson_df <- glm(scooter_ends.x ~ trimet_stops.x, family="poisson", data=test_df)
summary(poisson_df <- glm(scooter_ends.x ~ trimet_stops.x, family="poisson", data=test_df))
##
## Call:
## glm(formula = scooter_ends.x ~ trimet_stops.x, family = "poisson",
## data = test_df)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -22.674 -16.986 -9.421 4.707 51.247
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 5.2749953 0.0067714 779.00 <2e-16 ***
## trimet_stops.x 0.0195964 0.0008197 23.91 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 133105 on 411 degrees of freedom
## Residual deviance: 132544 on 410 degrees of freedom
## AIC: 135077
##
## Number of Fisher Scoring iterations: 6
ggPredict(poisson_df,se=TRUE,interactive=TRUE,digits=3)
P-value: Significance, we have statistically significant relationship according to our p-value (< 0.05)
Null Deviance: A low null deviance means that the data can be modeled well just using the intercept. SINCE OURS IS HIGH, WE SHOULD THEORETICALLY HAVE MORE PARAMETERS (i.e. MORE VARIABLES) INCLUDED IN THE MODEL
Residual Deviance: Low residual deviance means the data is well dispersed. Our RD is high, so the data is OVERDISPERSED AND UNFIT FOR A GOOD PREDICTIVE MODEL
AIC: Akaike information criterion (AIC) is an information-theoretic measure that describes the quality of a model. LOW = BETTER, we have HIGH so it’s bad.