In this project I will be analyizing MTA New York City Subway data taken from the month of May in the year 2011. This MTA data contains hourly entries and exits to turnstiles in the subway system. I have also joined weather information gathered from Weather Underground containing features such as temperature, barometric pressure, indicators for rain, fog, thunder, the total amount of precipitation, among others.
I worked on this data subset to draw an interesting conclusion about the dataset itself with the key question of: Do more people ride the subway when it is raining versus when it is not raining.
Data Science Skills Applied:
library(dplyr)
##
## Attaching package: 'dplyr'
##
## The following objects are masked from 'package:stats':
##
## filter, lag
##
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
library(GGally)
##
## Attaching package: 'GGally'
##
## The following object is masked from 'package:dplyr':
##
## nasa
library(scales)
library(gridExtra)
library(corrplot)
library(stargazer)
##
## Please cite as:
##
## Hlavac, Marek (2015). stargazer: Well-Formatted Regression and Summary Statistics Tables.
## R package version 5.2. http://CRAN.R-project.org/package=stargazer
library(stringi)
library(caret)
## Loading required package: lattice
library(caTools)
I have already done some data wrangling from the original dataset. I have included the columns ‘ENTRIESn_hourly’, ‘EXITSn_hourly’, ‘hour’, and altering the structure of the date to become more user friendly and have already wrote it back into a new csv file.
I have decided to transform some of the variables to be easier described and visually more appealing
turnstile_data <- read.csv('turnstile_weather_v2.csv')
# turn date into POSIX to make easier to work with
turnstile_data$datetime <- as.POSIXct(turnstile_data$datetime)
# Add full name of the day of the week instead of day number
turnstile_data$day_week <- factor(format(turnstile_data$datetime, format = '%A'),
levels = c("Monday","Tuesday","Wednesday","Thursday",
"Friday","Saturday","Sunday"))
# Also adding day number
turnstile_data$day_number <- as.numeric(turnstile_data$day_week)
# Adding formated hour with levels
turnstile_data$hour_format <- factor(format(turnstile_data$datetime, format = '%I:00 %p'),
levels = c("04:00 AM", "08:00 AM","12:00 PM",
"04:00 PM", "08:00 PM", "12:00 AM"))
# Adding number data for hour_format just created like day number
turnstile_data$hour_number <- as.numeric(turnstile_data$hour_format)
# Add strings to fog
turnstile_data$fog <- factor(turnstile_data$fog, levels=c(0,1), labels=c("No Fog", "Fog"))
# Add strings to rain
turnstile_data$rain <- factor(turnstile_data$rain, levels=c(0,1), labels=c("No Rain", "Rain"))
# Add strings to weekend / weekday
turnstile_data$weekday <- factor(turnstile_data$weekday, levels=c(0,1), labels=c("Weekend", "Weekday"))
# Combining date and day of week
turnstile_data$date_day_week <- factor(format(turnstile_data$datetime,format="%m-%d-%Y:%A"))
head(turnstile_data)
## UNIT DATEn TIMEn ENTRIESn EXITSn ENTRIESn_hourly EXITSn_hourly
## 1 R003 05-01-11 00:00:00 4388333 2911002 0 0
## 2 R003 05-01-11 04:00:00 4388333 2911002 0 0
## 3 R003 05-01-11 12:00:00 4388333 2911002 0 0
## 4 R003 05-01-11 16:00:00 4388333 2911002 0 0
## 5 R003 05-01-11 20:00:00 4388333 2911002 0 0
## 6 R003 05-02-11 00:00:00 4388348 2911036 15 34
## datetime hour day_week weekday station latitude
## 1 2011-05-01 00:00:00 0 Sunday Weekend CYPRESS HILLS 40.68995
## 2 2011-05-01 04:00:00 4 Sunday Weekend CYPRESS HILLS 40.68995
## 3 2011-05-01 12:00:00 12 Sunday Weekend CYPRESS HILLS 40.68995
## 4 2011-05-01 16:00:00 16 Sunday Weekend CYPRESS HILLS 40.68995
## 5 2011-05-01 20:00:00 20 Sunday Weekend CYPRESS HILLS 40.68995
## 6 2011-05-02 00:00:00 0 Monday Weekday CYPRESS HILLS 40.68995
## longitude conds fog precipi pressurei rain tempi wspdi
## 1 -73.87256 Clear No Fog 0 30.22 No Rain 55.9 3.5
## 2 -73.87256 Partly Cloudy No Fog 0 30.25 No Rain 52.0 3.5
## 3 -73.87256 Mostly Cloudy No Fog 0 30.28 No Rain 62.1 6.9
## 4 -73.87256 Mostly Cloudy No Fog 0 30.26 No Rain 57.9 15.0
## 5 -73.87256 Mostly Cloudy No Fog 0 30.28 No Rain 52.0 10.4
## 6 -73.87256 Mostly Cloudy No Fog 0 30.31 No Rain 50.0 6.9
## meanprecipi meanpressurei meantempi meanwspdi weather_lat weather_lon
## 1 0 30.25800 55.98000 7.86 40.70035 -73.88718
## 2 0 30.25800 55.98000 7.86 40.70035 -73.88718
## 3 0 30.25800 55.98000 7.86 40.70035 -73.88718
## 4 0 30.25800 55.98000 7.86 40.70035 -73.88718
## 5 0 30.25800 55.98000 7.86 40.70035 -73.88718
## 6 0 30.23833 54.16667 8.25 40.70035 -73.88718
## day_number hour_format hour_number date_day_week
## 1 7 12:00 AM 6 05-01-2011:Sunday
## 2 7 04:00 AM 1 05-01-2011:Sunday
## 3 7 12:00 PM 3 05-01-2011:Sunday
## 4 7 04:00 PM 4 05-01-2011:Sunday
## 5 7 08:00 PM 5 05-01-2011:Sunday
## 6 1 12:00 AM 6 05-02-2011:Monday
To further analyze this data and the correlation between varibales I decided to comute their correlation. And instead of just displaying a long table of number that is extremely hard to read, I created a more user friendly corplot.
# separate only numeric numbers
turn_numeric <- turnstile_data[,sapply(turnstile_data, is.numeric)]
# create correlation matrix
turnstile_data_corr_matrix <- cor(turn_numeric)
# plot
corrplot(turnstile_data_corr_matrix, method = "ellipse", order="hclust", type="lower",
tl.cex=0.75, add = FALSE, tl.pos="lower")
corrplot(turnstile_data_corr_matrix, method = "number", order="hclust", type="upper",
tl.cex=0.75, add = TRUE, tl.pos="upper")
As expected their aren’t too many correlated variables. Of course like ‘meanprecipi’ is related with ‘precipi’ and ‘u’weather_lat’ and ‘latitude’, etc are correlated, but nothing else really pops out here.
Here, I group the total entries of MTA subway. Iniitally these figures make sense, as it is understood that there would be less riderships on the weekends rather than the on the weekdays. But having a little more skepticism I looked more into this and with some common sense, I realized that the data given for this month of May is skewed. All of the days of the week do not have the same amount of entries. In fact, in this month there are 5 day for the month for Sunday, Monday, and Tuesday, with the other days only having 4 days in the month. Another point that I have to make is that Monday the 30th, was Memorial Day and ridership ought to lower. Therefore the total ridership with the Tuesday being the highest makes sence so the data fits the assumptions. The total should be taken with a grain of salt, while the other graph with the average is a better understanding of the true distribution.
# Using dplyr and the split apply combine functionality I can create day of week and entries for that day
Day_of_Week <- turnstile_data %>%
group_by(day_week) %>%
summarise(entries = sum(ENTRIESn_hourly))
# however the day of the week and counts are flawed since there is not an even amount of day in the month
Day_of_Week_Filtered <- turnstile_data %>%
group_by(date_day_week, day_week) %>%
summarise(entries = sum(ENTRIESn_hourly)) %>%
ungroup() %>%
group_by(day_week) %>%
mutate(total_days = n(),
avg_entries_per_day = sum(entries) / mean(total_days)) %>%
select(day_week, total_days, avg_entries_per_day) %>%
distinct(day_week, total_days, avg_entries_per_day)
day_total <- ggplot(Day_of_Week, aes(day_week, entries, fill = day_week)) +
geom_bar(stat = "identity") +
scale_fill_brewer(palette = "Set2") +
theme(axis.text.x = element_text(angle = 45)) +
guides(fill=FALSE) +
scale_y_continuous(labels = comma) +
xlab("") + ylab("Total Entries") + ggtitle("Total Ridership - Day of the Week") +
geom_text(aes(day_week, 500000, label = paste0("Total \nDays: \n", total_days)), size=3 , data=Day_of_Week_Filtered)
day_avg <- ggplot(Day_of_Week_Filtered, aes(day_week, avg_entries_per_day, fill = day_week)) +
geom_bar(stat = "identity") +
scale_fill_brewer(palette = "Set2") +
theme(axis.text.x = element_text(angle = 45)) +
guides(fill = FALSE) +
scale_y_continuous(labels=comma) +
xlab("") + ylab("Total Entries") + ggtitle("Average Ridership \nEach Day of the Week") +
geom_text(aes(x=day_week, y=100000, label = paste0("Total \nDays: \n", total_days)), size=3 , data = Day_of_Week_Filtered)
grid.arrange(day_total, day_avg, ncol=2)
### Hours of the Day Plotting for the hours of the day is alittle different than what I expected. I expected the data to be bi-modal with one peak before work and the other after work. Well, according to the data, it is bi-modal, but the peaks are at around noon and and around 8 pm.
Day_of_Week_hour <- turnstile_data %>%
group_by(hour_format) %>%
summarise(total_entries = sum(ENTRIESn_hourly),
n = n(),
normalized = total_entries / n)
hour_total <- ggplot(Day_of_Week_hour, aes(x = hour_format, y = total_entries,
fill = hour_format)) +
geom_bar(stat = 'identity') +
theme(axis.text.x = element_text(angle=45)) +
scale_y_continuous(labels = comma) +
scale_fill_brewer(palette = 'Set2') +
guides(fill = FALSE) +
xlab("") + ylab("Total Entries") +
ggtitle("Total Ridership for each 4-hour Time Period")
hour_normalized <- ggplot(Day_of_Week_hour, aes(x = hour_format, y = normalized,
fill = hour_format)) +
geom_bar(stat = 'identity') +
theme(axis.text.x = element_text(angle=45)) +
scale_y_continuous(labels = comma) +
scale_fill_brewer(palette = 'Set2') +
guides(fill = FALSE) +
xlab("") + ylab("Total Entries") +
ggtitle("Average Ridership for each 4-hour Time Period")
grid.arrange(hour_total, hour_normalized, ncol = 2)
### Day and Hour Now combining the last two varibales might show some more trends in the data. However, nothing extreme jumps out. Understandingly the normalized entries are just like the the other plots with riderships falling on the weekends from the weekdays.
Day_of_Week_And_Hour <- turnstile_data %>%
group_by(day_week, hour_format) %>%
summarise(total_entries = sum(ENTRIESn_hourly),
n = n(),
normalized = total_entries / n)
ggplot(Day_of_Week_And_Hour, aes(hour_format, normalized, fill = hour_format)) +
geom_bar(stat = 'identity') +
facet_grid(day_week ~ .) +
scale_y_continuous(labels = comma) +
scale_fill_brewer(palette = 'Set2') +
theme(legend.position = 'None') +
xlab("") + ylab("Total Entries")
### Units I thought it would be interesting to look at the total number of entries per unit and per location. And as expected some terminals are much more popular than others. Later in this paper, a D3 plot show this is a cooler fashion.
units <- turnstile_data %>%
group_by(UNIT) %>%
summarise(total_entries = sum(ENTRIESn_hourly))
unit_vector <- as.character(unique(turnstile_data$UNIT))[seq(1, length(turnstile_data$UNIT), 4)]
ggplot(units, aes(UNIT, total_entries)) +
geom_bar(stat = 'identity', fill = 'yellow', color = 'black') +
theme(axis.text.x = element_text(angle = 90)) +
scale_y_continuous(labels = comma) +
scale_x_discrete(breaks = unit_vector, labels = unit_vector) +
xlab("Units") + ylab("Total Entries per Unit")
### Entries Among Stations A similar exploration for each station is shown below. Only showing the top 50 stations and their respective names.
stations <- turnstile_data %>%
group_by(station) %>%
summarise(total_entries = sum(ENTRIESn_hourly)) %>%
arrange(-total_entries)
ggplot(stations[1:50,], aes(x = reorder(station, -total_entries), total_entries)) +
geom_bar(stat = 'identity', fill = 'yellow', color = 'black') +
theme(axis.text.x = element_text(angle = 90)) +
scale_y_continuous(labels = comma) +
xlab("Station Name") + ylab("Total Entries") +
ggtitle("Total Entries for each Station - Top 40 Stations")
### Rain Visualization Later in the paper it demonstrates that there is evidence of a statistically significant difference between rainy and non-rainy times. Histogram displaying the difference between ENTRIESn_hourly for rainy days and ENTRIESn_hourly for non-rainy days.
rain <- turnstile_data %>%
group_by(datetime, rain) %>%
summarise(total_entries = sum(ENTRIESn_hourly))
rain_by_hour <- ggplot(turnstile_data, aes(x=ENTRIESn_hourly, fill=rain, order=-as.numeric(rain))) +
geom_histogram(alpha=1, binwidth=100, color="black") +
coord_cartesian(xlim=c(0,5000), ylim=c(0, 5000)) +
guides(fill=guide_legend("Weather\nCondition")) +
scale_y_continuous(breaks=seq(0, 5000, by=1000)) +
xlab("Hourly Entries - Bins of Size 100 (Limited to 5000)") +
ylab("Hourly Entries - Frequency per Bin") +
ggtitle("Hourly Entries Histogram - Rain vs. No Rain")
rain_by_hour
### plot Rain Visualization Another plot for the rain, this plot is a little different and this plot can actually represent a visual cue that rain it does seem to have an effect whether or not people ride the subway during rainy days. As shown, the blue indicates rain and they are mostly high.
rain_time <- ggplot(rain, aes(x=datetime, y=total_entries, fill=rain)) +
geom_histogram(stat="identity", position = 'dodge') +
scale_x_datetime(expand=c(0,0), breaks = date_breaks("1 day"), labels = date_format("%m-%d-%y (%a)")) +
theme(axis.text.x = element_text(angle = 90)) +
coord_flip() +
guides(fill = guide_legend("Weather")) +
xlab("") + ylab("Total Entries") +
ggtitle("Cumulative Ridership per Hour\n Colored by Presence of Rain at Each Unit (Labeled by Day)")
rain_time
## Statitical Tests
To analysis this data I used the Mann-Whitney U-Test test.
A two-tail p-value was also used as no prior assumptions are made about the contrast in the distributions of ridership on rainy and non-rainy days.
hypothesis - NULL ( ENTRIESn_hourlyrainy ) and ( ENTRIESn_hourlynon − rainy ) are same - ALT ( ENTRIESn_hourlyrainy ) and ( ENTRIESn_hourlynon − rainy ) differ by a location shift μ and μ≠0
p-critical value - α = 0.05
This test is applicable to the dataset because the Mann-Whitney U-Test tests the null hypothesis that the two samples being compared are derived from the same population. The Mann–Whitney U test is a non-parametric test that is good for testing whether a particular population tends to have larger values than the other. This null hypothesis allows us to test whether there is a statistically significant difference in ridership on rainy and non-rainy days. Furthermore, exploratory data analysis has shown that the data is not normally distributed. The Mann-Whitney U-Test does not assume normality of the data, making this test appropriate.
rainy_days <- turnstile_data$ENTRIESn_hourly[turnstile_data$rain == 'Rain']
without_rain_days <- turnstile_data$ENTRIESn_hourly[turnstile_data$rain == 'No Rain']
rainy_day_median <- median(rainy_days)
without_rain_day_median <- median(without_rain_days)
man_witney_test = wilcox.test(rainy_days, without_rain_days, alternative = 'two.sided')
u_value = man_witney_test$statistic
p_value = man_witney_test$p.value
Since in reporting the results of a Mann–Whitney test, it is important to state:
df <- data.frame(rainy_day_median, without_rain_day_median, u_value, p_value)
print(df)
## rainy_day_median without_rain_day_median u_value p_value
## W 939 893 163283320 5.482139e-06
alpha = 0.05
ifelse(p_value < alpha, 'Reject Null Hypothesis', 'Fail to Reject Null Hypothesis')
## [1] "Reject Null Hypothesis"
So with α = 0.05, and the p-value being so low we would Reject the Null Hypothesis. Meaning the distribution between rainy and non_rainy is statisticly different.
It turns out that alot of the variance can be explained by these factor/dummy varibales.
unit_factor <- factor(turnstile_data$UNIT)
day_week_factor <- factor(turnstile_data$day_week)
hour_factor <- factor(turnstile_data$hour_format)
f1 <- model.matrix(~ unit_factor -1)
f2 <- model.matrix(~ day_week_factor -1)
f3 <- model.matrix(~ hour_factor -1)
Here we add interaction terms and continue to use model.matrix to have it be able to work with OLS and possibly Gradient Descent with constructing the X/predictors with the response/Y being ENTRIESn_hourly.
unit_day <- paste0(as.character(turnstile_data$UNIT), '_' , as.character(turnstile_data$day_week))
unit_hour <- paste0(as.character(turnstile_data$UNIT), '_' , as.character(turnstile_data$hour_format))
day_hour <- paste0(as.character(turnstile_data$day_week), '_' , as.character(turnstile_data$hour_format))
f4 <- model.matrix(~factor(unit_day) -1)
f5 <- model.matrix(~factor(unit_hour) -1)
f6 <- model.matrix(~factor(day_hour) -1)
dummy_features <- cbind(f1,f2,f3,f4,f5,f6)
features <- cbind(1, matrix(dummy_features))
R2=1−∑i=1m(observedi−predictedi)2∑i=1m(observedi−mean(observed))2
R2=1−∑i=1m(yi−h(xi))2∑i=1m(yi−y¯i)2
The R-Squared can be seen as the percent of variance explained by the model.
Adjusted R2 is computed as: 1−(1−R2)n−1n−p−1
The adjusted R-squared is a modified version of R-squared that has been adjusted for the number of predictors in the model. The adjusted R-squared increases only if the new term improves the model more than would be expected by chance. Therefore the adj R-squared is a better measure of fit since it will put a penalty on adding more to the model.
r_squared <- function(y, pred) {
RSS <- sum((y - pred )^2)
TSS <- sum((y - mean(y))^2)
r_sq <- 1 - (RSS / TSS)
}
adj_r_squared <- function(r_sq, num_examples, num_features) {
adj_r <- 1 - (1 - r_sq) * ((num_examples - 1) / (num_examples - num_features - 1))
return(adj_r)
}
Split the data into two sets, train on one and can test on the other, using 80% for train and the 20% for testing. I probably should have donw this pre-EDA but it wont hurt too much.
set.seed(100)
split <- sample.split(turnstile_data$ENTRIESn_hourly, 0.7)
train_df <- subset(turnstile_data, split == TRUE)
test_df <- subset(turnstile_data, split == FALSE)
Fitting models can be complicated and I used somewhat of a personalized forward selection process and ended with model4 with interaction terms between the dummy variables, and I also threw in a few more rain variables.
######### test <- turnstile_data[-inTrain,]
train_predictors <- train_df %>%
select(ENTRIESn_hourly, hour_format, day_week, UNIT, precipi, pressurei, tempi)
model1 <- lm(ENTRIESn_hourly ~ UNIT, data = train_predictors)
model2 <- lm(ENTRIESn_hourly ~ UNIT + hour_format, data = train_predictors)
model3 <- lm(ENTRIESn_hourly ~ UNIT + hour_format + day_week, data = train_predictors)
model4 <- lm(ENTRIESn_hourly ~ UNIT*hour_format + hour_format:day_week + UNIT:day_week + day_week + precipi + pressurei + tempi, data = train_predictors)
stargazer(model1, model2, model3, model4, title = 'Regression Results', header = FALSE, single.row = TRUE, type = "html", font.size = "tiny", digits.extra = 4, digits = 4, keep="NULL", column.labels = c("UNIT only", "add 'hour'", "add 'day of week'", "add interactions"))
##
## <table style="text-align:center"><caption><strong>Regression Results</strong></caption>
## <tr><td colspan="5" style="border-bottom: 1px solid black"></td></tr><tr><td style="text-align:left"></td><td colspan="4"><em>Dependent variable:</em></td></tr>
## <tr><td></td><td colspan="4" style="border-bottom: 1px solid black"></td></tr>
## <tr><td style="text-align:left"></td><td colspan="4">ENTRIESn_hourly</td></tr>
## <tr><td style="text-align:left"></td><td>UNIT only</td><td>add 'hour'</td><td>add 'day of week'</td><td>add interactions</td></tr>
## <tr><td style="text-align:left"></td><td>(1)</td><td>(2)</td><td>(3)</td><td>(4)</td></tr>
## <tr><td colspan="5" style="border-bottom: 1px solid black"></td></tr><tr><td style="text-align:left">Observations</td><td>30,492</td><td>30,492</td><td>30,492</td><td>30,492</td></tr>
## <tr><td style="text-align:left">R<sup>2</sup></td><td>0.4090</td><td>0.5420</td><td>0.5681</td><td>0.8874</td></tr>
## <tr><td style="text-align:left">Adjusted R<sup>2</sup></td><td>0.4043</td><td>0.5383</td><td>0.5645</td><td>0.8755</td></tr>
## <tr><td style="text-align:left">Residual Std. Error</td><td>2,543.2760 (df = 30252)</td><td>2,239.1340 (df = 30247)</td><td>2,174.5210 (df = 30241)</td><td>1,162.8980 (df = 27579)</td></tr>
## <tr><td style="text-align:left">F Statistic</td><td>87.5989<sup>***</sup> (df = 239; 30252)</td><td>146.6859<sup>***</sup> (df = 244; 30247)</td><td>159.1206<sup>***</sup> (df = 250; 30241)</td><td>74.6071<sup>***</sup> (df = 2912; 27579)</td></tr>
## <tr><td colspan="5" style="border-bottom: 1px solid black"></td></tr><tr><td style="text-align:left"><em>Note:</em></td><td colspan="4" style="text-align:right"><sup>*</sup>p<0.1; <sup>**</sup>p<0.05; <sup>***</sup>p<0.01</td></tr>
## </table>
This model is definitely not very comprehensive however it does result in a decent r2 and adj r2.
train_pred <- predict(model4, train_df)
train_y <- train_df$ENTRIESn_hourly
train_r_squared <- r_squared(train_y, train_pred)
train_r_squared
## [1] 0.8873567
train_adjusted_R_squared <- adj_r_squared(train_r_squared, dim(train_df)[1], length(coef(model4))-1)
train_adjusted_R_squared
## [1] 0.875463
test_pred <- predict(model4, newdata = test_df)
test_y <- test_df$ENTRIESn_hourly
test_r_squared <- r_squared(test_y, test_pred)
test_r_squared
## [1] 0.6608977
Mean Squared Error measures the average of the squares of the “errors”, that is, the difference between the estimator and what is estimated
MSE=1m∑i=1m(yi−ŷ i)2
Also square rooting the MSE puts it in more common terms
MSE <- function(pred, actual){
return( mean( (pred-actual)^2 ) )
}
train_MSE <- MSE(train_pred, train_y)
# Root Mean Squared Error
sqrt(train_MSE)
## [1] 1105.956
test_MSE <- MSE(test_pred, test_y)
# Root Mean Squared Error
sqrt(test_MSE)
## [1] 1017.499
A usefull way to further explore the model fit is to plot the residuals from the predicted model fit. This can be done by simply ploting the model in R but doing it here is more effective. Plots of the residuals show somewhat of a normal distributions with compression close to 0, and fanning out at the upper ends. This also shows an under-prediction for higher ridership levels.
pred <- predict(model4)
p1 <- ggplot() + geom_histogram(aes(model4$residuals), binwidth = 200) +
xlab('Residuals') + ylab('Frequencies') +
ggtitle('Histogram of Residuals - Training Set')
p2 <- ggplot() + geom_point(aes(pred, model4$residuals)) +
geom_hline(yintercept = 0, linetype = 'dotted', size = 1.2, color = 'red') +
xlab('Predictions') + ylab('Residuals') +
ggtitle('Residuals against Predictions - Training Set')
p3 <- ggplot() + geom_point(aes(pred, train_df$ENTRIESn_hourly)) +
geom_abline(slope = 1, linetype = 'dotted', color = 'red', size = 1.2) +
xlab("Predictions") + ylab("Actual Hourly Entries") +
ggtitle("Predictions against Hourly Entries - Training Set")
p4 <- ggplot() + geom_point(aes(1:nrow(train_df), model4$residuals), data = train_df) +
geom_hline(yintercept = 0, linetype = 'dotted', size = 1.2, color = 'red') +
xlab('Index') + ylab('Residuals') +
ggtitle('Residuals in Sequential Order\nPer Hour Per Unit - Training Set')
grid.arrange(p1, p2, p3, p4, ncol = 2)
### Test Residuals You can also plot the testing residuals for the data as well. Plotting the testing resids shows more of a true understanding on the true predictive power of the model as this data was never seen before.
test_resid <- test_y - test_pred
pl1 <- ggplot() + geom_histogram(aes(test_resid), binwidth = 200) +
xlab('Residuals') + ylab('Frequencies') + xlim(c(-20000, 20000)) +
ggtitle('Histogram of Residuals - Training Set')
pl2 <- ggplot() + geom_point(aes(test_pred, test_resid)) +
geom_hline(yintercept = 0, linetype = 'dotted', size = 1.2, color = 'red') +
xlab('Predictions') + ylab('Residuals') +
ggtitle('Residuals against Predictions - Training Set')
pl3 <- ggplot() + geom_point(aes(test_pred, test_y)) +
geom_abline(slope = 1, linetype = 'dotted', color = 'red', size = 1.2) +
xlab("Predictions") + ylab("Actual Hourly Entries") +
ggtitle("Predictions against Hourly Entries - Training Set")
pl4 <- ggplot() + geom_point(aes(1:nrow(test_df), test_resid), data = test_df) +
geom_hline(yintercept = 0, linetype = 'dotted', size = 1.2, color = 'red') +
xlab('Index') + ylab('Residuals') +
ggtitle('Residuals in Sequential Order\nPer Hour Per Unit - Training Set')
grid.arrange(pl1, pl2, pl3, pl4, ncol = 2)
## Summary By fitting this model, we achieved a fairly decent R^2 and root mean squared error. However, this data probably needs to be examined further with different model types and different machine learning algorithms such as support vector machines or non-linear splines. Or needing a more comprehensive model can use decision trees.
So the original question was do more people ride the NYC subway when it is raining or when it is not raining?
From my analysis and interpretation of the data, I have concluded from the result of the Mann–Whitney U test, the two-tailed p-value is 5.48213914249e-06, which indicates the probability of the distributions of ENTRIESn_hourlyrainy and ENTRIESn_hourlynon−rainy are the same is less than α. There is a statistical significant difference between the riderships of rainy and non-rainy days. Furthermore, all the descriptive listed (including 1st quartile, median, mean and 3rd quartile) of rainy days are greater than those of non-rainy days and the same information also shows in the histagrams and box plots. Therefore, I conclude that there are more people will ride the NYC subway when it is raining versus when it is not raining.
By using the leaflet library in R we can use a OpenStreetMap Api and JavaScript D3 to create a fully awesome visual to display the lat / lon of the area. Explaing the map the weather stations are shown with purple circles, and subway stations are shown in green. Where the size of the circle represents the total ridership for each station. You can also click on the icons to display the total entries and station names.
lat_lon <- turnstile_data %>%
group_by(longitude, latitude, station) %>%
summarise(entries = sum(ENTRIESn_hourly)) %>%
ungroup() %>%
arrange(-entries) %>%
mutate(station = paste0('Station-', station),
icon = paste0(station, "<br />Total Entries: ", comma(entries)))
weather_dim <- turnstile_data %>%
mutate(weather_station = factor(abs(weather_lon) + abs(weather_lat))) %>%
mutate(weather_station = as.factor(as.numeric(weather_station))) %>%
group_by(weather_lon, weather_lat, weather_station) %>%
summarise(entries = sum(ENTRIESn_hourly)) %>%
ungroup() %>%
arrange(-entries) %>%
mutate(weather_station = paste0('Weather Station-', weather_station),
icon = paste0(weather_station, "<br />Total Entries of Recorded Stations: ", comma(entries)))
library(leaflet)
col1 = colorQuantile(palette = 'YlGn', domain = NULL, n = 8)
col2 = colorQuantile(palette = 'RdPu', domain = NULL, n = 8)
m <- leaflet(lat_lon) %>%
addTiles() %>%
addCircleMarkers(data=weather_dim, lng= ~weather_lon, lat=~weather_lat,
radius = ~entries/sum(entries)*100,
color= ~col2(entries),
opacity= ~entries/sum(entries)*100,
popup= ~icon) %>%
addCircleMarkers(radius = ~entries/sum(entries)*200,
color = ~col1(entries),
opacity= ~entries/sum(entries)*200,
popup = ~icon) %>%
setView(lng= -73.92, lat=40.73, zoom = 12)
## Assuming 'longitude' and 'latitude' are longitude and latitude, respectively
m