#Include all libraries here
library(dplyr)
library(forecast)
library(splitstackshape)
library(lubridate)
library(lmtest)
library(kableExtra)
library(moderndive)
library(TTR)
#code that creates the time variables
#split Date variable on the commas
#this will create two new variables
#split Date_1 into two columns: Date_1_1, Date_1_2 (separates month and day from time)
data=cSplit(data, "Date", " ")
#Convert character variable to an R date
data$newdate=lubridate::mdy(data$Date_1)
#create time variable
data$date = as.Date(data$newdate, format="%m/%d/%Y")
data$month = as.numeric(format(data$date, "%m"))
data$year = as.numeric(format(data$date, "%Y"))
The data set is the mileage of cycling collected for 2017-2024 by Professor Pyott. The data includes the number of rides per year in addition to the mileage cycled per month and per year.
#Month totals for bar chart
month_totals = data %>%
select(Distance, month) %>%
group_by(month) %>%
summarise(total=sum(Distance),
sd=sd(Distance),
n=n())
dt= month_totals %>%
kbl(caption="Monthly Mileage Summary") %>%
kable_classic(full_width=F, html_font = "Cambria")
dt
| month | total | sd | n |
|---|---|---|---|
| 1 | 894 | 7.1 | 42 |
| 2 | 952 | 7.9 | 47 |
| 3 | 2223 | 10.4 | 81 |
| 4 | 2655 | 10.6 | 92 |
| 5 | 3504 | 16.5 | 102 |
| 6 | 3402 | 13.1 | 103 |
| 7 | 3353 | 12.8 | 109 |
| 8 | 2932 | 11.9 | 93 |
| 9 | 2194 | 13.5 | 66 |
| 10 | 1928 | 16.2 | 61 |
| 11 | 1176 | 9.2 | 50 |
| 12 | 832 | 9.6 | 37 |
#create annual totals
annual_totals = data %>%
select(Distance, year) %>%
group_by(year) %>%
summarise(total=sum(Distance),
sd=sd(Distance),
n=n())
dt= annual_totals %>%
kbl(caption="Annual Mileage Summary") %>%
kable_classic(full_width=F, html_font = "Cambria")
dt
| year | total | sd | n |
|---|---|---|---|
| 2017 | 2303 | 15 | 70 |
| 2018 | 1813 | 10 | 61 |
| 2019 | 3940 | 14 | 107 |
| 2020 | 4564 | 12 | 153 |
| 2021 | 5059 | 13 | 177 |
| 2022 | 3804 | 12 | 144 |
| 2023 | 3850 | 13 | 144 |
| 2024 | 712 | 10 | 27 |
#Bar plot, box plot, violin plot of miles per month
Month=c("Jan", "Feb", "Mar", "Apr", "May","Jun",
"Jul", "Aug", "Sep", "Oct", "Nov", "Dec")
barplot(total~month, data=month_totals, names.arg=Month, col="pink")
title(main = "Monthly Average Ride Length", font.main = 4)
#graph totals data set
barplot(total~year, data=annual_totals, col="orange")
title(main = "Annual Total Miles", font.main = 4)
#create and plot time series
#create time series
timeseries_totals = data %>%
select(month, year, Distance) %>%
group_by(year, month) %>%
summarise(sum=sum(Distance), average=mean(Distance))
#Which variable to do want to model? Sum, average?
#Sum
timeseries1 = ts(timeseries_totals$sum, frequency=12, start=c(2017,4))
plot(timeseries1, xlab="Time", ylab="Sum", main="Time vs Average", col="green")
abline(reg=lm(timeseries1~time(timeseries1)),col="red")
tscomponents1=decompose(timeseries1)
plot(tscomponents1)
#Average
timeseries2 = ts(timeseries_totals$average, frequency=12, start=c(2017,4))
plot(timeseries2, xlab="Time", ylab="Average", main="Time vs Average", col="purple")
abline(reg=lm(timeseries2~time(timeseries2)),col="blue")
tscomponents2=decompose(timeseries2)
plot(tscomponents2)
Timeseries1 shows a secular trend that the sum is increasing. We see a
seasonal trend of mileage increasing in the warmer months then
decreasing during the colder months. The decompostion plot for
timeseries1 shows that there is a seasonal trend and the sum does go up
over time. There is a nonrandom increase for 2020 due to covid allowing
more time for cycling Timeseries2 shows a secular trend that average is
decreasing. We still see the same seasonal trend of mileage increasing
for warmer months and decreasing for cold months. The decomposition
shows a seasonal trend and that average decreases over time. We can see
the covid disruption allowing for more cycling.
This section will forecast the next five months for the cyclists, and include the smoothed time series plot with the forecasts on the plot.
#smooth TS
#forecast 5 months
#Holt Winters
hw2 =HoltWinters(timeseries1, optim.start=c(alpha=.1,
beta=.1, gamma=.1),
seasonal = c("additive"))
plot(hw2)
fc=forecast(hw2, h=5, level=c(95))
dt2= fc %>%
kbl(caption="Forecast for Next 5 Months") %>%
kable_classic(full_width=F, html_font = "Cambria")
dt2
| Point Forecast | Lo 95 | Hi 95 | |
|---|---|---|---|
| Mar 2024 | 302 | 53 | 550 |
| Apr 2024 | 285 | -32 | 602 |
| May 2024 | 326 | -47 | 698 |
| Jun 2024 | 336 | -85 | 757 |
| Jul 2024 | 220 | -244 | 685 |
#perform OLS on time series
##linear regression
timeseries_totals=as.data.frame(timeseries_totals)
#Add an index variable to represent time point
timeseries_totals$time= 1:nrow(timeseries_totals)
plot(timeseries_totals$time, timeseries_totals$sum, col="blue", xlab="Time", ylab="Sum", main="Sum vs Time")
abline(lm(timeseries_totals$sum~timeseries_totals$time), col="red")
model =lm(sum~time, data=timeseries_totals)
model_summary=summary(model)
model_summary
Call:
lm(formula = sum ~ time, data = timeseries_totals)
Residuals:
Min 1Q Median 3Q Max
-278.3 -149.6 -21.4 137.4 431.6
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 278.770 38.859 7.17 3.1e-10 ***
time 0.834 0.804 1.04 0.3
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 175 on 81 degrees of freedom
Multiple R-squared: 0.0131, Adjusted R-squared: 0.000928
F-statistic: 1.08 on 1 and 81 DF, p-value: 0.303
r2=model_summary$r.squared
#durbin-watson test
dwt=dwtest(model)
The linear regression model is mileage=278.77 + 0.83 * time. The coefficient of determination is 0.01. The F test is F(1, 81)=1.08`.
The DW test stat is 0.61 with a p-value of 2.13^{-14}.
Model Utility and DW test analysis: Annual cycling mileage is statistically significantly predicted by the model F(1, 81)=1.08`, p = 2.13^{-14}, with adjust R squared 0.01. The value for time, b = 0.83, significantly predicts mileage. Since the DW value is 0.61, linear regression is not appropriate since the residuals are positively correlated. A DW value of less than 2 means positive correlation and a value greater than 2 or close to 4 means a negative correlation for the residuals. A value centered at 2 means that the residuals are not correlated and linear regression will model the data well.