This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.
When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:
###LOADING THE DATA
bike<- read.csv('D:/FALL 2023/STATISTICS/datasets/bike.csv')
###TIME-SERIES
TAKING THE AGGREGATED RESPONSE VARIABLE WITH CORRESPOSNING DATE
bike_aggregate <- bike%>%
group_by(Date) %>%
summarise(Sum_Rented_Bike_Count = sum(Rented.Bike.Count))
###CREATING TSIBBLE - Further creating a tsibble object to pick the date and response variable, here view in our case. This can be used to plot the data over time.
bike_time <- select(bike_aggregate, Date, Sum_Rented_Bike_Count)
bike_time$Date <- as.Date(bike_time$Date, format = "%m/%d/%Y")
# Remove rows with NA values in the Date column
bike_time <- bike_time[complete.cases(bike_time$Date), ]
bike_time_ts <- as_tsibble(bike_time, index = Date)
head(bike_time_ts)
## # A tsibble: 6 x 2 [1D]
## Date Sum_Rented_Bike_Count
## <date> <int>
## 1 2017-01-12 9539
## 2 2017-02-12 8523
## 3 2017-03-12 7222
## 4 2017-04-12 8729
## 5 2017-05-12 8307
## 6 2017-06-12 6669
###PLOTTING
# Plot the time series
ggplot(bike_time_ts, aes(x = Date, y = Sum_Rented_Bike_Count)) +
geom_line() +
labs(title = "Rented Bike Counts Over Time",
x = "Date",
y = "Sum of Rented Bike Counts")
We can observe that over the time, the demand for rental bikes was
increased. from the year 2018 the trend is changed drastically.
Although, there was slight corrections further, the trend was constant
till the end.
###TRENDS- Let us investigate the data and its trends by using linear modelling.
bike_time_ts %>%
ggplot(mapping = aes(x = Date, y = Sum_Rented_Bike_Count)) +
geom_line() +
geom_smooth(method = "lm", se = FALSE) +
labs(title = "NUmber of bikes rented over years",
x = 'YEAR',
y = 'BIKE COUNT')
## `geom_smooth()` using formula = 'y ~ x'
we can observe that there is a up trend which is linear exists. From
2017 - 2019, this trend looks pretty consistent over time, however, from
2018 the spike appears to be more rapid than the previous years.
Do you need to subset the data for multiple trends? We can divide the data set into 2 parts i.e.before 2018 and after 2018, where we can see certain constant check on exercise page bike count.
BEFORE2018 <- bike_time_ts %>%
filter(Date < as.Date('2018/01/01'))
AFTER2018 <- bike_time_ts %>%
filter(Date >= as.Date('2018/01/01'))
ggplot() +
geom_line(mapping = aes(x = BEFORE2018$Date, y = BEFORE2018$Sum_Rented_Bike_Count)) +
geom_smooth(mapping = aes(x = BEFORE2018$Date, y = BEFORE2018$Sum_Rented_Bike_Count),method = "lm", se = FALSE) +
geom_line(mapping = aes(x = AFTER2018$Date, y = AFTER2018$Sum_Rented_Bike_Count)) +
geom_smooth(mapping = aes(x = AFTER2018$Date, y = AFTER2018$Sum_Rented_Bike_Count),method = "lm", se = FALSE) +
labs(title = "Wikipedia page view for time period 2008 - 2016",
x = 'years',
y = 'Views/Curiosity for Exercise ')
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
- How strong are these trends? By separating the trends at the 2018, we
can see that there may be different levels of trend for pre 2018 and
post. This may be because the people realized the advantages of renting
a bike rather than owning it. Like drastic surge is seen post 2018.
bike_time_ts %>%
ggplot(mapping = aes(x=Date, y = Sum_Rented_Bike_Count )) +
geom_point(size = 1, shape = 'o') +
geom_smooth(span = 0.4, se = FALSE) +
labs(title = "bike rented count over time",
subtitle = "loess smoothing - span = 0.4",
x = 'Date',
y = 'Bike Count')
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
The overall result is a scatter plot of bike rented count over time with
a smoothed line, providing a visual representation of trends or patterns
in the data. The loess smoothing helps to reveal the underlying trends
in the time series ###SEASONS
bike_time_ts|>
index_by(month = floor_date(Date, 'quarter')) |>
summarise(avg_sum= mean(Sum_Rented_Bike_Count, na.rm = TRUE)) |>
ggplot(mapping = aes(x = month, y = avg_sum)) + geom_line(color="blue",size=1.5) +geom_smooth(span = 0.3, color = 'red', se=FALSE, size=1.5) +labs(title = 'Average Number of Rented bike count Over Time', subtitle = "by quarter year",y="Average bike count",x="Year") + scale_x_date(breaks = "1 year", labels = \(x) year(x)) + theme_bw()
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
## : span too small. fewer data values than degrees of freedom.
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
## : pseudoinverse used at 17164
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
## : neighborhood radius 93.19
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
## : reciprocal condition number 0
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
## : There are other near singularities as well. 9061.1
- From above plot not much with respect to season can be figured out
over the quarter of the year. - from 2017, the rental count seem to not
have particular season on page view for exercise, i.e. it constant and
dropped before reaching 2018.Further from year 2018 the count increased
drastically and would go reach the saturation by the end end of the time
period.
# Plot ACF and PACF to illustrate seasonality
my_acf <- acf(bike_time_ts$Sum_Rented_Bike_Count, lag.max = 30)
my_pacf <- pacf(bike_time_ts$Sum_Rented_Bike_Count, lag.max = 30)
# Plot ACF
autoplot(my_acf) +
labs(title = "Autocorrelation Function (ACF)",
x = "Lag",
y = "ACF")
# Plot PACF
autoplot(my_pacf) +
labs(title = "Partial Autocorrelation Function (PACF)",
x = "Lag",
y="PACF")
ACF Interpretation: There is a positive correlation between the two
variables. This means that as the date increases, the rented bike count
increases. This could be due to a number of factors, such as economic
growth, environment consious, or purchasing power. The positive
correlation is strongest at lag 1. This means that the rented bike count
is most positively correlated with the rented bike count of the previous
period. This suggests that the count is a non-stationary time series,
meaning that its mean and variance change over time.
PACF Interpretation: The Partial Autocorrelation Function (PACF) plot for time series between date and rental bike count shows that there is a significant positive correlation between the count of rental bikes and the rental bikes count of the previous period (lag 1). This indicates that the rental bike count is a non-stationary time series, meaning that its mean and variance change over time. The positive correlation gradually weakens as the lag increases, suggesting that the rental bike count is also influenced by long-term trends and otherfactors.