WEEK12

R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:

###LOADING THE DATA

bike<- read.csv('D:/FALL 2023/STATISTICS/datasets/bike.csv')

###TIME-SERIES

TAKING THE AGGREGATED RESPONSE VARIABLE WITH CORRESPOSNING DATE

   bike_aggregate <- bike%>%
  group_by(Date) %>%
  summarise(Sum_Rented_Bike_Count = sum(Rented.Bike.Count))

###CREATING TSIBBLE - Further creating a tsibble object to pick the date and response variable, here view in our case. This can be used to plot the data over time.

bike_time <- select(bike_aggregate, Date, Sum_Rented_Bike_Count)
bike_time$Date <- as.Date(bike_time$Date, format = "%m/%d/%Y")

# Remove rows with NA values in the Date column
bike_time <- bike_time[complete.cases(bike_time$Date), ]

bike_time_ts <- as_tsibble(bike_time, index = Date)
head(bike_time_ts)

## # A tsibble: 6 x 2 [1D]
##   Date       Sum_Rented_Bike_Count
##   <date>                     <int>
## 1 2017-01-12                  9539
## 2 2017-02-12                  8523
## 3 2017-03-12                  7222
## 4 2017-04-12                  8729
## 5 2017-05-12                  8307
## 6 2017-06-12                  6669

###PLOTTING

# Plot the time series
ggplot(bike_time_ts, aes(x = Date, y = Sum_Rented_Bike_Count)) +
  geom_line() +
  labs(title = "Rented Bike Counts Over Time",
       x = "Date",
       y = "Sum of Rented Bike Counts")

We can observe that over the time, the demand for rental bikes was increased. from the year 2018 the trend is changed drastically. Although, there was slight corrections further, the trend was constant till the end.

###TRENDS- Let us investigate the data and its trends by using linear modelling.

bike_time_ts %>%
  ggplot(mapping = aes(x = Date, y = Sum_Rented_Bike_Count)) +
  geom_line() +
  geom_smooth(method = "lm", se = FALSE) +
  labs(title = "NUmber of bikes rented over years",
       x = 'YEAR',
       y = 'BIKE COUNT')

## `geom_smooth()` using formula = 'y ~ x'

we can observe that there is a up trend which is linear exists. From 2017 - 2019, this trend looks pretty consistent over time, however, from 2018 the spike appears to be more rapid than the previous years.

Do you need to subset the data for multiple trends? We can divide the data set into 2 parts i.e.before 2018 and after 2018, where we can see certain constant check on exercise page bike count.

BEFORE2018 <- bike_time_ts %>%
          filter(Date < as.Date('2018/01/01'))
    AFTER2018 <- bike_time_ts %>%
  filter(Date >= as.Date('2018/01/01'))

ggplot() +
  geom_line(mapping = aes(x = BEFORE2018$Date, y = BEFORE2018$Sum_Rented_Bike_Count)) +
  geom_smooth(mapping = aes(x = BEFORE2018$Date, y = BEFORE2018$Sum_Rented_Bike_Count),method = "lm", se = FALSE) +
  geom_line(mapping = aes(x = AFTER2018$Date, y = AFTER2018$Sum_Rented_Bike_Count)) +
  geom_smooth(mapping = aes(x = AFTER2018$Date, y = AFTER2018$Sum_Rented_Bike_Count),method = "lm", se = FALSE) +
  labs(title = "Wikipedia page view for time period 2008 - 2016",
       x = 'years',
       y = 'Views/Curiosity for Exercise ')

## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'

- How strong are these trends? By separating the trends at the 2018, we can see that there may be different levels of trend for pre 2018 and post. This may be because the people realized the advantages of renting a bike rather than owning it. Like drastic surge is seen post 2018.

Smoothing

LO(W)ESS

bike_time_ts %>%
  ggplot(mapping = aes(x=Date, y = Sum_Rented_Bike_Count )) +
  geom_point(size = 1, shape = 'o') +
  geom_smooth(span = 0.4, se = FALSE) +
  labs(title = "bike rented count over time",
       subtitle = "loess smoothing - span = 0.4",
       x = 'Date',
       y = 'Bike Count')

## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

The overall result is a scatter plot of bike rented count over time with a smoothed line, providing a visual representation of trends or patterns in the data. The loess smoothing helps to reveal the underlying trends in the time series ###SEASONS

bike_time_ts|>
  index_by(month = floor_date(Date, 'quarter')) |>
  summarise(avg_sum= mean(Sum_Rented_Bike_Count, na.rm = TRUE)) |>
  ggplot(mapping = aes(x = month, y = avg_sum)) + geom_line(color="blue",size=1.5) +geom_smooth(span = 0.3, color = 'red', se=FALSE, size=1.5) +labs(title = 'Average Number of Rented bike count Over Time', subtitle = "by quarter year",y="Average bike count",x="Year") + scale_x_date(breaks = "1 year", labels = \(x) year(x)) + theme_bw()

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
## : span too small.  fewer data values than degrees of freedom.

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
## : pseudoinverse used at 17164

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
## : neighborhood radius 93.19

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
## : reciprocal condition number 0

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
## : There are other near singularities as well. 9061.1

- From above plot not much with respect to season can be figured out over the quarter of the year. - from 2017, the rental count seem to not have particular season on page view for exercise, i.e. it constant and dropped before reaching 2018.Further from year 2018 the count increased drastically and would go reach the saturation by the end end of the time period.

# Plot ACF and PACF to illustrate seasonality
my_acf <- acf(bike_time_ts$Sum_Rented_Bike_Count, lag.max = 30)

my_pacf <- pacf(bike_time_ts$Sum_Rented_Bike_Count, lag.max = 30)

# Plot ACF
autoplot(my_acf) +
  labs(title = "Autocorrelation Function (ACF)",
       x = "Lag",
       y = "ACF")

# Plot PACF
autoplot(my_pacf) +
  labs(title = "Partial Autocorrelation Function (PACF)",
       x = "Lag",
  y="PACF")

ACF Interpretation: There is a positive correlation between the two variables. This means that as the date increases, the rented bike count increases. This could be due to a number of factors, such as economic growth, environment consious, or purchasing power. The positive correlation is strongest at lag 1. This means that the rented bike count is most positively correlated with the rented bike count of the previous period. This suggests that the count is a non-stationary time series, meaning that its mean and variance change over time.

PACF Interpretation: The Partial Autocorrelation Function (PACF) plot for time series between date and rental bike count shows that there is a significant positive correlation between the count of rental bikes and the rental bikes count of the previous period (lag 1). This indicates that the rental bike count is a non-stationary time series, meaning that its mean and variance change over time. The positive correlation gradually weakens as the lag increases, suggesting that the rental bike count is also influenced by long-term trends and otherfactors.

WEEK12

roshan

2023-11-13

R Markdown

Smoothing

LO(W)ESS