# Load the UCI Bike Sharing Dataset
bike_sharing_data <- read.csv("C:/Statistics for Data Science/Week 2/bike+sharing+dataset/hour.csv")

# Display the first few rows of the dataset
knitr::kable(head(bike_sharing_data))
instant dteday season yr mnth hr holiday weekday workingday weathersit temp atemp hum windspeed casual registered cnt
1 2011-01-01 1 0 1 0 0 6 0 1 0.24 0.2879 0.81 0.0000 3 13 16
2 2011-01-01 1 0 1 1 0 6 0 1 0.22 0.2727 0.80 0.0000 8 32 40
3 2011-01-01 1 0 1 2 0 6 0 1 0.22 0.2727 0.80 0.0000 5 27 32
4 2011-01-01 1 0 1 3 0 6 0 1 0.24 0.2879 0.75 0.0000 3 10 13
5 2011-01-01 1 0 1 4 0 6 0 1 0.24 0.2879 0.75 0.0000 0 1 1
6 2011-01-01 1 0 1 5 0 6 0 2 0.24 0.2576 0.75 0.0896 0 1 1

Summary

Dataset Description

The UCI Bike Sharing Dataset contains the hourly and daily count of rental bikes between years 2011 and 2012 in the Capital Bikeshare system with the corresponding weather and seasonal information.

Documentation Link: https://archive.ics.uci.edu/dataset/275/bike+sharing+dataset

Main Question and Goal

Main Question: How do various weather factors influence the number of bike rentals in the Capital Bikeshare system?

Goal: To analyze the relationship between weather conditions (such as temperature, humidity, and windspeed) and bike rental counts. This will help in understanding the patterns and factors that drive bike-sharing usage, enabling better resource allocation and system optimization.

Interesting Aspects for Further Investigation

1) Relationship between bike rentals, time of day, and temperature

Visualization: Heatmap of Bike Rentals by Hour and Temperature

# Assuming 'hr' represents the hour of the day
ggplot(bike_sharing_data, aes(x = temp, y = factor(hr))) +
  geom_bin2d(bins = 30) +
  scale_fill_gradient(low = "lightyellow", high = "red") +
  labs(title = "Heatmap of Bike Rentals by Temperature and Hour",
       x = "Normalized Temperature",
       y = "Hour of Day",
       fill = "Count") +
  theme_minimal()

Explanation: Identifying peak hours and their corresponding temperatures can help in optimizing bike distribution and availability.

Insights: Bike rentals peak during morning and evening commute hours (7-9 AM, 5-7 PM) and are highest at moderate temperatures (0.4 - 0.7), with significantly fewer rentals at colder or extremely hot temperatures.

2) Visualizing the correlations between multiple numeric variables in the dataset simultaneously.

Visualization: Correlation Heat Matrix

# Select numeric variables
numeric_vars <- bike_sharing_data %>%
  select(temp, atemp, hum, windspeed, cnt)

# Compute correlation matrix
cor_matrix <- cor(numeric_vars)

# Convert to long format for ggplot
cor_long <- as.data.frame(as.table(cor_matrix))

# Heatmap
ggplot(cor_long, aes(Var1, Var2, fill = Freq)) +
  geom_tile(color = "white") +
  scale_fill_gradient2(low = "blue", high = "red", mid = "white",
                       midpoint = 0, limit = c(-1,1), space = "Lab",
                       name="Pearson\nCorrelation") +
  theme_minimal() +
  labs(title = "Correlation Matrix Heatmap") +
  theme(axis.text.x = element_text(angle = 45, vjust = 1,
                                   size = 12, hjust = 1))

Explanation: Understanding correlations helps in feature selection for modeling and identifying potential multicollinearity issues.

Insight: Temperature has a moderate positive correlation with bike rentals (~0.63), while humidity and windspeed show weak or negligible correlations, indicating temperature as a key driver of rental activity.

Plan Moving Forward

  1. Data Cleaning and Preparation
  1. Handle missing values if any.
  2. Create new relevant variables if necessary (e.g., temperature difference).
  1. Exploratory Data Analysis (EDA)
  1. Perform descriptive statistics.
  2. Identify and visualize relationships between variables.
  1. Hypothesis Formulation Develop hypotheses based on initial observations.

  2. Statistical Testing Test the formulated hypotheses using appropriate statistical methods.

  3. Model Building Develop predictive models to forecast bike rentals based on weather factors.

  4. Documentation and Reporting Document findings and prepare reports for stakeholders.

Initial Findings

Hypotheses

1) Hypothesis 1: Higher temperatures are positively correlated with an increase in bike rentals.

Visualization for Hypothesis 1

Total Bike Rentals Vs Temperature

# Scatter plot for cnt vs temp
ggplot(bike_sharing_data, aes(x = temp, y = cnt)) +
  geom_point(alpha = 0.6, color = "blue") +
  geom_smooth(method = "lm", se = FALSE, color = "red") +
  labs(title = "Hypothesis 1: Total Bike Rentals vs Temperature",
       subtitle = "Expecting a positive correlation",
       x = "Normalized Temperature",
       y = "Total Bike Rentals") +
  theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'

Insight: Based ion the scatter plot, we can clearly see that there is a positive correlation between Normalized temperature and Total Bike Rental.

2) Hypothesis 2: Increased humidity levels are negatively correlated with bike rentals.

Visualization for Hypothesis 2

Total Bike Rentals Vs Humidity

# Reusing the scatter plot for cnt vs humidity
ggplot(bike_sharing_data, aes(x = hum, y = cnt)) +
  geom_point(alpha = 0.6, color = "green") +
  geom_smooth(method = "lm", se = FALSE, color = "orange") +
  labs(title = "Hypothesis 2: Total Bike Rentals vs Humidity",
       subtitle = "Expecting a negative correlation",
       x = "Humidity",
       y = "Total Bike Rentals") +
  theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'

Insight: Based on the data visualizatiom above, we can clearly see that our initial hypothesis holds true as there is a negative correlation between Humidity and Total Bike Rental.