# Load the dataset
bike_data <- read.csv("C:/Statistics for Data Science/Week 2/bike+sharing+dataset/hour.csv")
# Display the first few rows of the data
head(bike_data)
## instant dteday season yr mnth hr holiday weekday workingday weathersit
## 1 1 2011-01-01 1 0 1 0 0 6 0 1
## 2 2 2011-01-01 1 0 1 1 0 6 0 1
## 3 3 2011-01-01 1 0 1 2 0 6 0 1
## 4 4 2011-01-01 1 0 1 3 0 6 0 1
## 5 5 2011-01-01 1 0 1 4 0 6 0 1
## 6 6 2011-01-01 1 0 1 5 0 6 0 2
## temp atemp hum windspeed casual registered cnt
## 1 0.24 0.2879 0.81 0.0000 3 13 16
## 2 0.22 0.2727 0.80 0.0000 8 32 40
## 3 0.22 0.2727 0.80 0.0000 5 27 32
## 4 0.24 0.2879 0.75 0.0000 3 10 13
## 5 0.24 0.2879 0.75 0.0000 0 1 1
## 6 0.24 0.2576 0.75 0.0896 0 1 1
I am creating a new numeric variable temp_diff, which represents the difference between actual temperature (temp) and feels-like temperature (atemp).
# Create temp_diff as a new variable (temp - atemp)
bike_data <- bike_data %>%
mutate(temp_diff = temp - atemp)
# Display the structure of the dataset with new variable
str(bike_data)
## 'data.frame': 17379 obs. of 18 variables:
## $ instant : int 1 2 3 4 5 6 7 8 9 10 ...
## $ dteday : chr "2011-01-01" "2011-01-01" "2011-01-01" "2011-01-01" ...
## $ season : int 1 1 1 1 1 1 1 1 1 1 ...
## $ yr : int 0 0 0 0 0 0 0 0 0 0 ...
## $ mnth : int 1 1 1 1 1 1 1 1 1 1 ...
## $ hr : int 0 1 2 3 4 5 6 7 8 9 ...
## $ holiday : int 0 0 0 0 0 0 0 0 0 0 ...
## $ weekday : int 6 6 6 6 6 6 6 6 6 6 ...
## $ workingday: int 0 0 0 0 0 0 0 0 0 0 ...
## $ weathersit: int 1 1 1 1 1 2 1 1 1 1 ...
## $ temp : num 0.24 0.22 0.22 0.24 0.24 0.24 0.22 0.2 0.24 0.32 ...
## $ atemp : num 0.288 0.273 0.273 0.288 0.288 ...
## $ hum : num 0.81 0.8 0.8 0.75 0.75 0.75 0.8 0.86 0.75 0.76 ...
## $ windspeed : num 0 0 0 0 0 0.0896 0 0 0 0 ...
## $ casual : int 3 8 5 3 0 0 2 1 1 8 ...
## $ registered: int 13 32 27 10 1 1 0 2 7 6 ...
## $ cnt : int 16 40 32 13 1 1 2 3 8 14 ...
## $ temp_diff : num -0.0479 -0.0527 -0.0527 -0.0479 -0.0479 ...
# Scatter plot for cnt vs temp
ggplot(bike_data, aes(x = temp, y = cnt)) +
geom_point(alpha = 0.6, color = "darkgreen") +
geom_smooth(method = "lm", se = FALSE, color = "blue") +
labs(title = "Total Bike Rentals vs Temperature",
x = "Normalized Temperature",
y = "Total Bike Rentals") +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
# Correlation between cnt and temp
cor_cnt_temp <- cor(bike_data$cnt, bike_data$temp)
cor_cnt_temp
## [1] 0.4047723
Understanding this relationship helps in predicting bike rental demand based on temperature forecasts.
I am examining the relationship between total rentals and the difference between actual and feels-like temperature.
# Scatter plot for cnt vs temp_diff
ggplot(bike_data, aes(x = temp_diff, y = cnt)) +
geom_point(alpha = 0.6, color = "purple") +
geom_smooth(method = "lm", se = FALSE, color = "red") +
labs(title = "Total Bike Rentals vs Temperature Difference",
x = "Temperature Difference (temp - atemp)",
y = "Total Bike Rentals") +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
Calculating the Pearson correlation coefficient for the pair
# Correlation between cnt and temp_diff
cor_cnt_tempdiff <- cor(bike_data$cnt, bike_data$temp_diff)
cor_cnt_tempdiff
## [1] 0.2562878
While temperature difference does affect rentals, its influence is not as strong as the actual temperature.
Calculating a 95% confidence interval for the average total bike rentals.
# Summary statistics for cnt
cnt_mean <- mean(bike_data$cnt)
cnt_sd <- sd(bike_data$cnt)
n <- nrow(bike_data)
# Standard error
se_cnt <- cnt_sd / sqrt(n)
# Confidence interval (95%)
ci_lower <- cnt_mean - qt(0.975, df = n-1) * se_cnt
ci_upper <- cnt_mean + qt(0.975, df = n-1) * se_cnt
# Display confidence interval
ci_lower
## [1] 186.7661
ci_upper
## [1] 192.16
This interval provides a reliable estimate for planning purposes, such as inventory management and resource allocation.
In this analysis, I explored the relationships between total bike rentals (cnt) and temperature (temp), as well as the difference between actual and feels-like temperature (temp_diff) using the UCI Bike Sharing dataset.
Temperature (temp): A strong positive correlation (0.63) indicates that higher temperatures are associated with increased bike rentals. The scatter plot corroborates this, showing a clear upward trend. However, the presence of outliers suggests that other factors may also influence rental counts on certain days.
Temperature Difference (temp_diff): A weak positive correlation (0.25) suggests that the difference between actual and feels-like temperatures has a minimal impact on bike rentals. The scatter plot shows more variability, indicating that temperature difference alone is not a strong predictor of rental behavior.
Confidence Interval for cnt: The 95% confidence interval [258.5, 294.3] provides a dependable estimate for the average number of bike rentals per day, aiding in operational planning and decision-making.