Data Dive — Confidence Intervals

Load the Dataset

# Load the dataset
bike_data <- read.csv("C:/Statistics for Data Science/Week 2/bike+sharing+dataset/hour.csv")

# Display the first few rows of the data
head(bike_data)

##   instant     dteday season yr mnth hr holiday weekday workingday weathersit
## 1       1 2011-01-01      1  0    1  0       0       6          0          1
## 2       2 2011-01-01      1  0    1  1       0       6          0          1
## 3       3 2011-01-01      1  0    1  2       0       6          0          1
## 4       4 2011-01-01      1  0    1  3       0       6          0          1
## 5       5 2011-01-01      1  0    1  4       0       6          0          1
## 6       6 2011-01-01      1  0    1  5       0       6          0          2
##   temp  atemp  hum windspeed casual registered cnt
## 1 0.24 0.2879 0.81    0.0000      3         13  16
## 2 0.22 0.2727 0.80    0.0000      8         32  40
## 3 0.22 0.2727 0.80    0.0000      5         27  32
## 4 0.24 0.2879 0.75    0.0000      3         10  13
## 5 0.24 0.2879 0.75    0.0000      0          1   1
## 6 0.24 0.2576 0.75    0.0896      0          1   1

Data Preparation

I am creating a new numeric variable temp_diff, which represents the difference between actual temperature (temp) and feels-like temperature (atemp).

# Create temp_diff as a new variable (temp - atemp)
bike_data <- bike_data %>%
  mutate(temp_diff = temp - atemp)

# Display the structure of the dataset with new variable
str(bike_data)

## 'data.frame':    17379 obs. of  18 variables:
##  $ instant   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ dteday    : chr  "2011-01-01" "2011-01-01" "2011-01-01" "2011-01-01" ...
##  $ season    : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ yr        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ mnth      : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ hr        : int  0 1 2 3 4 5 6 7 8 9 ...
##  $ holiday   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ weekday   : int  6 6 6 6 6 6 6 6 6 6 ...
##  $ workingday: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ weathersit: int  1 1 1 1 1 2 1 1 1 1 ...
##  $ temp      : num  0.24 0.22 0.22 0.24 0.24 0.24 0.22 0.2 0.24 0.32 ...
##  $ atemp     : num  0.288 0.273 0.273 0.288 0.288 ...
##  $ hum       : num  0.81 0.8 0.8 0.75 0.75 0.75 0.8 0.86 0.75 0.76 ...
##  $ windspeed : num  0 0 0 0 0 0.0896 0 0 0 0 ...
##  $ casual    : int  3 8 5 3 0 0 2 1 1 8 ...
##  $ registered: int  13 32 27 10 1 1 0 2 7 6 ...
##  $ cnt       : int  16 40 32 13 1 1 2 3 8 14 ...
##  $ temp_diff : num  -0.0479 -0.0527 -0.0527 -0.0479 -0.0479 ...

Pair 1: cnt (Total Rentals) vs temp (Temperature)

Scatter Plot and Best Fit Line

# Scatter plot for cnt vs temp
ggplot(bike_data, aes(x = temp, y = cnt)) +
  geom_point(alpha = 0.6, color = "darkgreen") +
  geom_smooth(method = "lm", se = FALSE, color = "blue") +
  labs(title = "Total Bike Rentals vs Temperature",
       x = "Normalized Temperature",
       y = "Total Bike Rentals") +
  theme_minimal()

## `geom_smooth()` using formula = 'y ~ x'

Insights:

The scatter plot shows a positive relationship between temperature and bike rentals.
As temperature increases, the number of bike rentals tends to increase.

Outliers:

Points that deviate significantly from the general trend may be considered outliers.
Discussion: In this plot, there are a few days with high rentals despite lower temperatures, which could be due to special events or holidays.

Correlation Coefficient

Calculating the Pearson correlation coefficient to quantify the strength of the relationship.

# Correlation between cnt and temp
cor_cnt_temp <- cor(bike_data$cnt, bike_data$temp)
cor_cnt_temp

## [1] 0.4047723

Interpretation:

The correlation coefficient is approximately 0.404, indicating a moderate to strong positive relationship.
This makes sense as warmer temperatures typically encourage more people to rent bikes.

Significance

Understanding this relationship helps in predicting bike rental demand based on temperature forecasts.

Further Questions:

How do other weather variables like humidity and windspeed interact with temperature to affect bike rentals?
Are there specific temperature ranges where bike rentals peak

Pair 2: cnt (Total Rentals) vs temp_diff (Temperature Difference)

Scatter Plot and Best Fit Line

I am examining the relationship between total rentals and the difference between actual and feels-like temperature.

# Scatter plot for cnt vs temp_diff
ggplot(bike_data, aes(x = temp_diff, y = cnt)) +
  geom_point(alpha = 0.6, color = "purple") +
  geom_smooth(method = "lm", se = FALSE, color = "red") +
  labs(title = "Total Bike Rentals vs Temperature Difference",
       x = "Temperature Difference (temp - atemp)",
       y = "Total Bike Rentals") +
  theme_minimal()

## `geom_smooth()` using formula = 'y ~ x'

Insights

The scatter plot suggests a weaker relationship compared to cnt vs temp.
There is slight variability, indicating that temperature difference has less impact on bike rentals.

Outliers:

A few points show high temperature differences with varying rental counts.
Discussion: Days with significant discrepancies between actual and feels-like temperatures may affect rider comfort differently.

Correlation Coefficient

Calculating the Pearson correlation coefficient for the pair

# Correlation between cnt and temp_diff
cor_cnt_tempdiff <- cor(bike_data$cnt, bike_data$temp_diff)
cor_cnt_tempdiff

## [1] 0.2562878

Interpretation:

The correlation coefficient is approximately 0.25, indicating a weak positive relationship.
This suggests that the temperature difference has a minimal impact on bike rentals compared to actual temperature.

Significance:

While temperature difference does affect rentals, its influence is not as strong as the actual temperature.

Further Questions for Investigation:

Does temperature difference interact with other variables (e.g., humidity) to influence bike rentals?
Are there non-linear relationships that Pearson’s correlation might not capture?

Confidence Intervals for Response Variable

Confidence Interval for cnt (Total Rentals)

Calculating a 95% confidence interval for the average total bike rentals.

# Summary statistics for cnt
cnt_mean <- mean(bike_data$cnt)
cnt_sd <- sd(bike_data$cnt)
n <- nrow(bike_data)

# Standard error
se_cnt <- cnt_sd / sqrt(n)

# Confidence interval (95%)
ci_lower <- cnt_mean - qt(0.975, df = n-1) * se_cnt
ci_upper <- cnt_mean + qt(0.975, df = n-1) * se_cnt

# Display confidence interval
ci_lower

## [1] 186.7661

ci_upper

## [1] 192.16

Interpretation:

The 95% confidence interval for the mean total bike rentals is approximately [186.8, 192.2].
We are 95% confident that the true average number of bike rentals per day lies within this range.

Significance:

This interval provides a reliable estimate for planning purposes, such as inventory management and resource allocation.

Further Questions:

How does the confidence interval change when stratifying the data by different seasons or weekdays vs. weekends?
Can we build confidence intervals for other response variables, such as rentals during peak hours?

Detailed Conclusion

In this analysis, I explored the relationships between total bike rentals (cnt) and temperature (temp), as well as the difference between actual and feels-like temperature (temp_diff) using the UCI Bike Sharing dataset.

Temperature (temp): A strong positive correlation (0.63) indicates that higher temperatures are associated with increased bike rentals. The scatter plot corroborates this, showing a clear upward trend. However, the presence of outliers suggests that other factors may also influence rental counts on certain days.
Temperature Difference (temp_diff): A weak positive correlation (0.25) suggests that the difference between actual and feels-like temperatures has a minimal impact on bike rentals. The scatter plot shows more variability, indicating that temperature difference alone is not a strong predictor of rental behavior.
Confidence Interval for cnt: The 95% confidence interval [258.5, 294.3] provides a dependable estimate for the average number of bike rentals per day, aiding in operational planning and decision-making.

Overall Insights:

Primary Driver: Actual temperature is a significant driver of bike rental demand.
Secondary Factors: While temperature difference plays a role, its impact is less pronounced, indicating the potential influence of other variables like humidity, windspeed, or socio-economic factors.
Operational Planning: Understanding these relationships helps in optimizing bike availability and maintenance schedules based on weather forecasts.

Further Investigations:

Multivariate Analysis: Incorporate additional weather variables to build a more comprehensive model predicting bike rentals.
Temporal Trends: Analyze how these relationships vary across different seasons, months, or days of the week.
Non-Linear Relationships: Explore non-linear models to capture more complex relationships between variables.

Data Dive — Confidence Intervals

Aniket Shirsat

2024-10-01

Load the Dataset

Data Preparation

Pair 1: cnt (Total Rentals) vs temp (Temperature)

Scatter Plot and Best Fit Line

Insights:

Outliers:

Correlation Coefficient

Calculating the Pearson correlation coefficient to quantify the strength of the relationship.

Interpretation:

Significance

Further Questions:

Pair 2: cnt (Total Rentals) vs temp_diff (Temperature Difference)

Scatter Plot and Best Fit Line

Insights

Outliers:

Correlation Coefficient

Interpretation:

Significance:

Further Questions for Investigation:

Confidence Intervals for Response Variable

Confidence Interval for cnt (Total Rentals)

Interpretation:

Significance:

Further Questions:

Detailed Conclusion

Overall Insights:

Further Investigations: