# Load the dataset
bike_data <- read.csv("/Users/roshannaidu/Desktop/IU Sem 2/Stats 1/bike+sharing+dataset/hour.csv")
# View structure and data types of variables
str(bike_data)
## 'data.frame': 17379 obs. of 17 variables:
## $ instant : int 1 2 3 4 5 6 7 8 9 10 ...
## $ dteday : chr "2011-01-01" "2011-01-01" "2011-01-01" "2011-01-01" ...
## $ season : int 1 1 1 1 1 1 1 1 1 1 ...
## $ yr : int 0 0 0 0 0 0 0 0 0 0 ...
## $ mnth : int 1 1 1 1 1 1 1 1 1 1 ...
## $ hr : int 0 1 2 3 4 5 6 7 8 9 ...
## $ holiday : int 0 0 0 0 0 0 0 0 0 0 ...
## $ weekday : int 6 6 6 6 6 6 6 6 6 6 ...
## $ workingday: int 0 0 0 0 0 0 0 0 0 0 ...
## $ weathersit: int 1 1 1 1 1 2 1 1 1 1 ...
## $ temp : num 0.24 0.22 0.22 0.24 0.24 0.24 0.22 0.2 0.24 0.32 ...
## $ atemp : num 0.288 0.273 0.273 0.288 0.288 ...
## $ hum : num 0.81 0.8 0.8 0.75 0.75 0.75 0.8 0.86 0.75 0.76 ...
## $ windspeed : num 0 0 0 0 0 0.0896 0 0 0 0 ...
## $ casual : int 3 8 5 3 0 0 2 1 1 8 ...
## $ registered: int 13 32 27 10 1 1 0 2 7 6 ...
## $ cnt : int 16 40 32 13 1 1 2 3 8 14 ...
# View first few rows of the dataset
head(bike_data)
# View summary statistics for all variables
summary(bike_data)
## instant dteday season yr
## Min. : 1 Length:17379 Min. :1.000 Min. :0.0000
## 1st Qu.: 4346 Class :character 1st Qu.:2.000 1st Qu.:0.0000
## Median : 8690 Mode :character Median :3.000 Median :1.0000
## Mean : 8690 Mean :2.502 Mean :0.5026
## 3rd Qu.:13034 3rd Qu.:3.000 3rd Qu.:1.0000
## Max. :17379 Max. :4.000 Max. :1.0000
## mnth hr holiday weekday
## Min. : 1.000 Min. : 0.00 Min. :0.00000 Min. :0.000
## 1st Qu.: 4.000 1st Qu.: 6.00 1st Qu.:0.00000 1st Qu.:1.000
## Median : 7.000 Median :12.00 Median :0.00000 Median :3.000
## Mean : 6.538 Mean :11.55 Mean :0.02877 Mean :3.004
## 3rd Qu.:10.000 3rd Qu.:18.00 3rd Qu.:0.00000 3rd Qu.:5.000
## Max. :12.000 Max. :23.00 Max. :1.00000 Max. :6.000
## workingday weathersit temp atemp
## Min. :0.0000 Min. :1.000 Min. :0.020 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:1.000 1st Qu.:0.340 1st Qu.:0.3333
## Median :1.0000 Median :1.000 Median :0.500 Median :0.4848
## Mean :0.6827 Mean :1.425 Mean :0.497 Mean :0.4758
## 3rd Qu.:1.0000 3rd Qu.:2.000 3rd Qu.:0.660 3rd Qu.:0.6212
## Max. :1.0000 Max. :4.000 Max. :1.000 Max. :1.0000
## hum windspeed casual registered
## Min. :0.0000 Min. :0.0000 Min. : 0.00 Min. : 0.0
## 1st Qu.:0.4800 1st Qu.:0.1045 1st Qu.: 4.00 1st Qu.: 34.0
## Median :0.6300 Median :0.1940 Median : 17.00 Median :115.0
## Mean :0.6272 Mean :0.1901 Mean : 35.68 Mean :153.8
## 3rd Qu.:0.7800 3rd Qu.:0.2537 3rd Qu.: 48.00 3rd Qu.:220.0
## Max. :1.0000 Max. :0.8507 Max. :367.00 Max. :886.0
## cnt
## Min. : 1.0
## 1st Qu.: 40.0
## Median :142.0
## Mean :189.5
## 3rd Qu.:281.0
## Max. :977.0
# Check number of rows and columns
dim(bike_data)
## [1] 17379 17
# Display all variable names
names(bike_data)
## [1] "instant" "dteday" "season" "yr" "mnth"
## [6] "hr" "holiday" "weekday" "workingday" "weathersit"
## [11] "temp" "atemp" "hum" "windspeed" "casual"
## [16] "registered" "cnt"
# Check for missing values in each column
colSums(is.na(bike_data))
## instant dteday season yr mnth hr holiday
## 0 0 0 0 0 0 0
## weekday workingday weathersit temp atemp hum windspeed
## 0 0 0 0 0 0 0
## casual registered cnt
## 0 0 0
I am creating a new numeric variable temp_diff, which represents the difference between actual temperature (temp) and feels-like temperature (atemp).
bike_data <- bike_data %>%
mutate(
temp_diff = temp - atemp,
log_cnt = log(cnt + 1) # log transformation to stabilize variance
)
str(bike_data)
## 'data.frame': 17379 obs. of 19 variables:
## $ instant : int 1 2 3 4 5 6 7 8 9 10 ...
## $ dteday : chr "2011-01-01" "2011-01-01" "2011-01-01" "2011-01-01" ...
## $ season : int 1 1 1 1 1 1 1 1 1 1 ...
## $ yr : int 0 0 0 0 0 0 0 0 0 0 ...
## $ mnth : int 1 1 1 1 1 1 1 1 1 1 ...
## $ hr : int 0 1 2 3 4 5 6 7 8 9 ...
## $ holiday : int 0 0 0 0 0 0 0 0 0 0 ...
## $ weekday : int 6 6 6 6 6 6 6 6 6 6 ...
## $ workingday: int 0 0 0 0 0 0 0 0 0 0 ...
## $ weathersit: int 1 1 1 1 1 2 1 1 1 1 ...
## $ temp : num 0.24 0.22 0.22 0.24 0.24 0.24 0.22 0.2 0.24 0.32 ...
## $ atemp : num 0.288 0.273 0.273 0.288 0.288 ...
## $ hum : num 0.81 0.8 0.8 0.75 0.75 0.75 0.8 0.86 0.75 0.76 ...
## $ windspeed : num 0 0 0 0 0 0.0896 0 0 0 0 ...
## $ casual : int 3 8 5 3 0 0 2 1 1 8 ...
## $ registered: int 13 32 27 10 1 1 0 2 7 6 ...
## $ cnt : int 16 40 32 13 1 1 2 3 8 14 ...
## $ temp_diff : num -0.0479 -0.0527 -0.0527 -0.0479 -0.0479 ...
## $ log_cnt : num 2.833 3.714 3.497 2.639 0.693 ...
# Display the structure of the dataset with new variable
str(bike_data)
## 'data.frame': 17379 obs. of 19 variables:
## $ instant : int 1 2 3 4 5 6 7 8 9 10 ...
## $ dteday : chr "2011-01-01" "2011-01-01" "2011-01-01" "2011-01-01" ...
## $ season : int 1 1 1 1 1 1 1 1 1 1 ...
## $ yr : int 0 0 0 0 0 0 0 0 0 0 ...
## $ mnth : int 1 1 1 1 1 1 1 1 1 1 ...
## $ hr : int 0 1 2 3 4 5 6 7 8 9 ...
## $ holiday : int 0 0 0 0 0 0 0 0 0 0 ...
## $ weekday : int 6 6 6 6 6 6 6 6 6 6 ...
## $ workingday: int 0 0 0 0 0 0 0 0 0 0 ...
## $ weathersit: int 1 1 1 1 1 2 1 1 1 1 ...
## $ temp : num 0.24 0.22 0.22 0.24 0.24 0.24 0.22 0.2 0.24 0.32 ...
## $ atemp : num 0.288 0.273 0.273 0.288 0.288 ...
## $ hum : num 0.81 0.8 0.8 0.75 0.75 0.75 0.8 0.86 0.75 0.76 ...
## $ windspeed : num 0 0 0 0 0 0.0896 0 0 0 0 ...
## $ casual : int 3 8 5 3 0 0 2 1 1 8 ...
## $ registered: int 13 32 27 10 1 1 0 2 7 6 ...
## $ cnt : int 16 40 32 13 1 1 2 3 8 14 ...
## $ temp_diff : num -0.0479 -0.0527 -0.0527 -0.0479 -0.0479 ...
## $ log_cnt : num 2.833 3.714 3.497 2.639 0.693 ...
In this pair, cnt is the original variable and log_cnt is a transformed version created using a logarithmic transformation. Log transformations are commonly used for count data to reduce right skewness and stabilize variance.
ggplot(bike_data, aes(x = cnt, y = log_cnt)) +
geom_point(alpha = 0.5, color = "darkgreen") +
geom_smooth(method = "lm", se = FALSE, color = "blue") +
labs(title = "Log-Transformed Rentals vs Total Rentals",
x = "Total Rentals (cnt)",
y = "Log of Total Rentals") +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
Because log_cnt is a deterministic transformation of cnt, no true statistical outliers exist in this relationship. Every point follows the mathematical log function.
Any apparent deviation from the straight regression line is due to the nonlinear nature of the logarithmic transformation, not anomalous data points.
cor_cnt_logcnt <- cor(bike_data$cnt, bike_data$log_cnt)
cor_cnt_logcnt
## [1] 0.8205135
The Pearson correlation coefficient is approximately 0.82, indicating a strong positive association.
This high value is expected because log_cnt is directly derived from cnt. However, the relationship is nonlinear (logarithmic), which is visible in the curvature of the scatter plot.
This transformation is useful for modeling purposes because:
Such transformations are commonly used before fitting regression models to count data.
Scatter Plot and Best Fit Line
I am examining the relationship between total rentals and the difference between actual and feels-like temperature.
# Scatter plot for cnt vs temp_diff
ggplot(bike_data, aes(x = temp_diff, y = cnt)) +
geom_point(alpha = 0.6, color = "purple") +
geom_smooth(method = "lm", se = FALSE, color = "red") +
labs(title = "Total Bike Rentals vs Temperature Difference",
x = "Temperature Difference (temp - atemp)",
y = "Total Bike Rentals") +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
The scatter plot suggests a weaker relationship compared to cnt vs temp. There is slight variability, indicating that temperature difference has less impact on bike rentals.
A few points show high temperature differences with varying rental counts. Discussion: Days with significant discrepancies between actual and feels-like temperatures may affect rider comfort differently.
Calculating the Pearson correlation coefficient for the pair
# Correlation between cnt and temp_diff
cor_cnt_tempdiff <- cor(bike_data$cnt, bike_data$temp_diff)
cor_cnt_tempdiff
## [1] 0.2562878
While temperature difference does affect rentals, its influence is not as strong as the actual temperature.
Confidence Interval for cnt (Total Rentals)
Calculating a 95% confidence interval for the average total bike rentals.
# Summary statistics for cnt
cnt_mean <- mean(bike_data$cnt)
cnt_sd <- sd(bike_data$cnt)
n <- nrow(bike_data)
# Standard error
se_cnt <- cnt_sd / sqrt(n)
# Confidence interval (95%)
ci_lower <- cnt_mean - qt(0.975, df = n-1) * se_cnt
ci_upper <- cnt_mean + qt(0.975, df = n-1) * se_cnt
# Display confidence interval
ci_lower
## [1] 186.7661
ci_upper
## [1] 192.16
The 95% confidence interval for the mean total bike rentals is approximately [186.8, 192.2]. We are 95% confident that the true average number of bike rentals per day lies within this range.
This interval provides a reliable estimate for planning purposes, such as inventory management and resource allocation.
This analysis explored two relationships using the UCI Bike Sharing hourly dataset:
cnt vs log_cnt:
A strong positive correlation (≈ 0.82) was observed between total
rentals and its logarithmic transformation. This is expected because the
transformation is monotonic. The log transformation compresses higher
values and reduces skewness, making it useful for modeling count
data.
cnt vs temp_diff:
The relationship between actual count and temperature difference was
examined to understand whether perceived deviation changes
systematically with temperature levels. The correlation value indicates
whether this deviation is temperature-dependent or largely
random.
Confidence Interval for cnt:
The 95% confidence interval provides a precise estimate of the mean
hourly rental demand. Because the dataset contains over 17,000
observations, the Central Limit Theorem ensures reliable inference about
the population mean.