Roshan R Naidu (23/02/2026)

Loading and Exploring The Dataset

# Load the dataset
bike_data <- read.csv("/Users/roshannaidu/Desktop/IU Sem 2/Stats 1/bike+sharing+dataset/hour.csv")

# View structure and data types of variables
str(bike_data)
## 'data.frame':    17379 obs. of  17 variables:
##  $ instant   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ dteday    : chr  "2011-01-01" "2011-01-01" "2011-01-01" "2011-01-01" ...
##  $ season    : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ yr        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ mnth      : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ hr        : int  0 1 2 3 4 5 6 7 8 9 ...
##  $ holiday   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ weekday   : int  6 6 6 6 6 6 6 6 6 6 ...
##  $ workingday: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ weathersit: int  1 1 1 1 1 2 1 1 1 1 ...
##  $ temp      : num  0.24 0.22 0.22 0.24 0.24 0.24 0.22 0.2 0.24 0.32 ...
##  $ atemp     : num  0.288 0.273 0.273 0.288 0.288 ...
##  $ hum       : num  0.81 0.8 0.8 0.75 0.75 0.75 0.8 0.86 0.75 0.76 ...
##  $ windspeed : num  0 0 0 0 0 0.0896 0 0 0 0 ...
##  $ casual    : int  3 8 5 3 0 0 2 1 1 8 ...
##  $ registered: int  13 32 27 10 1 1 0 2 7 6 ...
##  $ cnt       : int  16 40 32 13 1 1 2 3 8 14 ...
# View first few rows of the dataset
head(bike_data)
# View summary statistics for all variables
summary(bike_data)
##     instant         dteday              season            yr        
##  Min.   :    1   Length:17379       Min.   :1.000   Min.   :0.0000  
##  1st Qu.: 4346   Class :character   1st Qu.:2.000   1st Qu.:0.0000  
##  Median : 8690   Mode  :character   Median :3.000   Median :1.0000  
##  Mean   : 8690                      Mean   :2.502   Mean   :0.5026  
##  3rd Qu.:13034                      3rd Qu.:3.000   3rd Qu.:1.0000  
##  Max.   :17379                      Max.   :4.000   Max.   :1.0000  
##       mnth              hr           holiday           weekday     
##  Min.   : 1.000   Min.   : 0.00   Min.   :0.00000   Min.   :0.000  
##  1st Qu.: 4.000   1st Qu.: 6.00   1st Qu.:0.00000   1st Qu.:1.000  
##  Median : 7.000   Median :12.00   Median :0.00000   Median :3.000  
##  Mean   : 6.538   Mean   :11.55   Mean   :0.02877   Mean   :3.004  
##  3rd Qu.:10.000   3rd Qu.:18.00   3rd Qu.:0.00000   3rd Qu.:5.000  
##  Max.   :12.000   Max.   :23.00   Max.   :1.00000   Max.   :6.000  
##    workingday       weathersit         temp           atemp       
##  Min.   :0.0000   Min.   :1.000   Min.   :0.020   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:1.000   1st Qu.:0.340   1st Qu.:0.3333  
##  Median :1.0000   Median :1.000   Median :0.500   Median :0.4848  
##  Mean   :0.6827   Mean   :1.425   Mean   :0.497   Mean   :0.4758  
##  3rd Qu.:1.0000   3rd Qu.:2.000   3rd Qu.:0.660   3rd Qu.:0.6212  
##  Max.   :1.0000   Max.   :4.000   Max.   :1.000   Max.   :1.0000  
##       hum           windspeed          casual         registered   
##  Min.   :0.0000   Min.   :0.0000   Min.   :  0.00   Min.   :  0.0  
##  1st Qu.:0.4800   1st Qu.:0.1045   1st Qu.:  4.00   1st Qu.: 34.0  
##  Median :0.6300   Median :0.1940   Median : 17.00   Median :115.0  
##  Mean   :0.6272   Mean   :0.1901   Mean   : 35.68   Mean   :153.8  
##  3rd Qu.:0.7800   3rd Qu.:0.2537   3rd Qu.: 48.00   3rd Qu.:220.0  
##  Max.   :1.0000   Max.   :0.8507   Max.   :367.00   Max.   :886.0  
##       cnt       
##  Min.   :  1.0  
##  1st Qu.: 40.0  
##  Median :142.0  
##  Mean   :189.5  
##  3rd Qu.:281.0  
##  Max.   :977.0
# Check number of rows and columns
dim(bike_data)
## [1] 17379    17
# Display all variable names
names(bike_data)
##  [1] "instant"    "dteday"     "season"     "yr"         "mnth"      
##  [6] "hr"         "holiday"    "weekday"    "workingday" "weathersit"
## [11] "temp"       "atemp"      "hum"        "windspeed"  "casual"    
## [16] "registered" "cnt"
# Check for missing values in each column
colSums(is.na(bike_data))
##    instant     dteday     season         yr       mnth         hr    holiday 
##          0          0          0          0          0          0          0 
##    weekday workingday weathersit       temp      atemp        hum  windspeed 
##          0          0          0          0          0          0          0 
##     casual registered        cnt 
##          0          0          0

Data Preparation

I am creating a new numeric variable temp_diff, which represents the difference between actual temperature (temp) and feels-like temperature (atemp).

bike_data <- bike_data %>%
  mutate(
    temp_diff = temp - atemp,
    log_cnt = log(cnt + 1)   # log transformation to stabilize variance
  )

str(bike_data)
## 'data.frame':    17379 obs. of  19 variables:
##  $ instant   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ dteday    : chr  "2011-01-01" "2011-01-01" "2011-01-01" "2011-01-01" ...
##  $ season    : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ yr        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ mnth      : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ hr        : int  0 1 2 3 4 5 6 7 8 9 ...
##  $ holiday   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ weekday   : int  6 6 6 6 6 6 6 6 6 6 ...
##  $ workingday: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ weathersit: int  1 1 1 1 1 2 1 1 1 1 ...
##  $ temp      : num  0.24 0.22 0.22 0.24 0.24 0.24 0.22 0.2 0.24 0.32 ...
##  $ atemp     : num  0.288 0.273 0.273 0.288 0.288 ...
##  $ hum       : num  0.81 0.8 0.8 0.75 0.75 0.75 0.8 0.86 0.75 0.76 ...
##  $ windspeed : num  0 0 0 0 0 0.0896 0 0 0 0 ...
##  $ casual    : int  3 8 5 3 0 0 2 1 1 8 ...
##  $ registered: int  13 32 27 10 1 1 0 2 7 6 ...
##  $ cnt       : int  16 40 32 13 1 1 2 3 8 14 ...
##  $ temp_diff : num  -0.0479 -0.0527 -0.0527 -0.0479 -0.0479 ...
##  $ log_cnt   : num  2.833 3.714 3.497 2.639 0.693 ...
# Display the structure of the dataset with new variable
str(bike_data)
## 'data.frame':    17379 obs. of  19 variables:
##  $ instant   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ dteday    : chr  "2011-01-01" "2011-01-01" "2011-01-01" "2011-01-01" ...
##  $ season    : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ yr        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ mnth      : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ hr        : int  0 1 2 3 4 5 6 7 8 9 ...
##  $ holiday   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ weekday   : int  6 6 6 6 6 6 6 6 6 6 ...
##  $ workingday: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ weathersit: int  1 1 1 1 1 2 1 1 1 1 ...
##  $ temp      : num  0.24 0.22 0.22 0.24 0.24 0.24 0.22 0.2 0.24 0.32 ...
##  $ atemp     : num  0.288 0.273 0.273 0.288 0.288 ...
##  $ hum       : num  0.81 0.8 0.8 0.75 0.75 0.75 0.8 0.86 0.75 0.76 ...
##  $ windspeed : num  0 0 0 0 0 0.0896 0 0 0 0 ...
##  $ casual    : int  3 8 5 3 0 0 2 1 1 8 ...
##  $ registered: int  13 32 27 10 1 1 0 2 7 6 ...
##  $ cnt       : int  16 40 32 13 1 1 2 3 8 14 ...
##  $ temp_diff : num  -0.0479 -0.0527 -0.0527 -0.0479 -0.0479 ...
##  $ log_cnt   : num  2.833 3.714 3.497 2.639 0.693 ...

Pair 1: cnt (Original) vs log_cnt (Created)

In this pair, cnt is the original variable and log_cnt is a transformed version created using a logarithmic transformation. Log transformations are commonly used for count data to reduce right skewness and stabilize variance.

ggplot(bike_data, aes(x = cnt, y = log_cnt)) +
  geom_point(alpha = 0.5, color = "darkgreen") +
  geom_smooth(method = "lm", se = FALSE, color = "blue") +
  labs(title = "Log-Transformed Rentals vs Total Rentals",
       x = "Total Rentals (cnt)",
       y = "Log of Total Rentals") +
  theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'

Insights:

  1. The scatter plot shows a strong positive monotonic relationship between cnt and log_cnt.
  2. The curve flattens as cnt increases, reflecting the logarithmic compression of larger values.
  3. The relationship is nonlinear in shape but strictly increasing, which explains the high positive correlation.

Outliers:

Because log_cnt is a deterministic transformation of cnt, no true statistical outliers exist in this relationship. Every point follows the mathematical log function.

Any apparent deviation from the straight regression line is due to the nonlinear nature of the logarithmic transformation, not anomalous data points.

cor_cnt_logcnt <- cor(bike_data$cnt, bike_data$log_cnt)
cor_cnt_logcnt
## [1] 0.8205135

Correlation Interpretation:

The Pearson correlation coefficient is approximately 0.82, indicating a strong positive association.

This high value is expected because log_cnt is directly derived from cnt. However, the relationship is nonlinear (logarithmic), which is visible in the curvature of the scatter plot.

Significance:

This transformation is useful for modeling purposes because:

  • It reduces right skewness.
  • It stabilizes variance.
  • It makes the distribution more symmetric.

Such transformations are commonly used before fitting regression models to count data.

Further Questions:

  1. Would modeling log_cnt instead of cnt improve regression performance?
  2. Does the variance of rentals decrease after transformation?

Pair 2: cnt (Total Rentals) vs temp_diff (Temperature Difference)

Scatter Plot and Best Fit Line

I am examining the relationship between total rentals and the difference between actual and feels-like temperature.

# Scatter plot for cnt vs temp_diff
ggplot(bike_data, aes(x = temp_diff, y = cnt)) +
  geom_point(alpha = 0.6, color = "purple") +
  geom_smooth(method = "lm", se = FALSE, color = "red") +
  labs(title = "Total Bike Rentals vs Temperature Difference",
       x = "Temperature Difference (temp - atemp)",
       y = "Total Bike Rentals") +
  theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'

Insights

The scatter plot suggests a weaker relationship compared to cnt vs temp. There is slight variability, indicating that temperature difference has less impact on bike rentals.

Outliers:

A few points show high temperature differences with varying rental counts. Discussion: Days with significant discrepancies between actual and feels-like temperatures may affect rider comfort differently.

Correlation Coefficient

Calculating the Pearson correlation coefficient for the pair

# Correlation between cnt and temp_diff
cor_cnt_tempdiff <- cor(bike_data$cnt, bike_data$temp_diff)
cor_cnt_tempdiff
## [1] 0.2562878

Interpretation:

  1. The correlation coefficient is approximately 0.25, indicating a weak positive relationship.
  2. This suggests that the temperature difference has a minimal impact on bike rentals compared to actual temperature.

Significance:

While temperature difference does affect rentals, its influence is not as strong as the actual temperature.

Further Questions for Investigation:

  1. Does temperature difference interact with other variables (e.g., humidity) to influence bike rentals?
  2. Are there non-linear relationships that Pearson’s correlation might not capture?

Confidence Intervals for Response Variable

Confidence Interval for cnt (Total Rentals)

Calculating a 95% confidence interval for the average total bike rentals.

# Summary statistics for cnt
cnt_mean <- mean(bike_data$cnt)
cnt_sd <- sd(bike_data$cnt)
n <- nrow(bike_data)

# Standard error
se_cnt <- cnt_sd / sqrt(n)

# Confidence interval (95%)
ci_lower <- cnt_mean - qt(0.975, df = n-1) * se_cnt
ci_upper <- cnt_mean + qt(0.975, df = n-1) * se_cnt

# Display confidence interval
ci_lower
## [1] 186.7661
ci_upper
## [1] 192.16

Interpretation:

The 95% confidence interval for the mean total bike rentals is approximately [186.8, 192.2]. We are 95% confident that the true average number of bike rentals per day lies within this range.

Significance:

This interval provides a reliable estimate for planning purposes, such as inventory management and resource allocation.

Further Questions:

  1. How does the confidence interval change when stratifying the data by different seasons or weekdays vs. weekends?
  2. Can we build confidence intervals for other response variables, such as rentals during peak hours?

Detailed Conclusion

This analysis explored two relationships using the UCI Bike Sharing hourly dataset:

  1. cnt vs log_cnt:
    A strong positive correlation (≈ 0.82) was observed between total rentals and its logarithmic transformation. This is expected because the transformation is monotonic. The log transformation compresses higher values and reduces skewness, making it useful for modeling count data.

  2. cnt vs temp_diff:
    The relationship between actual count and temperature difference was examined to understand whether perceived deviation changes systematically with temperature levels. The correlation value indicates whether this deviation is temperature-dependent or largely random.

  3. Confidence Interval for cnt:
    The 95% confidence interval provides a precise estimate of the mean hourly rental demand. Because the dataset contains over 17,000 observations, the Central Limit Theorem ensures reliable inference about the population mean.

Overall Insights:

  • Transformations like log(cnt) are valuable for statistical modeling.
  • Temperature deviation may not strongly drive changes in rental behavior.
  • Large sample sizes yield narrow confidence intervals, increasing precision of estimates.

Further Investigation:

  • Examine how season, humidity, and windspeed jointly influence rentals.
  • Explore multiple regression models.
  • Investigate potential time-series effects and autocorrelation.