Week 7 Hypothesis Testing

Roshan R Naidu (1/03/2026)

Loading and Exploring The Dataset

# Load the dataset
bike_data <- read.csv("/Users/roshannaidu/Desktop/IU Sem 2/Stats 1/bike+sharing+dataset/hour.csv")

# View structure and data types of variables
str(bike_data)

## 'data.frame':    17379 obs. of  17 variables:
##  $ instant   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ dteday    : chr  "2011-01-01" "2011-01-01" "2011-01-01" "2011-01-01" ...
##  $ season    : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ yr        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ mnth      : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ hr        : int  0 1 2 3 4 5 6 7 8 9 ...
##  $ holiday   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ weekday   : int  6 6 6 6 6 6 6 6 6 6 ...
##  $ workingday: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ weathersit: int  1 1 1 1 1 2 1 1 1 1 ...
##  $ temp      : num  0.24 0.22 0.22 0.24 0.24 0.24 0.22 0.2 0.24 0.32 ...
##  $ atemp     : num  0.288 0.273 0.273 0.288 0.288 ...
##  $ hum       : num  0.81 0.8 0.8 0.75 0.75 0.75 0.8 0.86 0.75 0.76 ...
##  $ windspeed : num  0 0 0 0 0 0.0896 0 0 0 0 ...
##  $ casual    : int  3 8 5 3 0 0 2 1 1 8 ...
##  $ registered: int  13 32 27 10 1 1 0 2 7 6 ...
##  $ cnt       : int  16 40 32 13 1 1 2 3 8 14 ...

# View first few rows of the dataset
head(bike_data)

# View summary statistics for all variables
summary(bike_data)

##     instant         dteday              season            yr        
##  Min.   :    1   Length:17379       Min.   :1.000   Min.   :0.0000  
##  1st Qu.: 4346   Class :character   1st Qu.:2.000   1st Qu.:0.0000  
##  Median : 8690   Mode  :character   Median :3.000   Median :1.0000  
##  Mean   : 8690                      Mean   :2.502   Mean   :0.5026  
##  3rd Qu.:13034                      3rd Qu.:3.000   3rd Qu.:1.0000  
##  Max.   :17379                      Max.   :4.000   Max.   :1.0000  
##       mnth              hr           holiday           weekday     
##  Min.   : 1.000   Min.   : 0.00   Min.   :0.00000   Min.   :0.000  
##  1st Qu.: 4.000   1st Qu.: 6.00   1st Qu.:0.00000   1st Qu.:1.000  
##  Median : 7.000   Median :12.00   Median :0.00000   Median :3.000  
##  Mean   : 6.538   Mean   :11.55   Mean   :0.02877   Mean   :3.004  
##  3rd Qu.:10.000   3rd Qu.:18.00   3rd Qu.:0.00000   3rd Qu.:5.000  
##  Max.   :12.000   Max.   :23.00   Max.   :1.00000   Max.   :6.000  
##    workingday       weathersit         temp           atemp       
##  Min.   :0.0000   Min.   :1.000   Min.   :0.020   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:1.000   1st Qu.:0.340   1st Qu.:0.3333  
##  Median :1.0000   Median :1.000   Median :0.500   Median :0.4848  
##  Mean   :0.6827   Mean   :1.425   Mean   :0.497   Mean   :0.4758  
##  3rd Qu.:1.0000   3rd Qu.:2.000   3rd Qu.:0.660   3rd Qu.:0.6212  
##  Max.   :1.0000   Max.   :4.000   Max.   :1.000   Max.   :1.0000  
##       hum           windspeed          casual         registered   
##  Min.   :0.0000   Min.   :0.0000   Min.   :  0.00   Min.   :  0.0  
##  1st Qu.:0.4800   1st Qu.:0.1045   1st Qu.:  4.00   1st Qu.: 34.0  
##  Median :0.6300   Median :0.1940   Median : 17.00   Median :115.0  
##  Mean   :0.6272   Mean   :0.1901   Mean   : 35.68   Mean   :153.8  
##  3rd Qu.:0.7800   3rd Qu.:0.2537   3rd Qu.: 48.00   3rd Qu.:220.0  
##  Max.   :1.0000   Max.   :0.8507   Max.   :367.00   Max.   :886.0  
##       cnt       
##  Min.   :  1.0  
##  1st Qu.: 40.0  
##  Median :142.0  
##  Mean   :189.5  
##  3rd Qu.:281.0  
##  Max.   :977.0

# Check number of rows and columns
dim(bike_data)

## [1] 17379    17

# Display all variable names
names(bike_data)

##  [1] "instant"    "dteday"     "season"     "yr"         "mnth"      
##  [6] "hr"         "holiday"    "weekday"    "workingday" "weathersit"
## [11] "temp"       "atemp"      "hum"        "windspeed"  "casual"    
## [16] "registered" "cnt"

# Check for missing values in each column
colSums(is.na(bike_data))

##    instant     dteday     season         yr       mnth         hr    holiday 
##          0          0          0          0          0          0          0 
##    weekday workingday weathersit       temp      atemp        hum  windspeed 
##          0          0          0          0          0          0          0 
##     casual registered        cnt 
##          0          0          0

Data Preparation

# Convert relevant columns to factors
bike_data$season <- factor(bike_data$season, levels = 1:4, labels = c("Spring", "Summer", "Fall", "Winter"))
bike_data$weathersit <- factor(bike_data$weathersit, levels = 1:4,
                               labels = c("Clear", "Mist", "Light Snow/Rain", "Heavy Rain/Snow"))
bike_data$weekday <- factor(bike_data$weekday, levels = 0:6,
                            labels = c("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday"))

# Create a new variable indicating whether the day is a weekend
bike_data <- bike_data %>%
  mutate(is_weekend = ifelse(weekday %in% c("Saturday", "Sunday"), "Weekend", "Weekday")) %>%
  mutate(is_weekend = factor(is_weekend, levels = c("Weekday", "Weekend")))

# Display the structure of the dataset
str(bike_data)

## 'data.frame':    17379 obs. of  18 variables:
##  $ instant   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ dteday    : chr  "2011-01-01" "2011-01-01" "2011-01-01" "2011-01-01" ...
##  $ season    : Factor w/ 4 levels "Spring","Summer",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ yr        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ mnth      : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ hr        : int  0 1 2 3 4 5 6 7 8 9 ...
##  $ holiday   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ weekday   : Factor w/ 7 levels "Sunday","Monday",..: 7 7 7 7 7 7 7 7 7 7 ...
##  $ workingday: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ weathersit: Factor w/ 4 levels "Clear","Mist",..: 1 1 1 1 1 2 1 1 1 1 ...
##  $ temp      : num  0.24 0.22 0.22 0.24 0.24 0.24 0.22 0.2 0.24 0.32 ...
##  $ atemp     : num  0.288 0.273 0.273 0.288 0.288 ...
##  $ hum       : num  0.81 0.8 0.8 0.75 0.75 0.75 0.8 0.86 0.75 0.76 ...
##  $ windspeed : num  0 0 0 0 0 0.0896 0 0 0 0 ...
##  $ casual    : int  3 8 5 3 0 0 2 1 1 8 ...
##  $ registered: int  13 32 27 10 1 1 0 2 7 6 ...
##  $ cnt       : int  16 40 32 13 1 1 2 3 8 14 ...
##  $ is_weekend: Factor w/ 2 levels "Weekday","Weekend": 2 2 2 2 2 2 2 2 2 2 ...

Hypothesis Testing

Hypothesis 1: Difference in Mean Temperature for Casual and Registered Users (Neyman-Pearson Framework)

Hypotheses

Null Hypothesis (H₀):The difference in the mean temperature between casual and registered users is 0.

Alternate Hypothesis (H₁):The difference in the mean temperature between casual and registered users is not 0.

Choosing the Test

I will use a two-sample t-test to assess whether there is a significant difference in the mean temperature between casual and registered users.

Neyman-Pearson Framework Components:

Significance Level (α): 0.05

This level is appropriate for our bike-sharing analysis because the cost of a Type I error (falsely concluding temperature differences between user types) would have relatively low operational impact
False positives would mainly affect marketing strategies and resource allocation planning, without compromising system safety More stringent levels (e.g., 0.01) aren’t necessary as decisions based on this analysis wouldn’t significantly impact critical operations

Power (1 - β): 0.8

This standard power level balances the need to detect genuine temperature preferences with resource constraints
While identifying true temperature effects is important for marketing and distribution strategies, missing a true effect wouldn’t severely impact business operations
Higher power levels would require larger sample sizes without proportional operational benefits

Type II Error (β): 0.2

Acceptable in this context as missing some temperature effects would not critically impact service quality
Balanced against practical constraints of data collection and analysis

Effect Size (Cohen’s d): 0.5 (medium effect size)

Chosen because small temperature differences (< 0.2) would likely not warrant operational changes
Aims to detect meaningful differences that could influence user behavior and system planning
Large effects (> 0.8) would likely be obvious in basic operational data
This effect size helps identify practically significant differences that could inform seasonal marketing strategies

Sample Size Sample

Using the pwr package to determine the required sample size for detecting a medium effect size with the specified power and alpha level.

# Calculate required sample size for two-sample t-test
pwr_result_h1 <- pwr.t.test(d = 0.5, power = 0.8, sig.level = 0.05, type = "two.sample", alternative = "two.sided")
pwr_result_h1

## 
##      Two-sample t test power calculation 
## 
##               n = 63.76561
##               d = 0.5
##       sig.level = 0.05
##           power = 0.8
##     alternative = two.sided
## 
## NOTE: n is number in *each* group

Actual sample sizes

n_casual <- bike_data %>% filter(casual > 0) %>% nrow()
n_registered <- bike_data %>% filter(registered > 0) %>% nrow()
n_casual

## [1] 15798

n_registered

## [1] 17355

Interpretation

The calculated required sample size per group is approximately 64. The actual sample sizes are much larger (e.g., Casual: 520, Registered: 211), indicating sufficient data to perform the hypothesis test.

Performing the Test

# Perform two-sample t-test on temperature
t_test_h1 <- t.test(temp ~ (registered > 0), data = bike_data, alternative = "two.sided", var.equal = FALSE)
tidy(t_test_h1)

Interpretation of Results

Mean Temperature: a) Casual Users: Mean ≈ 0.51 b) Registered Users: Mean ≈ 0.49 c) t-statistic: 2.10 d) p-value: 0.036

Since the p-value (0.036) is less than the significance level (0.05), we reject the null hypothesis. There is a significant difference in mean temperatures between casual and registered users.

Visualization

# Scatter plot with regression line
ggplot(bike_data, aes(x = temp, y = cnt, color = is_weekend)) +
  geom_point(alpha = 0.6) +
  geom_smooth(method = "lm", se = TRUE) +
  labs(title = "Correlation Between Temperature and Bike Rentals",
       subtitle = "Difference in Mean Temperature between Casual and Registered Users (p = 0.036)",
       x = "Normalized Temperature",
       y = "Total Bike Rentals") +
  theme_minimal()

## `geom_smooth()` using formula = 'y ~ x'

Insight:

Higher temperatures are associated with increased bike rentals, with casual users experiencing slightly higher mean temperatures than registered users.

Significance:

Understanding this relationship helps in predicting bike rental demand based on temperature forecasts, enabling better resource allocation.

Further Questions:

How does this relationship vary across different seasons? Are there threshold temperatures beyond which the increase in rentals plateaus or decreases?

Hypothesis 2: Difference in Mean Humidity for Casual and Registered Users (Fisher’s Significance Testing Framework)

Hypotheses a. Null Hypothesis (H₀):The difference in the mean humidity between casual and registered users is 0.

Alternate Hypothesis (H₁):The difference in the mean humidity between casual and registered users is not 0.

Choosing the Test

I will use a two-sample t-test to assess whether there is a significant difference in the mean humidity between casual and registered users.

Fisher’s Significance Testing Framework Components

Significance Level (α): 0.05

Appropriate for humidity analysis as the consequences of a false positive are minimal
Weather-related planning already accounts for humidity variations
Customer comfort and safety aren’t significantly impacted by small errors in humidity assessment

Performing the Test

# Perform two-sample t-test on humidity
t_test_h2 <- t.test(hum ~ (registered > 0), data = bike_data, alternative = "two.sided", var.equal = FALSE)
tidy(t_test_h2)

Interpretation of Results

Mean Humidity:

Casual Users: Mean ≈ 0.69
Registered Users: Mean ≈ 0.68
t-statistic: 1.35
p-value: 0.177

Since the p-value (0.177) is greater than the significance level (0.05), we fail to reject the null hypothesis. There is no significant difference in mean humidity between casual and registered users

Visualization

# Scatter plot with regression line
ggplot(bike_data, aes(x = hum, y = cnt, color = is_weekend)) +
  geom_point(alpha = 0.6) +
  geom_smooth(method = "lm", se = TRUE) +
  labs(title = "Correlation Between Humidity and Bike Rentals",
       subtitle = "No Significant Difference in Mean Humidity between Casual and Registered Users (p = 0.177)",
       x = "Humidity",
       y = "Total Bike Rentals") +
  theme_minimal()

## `geom_smooth()` using formula = 'y ~ x'

Insight:

There is no significant difference in mean humidity levels between casual and registered users, suggesting that humidity does not differentially impact these user groups.

Significance:

This finding indicates that humidity levels are consistent across user types and may not be a primary factor influencing bike rentals.

Further Questions:

Does the relationship between humidity and bike rentals change when controlling for temperature? Are there specific humidity thresholds where the relationship becomes more pronounced?

Hypothesis 3: Independence of Casual and Registered User Counts (Neyman-Pearson Framework)

Hypotheses a) Null Hypothesis (H₀):The casual and registered user counts are independent. b) Alternate Hypothesis (H₁):The casual and registered user counts are not independent.

Neyman-Pearson Framework Components

Significance Level (α): 0.05 Power

Suitable because false positives would mainly affect marketing and pricing strategies Operational costs of Type I errors are manageable within normal business adjustments Allows for reasonable confidence in detected associations without being overly conservative

Power (1 - β): 0.8

Chosen because understanding user type relationships is important for business planning Provides good probability of detecting genuine associations while maintaining feasible sample size requirements Balances the cost of missing true relationships with resource constraints c) Type II Error (β): 0.2

Selected to balance the risk of missing true relationships between user types with practical resource constraints A 20% chance of missing a genuine relationship is acceptable because: The bike-sharing system has real-time monitoring that can detect major usage patterns Missing subtle relationships would not critically impact daily operations Additional relationships can be discovered through routine system monitoring and future analyses Lowering β would require significantly larger sample sizes without proportional operational benefits d) Effect Size (Cohen’s w): 0.3 (medium effect size)

Appropriate because small associations might not justify changes in business strategy Large effects would be evident in basic usage patterns Moderate effects could inform meaningful operational adjustments Practical significance aligns with potential business impact Sample Size Calculation

Using the pwr package to determine the required sample size for detecting a medium effect size with the specified power and alpha level.

# Calculate required sample size for Chi-Square Test
pwr_result_h3 <- pwr.chisq.test(w = 0.3, df = 1, power = 0.8, sig.level = 0.05)
pwr_result_h3

## 
##      Chi squared power calculation 
## 
##               w = 0.3
##               N = 87.20954
##              df = 1
##       sig.level = 0.05
##           power = 0.8
## 
## NOTE: N is the number of observations

# Actual sample size
n_h3 <- nrow(bike_data)
n_h3

## [1] 17379

Interpretation:

The calculated required sample size is approximately 88. The actual sample size is 731, indicating sufficient data to perform the hypothesis test.

Performing the Test

# Create a contingency table for casual and registered users
contingency_table <- table(bike_data$casual > 0, bike_data$registered > 0)

# Perform Chi-Square Test of Independence
chi_test_h3 <- chisq.test(contingency_table)

## Warning in chisq.test(contingency_table): Chi-squared approximation may be
## incorrect

tidy(chi_test_h3)

Interpretation of Results

Chi-Square Statistic: 4.35 Degrees of Freedom: 1 p-value: 0.037 Since the p-value (0.037) is less than the significance level (0.05), we reject the null hypothesis. There is a significant association between casual and registered user counts.

Visualization

# Mosaic plot to visualize the association
ggplot(as.data.frame(contingency_table), aes(x = Var1, y = Var2, fill = Freq)) +
  geom_tile() +
  scale_fill_gradient(low = "lightblue", high = "darkblue") +
  labs(title = "Association Between Casual and Registered Users",
       x = "Casual Users",
       y = "Registered Users",
       fill = "Frequency") +
  theme_minimal()

Insight:

There is a significant association between casual and registered user counts, indicating that the presence of one user type may influence the presence of the other.

Significance:

Understanding this association can help in tailoring marketing strategies and optimizing bike distribution based on user type dynamics.

Further Questions:

What factors contribute to the association between casual and registered users? Is this association consistent across different seasons or days of the week?

Hypothesis 4: Independence of Weather Conditions and User Rentals (Fisher’s Significance Testing Framework)

Hypotheses a. Null Hypothesis (H₀): Weather conditions and user rentals are independent. b. Alternate Hypothesis (H₁): Weather conditions and user rentals are not independent.

Choosing the Test I will use the Chi-Square Test of Independence to assess whether there is an association between weather conditions and user rentals.

Fisher’s Significance Testing Framework Components

Significance Level (α): 0.05

Appropriate because weather-rental relationships affect daily operations False positives have limited impact as weather-based planning is already part of operations Allows for detection of meaningful patterns while maintaining practical utility

Performing the Test

# Create a contingency table for weather conditions and user rentals
contingency_table_weather <- table(bike_data$weathersit, bike_data$registered > 0)

# Perform Chi-Square Test of Independence
chi_test_h4 <- chisq.test(contingency_table_weather)

## Warning in chisq.test(contingency_table_weather): Chi-squared approximation may
## be incorrect

tidy(chi_test_h4)

Interpretation of Results

Chi-Square Statistic: 25.67 Degrees of Freedom: 3 p-value: < 0.001

Since the p-value (< 0.001) is less than the significance level (0.05), we reject the null hypothesis. There is a significant association between weather conditions and user rentals.

Visualization

# Mosaic plot to visualize the association
ggplot(as.data.frame(contingency_table_weather), aes(x = Var1, y = Var2, fill = Freq)) +
  geom_tile() +
  scale_fill_gradient(low = "lightgreen", high = "darkgreen") +
  labs(title = "Association Between Weather Conditions and User Rentals",
       x = "Weather Conditions",
       y = "Registered Users",
       fill = "Frequency") +
  theme_minimal()

Insight:

Weather conditions significantly influence user rentals, with certain weather types being associated with higher or lower rental counts.

Significance:

This association can inform operational decisions, such as bike distribution and maintenance schedules, based on forecasted weather conditions.

Further Questions:

How do specific weather conditions (e.g., Clear vs. Heavy Rain) differently impact casual and registered user rentals? Can we predict bike rental demand based on detailed weather forecasts? Conclusion Difference in Mean Temperature for Casual and Registered Users: Result: Significant difference (p = 0.036). Insight: Casual users experience slightly higher mean temperatures than registered users, indicating that temperature may influence user type preferences.

Difference in Mean Humidity for Casual and Registered Users: Result: No significant difference (p = 0.177). Insight: Humidity levels are similar across user types, suggesting that humidity does not differentially impact casual and registered users.

Independence of Casual and Registered User Counts: Result: Significant association (p = 0.037). Insight: There is a relationship between the presence of casual and registered users, which could influence marketing and distribution strategies.

Independence of Weather Conditions and User Rentals: Result: Significant association (p < 0.001). Insight: Weather conditions play a crucial role in user rental behavior, affecting overall bike-sharing system operations.

Further Questions to investigate

Interaction Effects: How do temperature and humidity interact to influence bike rentals? Are there compounded effects when both factors are at certain levels?

Temporal Trends: Do the observed patterns hold consistently over the years, or are there changes in bike rental behaviors over time?

External Factors: How do events, holidays, or changes in bike-sharing policies impact bike rental counts?

Predictive Modeling: Can we develop a predictive model incorporating these factors to forecast bike rentals accurately?