Week 7: Hypothesis testing

1. Reading Data

To begin, we load the data set and perform necessary pre-processing, such as converting the date column and categorizing days into weekdays and weekends.

library(readr)
library(ggplot2)
library(patchwork)
library(dplyr)
library(lubridate)
library(GGally)
library(corrplot)
library(ggpubr)
library(pwr)
week2=read_csv("C:/Users/rajas/OneDrive/Desktop/Desktop/Applied Data Science/INFOH510/R Jupyter/Metro_Interstate_Traffic_Volume.csv")
week2=week2[week2$temp>0,]
week2=week2[week2$rain_1h< 60,]
week2<- week2|>
  mutate(temp=(((temp-273)*9/5))+32)
week2$hour<- as.integer(format(as.POSIXct(week2$date_time),"%H")) #converting the date_time information into hours,month,year, weekdays to get relevant insights.
week2$month<- month(as.integer(format(as.POSIXct(week2$date_time),"%m")),label = TRUE) #using lubridate library to get the month labels
week2$year<- as.integer(format(as.POSIXct(week2$date_time),"%y"))
week2$day<- as.integer(format(as.POSIXct(week2$date_time),"%d"))
week2$weekday<-weekdays(as.Date(week2$date_time))
week2$weekday<-factor(week2$weekday,levels=c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday")) #sorting the weekdays
week2$weekend <- ifelse(week2$weekday %in% c("Saturday", "Sunday"), "Weekend", "Weekday")
data<-week2
data<-data[data$month=='Jan',]

We will analyze traffic volume patterns using hypothesis testing and visualizations for the month of January across all the years (2012-2018).

1. Whether weekday and weekend traffic volumes are significantly different.

2. Whether temperature and traffic volume are correlated.

2. Hypothesis 1 – Comparing Weekday vs. Weekend Traffic

We hypothesize that traffic volume on weekdays and weekends differs significantly.

2.1 Define the Hypothesis

Null Hypothesis (H₀): The average traffic volume is the same during weekdays and weekends.
Alternative Hypothesis (H₁): The average traffic volume is different between weekdays and weekends.
Column: traffic_volume

2.2 Choosing the test

Since we are comparing the means of two independent groups (weekdays vs. weekends), a two-sample t-test (independent samples) is appropriate in ideal situation assuming that both the groups have normal distribution and no variance between them. We will delve deeper into the choice of test when we calculate the normality and variance difference later to modify our seclection of test.

2.3 Choosing Alpha (α)

A common choice for α is 0.05, meaning we accept a 5% chance of rejecting the null hypothesis when it is actually true (Type I error).

Given the context of traffic planning, we want to minimize false positives but also ensure we detect a true difference if it exists. For total traffic volume and urban planning, it wont be an issue if weekdays tend to have the same traffic as the weekends as the infrastructure already exists to accomodate the large amount of weekday traffic which can sustain similar levels if weekend tends towards the same levels. Considering this we can choose Alpha as 0.1 for our test.

2.4 Choosing Power (1 - β)

We set the power at 0.85, which means we have an 85% chance of detecting an actual difference if one exists. In order for us to detect if there is an actual possibility of refuting the null hypothesis, 1 - β being 0.85 is suitable for our use case. If the average traffic volumes on weekdays and weekends are same, there is continuous stress on the infrastructure and operation staff round the year. This will need careful evaluation and strategies planned to counter disasters by scheduling more frequent maintenance and diligent staff rotations etc. Therefore we need to be able to detect this issue promptly, thus making our assumption of Power as 0.85 valid.

2.5 Choosing Minimum Effect Size (d)

A practical minimum effect size should be chosen based on the data. We can use Cohen’s d:

Small effect: 0.2
Medium effect: 0.5
Large effect: 0.8

Lets selcect a value for d by calculating the effect size:

library(effsize)
cohen.d(data$traffic_volume[data$weekend == "Weekday"],
        data$traffic_volume[data$weekend == "Weekend"])

## 
## Cohen's d
## 
## d estimate: 0.4303336 (small)
## 95 percent confidence interval:
##     lower     upper 
## 0.3589976 0.5016696

From the above calculation we can safely assume d to be 0.5

2.6 Sample Size Calculation

We use the formula for two-sample t-test power analysis: We will compute to determine if we have sufficient data. First, we calculate the required sample size to ensure our test has sufficient power.

# Set parameters
alpha <- 0.1   # Significance level
power <- 0.85    # Power
effect_size <- 0.5  # Medium effect size

# Calculate required sample size
sample_size <- pwr.t.test(d = effect_size, sig.level = alpha, power = power, type = "two.sample")$n
print(paste("Required sample size per group:", round(sample_size)))

## [1] "Required sample size per group: 58"

The minimum sample size needed to detect a particular effect size for our data in the context of Weekday vs Weekends is 58. We need atleast 58 weekends and 58 weekdays to perform our test

Next, we check if our dataset contains enough data for a valid hypothesis test.

# Check available sample size
table(data$weekend)

## 
## Weekday Weekend 
##    2964    1038

Since we considered traffic data for the month of January across all the years, we have plenty of data to perform the test

2.7 Running the test

Checking for normality and variance homogeneity will help us determine if we need to adapt out two sample t test:

# Normality test
shapiro.test(data$traffic_volume[data$weekend == "Weekday"])

## 
##  Shapiro-Wilk normality test
## 
## data:  data$traffic_volume[data$weekend == "Weekday"]
## W = 0.92529, p-value < 2.2e-16

shapiro.test(data$traffic_volume[data$weekend == "Weekend"])

## 
##  Shapiro-Wilk normality test
## 
## data:  data$traffic_volume[data$weekend == "Weekend"]
## W = 0.93296, p-value < 2.2e-16

# Variance test
var.test(traffic_volume ~ weekend, data = data)

## 
##  F test to compare two variances
## 
## data:  traffic_volume by weekend
## F = 1.7443, num df = 2963, denom df = 1037, p-value < 2.2e-16
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##  1.576316 1.925517
## sample estimates:
## ratio of variances 
##           1.744318

For the normality test: The p-values are extremely small (< 0.05), indicating that the traffic volume data for both weekdays and weekends does not follow a normal distribution. This suggests that using parametric tests like the t-test may not be ideal unless we assume robustness (large sample size).

For the variance test: The p-value is very small (< 0.05), indicating that the variances of traffic volume between weekdays and weekends are significantly different. This violates the assumption of equal variances required for a standard two-sample t-test.

Given the above conclusions and considering that we have a large sample size, we should use Welch’s t-test (which adjusts for unequal variances) instead of the standard two-sample t-test.

Lets perform the Welch’s t-test:

# Perform Welch’s t-test
t_test_result <- t.test(traffic_volume ~ weekend, data = data, var.equal = FALSE)
print(t_test_result)

## 
##  Welch Two Sample t-test
## 
## data:  traffic_volume by weekend
## t = 13.606, df = 2380.1, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group Weekday and group Weekend is not equal to 0
## 95 percent confidence interval:
##  690.5817 923.1624
## sample estimates:
## mean in group Weekday mean in group Weekend 
##              3261.210              2454.338

2.8 Conclusion

p-value Interpretation:

The p-value is extremely small (< 0.05), meaning we reject the null hypothesis (H₀).
There is strong statistical evidence that the mean traffic volume differs between weekdays and weekends.

Mean Difference & Practical Significance:

The mean weekday traffic volume (3261.210) is significantly higher than the mean weekend traffic volume (2454.338).
The confidence interval (690.5817 to 923.1624) suggests that, on average, weekday traffic volume is between 690 and 923 units higher than weekend traffic volume.

Effect Size Consideration:

Judging from the calculated cohen’d and our assumption that it has a medium effect it suggests that weekday traffic is noticeably higher, but not overwhelmingly so, meaning adjustments should be considered but may not require drastic policy changes.

2.9 Visualization: Traffic Volume by Day Type

To better understand the difference in traffic volumes, we visualize it using boxplots and histograms.

ggplot(data, aes(x=weekend, y=traffic_volume, fill=weekend)) +
  geom_boxplot(alpha=0.7) +
  labs(title="Traffic Volume: Weekdays vs. Weekends", x="Day Type", y="Traffic Volume") +
  theme(axis.text=element_text(size=25),
          axis.title.x = element_text(size = 20),
          axis.title.y = element_text(size = 20),
          plot.title = element_text(size = 20),
          legend.key.size = unit(2,"cm"),
          legend.text = element_text(size = 18),
          legend.title = element_text(size = 14),
          panel.background = element_rect(fill = 'white'),
          panel.grid.major = element_line(color = "grey")) +
  scale_fill_manual(values=c("steelblue", "tomato"))

The mean traffic_volume for Weekends is clearly much higher than weekdays.

ggplot(data, aes(x=traffic_volume, fill=weekend)) +
  geom_histogram(position="identity", alpha=0.6, bins=30) +
  labs(title="Distribution of Traffic Volume: Weekdays vs. Weekends", x="Traffic Volume", y="Count") +
  theme(axis.text=element_text(size=25),
          axis.title.x = element_text(size = 20),
          axis.title.y = element_text(size = 20),
          plot.title = element_text(size = 20),
          legend.key.size = unit(2,"cm"),
          legend.text = element_text(size = 18),
          legend.title = element_text(size = 14),
          panel.background = element_rect(fill = 'white'),
          panel.grid.major = element_line(color = "grey")) +
  scale_fill_manual(values=c("steelblue", "tomato"))

3. Hypothesis 2 – Correlation Between Temperature and Traffic Volume

We hypothesize that temperature affects traffic volume. We we will be using the Pearson’s correlation test to discuss this hypothesis

3.1 Why Pearson’s Correlation Test?

Pearson’s correlation is appropriate when:

Both variables are continuous (temp and traffic_volume).
We assume a linear relationship between them.
Normality assumption: While Pearson’s test assumes normality, it is often robust to slight deviations, especially with large sample sizes. Given the large sample size of our data, Pearson’s correlation remains valid.

3.2 Define Hypothesis:

Null Hypothesis (H₀): There is no correlation between temperature and traffic volume.
Alternate Hypothesis (H₁): There is a correlation between temperature and traffic volume.

We perform the correlation test below.

# Pearson correlation test
cor_test_result <- cor.test(data$temp, data$traffic_volume, method="pearson")
print(cor_test_result)

## 
##  Pearson's product-moment correlation
## 
## data:  data$temp and data$traffic_volume
## t = 4.9282, df = 4000, p-value = 8.636e-07
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.04681481 0.10840860
## sample estimates:
##        cor 
## 0.07768583

3.3 Conclusion

Statistical Significance:
- The p-value is extremely small (< 0.05), meaning we reject the null hypothesis (H₀), that there is no correlation between temperature (temp) and traffic volume (traffic_volume).
- This suggests that temperature and traffic volume are statistically correlated.
Strength & Direction of Correlation:
- The correlation coefficient r = 0.0777 is positive but very weak.
- A weak positive correlation means that as temperature increases, traffic volume slightly increases, but the effect is minimal.
- The 95% confidence interval (0.0468 to 0.1084) confirms that the correlation is weak, as it does not include values close to ±1.
Practical Significance:
- While the correlation is statistically significant, the effect size is very small.
- This suggests that temperature alone is not a strong predictor of traffic volume. Other factors (e.g., time of day, holidays, road conditions) likely play a much bigger role.

3.4 Visualization: Traffic Volume vs. Temperature

Lets use a scatter plot and heatmap to explore the relationship visually.

ggplot(data, aes(x=temp, y=traffic_volume)) +
  geom_point(alpha=0.4, color="blue") +
  geom_smooth(method="lm", color="red", se=TRUE) +
  labs(title="Traffic Volume vs. Temperature", x="Temperature (Kelvin)", y="Traffic Volume") +
  theme(axis.text=element_text(size=25),
          axis.title.x = element_text(size = 20),
          axis.title.y = element_text(size = 20),
          plot.title = element_text(size = 20),
          legend.key.size = unit(2,"cm"),
          legend.text = element_text(size = 18),
          legend.title = element_text(size = 14),
          panel.background = element_rect(fill = 'white'),
          panel.grid.major = element_line(color = "grey"))

ggplot(data, aes(x=temp, y=traffic_volume)) +
  geom_bin2d(bins=50) +
  scale_fill_viridis_c() +
  labs(title="Density of Traffic Volume at Different Temperatures", x="Temperature (Kelvin)", y="Traffic Volume") +
  theme(axis.text=element_text(size=25),
          axis.title.x = element_text(size = 20),
          axis.title.y = element_text(size = 20),
          plot.title = element_text(size = 20),
          legend.key.size = unit(2,"cm"),
          legend.text = element_text(size = 18),
          legend.title = element_text(size = 14),
          panel.background = element_rect(fill = 'white'),
          panel.grid.major = element_line(color = "grey"))

Both charts indicate that, although temp and traffic volume are related, their impact is not very large. We do not see any specific pattern in the scatter plot nor the heatmap.