Reading the data and performing minor adjustments to remove inappropriate outliers and make the data easy to work with.
library(readr)
library(ggplot2)
library(patchwork)
library(dplyr)
library(lubridate)
library(GGally)
week2=read_csv("C:/Users/rajas/OneDrive/Desktop/Desktop/Applied Data Science/INFOH510/R Jupyter/Metro_Interstate_Traffic_Volume.csv")
week2=week2[week2$temp>0,]
week2=week2[week2$rain_1h< 60,]
week2<- week2|>
mutate(temp=(((temp-273)*9/5))+32)
week2$hour<- as.integer(format(as.POSIXct(week2$date_time),"%H")) #converting the date_time information into hours,month,year, weekdays to get relevant insights.
week2$month<- month(as.integer(format(as.POSIXct(week2$date_time),"%m")),label = TRUE) #using lubridate library to get the month labels
week2$year<- as.integer(format(as.POSIXct(week2$date_time),"%y"))
week2$weekday<-weekdays(as.Date(week2$date_time))
week2$weekday<-factor(week2$weekday,levels=c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday")) #sorting the weekdays
We are grouping the dataset into three different groups:
G2<-week2|>
group_by(weekday,weather_main)|>
summarise(Avg_Temp=mean(temp),.groups = 'drop')
G3<-week2|>
group_by(weather_main)|>
summarise(Volume=sum(traffic_volume))
G1<-week2|>
group_by(hour)|>
summarise(avg_traffic=mean(traffic_volume),
median_traffic=median(traffic_volume),
count=n())
G1
## # A tibble: 24 × 4
## hour avg_traffic median_traffic count
## <int> <dbl> <dbl> <int>
## 1 0 835. 676 2037
## 2 1 516. 420 2049
## 3 2 388. 315 2019
## 4 3 371. 362 2023
## 5 4 703. 808 2089
## 6 5 2095. 2638 2061
## 7 6 4142. 5381 2085
## 8 7 4742. 5999 2078
## 9 8 4589. 5441 2079
## 10 9 4385. 4839 2018
## # ℹ 14 more rows
G2
## # A tibble: 69 × 3
## weekday weather_main Avg_Temp
## <fct> <chr> <dbl>
## 1 Monday Clear 50.2
## 2 Monday Clouds 48.1
## 3 Monday Drizzle 51.2
## 4 Monday Fog 44.1
## 5 Monday Haze 34.4
## 6 Monday Mist 42.2
## 7 Monday Rain 57.4
## 8 Monday Smoke 68.6
## 9 Monday Snow 22.8
## 10 Monday Thunderstorm 66.6
## # ℹ 59 more rows
G3
## # A tibble: 11 × 2
## weather_main Volume
## <chr> <dbl>
## 1 Clear 40908493
## 2 Clouds 54870172
## 3 Drizzle 5992414
## 4 Fog 2465793
## 5 Haze 4762858
## 6 Mist 17451092
## 7 Rain 18813625
## 8 Smoke 64753
## 9 Snow 8676444
## 10 Squall 8247
## 11 Thunderstorm 3103676
Lets order the selected dataframes to figure out the least represented groups
G1|>
slice_min(order_by = avg_traffic)
## # A tibble: 1 × 4
## hour avg_traffic median_traffic count
## <int> <dbl> <dbl> <int>
## 1 3 371. 362 2023
G2|>
slice_min(order_by = Avg_Temp)
## # A tibble: 1 × 3
## weekday weather_main Avg_Temp
## <fct> <chr> <dbl>
## 1 Sunday Snow 20.9
G3|>
slice_min(order_by = Volume)
## # A tibble: 1 × 2
## weather_main Volume
## <chr> <dbl>
## 1 Squall 8247
From the outputs above, we can conclude the following least represented values for each group:
tagged_df<-week2|>
mutate(tags=case_when(
hour==3 ~'G1',
weekday=='Sunday'& weather_main=='Snow'~'G2',
weather_main=='Squall'~'G3',
TRUE~'untagged'
))
Visualizing traffic volume across the day, we see that the peak hours are between 6:00 AM and 5:00 PM. We notice the least amount traffic is seen around 3:00 AM.
G1|>
ggplot()+
geom_line(mapping=aes(x=hour,y=avg_traffic), color='blue',linewidth=1.2)+
labs(title="Hourly Traffic Volume", x= "Hour",y="Traffic Volume")+
#theme_minimal()
theme(axis.text=element_text(size=12),
panel.background = element_rect(fill = 'white'),
panel.grid.major = element_line(color = "grey"))
We see that when the weather type is “Squall”, the average volume is the lowest. It is also interesting to note that when it was Squall, the traffic was higher on a Wednesday compared to the when it was over the weekend. This is likely due to the fact that Wednesday is a weekday and people were commuting to and from work.
tagged_df|>
group_by(weather_main,weekday)|>
summarise(Avg_traffic=mean(traffic_volume),.groups = 'drop')|>
ggplot()+
geom_bar(aes(x=weather_main,y=Avg_traffic,fill=weekday),stat='identity')+
labs(title="Traffic Volume by weather",x="Weather Type",y="Traffic Volume")+
theme(axis.text=element_text(size=12),
panel.background = element_rect(fill = 'white'),
panel.grid.major = element_line(color = "grey"))
Snow days have the least average temperature as expected. Interesting to note that there was never “Smoke” weather type on a “Sunday”.
G2|>
ggplot()+
geom_line(aes(x=weekday,y=Avg_Temp,group=weather_main,color=weather_main),size=1.2)+
geom_point(aes(x=weekday,y=Avg_Temp,group=weather_main,color=as.factor(weather_main),group=weather_main))+
facet_wrap(~weather_main)+
labs(title="Temperature trends",x="Weekday",y="Avg temperature(C)")+
theme(axis.text=element_text(size=12),
panel.background = element_rect(fill = 'white'),
panel.grid.major = element_line(color = "grey"))
3 AM has the fewest records, it means fewer vehicles are on the road at that time. This suggests that nighttime traffic volume is significantly lower due to reduced human activity. Looking at in terms of probability;
Our data set has 48205 rows. Out of the these rows only X rows correspond to 3 AM. The probability of selecting a 3AM data point is :
\[ P(3AM)=X/4805 \]
Meanwhile, 4 PM has Y rows whose probability of selection is:
\[ P(4PM)=Y/48205 \]
From the Line chart above it is evident that X<Y and hence
\[ P(4PM)>P(3AM) \]
Therefore, we can conclude that traffic data points for 3 AM are much rarer than 4PM. Considering the general sparsity of activity during early morning hours in a day, this conclusion is sound. It also helps us establish Peak and Non-peak hours.
Let us test the hypothesis that:
The Probability of observing low traffic is significantly higher at 3 AM than at other times
Lets Define:
\[ A= Low\;Traffic\;Volume\\ B=\;Time\;is\;3\;AM \]
Using Bayes’ Theorem:
\[ P(B|A)= \frac{P(A|B)P(B)}{P(A)} \]
where:
P(B|A)= Probability that the time is 3 AM given that traffic volume is low.
P(A|B)= Probability of low traffic volume given that the time is 3 AM.
P(B)= Prior probability of selecting 3 AM from all times.
P(A)= Overall probability of low traffic volume across all hours.
Using our data set, we can calculate the following probabilities:
\[ P(A|B) \approx90\%\;(90\%\;of\;instances\;at\;3AM\;have\;low\;traffic)\\ P(B)= 1/24=4.17\% (as\;3AM\;is\;one\;of\;24\;hours)\\ P(A)=30\%(Overall,\; 30\%\;of\;all\;data\;points\;show\;low\;traffic) \]
(Refer 3.4.1 for above calculations/assumptions)
Applying Bayes’ Theorem:
\[ P(B|A)=\frac{(0.90\;X\;0.0417)}{0.30}\\ =\frac{0.0375}{0.30}=0.125 \]
So, given that we observe low traffic, there is a 12.5% probability that it occurred at 3 AM.
Null Hypothesis= The probability of low traffic at 3 AM is not significantly different from other hours.
Alternative Hypothesis= The probability of low traffic is significantly higher at 3 AM compared to other hours.
(Using the chi-square test to verify if 3 AM is significantly different)
week2$hour<-as.numeric(week2$hour)
week2$low_traffic<-ifelse(week2$traffic_volume<2000,1,0)
table_data <- table(week2$low_traffic, week2$hour == 3)
# Perform chi-square test
chi_test <- chisq.test(table_data)
# Print results
print(chi_test)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: table_data
## X-squared = 4575.4, df = 1, p-value < 2.2e-16
Since p-value < 0.05, we reject the Null Hypothesis, meaning low traffic is significantly more common at 3 AM (Further proven by R calculation of P(B|A)=13.2% which is much higher than the uniform P(B)=4.17%).
3.4.1 Probability Calculations:
# Total number of rows in the dataset
total_rows <- nrow(week2)
# Number of rows where traffic_volume is low (< 2000)
low_traffic_rows <- sum(week2$traffic_volume < 2000)
# Compute P(A) as a proportion
P_A <- low_traffic_rows / total_rows
# Print result
print(P_A)
## [1] 0.3156268
# Count total instances where hour == 3
total_3am <- nrow(week2[week2$hour == 3, ])
# Count instances where hour == 3 and traffic_volume < 2000
low_traffic_3am <- sum(week2$traffic_volume[week2$hour == 3] < 2000)
# Compute conditional probability
P_A_given_B <- low_traffic_3am / total_3am
P_B <- nrow(week2[week2$hour == 3, ]) / nrow(week2)
P_A <- sum(week2$low_traffic) / nrow(week2)
# Bayes' Theorem Calculation
P_B_given_A <- (P_A_given_B * P_B) / P_A
# Print result
P_A_given_B
## [1] 1
P_B_given_A
## [1] 0.1329959
Lets find the missing combinations and all possible combinations
weather_types<-unique(week2$weather_main)
weekdays<-unique(week2$weekday)
#compare_df<-data.frame(Weather=weather_types,
#WeekDay=weekdays)
all_combinations <- expand.grid(weather_main = weather_types, weekday = weekdays)
print(all_combinations)
## weather_main weekday
## 1 Clouds Tuesday
## 2 Clear Tuesday
## 3 Rain Tuesday
## 4 Drizzle Tuesday
## 5 Mist Tuesday
## 6 Haze Tuesday
## 7 Fog Tuesday
## 8 Thunderstorm Tuesday
## 9 Snow Tuesday
## 10 Squall Tuesday
## 11 Smoke Tuesday
## 12 Clouds Wednesday
## 13 Clear Wednesday
## 14 Rain Wednesday
## 15 Drizzle Wednesday
## 16 Mist Wednesday
## 17 Haze Wednesday
## 18 Fog Wednesday
## 19 Thunderstorm Wednesday
## 20 Snow Wednesday
## 21 Squall Wednesday
## 22 Smoke Wednesday
## 23 Clouds Thursday
## 24 Clear Thursday
## 25 Rain Thursday
## 26 Drizzle Thursday
## 27 Mist Thursday
## 28 Haze Thursday
## 29 Fog Thursday
## 30 Thunderstorm Thursday
## 31 Snow Thursday
## 32 Squall Thursday
## 33 Smoke Thursday
## 34 Clouds Friday
## 35 Clear Friday
## 36 Rain Friday
## 37 Drizzle Friday
## 38 Mist Friday
## 39 Haze Friday
## 40 Fog Friday
## 41 Thunderstorm Friday
## 42 Snow Friday
## 43 Squall Friday
## 44 Smoke Friday
## 45 Clouds Saturday
## 46 Clear Saturday
## 47 Rain Saturday
## 48 Drizzle Saturday
## 49 Mist Saturday
## 50 Haze Saturday
## 51 Fog Saturday
## 52 Thunderstorm Saturday
## 53 Snow Saturday
## 54 Squall Saturday
## 55 Smoke Saturday
## 56 Clouds Sunday
## 57 Clear Sunday
## 58 Rain Sunday
## 59 Drizzle Sunday
## 60 Mist Sunday
## 61 Haze Sunday
## 62 Fog Sunday
## 63 Thunderstorm Sunday
## 64 Snow Sunday
## 65 Squall Sunday
## 66 Smoke Sunday
## 67 Clouds Monday
## 68 Clear Monday
## 69 Rain Monday
## 70 Drizzle Monday
## 71 Mist Monday
## 72 Haze Monday
## 73 Fog Monday
## 74 Thunderstorm Monday
## 75 Snow Monday
## 76 Squall Monday
## 77 Smoke Monday
actual_combinations <- week2 |>
select(weather_main, weekday)|>
distinct()
missing_combinations <- anti_join(all_combinations, actual_combinations, by = c("weather_main", "weekday"))
print(missing_combinations)
## weather_main weekday
## 1 Squall Tuesday
## 2 Smoke Tuesday
## 3 Smoke Wednesday
## 4 Squall Thursday
## 5 Smoke Thursday
## 6 Squall Friday
## 7 Smoke Sunday
## 8 Squall Monday
The missing combinations are:
There has been no Squall weather on Monday, Tuesday, Thursday and Friday. Same is the case for Smoke on Tuesday, Wednesday, Thursday and Sunday
Some rare weather types might not be common on all weekdays. There is no specific pattern (ex. Weekday vs Weekends) that we observe in the above combinations as such. So this might also be due to data collections issues (Faulty sensors, bad data, etc.)
# Count occurrences of each combination
combination_counts <- week2|>
group_by(weather_main, weekday)|>
summarise(count = n(), .groups = "drop")|>
arrange(desc(count)) # Sorting to see the most common first
# View the most and least common
print(combination_counts)
## # A tibble: 69 × 3
## weather_main weekday count
## <chr> <fct> <int>
## 1 Clouds Friday 2221
## 2 Clouds Monday 2213
## 3 Clouds Thursday 2213
## 4 Clouds Sunday 2186
## 5 Clouds Wednesday 2163
## 6 Clouds Tuesday 2110
## 7 Clear Saturday 2105
## 8 Clouds Saturday 2058
## 9 Clear Friday 2018
## 10 Clear Sunday 2005
## # ℹ 59 more rows
Heatmap to visualize missing combinations
The Heatmap below shows us clearly the missing combinations as discussed above. Note that the most common weather types are clear and cloudy days with thunderstorms and Squalls are the least common as they are mostly extreme conditions.
combination_counts|>
ggplot()+
geom_tile(aes(x=weekday,y=weather_main,fill=count))+
scale_fill_viridis_c(option="C")+
labs(title="Weather Frequency",
x="Weekday",
y="Weather Condition",
fill="Count")+
theme(axis.text=element_text(size=12),
panel.background = element_rect(fill = 'white'),
panel.grid.major = element_line(color = "grey"))