Week 3 Data Dive

1. Reading Data

Reading the data and performing minor adjustments to remove inappropriate outliers and make the data easy to work with.

library(readr)
library(ggplot2)
library(patchwork)
library(dplyr)
library(lubridate)
library(GGally)
week2=read_csv("C:/Users/rajas/OneDrive/Desktop/Desktop/Applied Data Science/INFOH510/R Jupyter/Metro_Interstate_Traffic_Volume.csv")
week2=week2[week2$temp>0,]
week2=week2[week2$rain_1h< 60,]
week2<- week2|>
  mutate(temp=(((temp-273)*9/5))+32)
week2$hour<- as.integer(format(as.POSIXct(week2$date_time),"%H")) #converting the date_time information into hours,month,year, weekdays to get relevant insights.
week2$month<- month(as.integer(format(as.POSIXct(week2$date_time),"%m")),label = TRUE) #using lubridate library to get the month labels
week2$year<- as.integer(format(as.POSIXct(week2$date_time),"%y"))
week2$weekday<-weekdays(as.Date(week2$date_time))
week2$weekday<-factor(week2$weekday,levels=c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday")) #sorting the weekdays

2. Grouping data

We are grouping the dataset into three different groups:

Group 1: Average traffic volume for each hour over the day
Group 2: Average temperature on each weekday for different kinds of weather
Group 3: Total volume for different kinds of weather

G2<-week2|>
  group_by(weekday,weather_main)|>
  summarise(Avg_Temp=mean(temp),.groups = 'drop')
G3<-week2|>
  group_by(weather_main)|>
  summarise(Volume=sum(traffic_volume))
G1<-week2|>
  group_by(hour)|>
  summarise(avg_traffic=mean(traffic_volume),
            median_traffic=median(traffic_volume),
            count=n())
G1

## # A tibble: 24 × 4
##     hour avg_traffic median_traffic count
##    <int>       <dbl>          <dbl> <int>
##  1     0        835.            676  2037
##  2     1        516.            420  2049
##  3     2        388.            315  2019
##  4     3        371.            362  2023
##  5     4        703.            808  2089
##  6     5       2095.           2638  2061
##  7     6       4142.           5381  2085
##  8     7       4742.           5999  2078
##  9     8       4589.           5441  2079
## 10     9       4385.           4839  2018
## # ℹ 14 more rows

G2

## # A tibble: 69 × 3
##    weekday weather_main Avg_Temp
##    <fct>   <chr>           <dbl>
##  1 Monday  Clear            50.2
##  2 Monday  Clouds           48.1
##  3 Monday  Drizzle          51.2
##  4 Monday  Fog              44.1
##  5 Monday  Haze             34.4
##  6 Monday  Mist             42.2
##  7 Monday  Rain             57.4
##  8 Monday  Smoke            68.6
##  9 Monday  Snow             22.8
## 10 Monday  Thunderstorm     66.6
## # ℹ 59 more rows

G3

## # A tibble: 11 × 2
##    weather_main   Volume
##    <chr>           <dbl>
##  1 Clear        40908493
##  2 Clouds       54870172
##  3 Drizzle       5992414
##  4 Fog           2465793
##  5 Haze          4762858
##  6 Mist         17451092
##  7 Rain         18813625
##  8 Smoke           64753
##  9 Snow          8676444
## 10 Squall           8247
## 11 Thunderstorm  3103676

3. Investigation

3.1 Selecting Low probability groups

Lets order the selected dataframes to figure out the least represented groups

G1|>
  slice_min(order_by = avg_traffic)

## # A tibble: 1 × 4
##    hour avg_traffic median_traffic count
##   <int>       <dbl>          <dbl> <int>
## 1     3        371.            362  2023

G2|>
  slice_min(order_by = Avg_Temp)

## # A tibble: 1 × 3
##   weekday weather_main Avg_Temp
##   <fct>   <chr>           <dbl>
## 1 Sunday  Snow             20.9

G3|>
  slice_min(order_by = Volume)

## # A tibble: 1 × 2
##   weather_main Volume
##   <chr>         <dbl>
## 1 Squall         8247

From the outputs above, we can conclude the following least represented values for each group:

Group 1: 3 AM was the hour with the lowest average volume across all days over the 7 years.
Group 2: “Sunday” with “Snow” as weather type saw the lowest average temp across the dataset.
Group 3: “Squall” was the weather type with the lowest Volume across the years.

3.2 Tagging the Low probability groups

tagged_df<-week2|>
  mutate(tags=case_when(
    hour==3 ~'G1',
    weekday=='Sunday'& weather_main=='Snow'~'G2',
    weather_main=='Squall'~'G3',
    TRUE~'untagged'
  ))

Visualizing traffic volume across the day, we see that the peak hours are between 6:00 AM and 5:00 PM. We notice the least amount traffic is seen around 3:00 AM.

G1|>
  ggplot()+
  geom_line(mapping=aes(x=hour,y=avg_traffic), color='blue',linewidth=1.2)+
  labs(title="Hourly Traffic Volume", x= "Hour",y="Traffic Volume")+
  #theme_minimal()
  theme(axis.text=element_text(size=12),
        panel.background = element_rect(fill = 'white'),
        panel.grid.major = element_line(color = "grey"))

We see that when the weather type is “Squall”, the average volume is the lowest. It is also interesting to note that when it was Squall, the traffic was higher on a Wednesday compared to the when it was over the weekend. This is likely due to the fact that Wednesday is a weekday and people were commuting to and from work.

tagged_df|>
  group_by(weather_main,weekday)|>
  summarise(Avg_traffic=mean(traffic_volume),.groups = 'drop')|>
  ggplot()+
  geom_bar(aes(x=weather_main,y=Avg_traffic,fill=weekday),stat='identity')+
  labs(title="Traffic Volume by weather",x="Weather Type",y="Traffic Volume")+
  theme(axis.text=element_text(size=12),
        panel.background = element_rect(fill = 'white'),
        panel.grid.major = element_line(color = "grey"))

Snow days have the least average temperature as expected. Interesting to note that there was never “Smoke” weather type on a “Sunday”.

G2|>
  ggplot()+
  geom_line(aes(x=weekday,y=Avg_Temp,group=weather_main,color=weather_main),size=1.2)+
  geom_point(aes(x=weekday,y=Avg_Temp,group=weather_main,color=as.factor(weather_main),group=weather_main))+
  facet_wrap(~weather_main)+
  labs(title="Temperature trends",x="Weekday",y="Avg temperature(C)")+
  theme(axis.text=element_text(size=12),
        panel.background = element_rect(fill = 'white'),
        panel.grid.major = element_line(color = "grey"))

3.3 Insights on the Low Probability groups

3 AM has the fewest records, it means fewer vehicles are on the road at that time. This suggests that nighttime traffic volume is significantly lower due to reduced human activity. Looking at in terms of probability;

Our data set has 48205 rows. Out of the these rows only X rows correspond to 3 AM. The probability of selecting a 3AM data point is :

\[ P(3AM)=X/4805 \]

Meanwhile, 4 PM has Y rows whose probability of selection is:

\[ P(4PM)=Y/48205 \]

From the Line chart above it is evident that X<Y and hence

\[ P(4PM)>P(3AM) \]

Therefore, we can conclude that traffic data points for 3 AM are much rarer than 4PM. Considering the general sparsity of activity during early morning hours in a day, this conclusion is sound. It also helps us establish Peak and Non-peak hours.

3.4 Testable Hypothesis

Let us test the hypothesis that:

The Probability of observing low traffic is significantly higher at 3 AM than at other times

Lets Define:

\[ A= Low\;Traffic\;Volume\\ B=\;Time\;is\;3\;AM \]

Using Bayes’ Theorem:

\[ P(B|A)= \frac{P(A|B)P(B)}{P(A)} \]

where:

P(B|A)= Probability that the time is 3 AM given that traffic volume is low.

P(A|B)= Probability of low traffic volume given that the time is 3 AM.

P(B)= Prior probability of selecting 3 AM from all times.

P(A)= Overall probability of low traffic volume across all hours.

Using our data set, we can calculate the following probabilities:

\[ P(A|B) \approx90\%\;(90\%\;of\;instances\;at\;3AM\;have\;low\;traffic)\\ P(B)= 1/24=4.17\% (as\;3AM\;is\;one\;of\;24\;hours)\\ P(A)=30\%(Overall,\; 30\%\;of\;all\;data\;points\;show\;low\;traffic) \]

(Refer 3.4.1 for above calculations/assumptions)

Applying Bayes’ Theorem:

\[ P(B|A)=\frac{(0.90\;X\;0.0417)}{0.30}\\ =\frac{0.0375}{0.30}=0.125 \]

So, given that we observe low traffic, there is a 12.5% probability that it occurred at 3 AM.

Null Hypothesis= The probability of low traffic at 3 AM is not significantly different from other hours.

Alternative Hypothesis= The probability of low traffic is significantly higher at 3 AM compared to other hours.

(Using the chi-square test to verify if 3 AM is significantly different)

week2$hour<-as.numeric(week2$hour)
week2$low_traffic<-ifelse(week2$traffic_volume<2000,1,0)
table_data <- table(week2$low_traffic, week2$hour == 3)
# Perform chi-square test
chi_test <- chisq.test(table_data)
# Print results
print(chi_test)

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  table_data
## X-squared = 4575.4, df = 1, p-value < 2.2e-16

Since p-value < 0.05, we reject the Null Hypothesis, meaning low traffic is significantly more common at 3 AM (Further proven by R calculation of P(B|A)=13.2% which is much higher than the uniform P(B)=4.17%).

3.4.1 Probability Calculations:

# Total number of rows in the dataset
total_rows <- nrow(week2)
# Number of rows where traffic_volume is low (< 2000)
low_traffic_rows <- sum(week2$traffic_volume < 2000)
# Compute P(A) as a proportion
P_A <- low_traffic_rows / total_rows
# Print result
print(P_A)

## [1] 0.3156268

# Count total instances where hour == 3
total_3am <- nrow(week2[week2$hour == 3, ])

# Count instances where hour == 3 and traffic_volume < 2000
low_traffic_3am <- sum(week2$traffic_volume[week2$hour == 3] < 2000)

# Compute conditional probability
P_A_given_B <- low_traffic_3am / total_3am
P_B <- nrow(week2[week2$hour == 3, ]) / nrow(week2)
P_A <- sum(week2$low_traffic) / nrow(week2)
# Bayes' Theorem Calculation
P_B_given_A <- (P_A_given_B * P_B) / P_A
# Print result
P_A_given_B

## [1] 1

P_B_given_A

## [1] 0.1329959

4. Combinations

4.1 Missing combinations

Lets find the missing combinations and all possible combinations

weather_types<-unique(week2$weather_main)
weekdays<-unique(week2$weekday)
#compare_df<-data.frame(Weather=weather_types,
                       #WeekDay=weekdays)
all_combinations <- expand.grid(weather_main = weather_types, weekday = weekdays)
print(all_combinations)

##    weather_main   weekday
## 1        Clouds   Tuesday
## 2         Clear   Tuesday
## 3          Rain   Tuesday
## 4       Drizzle   Tuesday
## 5          Mist   Tuesday
## 6          Haze   Tuesday
## 7           Fog   Tuesday
## 8  Thunderstorm   Tuesday
## 9          Snow   Tuesday
## 10       Squall   Tuesday
## 11        Smoke   Tuesday
## 12       Clouds Wednesday
## 13        Clear Wednesday
## 14         Rain Wednesday
## 15      Drizzle Wednesday
## 16         Mist Wednesday
## 17         Haze Wednesday
## 18          Fog Wednesday
## 19 Thunderstorm Wednesday
## 20         Snow Wednesday
## 21       Squall Wednesday
## 22        Smoke Wednesday
## 23       Clouds  Thursday
## 24        Clear  Thursday
## 25         Rain  Thursday
## 26      Drizzle  Thursday
## 27         Mist  Thursday
## 28         Haze  Thursday
## 29          Fog  Thursday
## 30 Thunderstorm  Thursday
## 31         Snow  Thursday
## 32       Squall  Thursday
## 33        Smoke  Thursday
## 34       Clouds    Friday
## 35        Clear    Friday
## 36         Rain    Friday
## 37      Drizzle    Friday
## 38         Mist    Friday
## 39         Haze    Friday
## 40          Fog    Friday
## 41 Thunderstorm    Friday
## 42         Snow    Friday
## 43       Squall    Friday
## 44        Smoke    Friday
## 45       Clouds  Saturday
## 46        Clear  Saturday
## 47         Rain  Saturday
## 48      Drizzle  Saturday
## 49         Mist  Saturday
## 50         Haze  Saturday
## 51          Fog  Saturday
## 52 Thunderstorm  Saturday
## 53         Snow  Saturday
## 54       Squall  Saturday
## 55        Smoke  Saturday
## 56       Clouds    Sunday
## 57        Clear    Sunday
## 58         Rain    Sunday
## 59      Drizzle    Sunday
## 60         Mist    Sunday
## 61         Haze    Sunday
## 62          Fog    Sunday
## 63 Thunderstorm    Sunday
## 64         Snow    Sunday
## 65       Squall    Sunday
## 66        Smoke    Sunday
## 67       Clouds    Monday
## 68        Clear    Monday
## 69         Rain    Monday
## 70      Drizzle    Monday
## 71         Mist    Monday
## 72         Haze    Monday
## 73          Fog    Monday
## 74 Thunderstorm    Monday
## 75         Snow    Monday
## 76       Squall    Monday
## 77        Smoke    Monday

actual_combinations <- week2 |>
    select(weather_main, weekday)|>
    distinct()
missing_combinations <- anti_join(all_combinations, actual_combinations, by = c("weather_main", "weekday"))
print(missing_combinations)

##   weather_main   weekday
## 1       Squall   Tuesday
## 2        Smoke   Tuesday
## 3        Smoke Wednesday
## 4       Squall  Thursday
## 5        Smoke  Thursday
## 6       Squall    Friday
## 7        Smoke    Sunday
## 8       Squall    Monday

The missing combinations are:

There has been no Squall weather on Monday, Tuesday, Thursday and Friday. Same is the case for Smoke on Tuesday, Wednesday, Thursday and Sunday

Some rare weather types might not be common on all weekdays. There is no specific pattern (ex. Weekday vs Weekends) that we observe in the above combinations as such. So this might also be due to data collections issues (Faulty sensors, bad data, etc.)

# Count occurrences of each combination
combination_counts <- week2|>
  group_by(weather_main, weekday)|>
  summarise(count = n(), .groups = "drop")|>
  arrange(desc(count))  # Sorting to see the most common first
# View the most and least common
print(combination_counts)

## # A tibble: 69 × 3
##    weather_main weekday   count
##    <chr>        <fct>     <int>
##  1 Clouds       Friday     2221
##  2 Clouds       Monday     2213
##  3 Clouds       Thursday   2213
##  4 Clouds       Sunday     2186
##  5 Clouds       Wednesday  2163
##  6 Clouds       Tuesday    2110
##  7 Clear        Saturday   2105
##  8 Clouds       Saturday   2058
##  9 Clear        Friday     2018
## 10 Clear        Sunday     2005
## # ℹ 59 more rows

Heatmap to visualize missing combinations

The Heatmap below shows us clearly the missing combinations as discussed above. Note that the most common weather types are clear and cloudy days with thunderstorms and Squalls are the least common as they are mostly extreme conditions.

combination_counts|>
  ggplot()+
  geom_tile(aes(x=weekday,y=weather_main,fill=count))+
  scale_fill_viridis_c(option="C")+
  labs(title="Weather Frequency",
       x="Weekday",
       y="Weather Condition",
       fill="Count")+
  theme(axis.text=element_text(size=12),
        panel.background = element_rect(fill = 'white'),
        panel.grid.major = element_line(color = "grey"))