Reading the data and performing minor adjustments to remove inappropriate outliers and make the data easy to work with.
library(readr)
library(ggplot2)
library(patchwork)
library(dplyr)
library(lubridate)
library(GGally)
library(corrplot)
week2=read_csv("C:/Users/rajas/OneDrive/Desktop/Desktop/Applied Data Science/INFOH510/R Jupyter/Metro_Interstate_Traffic_Volume.csv")
week2=week2[week2$temp>0,]
week2=week2[week2$rain_1h< 60,]
week2<- week2|>
mutate(temp=(((temp-273)*9/5))+32)
week2$hour<- as.integer(format(as.POSIXct(week2$date_time),"%H")) #converting the date_time information into hours,month,year, weekdays to get relevant insights.
week2$month<- month(as.integer(format(as.POSIXct(week2$date_time),"%m")),label = TRUE) #using lubridate library to get the month labels
week2$year<- as.integer(format(as.POSIXct(week2$date_time),"%y"))
week2$day<- as.integer(format(as.POSIXct(week2$date_time),"%d"))
week2$weekday<-weekdays(as.Date(week2$date_time))
week2$weekday<-factor(week2$weekday,levels=c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday")) #sorting the weekdays
data_df<-week2
The major aspect of the data we are dealing with is traffic volume across an interstate in Minnesota. There are multiple factors that affect the traffic volume including time of the day, rain and snow. Lets try to understand some of the values and columns to understand if any require further look at the dataset’s documentation:
a. “holiday”: This column contains all the holidays that Minnesota celebrates at official capacity. This was likely a data record to track traffic patterns across different holidays. There is unlikely to be a lot of confusion understanding the data in the column but there is a possibility that this could have been a binary column indicating whether a particular day is a holiday or not.
b. “weather_main & weather_description”: Both these columns depict the condition of the weather. We can make a reasonable assumption that weather quality affects traffic volume. This column is trying to give information regarding the same. Since these two columns show information about weather at different granularity, not referring the documentation would seem like the weather_description column is redundant.
c. “temp”: Shows the temperature of the day. Since the temperature is recorded in Kelvin, it may mislead the user if they do not convert the data as required.
rain_1h
(rain in the last 1 hour,
measured in mm) is an unclear column. The documentation states it’s the
amount of rain recorded in millimeters per hour. However, it
contains many zero values, making it unclear whether
zero means “no rain” or if it’s a missing or erroneous
entry. the histogram below shows us majority of the values in “rain_1h”
are zero.
ggplot(data_df, aes(x = rain_1h)) +
geom_histogram(binwidth = 0.5, fill = "blue", color = "black") +
labs(title = "Distribution of Rain in Last 1 Hour",
x = "Rain (mm)",
y = "Count") +
theme(axis.text=element_text(size=25),
axis.title.x = element_text(size = 20),
axis.title.y = element_text(size = 20),
plot.title = element_text(size = 20),
legend.key.size = unit(2,"cm"),
legend.text = element_text(size = 18),
legend.title = element_text(size = 14),
panel.background = element_rect(fill = 'white'),
panel.grid.major = element_line(color = "grey"))
ggplot(data_df, aes(x = weather_main, y = rain_1h)) +
geom_boxplot(aes(fill = weather_main)) +
labs(title = "Rainfall Amount by Weather Condition",
x = "Weather Condition",
y = "Rain (mm)") +
theme(axis.text=element_text(size=25),
axis.title.x = element_text(size = 20),
axis.title.y = element_text(size = 20),
plot.title = element_text(size = 20),
legend.key.size = unit(2,"cm"),
legend.text = element_text(size = 18),
legend.title = element_text(size = 14),
panel.background = element_rect(fill = 'white'),
panel.grid.major = element_line(color = "grey")) +
coord_flip()
We plotted the rain volume to corresponding weather conditions. We can check if the zero volume rain days correspond to the right weather conditions to decide if we need to consider the rain_1h data incorrectly recorded or not. As we see in the plot, most days when there is just clouds or clear, we dont see any rain. But during storms and related conditions we see rain. Thus we can conclude that the column holds the right data for insights.
a. Checking for Explicitly Missing Values
sum(is.na(data_df$holiday))
## [1] 0
sum(is.na(data_df$weather_main))
## [1] 0
The above output tells us that the columns “holiday” and “weather_main” do not have any explicit missing values.
b. Checking for Implicitly Missing Rows
For example, if holidays are missing some expected values. We can check the unique values of this column with known US holidays to check if there are any missing values.
unique(data_df$holiday)
## [1] "None" "Columbus Day"
## [3] "Veterans Day" "Thanksgiving Day"
## [5] "Christmas Day" "New Years Day"
## [7] "Washingtons Birthday" "Memorial Day"
## [9] "Independence Day" "State Fair"
## [11] "Labor Day" "Martin Luther King Jr Day"
Referring to the known list of U.S holidays, we see that Minnesota celebrates most US holidays. Holidays like “State Fair” are specific to Minnesota. They do not celebrate Juneteenth National Independence day. So there are some implicitly missing rows but none that could corrupt the insights that we can gain from the data set.
c. Checking for Empty Groups
table(data_df$weather_main)
##
## Clear Clouds Drizzle Fog Haze Mist
## 13381 15164 1821 912 1360 5950
## Rain Smoke Snow Squall Thunderstorm
## 5671 20 2876 4 1034
combination_counts <- week2|>
group_by(weather_main, weekday)|>
summarise(count = n(), .groups = "drop")|>
arrange(desc(count))
combination_counts|>
ggplot()+
geom_tile(aes(x=weekday,y=weather_main,fill=count))+
scale_fill_viridis_c(option="C")+
labs(title="Weather Frequency",
x="Weekday",
y="Weather Condition",
fill="Count")+
theme(axis.text=element_text(size=25),
axis.title.x = element_text(size = 20),
axis.title.y = element_text(size = 20),
plot.title = element_text(size = 20),
legend.key.size = unit(2,"cm"),
legend.text = element_text(size = 18),
legend.title = element_text(size = 14),
panel.background = element_rect(fill = 'white'),
panel.grid.major = element_line(color = "grey"))
As we have noted in the previous assignment, there are weather events like “smoke”, “squall” missing from few weekdays. “weather_main” in itself does not have any empty groups, but combining it with weekdays we see some missing.
Lets check outliers for the column temp before any conversion was made:
pre_data=read_csv("C:/Users/rajas/OneDrive/Desktop/Desktop/Applied Data Science/INFOH510/R Jupyter/Metro_Interstate_Traffic_Volume.csv")
Q1 <- quantile(pre_data$temp, 0.25)
Q3 <- quantile(pre_data$temp, 0.75)
IQR <- Q3 - Q1
lower_bound <- Q1 - 1.5 * IQR
upper_bound <- Q3 + 1.5 * IQR
outliers <- pre_data[pre_data$temp < lower_bound | pre_data$temp > upper_bound, ]
nrow(outliers)
## [1] 10
This shows us some 10 outliers, lets plot and check what those are (The red dot indicates outliers)
ggplot(pre_data, aes(y = temp)) +
geom_boxplot(fill = "orange", outlier.color = "red", outlier.size = 2) +
labs(title = "Boxplot of Temperature",
y = "Temperature") +
theme(axis.text=element_text(size=25),
axis.title.x = element_text(size = 20),
axis.title.y = element_text(size = 20),
plot.title = element_text(size = 20),
legend.key.size = unit(2,"cm"),
legend.text = element_text(size = 18),
legend.title = element_text(size = 14),
panel.background = element_rect(fill = 'white'),
panel.grid.major = element_line(color = "grey"))
As we can see fro the boxplot, the data indicates that there are outlires which equal to 0K . This is not possible as that is theoretical lowest possible temperature unlikely on regular environmental conditions. Thus we only consider data points where temp is above 0 for insights.