Data Dive 5: Documentation

2. Data assessment

The major aspect of the data we are dealing with is traffic volume across an interstate in Minnesota. There are multiple factors that affect the traffic volume including time of the day, rain and snow. Lets try to understand some of the values and columns to understand if any require further look at the dataset’s documentation:

2.1 Identifying Unclear columns

a. “holiday”: This column contains all the holidays that Minnesota celebrates at official capacity. This was likely a data record to track traffic patterns across different holidays. There is unlikely to be a lot of confusion understanding the data in the column but there is a possibility that this could have been a binary column indicating whether a particular day is a holiday or not.

b. “weather_main & weather_description”: Both these columns depict the condition of the weather. We can make a reasonable assumption that weather quality affects traffic volume. This column is trying to give information regarding the same. Since these two columns show information about weather at different granularity, not referring the documentation would seem like the weather_description column is redundant.

c. “temp”: Shows the temperature of the day. Since the temperature is recorded in Kelvin, it may mislead the user if they do not convert the data as required.

2.2 Identifying an Unclear Element Even After Reading Documentation

rain_1h (rain in the last 1 hour, measured in mm) is an unclear column. The documentation states it’s the amount of rain recorded in millimeters per hour. However, it contains many zero values, making it unclear whether zero means “no rain” or if it’s a missing or erroneous entry. the histogram below shows us majority of the values in “rain_1h” are zero.

ggplot(data_df, aes(x = rain_1h)) +
  geom_histogram(binwidth = 0.5, fill = "blue", color = "black") +
  labs(title = "Distribution of Rain in Last 1 Hour",
       x = "Rain (mm)",
       y = "Count") +
  theme(axis.text=element_text(size=25),
          axis.title.x = element_text(size = 20),
          axis.title.y = element_text(size = 20),
          plot.title = element_text(size = 20),
          legend.key.size = unit(2,"cm"),
          legend.text = element_text(size = 18),
          legend.title = element_text(size = 14),
          panel.background = element_rect(fill = 'white'),
          panel.grid.major = element_line(color = "grey"))

ggplot(data_df, aes(x = weather_main, y = rain_1h)) +
  geom_boxplot(aes(fill = weather_main)) +
  labs(title = "Rainfall Amount by Weather Condition",
       x = "Weather Condition",
       y = "Rain (mm)") +
  theme(axis.text=element_text(size=25),
          axis.title.x = element_text(size = 20),
          axis.title.y = element_text(size = 20),
          plot.title = element_text(size = 20),
          legend.key.size = unit(2,"cm"),
          legend.text = element_text(size = 18),
          legend.title = element_text(size = 14),
          panel.background = element_rect(fill = 'white'),
          panel.grid.major = element_line(color = "grey")) +
  coord_flip()

We plotted the rain volume to corresponding weather conditions. We can check if the zero volume rain days correspond to the right weather conditions to decide if we need to consider the rain_1h data incorrectly recorded or not. As we see in the plot, most days when there is just clouds or clear, we dont see any rain. But during storms and related conditions we see rain. Thus we can conclude that the column holds the right data for insights.

2.3 Missing data identification

a. Checking for Explicitly Missing Values

sum(is.na(data_df$holiday))

## [1] 0

sum(is.na(data_df$weather_main))

## [1] 0

The above output tells us that the columns “holiday” and “weather_main” do not have any explicit missing values.

b. Checking for Implicitly Missing Rows

For example, if holidays are missing some expected values. We can check the unique values of this column with known US holidays to check if there are any missing values.

unique(data_df$holiday)

##  [1] "None"                      "Columbus Day"             
##  [3] "Veterans Day"              "Thanksgiving Day"         
##  [5] "Christmas Day"             "New Years Day"            
##  [7] "Washingtons Birthday"      "Memorial Day"             
##  [9] "Independence Day"          "State Fair"               
## [11] "Labor Day"                 "Martin Luther King Jr Day"

Referring to the known list of U.S holidays, we see that Minnesota celebrates most US holidays. Holidays like “State Fair” are specific to Minnesota. They do not celebrate Juneteenth National Independence day. So there are some implicitly missing rows but none that could corrupt the insights that we can gain from the data set.

c. Checking for Empty Groups

table(data_df$weather_main)

## 
##        Clear       Clouds      Drizzle          Fog         Haze         Mist 
##        13381        15164         1821          912         1360         5950 
##         Rain        Smoke         Snow       Squall Thunderstorm 
##         5671           20         2876            4         1034

combination_counts <- week2|>
  group_by(weather_main, weekday)|>
  summarise(count = n(), .groups = "drop")|>
  arrange(desc(count))

combination_counts|>
  ggplot()+
  geom_tile(aes(x=weekday,y=weather_main,fill=count))+
  scale_fill_viridis_c(option="C")+
  labs(title="Weather Frequency",
       x="Weekday",
       y="Weather Condition",
       fill="Count")+
  theme(axis.text=element_text(size=25),
          axis.title.x = element_text(size = 20),
          axis.title.y = element_text(size = 20),
          plot.title = element_text(size = 20),
          legend.key.size = unit(2,"cm"),
          legend.text = element_text(size = 18),
          legend.title = element_text(size = 14),
          panel.background = element_rect(fill = 'white'),
          panel.grid.major = element_line(color = "grey"))

As we have noted in the previous assignment, there are weather events like “smoke”, “squall” missing from few weekdays. “weather_main” in itself does not have any empty groups, but combining it with weekdays we see some missing.

2.3 Identifying outliers

Lets check outliers for the column temp before any conversion was made:

pre_data=read_csv("C:/Users/rajas/OneDrive/Desktop/Desktop/Applied Data Science/INFOH510/R Jupyter/Metro_Interstate_Traffic_Volume.csv")
Q1 <- quantile(pre_data$temp, 0.25)
Q3 <- quantile(pre_data$temp, 0.75)
IQR <- Q3 - Q1
lower_bound <- Q1 - 1.5 * IQR
upper_bound <- Q3 + 1.5 * IQR

outliers <- pre_data[pre_data$temp < lower_bound | pre_data$temp > upper_bound, ]
nrow(outliers)

## [1] 10

This shows us some 10 outliers, lets plot and check what those are (The red dot indicates outliers)

ggplot(pre_data, aes(y = temp)) +
  geom_boxplot(fill = "orange", outlier.color = "red", outlier.size = 2) +
  labs(title = "Boxplot of Temperature",
       y = "Temperature") +
  theme(axis.text=element_text(size=25),
          axis.title.x = element_text(size = 20),
          axis.title.y = element_text(size = 20),
          plot.title = element_text(size = 20),
          legend.key.size = unit(2,"cm"),
          legend.text = element_text(size = 18),
          legend.title = element_text(size = 14),
          panel.background = element_rect(fill = 'white'),
          panel.grid.major = element_line(color = "grey"))

As we can see fro the boxplot, the data indicates that there are outlires which equal to 0K . This is not possible as that is theoretical lowest possible temperature unlikely on regular environmental conditions. Thus we only consider data points where temp is above 0 for insights.

Data Dive 5: Documentation

Rajashekar

2025-02-17

1. Reading Data

2. Data assessment

2.1 Identifying Unclear columns

2.2 Identifying an Unclear Element Even After Reading Documentation

2.3 Missing data identification

2.3 Identifying outliers