Exploratory Data Analysis of HW1

1. Introduction

This report performs an exploratory data analysis on the dataset HW1_Data. The focus is on identifying key patterns with colorful visualizations and matrix comparisons.

2. Loading the Data

# Load the data
file_path <- "HW1_Data.xlsx"
df <- read_excel(file_path, sheet = "UnitL4")

# Display the first few rows of the dataset
head(df)
## # A tibble: 6 × 19
##   Wthr_Cond_ID Light_Cond_ID   Road_Type_ID Road_Algn_ID SurfDry Traffic_Cntl_ID
##   <chr>        <chr>           <chr>        <chr>          <dbl> <chr>          
## 1 Clear        Dark, not ligh… 2 lane, 2 w… Straight, l…       1 Marked lanes   
## 2 Clear        Dark, not ligh… 2 lane, 2 w… Straight, l…       1 Center stripe/…
## 3 Clear        Daylight        2 lane, 2 w… Straight, l…       1 Marked lanes   
## 4 Clear        Daylight        2 lane, 2 w… Straight, l…       1 Center stripe/…
## 5 Clear        Dark, not ligh… 2 lane, 2 w… Straight, g…       1 None           
## 6 Clear        Daylight        Unknown      Straight, l…       1 None           
## # ℹ 13 more variables: Harm_Evnt_ID <chr>, Intrsct_Relat_ID <chr>,
## #   FHE_Collsn_ID <chr>, Road_Part_Adj_ID <chr>, Road_Cls_ID <chr>,
## #   Pop_Group_ID <chr>, Crash_Speed_LimitCat <chr>, Veh_Body_Styl_ID <chr>,
## #   Prsn_Ethnicity_ID <chr>, GenMale <dbl>, TrafVol <dbl>, Prsn_Age <chr>,
## #   Prsn_Injry_Sev_ID <chr>

3. Summary Statistics

# Summary of the dataset
summary(df)
##  Wthr_Cond_ID       Light_Cond_ID      Road_Type_ID       Road_Algn_ID      
##  Length:1295        Length:1295        Length:1295        Length:1295       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##     SurfDry       Traffic_Cntl_ID    Harm_Evnt_ID       Intrsct_Relat_ID  
##  Min.   :0.0000   Length:1295        Length:1295        Length:1295       
##  1st Qu.:1.0000   Class :character   Class :character   Class :character  
##  Median :1.0000   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :0.9143                                                           
##  3rd Qu.:1.0000                                                           
##  Max.   :1.0000                                                           
##  FHE_Collsn_ID      Road_Part_Adj_ID   Road_Cls_ID        Pop_Group_ID      
##  Length:1295        Length:1295        Length:1295        Length:1295       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##  Crash_Speed_LimitCat Veh_Body_Styl_ID   Prsn_Ethnicity_ID     GenMale      
##  Length:1295          Length:1295        Length:1295        Min.   :0.0000  
##  Class :character     Class :character   Class :character   1st Qu.:1.0000  
##  Mode  :character     Mode  :character   Mode  :character   Median :1.0000  
##                                                             Mean   :0.8842  
##                                                             3rd Qu.:1.0000  
##                                                             Max.   :1.0000  
##     TrafVol        Prsn_Age         Prsn_Injry_Sev_ID 
##  Min.   :  215   Length:1295        Length:1295       
##  1st Qu.: 6765   Class :character   Class :character  
##  Median :14170   Mode  :character   Mode  :character  
##  Mean   :14417                                        
##  3rd Qu.:22010                                        
##  Max.   :28970
# Checking for missing values
colSums(is.na(df))
##         Wthr_Cond_ID        Light_Cond_ID         Road_Type_ID 
##                    0                    0                    0 
##         Road_Algn_ID              SurfDry      Traffic_Cntl_ID 
##                    0                    0                    0 
##         Harm_Evnt_ID     Intrsct_Relat_ID        FHE_Collsn_ID 
##                    0                    0                    0 
##     Road_Part_Adj_ID          Road_Cls_ID         Pop_Group_ID 
##                    0                    0                    0 
## Crash_Speed_LimitCat     Veh_Body_Styl_ID    Prsn_Ethnicity_ID 
##                    0                    0                    0 
##              GenMale              TrafVol             Prsn_Age 
##                    0                    0                    0 
##    Prsn_Injry_Sev_ID 
##                    0

4. Data Structure

# Checking the structure of the dataset
str(df)
## tibble [1,295 × 19] (S3: tbl_df/tbl/data.frame)
##  $ Wthr_Cond_ID        : chr [1:1295] "Clear" "Clear" "Clear" "Clear" ...
##  $ Light_Cond_ID       : chr [1:1295] "Dark, not lighted" "Dark, not lighted" "Daylight" "Daylight" ...
##  $ Road_Type_ID        : chr [1:1295] "2 lane, 2 way" "2 lane, 2 way" "2 lane, 2 way" "2 lane, 2 way" ...
##  $ Road_Algn_ID        : chr [1:1295] "Straight, level" "Straight, level" "Straight, level" "Straight, level" ...
##  $ SurfDry             : num [1:1295] 1 1 1 1 1 1 1 1 1 1 ...
##  $ Traffic_Cntl_ID     : chr [1:1295] "Marked lanes" "Center stripe/divider" "Marked lanes" "Center stripe/divider" ...
##  $ Harm_Evnt_ID        : chr [1:1295] "Motor vehicle in transport" "Motor vehicle in transport" "Motor vehicle in transport" "Fixed object" ...
##  $ Intrsct_Relat_ID    : chr [1:1295] "Non intersection" "Non intersection" "Intersection" "Non intersection" ...
##  $ FHE_Collsn_ID       : chr [1:1295] "Sd both going straight-rear end" "Sd both going straight-rear end" "Other" "Omv vehicle going straight" ...
##  $ Road_Part_Adj_ID    : chr [1:1295] "Main/proper lane" "Main/proper lane" "Main/proper lane" "Main/proper lane" ...
##  $ Road_Cls_ID         : chr [1:1295] "Farm to market" "Us & state highways" "Farm to market" "Us & state highways" ...
##  $ Pop_Group_ID        : chr [1:1295] "10,000 - 24,999 pop" "Rural" "Other" "Rural" ...
##  $ Crash_Speed_LimitCat: chr [1:1295] "30-40 mph" "65-70 mph" "45-60 mph" "65-70 mph" ...
##  $ Veh_Body_Styl_ID    : chr [1:1295] "Farm equipment" "Farm equipment" "Farm equipment" "Farm equipment" ...
##  $ Prsn_Ethnicity_ID   : chr [1:1295] "White" "White" "White" "White" ...
##  $ GenMale             : num [1:1295] 1 1 1 1 1 1 1 1 1 1 ...
##  $ TrafVol             : num [1:1295] 7654 13770 11470 16972 413 ...
##  $ Prsn_Age            : chr [1:1295] "25-54 years" "25-54 years" "Other" "25-54 years" ...
##  $ Prsn_Injry_Sev_ID   : chr [1:1295] "O" "O" "O" "O" ...

5. Visualizations

5.1 Plots for Different Variables with Clear Axis Labels

# Set color palette
colors <- brewer.pal(8, "Set1")

# Plotting distribution of Weather Conditions with colors
ggplot(df, aes(x=Wthr_Cond_ID, fill=Wthr_Cond_ID)) + 
  geom_bar() + 
  scale_fill_manual(values=colors) +
  theme_minimal() + 
  labs(title="Distribution of Weather Conditions", 
       x="Weather Condition", 
       y="Number of Observations")

# Plotting distribution of Road Types with colors
ggplot(df, aes(x=Road_Type_ID, fill=Road_Type_ID)) + 
  geom_bar() + 
  scale_fill_manual(values=colors) +
  theme_minimal() + 
  theme_bw (base_size = 11) +
  labs(title="Distribution of Road Types", 
       x="Road Type", 
       y="Number of Observations")

# Pie Chart to show the proportion of traffic by road type
df %>% 
  count(Road_Type_ID) %>%
  ggplot(aes(x = "", y = n, fill = Road_Type_ID)) +
  geom_bar(stat = "identity", width = 1) +
  coord_polar(theta = "y") +
  labs(title = "Share of Traffic by Road Type",
       x = NULL,  # Correct usage of NULL instead of null
       y = NULL) +
  theme_void()  # Removes unnecessary chart elements

# Plotting Traffic Volume with color
ggplot(df, aes(x=TrafVol)) + 
  geom_histogram(bins=30, fill=colors[4], color="white") + 
  theme_minimal() + 
  labs(title="Traffic Volume Distribution", 
       x="Traffic Volume", 
       y="Frequency")

# Plotting traffic volume across different road types
avg_traffic <- df %>%
  group_by(Road_Type_ID) %>%
  summarize(mean_traffic = mean(TrafVol))

ggplot(avg_traffic, aes(x = Road_Type_ID, y = mean_traffic, group = 1, color = Road_Type_ID)) + 
  geom_line(size = 1.2) + 
  geom_point(size = 3) + 
  scale_color_brewer(palette = "Dark2") + 
  labs(title = "Traffic Volume Across Road Types", 
       x = "Road Type", y = "Average Traffic Volume") + 
  theme_minimal() + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

### 3. Box Plot: Traffic Volume vs Speed Limit Category
ggplot(data = df, aes(x = Crash_Speed_LimitCat, y = TrafVol, fill = Crash_Speed_LimitCat)) + 
  geom_boxplot() + 
  scale_fill_viridis_d() + 
  labs(title = "Traffic Volume vs Speed Limit Category", 
       x = "Speed Limit Category", y = "Traffic Volume") + 
  theme_minimal() +
  theme(legend.position = "none")

### 4. Stacked Bar Plot: Gender vs Injury Severity
ggplot(data = df, aes(x = as.factor(GenMale), fill = Prsn_Injry_Sev_ID)) + 
  geom_bar(position = "fill") + 
  scale_fill_manual(values = c("#FF6347", "#4682B4", "#32CD32", "#FFD700")) + 
  labs(title = "Gender vs Injury Severity", 
       x = "Gender (1 = Male, 0 = Female)", y = "Proportion", fill = "Injury Severity") + 
  theme_minimal()

### 5. Bar Plot: Collision Types
# Display the count of each collision type
df %>% 
  group_by(FHE_Collsn_ID) %>%
  summarise(Count = n()) %>%
  arrange(desc(Count))
## # A tibble: 5 × 2
##   FHE_Collsn_ID                    Count
##   <chr>                            <int>
## 1 Other                              397
## 2 Sd both going straight-rear end    312
## 3 Omv vehicle going straight         259
## 4 Sd one straight-one left turn      168
## 5 Sd both going straight-sideswipe   159
# Bar plot for collision types
ggplot(data = df, aes(x = FHE_Collsn_ID, fill = FHE_Collsn_ID)) +
  geom_bar() +
  labs(title = "Different Collision Types Vs Collision Frequencies",
       x = "Collision Type",
       y = "Frequency") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))  # Rotate x-axis labels for better readability

# Counting number of crashes by ethnicity
ethnicity_counts <- df %>%
  group_by(Prsn_Ethnicity_ID) %>%
  summarise(Crash_Count = n(), .groups = 'drop')

# View the modified summary
print(ethnicity_counts)
## # A tibble: 5 × 2
##   Prsn_Ethnicity_ID Crash_Count
##   <chr>                   <int>
## 1 Black                      63
## 2 Hispanic                  422
## 3 Other                      39
## 4 Unknown                    83
## 5 White                     688
# Bar plot for crash counts by ethnicity
ggplot(data = ethnicity_counts, aes(x = Prsn_Ethnicity_ID, y = Crash_Count, fill = Prsn_Ethnicity_ID)) +
  geom_bar(stat = "identity") +
  labs(title = "Crash Occurrences by Ethnicity",
       x = "Ethnicity",
       y = "Number of Crashes") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))  # Improve label readability

# Define a scoring system for injury severity
severity_scores <- c('O' = 1, 'BC' = 2, 'KA' = 3)
df$Severity_Score <- as.numeric(severity_scores[df$Prsn_Injry_Sev_ID])
# Group by ethnicity and calculate the average severity
ethnicity_severity <- df %>%
  group_by(Prsn_Ethnicity_ID) %>%
  summarise(Average_Severity = mean(Severity_Score, na.rm = TRUE), .groups = 'drop')

# View the results
ethnicity_severity
## # A tibble: 5 × 2
##   Prsn_Ethnicity_ID Average_Severity
##   <chr>                        <dbl>
## 1 Black                         1.21
## 2 Hispanic                      1.14
## 3 Other                         1.03
## 4 Unknown                       1   
## 5 White                         1.21
ggplot(ethnicity_severity, aes(x = Prsn_Ethnicity_ID, y = Average_Severity, fill = Prsn_Ethnicity_ID)) +
  geom_col() +
  labs(title = "Average Injury Severity by Ethnicity",
       x = "Ethnicity",
       y = "Average Severity Score") +
  theme_minimal() +
  scale_fill_brewer(palette = "Set2")  # Use a different palette if preferred

# Bar plot for average injury severity by ethnicity
# Recalculate the counts for each severity category within each ethnicity
severity_counts <- df %>%
  group_by(Prsn_Ethnicity_ID, Prsn_Injry_Sev_ID) %>%
  summarise(Count = n(), .groups = 'drop')  # Adjusting for categorization

# Grouped bar plot for injury severity by ethnicity
ggplot(data = severity_counts, aes(x = Prsn_Ethnicity_ID, y = Count, fill = Prsn_Injry_Sev_ID)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(title = "Injury Severity Counts by Ethnicity",
       x = "Ethnicity",
       y = "Count of Cases") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

# Calculate counts for plotting
age_severity_counts <- df %>%
  group_by(Prsn_Age, Prsn_Injry_Sev_ID) %>%
  summarise(Count = n(), .groups = 'drop')

# Plotting the stacked bar chart
ggplot(data = age_severity_counts, aes(x = Prsn_Age, y = Count, fill = Prsn_Injry_Sev_ID)) +
  geom_bar(stat = "identity") +
  labs(title = "Severity of Incidents Across Age Groups",
       x = "Age Group",
       y = "Count of Incidents") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

# Plotting the heat maps
ggplot(data = age_severity_counts, aes(x = Prsn_Age, y = Prsn_Injry_Sev_ID, fill = Count)) +
  geom_tile() +
  scale_fill_gradient(low = "blue", high = "red") +
  labs(title = "Heatmap of Incident Severity Across Age Groups",
       x = "Age Group",
       y = "Severity ID") +
  theme_minimal()

# Boxplot of Traffic Volume by Weather Conditions with colors
ggplot(df, aes(x=Wthr_Cond_ID, y=TrafVol, fill=Wthr_Cond_ID)) + 
  geom_boxplot() + 
  scale_fill_manual(values=colors) +
  theme_minimal() + 
  labs(title="Traffic Volume by Weather Conditions", 
       x="Weather Condition", 
       y="Traffic Volume")

5.2 Matrix Comparison: Correlation and Heatmap with Clear Axis Labels

# Correlation matrix for numerical variables
numeric_df <- df %>%
  select_if(is.numeric)

corr_matrix <- cor(numeric_df, use="complete.obs")

# Visualize correlation matrix using circle method
corrplot(corr_matrix, method="circle", col=brewer.pal(n=8, name="RdYlBu"), 
         tl.cex = 0.8, mar=c(0,0,2,0), title = "Correlation Matrix")

# Visualize correlation matrix using heatmap
corrplot(corr_matrix, method="color", col=brewer.pal(n=8, name="RdYlBu"), type="upper", 
         tl.col="black", tl.srt=45, tl.cex = 0.8, mar=c(0,0,2,0), title = "Correlation Heatmap")

6. Insights and Findings

  • Weather Conditions: The majority of crashes happen during clear weather, as seen in the bar plots.

  • Road Types Bar and pie reveals that “2 lane, 2 way” roads are the most commonly observed road type, followed by “Unknown,” with “4 or more lanes, divided” and “4 or more lanes, undivided” showing notably fewer observations, and “Other” types being the least common.

  • Traffic Volume:

    1. There is a noticeable difference in traffic volume across various weather conditions, indicated by the boxplots.
    2. The histogram of traffic volume distribution also shows a multimodal pattern, with peaks around 10,000 and slightly above 20,000, indicating that certain traffic volumes are more commonly observed, with frequencies tapering off for the highest and lowest volumes recorded.
    3. The line plot illustrates the average traffic volume across different road types, showing that “4 or more lanes, undivided” roads experience the highest average traffic volume, closely followed by “Unknown” road types, with “2 lane, 2 way” and “4 or more lanes, divided” roads showing significantly lower average volumes.
    4. The box plot comparing traffic volume across different speed limit categories reveals that roads with speed limits over 70 mph and those in the “Other” category experience higher median traffic volumes, while roads with speed limits between 45-60 mph and 65-70 mph exhibit similar lower median volumes, albeit with varying distributions.
  • Injury Severity:

    1. Figures indicate that the most severe incidents (KA) predominantly occur in the 25-54 years age group, while the least severe type of incidents (O) are observed in the 65-74 years range.
    2. The stacked bar chart depicting the proportion of injury severity by gender shows that both males (1) and females (0) predominantly experience incidents classified as ‘O’ (No Injury), though males have a slightly higher proportion of more severe incidents (‘KA’ and ‘BC’) compared to females.
  • Ethnic Connection:

    1. The bar chart illustrates that White individuals experience significantly higher crash occurrences compared to other ethnic groups, with Hispanic individuals also showing a notably high number of crashes, while Black, Other, and Unknown categories have much lower frequencies.
    2. The bar chart shows that Black and White individuals have the highest average injury severity scores, significantly higher than those for Hispanic, Other, and Unknown ethnic categories, indicating a disparity in injury severity across different ethnic groups.
  • Correlation Matrix: A strong correlation exists between some variables such as traffic volume and other factors.

    1. TrafVol and Severity_Score shows a moderate negative correlation. This could imply that higher traffic volumes might correlate with incidents of lower severity, potentially due to lower speeds or more cautious driving in higher traffic conditions.
    2. A moderate negative correlation is visible between GenMale and Severity_Score. This might suggest that incidents involving males have a lower severity score, although this requires more contextual data to interpret accurately.
    3. GenMale and TrafVol: They exhibit a very weak positive correlation, almost negligible, as indicated by the very small and faint blue circle.
    4. SurfDry and Severity_Score appears to have a weak to moderate negative correlation, suggesting that more severe incidents might occur less frequently under dry surface conditions.
    5. SurfDry and TrafVol: This pair shows a moderate negative correlation, as indicated by the medium-sized red circle. A possible interpretation is that higher TrafVol might be associated with less dry surface conditions or vice versa.
    6. SurfDry and GenMale: There’s a relatively strong positive correlation between SurfDry and GenMale, indicated by the large blue circle. This suggests that these two variables tend to increase together; as one increases, the other tends to increase as well.

7. Conclusions

This exploratory data analysis reveals interesting insights into weather conditions, traffic volume, and correlations between different variables. Using colorful visualizations helps highlight key patterns and relationships.