Exploratory Data Analysis of HW1
1. Introduction
This report performs an exploratory data analysis on the dataset
HW1_Data. The focus is on identifying key patterns with
colorful visualizations and matrix comparisons.
2. Loading the Data
# Load the data
file_path <- "HW1_Data.xlsx"
df <- read_excel(file_path, sheet = "UnitL4")
# Display the first few rows of the dataset
head(df)## # A tibble: 6 × 19
## Wthr_Cond_ID Light_Cond_ID Road_Type_ID Road_Algn_ID SurfDry Traffic_Cntl_ID
## <chr> <chr> <chr> <chr> <dbl> <chr>
## 1 Clear Dark, not ligh… 2 lane, 2 w… Straight, l… 1 Marked lanes
## 2 Clear Dark, not ligh… 2 lane, 2 w… Straight, l… 1 Center stripe/…
## 3 Clear Daylight 2 lane, 2 w… Straight, l… 1 Marked lanes
## 4 Clear Daylight 2 lane, 2 w… Straight, l… 1 Center stripe/…
## 5 Clear Dark, not ligh… 2 lane, 2 w… Straight, g… 1 None
## 6 Clear Daylight Unknown Straight, l… 1 None
## # ℹ 13 more variables: Harm_Evnt_ID <chr>, Intrsct_Relat_ID <chr>,
## # FHE_Collsn_ID <chr>, Road_Part_Adj_ID <chr>, Road_Cls_ID <chr>,
## # Pop_Group_ID <chr>, Crash_Speed_LimitCat <chr>, Veh_Body_Styl_ID <chr>,
## # Prsn_Ethnicity_ID <chr>, GenMale <dbl>, TrafVol <dbl>, Prsn_Age <chr>,
## # Prsn_Injry_Sev_ID <chr>
3. Summary Statistics
## Wthr_Cond_ID Light_Cond_ID Road_Type_ID Road_Algn_ID
## Length:1295 Length:1295 Length:1295 Length:1295
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## SurfDry Traffic_Cntl_ID Harm_Evnt_ID Intrsct_Relat_ID
## Min. :0.0000 Length:1295 Length:1295 Length:1295
## 1st Qu.:1.0000 Class :character Class :character Class :character
## Median :1.0000 Mode :character Mode :character Mode :character
## Mean :0.9143
## 3rd Qu.:1.0000
## Max. :1.0000
## FHE_Collsn_ID Road_Part_Adj_ID Road_Cls_ID Pop_Group_ID
## Length:1295 Length:1295 Length:1295 Length:1295
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## Crash_Speed_LimitCat Veh_Body_Styl_ID Prsn_Ethnicity_ID GenMale
## Length:1295 Length:1295 Length:1295 Min. :0.0000
## Class :character Class :character Class :character 1st Qu.:1.0000
## Mode :character Mode :character Mode :character Median :1.0000
## Mean :0.8842
## 3rd Qu.:1.0000
## Max. :1.0000
## TrafVol Prsn_Age Prsn_Injry_Sev_ID
## Min. : 215 Length:1295 Length:1295
## 1st Qu.: 6765 Class :character Class :character
## Median :14170 Mode :character Mode :character
## Mean :14417
## 3rd Qu.:22010
## Max. :28970
## Wthr_Cond_ID Light_Cond_ID Road_Type_ID
## 0 0 0
## Road_Algn_ID SurfDry Traffic_Cntl_ID
## 0 0 0
## Harm_Evnt_ID Intrsct_Relat_ID FHE_Collsn_ID
## 0 0 0
## Road_Part_Adj_ID Road_Cls_ID Pop_Group_ID
## 0 0 0
## Crash_Speed_LimitCat Veh_Body_Styl_ID Prsn_Ethnicity_ID
## 0 0 0
## GenMale TrafVol Prsn_Age
## 0 0 0
## Prsn_Injry_Sev_ID
## 0
4. Data Structure
## tibble [1,295 × 19] (S3: tbl_df/tbl/data.frame)
## $ Wthr_Cond_ID : chr [1:1295] "Clear" "Clear" "Clear" "Clear" ...
## $ Light_Cond_ID : chr [1:1295] "Dark, not lighted" "Dark, not lighted" "Daylight" "Daylight" ...
## $ Road_Type_ID : chr [1:1295] "2 lane, 2 way" "2 lane, 2 way" "2 lane, 2 way" "2 lane, 2 way" ...
## $ Road_Algn_ID : chr [1:1295] "Straight, level" "Straight, level" "Straight, level" "Straight, level" ...
## $ SurfDry : num [1:1295] 1 1 1 1 1 1 1 1 1 1 ...
## $ Traffic_Cntl_ID : chr [1:1295] "Marked lanes" "Center stripe/divider" "Marked lanes" "Center stripe/divider" ...
## $ Harm_Evnt_ID : chr [1:1295] "Motor vehicle in transport" "Motor vehicle in transport" "Motor vehicle in transport" "Fixed object" ...
## $ Intrsct_Relat_ID : chr [1:1295] "Non intersection" "Non intersection" "Intersection" "Non intersection" ...
## $ FHE_Collsn_ID : chr [1:1295] "Sd both going straight-rear end" "Sd both going straight-rear end" "Other" "Omv vehicle going straight" ...
## $ Road_Part_Adj_ID : chr [1:1295] "Main/proper lane" "Main/proper lane" "Main/proper lane" "Main/proper lane" ...
## $ Road_Cls_ID : chr [1:1295] "Farm to market" "Us & state highways" "Farm to market" "Us & state highways" ...
## $ Pop_Group_ID : chr [1:1295] "10,000 - 24,999 pop" "Rural" "Other" "Rural" ...
## $ Crash_Speed_LimitCat: chr [1:1295] "30-40 mph" "65-70 mph" "45-60 mph" "65-70 mph" ...
## $ Veh_Body_Styl_ID : chr [1:1295] "Farm equipment" "Farm equipment" "Farm equipment" "Farm equipment" ...
## $ Prsn_Ethnicity_ID : chr [1:1295] "White" "White" "White" "White" ...
## $ GenMale : num [1:1295] 1 1 1 1 1 1 1 1 1 1 ...
## $ TrafVol : num [1:1295] 7654 13770 11470 16972 413 ...
## $ Prsn_Age : chr [1:1295] "25-54 years" "25-54 years" "Other" "25-54 years" ...
## $ Prsn_Injry_Sev_ID : chr [1:1295] "O" "O" "O" "O" ...
5. Visualizations
5.1 Plots for Different Variables with Clear Axis Labels
# Set color palette
colors <- brewer.pal(8, "Set1")
# Plotting distribution of Weather Conditions with colors
ggplot(df, aes(x=Wthr_Cond_ID, fill=Wthr_Cond_ID)) +
geom_bar() +
scale_fill_manual(values=colors) +
theme_minimal() +
labs(title="Distribution of Weather Conditions",
x="Weather Condition",
y="Number of Observations")# Plotting distribution of Road Types with colors
ggplot(df, aes(x=Road_Type_ID, fill=Road_Type_ID)) +
geom_bar() +
scale_fill_manual(values=colors) +
theme_minimal() +
theme_bw (base_size = 11) +
labs(title="Distribution of Road Types",
x="Road Type",
y="Number of Observations")# Pie Chart to show the proportion of traffic by road type
df %>%
count(Road_Type_ID) %>%
ggplot(aes(x = "", y = n, fill = Road_Type_ID)) +
geom_bar(stat = "identity", width = 1) +
coord_polar(theta = "y") +
labs(title = "Share of Traffic by Road Type",
x = NULL, # Correct usage of NULL instead of null
y = NULL) +
theme_void() # Removes unnecessary chart elements# Plotting Traffic Volume with color
ggplot(df, aes(x=TrafVol)) +
geom_histogram(bins=30, fill=colors[4], color="white") +
theme_minimal() +
labs(title="Traffic Volume Distribution",
x="Traffic Volume",
y="Frequency")# Plotting traffic volume across different road types
avg_traffic <- df %>%
group_by(Road_Type_ID) %>%
summarize(mean_traffic = mean(TrafVol))
ggplot(avg_traffic, aes(x = Road_Type_ID, y = mean_traffic, group = 1, color = Road_Type_ID)) +
geom_line(size = 1.2) +
geom_point(size = 3) +
scale_color_brewer(palette = "Dark2") +
labs(title = "Traffic Volume Across Road Types",
x = "Road Type", y = "Average Traffic Volume") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))### 3. Box Plot: Traffic Volume vs Speed Limit Category
ggplot(data = df, aes(x = Crash_Speed_LimitCat, y = TrafVol, fill = Crash_Speed_LimitCat)) +
geom_boxplot() +
scale_fill_viridis_d() +
labs(title = "Traffic Volume vs Speed Limit Category",
x = "Speed Limit Category", y = "Traffic Volume") +
theme_minimal() +
theme(legend.position = "none")### 4. Stacked Bar Plot: Gender vs Injury Severity
ggplot(data = df, aes(x = as.factor(GenMale), fill = Prsn_Injry_Sev_ID)) +
geom_bar(position = "fill") +
scale_fill_manual(values = c("#FF6347", "#4682B4", "#32CD32", "#FFD700")) +
labs(title = "Gender vs Injury Severity",
x = "Gender (1 = Male, 0 = Female)", y = "Proportion", fill = "Injury Severity") +
theme_minimal()### 5. Bar Plot: Collision Types
# Display the count of each collision type
df %>%
group_by(FHE_Collsn_ID) %>%
summarise(Count = n()) %>%
arrange(desc(Count))## # A tibble: 5 × 2
## FHE_Collsn_ID Count
## <chr> <int>
## 1 Other 397
## 2 Sd both going straight-rear end 312
## 3 Omv vehicle going straight 259
## 4 Sd one straight-one left turn 168
## 5 Sd both going straight-sideswipe 159
# Bar plot for collision types
ggplot(data = df, aes(x = FHE_Collsn_ID, fill = FHE_Collsn_ID)) +
geom_bar() +
labs(title = "Different Collision Types Vs Collision Frequencies",
x = "Collision Type",
y = "Frequency") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) # Rotate x-axis labels for better readability# Counting number of crashes by ethnicity
ethnicity_counts <- df %>%
group_by(Prsn_Ethnicity_ID) %>%
summarise(Crash_Count = n(), .groups = 'drop')
# View the modified summary
print(ethnicity_counts)## # A tibble: 5 × 2
## Prsn_Ethnicity_ID Crash_Count
## <chr> <int>
## 1 Black 63
## 2 Hispanic 422
## 3 Other 39
## 4 Unknown 83
## 5 White 688
# Bar plot for crash counts by ethnicity
ggplot(data = ethnicity_counts, aes(x = Prsn_Ethnicity_ID, y = Crash_Count, fill = Prsn_Ethnicity_ID)) +
geom_bar(stat = "identity") +
labs(title = "Crash Occurrences by Ethnicity",
x = "Ethnicity",
y = "Number of Crashes") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) # Improve label readability# Define a scoring system for injury severity
severity_scores <- c('O' = 1, 'BC' = 2, 'KA' = 3)
df$Severity_Score <- as.numeric(severity_scores[df$Prsn_Injry_Sev_ID])
# Group by ethnicity and calculate the average severity
ethnicity_severity <- df %>%
group_by(Prsn_Ethnicity_ID) %>%
summarise(Average_Severity = mean(Severity_Score, na.rm = TRUE), .groups = 'drop')
# View the results
ethnicity_severity## # A tibble: 5 × 2
## Prsn_Ethnicity_ID Average_Severity
## <chr> <dbl>
## 1 Black 1.21
## 2 Hispanic 1.14
## 3 Other 1.03
## 4 Unknown 1
## 5 White 1.21
ggplot(ethnicity_severity, aes(x = Prsn_Ethnicity_ID, y = Average_Severity, fill = Prsn_Ethnicity_ID)) +
geom_col() +
labs(title = "Average Injury Severity by Ethnicity",
x = "Ethnicity",
y = "Average Severity Score") +
theme_minimal() +
scale_fill_brewer(palette = "Set2") # Use a different palette if preferred# Bar plot for average injury severity by ethnicity
# Recalculate the counts for each severity category within each ethnicity
severity_counts <- df %>%
group_by(Prsn_Ethnicity_ID, Prsn_Injry_Sev_ID) %>%
summarise(Count = n(), .groups = 'drop') # Adjusting for categorization
# Grouped bar plot for injury severity by ethnicity
ggplot(data = severity_counts, aes(x = Prsn_Ethnicity_ID, y = Count, fill = Prsn_Injry_Sev_ID)) +
geom_bar(stat = "identity", position = "dodge") +
labs(title = "Injury Severity Counts by Ethnicity",
x = "Ethnicity",
y = "Count of Cases") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))# Calculate counts for plotting
age_severity_counts <- df %>%
group_by(Prsn_Age, Prsn_Injry_Sev_ID) %>%
summarise(Count = n(), .groups = 'drop')
# Plotting the stacked bar chart
ggplot(data = age_severity_counts, aes(x = Prsn_Age, y = Count, fill = Prsn_Injry_Sev_ID)) +
geom_bar(stat = "identity") +
labs(title = "Severity of Incidents Across Age Groups",
x = "Age Group",
y = "Count of Incidents") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))# Plotting the heat maps
ggplot(data = age_severity_counts, aes(x = Prsn_Age, y = Prsn_Injry_Sev_ID, fill = Count)) +
geom_tile() +
scale_fill_gradient(low = "blue", high = "red") +
labs(title = "Heatmap of Incident Severity Across Age Groups",
x = "Age Group",
y = "Severity ID") +
theme_minimal()# Boxplot of Traffic Volume by Weather Conditions with colors
ggplot(df, aes(x=Wthr_Cond_ID, y=TrafVol, fill=Wthr_Cond_ID)) +
geom_boxplot() +
scale_fill_manual(values=colors) +
theme_minimal() +
labs(title="Traffic Volume by Weather Conditions",
x="Weather Condition",
y="Traffic Volume")5.2 Matrix Comparison: Correlation and Heatmap with Clear Axis Labels
# Correlation matrix for numerical variables
numeric_df <- df %>%
select_if(is.numeric)
corr_matrix <- cor(numeric_df, use="complete.obs")
# Visualize correlation matrix using circle method
corrplot(corr_matrix, method="circle", col=brewer.pal(n=8, name="RdYlBu"),
tl.cex = 0.8, mar=c(0,0,2,0), title = "Correlation Matrix")# Visualize correlation matrix using heatmap
corrplot(corr_matrix, method="color", col=brewer.pal(n=8, name="RdYlBu"), type="upper",
tl.col="black", tl.srt=45, tl.cex = 0.8, mar=c(0,0,2,0), title = "Correlation Heatmap")6. Insights and Findings
Weather Conditions: The majority of crashes happen during clear weather, as seen in the bar plots.
Road Types Bar and pie reveals that “2 lane, 2 way” roads are the most commonly observed road type, followed by “Unknown,” with “4 or more lanes, divided” and “4 or more lanes, undivided” showing notably fewer observations, and “Other” types being the least common.
Traffic Volume:
- There is a noticeable difference in traffic volume across various weather conditions, indicated by the boxplots.
- The histogram of traffic volume distribution also shows a multimodal pattern, with peaks around 10,000 and slightly above 20,000, indicating that certain traffic volumes are more commonly observed, with frequencies tapering off for the highest and lowest volumes recorded.
- The line plot illustrates the average traffic volume across different road types, showing that “4 or more lanes, undivided” roads experience the highest average traffic volume, closely followed by “Unknown” road types, with “2 lane, 2 way” and “4 or more lanes, divided” roads showing significantly lower average volumes.
- The box plot comparing traffic volume across different speed limit categories reveals that roads with speed limits over 70 mph and those in the “Other” category experience higher median traffic volumes, while roads with speed limits between 45-60 mph and 65-70 mph exhibit similar lower median volumes, albeit with varying distributions.
Injury Severity:
- Figures indicate that the most severe incidents (KA) predominantly occur in the 25-54 years age group, while the least severe type of incidents (O) are observed in the 65-74 years range.
- The stacked bar chart depicting the proportion of injury severity by gender shows that both males (1) and females (0) predominantly experience incidents classified as ‘O’ (No Injury), though males have a slightly higher proportion of more severe incidents (‘KA’ and ‘BC’) compared to females.
Ethnic Connection:
- The bar chart illustrates that White individuals experience significantly higher crash occurrences compared to other ethnic groups, with Hispanic individuals also showing a notably high number of crashes, while Black, Other, and Unknown categories have much lower frequencies.
- The bar chart shows that Black and White individuals have the highest average injury severity scores, significantly higher than those for Hispanic, Other, and Unknown ethnic categories, indicating a disparity in injury severity across different ethnic groups.
Correlation Matrix: A strong correlation exists between some variables such as traffic volume and other factors.
- TrafVol and Severity_Score shows a moderate negative correlation. This could imply that higher traffic volumes might correlate with incidents of lower severity, potentially due to lower speeds or more cautious driving in higher traffic conditions.
- A moderate negative correlation is visible between GenMale and Severity_Score. This might suggest that incidents involving males have a lower severity score, although this requires more contextual data to interpret accurately.
- GenMale and TrafVol: They exhibit a very weak positive correlation, almost negligible, as indicated by the very small and faint blue circle.
- SurfDry and Severity_Score appears to have a weak to moderate negative correlation, suggesting that more severe incidents might occur less frequently under dry surface conditions.
- SurfDry and TrafVol: This pair shows a moderate negative correlation, as indicated by the medium-sized red circle. A possible interpretation is that higher TrafVol might be associated with less dry surface conditions or vice versa.
- SurfDry and GenMale: There’s a relatively strong positive correlation between SurfDry and GenMale, indicated by the large blue circle. This suggests that these two variables tend to increase together; as one increases, the other tends to increase as well.
7. Conclusions
This exploratory data analysis reveals interesting insights into weather conditions, traffic volume, and correlations between different variables. Using colorful visualizations helps highlight key patterns and relationships.