Exploratory Data Analysis

In this document, I will be exploring my three chosen datasets on Income Data, Public School Data, and Crash Reporting Data all in Montgomery County with the following information.

Variable Statistics
Missing Information

Strong visualizations

Accompanied by explanations and insights.

Lets First Load in the Datasets

library(dplyr)


Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

library(ggplot2)

school_data <- read.csv("/Users/xutongzhang/Documents/DATASETS SPRING 2024/Public_Schools_20240211.csv") #Load in School Data

income_data <- read.csv("/Users/xutongzhang/Documents/DATASETS SPRING 2024/Income Data Final.csv") #Load in Income Data

crash_data <- read.csv("/Users/xutongzhang/Documents/DATASETS SPRING 2024/Crash_Reporting_-_Drivers_Data_20240211.csv") #Load in Crash Data

Descriptive Variable Statistics

This section will show the frequency counts, contingency tables, five number summary, and mean of variables in each of the three datasets.

Public School Data (Montgomery County Maryland)

Categorical Variables: CATEGORY

# Descriptive statistics for ZIP CODE
table(school_data$CATEGORY)


ELEMENTARY SCHOOLS       HIGH SCHOOLS     MIDDLE SCHOOLS 
               136                 25                 39

We can see that there are 136 Elementary Schools, 25 High Schools, and 39 Middle schools. Showing a majority of the schools in Montgomery County are Elementary Schools.

Numerical Variables: ZIP.CODE

summary(school_data$ZIP.CODE)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  20814   20853   20876   20871   20901   20912

We notice how zip codes range from 20814 to 20912, a valuable piece of info for connecting our other datasets.

Contingency Table: Elementary Schools IN ZIP.CODE

table(school_data$CATEGORY, school_data$`ZIP.CODE`)

                    
                     20814 20815 20816 20817 20832 20833 20837 20842 20850
  ELEMENTARY SCHOOLS     2     4     2     8     3     1     1     1     4
  HIGH SCHOOLS           2     0     0     1     0     0     1     0     1
  MIDDLE SCHOOLS         0     0     1     2     2     0     1     0     2
                    
                     20851 20852 20853 20854 20855 20860 20866 20871 20872
  ELEMENTARY SCHOOLS     2     2     8     6     3     1     1     2     3
  HIGH SCHOOLS           1     1     0     1     1     1     1     1     1
  MIDDLE SCHOOLS         0     3     1     2     1     0     1     1     1
                    
                     20874 20876 20877 20878 20879 20882 20886 20895 20896
  ELEMENTARY SCHOOLS     8     5     5    10     5     2     2     2     1
  HIGH SCHOOLS           2     0     1     1     1     0     0     1     0
  MIDDLE SCHOOLS         3     1     3     2     0     0     1     1     0
                    
                     20901 20902 20903 20904 20905 20906 20910 20912
  ELEMENTARY SCHOOLS     6     7     4     9     2     7     4     3
  HIGH SCHOOLS           2     1     0     1     1     1     0     0
  MIDDLE SCHOOLS         1     2     1     1     1     2     2     0

This contingency table shows us howmany elementary, middle, and high schools are in each zip code.

Car Crash Data (Montgomery County Maryland)

Categorical Variables: ACRS Report Type

# Frequency counts for ACRS Report Type
table(crash_data$`ACRS.Report.Type`)


          Fatal Crash          Injury Crash Property Damage Crash 
                  425                 62228                109452

This allows me to count of each type of crash report, such as “Property Damage Crash”, “Injury Crash”, etc. This information will help me understand the distribution of crash severities. We can see how there are 425 fatal crashes, 62,228 crashes resulting in injury, and 10,9452 crashes that resulting in substantial property damage.

Numerical Variables: Speed Limit

# Summary statistics for Speed Limit
summary(crash_data$`Speed.Limit`)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   0.00   25.00   35.00   32.55   40.00   75.00

This shows us the five-number summary (minimum, first quartile, median, third quartile, maximum) and mean of the speed limits where crashes occurred, which can help us know whether or not higher speed limits correlate with more severe or frequent crashes.

Contingency Table: Elementary Schools IN ZIP.CODE

# Contingency table for ACRS Report Type and Speed Limit
table(crash_data$`ACRS.Report.Type`, crash_data$`Speed.Limit`)

                       
                            0     5    10    15    20    25    30    35    40
  Fatal Crash               3     0     0     1     1    31    51    85   139
  Injury Crash            879   685   546  1163   278  7086  8582 19590 14304
  Property Damage Crash  3867  3528  2548  4961   913 16928 14744 30804 19146
                       
                           45    50    55    60    65    70    75
  Fatal Crash              56    51     5     2     0     0     0
  Injury Crash           5514  2046  1512    24    17     2     0
  Property Damage Crash  6884  2572  2461    50    41     4     1

This table will help us see if certain types of crashes are more common at specific speed limits, helping us to see areas where speed limit adjustments might improve safety.

Income Data (Montgomery County Maryland)

Categorical Variables: ZIP Code

# Unique ZIP codes
length(unique(income_data$ZIP_Code))

[1] 478

We see from this is that there are 478 unique Zip codes in the dataset, helping us show the geographic coverage of my income data.

Numerical Variables: Median Household Income and Total Households

# Summary statistics for Median Household Income
summary(income_data$Median_Household_Income)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
  19722   77758  100386  106476  129495  246618      56

# Summary statistics for Total Households
summary(income_data$Total_Households)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
      0     331    1836    4870    8264   27598

The data on top shows us the five-number summary (minimum, first quartile, median, third quartile, maximum) and mean of the speed limits where crashes occurred, which can help us know whether or not higher speed limits correlate with more severe or frequent crashes.

The data on the bottom provides us with how many households are in each ZIp code from least densely populated to most.

Contingency Table: NA

This dataset does not have sufficient rows and columns to prepare a contingency table. Please stay tuned for the joining with other datasets in the following sections.

Report on Missing Values

In this section, I will be individually be looking at each dataset and reporting the missing values, how many of each, and stating how it will affect my analysis if at all.

School Data Report

sapply(school_data, function(x) sum(is.na(x)))

   CATEGORY SCHOOL.NAME     ADDRESS        CITY    ZIP.CODE       PHONE 
          0           0           0           0           0           0 
        URL   LONGITUDE    LATITUDE    LOCATION 
          0           0           0           0

The return above shows that there are in fact no missing values across all variables in the Public School dataset. This completeness ensures that our analysis will be as comprehensive as possible.

Crash Data Report

sapply(crash_data, function(x) sum(is.na(x)))

                 Report.Number              Local.Case.Number 
                             0                              0 
                   Agency.Name               ACRS.Report.Type 
                             0                              0 
               Crash.Date.Time                     Route.Type 
                             0                              0 
                     Road.Name              Cross.Street.Type 
                             0                              0 
             Cross.Street.Name           Off.Road.Description 
                             8                              0 
                  Municipality           Related.Non.Motorist 
                             0                              0 
                Collision.Type                        Weather 
                             0                              0 
             Surface.Condition                          Light 
                             0                              0 
               Traffic.Control         Driver.Substance.Abuse 
                             0                              0 
  Non.Motorist.Substance.Abuse                      Person.ID 
                             0                              0 
               Driver.At.Fault                Injury.Severity 
                             0                              0 
                  Circumstance           Driver.Distracted.By 
                             0                              0 
         Drivers.License.State                     Vehicle.ID 
                             0                              0 
         Vehicle.Damage.Extent  Vehicle.First.Impact.Location 
                             0                              0 
Vehicle.Second.Impact.Location              Vehicle.Body.Type 
                             0                              0 
              Vehicle.Movement         Vehicle.Continuing.Dir 
                             0                              0 
             Vehicle.Going.Dir                    Speed.Limit 
                             0                              0 
            Driverless.Vehicle                 Parked.Vehicle 
                             0                              0 
                  Vehicle.Year                   Vehicle.Make 
                             0                             11 
                 Vehicle.Model             Equipment.Problems 
                            25                              0 
                      Latitude                      Longitude 
                             0                              0 
                      Location 
                             0

The return above shows that there is some missing values, specifically:

Cross Street Name: 8
Vehicle Model: 25
Vehicle Make: 11

Considering the number of values in this Dataset, these numbers are extremley small, and only for 3 out of 43 total variables. Having said this, I believe that there will be little to no affect on my analysis, even more so that these are less significant values for my research.

School Data Report

sapply(income_data, function(x) sum(is.na(x)))

               ZIP_Code        Total_Households Median_Household_Income 
                      0                       0                      56

The return above shows that there is only 1/3 variables that are missing data, specifically:

Median Household Income: 56

This is significant as it is the most important variable in my dataset. This effectively removes 56 zipcodes that I could possibly use, making it harder to do my research if it requires those variables. To counteract this I will need to ensure that I focus on zipcodes that do not rely on this income data.

Compelling Visualizations

I will create visualizations in this section for each dataset that show:

Distribution of variables
Relationships inside of their own datasets
Outliers inside their own datasets
Descriptive Statistics where needed

to create comprehensive and understandable visualizations that aid in answering my research direction.

Public School Visualizations

Bar Graph of School Types

school_counts <- table(school_data$CATEGORY) #Count the number of each type

# create a bar plot
ggplot(data = as.data.frame(school_counts), aes(x = Var1, y = Freq)) +
  geom_bar(stat = "identity", fill = c("skyblue", "orange", "lightgreen")) +
  labs(title = "Distribution of School Types in Montgomery County",
       x = "School Type",
       y = "Number of Schools") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

The chart shows a distribution of school types across Montgomery County providing a clear picture of the number of Elementary, Middle and High schools.

Map of Schools

library(leaflet) #Use leaflet to make interactive maps

colors <- c("ELEMENTARY SCHOOLS" = "skyblue", "MIDDLE SCHOOLS" = "orange", "HIGH SCHOOLS" = "lightgreen") #Establish pretty colors

# Add a color column based on school type
school_data$color <- colors[school_data$CATEGORY]

# Create the map
leaflet(school_data) %>%
  addTiles() %>%  # Add default OpenStreetMap map tiles
  addCircleMarkers(~LONGITUDE, ~LATITUDE, color = ~color,
                   popup = ~as.character(CATEGORY), radius = 5)

This plot has all schools on a map, color-coded by school type. This visualization will help get a sense of the layout of schools across the county in certain zip codes.

Car Crash Visualizations

Distribution of Crashes by Severity

# Count occurrences each crash type
crash_severity_counts <- table(crash_data$ACRS.Report.Type)

# Convert table to dataframe 
crash_severity_df <- as.data.frame(crash_severity_counts)

names(crash_severity_df) <- c("Crash_Severity", "Number_of_Crashes")

# Create the bar chart
ggplot(crash_severity_df, aes(x = Crash_Severity, y = Number_of_Crashes, fill = Crash_Severity)) +
  geom_bar(stat = "identity", color = "black", show.legend = FALSE) +
  scale_fill_manual(values = c("Property Damage Crash" = "tomato", "Injury Crash" = "lightblue", "Fatal Crash" = "red")) +
  theme_minimal() +
  labs(title = "Distribution of Crashes by Severity",
       x = "Crash Severity", y = "Number of Crashes") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

This plot shows the most common type of crashes and their severity. From the looks of it, not that many of the crashes are fatal, but more than half result in serious injury.

Geographical Map of Crashes (Suspected Injury - Fatal)

# Filter out "NO APPARENT INJURY" records because too many slows down map
filtered_crash_data <- crash_data %>% 
  filter(Injury.Severity != "NO APPARENT INJURY")

# colors for different injury level, excluding "NO APPARENT INJURY"
severity_colors <- c("POSSIBLE INJURY" = "yellow", 
                     "SUSPECTED MINOR INJURY" = "orange", 
                     "SUSPECTED SERIOUS INJURY" = "red", 
                     "FATAL INJURY" = "darkred")

# Add color column based on injury severity
filtered_crash_data$color <- severity_colors[filtered_crash_data$`Injury.Severity`]

# Create map without "NO APPARENT INJURY"
map <- leaflet(filtered_crash_data) %>%
  addTiles() %>%  #  OpenStreetMap map tiles
  addCircleMarkers(~Longitude, ~Latitude, color = ~color,
                   popup = ~paste('Injury Severity:', as.character(`Injury.Severity`),
                                  '<br>Weather:', as.character(Weather),
                                  '<br>Light:', as.character(Light),
                                  '<br>Date:', as.Date(Crash.Date.Time),
                                  '<br>Car Condition:',as.character(Vehicle.Damage.Extent),
                                  '<br>Final Action:', as.character(Vehicle.Movement)),
                   radius = 5, opacity = 1, fillOpacity = 1)

# Add legend
map <- map %>% addLegend("bottomright", 
                         colors = severity_colors, 
                         labels = names(severity_colors),
                         title = "Injury Severity")

# Print the map
map

This plot has all car crashes with a suspected injury to fatal, showing the date, last action of the driver, light condition, and all color coded based on crash severity.

Income Data Visualization

Distribution of Median Household Income by ZIP Code

ggplot(income_data, aes(x = Median_Household_Income)) +
  geom_histogram(binwidth = 10000, fill = "cornflowerblue", color = "black") +
  theme_minimal() +
  labs(title = "Distribution of Median Household Income by ZIP Code",
       x = "Median Household Income",
       y = "Frequency")

Warning: Removed 56 rows containing non-finite outside the scale range
(`stat_bin()`).

This histogram shows the distribution of medium household income across different zipcodes.