Lab3: Exploratory Data Analysis

Author

Amanda Rose Knudsen

Overview

This is a two part lab where each part will focus on a different dataset: the first part will use a dataset containing a series of diagnostic measurements taken on members of the Akimel O’odham people (an indigenous group living in the Southwestern United States who are also called the Pima) to understand diabetes risk (click here to download diabetes.csv), and the second dataset contains information on traffic accidents in New York City in the months of July and August of this year, and was compiled by NYC Open Data (click here to download crashes.csv).

For this problem set you will need to install the skimr and GGally packages, and in particular the functions skim and ggpairs.

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(skimr)
library(GGally)

Registered S3 method overwritten by 'GGally':
  method from   
  +.gg   ggplot2

We will also explore the concept of an inlier, which is an erroneous value that occurs in the interior of the distribution of a variable, rather than in the tails of the variable. The US Census published an article on the problem of inliers here

Part 1: Health Diagnostics and Diabetes Incidence

Problem 1: Data Description and Outliers.

Load diabetes.csv into R and take a look at the data using the skimr package (make sure to install it if you don’t have it). Skimr provides a tidy summary function called skim. Use skim on the data frame that you loaded from diabetes.csv.

diabetes = read_csv("Lab03_Data/diabetes.csv")

skim(diabetes)

Data summary
Name	diabetes
Number of rows	768
Number of columns	9
_______________________
Column type frequency:
numeric	9
________________________
Group variables	None

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
Pregnancies	1	3.85	3.37	0.00	1.00	3.00	6.00	17.00	▇▃▂▁▁
Glucose	1	120.89	31.97	0.00	99.00	117.00	140.25	199.00	▁▁▇▆▂
BloodPressure	1	69.11	19.36	0.00	62.00	72.00	80.00	122.00	▁▁▇▇▁
SkinThickness	1	20.54	15.95	0.00	0.00	23.00	32.00	99.00	▇▇▂▁▁
Insulin	1	79.80	115.24	0.00	0.00	30.50	127.25	846.00	▇▁▁▁▁
BMI	1	31.99	7.88	0.00	27.30	32.00	36.60	67.10	▁▃▇▂▁
DiabetesPedigreeFunction	1	0.47	0.33	0.08	0.24	0.37	0.63	2.42	▇▃▁▁▁
Age	1	33.24	11.76	21.00	24.00	29.00	41.00	81.00	▇▃▁▁▁
Outcome	1	0.35	0.48	0.00	0.00	0.00	1.00	1.00	▇▁▁▁▅

Skim will list several variables. Pregnancies is the past number of pregnancies (this dataset includes women 21 years or older), glucose describes the concentration of glucose in the blood after an oral glucose tolerance test (drinking a sugary drink and measuring two hours later), skin thickness is the result of a skinfold thickness test taken at the triceps (upper arm), Insulin is the insulin concentration in the blood taken at the same time as the glucose measurement (Insulin is a hormone that transports glucose into cells), BMI is “Body Mass Index”, Diabetes Pedigree Function is a measure of diabetes risk based on the family history of diabetes for each patient (this is an engineered feature) and outcome is equal to 1 if the patient was diagnosed with diabetes with 5 years and 0 otherwise.

Skim should show no missing data, but should indicate potential data issues. Do any of the percentile ranges (p0, p25, p50, p75, or p100) for the reported variables suggest a potential problem? The SkinThickness variable shows 0.000 for the p0 and p25 percentile ranges, as does Insulin. This suggests a potential problem with the data. It appears that, especially for SkinThickness, that a measure of 0.000 would correspond to NULL (no data) rather than any human person having a skin thickness of 0 at the triceps – rather than 0 as a legitimate measurement (that would indicate they have no skin and have exposed subcutaneous physiology - not sure this is likely or possible?). It’s also likely that 0.000 represents NULL (no data) for the Insulin variable and the Glucose variable, because it is not possible to have a glucose level of 0, and similarly it’s not possible to have an absolute 0 Insulin level - and Insulin values appear to jump from 0.000 at p25 to 30.5 at p50. In a similar vein, the BloodPressure and the BMI measurement of 0.000 at the p0 is also a likely case of NULL (no data) represented with 0. It is not humanly possible to have a BMI of 0, or a Blood Pressure of 0. However, variables like Pregnancies are legitimate with a percentile range of 0.00 since the number of past pregnancies can legitimately be 0; same with Outcome since that is a 0-or-1 value variable.
Further investigate the dataset to find potentially problematic variables using a qq-plot (geom_qq) or group_by combined with count and arrange. For which variables do you find repeated values and what are those values? Do you believe these values represent real measurements or could they correspond to missing data? Do the repeated variables occur in the same rows or different rows?

Below I will use a qq-plot for the variables that deserve further investigation. This excludes the Pregnancies variable, the Age variable, and the engineered DiabetesPedigreeFunction variable. After looking at each individually and at each grouped together, it appears that 5 of the 5 I was skeptical about have 0.000 values that correspond to missing data: BloodPressure, SkinThickness, Insulin, Glucose, and BMI. It doesn’t appear that Glucose = 0 occurs very often (only 5 occurences) or in the same rows as the other repeated values (other than 4 of the 5 same = 0 for Insulin). While Insulin = 0 does appear to occur regularly as a 0 value in the same 33 rows as a SkinThicnkess and BloodPressure = 0 (and 7 records where BloodPressure, SkinThickness, and BMI, AND Insulin = 0), it appears more often to be missing (=0) along with SkinThickness: 227 rows (records) reveal SkinThickness is ‘missing’ at the same time as Insulin is ‘missing’. Additionally, we see that the BloodPressure, Insulin, and SkinThickness variables are equal to 0 in the same 33 rows (for 33 records).

q1b_v1 <- diabetes |> 
  group_by(BloodPressure, SkinThickness, BMI, Insulin) |>
  summarize(
    countgroup = n(),
  ) |> 
  arrange(desc(countgroup))

`summarise()` has grouped output by 'BloodPressure', 'SkinThickness', 'BMI'.
You can override using the `.groups` argument.

q1b_v2 <- diabetes |> 
  group_by(BloodPressure, SkinThickness, Insulin) |>
  summarize(
    countgroup = n(),
  ) |> 
  arrange(desc(countgroup))

`summarise()` has grouped output by 'BloodPressure', 'SkinThickness'. You can
override using the `.groups` argument.

q1b_v3 <-diabetes |> 
  group_by(SkinThickness, Insulin) |>
  summarize(
    countgroup = n(),
  ) |> 
  arrange(desc(countgroup))

`summarise()` has grouped output by 'SkinThickness'. You can override using the
`.groups` argument.

print(q1b_v1)

# A tibble: 756 × 5
# Groups:   BloodPressure, SkinThickness, BMI [755]
   BloodPressure SkinThickness   BMI Insulin countgroup
           <dbl>         <dbl> <dbl>   <dbl>      <int>
 1             0             0   0         0          7
 2             0             0  30         0          3
 3            64             0  27.4       0          2
 4            70             0  34.2       0          2
 5            76             0  45.3       0          2
 6            78            31  27.6       0          2
 7             0             0  19.6       0          1
 8             0             0  21.1       0          1
 9             0             0  22.2       0          1
10             0             0  23.5       0          1
# ℹ 746 more rows

print(q1b_v2)

# A tibble: 569 × 4
# Groups:   BloodPressure, SkinThickness [430]
   BloodPressure SkinThickness Insulin countgroup
           <dbl>         <dbl>   <dbl>      <int>
 1             0             0       0         33
 2            80             0       0         15
 3            74             0       0         14
 4            76             0       0         14
 5            78             0       0         14
 6            70             0       0         13
 7            72             0       0         11
 8            62             0       0         10
 9            60             0       0          8
10            64             0       0          8
# ℹ 559 more rows

print(q1b_v3)

# A tibble: 419 × 3
# Groups:   SkinThickness [51]
   SkinThickness Insulin countgroup
           <dbl>   <dbl>      <int>
 1             0       0        227
 2            31       0         11
 3            32       0         11
 4            27       0          9
 5            30       0          9
 6            22       0          7
 7            28       0          7
 8            19       0          6
 9            39       0          6
10            40       0          6
# ℹ 409 more rows

Next we look at the output for each variable using both geom_qq followed by group_by.

ggplot(diabetes, aes(sample = Glucose)) +
  geom_qq() +
  labs(title = "QQ Plot of Glucose in Diabetes Dataset")

diabetes |> 
  group_by(Glucose) |>
  summarize(
    count = n(),
  )

# A tibble: 136 × 2
   Glucose count
     <dbl> <int>
 1       0     5
 2      44     1
 3      56     1
 4      57     2
 5      61     1
 6      62     1
 7      65     1
 8      67     1
 9      68     3
10      71     4
# ℹ 126 more rows

ggplot(diabetes, aes(sample = BloodPressure)) +
  geom_qq() +
  labs(title = "QQ Plot of Blood Pressure in Diabetes Dataset")

diabetes |> 
  group_by(BloodPressure) |>
  summarize(
    count = n(),
  )

# A tibble: 47 × 2
   BloodPressure count
           <dbl> <int>
 1             0    35
 2            24     1
 3            30     2
 4            38     1
 5            40     1
 6            44     4
 7            46     2
 8            48     5
 9            50    13
10            52    11
# ℹ 37 more rows

ggplot(diabetes, aes(sample = SkinThickness)) +
  geom_qq() +
  labs(title = "QQ Plot of Skin Thickness in Diabetes Dataset")

diabetes |> 
  group_by(SkinThickness) |>
  summarize(
    count = n(),
  )

# A tibble: 51 × 2
   SkinThickness count
           <dbl> <int>
 1             0   227
 2             7     2
 3             8     2
 4            10     5
 5            11     6
 6            12     7
 7            13    11
 8            14     6
 9            15    14
10            16     6
# ℹ 41 more rows

ggplot(diabetes, aes(sample = Insulin)) +
  geom_qq() +
  labs(title = "QQ Plot of Insulin in Diabetes Dataset")

diabetes |> 
  group_by(Insulin) |>
  summarize(
    count = n(),
  )

# A tibble: 186 × 2
   Insulin count
     <dbl> <int>
 1       0   374
 2      14     1
 3      15     1
 4      16     1
 5      18     2
 6      22     1
 7      23     2
 8      25     1
 9      29     1
10      32     1
# ℹ 176 more rows

ggplot(diabetes, aes(sample = BMI)) +
  geom_qq() +
  labs(title = "QQ Plot of BMI in Diabetes Dataset")

diabetes |> 
  group_by(BMI) |>
  summarize(
    count = n(),
  )

# A tibble: 248 × 2
     BMI count
   <dbl> <int>
 1   0      11
 2  18.2     3
 3  18.4     1
 4  19.1     1
 5  19.3     1
 6  19.4     1
 7  19.5     2
 8  19.6     3
 9  19.9     1
10  20       1
# ℹ 238 more rows

Write an overview of which values are missing and replace all missing values with NA for the next stage of analysis.

As stated above, values are missing for the variables BloodPressure, SkinThickness, Insulin, Glucose, and BMI when they appear equal to 0.

diabetes2 <- diabetes |> 
  mutate(
    BloodPressure = if_else(BloodPressure == 0, NA, BloodPressure),
    SkinThickness = if_else(SkinThickness == 0, NA, SkinThickness),
    Insulin = if_else(Insulin == 0, NA, Insulin),
    Glucose = if_else(Glucose == 0, NA, Glucose),
    BMI = if_else(BMI == 0, NA, BMI)
  )

Perform Tukey Box plots on each variable to identify potential outliers. Which variables have the most outliers? Are there any outliers that you think come from measurement error? If so remove them.

ggplot(diabetes2, aes(y = Pregnancies)) +
  geom_boxplot()

ggplot(diabetes2, aes(y = Glucose)) +
  geom_boxplot()

Warning: Removed 5 rows containing non-finite outside the scale range
(`stat_boxplot()`).

ggplot(diabetes2, aes(y = BloodPressure)) +
  geom_boxplot()

Warning: Removed 35 rows containing non-finite outside the scale range
(`stat_boxplot()`).

ggplot(diabetes2, aes(y = SkinThickness)) +
  geom_boxplot()

Warning: Removed 227 rows containing non-finite outside the scale range
(`stat_boxplot()`).

ggplot(diabetes2, aes(y = Insulin)) +
  geom_boxplot()

Warning: Removed 374 rows containing non-finite outside the scale range
(`stat_boxplot()`).

ggplot(diabetes2, aes(y = BMI)) +
  geom_boxplot()

Warning: Removed 11 rows containing non-finite outside the scale range
(`stat_boxplot()`).

ggplot(diabetes2, aes(y = DiabetesPedigreeFunction)) +
  geom_boxplot()

ggplot(diabetes2, aes(y = Age)) +
  geom_boxplot()

ggplot(diabetes2, aes(y = Outcome)) +
  geom_boxplot()

Some variables, like DiabetesPedigreeFunction, SkinThickness, and Insulin, appear to have many outliers but it’s not clear this is due to measurement error due to lack of knowledge of measurement normal expected range. Some variables, like BloodPressure, have few outliers overall but they seem to be likely related to measurement error because a blood pressure as low as 27 is considered dangerously low. Due to our lack of expertise in how the data was gathered and how the DiabetesPedigreeFunction was measured (it also has a lot of outliers), it doesn’t seem like we can with great confidence elect to replace any outliers at this time. Doing so may unintentionally degrade the dataset; until we would know more about the source and the context of the data, and the sample population included (for instance, did it include many people with known insulin and diabetes health issues at present; many people with other existing comorbidities or conditions that might ‘explain’ or at least contextualize some of the apparent ‘outliers’) it can be reasonabily decided to not change the outliers at this time.

Problem 2: Pair Plots

Use the GGally package and its function ggpair on both the original dataset and the cleaned dataset. Which correlations change the most? What are the strongest correlations between variables overall and with the Outcome?

diabetes |> 
  ggpairs(
    diabetes, upper = list(continuous = wrap("cor", size = 3)), # smaller dots
    lower = list(continuous = wrap("points", size = 0.5))
  ) +
  labs(title = "Original Diabetes Dataset") +
  theme(
    strip.text = element_text(size = 4, face = "bold",
                              margin = margin(t = 2, r = 0, b = 1, l = 0)
                              ),
    axis.text.y = element_text(size = 6, face = "bold"),
    axis.text.x = element_text(size = 6, face = "bold")
    )

diabetes2 |> 
  ggpairs(
    diabetes2, upper = list(continuous = wrap("cor", size = 3)), # smaller dots
    lower = list(continuous = wrap("points", size = 0.5))
  ) +
  labs(title = "Modified Diabetes Dataset") +
  theme(
    strip.text = element_text(size = 4, face = "bold",
                              margin = margin(t = 2, r = 0, b = 1, l = 0)
                              ),
    axis.text.y = element_text(size = 6, face = "bold"),
    axis.text.x = element_text(size = 6, face = "bold")
    )

Some of the correlations which changed the most are observable between: - Insulin and SkinThickness (0.437 in original, 0.185 in modified) - SkinThickness and Outcome (0.075 in original, 2.59 in modified) - Insulin and Age (-0.042 in original, 0.220 in modified) - BMI and SkinThickness (0.393 in original, 0.648 in modified)

The strongest correlations between variables overall are: - Pregnancies and Age (0.544 in original and modified) - Insulin and Glucose (0.581 in modified, but just 0.331 in original) - BMI and SkinThickness (0.648 in modified, but just 0.393 in original)

The strongest correlations between variables and the Outcome are: - Outcome and Glucose (0.467 in original, 0.495 in modified) - Outcome and BMI (0.314 in modified, but lower 0.293 in original)

Remark: This dataset has been used an model dataset for the construction of binary classifiers using machine learning and there are a large number of published studies showing these analyses. However, many of these analyses did not exclude the missing values erroneously coded as zero, as is discussed in this interesting paper by Breault, leading to highly degraded accuracy.

Part 2: Car Crashes in NYC

Problem 3: Finding Inliers and Missing Data

Load the NYC car crash dataset using read_csv. You can download the data from the course website by clicking here.

crashes = read_csv("Lab03_Data/crashes.csv")

Which variables have missing data (use skim or another tool of your choosing)? Some missing values have a different interpretation than others- what does it mean when VEHICLE TYPE CODE 2 is missing compared to LATITUDE?

skim(crashes)

Data summary
Name	crashes
Number of rows	14720
Number of columns	29
_______________________
Column type frequency:
character	16
difftime	1
numeric	12
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	n_unique
CRASH DATE	0	1.00	10	10	61
BOROUGH	4559	0.69	5	13	5
LOCATION	9120	0.38	10	23	4754
ON STREET NAME	4179	0.72	6	32	2167
CROSS STREET NAME	7440	0.49	6	32	2342
OFF STREET NAME	10541	0.28	10	36	4037
CONTRIBUTING FACTOR VEHICLE 1	90	0.99	5	53	51
CONTRIBUTING FACTOR VEHICLE 2	3305	0.78	5	53	35
CONTRIBUTING FACTOR VEHICLE 3	13349	0.09	11	30	15
CONTRIBUTING FACTOR VEHICLE 4	14381	0.02	11	30	9
CONTRIBUTING FACTOR VEHICLE 5	14610	0.01	11	30	4
VEHICLE TYPE CODE 1	214	0.99	2	38	117
VEHICLE TYPE CODE 2	4769	0.68	2	35	140
VEHICLE TYPE CODE 3	13471	0.08	2	35	34
VEHICLE TYPE CODE 4	14401	0.02	4	35	12
VEHICLE TYPE CODE 5	14613	0.01	5	35	4

Variable type: difftime

skim_variable	n_missing	complete_rate	min	max	median	n_unique
CRASH TIME	0	1	0 secs	86340 secs	50820 secs	1405

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
ZIP CODE	4560	0.69	10889.33	533.32	10001.00	10456.00	11208.00	11238.00	11697.0	▅▃▁▇▅
LATITUDE	9120	0.38	39.99	5.42	0.00	40.66	40.72	40.77	41.4	▁▁▁▁▇
LONGITUDE	9120	0.38	-72.58	9.84	-74.33	-73.96	-73.92	-73.86	0.0	▇▁▁▁▁
NUMBER OF PERSONS INJURED	0	1.00	0.63	0.91	0.00	0.00	0.00	1.00	12.0	▇▁▁▁▁
NUMBER OF PERSONS KILLED	0	1.00	0.00	0.06	0.00	0.00	0.00	0.00	4.0	▇▁▁▁▁
NUMBER OF PEDESTRIANS INJURED	0	1.00	0.09	0.30	0.00	0.00	0.00	0.00	7.0	▇▁▁▁▁
NUMBER OF PEDESTRIANS KILLED	0	1.00	0.00	0.04	0.00	0.00	0.00	0.00	4.0	▇▁▁▁▁
NUMBER OF CYCLIST INJURED	0	1.00	0.07	0.26	0.00	0.00	0.00	0.00	2.0	▇▁▁▁▁
NUMBER OF CYCLIST KILLED	0	1.00	0.00	0.01	0.00	0.00	0.00	0.00	1.0	▇▁▁▁▁
NUMBER OF MOTORIST INJURED	0	1.00	0.45	0.90	0.00	0.00	0.00	1.00	12.0	▇▁▁▁▁
NUMBER OF MOTORIST KILLED	0	1.00	0.00	0.05	0.00	0.00	0.00	0.00	2.0	▇▁▁▁▁
COLLISION_ID	0	1.00	4745479.90	4398.30	4737173.00	4741762.75	4745506.50	4749205.25	4755325.0	▆▇▇▇▂

The variables that have missing data are: - Borough (4559 missing) - Location (9120 missing) - On Street Name (4179 missing) - Cross Street Name (7440 missing) - Off Street Name (10541 missing) - Contributing Factor Vehicle 1 (90 missing) - Contributing Factor Vehicle 2 (3305 missing) - Contributing Factor Vehicle 3 (13349 missing) - Contributing Factor Vehicle 4 (14381 missing) - Contributing Factor Vehicle 5 (14610 missing) - Vehicle Type Code 1 (214 missing) - Vehicle Type Code 2 (4769 missing) - Vehicle Type Code 3 (13471 missing) - Vehicle Type Code 4 (14401 missing) - Vehicle Type Code 5 (14613 missing) - Zip Code (4560 missing) - Latitude (9120 missing) - Longitude (9120 missing)

We can definitely have a different interpretation when values are missing from fields like VEHICLE TYPE CODE 2 compared to LATITUDE. For instance, perhaps the crash did not involve more than 1 vehicle; so there would be no “vehicle code type” to record for that second vehicle. If something like LATITUDE is missing, this is a lack of information – similar to the lack of Location provided (and if we don’t have a location, we can’t possibly identify a latitude). Since the exact same number of Locations are missing as are Lat/Long missing, we can presume that LAT/LONG is automatically filled in for entries that have an actual location tied to them; therefore they are missing because their underlying source is missing. It’s unclear why a ‘Location’ would be missing other than lack of data quality, potentially legal reasons why location wouldn’t be provided (e.g. a private residence?), or general lack of information about the crash, etc.

Latitude and Longitude have the same number of missing values. Verify that they always occur in the same row. Check the counts of latitude and longitude values- do you find any hidden missing values? If so recode them as NA.

crashes |> 
  group_by(LATITUDE, LONGITUDE) |> 
  summarize(
    count = n()
  ) |> 
  arrange(desc(count))

`summarise()` has grouped output by 'LATITUDE'. You can override using the
`.groups` argument.

# A tibble: 4,755 × 3
# Groups:   LATITUDE [4,591]
   LATITUDE LONGITUDE count
      <dbl>     <dbl> <int>
 1     NA        NA    9120
 2      0         0     101
 3     40.7     -73.9     7
 4     40.8     -73.9     7
 5     40.6     -74.1     6
 6     40.6     -74.1     6
 7     40.7     -73.9     6
 8     40.7     -73.7     6
 9     40.7     -73.8     6
10     40.8     -74.0     6
# ℹ 4,745 more rows

We can see that yes, all 9120 of the NA values for LATITUDE occur in the same row as the 9120 NA values for LONGITUDE. We also observe there are 101 records where LATITUDE and LONGITUDE are 0 – we will replace these with NA.

crashes2 <- crashes |> 
  mutate(
    LATITUDE = if_else(LATITUDE == 0, NA, LATITUDE),
    LONGITUDE = if_else(LONGITUDE == 0, NA, LONGITUDE)
  )

Many of the geographic values are missing, but geographic information is redundant in multiple variables in the dataset. For example, with effort you could determine the borough of an accident from the zip code, the latitude and longitude, or the streets (not part of the assignment for this week). Consider the borough variable- what percentage of the missing values of borough have values present of at least one of zip code or latitude. What about if we include all the street name variables? What fraction of rows don’t have any detailed location information (latitude, zip code, or street names)?

nullboro <- crashes2 |> 
  filter(is.na(BOROUGH)) 

nullboro <- nullboro |> 
  mutate(
    ZipAndLatNull = if_else(is.na(`ZIP CODE`) & 
                              is.na(LATITUDE), TRUE, FALSE),
    StreetNamesNull = if_else(is.na(`ON STREET NAME`) & 
                                is.na(`CROSS STREET NAME`) & 
                                is.na(`OFF STREET NAME`), TRUE, FALSE)
  ) |> 
  mutate(
    NoLocInfo = if_else(ZipAndLatNull == TRUE & 
                          StreetNamesNull == TRUE, TRUE, FALSE)
  )


# glimpse(nullboro)

# nullboro |> 
#  filter(is.na(`ON STREET NAME`))

Above we created a separate dataframe of just those observations that have a NA borough, and we added some variables using mutate to identify the scenarios of where there are values present of at least one of zip code or latitude; if we include all the street name variables; and if we combine those two…

nullboro |> 
  group_by(ZipAndLatNull) |> 
  summarize(
    countZipAndLatNull = n()
    )|> 
  mutate(
    pctZipAndLatNull = countZipAndLatNull / sum(countZipAndLatNull) *100
  )

# A tibble: 2 × 3
  ZipAndLatNull countZipAndLatNull pctZipAndLatNull
  <lgl>                      <int>            <dbl>
1 FALSE                       2062             45.2
2 TRUE                        2497             54.8

We can see that 45.2% of the observations with a NA borough have at least one of zip code or latitude.

nullboro |> 
  group_by(StreetNamesNull) |> 
  summarize(
    countStreetNamesNull = n()
    )|> 
  mutate(
    pctStreetNamesNull = countStreetNamesNull / sum(countStreetNamesNull) *100
  )

# A tibble: 1 × 3
  StreetNamesNull countStreetNamesNull pctStreetNamesNull
  <lgl>                          <int>              <dbl>
1 FALSE                           4559                100

It appears that all (100%) of the observations with a NA borough have at least one street name element.

nullboro |> 
  group_by(NoLocInfo) |> 
  summarize(
    countNoLocInfo = n()
    )|> 
  mutate(
    pctNoLocInfo = countNoLocInfo / sum(countNoLocInfo) *100
  )

# A tibble: 1 × 3
  NoLocInfo countNoLocInfo pctNoLocInfo
  <lgl>              <int>        <dbl>
1 FALSE               4559          100

Logically, based on the result of the above, among the 4559 observations with no Borough recorded in the crashes dataset, because there are 0 observations that have absolutely no street-related information, there are also there are 0 observations that have no location information at all based on the measures of no zip, no latitude, none of the street name variables.

The CRASH TIME variable has no missing values. Compute the count of how many times each individual time occurs in the crash data set. This will suggest that there are some inliers in the data. Compute summary statistics on the count data, and determine how many inliers there are (define an inlier as a data value where the count is an outlier, i.e. the count of that value is greater than 1.5*IQR + P75, i.e. 1.5 times the interquartile range past the 75th percentile for the distribution of counts for values of that variable.) For which inliers do you believe the time is most likely to be accurate? For which is it least likely to be accurate and why do you think so?

crashes2 |> 
  group_by(`CRASH TIME`) |> 
  summarize(
    number_of_occurences = n()
  ) |> 
  arrange(desc(number_of_occurences))

# A tibble: 1,405 × 2
   `CRASH TIME` number_of_occurences
   <time>                      <int>
 1 00:00                         306
 2 16:00                         152
 3 19:00                         149
 4 12:00                         134
 5 13:00                         129
 6 14:00                         128
 7 17:00                         125
 8 15:00                         120
 9 17:30                         114
10 10:00                         112
# ℹ 1,395 more rows

# crashes2 |> 
#   summary()

CrashTimeSummaryStats <- crashes2 |> 
  select(`CRASH TIME`) |> 
  summarize(
    number_of_occurences = n(),
    mean_crashtime = mean(`CRASH TIME`),
    median_crashtime = median(`CRASH TIME`),
    sd_crashtime = sd(`CRASH TIME`),
    min_crashtime = min(`CRASH TIME`),
    max_crashtime = max(`CRASH TIME`),
    iqr_crashtime = IQR(`CRASH TIME`),
    iqrTimes1point5 = (1.5 * IQR(`CRASH TIME`)),
    pctile25 = quantile(`CRASH TIME`, 0.25),
    pctile75 = quantile(`CRASH TIME`, 0.75)
  ) 

CrashTimeSummaryStats

# A tibble: 1 × 10
  number_of_occurences mean_crashtime median_crashtime sd_crashtime
                 <int> <drtn>         <drtn>                  <dbl>
1                14720 47840.54 secs  50820 secs             23405.
# ℹ 6 more variables: min_crashtime <drtn>, max_crashtime <drtn>,
#   iqr_crashtime <dbl>, iqrTimes1point5 <dbl>, pctile25 <time>,
#   pctile75 <time>

We can see that the median is 50820 secs, and 1.5 * IQR = 52290. So the lower and upper end of our range is:

lower <- 50820 - 52290
upper <- 50820 + 52290

print(lower)

[1] -1470

print(upper)

[1] 103110

I am skeptical that the 0:00 crashtimes are all accurate. It is an ‘inlier’ rather than an outlier in the sense that it is not more than 1.5 * IQR; however, there are 306 occurrences of 0:00 crashtime in the dataset (the most “common” time – the mode) and the next-most-common crashtimes are all exactly “on-the-hour”. It is highly unlikely that that many crashes occurred exactly at the stroke of midnight (for 0:00), or exactly on any hour. It is most likely that this was the approximated crashtime that was entered – not an ‘exact’ time. The inliers that seem accurate, to me, are those that aren’t so precise, like 12:00 (134 occurrences) – the ‘on the hour’ times seem least likely to be accurate for this reason, and are likely a convenience of documentation/data-entry. The inliers that seem most accurate, then, are times such as 16:15:00 (49 occurences). That said, it’s highly likely that most crashes recorded must be estimated to the nearest quarter-hour. It is likely a challenge to know with precision unless there was video footage or something similar.

Problem 4: Finding Patterns in the Data

Formulate a question about crash data in NYC and make visualizations to explore your question. It could be related to the geographical distribution of accidents, the timing of accidents, which types of vehicles lead to more or less dangerous accidents, or anything else you want. Write comments/notes describing your observations in each visualizations you create and mention how these observations impact your initial hypotheses.

Useful questions to consider when you observe a pattern:

Could this pattern be due to coincidence (i.e. random chance)?
How can you describe the relationship implied by the pattern?
How strong is the relationship implied by the pattern?
What other variables might affect the relationship?
Does the relationship change if you look at individual subgroups of the data?

I’m going to look into vehicle code types, contributing factors, and the level of ‘dangerousness’ of accidents, as defined by the outcome of injury or death. Can we see patterns in the involvement of certain vehicle types or contributing factors, such as the likelihood of trucks and the outcome of injury, or the likelihood of alcohol as a contributing factor in deaths? Since not all crashes result in injury or death, I expect to see a potentially negative (and misleading) correlation of certain outcomes; just because of the overall rate of 0s for deaths does not mean that car crashes are unlikely to cause death. In fact, they are one of the leading causes of deaths in the U.S. I expect relationships to change if I look at individual subgroups (such as, pedestrian versus motorist injuries – or the ‘primary’ vehicle versus the secondary, if it exists in the data.)

problem4 <- crashes2 |> 
  rename(
    FactorV1 = `CONTRIBUTING FACTOR VEHICLE 1`,
    FactorV2 = `CONTRIBUTING FACTOR VEHICLE 2`,
    FactorV3 = `CONTRIBUTING FACTOR VEHICLE 3`,
    FactorV4 = `CONTRIBUTING FACTOR VEHICLE 4`,
    FactorV5 = `CONTRIBUTING FACTOR VEHICLE 5`,
    TypeV1 = `VEHICLE TYPE CODE 1`,
    TypeV2 = `VEHICLE TYPE CODE 2`,
    TypeV3 = `VEHICLE TYPE CODE 3`,
    TypeV4 = `VEHICLE TYPE CODE 4`,
    TypeV5 = `VEHICLE TYPE CODE 5`,
    Deaths = `NUMBER OF PERSONS KILLED`,
    Injuries = `NUMBER OF PERSONS INJURED`,
    Deaths_Peds = `NUMBER OF PEDESTRIANS KILLED`,
    Injuries_Peds = `NUMBER OF PEDESTRIANS INJURED`,
    Deaths_Cyclist = `NUMBER OF CYCLIST KILLED`,
    Injuries_Cyclist = `NUMBER OF CYCLIST INJURED`,
    Deaths_Motorist = `NUMBER OF MOTORIST KILLED`,
    Injuries_Motorist = `NUMBER OF MOTORIST INJURED`
  ) 
    
     
# 
# 
glimpse(problem4)

Rows: 14,720
Columns: 29
$ `CRASH DATE`        <chr> "07/02/2024", "07/02/2024", "07/02/2024", "07/02/2…
$ `CRASH TIME`        <time> 05:31:00, 16:15:00, 10:48:00, 18:37:00, 16:47:00,…
$ BOROUGH             <chr> NA, "QUEENS", NA, "QUEENS", "BRONX", NA, NA, NA, "…
$ `ZIP CODE`          <dbl> NA, 11414, NA, 11105, 10451, NA, NA, NA, 10307, 11…
$ LATITUDE            <dbl> 40.64655, 40.65958, NA, 40.77659, NA, NA, NA, NA, …
$ LONGITUDE           <dbl> -74.01575, -73.83705, NA, -73.90415, NA, NA, NA, N…
$ LOCATION            <chr> "(40.646553, -74.01575)", "(40.65958, -73.83705)",…
$ `ON STREET NAME`    <chr> "52 STREET", "96 STREET", "JACKIE ROBINSON PKWY", …
$ `CROSS STREET NAME` <chr> NA, "159 AVENUE", NA, NA, "CANAL ST WEST", NA, "VA…
$ `OFF STREET NAME`   <chr> NA, NA, NA, "20-39     37 STREET", NA, NA, NA, NA,…
$ Injuries            <dbl> 0, 1, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0,…
$ Deaths              <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ Injuries_Peds       <dbl> 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,…
$ Deaths_Peds         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ Injuries_Cyclist    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ Deaths_Cyclist      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ Injuries_Motorist   <dbl> 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0,…
$ Deaths_Motorist     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ FactorV1            <chr> "Unspecified", "Other Vehicular", "Passing or Lane…
$ FactorV2            <chr> "Unspecified", NA, "Unspecified", "Unspecified", "…
$ FactorV3            <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ FactorV4            <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ FactorV5            <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ COLLISION_ID        <dbl> 4739948, 4738204, 4737448, 4743612, 4746614, 47478…
$ TypeV1              <chr> "Sedan", NA, "PK", "Sedan", "Station Wagon/Sport U…
$ TypeV2              <chr> "Tractor Truck Diesel", NA, "Station Wagon/Sport U…
$ TypeV3              <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ TypeV4              <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ TypeV5              <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…

problem4 <- problem4 |> 
  select(starts_with("Factor") | starts_with("Type") | starts_with("Deaths") | starts_with("Injuries"))

problem4 |> 
  group_by(FactorV1, Deaths) |> 
  summarise(
    n = n()
  ) |> 
  arrange(desc(Deaths))

`summarise()` has grouped output by 'FactorV1'. You can override using the
`.groups` argument.

# A tibble: 64 × 3
# Groups:   FactorV1 [52]
   FactorV1                       Deaths     n
   <chr>                           <dbl> <int>
 1 Unspecified                         4     1
 2 Passing or Lane Usage Improper      2     1
 3 Unsafe Speed                        2     1
 4 Unspecified                         2     1
 5 Alcohol Involvement                 1     2
 6 Driver Inattention/Distraction      1     2
 7 Driver Inexperience                 1     1
 8 Failure to Yield Right-of-Way       1     3
 9 Passing or Lane Usage Improper      1     1
10 Traffic Control Disregarded         1     5
# ℹ 54 more rows

It appears that the most deadly contributing factor is ‘unspecified’, which is, unsatisfying in our inquiry into the relationship between these variables. What about the contributing factors that cause the most injuries?

problem4 |> 
  group_by(FactorV1, Injuries) |> 
  summarise(
    n = n()
  ) |> 
  arrange(desc(Injuries))

`summarise()` has grouped output by 'FactorV1'. You can override using the
`.groups` argument.

# A tibble: 225 × 3
# Groups:   FactorV1 [52]
   FactorV1                       Injuries     n
   <chr>                             <dbl> <int>
 1 Driver Inattention/Distraction       12     1
 2 Traffic Control Disregarded          11     1
 3 Driver Inattention/Distraction       10     1
 4 Driver Inattention/Distraction        9     1
 5 Following Too Closely                 9     1
 6 Traffic Control Disregarded           9     1
 7 Unspecified                           9     1
 8 Pavement Slippery                     8     1
 9 Reaction to Uninvolved Vehicle        8     1
10 Unsafe Speed                          8     1
# ℹ 215 more rows

We see that Driver Inattention/Distraction appears as the cause of the top injury-producing crash, with 12 as the highest number of injuries caused in a crash due to Driver Inattention/Distraction. Let’s try and look to what the most common contributing factors and vehicle types are…

ggplot(problem4, aes(y = FactorV1)) +
  geom_bar()

We can see very clearly that the most common contributing factors of crashes are “Unspecified” and “Driver Innattention/Distraction”. What about in instances where there’s a second contributing factor?

ggplot(problem4, aes(y = FactorV2)) +
  geom_bar()

Once again, we see unhelpful inforation - unspecified at the top. So, let’s look more into the relationships solely between the V1 car type and the V1 car contributing factor.

problem4b <-
  problem4 |> 
  select(TypeV1, FactorV1, Deaths, Injuries)


problem4b |> 
  group_by(TypeV1) |> 
  summarize(count_observations = n()) |> 
  arrange(desc(count_observations))

# A tibble: 118 × 2
   TypeV1                              count_observations
   <chr>                                            <int>
 1 Sedan                                             6537
 2 Station Wagon/Sport Utility Vehicle               4982
 3 Taxi                                               440
 4 Bike                                               377
 5 Pick-up Truck                                      340
 6 Box Truck                                          251
 7 Motorcycle                                         214
 8 <NA>                                               214
 9 Bus                                                210
10 Moped                                              172
# ℹ 108 more rows

Unfortunately, there are over 100 Types in the Vehicle 1, and over 50 Contributing Factors, which makes many visualizations of these categorical variables extremely unwieldy to work with. So, instead, we will look first at what the most common type of Vehicle in the recorded crashes dataset is – and se can see the most common by is a Sedan followed by a Station Wagon/Sport Utility Vehicle. Let’s see if either of these two have different “dangerousness” values as indicated by the number injuries they’re associated with.

problem4c <- 
  problem4b |> 
  filter(TypeV1 == "Sedan" | TypeV1 == "Station Wagon/Sport Utility Vehicle")

ggplot(problem4c, aes(x = Injuries, fill = TypeV1)) +
  geom_bar()

Interestingly, it looks like most crashes don’t result in any injuries at all (0), and it looks like roughly half the amount of injuries for 1 person occur as often as the times that 0 injuries occur for this subset of sedans and station wagons / SUVS.

Let’s look more closely at the most common reason (other than ‘Unspecified’) for crashes in the dataset: Driver Inattention / Distraction. How many injuries and deaths do we see with that value?

problem4d <-
  problem4b |> 
    filter(str_starts(FactorV1, "Driver Inattention")
  )

ggplot(problem4d, aes(x = Injuries)) +
  geom_bar()

ggplot(problem4d, aes(x = Deaths)) +
  geom_bar()

While most crashes result in 0 deaths, the fact that some exist (we can see that there is at least 1 crash from Driver Inattention/Distraction), the number of injuries is much greater in comparison (although the most common is ‘0’ injuries, too). What are the vehicle types most associated with this reason?

problem4d |> 
  group_by(TypeV1) |> 
  summarize(
    number_of_observations = n()
  ) |> 
  arrange(desc(number_of_observations))

# A tibble: 56 × 2
   TypeV1                              number_of_observations
   <chr>                                                <int>
 1 Sedan                                                 1607
 2 Station Wagon/Sport Utility Vehicle                   1237
 3 Bike                                                   117
 4 Taxi                                                   110
 5 Pick-up Truck                                           79
 6 Box Truck                                               64
 7 Motorcycle                                              53
 8 Bus                                                     43
 9 Moped                                                   41
10 <NA>                                                    38
# ℹ 46 more rows

Once again, we see that the Sedan and Station Wagon / SUV are at the top of the list. At this point in EDA, we can conclude that this is simply because these are the most common types of cars on the road.

These observations in collective do answer our initial question of commonality and dangerousness, but in a relatively unsatisfying way: we see more crashes with the most common type of cars on the road. We did find out, however, that driver distraction / inattention is a major factor in crashes – and that there’s still some mystery in the dataset regarding what an “unspecified” contributing factor really is. There’s always more to look into! At this stage, I might go back to the beginning and think of another type of question to ask that might result in a bit more certainty – or at least more specificity, and less ‘unspecified’ reasons.