Data merging

You’re looking at an R Markdown file (file>new file>r markdown) They’re great for creating documents that incorporate text and examples of code/code output (plots, etc.) - aka great for tutorials.

So, we’re going to play around with some crash data to learn how we might clean our data (and minimize human error!). R is a fantastic tool for data analysis. It is also a stellar tool for data cleaning and management. Rather than play around in Google sheets and accidentally delete something, change a value, or do some mind-numbingly repetitive action (there are shortcuts in sheets too…), you can code the management process out in R!

Merging crash data

First, you’ll want to download the data I’ve shared in the classroom stream for this example. You’ll find two .csv files: boise.csv & nampa.csv

Make sure to download these and get them into a spot you can find them.

These two datasets focus on the contributing circumstances to crashes in Boise and Nampa

OK, so let’s get them into R:

boiseData <- read.csv("boise.csv")
nampaData <- read.csv("nampa.csv")

Let’s see what’s going on…

Let’s get a summary of the structure of our dataframes using the str() function

str(boiseData)
## 'data.frame':    38 obs. of  2 variables:
##  $ Contributing.Circumstances..All.: chr  "Following Too Close" "Failed to Yield" "Inattention" "Failed to Obey Signal" ...
##  $ Total.Crashes                   : int  805 680 489 280 278 276 232 210 191 113 ...
str(nampaData)
## 'data.frame':    35 obs. of  2 variables:
##  $ Contributing.Circumstances..All.: chr  "Failed to Yield" "Following Too Close" "Inattention" "Improper Lane Change" ...
##  $ Total.Crashes                   : int  603 580 333 169 138 121 118 100 99 80 ...

So we can see we have two $s, or variables, in each dataframe: $ Contributing.Circumstances..All. and $ Total.Crashes

We could also get variable names using names()

names(boiseData)
## [1] "Contributing.Circumstances..All." "Total.Crashes"
names(nampaData)
## [1] "Contributing.Circumstances..All." "Total.Crashes"

Those are pretty gross names, so we’ll use this same function to reset them. In this case, we’ll feed the variable names attribute of the dataframes (accessed by the names() function) vectors (created with c()) made up of preferable names

names(boiseData) <- c("circumstances","total")
names(nampaData) <- c("circumstances","total")

Let’s see whether we have the same number of rows in these two dataframes

all.equal(boiseData, nampaData)
## [1] "Attributes: < Component \"row.names\": Numeric: lengths (38, 35) differ >"        
## [2] "Component \"circumstances\": Lengths (38, 35) differ (string compare on first 35)"
## [3] "Component \"circumstances\": 32 string mismatches"                                
## [4] "Component \"total\": Numeric: lengths (38, 35) differ"

So we don’t have the same # of rows… so, if we try to join these two dataframes we’re going to bump into some problems

Which categories are different (missing from the shorter Nampa dataframe)? To find out we can use the %in% operator The %in% operator essentially means “includes.” It allows us to see whether a vector (in this case a variable in a dataframe) includes all the things another vector does.

In this case we want to know which of the boiseData circumstances are NOT (!) included in the nampaData circumstances. So, in this next line of code we’re indexing the boiseData dataframe and asking R to go row by row and identifying whether there is not (!) a boiseData circumstance included in the nampaData circumstances

boiseData[!boiseData$circumstances %in% nampaData$circumstances,]
##                                   circumstances total
## 33                                 Light Defect     7
## 35 Truck Coupling, Trailer Hitch, Safety Chains     5
## 38                         Too Slow for Traffic     2

Let’s use the cbind() function anyway to see what happens when we try “bind” or combine the dataframes. If cbind() works we should have a new dataframe with all columns from both dataframes

cbind(boiseData,nampaData)
## Error in data.frame(..., check.names = FALSE): arguments imply differing number of rows: 38, 35

DOESN’T WORK! cbind() needs even dataframes with the SAME variables in the SAME order! Thankfully we have the merge() function! merge() takes a few arguments. Here we’re feeding it two dataframes, are specifying that we want to merge by the variable circumstances, and tell it to include all the values, not only those that are included in both dataframes

merge(boiseData, nampaData, by = "circumstances", all = T)
##                                   circumstances total.x total.y
## 1                              Alcohol Impaired     210     118
## 2                          Animal(s) in Roadway      48      16
## 3                      Asleep, Drowsy, Fatigued      54      32
## 4                                        Brakes      19      21
## 5                   Distracted IN or ON Vehicle     113      99
## 6                          Drove Left of Center      66      26
## 7                                 Drug Impaired      31      18
## 8       Emotional – Depressed, Angry, Disturbed      22       5
## 9                         Exceeded Posted Speed      36      22
## 10                      Failed to Maintain Lane     276     121
## 11                        Failed to Obey Signal     280     100
## 12                     Failed to Obey Stop Sign      68      80
## 13                             Failed to Signal       8       3
## 14                              Failed to Yield     680     603
## 15                          Following Too Close     805     580
## 16          Foot Slipped Off or Caught On Pedal      26      15
## 17                             Improper Backing      37      35
## 18                         Improper Lane Change     278     169
## 19                          Improper Overtaking      26      15
## 20                                Improper Turn     191      73
## 21                    Improper Use of Turn Lane       8       6
## 22                            Improperly Parked       9       2
## 23                                  Inattention     489     333
## 24                                 Light Defect       7      NA
## 25                                        Other      91      50
## 26                         Other Vehicle Defect      20       7
## 27                                Overcorrected      63      17
## 28                          Physical Impairment      15       8
## 29                            Previous Accident       4       2
## 30                                         Sick      24       8
## 31                Speed Too Fast For Conditions     232     138
## 32                                     Steering       5       5
## 33                                  Tire Defect      12      14
## 34                         Too Slow for Traffic       2      NA
## 35 Truck Coupling, Trailer Hitch, Safety Chains       5      NA
## 36                           Vision Obstruction      70      66
## 37                                 Wheel Defect       3       2
## 38                      Wrong Side or Wrong Way      17       6

Cool! So let’s create a new object and take a look!

tvCrashes <- merge(boiseData, nampaData, by = "circumstances", all = T)

Now we can see we have ugly variable names for Boise and Nampa, so let’s fix those

names(tvCrashes) <- c("circumstances","Boise", "Nampa")

And, we want to make sure we can use all our data, so let’s replace those NAs with 0s

First, how many NAs do we have?

tvCrashes[is.na(tvCrashes)]
## [1] NA NA NA

Now let’s turn those into 0s

tvCrashes[is.na(tvCrashes)] <- 0

And check again

tvCrashes[is.na(tvCrashes)]
## character(0)

And let’s look at the range of Nampa crash incidents to triple-check we not have zeroes

range(tvCrashes$Nampa)
## [1]   0 603

Is there another way to clean data?

There’s always another way!

Now we’ll explore what some data tidying looks like using tidyverse methods

The tidyverse is a group of packages developed by a group of folks. They have set out to improve on/simplify some base R functions and standard practices.

First you’ll want to install and load in the tidyverse install.packages(“tidyverse”) - had to put this here, wouldn’t work in my code snippet below

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.1     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Then you pull your data in. What about this function looks different?

read_csv("boise.csv")
## Rows: 38 Columns: 2
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Contributing Circumstances (All)
## dbl (1): Total Crashes
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## # A tibble: 38 × 2
##    `Contributing Circumstances (All)` `Total Crashes`
##    <chr>                                        <dbl>
##  1 Following Too Close                            805
##  2 Failed to Yield                                680
##  3 Inattention                                    489
##  4 Failed to Obey Signal                          280
##  5 Improper Lane Change                           278
##  6 Failed to Maintain Lane                        276
##  7 Speed Too Fast For Conditions                  232
##  8 Alcohol Impaired                               210
##  9 Improper Turn                                  191
## 10 Distracted IN or ON Vehicle                    113
## # ℹ 28 more rows
read_csv("nampa.csv")
## Rows: 35 Columns: 2
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Contributing Circumstances (All)
## dbl (1): Total Crashes
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## # A tibble: 35 × 2
##    `Contributing Circumstances (All)` `Total Crashes`
##    <chr>                                        <dbl>
##  1 Failed to Yield                                603
##  2 Following Too Close                            580
##  3 Inattention                                    333
##  4 Improper Lane Change                           169
##  5 Speed Too Fast For Conditions                  138
##  6 Failed to Maintain Lane                        121
##  7 Alcohol Impaired                               118
##  8 Failed to Obey Signal                          100
##  9 Distracted IN or ON Vehicle                     99
## 10 Failed to Obey Stop Sign                        80
## # ℹ 25 more rows

You’ll note that the tidy version of this function gives you some nice output with rows, columns, and nice details about your variables (columns)

Save them as objects and check out the tidy variable names!

boiseTidyData <- read_csv("boise.csv")
## Rows: 38 Columns: 2
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Contributing Circumstances (All)
## dbl (1): Total Crashes
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
nampaTidyData <- read_csv("nampa.csv")
## Rows: 35 Columns: 2
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Contributing Circumstances (All)
## dbl (1): Total Crashes
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

So much cleaner!

… more to come