You’re looking at an R Markdown file (file>new file>r markdown) They’re great for creating documents that incorporate text and examples of code/code output (plots, etc.) - aka great for tutorials.
So, we’re going to play around with some crash data to learn how we might clean our data (and minimize human error!). R is a fantastic tool for data analysis. It is also a stellar tool for data cleaning and management. Rather than play around in Google sheets and accidentally delete something, change a value, or do some mind-numbingly repetitive action (there are shortcuts in sheets too…), you can code the management process out in R!
First, you’ll want to download the data I’ve shared in the classroom stream for this example. You’ll find two .csv files: boise.csv & nampa.csv
Make sure to download these and get them into a spot you can find them.
These two datasets focus on the contributing circumstances to crashes in Boise and Nampa
OK, so let’s get them into R:
boiseData <- read.csv("boise.csv")
nampaData <- read.csv("nampa.csv")
Let’s get a summary of the structure of our dataframes using the str() function
str(boiseData)
## 'data.frame': 38 obs. of 2 variables:
## $ Contributing.Circumstances..All.: chr "Following Too Close" "Failed to Yield" "Inattention" "Failed to Obey Signal" ...
## $ Total.Crashes : int 805 680 489 280 278 276 232 210 191 113 ...
str(nampaData)
## 'data.frame': 35 obs. of 2 variables:
## $ Contributing.Circumstances..All.: chr "Failed to Yield" "Following Too Close" "Inattention" "Improper Lane Change" ...
## $ Total.Crashes : int 603 580 333 169 138 121 118 100 99 80 ...
So we can see we have two $s, or variables, in each dataframe: $ Contributing.Circumstances..All. and $ Total.Crashes
We could also get variable names using names()
names(boiseData)
## [1] "Contributing.Circumstances..All." "Total.Crashes"
names(nampaData)
## [1] "Contributing.Circumstances..All." "Total.Crashes"
Those are pretty gross names, so we’ll use this same function to reset them. In this case, we’ll feed the variable names attribute of the dataframes (accessed by the names() function) vectors (created with c()) made up of preferable names
names(boiseData) <- c("circumstances","total")
names(nampaData) <- c("circumstances","total")
Let’s see whether we have the same number of rows in these two dataframes
all.equal(boiseData, nampaData)
## [1] "Attributes: < Component \"row.names\": Numeric: lengths (38, 35) differ >"
## [2] "Component \"circumstances\": Lengths (38, 35) differ (string compare on first 35)"
## [3] "Component \"circumstances\": 32 string mismatches"
## [4] "Component \"total\": Numeric: lengths (38, 35) differ"
So we don’t have the same # of rows… so, if we try to join these two dataframes we’re going to bump into some problems
Which categories are different (missing from the shorter Nampa dataframe)? To find out we can use the %in% operator The %in% operator essentially means “includes.” It allows us to see whether a vector (in this case a variable in a dataframe) includes all the things another vector does.
In this case we want to know which of the boiseData circumstances are NOT (!) included in the nampaData circumstances. So, in this next line of code we’re indexing the boiseData dataframe and asking R to go row by row and identifying whether there is not (!) a boiseData circumstance included in the nampaData circumstances
boiseData[!boiseData$circumstances %in% nampaData$circumstances,]
## circumstances total
## 33 Light Defect 7
## 35 Truck Coupling, Trailer Hitch, Safety Chains 5
## 38 Too Slow for Traffic 2
Let’s use the cbind() function anyway to see what happens when we try “bind” or combine the dataframes. If cbind() works we should have a new dataframe with all columns from both dataframes
cbind(boiseData,nampaData)
## Error in data.frame(..., check.names = FALSE): arguments imply differing number of rows: 38, 35
DOESN’T WORK! cbind() needs even dataframes with the SAME variables in the SAME order! Thankfully we have the merge() function! merge() takes a few arguments. Here we’re feeding it two dataframes, are specifying that we want to merge by the variable circumstances, and tell it to include all the values, not only those that are included in both dataframes
merge(boiseData, nampaData, by = "circumstances", all = T)
## circumstances total.x total.y
## 1 Alcohol Impaired 210 118
## 2 Animal(s) in Roadway 48 16
## 3 Asleep, Drowsy, Fatigued 54 32
## 4 Brakes 19 21
## 5 Distracted IN or ON Vehicle 113 99
## 6 Drove Left of Center 66 26
## 7 Drug Impaired 31 18
## 8 Emotional – Depressed, Angry, Disturbed 22 5
## 9 Exceeded Posted Speed 36 22
## 10 Failed to Maintain Lane 276 121
## 11 Failed to Obey Signal 280 100
## 12 Failed to Obey Stop Sign 68 80
## 13 Failed to Signal 8 3
## 14 Failed to Yield 680 603
## 15 Following Too Close 805 580
## 16 Foot Slipped Off or Caught On Pedal 26 15
## 17 Improper Backing 37 35
## 18 Improper Lane Change 278 169
## 19 Improper Overtaking 26 15
## 20 Improper Turn 191 73
## 21 Improper Use of Turn Lane 8 6
## 22 Improperly Parked 9 2
## 23 Inattention 489 333
## 24 Light Defect 7 NA
## 25 Other 91 50
## 26 Other Vehicle Defect 20 7
## 27 Overcorrected 63 17
## 28 Physical Impairment 15 8
## 29 Previous Accident 4 2
## 30 Sick 24 8
## 31 Speed Too Fast For Conditions 232 138
## 32 Steering 5 5
## 33 Tire Defect 12 14
## 34 Too Slow for Traffic 2 NA
## 35 Truck Coupling, Trailer Hitch, Safety Chains 5 NA
## 36 Vision Obstruction 70 66
## 37 Wheel Defect 3 2
## 38 Wrong Side or Wrong Way 17 6
Cool! So let’s create a new object and take a look!
tvCrashes <- merge(boiseData, nampaData, by = "circumstances", all = T)
Now we can see we have ugly variable names for Boise and Nampa, so let’s fix those
names(tvCrashes) <- c("circumstances","Boise", "Nampa")
And, we want to make sure we can use all our data, so let’s replace those NAs with 0s
First, how many NAs do we have?
tvCrashes[is.na(tvCrashes)]
## [1] NA NA NA
Now let’s turn those into 0s
tvCrashes[is.na(tvCrashes)] <- 0
And check again
tvCrashes[is.na(tvCrashes)]
## character(0)
And let’s look at the range of Nampa crash incidents to triple-check we not have zeroes
range(tvCrashes$Nampa)
## [1] 0 603
There’s always another way!
Now we’ll explore what some data tidying looks like using tidyverse methods
The tidyverse is a group of packages developed by a group of folks. They have set out to improve on/simplify some base R functions and standard practices.
First you’ll want to install and load in the tidyverse install.packages(“tidyverse”) - had to put this here, wouldn’t work in my code snippet below
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.1 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Then you pull your data in. What about this function looks different?
read_csv("boise.csv")
## Rows: 38 Columns: 2
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Contributing Circumstances (All)
## dbl (1): Total Crashes
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## # A tibble: 38 × 2
## `Contributing Circumstances (All)` `Total Crashes`
## <chr> <dbl>
## 1 Following Too Close 805
## 2 Failed to Yield 680
## 3 Inattention 489
## 4 Failed to Obey Signal 280
## 5 Improper Lane Change 278
## 6 Failed to Maintain Lane 276
## 7 Speed Too Fast For Conditions 232
## 8 Alcohol Impaired 210
## 9 Improper Turn 191
## 10 Distracted IN or ON Vehicle 113
## # ℹ 28 more rows
read_csv("nampa.csv")
## Rows: 35 Columns: 2
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Contributing Circumstances (All)
## dbl (1): Total Crashes
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## # A tibble: 35 × 2
## `Contributing Circumstances (All)` `Total Crashes`
## <chr> <dbl>
## 1 Failed to Yield 603
## 2 Following Too Close 580
## 3 Inattention 333
## 4 Improper Lane Change 169
## 5 Speed Too Fast For Conditions 138
## 6 Failed to Maintain Lane 121
## 7 Alcohol Impaired 118
## 8 Failed to Obey Signal 100
## 9 Distracted IN or ON Vehicle 99
## 10 Failed to Obey Stop Sign 80
## # ℹ 25 more rows
You’ll note that the tidy version of this function gives you some nice output with rows, columns, and nice details about your variables (columns)
Save them as objects and check out the tidy variable names!
boiseTidyData <- read_csv("boise.csv")
## Rows: 38 Columns: 2
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Contributing Circumstances (All)
## dbl (1): Total Crashes
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
nampaTidyData <- read_csv("nampa.csv")
## Rows: 35 Columns: 2
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Contributing Circumstances (All)
## dbl (1): Total Crashes
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
So much cleaner!
… more to come