This project involves further exercises in cleaning up messy data using files submitted by colleagues. The goal is to transform each of three wide or disorganized dataframes into a tidy table in which each row is a single observation and each column a single variable.
The datasets I’ve chosen are:
After cleaning the data, we created some tables, graphs and an interactive map to display some of the results. To view the underlying R code, select the toggle buttons on the right.
Almost everything we need is in Hadley Wickham’s tidyverse package of packages. But for the plots and interactive maps we need to add packages that transform R and markdown into javascript.
# load libraries
library(dplyr)
library(tidyr)
library(DT)
library(stringr)
library(ggplot2)
library(ggthemes)
library(scales)
library(pander)
library(readr)
library(htmlwidgets)
library(htmltools)
library(highcharter)
library(RColorBrewer)
library(leaflet)
library(rgdal)
library(dygraphs)
library(quantmod)
This table comes to us from Cesar Espitia, who called it “a sample analysis that is typically requested when determining operation improvements.” The data track shipping fees paid to or charged by several southeast Asia countries.
As shown in the raw data below, the date variable is not populated for every shipping fee. (See the empty cells in rows 2, 4, 6.) Fixhing this requires writing a bespoke function and transforming the date variable to a character from a factor, the default in read.csv().
# import the data
fees <- read.csv("Data/Discussion5.csv", header=T)
# look at it, though we won't show this step
# str(fees)
# summary(fees)
head(fees[1:6, 1:5])
## Date Values China Canada Tawian
## 1 12/29/2016 Price of Carrier 518.0 NA NA
## 2 Shipping Fees Collected 260.0 NA NA
## 3 12/30/2016 Price of Carrier 232.0 NA NA
## 4 Shipping Fees Collected 132.0 NA NA
## 5 1/3/2017 Price of Carrier 1143.5 NA NA
## 6 Shipping Fees Collected 752.0 NA NA
# fix date
fees$Date <- as.character(fees$Date)
# function to copy dates into appropriate rows
# if a cell is empty, copies the date from the
# cell above
fixdate <- function(fees){
for (i in 1:nrow(fees)){
if(fees[i,1]==""){
fees[i,1] <- fees[(i-1),1]
} else {
next
}
}
return(fees)
}
# populate the missing dates using the function
fees <- fixdate(fees)
# reformat the date
fees$Date <- as.Date(fees$Date, "%m/%d/%Y")
## pander(head(fees[1:5, c(1:5),]), caption = "Table 1: Fees data, initial clean-up (5 of 10 columns shown)")
datatable(fees, caption = "Table 1: Missing dates populated to empty cells")
Country is a variable, but the values are spread across seven columns. We use gather() to put them into one. Price and Fees are in the same column, but they should be variables in their own columns. We use spread() to fix that. First, we make our dataframe a tbl_df() so it’s recognized in the tidyverse.
# make it a table dataframe
fees <- tbl_df(fees)
# gather countries into a single variable 'Country'
fees2 <- ungroup(fees) %>%
gather(Country, Amount, 3:10)
# spread Price and Shipping Fee into their own columns
fees3 <- ungroup(fees2) %>%
spread(Values, Amount)
# Simplify the column names
colnames(fees3)[c(3:4)] <- c("Fees", "Price")
# the resulting table has one row for each date and country showing
# Shipping Fees Collected and Price of Carrier
# Here are some statistics by Country
stats <- ungroup(fees3) %>%
group_by(Country) %>%
summarise('Total Fees' = round(sum(Fees, na.rm=T), 2),
'Average Price' = round(mean(Price, na.rm=T), 2)) %>%
arrange(Country)
datatable(stats, caption = "Table 2: SAggregate shipping fee stats")
ggplot(stats, aes(x=Country, y=stats[,3])) +
geom_bar(stat="identity", position=position_dodge(),
fill='lightblue') +
xlab(" ") + scale_y_continuous(labels=scales::dollar) +
ylab("Average Price") +
theme_tufte()
Figure 1: Average Price by Country
These data come from Jaideep, who says: “The dataset is untidy as there are a lot of missing values for a number of columns.”
Yep. And other things, as raw data below shows.
There are 32 columns in the table. Of those, 13 identify whether specific drugs were involved in a death and can be gathered into a single variable. One other column combines lat-lon values and should be spread into separate columns for mapping.
There’s also dirty data: Entries aren’t standard in the columns that identify whether a drug was present. We have to use regular expresions and stringr to do clean-ups.
# read in the data
deaths <- read.csv('Data/Drug_Deaths.csv', header=T, colClasses="character")
# inspect; drug columns aren't standardized
# str(deaths)
print(deaths[c(557, 84, 1042, 2109, 1508, 918), c(2:3, 18,32)], row.names = FALSE)
## Date Sex Fentanyl DeathLoc
## 07/03/2012 Male Y POPS Clinton, CT\n(41.278677, -72.528067)
## 04/17/2013 Female Y (PTCH) Danbury, CT\n(41.393666, -73.451539)
## 11/17/2015 Male Y-A Hartford, CT\n(41.765775, -72.673356)
## 11/16/2015 Female Y-A STRATFORD, CT\n(41.200888, -73.131323)
## 07/02/2015 Male Y New Milford, CT\n(41.576633, -73.408713)
## 09/26/2015 Male Y Hartford, CT\n(41.765775, -72.673356)
As a first step, we fix the variable types and use regular expressions to standardize the “Y” values indicating whether a specific drug was present at death.
# transform strings to factors
deaths[, c(1,3:12,14,30,31)] <- lapply(deaths[, c(1,3:12,14,30,31)], as.factor)
# summary(deaths[, c(16:27)])
# trim spaces on drug-present columns
deaths[, c(16:27)] <- lapply(deaths[, c(16:27)], str_trim)
# standardize 'Y' values using regexpr and stringr
deaths[, c(16:27,29)] <- lapply(deaths[, c(16:27,29)], str_replace, pattern = "[yY].*", replacement = "Y")
# make them factors
deaths[, c(16:27,29)] <- lapply(deaths[, c(16:27,29)], as.factor)
# glimpse the result
pander(deaths[c(557, 84, 1042, 2109, 1508, 918), c(2:4, 16:19)], caption = "Table 3: Cleaned-up Deaths Data")
| Date | Sex | Race | Heroin | Cocaine | Fentanyl | Oxycodone | |
|---|---|---|---|---|---|---|---|
| 557 | 07/03/2012 | Male | White | Y | Y | ||
| 84 | 04/17/2013 | Female | White | Y | |||
| 1042 | 11/17/2015 | Male | Hispanic, White | Y | Y | Y | |
| 2109 | 11/16/2015 | Female | White | Y | Y | ||
| 1508 | 07/02/2015 | Male | Asian, Other | Y | Y | Y | |
| 918 | 09/26/2015 | Male | White | Y | Y | Y |
First we gather the drug variable … See last two columns (new).
deaths <- tbl_df(deaths)
deaths2 <- ungroup(deaths) %>%
gather(Drug, Present, c(16:27,29))
pander(head(deaths2[, c(1:6,20,21)]), caption = "Table 4: Gathered Deaths Data")
| CaseNumber | Date | Sex | Race | Age | Residence.City | Drug | Present |
|---|---|---|---|---|---|---|---|
| 12-4443 | 03/23/2012 | Male | White | 22 | ONECO | Heroin | Y |
| 12-2808 | 02/21/2012 | Male | White | 28 | WINCHESTER | Heroin | |
| 05/20/2016 | Male | White | 36 | AVON | Heroin | ||
| 15-13536 | 08/20/2015 | Male | Black | 63 | NEW HAVEN | Heroin | Y |
| 12-12217 | 08/29/2012 | Male | White | 39 | MIDDLETOWN | Heroin | |
| 15-6342 | 04/11/2015 | Male | White | 46 | NEWTOWN | Heroin |
… then separate latitude and longitude.
# make a first split to shear off city and state
deaths3 <- separate(deaths2, DeathLoc, into = c("Place", "LatLon"), sep="[(]")
# separate a second time on comma
deaths3 <- separate(deaths3, LatLon, into = c("Lat", "Lon"), sep="[,]")
# get rid of trailing parens
deaths3$Lon <- lapply(deaths3$Lon, gsub, pattern="\\)", replacement="")
deaths3$Lon <- round(as.numeric(unlist(deaths3$Lon)),7)
deaths3$Lat <- round(as.numeric(deaths3$Lat),7)
# str(deaths3)
pander(head(deaths3[, c(2:4, 20:23)]), caption = "Table 5: New Lon-Lat columns")
| Date | Sex | Race | Lat | Lon | Drug | Present |
|---|---|---|---|---|---|---|
| 03/23/2012 | Male | White | 41.71287 | -71.88121 | Heroin | Y |
| 02/21/2012 | Male | White | 41.90197 | -73.13385 | Heroin | |
| 05/20/2016 | Male | White | 41.80964 | -72.83055 | Heroin | |
| 08/20/2015 | Male | Black | 41.30825 | -72.92416 | Heroin | Y |
| 08/29/2012 | Male | White | 41.54465 | -72.65171 | Heroin | |
| 04/11/2015 | Male | White | 41.41352 | -73.30884 | Heroin |
Among the questions Jaideep asked about the data: “Which drug is most common cause of fatality? Is it available by Rx?”
Now that our data is reshaped, we can answer that using dplyr and the group_by() and summarise() functions. Heroin is present in 54 percent of the drug fatalities. The two top drugs are illegal; fentanyl and benzodiazepine and some of the other common drugs are available by prescription.
topdrugs <- ungroup(deaths3) %>%
filter(Present=='Y') %>%
group_by(Drug) %>%
summarise(Count=n(),
'Percent Present'=round(Count/3045, 3)*100) %>%
arrange(desc(Count))
pander(topdrugs[1:8,], caption = "Table 6: Most common drugs in CT fatalities")
| Drug | Count | Percent Present |
|---|---|---|
| Heroin | 1675 | 55.0 |
| Cocaine | 827 | 27.2 |
| Fentanyl | 791 | 26.0 |
| Benzodiazepine | 746 | 24.5 |
| EtOH | 697 | 22.9 |
| Any.Opioid | 675 | 22.2 |
| Oxycodone | 450 | 14.8 |
| Methadone | 286 | 9.4 |
heroin <- ungroup(deaths3) %>%
filter(Drug=='Heroin' & Present=='Y' & CaseNumber !="") %>%
group_by(CaseNumber) %>%
select(CaseNumber, ImmediateCauseA, Sex, Location, Lat, Lon)
gender <- ungroup(deaths) %>%
group_by(Sex) %>%
summarise(Count=n(),
Percent=round(Count/3045, 3)*100) %>%
arrange(desc(Percent))
We can also map the fatalities now that we have clean coordinates. Most – 72.8 percent – are male (red dots).
pal <- colorFactor(c("blue", "red"), domain = c("Male", "Female"))
leaflet(heroin) %>%
setView(lng = -71.881207,
lat = 41.712872,
zoom = 8) %>%
addProviderTiles("CartoDB.Positron") %>%
addCircleMarkers(heroin$Lon, heroin$Lat,
radius = ~ifelse(Sex == "Female", 6, 8),
color = ~pal(Sex),
stroke = FALSE, fillOpacity = 0.35,
label = ~htmlEscape(ImmediateCauseA))
According to Kyle, “It would be a good exercise to tidy this table by putting all of the years in a one column and the metrics in their own columns. Then you could use R to visualize some of the development trends by country and year in a line graph.”
OK. We’re game. Here is growth in real GDP for the European Union states over time.
# get the data
euro <- read.csv('Data/EUData.csv', header=T, colClasses="character")
# fix the column names
colnames(euro)[5:61] <- 1960:2016
# gather the years
euro2 <- ungroup(euro) %>%
gather(Year, Value, 5:61)
# set the data type
euro2$Value <- as.numeric(euro2$Value)
# select gdp data for the chart
gdp <- ungroup(euro2) %>%
filter(is.null(Value)==FALSE) %>%
filter(Indicator.Name=="GDP (current US$)") %>%
select(Year, Value)
# plot it
ggplot(gdp, aes(x=Year, y=Value/100000)) +
geom_bar(stat="identity", position=position_dodge(),
fill='blue') +
xlab(" ") + scale_y_continuous(labels=scales::dollar) +
ylab("GDP (current US$, 100K)") +
scale_x_discrete(breaks=seq(1960,2020,10)) +
theme_tufte()