Intro

This project involves further exercises in cleaning up messy data using files submitted by colleagues. The goal is to transform each of three wide or disorganized dataframes into a tidy table in which each row is a single observation and each column a single variable.

The datasets I’ve chosen are:

After cleaning the data, we created some tables, graphs and an interactive map to display some of the results. To view the underlying R code, select the toggle buttons on the right.

Environment Setup

Almost everything we need is in Hadley Wickham’s tidyverse package of packages. But for the plots and interactive maps we need to add packages that transform R and markdown into javascript.

# load libraries
library(dplyr)
library(tidyr)
library(DT)
library(stringr)
library(ggplot2)
library(ggthemes)
library(scales)
library(pander)
library(readr)
library(htmlwidgets)
library(htmltools)
library(highcharter)
library(RColorBrewer)
library(leaflet)
library(rgdal)
library(dygraphs)
library(quantmod)

1. Shipping Fees Data

This table comes to us from Cesar Espitia, who called it “a sample analysis that is typically requested when determining operation improvements.” The data track shipping fees paid to or charged by several southeast Asia countries.

1.1 Import and Examine Data

As shown in the raw data below, the date variable is not populated for every shipping fee. (See the empty cells in rows 2, 4, 6.) Fixhing this requires writing a bespoke function and transforming the date variable to a character from a factor, the default in read.csv().

# import the data
fees <- read.csv("Data/Discussion5.csv", header=T)

# look at it, though we won't show this step
# str(fees)
# summary(fees)

head(fees[1:6, 1:5])
##         Date                  Values  China Canada Tawian
## 1 12/29/2016        Price of Carrier  518.0     NA     NA
## 2            Shipping Fees Collected  260.0     NA     NA
## 3 12/30/2016        Price of Carrier  232.0     NA     NA
## 4            Shipping Fees Collected  132.0     NA     NA
## 5   1/3/2017        Price of Carrier 1143.5     NA     NA
## 6            Shipping Fees Collected  752.0     NA     NA
# fix date
fees$Date <- as.character(fees$Date)

# function to copy dates into appropriate rows
# if a cell is empty, copies the date from the 
# cell above

fixdate <- function(fees){
  for (i in 1:nrow(fees)){
    if(fees[i,1]==""){
     fees[i,1] <- fees[(i-1),1]
    } else {
      next
    }
  }
  return(fees)
}

# populate the missing dates using the function
fees <- fixdate(fees)

# reformat the date
fees$Date <- as.Date(fees$Date, "%m/%d/%Y")

## pander(head(fees[1:5, c(1:5),]), caption = "Table 1: Fees data, initial clean-up (5 of 10 columns shown)")

datatable(fees, caption = "Table 1: Missing dates populated to empty cells")

1.2 Gather and Spread

Country is a variable, but the values are spread across seven columns. We use gather() to put them into one. Price and Fees are in the same column, but they should be variables in their own columns. We use spread() to fix that. First, we make our dataframe a tbl_df() so it’s recognized in the tidyverse.

# make it a table dataframe 
fees <- tbl_df(fees)

# gather countries into a single variable 'Country'
fees2 <- ungroup(fees) %>%
    gather(Country, Amount, 3:10)

# spread Price and Shipping Fee into their own columns

fees3 <- ungroup(fees2) %>%
    spread(Values, Amount)

# Simplify the column names

colnames(fees3)[c(3:4)] <- c("Fees", "Price")

# the resulting table has one row for each date and country showing 
# Shipping Fees Collected and Price of Carrier
# Here are some statistics by Country

stats <- ungroup(fees3) %>% 
            group_by(Country) %>% 
            summarise('Total Fees' = round(sum(Fees, na.rm=T), 2),
                      'Average Price' = round(mean(Price, na.rm=T), 2)) %>%
            arrange(Country)

datatable(stats, caption = "Table 2: SAggregate shipping fee stats")

1.3 Sample plot

ggplot(stats, aes(x=Country, y=stats[,3])) +
        geom_bar(stat="identity", position=position_dodge(),
                 fill='lightblue') +
        xlab(" ") + scale_y_continuous(labels=scales::dollar) +
        ylab("Average Price") +
        theme_tufte()
Figure 1: Average Price by Country

Figure 1: Average Price by Country


2. Drug Deaths Data

2.1 Import, Inspect, Clean Up

These data come from Jaideep, who says: “The dataset is untidy as there are a lot of missing values for a number of columns.”

Yep. And other things, as raw data below shows.

There are 32 columns in the table. Of those, 13 identify whether specific drugs were involved in a death and can be gathered into a single variable. One other column combines lat-lon values and should be spread into separate columns for mapping.

There’s also dirty data: Entries aren’t standard in the columns that identify whether a drug was present. We have to use regular expresions and stringr to do clean-ups.

# read in the data
deaths <- read.csv('Data/Drug_Deaths.csv', header=T, colClasses="character")

# inspect; drug columns aren't standardized
# str(deaths)

print(deaths[c(557, 84, 1042, 2109, 1508, 918), c(2:3, 18,32)], row.names = FALSE)
##        Date    Sex Fentanyl                                 DeathLoc
##  07/03/2012   Male   Y POPS     Clinton, CT\n(41.278677, -72.528067)
##  04/17/2013 Female Y (PTCH)     Danbury, CT\n(41.393666, -73.451539)
##  11/17/2015   Male      Y-A    Hartford, CT\n(41.765775, -72.673356)
##  11/16/2015 Female      Y-A   STRATFORD, CT\n(41.200888, -73.131323)
##  07/02/2015   Male        Y New Milford, CT\n(41.576633, -73.408713)
##  09/26/2015   Male        Y    Hartford, CT\n(41.765775, -72.673356)

As a first step, we fix the variable types and use regular expressions to standardize the “Y” values indicating whether a specific drug was present at death.

# transform strings to factors
deaths[, c(1,3:12,14,30,31)] <- lapply(deaths[, c(1,3:12,14,30,31)], as.factor)

# summary(deaths[, c(16:27)])

# trim spaces on drug-present columns
deaths[, c(16:27)] <- lapply(deaths[, c(16:27)], str_trim)

# standardize 'Y' values using regexpr and stringr
deaths[, c(16:27,29)] <- lapply(deaths[, c(16:27,29)], str_replace, pattern = "[yY].*", replacement = "Y")

# make them factors
deaths[, c(16:27,29)] <- lapply(deaths[, c(16:27,29)], as.factor)

# glimpse the result
pander(deaths[c(557, 84, 1042, 2109, 1508, 918), c(2:4, 16:19)], caption = "Table 3: Cleaned-up Deaths Data")
Table 3: Cleaned-up Deaths Data
  Date Sex Race Heroin Cocaine Fentanyl Oxycodone
557 07/03/2012 Male White Y Y
84 04/17/2013 Female White Y
1042 11/17/2015 Male Hispanic, White Y Y Y
2109 11/16/2015 Female White Y Y
1508 07/02/2015 Male Asian, Other Y Y Y
918 09/26/2015 Male White Y Y Y

2.2 Gather Drug Variable

First we gather the drug variable … See last two columns (new).

deaths <- tbl_df(deaths)

deaths2 <- ungroup(deaths) %>% 
              gather(Drug, Present, c(16:27,29))

pander(head(deaths2[, c(1:6,20,21)]), caption = "Table 4: Gathered Deaths Data")
Table 4: Gathered Deaths Data
CaseNumber Date Sex Race Age Residence.City Drug Present
12-4443 03/23/2012 Male White 22 ONECO Heroin Y
12-2808 02/21/2012 Male White 28 WINCHESTER Heroin
05/20/2016 Male White 36 AVON Heroin
15-13536 08/20/2015 Male Black 63 NEW HAVEN Heroin Y
12-12217 08/29/2012 Male White 39 MIDDLETOWN Heroin
15-6342 04/11/2015 Male White 46 NEWTOWN Heroin

2.3 Separate Lat-Lon

… then separate latitude and longitude.

# make a first split to shear off city and state
deaths3 <- separate(deaths2, DeathLoc, into = c("Place", "LatLon"), sep="[(]")

# separate a second time on comma
deaths3 <- separate(deaths3, LatLon, into = c("Lat", "Lon"), sep="[,]")

# get rid of trailing parens
deaths3$Lon <- lapply(deaths3$Lon, gsub, pattern="\\)", replacement="")

deaths3$Lon <- round(as.numeric(unlist(deaths3$Lon)),7)
deaths3$Lat <- round(as.numeric(deaths3$Lat),7)
# str(deaths3)

pander(head(deaths3[, c(2:4, 20:23)]), caption = "Table 5: New Lon-Lat columns")
Table 5: New Lon-Lat columns
Date Sex Race Lat Lon Drug Present
03/23/2012 Male White 41.71287 -71.88121 Heroin Y
02/21/2012 Male White 41.90197 -73.13385 Heroin
05/20/2016 Male White 41.80964 -72.83055 Heroin
08/20/2015 Male Black 41.30825 -72.92416 Heroin Y
08/29/2012 Male White 41.54465 -72.65171 Heroin
04/11/2015 Male White 41.41352 -73.30884 Heroin

2.4 Some summary statistics on drug deaths

Among the questions Jaideep asked about the data: “Which drug is most common cause of fatality? Is it available by Rx?”

Now that our data is reshaped, we can answer that using dplyr and the group_by() and summarise() functions. Heroin is present in 54 percent of the drug fatalities. The two top drugs are illegal; fentanyl and benzodiazepine and some of the other common drugs are available by prescription.

topdrugs <- ungroup(deaths3) %>% 
                filter(Present=='Y') %>% 
                group_by(Drug) %>% 
                summarise(Count=n(),
                          'Percent Present'=round(Count/3045, 3)*100) %>% 
                arrange(desc(Count))

pander(topdrugs[1:8,], caption = "Table 6: Most common drugs in CT  fatalities")
Table 6: Most common drugs in CT fatalities
Drug Count Percent Present
Heroin 1675 55.0
Cocaine 827 27.2
Fentanyl 791 26.0
Benzodiazepine 746 24.5
EtOH 697 22.9
Any.Opioid 675 22.2
Oxycodone 450 14.8
Methadone 286 9.4
heroin <- ungroup(deaths3) %>% 
            filter(Drug=='Heroin' & Present=='Y' & CaseNumber !="") %>%  
            group_by(CaseNumber) %>% 
            select(CaseNumber, ImmediateCauseA, Sex, Location, Lat, Lon)


gender <- ungroup(deaths) %>% 
            group_by(Sex) %>% 
            summarise(Count=n(),
                      Percent=round(Count/3045, 3)*100) %>%
            arrange(desc(Percent))

2.5 Map of drug deaths

We can also map the fatalities now that we have clean coordinates. Most – 72.8 percent – are male (red dots).

pal <- colorFactor(c("blue", "red"), domain = c("Male", "Female"))

leaflet(heroin) %>%
  setView(lng = -71.881207,
          lat = 41.712872,
          zoom = 8) %>%
  addProviderTiles("CartoDB.Positron") %>% 
  addCircleMarkers(heroin$Lon, heroin$Lat,
          radius = ~ifelse(Sex == "Female", 6, 8),
          color = ~pal(Sex),
          stroke = FALSE, fillOpacity = 0.35,
          label = ~htmlEscape(ImmediateCauseA))

3. World Bank Data, European Union

According to Kyle, “It would be a good exercise to tidy this table by putting all of the years in a one column and the metrics in their own columns. Then you could use R to visualize some of the development trends by country and year in a line graph.”

OK. We’re game. Here is growth in real GDP for the European Union states over time.

# get the data
euro <- read.csv('Data/EUData.csv', header=T, colClasses="character")

# fix the column names
colnames(euro)[5:61] <- 1960:2016

# gather the years
euro2 <- ungroup(euro) %>% 
          gather(Year, Value, 5:61)

# set the data type
euro2$Value <- as.numeric(euro2$Value)

# select gdp data for the chart 
gdp <- ungroup(euro2) %>% 
          filter(is.null(Value)==FALSE) %>% 
          filter(Indicator.Name=="GDP (current US$)") %>% 
          select(Year, Value)

# plot it
ggplot(gdp, aes(x=Year, y=Value/100000)) +
        geom_bar(stat="identity", position=position_dodge(),
                 fill='blue') +
        xlab(" ") + scale_y_continuous(labels=scales::dollar) +
        ylab("GDP (current US$, 100K)") +
        scale_x_discrete(breaks=seq(1960,2020,10)) +
        theme_tufte()