Daniel Lefevre

Tidying Untidy Datasets

library(tidyverse)
Registered S3 methods overwritten by 'dbplyr':
  method         from
  print.tbl_lazy     
  print.tbl_sql      
── Attaching packages ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.3.0 ──
✓ ggplot2 3.3.3     ✓ purrr   0.3.4
✓ tibble  3.1.0     ✓ dplyr   1.0.4
✓ tidyr   1.1.2     ✓ stringr 1.4.0
✓ readr   1.4.0     ✓ forcats 0.5.0
── Conflicts ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
x dplyr::filter() masks stats::filter()
x dplyr::lag()    masks stats::lag()

Untidy dataset 1

Source: https://github.com/kwstat/untidydata2/tree/master/inst/messydata

data <- read.csv("/Users/daniellefevre/Documents/DATA101/untidydata2-master/inst/messydata/agcensus-chapter1-table1-US.csv")
head(data)

This dataset has a lot of information, I am going to pare it down to some of the produce items in order to practice the pivot commands more effectively. If we wanted to work with the dataset, it would be easy enough to repeat the similar steps for the rest of the data in this file.

produce <- data %>%
  filter(commodity %in% c("OATS", "BARLEY",
                          "SOYBEANS", "BEANS", "COTTON", "TOBACCO", 
                          "HAY & HAYLAGE", "RICE", "SUNFLOWER", 
                          "SUGARBEETS", "SUGARCANE", "PEANUTS",
                          "POTATOES", "SWEET POTATOES"))
produce <- produce %>% select(c("commodity", "data.item", "X2017", "X2012",
                                "X2007", "X2002", "X1997"))
head(produce)

We’ll call this the ‘original’ data.

View(produce)

Rename the year columns:

names(produce)
[1] "commodity" "data.item" "X2017"     "X2012"     "X2007"     "X2002"     "X1997"    
names(produce) <- c("commodity", "data.item", "2017", "2012", "2007", "2002",
                    "1997")

Now we need to clean up the redundant information in the data.item column

produce[grepl("OPERATIONS", produce$data.item), ]$data.item = 
  "OPERATIONS WITH AREA HARVESTED"
produce[grepl("ACRES", produce$data.item), ]$data.item = 
  "ACRES HARVESTED"
# We could turn the production unit into a factor column, but we'll just assume
# a common unit for now
produce[grepl("PRODUCTION", produce$data.item), ]$data.item = 
  "PRODUCTION"
head(produce)

In order to “tidy” the data, “Year”, “OPERATIONS WITH AREA HARVESTED”, “ACRES HARVESTED”, and “PRODUCTION” should be columns.

view(produce)
produce %>%
  pivot_wider(names_from = data.item, values_from = value)

We now have the produce data arranged in a “tidy” format, where each row corresponds to a specific measurement and each column corresponds to a single variable.

Untidy dataset 2

Source: https://github.com/kwstat/untidydata2/tree/master/inst/messydata

data = read_csv("/Users/daniellefevre/Documents/DATA101/untidydata2-master/inst/messydata/gdp_by_county.csv")
view(data)
str(data)
names(data) = c("FIPS", "Countyname", "Postal", "LineCode", "IndustryName", "2012", "2013", "2014", "2015")
names(data)
data <- slice(data, 4:12459)
view(data)
data <- pivot_longer(data, cols=c("2012", "2013", "2014", "2015"), names_to = "year")
head(data)

By moving the data for years into a single column, we now have the data in a tidy format where each row represents a single measurement and each variable is represented by one column

Tidy datasets

Tidy dataset 1

The mpg dataset

head(mpg)

The mpg dataset is tidy; each variable is represented by a single column, and each row represents a specific data item (a car)

Tidy dataset 2

The Kaggle Netflix Movie dataset

Source: https://www.kaggle.com/shivamb/netflix-shows

movies <- read.csv("/Users/daniellefevre/Documents/DATA101/netflix_titles.csv")

This dataset is already tidy; each row represents a single media item (Movie or TV show). Each column represents a specific variable.

Tidy dataset 3

The US Food Assistance Dataset

Source: https://www.kaggle.com/jpmiller/publicassistance

snap <- read.csv("/Users/daniellefevre/Documents/DATA101/snap/SNAP_history_1969_2019.csv")

This dataset is already tidy; each row represents collected statistics for the SNAP program for a single year.

LS0tCnRpdGxlOiAiUiBOb3RlYm9vayIKb3V0cHV0OiBodG1sX25vdGVib29rCi0tLQoKRGFuaWVsIExlZmV2cmUKClRpZHlpbmcgVW50aWR5IERhdGFzZXRzCgpgYGB7cn0KbGlicmFyeSh0aWR5dmVyc2UpCmBgYAoKIyBVbnRpZHkgZGF0YXNldCAxCgpTb3VyY2U6IGh0dHBzOi8vZ2l0aHViLmNvbS9rd3N0YXQvdW50aWR5ZGF0YTIvdHJlZS9tYXN0ZXIvaW5zdC9tZXNzeWRhdGEKCmBgYHtyfQpkYXRhIDwtIHJlYWQuY3N2KCIvVXNlcnMvZGFuaWVsbGVmZXZyZS9Eb2N1bWVudHMvREFUQTEwMS91bnRpZHlkYXRhMi1tYXN0ZXIvaW5zdC9tZXNzeWRhdGEvYWdjZW5zdXMtY2hhcHRlcjEtdGFibGUxLVVTLmNzdiIpCmBgYAoKYGBge3J9CmhlYWQoZGF0YSkKYGBgCgpUaGlzIGRhdGFzZXQgaGFzIGEgbG90IG9mIGluZm9ybWF0aW9uLCBJIGFtIGdvaW5nIHRvIHBhcmUgaXQgZG93biB0byBzb21lIG9mIHRoZQpwcm9kdWNlIGl0ZW1zIGluIG9yZGVyIHRvIHByYWN0aWNlIHRoZSBwaXZvdCBjb21tYW5kcyBtb3JlIGVmZmVjdGl2ZWx5LiBJZgp3ZSB3YW50ZWQgdG8gd29yayB3aXRoIHRoZSBkYXRhc2V0LCBpdCB3b3VsZCBiZSBlYXN5IGVub3VnaCB0byByZXBlYXQgdGhlIApzaW1pbGFyIHN0ZXBzIGZvciB0aGUgcmVzdCBvZiB0aGUgZGF0YSBpbiB0aGlzIGZpbGUuCgpgYGB7cn0KcHJvZHVjZSA8LSBkYXRhICU+JQogIGZpbHRlcihjb21tb2RpdHkgJWluJSBjKCJPQVRTIiwgIkJBUkxFWSIsCiAgICAgICAgICAgICAgICAgICAgICAgICAgIlNPWUJFQU5TIiwgIkJFQU5TIiwgIkNPVFRPTiIsICJUT0JBQ0NPIiwgCiAgICAgICAgICAgICAgICAgICAgICAgICAgIkhBWSAmIEhBWUxBR0UiLCAiUklDRSIsICJTVU5GTE9XRVIiLCAKICAgICAgICAgICAgICAgICAgICAgICAgICAiU1VHQVJCRUVUUyIsICJTVUdBUkNBTkUiLCAiUEVBTlVUUyIsCiAgICAgICAgICAgICAgICAgICAgICAgICAgIlBPVEFUT0VTIiwgIlNXRUVUIFBPVEFUT0VTIikpCmBgYAoKYGBge3J9CnByb2R1Y2UgPC0gcHJvZHVjZSAlPiUgc2VsZWN0KGMoImNvbW1vZGl0eSIsICJkYXRhLml0ZW0iLCAiWDIwMTciLCAiWDIwMTIiLAogICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICJYMjAwNyIsICJYMjAwMiIsICJYMTk5NyIpKQpgYGAKCmBgYHtyfQpoZWFkKHByb2R1Y2UpCmBgYAoKV2UnbGwgY2FsbCB0aGlzIHRoZSAnb3JpZ2luYWwnIGRhdGEuCgpgYGB7cn0KVmlldyhwcm9kdWNlKQpgYGAKClJlbmFtZSB0aGUgeWVhciBjb2x1bW5zOgoKYGBge3J9Cm5hbWVzKHByb2R1Y2UpCmBgYAoKYGBge3J9Cm5hbWVzKHByb2R1Y2UpIDwtIGMoImNvbW1vZGl0eSIsICJkYXRhLml0ZW0iLCAiMjAxNyIsICIyMDEyIiwgIjIwMDciLCAiMjAwMiIsCiAgICAgICAgICAgICAgICAgICAgIjE5OTciKQpgYGAKCmBgYHtyfQpoZWFkKHByb2R1Y2UpCmBgYAoKTm93IHdlIG5lZWQgdG8gY2xlYW4gdXAgdGhlIHJlZHVuZGFudCBpbmZvcm1hdGlvbiBpbiB0aGUgZGF0YS5pdGVtIGNvbHVtbgoKYGBge3J9CnByb2R1Y2VbZ3JlcGwoIk9QRVJBVElPTlMiLCBwcm9kdWNlJGRhdGEuaXRlbSksIF0kZGF0YS5pdGVtID0gCiAgIk9QRVJBVElPTlMgV0lUSCBBUkVBIEhBUlZFU1RFRCIKcHJvZHVjZVtncmVwbCgiQUNSRVMiLCBwcm9kdWNlJGRhdGEuaXRlbSksIF0kZGF0YS5pdGVtID0gCiAgIkFDUkVTIEhBUlZFU1RFRCIKIyBXZSBjb3VsZCB0dXJuIHRoZSBwcm9kdWN0aW9uIHVuaXQgaW50byBhIGZhY3RvciBjb2x1bW4sIGJ1dCB3ZSdsbCBqdXN0IGFzc3VtZQojIGEgY29tbW9uIHVuaXQgZm9yIG5vdwpwcm9kdWNlW2dyZXBsKCJQUk9EVUNUSU9OIiwgcHJvZHVjZSRkYXRhLml0ZW0pLCBdJGRhdGEuaXRlbSA9IAogICJQUk9EVUNUSU9OIgpgYGAKCmBgYHtyfQpoZWFkKHByb2R1Y2UpCmBgYAoKPGI+CkluIG9yZGVyIHRvICJ0aWR5IiB0aGUgZGF0YSwgIlllYXIiLCAiT1BFUkFUSU9OUyBXSVRIIEFSRUEgSEFSVkVTVEVEIiwKIkFDUkVTIEhBUlZFU1RFRCIsIGFuZCAiUFJPRFVDVElPTiIgc2hvdWxkIGJlIGNvbHVtbnMuCjwvYj4KCmBgYHtyfQojIHBpdm90IHRoZSB5ZWFyIGNvbHVtbgpwcm9kdWNlIDwtIHByb2R1Y2UgJT4lCiAgcGl2b3RfbG9uZ2VyKGNvbHMgPSAtYygiY29tbW9kaXR5IiwgImRhdGEuaXRlbSIpLCBuYW1lc190byA9ICJ5ZWFyIikKYGBgCgpgYGB7cn0Kdmlldyhwcm9kdWNlKQpgYGAKCmBgYHtyfQpwcm9kdWNlICU+JQogIHBpdm90X3dpZGVyKG5hbWVzX2Zyb20gPSBkYXRhLml0ZW0sIHZhbHVlc19mcm9tID0gdmFsdWUpCmBgYAoKPGI+CldlIG5vdyBoYXZlIHRoZSBwcm9kdWNlIGRhdGEgYXJyYW5nZWQgaW4gYSAidGlkeSIgZm9ybWF0LCB3aGVyZSBlYWNoIHJvdyAKY29ycmVzcG9uZHMgdG8gYSBzcGVjaWZpYyBtZWFzdXJlbWVudCBhbmQgZWFjaCBjb2x1bW4gY29ycmVzcG9uZHMgdG8gYSBzaW5nbGUgCnZhcmlhYmxlLgo8L2I+CgojIFVudGlkeSBkYXRhc2V0IDIKClNvdXJjZTogaHR0cHM6Ly9naXRodWIuY29tL2t3c3RhdC91bnRpZHlkYXRhMi90cmVlL21hc3Rlci9pbnN0L21lc3N5ZGF0YQoKYGBge3J9CmRhdGEgPSByZWFkX2NzdigiL1VzZXJzL2RhbmllbGxlZmV2cmUvRG9jdW1lbnRzL0RBVEExMDEvdW50aWR5ZGF0YTItbWFzdGVyL2luc3QvbWVzc3lkYXRhL2dkcF9ieV9jb3VudHkuY3N2IikKYGBgCgpgYGB7cn0KdmlldyhkYXRhKQpgYGAKCmBgYHtyfQpzdHIoZGF0YSkKYGBgCgpgYGB7cn0KbmFtZXMoZGF0YSkgPSBjKCJGSVBTIiwgIkNvdW50eW5hbWUiLCAiUG9zdGFsIiwgIkxpbmVDb2RlIiwgIkluZHVzdHJ5TmFtZSIsICIyMDEyIiwgIjIwMTMiLCAiMjAxNCIsICIyMDE1IikKYGBgCgpgYGB7cn0KbmFtZXMoZGF0YSkKYGBgCgpgYGB7cn0KZGF0YSA8LSBzbGljZShkYXRhLCA0OjEyNDU5KQpgYGAKCmBgYHtyfQp2aWV3KGRhdGEpCmBgYAoKYGBge3J9CmRhdGEgPC0gcGl2b3RfbG9uZ2VyKGRhdGEsIGNvbHM9YygiMjAxMiIsICIyMDEzIiwgIjIwMTQiLCAiMjAxNSIpLCBuYW1lc190byA9ICJ5ZWFyIikKYGBgCgpgYGB7cn0KaGVhZChkYXRhKQpgYGAKCjxiPgpCeSBtb3ZpbmcgdGhlIGRhdGEgZm9yIHllYXJzIGludG8gYSBzaW5nbGUgY29sdW1uLCB3ZSBub3cgaGF2ZSB0aGUgZGF0YSBpbiBhIAp0aWR5IGZvcm1hdCB3aGVyZSBlYWNoIHJvdyByZXByZXNlbnRzIGEgc2luZ2xlIG1lYXN1cmVtZW50IGFuZCBlYWNoIHZhcmlhYmxlIGlzCnJlcHJlc2VudGVkIGJ5IG9uZSBjb2x1bW4KPC9iPgoKIyBUaWR5IGRhdGFzZXRzCgojIFRpZHkgZGF0YXNldCAxCgo8Yj4KVGhlIG1wZyBkYXRhc2V0CjwvYj4KCmBgYHtyfQpoZWFkKG1wZykKYGBgCjxiPgpUaGUgbXBnIGRhdGFzZXQgaXMgdGlkeTsgZWFjaCB2YXJpYWJsZSBpcyByZXByZXNlbnRlZCBieSBhIHNpbmdsZSBjb2x1bW4sCmFuZCBlYWNoIHJvdyByZXByZXNlbnRzIGEgc3BlY2lmaWMgZGF0YSBpdGVtIChhIGNhcikKPC9iPgoKIyBUaWR5IGRhdGFzZXQgMgoKVGhlIEthZ2dsZSBOZXRmbGl4IE1vdmllIGRhdGFzZXQKClNvdXJjZTogaHR0cHM6Ly93d3cua2FnZ2xlLmNvbS9zaGl2YW1iL25ldGZsaXgtc2hvd3MKCmBgYHtyfQptb3ZpZXMgPC0gcmVhZC5jc3YoIi9Vc2Vycy9kYW5pZWxsZWZldnJlL0RvY3VtZW50cy9EQVRBMTAxL25ldGZsaXhfdGl0bGVzLmNzdiIpCmBgYAoKYGBge3J9CmhlYWQobW92aWVzKQpgYGAKPC9iPgpUaGlzIGRhdGFzZXQgaXMgYWxyZWFkeSB0aWR5OyBlYWNoIHJvdyByZXByZXNlbnRzIGEgc2luZ2xlIG1lZGlhIGl0ZW0gKE1vdmllIG9yIApUViBzaG93KS4gRWFjaCBjb2x1bW4gcmVwcmVzZW50cyBhIHNwZWNpZmljIHZhcmlhYmxlLgo8L2I+CgojIFRpZHkgZGF0YXNldCAzCgo8Yj4KVGhlIFVTIEZvb2QgQXNzaXN0YW5jZSBEYXRhc2V0CgpTb3VyY2U6IGh0dHBzOi8vd3d3LmthZ2dsZS5jb20vanBtaWxsZXIvcHVibGljYXNzaXN0YW5jZQo8L2I+CgpgYGB7cn0Kc25hcCA8LSByZWFkLmNzdigiL1VzZXJzL2RhbmllbGxlZmV2cmUvRG9jdW1lbnRzL0RBVEExMDEvc25hcC9TTkFQX2hpc3RvcnlfMTk2OV8yMDE5LmNzdiIpCmBgYAoKYGBge3J9CmhlYWQoc25hcCkKYGBgCgo8Yj4KVGhpcyBkYXRhc2V0IGlzIGFscmVhZHkgdGlkeTsgZWFjaCByb3cgcmVwcmVzZW50cyBjb2xsZWN0ZWQgc3RhdGlzdGljcyBmb3IgdGhlIApTTkFQIHByb2dyYW0gZm9yIGEgc2luZ2xlIHllYXIuCjwvYj4=