Daniel Lefevre
Tidying Untidy Datasets
library(tidyverse)
Registered S3 methods overwritten by 'dbplyr':
method from
print.tbl_lazy
print.tbl_sql
── Attaching packages ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.3.0 ──
✓ ggplot2 3.3.3 ✓ purrr 0.3.4
✓ tibble 3.1.0 ✓ dplyr 1.0.4
✓ tidyr 1.1.2 ✓ stringr 1.4.0
✓ readr 1.4.0 ✓ forcats 0.5.0
── Conflicts ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
x dplyr::filter() masks stats::filter()
x dplyr::lag() masks stats::lag()
Untidy dataset 1
Source: https://github.com/kwstat/untidydata2/tree/master/inst/messydata
data <- read.csv("/Users/daniellefevre/Documents/DATA101/untidydata2-master/inst/messydata/agcensus-chapter1-table1-US.csv")
head(data)
This dataset has a lot of information, I am going to pare it down to some of the produce items in order to practice the pivot commands more effectively. If we wanted to work with the dataset, it would be easy enough to repeat the similar steps for the rest of the data in this file.
produce <- data %>%
filter(commodity %in% c("OATS", "BARLEY",
"SOYBEANS", "BEANS", "COTTON", "TOBACCO",
"HAY & HAYLAGE", "RICE", "SUNFLOWER",
"SUGARBEETS", "SUGARCANE", "PEANUTS",
"POTATOES", "SWEET POTATOES"))
produce <- produce %>% select(c("commodity", "data.item", "X2017", "X2012",
"X2007", "X2002", "X1997"))
head(produce)
We’ll call this the ‘original’ data.
View(produce)
Rename the year columns:
names(produce)
[1] "commodity" "data.item" "X2017" "X2012" "X2007" "X2002" "X1997"
names(produce) <- c("commodity", "data.item", "2017", "2012", "2007", "2002",
"1997")
Now we need to clean up the redundant information in the data.item column
produce[grepl("OPERATIONS", produce$data.item), ]$data.item =
"OPERATIONS WITH AREA HARVESTED"
produce[grepl("ACRES", produce$data.item), ]$data.item =
"ACRES HARVESTED"
# We could turn the production unit into a factor column, but we'll just assume
# a common unit for now
produce[grepl("PRODUCTION", produce$data.item), ]$data.item =
"PRODUCTION"
head(produce)
In order to “tidy” the data, “Year”, “OPERATIONS WITH AREA HARVESTED”, “ACRES HARVESTED”, and “PRODUCTION” should be columns.
view(produce)
produce %>%
pivot_wider(names_from = data.item, values_from = value)
We now have the produce data arranged in a “tidy” format, where each row corresponds to a specific measurement and each column corresponds to a single variable.
Untidy dataset 2
Source: https://github.com/kwstat/untidydata2/tree/master/inst/messydata
data = read_csv("/Users/daniellefevre/Documents/DATA101/untidydata2-master/inst/messydata/gdp_by_county.csv")
view(data)
str(data)
names(data) = c("FIPS", "Countyname", "Postal", "LineCode", "IndustryName", "2012", "2013", "2014", "2015")
names(data)
data <- slice(data, 4:12459)
view(data)
data <- pivot_longer(data, cols=c("2012", "2013", "2014", "2015"), names_to = "year")
head(data)
By moving the data for years into a single column, we now have the data in a tidy format where each row represents a single measurement and each variable is represented by one column
Tidy datasets
Tidy dataset 1
The mpg dataset
head(mpg)
The mpg dataset is tidy; each variable is represented by a single column, and each row represents a specific data item (a car)
Tidy dataset 2
The Kaggle Netflix Movie dataset
Source: https://www.kaggle.com/shivamb/netflix-shows
movies <- read.csv("/Users/daniellefevre/Documents/DATA101/netflix_titles.csv")
This dataset is already tidy; each row represents a single media item (Movie or TV show). Each column represents a specific variable.
Tidy dataset 3
The US Food Assistance Dataset
Source: https://www.kaggle.com/jpmiller/publicassistance
snap <- read.csv("/Users/daniellefevre/Documents/DATA101/snap/SNAP_history_1969_2019.csv")
This dataset is already tidy; each row represents collected statistics for the SNAP program for a single year.
LS0tCnRpdGxlOiAiUiBOb3RlYm9vayIKb3V0cHV0OiBodG1sX25vdGVib29rCi0tLQoKRGFuaWVsIExlZmV2cmUKClRpZHlpbmcgVW50aWR5IERhdGFzZXRzCgpgYGB7cn0KbGlicmFyeSh0aWR5dmVyc2UpCmBgYAoKIyBVbnRpZHkgZGF0YXNldCAxCgpTb3VyY2U6IGh0dHBzOi8vZ2l0aHViLmNvbS9rd3N0YXQvdW50aWR5ZGF0YTIvdHJlZS9tYXN0ZXIvaW5zdC9tZXNzeWRhdGEKCmBgYHtyfQpkYXRhIDwtIHJlYWQuY3N2KCIvVXNlcnMvZGFuaWVsbGVmZXZyZS9Eb2N1bWVudHMvREFUQTEwMS91bnRpZHlkYXRhMi1tYXN0ZXIvaW5zdC9tZXNzeWRhdGEvYWdjZW5zdXMtY2hhcHRlcjEtdGFibGUxLVVTLmNzdiIpCmBgYAoKYGBge3J9CmhlYWQoZGF0YSkKYGBgCgpUaGlzIGRhdGFzZXQgaGFzIGEgbG90IG9mIGluZm9ybWF0aW9uLCBJIGFtIGdvaW5nIHRvIHBhcmUgaXQgZG93biB0byBzb21lIG9mIHRoZQpwcm9kdWNlIGl0ZW1zIGluIG9yZGVyIHRvIHByYWN0aWNlIHRoZSBwaXZvdCBjb21tYW5kcyBtb3JlIGVmZmVjdGl2ZWx5LiBJZgp3ZSB3YW50ZWQgdG8gd29yayB3aXRoIHRoZSBkYXRhc2V0LCBpdCB3b3VsZCBiZSBlYXN5IGVub3VnaCB0byByZXBlYXQgdGhlIApzaW1pbGFyIHN0ZXBzIGZvciB0aGUgcmVzdCBvZiB0aGUgZGF0YSBpbiB0aGlzIGZpbGUuCgpgYGB7cn0KcHJvZHVjZSA8LSBkYXRhICU+JQogIGZpbHRlcihjb21tb2RpdHkgJWluJSBjKCJPQVRTIiwgIkJBUkxFWSIsCiAgICAgICAgICAgICAgICAgICAgICAgICAgIlNPWUJFQU5TIiwgIkJFQU5TIiwgIkNPVFRPTiIsICJUT0JBQ0NPIiwgCiAgICAgICAgICAgICAgICAgICAgICAgICAgIkhBWSAmIEhBWUxBR0UiLCAiUklDRSIsICJTVU5GTE9XRVIiLCAKICAgICAgICAgICAgICAgICAgICAgICAgICAiU1VHQVJCRUVUUyIsICJTVUdBUkNBTkUiLCAiUEVBTlVUUyIsCiAgICAgICAgICAgICAgICAgICAgICAgICAgIlBPVEFUT0VTIiwgIlNXRUVUIFBPVEFUT0VTIikpCmBgYAoKYGBge3J9CnByb2R1Y2UgPC0gcHJvZHVjZSAlPiUgc2VsZWN0KGMoImNvbW1vZGl0eSIsICJkYXRhLml0ZW0iLCAiWDIwMTciLCAiWDIwMTIiLAogICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICJYMjAwNyIsICJYMjAwMiIsICJYMTk5NyIpKQpgYGAKCmBgYHtyfQpoZWFkKHByb2R1Y2UpCmBgYAoKV2UnbGwgY2FsbCB0aGlzIHRoZSAnb3JpZ2luYWwnIGRhdGEuCgpgYGB7cn0KVmlldyhwcm9kdWNlKQpgYGAKClJlbmFtZSB0aGUgeWVhciBjb2x1bW5zOgoKYGBge3J9Cm5hbWVzKHByb2R1Y2UpCmBgYAoKYGBge3J9Cm5hbWVzKHByb2R1Y2UpIDwtIGMoImNvbW1vZGl0eSIsICJkYXRhLml0ZW0iLCAiMjAxNyIsICIyMDEyIiwgIjIwMDciLCAiMjAwMiIsCiAgICAgICAgICAgICAgICAgICAgIjE5OTciKQpgYGAKCmBgYHtyfQpoZWFkKHByb2R1Y2UpCmBgYAoKTm93IHdlIG5lZWQgdG8gY2xlYW4gdXAgdGhlIHJlZHVuZGFudCBpbmZvcm1hdGlvbiBpbiB0aGUgZGF0YS5pdGVtIGNvbHVtbgoKYGBge3J9CnByb2R1Y2VbZ3JlcGwoIk9QRVJBVElPTlMiLCBwcm9kdWNlJGRhdGEuaXRlbSksIF0kZGF0YS5pdGVtID0gCiAgIk9QRVJBVElPTlMgV0lUSCBBUkVBIEhBUlZFU1RFRCIKcHJvZHVjZVtncmVwbCgiQUNSRVMiLCBwcm9kdWNlJGRhdGEuaXRlbSksIF0kZGF0YS5pdGVtID0gCiAgIkFDUkVTIEhBUlZFU1RFRCIKIyBXZSBjb3VsZCB0dXJuIHRoZSBwcm9kdWN0aW9uIHVuaXQgaW50byBhIGZhY3RvciBjb2x1bW4sIGJ1dCB3ZSdsbCBqdXN0IGFzc3VtZQojIGEgY29tbW9uIHVuaXQgZm9yIG5vdwpwcm9kdWNlW2dyZXBsKCJQUk9EVUNUSU9OIiwgcHJvZHVjZSRkYXRhLml0ZW0pLCBdJGRhdGEuaXRlbSA9IAogICJQUk9EVUNUSU9OIgpgYGAKCmBgYHtyfQpoZWFkKHByb2R1Y2UpCmBgYAoKPGI+CkluIG9yZGVyIHRvICJ0aWR5IiB0aGUgZGF0YSwgIlllYXIiLCAiT1BFUkFUSU9OUyBXSVRIIEFSRUEgSEFSVkVTVEVEIiwKIkFDUkVTIEhBUlZFU1RFRCIsIGFuZCAiUFJPRFVDVElPTiIgc2hvdWxkIGJlIGNvbHVtbnMuCjwvYj4KCmBgYHtyfQojIHBpdm90IHRoZSB5ZWFyIGNvbHVtbgpwcm9kdWNlIDwtIHByb2R1Y2UgJT4lCiAgcGl2b3RfbG9uZ2VyKGNvbHMgPSAtYygiY29tbW9kaXR5IiwgImRhdGEuaXRlbSIpLCBuYW1lc190byA9ICJ5ZWFyIikKYGBgCgpgYGB7cn0Kdmlldyhwcm9kdWNlKQpgYGAKCmBgYHtyfQpwcm9kdWNlICU+JQogIHBpdm90X3dpZGVyKG5hbWVzX2Zyb20gPSBkYXRhLml0ZW0sIHZhbHVlc19mcm9tID0gdmFsdWUpCmBgYAoKPGI+CldlIG5vdyBoYXZlIHRoZSBwcm9kdWNlIGRhdGEgYXJyYW5nZWQgaW4gYSAidGlkeSIgZm9ybWF0LCB3aGVyZSBlYWNoIHJvdyAKY29ycmVzcG9uZHMgdG8gYSBzcGVjaWZpYyBtZWFzdXJlbWVudCBhbmQgZWFjaCBjb2x1bW4gY29ycmVzcG9uZHMgdG8gYSBzaW5nbGUgCnZhcmlhYmxlLgo8L2I+CgojIFVudGlkeSBkYXRhc2V0IDIKClNvdXJjZTogaHR0cHM6Ly9naXRodWIuY29tL2t3c3RhdC91bnRpZHlkYXRhMi90cmVlL21hc3Rlci9pbnN0L21lc3N5ZGF0YQoKYGBge3J9CmRhdGEgPSByZWFkX2NzdigiL1VzZXJzL2RhbmllbGxlZmV2cmUvRG9jdW1lbnRzL0RBVEExMDEvdW50aWR5ZGF0YTItbWFzdGVyL2luc3QvbWVzc3lkYXRhL2dkcF9ieV9jb3VudHkuY3N2IikKYGBgCgpgYGB7cn0KdmlldyhkYXRhKQpgYGAKCmBgYHtyfQpzdHIoZGF0YSkKYGBgCgpgYGB7cn0KbmFtZXMoZGF0YSkgPSBjKCJGSVBTIiwgIkNvdW50eW5hbWUiLCAiUG9zdGFsIiwgIkxpbmVDb2RlIiwgIkluZHVzdHJ5TmFtZSIsICIyMDEyIiwgIjIwMTMiLCAiMjAxNCIsICIyMDE1IikKYGBgCgpgYGB7cn0KbmFtZXMoZGF0YSkKYGBgCgpgYGB7cn0KZGF0YSA8LSBzbGljZShkYXRhLCA0OjEyNDU5KQpgYGAKCmBgYHtyfQp2aWV3KGRhdGEpCmBgYAoKYGBge3J9CmRhdGEgPC0gcGl2b3RfbG9uZ2VyKGRhdGEsIGNvbHM9YygiMjAxMiIsICIyMDEzIiwgIjIwMTQiLCAiMjAxNSIpLCBuYW1lc190byA9ICJ5ZWFyIikKYGBgCgpgYGB7cn0KaGVhZChkYXRhKQpgYGAKCjxiPgpCeSBtb3ZpbmcgdGhlIGRhdGEgZm9yIHllYXJzIGludG8gYSBzaW5nbGUgY29sdW1uLCB3ZSBub3cgaGF2ZSB0aGUgZGF0YSBpbiBhIAp0aWR5IGZvcm1hdCB3aGVyZSBlYWNoIHJvdyByZXByZXNlbnRzIGEgc2luZ2xlIG1lYXN1cmVtZW50IGFuZCBlYWNoIHZhcmlhYmxlIGlzCnJlcHJlc2VudGVkIGJ5IG9uZSBjb2x1bW4KPC9iPgoKIyBUaWR5IGRhdGFzZXRzCgojIFRpZHkgZGF0YXNldCAxCgo8Yj4KVGhlIG1wZyBkYXRhc2V0CjwvYj4KCmBgYHtyfQpoZWFkKG1wZykKYGBgCjxiPgpUaGUgbXBnIGRhdGFzZXQgaXMgdGlkeTsgZWFjaCB2YXJpYWJsZSBpcyByZXByZXNlbnRlZCBieSBhIHNpbmdsZSBjb2x1bW4sCmFuZCBlYWNoIHJvdyByZXByZXNlbnRzIGEgc3BlY2lmaWMgZGF0YSBpdGVtIChhIGNhcikKPC9iPgoKIyBUaWR5IGRhdGFzZXQgMgoKVGhlIEthZ2dsZSBOZXRmbGl4IE1vdmllIGRhdGFzZXQKClNvdXJjZTogaHR0cHM6Ly93d3cua2FnZ2xlLmNvbS9zaGl2YW1iL25ldGZsaXgtc2hvd3MKCmBgYHtyfQptb3ZpZXMgPC0gcmVhZC5jc3YoIi9Vc2Vycy9kYW5pZWxsZWZldnJlL0RvY3VtZW50cy9EQVRBMTAxL25ldGZsaXhfdGl0bGVzLmNzdiIpCmBgYAoKYGBge3J9CmhlYWQobW92aWVzKQpgYGAKPC9iPgpUaGlzIGRhdGFzZXQgaXMgYWxyZWFkeSB0aWR5OyBlYWNoIHJvdyByZXByZXNlbnRzIGEgc2luZ2xlIG1lZGlhIGl0ZW0gKE1vdmllIG9yIApUViBzaG93KS4gRWFjaCBjb2x1bW4gcmVwcmVzZW50cyBhIHNwZWNpZmljIHZhcmlhYmxlLgo8L2I+CgojIFRpZHkgZGF0YXNldCAzCgo8Yj4KVGhlIFVTIEZvb2QgQXNzaXN0YW5jZSBEYXRhc2V0CgpTb3VyY2U6IGh0dHBzOi8vd3d3LmthZ2dsZS5jb20vanBtaWxsZXIvcHVibGljYXNzaXN0YW5jZQo8L2I+CgpgYGB7cn0Kc25hcCA8LSByZWFkLmNzdigiL1VzZXJzL2RhbmllbGxlZmV2cmUvRG9jdW1lbnRzL0RBVEExMDEvc25hcC9TTkFQX2hpc3RvcnlfMTk2OV8yMDE5LmNzdiIpCmBgYAoKYGBge3J9CmhlYWQoc25hcCkKYGBgCgo8Yj4KVGhpcyBkYXRhc2V0IGlzIGFscmVhZHkgdGlkeTsgZWFjaCByb3cgcmVwcmVzZW50cyBjb2xsZWN0ZWQgc3RhdGlzdGljcyBmb3IgdGhlIApTTkFQIHByb2dyYW0gZm9yIGEgc2luZ2xlIHllYXIuCjwvYj4=