Data transformation using dplyr & tidyr.
The first dataset is from the UCI Machine Learning Repository. [https://archive.ics.uci.edu/ml/datasets/wiki4HE]. The CSV contains a survey of faculty members from two Spanish universities on teaching uses of Wikipedi, with 43 Likert scale variables in “wide” format. I’m interested in five of the variables, columns Qu1 to Qu5. I want to see what the average rating is for these five variables for users 55 and over.
library(stringr)
library(plyr)
library(dplyr)
##
## Attaching package: 'dplyr'
##
## The following objects are masked from 'package:plyr':
##
## arrange, count, desc, failwith, id, mutate, rename, summarise,
## summarize
##
## The following objects are masked from 'package:stats':
##
## filter, lag
##
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyr)
sdata <- read.csv("https://raw.githubusercontent.com/fdsps/IS607/master/wiki4HE.csv", sep=";", stringsAsFactors=FALSE, na.strings="c(3)")
## Warning in file(file, "rt"): DLL attempted to change FPU control word from
## 8001f to 9001f
# filter age condition, select target variables, gather to "long" format, and summarize.
sdata %>% filter(AGE >=55) %>% select(Qu1:Qu5) %>% gather(Q, A) %>% group_by(Q) %>% summarise(mean=mean(A))
## Source: local data frame [5 x 2]
##
## Q mean
## (fctr) (dbl)
## 1 Qu1 3.030303
## 2 Qu2 3.151515
## 3 Qu3 2.924242
## 4 Qu4 3.242424
## 5 Qu5 2.969697
The same results are obtained with ddply:
sdata %>% filter(AGE >=55) %>% select(Qu1:Qu5) %>% gather(Q, A) %>% group_by(Q) %>% ddply(.(Q), summarize, mean = mean(A))
## Q mean
## 1 Qu1 3.030303
## 2 Qu2 3.151515
## 3 Qu3 2.924242
## 4 Qu4 3.242424
## 5 Qu5 2.969697
Or we simply use plyr::colwise on the original table:
sdata %>% filter(AGE >=55) %>% select(Qu1:Qu5) %>% plyr::colwise(mean)()
## Qu1 Qu2 Qu3 Qu4 Qu5
## 1 3.030303 3.151515 2.924242 3.242424 2.969697
The next dataset is the 1970 Expected Survival Table from the National Cancer Institute. [http://seer.cancer.gov/expsurvival/1970.html]. This wide table has variables as columns with duplicate variable columns tucked underneath.
library(XML)
## Warning: package 'XML' was built under R version 3.2.3
ht <- readHTMLTable("http://seer.cancer.gov/expsurvival/1970.html",as.data.frame = TRUE, which=1, stringsAsFactors = FALSE)
head(ht)
## V1 V2 V3 V4 V5 V6 V7 V8 V9
## 1 Male Female Male Female Male Female Male Female <NA>
## 2 0 0.97994 0.98468 0.96394 0.97076 0.98225 0.98585 0.97755 0.98254
## 3 1 0.99884 0.99899 0.99776 0.99804 0.99852 0.99876 0.99867 0.99884
## 4 2 0.99917 0.99933 0.99841 0.99865 0.99893 0.99919 0.99906 0.99923
## 5 3 0.99928 0.99946 0.99889 0.99908 0.99921 0.99933 0.99922 0.9994
## 6 4 0.99941 0.99953 0.99903 0.99929 0.99944 0.99963 0.99936 0.99949
The header we want came in as row 1 and must be adjusted.
hed <- ht[1,1:8] %>% unlist() %>% as.vector()
colnames(ht)<- c("age", hed)
ht <- ht[-1,1:9]
# ht[-1,] and slice(ht,-1) jumbles the cell values, I suspect due to duplicate col names
head(ht)
## age Male Female Male.1 Female.1 Male.2 Female.2 Male.3 Female.3
## 2 0 0.97994 0.98468 0.96394 0.97076 0.98225 0.98585 0.97755 0.98254
## 3 1 0.99884 0.99899 0.99776 0.99804 0.99852 0.99876 0.99867 0.99884
## 4 2 0.99917 0.99933 0.99841 0.99865 0.99893 0.99919 0.99906 0.99923
## 5 3 0.99928 0.99946 0.99889 0.99908 0.99921 0.99933 0.99922 0.9994
## 6 4 0.99941 0.99953 0.99903 0.99929 0.99944 0.99963 0.99936 0.99949
## 7 5 0.99946 0.9996 0.99916 0.99937 0.99945 0.99957 0.99942 0.99957
Looks good, but the male/female values must be associated with the “race” header from the html table. I will alter the labels and split the columns later.
hed2 <- str_c( c('male.','female.'),c('w','w','b','b','o','o','u','u'))
# w - white, b - black, o - other, u - unknown
colnames(ht)<- c("age", hed2)
longht <- gather(ht, sex, rate, -age) %>% arrange(as.numeric(age))
head(longht)
## age sex rate
## 1 0 male.w 0.97994
## 2 0 female.w 0.98468
## 3 0 male.b 0.96394
## 4 0 female.b 0.97076
## 5 0 male.o 0.98225
## 6 0 female.o 0.98585
longht <- separate(longht, sex, c('sex','race'), sep = "[.]")
head(longht)
## age sex race rate
## 1 0 male w 0.97994
## 2 0 female w 0.98468
## 3 0 male b 0.96394
## 4 0 female b 0.97076
## 5 0 male o 0.98225
## 6 0 female o 0.98585
Now we might as the question: what was the mean expected survival rate for black females ages 55 to 65 in 1970?
filter(longht,sex == 'female', race =='b', age %in% 55:65) %>% select(rate) %>% unlist %>% as.numeric %>% mean
## [1] 0.9792682
The next dataset is a table of infectious disease stats from the Rhode Island Department of Health [http://www.health.ri.gov/data/diseases/].
ht <- readHTMLTable("http://www.health.ri.gov/data/diseases/",as.data.frame = TRUE, which=1, trim = TRUE, stringsAsFactors = FALSE)
# Again, the column names came in as row 1.
hed <- ht[1,1:7] %>% unlist() %>% as.vector()
colnames(ht)<- c("Disease", hed)
ht <- ht[-1,1:8]
# Clean up extra spaces in the Disease column. stringr doesn't handle the second gsub pass properly.
ht[,1] <- gsub(".data", "", ht[,1])
ht[,1] <- gsub("\\s+", " ", ht[,1])
# Gather and drop the computed mean/median columns.
longht <- select(ht, -c(Mean,Median)) %>% gather(year,cases,-Disease)
head(longht)
## Disease year cases
## 1 Babesiosis 2009 89
## 2 Campylobacteriosis 2009 111
## 3 Chlamydia 2009 3615
## 4 Cryptosporidosis 2009 22
## 5 Shiga-toxin positive E. coli 2009 38
## 6 Eastern Equine Encephalitis (EEE) 2009 0
Now we can ask for a breakdown of diseases as % of total reported from 2009 to 2013.
longht[,3]<-as.numeric(longht[,3])
tot<-sum(longht$cases)
sumt <- longht %>% group_by(Disease) %>% summarise(Sum_Of_Cases=sum(cases)) %>% as.data.frame
cbind(sumt, '%_of_total'= round(sumt$Sum_Of_Cases/tot,4))
## Disease Sum_Of_Cases %_of_total
## 1 Babesiosis 436 0.0149
## 2 Campylobacteriosis 608 0.0207
## 3 Chlamydia 19866 0.6776
## 4 Cryptosporidosis 80 0.0027
## 5 Eastern Equine Encephalitis (EEE) 1 0.0000
## 6 Ehrlichiosis / Anaplasmosis 385 0.0131
## 7 Giardiasis 336 0.0115
## 8 Gonorrhea 1934 0.0660
## 9 Haemophilus Influenzae, invasive 75 0.0026
## 10 Hepatitis A 33 0.0011
## 11 HIV (Human Immunodeficiency Virus) 480 0.0164
## 12 Legionnellosis 170 0.0058
## 13 Listeriosis 18 0.0006
## 14 Lyme Disease 1516 0.0517
## 15 Malaria 57 0.0019
## 16 Measles 2 0.0001
## 17 Meningococcal Disease 7 0.0002
## 18 Mumps 13 0.0004
## 19 Pertussis 424 0.0145
## 20 Pneumococcal Disease 467 0.0159
## 21 Rocky Mountain Spotted Fever 19 0.0006
## 22 Rubella (including congenital) 0 0.0000
## 23 Salmonellosis 758 0.0259
## 24 Shiga-toxin positive E. coli 54 0.0018
## 25 Shigellosis 335 0.0114
## 26 Streptococcus, Group A 120 0.0041
## 27 Streptococcus, Group B 280 0.0096
## 28 Syphilis (congenital) 1 0.0000
## 29 Syphilis (early latent) 100 0.0034
## 30 Syphilis (late) 111 0.0038
## 31 Syphilis (primary and secondary) 196 0.0067
## 32 Tuberculosis | demographics 127 0.0043
## 33 Varicella 264 0.0090
## 34 Vibriosis 38 0.0013
## 35 West Nile Virus (WNV) 6 0.0002