Project 2

Data transformation using dplyr & tidyr.

The first dataset is from the UCI Machine Learning Repository. [https://archive.ics.uci.edu/ml/datasets/wiki4HE]. The CSV contains a survey of faculty members from two Spanish universities on teaching uses of Wikipedi, with 43 Likert scale variables in “wide” format. I’m interested in five of the variables, columns Qu1 to Qu5. I want to see what the average rating is for these five variables for users 55 and over.

library(stringr)
library(plyr)
library(dplyr)

## 
## Attaching package: 'dplyr'
## 
## The following objects are masked from 'package:plyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize
## 
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(tidyr)

sdata <- read.csv("https://raw.githubusercontent.com/fdsps/IS607/master/wiki4HE.csv", sep=";", stringsAsFactors=FALSE, na.strings="c(3)")

## Warning in file(file, "rt"): DLL attempted to change FPU control word from
## 8001f to 9001f

# filter age condition, select target variables, gather to "long" format, and summarize.

sdata %>% filter(AGE >=55) %>% select(Qu1:Qu5) %>% gather(Q, A) %>% group_by(Q) %>% summarise(mean=mean(A))

## Source: local data frame [5 x 2]
## 
##        Q     mean
##   (fctr)    (dbl)
## 1    Qu1 3.030303
## 2    Qu2 3.151515
## 3    Qu3 2.924242
## 4    Qu4 3.242424
## 5    Qu5 2.969697

The same results are obtained with ddply:

sdata %>% filter(AGE >=55) %>% select(Qu1:Qu5) %>% gather(Q, A) %>% group_by(Q) %>% ddply(.(Q), summarize, mean = mean(A))

##     Q     mean
## 1 Qu1 3.030303
## 2 Qu2 3.151515
## 3 Qu3 2.924242
## 4 Qu4 3.242424
## 5 Qu5 2.969697

Or we simply use plyr::colwise on the original table:

sdata %>% filter(AGE >=55) %>% select(Qu1:Qu5) %>% plyr::colwise(mean)()

##        Qu1      Qu2      Qu3      Qu4      Qu5
## 1 3.030303 3.151515 2.924242 3.242424 2.969697

The next dataset is the 1970 Expected Survival Table from the National Cancer Institute. [http://seer.cancer.gov/expsurvival/1970.html]. This wide table has variables as columns with duplicate variable columns tucked underneath.

library(XML)

## Warning: package 'XML' was built under R version 3.2.3

ht <- readHTMLTable("http://seer.cancer.gov/expsurvival/1970.html",as.data.frame = TRUE, which=1, stringsAsFactors = FALSE)

head(ht)

##     V1      V2      V3      V4      V5      V6      V7      V8      V9
## 1 Male  Female    Male  Female    Male  Female    Male  Female    <NA>
## 2    0 0.97994 0.98468 0.96394 0.97076 0.98225 0.98585 0.97755 0.98254
## 3    1 0.99884 0.99899 0.99776 0.99804 0.99852 0.99876 0.99867 0.99884
## 4    2 0.99917 0.99933 0.99841 0.99865 0.99893 0.99919 0.99906 0.99923
## 5    3 0.99928 0.99946 0.99889 0.99908 0.99921 0.99933 0.99922  0.9994
## 6    4 0.99941 0.99953 0.99903 0.99929 0.99944 0.99963 0.99936 0.99949

The header we want came in as row 1 and must be adjusted.

hed <- ht[1,1:8] %>% unlist() %>% as.vector()

colnames(ht)<- c("age", hed)

ht <- ht[-1,1:9]

#  ht[-1,] and slice(ht,-1) jumbles the cell values, I suspect due to duplicate col names

head(ht)

##   age    Male  Female  Male.1 Female.1  Male.2 Female.2  Male.3 Female.3
## 2   0 0.97994 0.98468 0.96394  0.97076 0.98225  0.98585 0.97755  0.98254
## 3   1 0.99884 0.99899 0.99776  0.99804 0.99852  0.99876 0.99867  0.99884
## 4   2 0.99917 0.99933 0.99841  0.99865 0.99893  0.99919 0.99906  0.99923
## 5   3 0.99928 0.99946 0.99889  0.99908 0.99921  0.99933 0.99922   0.9994
## 6   4 0.99941 0.99953 0.99903  0.99929 0.99944  0.99963 0.99936  0.99949
## 7   5 0.99946  0.9996 0.99916  0.99937 0.99945  0.99957 0.99942  0.99957

Looks good, but the male/female values must be associated with the “race” header from the html table. I will alter the labels and split the columns later.

hed2 <- str_c( c('male.','female.'),c('w','w','b','b','o','o','u','u'))
# w - white, b - black, o - other, u - unknown

colnames(ht)<- c("age", hed2)

longht <- gather(ht, sex, rate, -age) %>% arrange(as.numeric(age))
head(longht)

##   age      sex    rate
## 1   0   male.w 0.97994
## 2   0 female.w 0.98468
## 3   0   male.b 0.96394
## 4   0 female.b 0.97076
## 5   0   male.o 0.98225
## 6   0 female.o 0.98585

longht <- separate(longht, sex, c('sex','race'), sep = "[.]")
head(longht)

##   age    sex race    rate
## 1   0   male    w 0.97994
## 2   0 female    w 0.98468
## 3   0   male    b 0.96394
## 4   0 female    b 0.97076
## 5   0   male    o 0.98225
## 6   0 female    o 0.98585

Now we might as the question: what was the mean expected survival rate for black females ages 55 to 65 in 1970?

filter(longht,sex == 'female', race =='b', age %in% 55:65) %>% select(rate) %>% unlist %>% as.numeric %>% mean

## [1] 0.9792682

The next dataset is a table of infectious disease stats from the Rhode Island Department of Health [http://www.health.ri.gov/data/diseases/].

ht <- readHTMLTable("http://www.health.ri.gov/data/diseases/",as.data.frame = TRUE, which=1, trim = TRUE, stringsAsFactors = FALSE)

# Again, the column names came in as row 1.

hed <- ht[1,1:7] %>% unlist() %>% as.vector()

colnames(ht)<- c("Disease", hed)

ht <- ht[-1,1:8]


# Clean up extra spaces in the Disease column. stringr doesn't handle the second gsub pass properly.

ht[,1] <- gsub(".data", "", ht[,1])
ht[,1] <- gsub("\\s+", " ", ht[,1])

# Gather and drop the computed mean/median columns.
longht <- select(ht, -c(Mean,Median)) %>% gather(year,cases,-Disease)

head(longht)

##                             Disease year cases
## 1                       Babesiosis  2009    89
## 2                Campylobacteriosis 2009   111
## 3                         Chlamydia 2009  3615
## 4                 Cryptosporidosis  2009    22
## 5      Shiga-toxin positive E. coli 2009    38
## 6 Eastern Equine Encephalitis (EEE) 2009     0

Now we can ask for a breakdown of diseases as % of total reported from 2009 to 2013.

longht[,3]<-as.numeric(longht[,3])

tot<-sum(longht$cases)
sumt <- longht %>% group_by(Disease) %>% summarise(Sum_Of_Cases=sum(cases)) %>% as.data.frame
cbind(sumt, '%_of_total'= round(sumt$Sum_Of_Cases/tot,4))

##                               Disease Sum_Of_Cases %_of_total
## 1                         Babesiosis           436     0.0149
## 2                  Campylobacteriosis          608     0.0207
## 3                           Chlamydia        19866     0.6776
## 4                   Cryptosporidosis            80     0.0027
## 5   Eastern Equine Encephalitis (EEE)            1     0.0000
## 6         Ehrlichiosis / Anaplasmosis          385     0.0131
## 7                         Giardiasis           336     0.0115
## 8                          Gonorrhea          1934     0.0660
## 9    Haemophilus Influenzae, invasive           75     0.0026
## 10                        Hepatitis A           33     0.0011
## 11 HIV (Human Immunodeficiency Virus)          480     0.0164
## 12                    Legionnellosis           170     0.0058
## 13                       Listeriosis            18     0.0006
## 14                       Lyme Disease         1516     0.0517
## 15                           Malaria            57     0.0019
## 16                            Measles            2     0.0001
## 17              Meningococcal Disease            7     0.0002
## 18                             Mumps            13     0.0004
## 19                         Pertussis           424     0.0145
## 20               Pneumococcal Disease          467     0.0159
## 21       Rocky Mountain Spotted Fever           19     0.0006
## 22     Rubella (including congenital)            0     0.0000
## 23                     Salmonellosis           758     0.0259
## 24       Shiga-toxin positive E. coli           54     0.0018
## 25                       Shigellosis           335     0.0114
## 26             Streptococcus, Group A          120     0.0041
## 27             Streptococcus, Group B          280     0.0096
## 28              Syphilis (congenital)            1     0.0000
## 29            Syphilis (early latent)          100     0.0034
## 30                    Syphilis (late)          111     0.0038
## 31   Syphilis (primary and secondary)          196     0.0067
## 32        Tuberculosis | demographics          127     0.0043
## 33                          Varicella          264     0.0090
## 34                          Vibriosis           38     0.0013
## 35              West Nile Virus (WNV)            6     0.0002

Project 2

FD