Data Aggregation

Heike Hofmann
Stat 579, Fall 2013

Homework

Homework #4 now online

Please submit Homework #3 online to the Blackboard system.

Grading Scheme C/S/O - each area gives 3.5 points, you could potentially over-score

Best answers for homework #1 and #2 are now online

Homework #1

overall: fantastic job!

biggest reasons for missing points:

plots lacking in variety (for instance, only scatterplots when boxplots or histograms would have been more appropriate)
no use of any color, sizing, faceting, etc to highlight differences.
very little analysis included … plots need explanations, they don't speak for themselves.

Homework #2

overall: again, great work!

biggest reasons for missing points:

missing interpretation/description for plots, DRY principle, …

Outline

Aggregation with reshape
Aggregation with plyr
- summarise, transform
- ddply

Economical Status

The file economical-status-cleaned.csv is a cleaned extract of the General Social Survey, with variables
- region, income, happy, age, finrela, marital, degree, health, wrkstat, partyid, polviews, sex, year, and coninc
- coninc is 'constant dollar income', i.e. it the dollar amount adjusted for inflation

gss <- read.csv("http://www.hofroe.net/stat579/economical-status-cleaned-full.csv")

library(ggplot2)

The reshape package

cast is a function of the reshape package.
cast consists of row variables, column variables, a summary function (sum, mean, max, etc), and a value variable
- cast(molten, row ~ col, summary, value)
- cast(molten, row1 + row2 ~ col, summary, value)
- cast(molten, row ~ . , summary, value)
- cast(molten, . ~ col, summary, value)

# average  income
cast(gss, .~., mean, value="coninc", na.rm=T)

  value (all)
1 (all) 44503

# average  income by degree for years 2012 and 1972
cast(subset(gss, year %in% c(1972, 2012)), year~degree, mean, value="coninc", na.rm=T)

  year BACHELOR GRADUATE HIGH SCHOOL JUNIOR COLLEGE
1 1972    50458    76769       42550          47839
2 2012    75174    89988       37453          46362
  LT HIGH SCHOOL    NA
1          27613 39276
2          20933   NaN

# income by gender
cast(gss, sex ~., mean, value="coninc", na.rm=T)

     sex (all)
1 FEMALE 41020
2   MALE 48764


# income by region
cast(gss, . ~ region, mean, value="coninc", na.rm=T)

  value E. NOR. CENTRAL E. SOU. CENTRAL MIDDLE ATLANTIC
1 (all)           44616           35403           48530
  MOUNTAIN NEW ENGLAND PACIFIC SOUTH ATLANTIC
1    42793       53586   49198          42937
  W. NOR. CENTRAL W. SOU. CENTRAL
1           41507           39867

Bigger Examples

summary of income by degree, work status and region:

library(reshape)
dframe <- cast(gss, degree + wrkstat+region~., value="coninc", fun=mean, na.rm=T)
require(ggplot2)
dframe$degree <- factor(dframe$degree, levels=c("LT HIGH SCHOOL", "HIGH SCHOOL", "JUNIOR COLLEGE", "BACHELOR", "GRADUATE"))
qplot(degree, `(all)`, data=dframe, facets=~wrkstat, colour=region) + coord_flip()

plot of chunk unnamed-chunk-4

Example (2)

median income over time by region and degree:

dframe <- cast(gss, degree + year+region~., value="coninc", fun=median, na.rm=T)
dframe$degree <- factor(dframe$degree, levels=c("LT HIGH SCHOOL", "HIGH SCHOOL", "JUNIOR COLLEGE", "BACHELOR", "GRADUATE"))
qplot(year, `(all)`, data=dframe, facets=~degree, colour=region) + geom_smooth(aes(group=1))

plot of chunk unnamed-chunk-5

Your Turn

Use reshape to find answers to the following questions. Plotting results might help with the interpretation.

use cast to find average income (coninc) by year, region, and degree
how many values are in each of the cells? (Hint: answer that with a similar statement … think about the aggregation function)
how many of the values in each cell are not missing values?
what is the standard deviation of income in each of the cells?

library(reshape)
dframe <- cast(gss, year+degree+region~., value="coninc", fun=mean, na.rm=T)
library(ggplot2)
qplot(year, `(all)`, data=dframe, colour=region, facets=~degree, size=I(4)) + geom_smooth(aes(group=1))

plot of chunk unnamed-chunk-6

Income inreases steadily over the years for bachelor degree and graduate levels. For all other degrees the curves look pretty flat.

dframe2 <- cast(gss, year+degree+region~., value="coninc", fun=length)
qplot(year, `(all)`, data=dframe2, colour=region, facets=~degree, size=I(4))

plot of chunk unnamed-chunk-7

The largest number of records are for people with high school degrees - that seems strange. We also see that the number of records increases in the mid nineties, when the survey is done less often as before.

number of non-missing values? - We don't know a function that gives us the number of valid values in one go (but a combination of sum(na.omit(x)) would do it). We can't solve that quite yet within cast.

dframe3 <- cast(gss, year+degree+region~., value="coninc", fun=sd, na.rm=T)
qplot(year, `(all)`, data=dframe3, colour=region, facets=~degree, size=I(4))

plot of chunk unnamed-chunk-8

standard deviation of income is quite large - it's about the same as mean income.

… we would like to compare mean income and standard deviation, but they are in different data frames, and the order of the cells might not be quite the same (at least we cannot rely on it being the same)

We want to do different aggregations in the same data set next.

Package plyr

library(plyr)

Functions summarise, transform
transform extends the old data set, summarise makes a new one

alt text

x <- rep(c("a", "b", "c"), each=2)
y <- c(2,4,0,5,5,10)
dframe <- data.frame(x,y)

ddply(dframe, "x", summarise, y=mean(y))

  x   y
1 a 3.0
2 b 2.5
3 c 7.5

summarise

summarise(gss, 
  avgincome=mean(coninc, na.rm=T),
  avgage=mean(age, na.rm=T))

  avgincome avgage
1     44503   45.7

ddply(gss, .(year), summarise, 
  avgincome=mean(coninc, na.rm=T),
  avgage=mean(age, na.rm=T))

   year avgincome avgage
1  1972     38418  44.95
2  1973     41493  44.18
3  1974     42054  44.59
4  1975     38698  44.31
5  1976     37282  45.29
6  1977     44303  44.66
7  1978     41059  44.01
8  1980     43950  44.97
9  1982     35771  44.86
10 1983     42244  44.30
11 1984     39713  44.00
12 1985     41554  45.71
13 1986     40488  45.43
14 1987     40366  44.92
15 1988     41203  45.37
16 1989     43787  45.44
17 1990     45090  45.96
18 1991     39912  45.63
19 1993     45076  46.05
20 1994     44554  45.97
21 1996     46273  44.78
22 1998     45761  45.56
23 2000     47897  46.02
24 2002     50074  46.28
25 2004     53019  45.96
26 2006     49842  47.14
27 2008     50738  47.71
28 2010     45697  47.97
29 2012     48385  48.19

dframe <- ddply(gss, .(year), summarise, 
  avgincome=mean(coninc, na.rm=T),
  avgage=mean(age, na.rm=T))

qplot(year, avgincome, data=dframe, size=I(3)) + geom_smooth()

plot of chunk unnamed-chunk-12

ddply

ddply part of the plyr package
usage: ddply(.data, .variables, .fun = NULL, ...)
- .data is data frame
- .variables (set of) categorical variable(s), whose combinations determine subsets of interest, for which aggregations should be done *.fun aggregation function *....fun
functions summarise and transform are the ones we'll use most often for aggregations

Examples

# summarise over multiple variables
happy <- ddply(gss, .(sex, year), summarise, 
  happy=sum(happy %in%  c("VERY HAPPY", "PRETTY HAPPY"))/length(happy)
 )

# what’s the difference to the above code?
happy <- ddply(gss, .(sex, year), summarise, 
  happy=sum(happy %in%  c("VERY HAPPY", "PRETTY HAPPY"))/sum(!is.na(happy))
 )

# include size of subset:
happy <- ddply(gss, .(sex, year), summarise, 
  happy=sum(happy %in%  c("VERY HAPPY", "PRETTY HAPPY"))/sum(!is.na(happy)),
  n = length(happy)
)

qplot(year, happy, colour=sex, data=happy, size=I(3))

plot of chunk unnamed-chunk-14

Your Turn

What is the ratio of happy and very happy people by age? Is there a difference between the genders? How reliable are the results (make sure to keep track of the number of respondents in each cell)
Are median income and happiness related? Calculate percentage of pretty ahppy and very happy respondents and relate to median income. Draw a scatterplot. What is the correlation?
Find mean income and standard deviation by degree, region and work status (wrkstat). Plot mean versus standard deviation. What is your conclusion for the distribution of income?

agehappy <- ddply(gss, .(age), summarise,
                  happy= sum(happy %in% c("PRETTY HAPPY", "VERY HAPPY"))/sum(!is.na(happy)),
                  n = length(age))
qplot(age, happy, data=agehappy) + geom_smooth()

plot of chunk unnamed-chunk-15

summary(agehappy)

      age           happy             n       
 Min.   :18.0   Min.   :0.793   Min.   :  99  
 1st Qu.:35.8   1st Qu.:0.863   1st Qu.: 515  
 Median :53.5   Median :0.872   Median : 810  
 Mean   :53.5   Mean   :0.870   Mean   : 782  
 3rd Qu.:71.2   3rd Qu.:0.881   3rd Qu.:1092  
 Max.   :89.0   Max.   :0.911   Max.   :1314  
 NA's   :1

agesexhappy <- ddply(gss, .(age, sex), summarise,
                  happy= sum(happy %in% c("PRETTY HAPPY", "VERY HAPPY"))/sum(!is.na(happy)),
                  n = length(age))
qplot(age, happy, data=agesexhappy, colour=sex) + geom_smooth()

plot of chunk unnamed-chunk-16

summary(agesexhappy)

      age           sex         happy             n      
 Min.   :18.0   FEMALE:73   Min.   :0.770   Min.   : 34  
 1st Qu.:35.8   MALE  :73   1st Qu.:0.856   1st Qu.:256  
 Median :53.5               Median :0.874   Median :412  
 Mean   :53.5               Mean   :0.871   Mean   :391  
 3rd Qu.:71.2               3rd Qu.:0.886   3rd Qu.:541  
 Max.   :89.0               Max.   :0.920   Max.   :737  
 NA's   :2

agehappy <- ddply(gss, .(age), summarise,
                  happy= sum(happy %in% c("PRETTY HAPPY", "VERY HAPPY"))/sum(!is.na(happy)),
                  income = median(coninc, na.rm=TRUE),
                  n = length(age))
qplot(income, happy, data=agehappy) + geom_smooth()

plot of chunk unnamed-chunk-17

with(agehappy, cor(income, happy))

[1] 0.3305

income <- ddply(gss, .(region, degree, wrkstat), summarise,
                  income = median(coninc, na.rm=TRUE),
                  stincome = sd(coninc, na.rm=TRUE),
                  n = length(age))
qplot(income, stincome, data=income) + geom_smooth()

plot of chunk unnamed-chunk-18

with(income, cor(income, stincome, use="complete"))

[1] 0.5623

Your Turn

For the gss data use a ddply statement to find

the percent of respondents with $25,000 or more for each year
the percent of respondents with $25,000 or more by gender and year
the percent of respondents with $25,000 or more by party affiliation and year