Heike Hofmann
Stat 579, Fall 2013
Homework #4 now online
Please submit Homework #3 online to the Blackboard system.
Grading Scheme C/S/O - each area gives 3.5 points, you could potentially over-score
Best answers for homework #1 and #2 are now online
overall: fantastic job!
biggest reasons for missing points:
overall: again, great work!
biggest reasons for missing points:
reshapeplyr
summarise, transformddplygss <- read.csv("http://www.hofroe.net/stat579/economical-status-cleaned-full.csv")
library(ggplot2)
cast is a function of the reshape package.cast consists of row variables, column variables, a summary function (sum, mean, max, etc), and a value variable
# average income
cast(gss, .~., mean, value="coninc", na.rm=T)
value (all)
1 (all) 44503
# average income by degree for years 2012 and 1972
cast(subset(gss, year %in% c(1972, 2012)), year~degree, mean, value="coninc", na.rm=T)
year BACHELOR GRADUATE HIGH SCHOOL JUNIOR COLLEGE
1 1972 50458 76769 42550 47839
2 2012 75174 89988 37453 46362
LT HIGH SCHOOL NA
1 27613 39276
2 20933 NaN
# income by gender
cast(gss, sex ~., mean, value="coninc", na.rm=T)
sex (all)
1 FEMALE 41020
2 MALE 48764
# income by region
cast(gss, . ~ region, mean, value="coninc", na.rm=T)
value E. NOR. CENTRAL E. SOU. CENTRAL MIDDLE ATLANTIC
1 (all) 44616 35403 48530
MOUNTAIN NEW ENGLAND PACIFIC SOUTH ATLANTIC
1 42793 53586 49198 42937
W. NOR. CENTRAL W. SOU. CENTRAL
1 41507 39867
summary of income by degree, work status and region:
library(reshape)
dframe <- cast(gss, degree + wrkstat+region~., value="coninc", fun=mean, na.rm=T)
require(ggplot2)
dframe$degree <- factor(dframe$degree, levels=c("LT HIGH SCHOOL", "HIGH SCHOOL", "JUNIOR COLLEGE", "BACHELOR", "GRADUATE"))
qplot(degree, `(all)`, data=dframe, facets=~wrkstat, colour=region) + coord_flip()
median income over time by region and degree:
dframe <- cast(gss, degree + year+region~., value="coninc", fun=median, na.rm=T)
dframe$degree <- factor(dframe$degree, levels=c("LT HIGH SCHOOL", "HIGH SCHOOL", "JUNIOR COLLEGE", "BACHELOR", "GRADUATE"))
qplot(year, `(all)`, data=dframe, facets=~degree, colour=region) + geom_smooth(aes(group=1))
Use reshape to find answers to the following questions. Plotting results might help with the interpretation.
use cast to find coninc) by year, region, and degree
how many values are in each of the cells? (Hint: answer that with a similar statement … think about the aggregation function)
how many of the values in each cell are not missing values?
what is the standard deviation of income in each of the cells?
library(reshape)
dframe <- cast(gss, year+degree+region~., value="coninc", fun=mean, na.rm=T)
library(ggplot2)
qplot(year, `(all)`, data=dframe, colour=region, facets=~degree, size=I(4)) + geom_smooth(aes(group=1))
Income inreases steadily over the years for bachelor degree and graduate levels. For all other degrees the curves look pretty flat.
dframe2 <- cast(gss, year+degree+region~., value="coninc", fun=length)
qplot(year, `(all)`, data=dframe2, colour=region, facets=~degree, size=I(4))
The largest number of records are for people with high school degrees - that seems strange. We also see that the number of records increases in the mid nineties, when the survey is done less often as before.
number of non-missing values? - We don't know a function that gives us the number of valid values in one go (but a combination of sum(na.omit(x)) would do it).
We can't solve that quite yet within cast.
dframe3 <- cast(gss, year+degree+region~., value="coninc", fun=sd, na.rm=T)
qplot(year, `(all)`, data=dframe3, colour=region, facets=~degree, size=I(4))
standard deviation of income is quite large - it's about the same as mean income.
… we would like to compare mean income and standard deviation, but they are in different data frames, and the order of the cells might not be quite the same (at least we cannot rely on it being the same)
We want to do different aggregations in the same data set next.
library(plyr)
summarise, transformtransform extends the old data set,
summarise makes a new one
x <- rep(c("a", "b", "c"), each=2)
y <- c(2,4,0,5,5,10)
dframe <- data.frame(x,y)
ddply(dframe, "x", summarise, y=mean(y))
x y
1 a 3.0
2 b 2.5
3 c 7.5
summarise(gss,
avgincome=mean(coninc, na.rm=T),
avgage=mean(age, na.rm=T))
avgincome avgage
1 44503 45.7
ddply(gss, .(year), summarise,
avgincome=mean(coninc, na.rm=T),
avgage=mean(age, na.rm=T))
year avgincome avgage
1 1972 38418 44.95
2 1973 41493 44.18
3 1974 42054 44.59
4 1975 38698 44.31
5 1976 37282 45.29
6 1977 44303 44.66
7 1978 41059 44.01
8 1980 43950 44.97
9 1982 35771 44.86
10 1983 42244 44.30
11 1984 39713 44.00
12 1985 41554 45.71
13 1986 40488 45.43
14 1987 40366 44.92
15 1988 41203 45.37
16 1989 43787 45.44
17 1990 45090 45.96
18 1991 39912 45.63
19 1993 45076 46.05
20 1994 44554 45.97
21 1996 46273 44.78
22 1998 45761 45.56
23 2000 47897 46.02
24 2002 50074 46.28
25 2004 53019 45.96
26 2006 49842 47.14
27 2008 50738 47.71
28 2010 45697 47.97
29 2012 48385 48.19
dframe <- ddply(gss, .(year), summarise,
avgincome=mean(coninc, na.rm=T),
avgage=mean(age, na.rm=T))
qplot(year, avgincome, data=dframe, size=I(3)) + geom_smooth()
ddply part of the plyr packageddply(.data, .variables, .fun = NULL, ...)
.data is data frame.variables (set of) categorical variable(s), whose combinations determine subsets of interest, for which aggregations should be done
*.fun aggregation function
*....funsummarise and transform are the ones we'll use most often for aggregations# summarise over multiple variables
happy <- ddply(gss, .(sex, year), summarise,
happy=sum(happy %in% c("VERY HAPPY", "PRETTY HAPPY"))/length(happy)
)
# what’s the difference to the above code?
happy <- ddply(gss, .(sex, year), summarise,
happy=sum(happy %in% c("VERY HAPPY", "PRETTY HAPPY"))/sum(!is.na(happy))
)
# include size of subset:
happy <- ddply(gss, .(sex, year), summarise,
happy=sum(happy %in% c("VERY HAPPY", "PRETTY HAPPY"))/sum(!is.na(happy)),
n = length(happy)
)
qplot(year, happy, colour=sex, data=happy, size=I(3))
What is the ratio of happy and very happy people by age? Is there a difference between the genders? How reliable are the results (make sure to keep track of the number of respondents in each cell)
Are median income and happiness related? Calculate percentage of pretty ahppy and very happy respondents and relate to median income. Draw a scatterplot. What is the correlation?
Find mean income and standard deviation by degree, region and work status (wrkstat). Plot mean versus standard deviation. What is your conclusion for the distribution of income?
agehappy <- ddply(gss, .(age), summarise,
happy= sum(happy %in% c("PRETTY HAPPY", "VERY HAPPY"))/sum(!is.na(happy)),
n = length(age))
qplot(age, happy, data=agehappy) + geom_smooth()
summary(agehappy)
age happy n
Min. :18.0 Min. :0.793 Min. : 99
1st Qu.:35.8 1st Qu.:0.863 1st Qu.: 515
Median :53.5 Median :0.872 Median : 810
Mean :53.5 Mean :0.870 Mean : 782
3rd Qu.:71.2 3rd Qu.:0.881 3rd Qu.:1092
Max. :89.0 Max. :0.911 Max. :1314
NA's :1
agesexhappy <- ddply(gss, .(age, sex), summarise,
happy= sum(happy %in% c("PRETTY HAPPY", "VERY HAPPY"))/sum(!is.na(happy)),
n = length(age))
qplot(age, happy, data=agesexhappy, colour=sex) + geom_smooth()
summary(agesexhappy)
age sex happy n
Min. :18.0 FEMALE:73 Min. :0.770 Min. : 34
1st Qu.:35.8 MALE :73 1st Qu.:0.856 1st Qu.:256
Median :53.5 Median :0.874 Median :412
Mean :53.5 Mean :0.871 Mean :391
3rd Qu.:71.2 3rd Qu.:0.886 3rd Qu.:541
Max. :89.0 Max. :0.920 Max. :737
NA's :2
agehappy <- ddply(gss, .(age), summarise,
happy= sum(happy %in% c("PRETTY HAPPY", "VERY HAPPY"))/sum(!is.na(happy)),
income = median(coninc, na.rm=TRUE),
n = length(age))
qplot(income, happy, data=agehappy) + geom_smooth()
with(agehappy, cor(income, happy))
[1] 0.3305
income <- ddply(gss, .(region, degree, wrkstat), summarise,
income = median(coninc, na.rm=TRUE),
stincome = sd(coninc, na.rm=TRUE),
n = length(age))
qplot(income, stincome, data=income) + geom_smooth()
with(income, cor(income, stincome, use="complete"))
[1] 0.5623
For the gss data use a ddply statement to find
the percent of respondents with $25,000 or more for each year
the percent of respondents with $25,000 or more by gender and year
the percent of respondents with $25,000 or more by party affiliation and year