36-721 Homework 1

Due Thursday, September 5, 2013

Philipp Burckhardt

Introduction

This R Markdown file uses the slidify package by Ramnath Vaidyanathan which is not on CRAN yet and has to be manually installed. I make also use of the interactive plotting capabilities provided by the rCharts package. Since these packages are still in rapid development and this feature was added only recently, one will have to install the current development versions from Github if one wants to reproduce the results. Luckily, this is as easy as typing three lines of R code, using the devtools package by Hadley Wickham:

require(devtools)
install_github(c("slidify", "slidifyLibraries"), "ramnathv", ref = "dev")
install_github("rCharts", "ramnathv")

To rebuild this presentation in knitr, the R markdown format can also be accessed under the following address: Click me to open it in Brower. To get the same results, it is also important to replace the slidify.css document in the folder ~/assets/css/slidify.css with the following custom file: Download [Notice: the folder will be created after invoking knit("index.Rmd") for the first time]

Load Required Packages

require(knitr)
require(slidify)
require(rCharts)
require(xtable)
require(gridExtra)
require(ggplot2)
theme_set(new = theme_gray(base_size = 14))
opts_chunk$set(dev = "svg")
require(lattice)

Question 1: survey Data Set

For the first question, we analyze a survey of 237 students at the University of Adelaide. All subsequently analyzed data sets come with the MASS package which can be loaded as follows:

require(MASS)
data(survey)

The first four observations are displayed in the following table:

df <- head(survey, n = 4)
print(xtable(df), type = "html")
Sex Wr.Hnd NW.Hnd W.Hnd Fold Pulse Clap Exer Smoke Height M.I Age
1 Female 18.50 18.00 Right R on L 92 Left Some Never 173.00 Metric 18.25
2 Male 19.50 20.50 Left R on L 104 Left None Regul 177.80 Imperial 17.58
3 Male 18.00 13.30 Right L on R 87 Neither None Occas 16.92
4 Male 18.80 18.90 Right R on L Neither None Never 160.00 Metric 20.33

Smoking Behaviour of Students

par(mfrow=c(1,2),oma=c(0,0,2,0));
plot(survey$Smoke,ylab="count"); title("Without Reordering");
survey$Smoke <- factor(survey$Smoke,levels=c("Never","Occas","Regul","Heavy"))
plot(survey$Smoke,ylab="count"); title("With Reordering");
title("Barplots of Smoke Status",outer=TRUE)

plot of chunk unnamed-chunk-4

How often do the Students exercise?

plot of chunk unnamed-chunk-5

Code for previous plot

exer.tab <- xtabs(~survey$Exer)
par(mfrow = c(1, 2), oma = c(0, 0, 2, 0))
barplot(rep(1, length(exer.tab)), exer.tab, space = 0, names.arg = names(exer.tab), 
    col = c("lightblue", "mistyrose", "lavender"), axes = FALSE)
title(main = "Without Reordering", xlab = "Exercise Level", ylab = "")

barplot(rep(1, length(exer.tab)), exer.tab[c(2, 3, 1)], space = 0, names.arg = names(exer.tab[c(2, 
    3, 1)]), col = c("mistyrose", "lavender", "lightblue"), axes = FALSE)
title(main = "With Reordering", xlab = "Exercise Level", ylab = "")

title("Spine Charts of Exercise Level", outer = TRUE)

Interpretation

Concerning the amount of time each student devotes to exercising, the spine charts reveal that there is only a minority of students who do not do any sports at all, with two other almost equally sized groups who practice either sometimes or frequently. However, with only three data points to display, the same information could also be conveyed by the use of a simple table. Of the two plots, it is advisable to go with the ordered one, as the exercise level is an ordinal variable and should be treated as such. If one does not take the inherent ordering into account, this will likely lead to misunderstandings the viewer will not notice unless he carefully looks at the labels.

Similarly, the bar plots on Page 4 show that the overwhelming majority of students does not smoke at all, with less than a quarter of students smoking either occasionally, regularly or heavily. In each of these sub-categories, there are fewer and fewer cases
with increasing smoking level. It is again advisable to use the ordered chart, as the initial ordering is arbitrary, while there is an inherent ordering of the categories which should be reflected.

Interactive Highcharts.js Plot

As can be seen, there does not seem to be any significant relationship between the two categorical variables.

Code for previous plot

df <- as.data.frame(matrix(c(87, 12, 9, 7, 18, 3, 1, 1, 84, 4, 7, 3), ncol = 4, 
    nrow = 3))
colnames(df) <- c("No Smoking", "Occasional Smoking", "Regular Smoking", "Heavy Smoking")
rownames(df) <- c("Freq", "None", "Some")

a <- rCharts:::Highcharts$new()
a$chart(type = "column")
a$title(text = "Exercise / Smoking")
a$plotOptions(column = list(stacking = "normal"))
a$xAxis(categories = c("No Exercising", "Some Exercising", "Frequent Exercising"))
a$data(df)
a$print("hchart1", include_assets = TRUE, cdn = TRUE)

Question 2: birthwt Data Set

In this part of the presentation, we study the birthwt data set which also comes with MASS. It comprises 189 observations of 10 different variables, with the goal to discover risk factors associated with low infant birth weight (see the help page of the data set by typing ?birthwt in the console).

We limit our attention to the following three variables:

  • race: race of the mother, one of white, black or other
  • smoke: smoking status during pregnancy
  • ptl: number of previous premature labours

Empirical Distributions

Displayed are bar plots for the univariate distributions of the three considered variables. plot of chunk f

Mosaic Plots to Display Contingency Tables

par(mfrow=c(1,3)); cortable1 <- table(birthwt[,c("smoke","race")])
mosaicplot (cortable1,cex.axis=1.5,main="Smoke / Race",xlab="Smoking",ylab="Mother's Race",color=TRUE)
cortable2 <- table(birthwt[,c("ptl","race")]); xlab <- "# Premature Labours" 
mosaicplot (cortable2,cex.axis=1.5,main="Premature Labours / Race",xlab=xlab,ylab="Mother's Race",color=TRUE)
cortable3 <- table(birthwt[,c("ptl","smoke")])
mosaicplot (cortable3,cex.axis=1.5,main="Premature Labours / Smoke",xlab=xlab,ylab="Smoking",
color=TRUE)

plot of chunk unnamed-chunk-9

Results: Smoke and Race

plot of chunk unnamed-chunk-10

From the first plot, one can deduce that the proportion of people who smoked during pregnancy is much larger among white people than the two other groups. For people with black skin, the situation is not as clear-cut, with roughly 40% of the group smoking during pregnancy. However, mothers from the category comprising all other races smoke much less during pregnancy, with more than three quarters refraining from doing so. A Chi-squared test rejects the null hypothesis of independence (p-value: 0), affirming that this association is not a result of pure chance.

Results: Smoke and # Premature Labours

plot of chunk unnamed-chunk-11

While there does not appear to be any association between race and the number of premature labours, a look at the mosaic-plot on the left suggests that there is an association between smoking behaviour and premature labours. Mothers who smoke during pregnancy had more premature labours in the past, an observation which is in line with current research suggesting that smoking during pregnancy increased the risk of a premature birth.

Question 3: minn38 Data Set

Finally, we look at the minn38 data set, which contains characteristics about high school graduates of 1938 from Minnesota (type ?help(minn38) in R).

data(minn38); minn38.tab <- xtabs(formula=f~phs+fol+sex,minn38)
minn38.tab.M <-  xtabs(formula=f~phs+fol,minn38[minn38$sex=="M",])
minn38.tab.F <-  xtabs(formula=f~phs+fol,minn38[minn38$sex=="F",])

Code for the plot on the next page:

X <- as.matrix(minn38.tab.F) # female bar chart
mat <- as.data.frame(matrix(X,nrow=4,ncol=7),row.names=c("C","N","E","O"))
colnames(mat) <- paste("F",1:7,sep="")
n1 <- rCharts:::Highcharts$new(); n1$chart(type = "column")
n1$xAxis(title ="Post High School Status", categories = c("College", "Non-Collegiate School",
                                                          "Full-Time Employment", "Other"))
n1$yAxis(title = "count")
n1$data(mat);

Female High School Graduates

Interactive Bar Chart:

Male High School Graduates

Interactive Bar Chart:

Results

One interesting and perhaps surprising result concerning the relationship between gender and post high school status is that women were more than twice as likely employed full-time than their male counterparts. Comparing the two plots from the previous pages, one can see that the bars differ indeed greatly in height: Roughly three times as many females compared to males were employed full-time (keep in mind that the two plots have a different scale on the y-scale).

Second, college enrollment is highest for pupils from a household in which the father's occupational level is "F1". Compared to all other categories of occupational level, this is the only group in which children who go on to enroll in college make up the largest group. To see this, it is advisable to deselect all other groups in the interactive plots and then make one-to-one comparisons between the individual groups.

Non-interactive percentage Barplot using Lattice

plot of chunk unnamed-chunk-16

R Code of Previous Plot

minn38_prop_tab <- prop.table(minn38.tab,margin=c(2,3)) 
dimnames(minn38_prop_tab)$sex <- c("Female","Male")
dimnames(minn38_prop_tab)$phs <- c("College","Full-Time Employment","Non-Collegiate School","Other")
barchart(Freq~fol|sex,groups=phs,data=as.data.frame(minn38_prop_tab),stack=TRUE,
         xlab="Post High School Status",ylab="Percentage",
         auto.key=list(space="right", columns=1, title="Legend", cex.title=1))

Results 2

To render visible the distribution in the individual groups, the previous bar chart only displays the proportions inside the respective groups and not the total counts. This allows better insight into the composition of the groups. One last observation is that there seems to be a trend that with increasing occupational level the number of graduates enrolled in college diminishes, which is offset by an increase in graduates of the category "other". The proportion of graduates belonging to the other categories (full-time employment and non-collegiate school) does not vary as much. One blatant exception is given by category "F3", in which less than under this hypothesis expected graduates are college students.