Homework #1

Jerry Zhang
36-721 Statistical Graphics and Visualization
Dr. A. Thomas Due September 5, 2013

setwd("//andrew/users/users22/jerryzha/Desktop/PS1")

## Error: cannot change working directory

Problem 1

library(MASS)
p1_data = data.frame(smoke = survey$Smoke, exercise = survey$Exer)
attach(p1_data)

par(mfrow = c(1, 2))
count = table(smoke)
barplot(count, main = "Smoking at University of Adelaide", ylab = "Counts")
count2 = count[c("Never", "Occas", "Regul", "Heavy")]
barplot(count2, main = "Smoking at University of Adelaide in Natural Order", 
    ylab = "Counts")

plot of chunk unnamed-chunk-3

par(mfrow = c(1, 2))
count3 = table(exercise)
count4 = count3[c("None", "Some", "Freq")]
barplot(rep(1, length(count3)), count3, space = 0, names.arg = names(count3), 
    main = "Exercise at University of Adelaide", ylab = "Proportions", col = c("grey25", 
        "grey90", "grey30"))
barplot(rep(1, length(count4)), count4, space = 0, names.arg = names(count4), 
    main = "Exercise at University of Adelaide in Natural order", ylab = "Proportions", 
    col = c("grey90", "grey30", "grey25"))

plot of chunk unnamed-chunk-4

Bar plots and Spine charts are different ways of presenting data. Within a bar plot, the height of each bar is proportional to the count. For a spine chart, the area of each spine is proportional to the count. Manual ordering helps us to sort the factors into logical orders (Never/Occasional/Regular/Heavy for smoking and None/Some/Frequent for exercise). R-default plots the factors in alphabetical order. Manual ordering can potentially more information. For example in the natural order box plot, we can clearly see that as the frequency of smoking increases from “never” to “heavy”, the number of count in each category decreases. Such an observation is not as easy when looking at the default plot.

I actually do prefer the spine chart method of displaying data. Spine chart provides us with a good sense of relative percentages. It is much more difficult for us to estimate the percentage of non-smokers from the bar plot than it is for us to estimate the percentage of frequency exercisers from the spine chart. If I were interested in the exact count, I could easily check the data set. Although the exact count is shown on the bar plots, it is nevertheless difficult to read off the actual count.

detach(p1_data)

Problem 2

attach(birthwt)
par(mfrow = c(1, 3))
barplot(table(race), main = "Race", ylab = "Count", names.arg = c("White", "Black", 
    "Other"))
barplot(table(smoke), main = "Smoke", ylab = "Count", names.arg = c("No", "Yes"))
barplot(table(ptl), main = "Number of Previous Premature Labors", ylab = "Count", 
    space = 0)

plot of chunk unnamed-chunk-6

Distribution of Race, Smoke, and Number of Ptl (Previous Premature Labors).

cor_racesmoke = table(birthwt[c("race", "smoke")])
cor_raceptl = table(birthwt[c("race", "ptl")])
cor_smokeptl = table(birthwt[c("smoke", "ptl")])
par(mfrow = c(1, 2))
mosaicplot(cor_racesmoke, shade = TRUE, main = "Smoke versus Race", xlab = "Race (1 = white, 2 = black, 3 = other)", 
    ylab = "Smoke (0 = no, 1 = yes)")
mosaicplot(t(cor_racesmoke), shade = TRUE, main = "Smoke versus Race", ylab = "Race (1 = white, 2 = black, 3 = other)", 
    xlab = "Smoke (0 = no, 1 = yes)")

plot of chunk unnamed-chunk-7

Smoking is correlated with race. Within the data set, we are seeing an abundance of white smokers and other non-smokers. There is also an less than expected number of other smokers. The proportion of smokers and non-smokers within black is what we expect.

par(mfrow = c(1, 2))
mosaicplot(cor_raceptl, shade = TRUE, main = "Previous Premature Labors versus Race", 
    xlab = "Race (1 = white, 2 = black, 3 = other)", ylab = "Number of Previous Premature Labors (Count)")
mosaicplot(t(cor_raceptl), shade = TRUE, main = "Previous Premature Labors versus Race", 
    ylab = "Race (1 = white, 2 = black, 3 = other)", xlab = "Number of Previous Premature Labors (Count)")

plot of chunk unnamed-chunk-8

There doesn't seem to be any strong correlations between race and number of previous premature labors.

par(mfrow = c(1, 2))
mosaicplot(cor_smokeptl, shade = TRUE, main = "Previous Premature Labors versus Smoke", 
    xlab = "Smoke (0 = no, 1 = yes)", ylab = "Number of Previous Premature Labors (Count)")
mosaicplot(t(cor_smokeptl), shade = TRUE, main = "Previous Premature Labors versus Smoke", 
    ylab = "Smoke (0 = no, 1 = yes)", xlab = "Number of Previous Premature Labors (Count)")

plot of chunk unnamed-chunk-9

There doesn't seem to be any strong correlations between smoke and number of previous premature labors.

Problem 3

We sum over the data using plyr package.

library(plyr)

## Warning: package 'plyr' was built under R version 2.15.3


dataminn38 = NULL
rep.row <- function(r, n) {
    colwise(function(x) rep(x, n))(r)
}

for (i in 1:dim(minn38)[1]) {
    dataminn38 = rbind(dataminn38, rep.row(minn38[i, 2:4], minn38[i, 5]))
}
names(dataminn38) = names(minn38[2:4])
attach(dataminn38)

Quick bar plots for the 3 variables

par(mfrow = c(1, 3))
barplot(table(phs), ylab = "Count", names.arg = c("Entrolled in College", "Enrolled in Non-collegiate School", 
    "Emplyed Full-time", "Other"), main = "Post High School Status")
barplot(table(fol), ylab = "Count", main = "Father's Occupational Level")
barplot(table(sex), ylab = "Count", names.arg = c("Female", "Male"), main = "Sex")

plot of chunk unnamed-chunk-10

The results of the bar plots are as expected. There doesn't seem to be anything outrageous within these plots. The count of females finishing high school is higher than males.

Construct some tables between each variable for plotting

cor_phsfol = table(dataminn38[c("phs", "fol")])
cor_phssex = table(dataminn38[c("phs", "sex")])
cor_folsex = table(dataminn38[c("fol", "sex")])

par(mfrow = c(1, 2))
mosaicplot(cor_phsfol, xlab = "Post High School Status (C = College, N = Non-collegiate School, E = Employed Full-time, O = Other)", 
    ylab = "Father's Occupational Level", shade = TRUE, main = "Father's Occupational Level versus Post High School Status")
mosaicplot(t(cor_phsfol), ylab = "Post High School Status (C = College, N = Non-collegiate School, E = Employed Full-time, O = Other)", 
    xlab = "Father's Occupational Level", shade = TRUE, main = "Post High School Status versus Father's Occupational Level")

plot of chunk unnamed-chunk-12

Students from families with lower occupational level fathers attended college at a higher than expected rate. Those from families with higher occupational level fathers gravitated toward other.

par(mfrow = c(1, 2))
mosaicplot(cor_phssex, xlab = "Post High School Status (C = College, N = Non-collegiate School, E = Employed Full-time, O = Other)", 
    ylab = " Sex", shade = TRUE, main = "Sex versus Post High School Status")
mosaicplot(t(cor_phssex), ylab = "Post High School Status (C = College, N = Non-collegiate School, E = Employed Full-time, O = Other)", 
    xlab = " Sex", shade = TRUE, main = "Post High School Status versus Sex")

plot of chunk unnamed-chunk-13

Females attended non-collegiate schools and and went to work full-time at a much higher rate than males. Males attended college and pursued other plans at higher than expected rates.

par(mfrow = c(1, 2))
mosaicplot(cor_folsex, xlab = "Father's Occupational Level", ylab = "Sex", shade = TRUE, 
    main = "Sex versus Father's Occupational Level")
mosaicplot(t(cor_folsex), ylab = "Father's Occupational Level", xlab = "Sex", 
    shade = TRUE, main = "Father's Occupational Level versus Sex ")

plot of chunk unnamed-chunk-14

Within families with lower occupational level fathers, a higher than expected proportions of males, and a lower than expected proportion of females finish high school. Among families with F3 occupational level fathers, the trend is reversed; females finish high school at a much higher rate.