Homework 2

Before starting, you need to require package “plyr”, “dplyr”, “ggplot2”, “WWGbook”, “MASS”, “mlmRev”.

Use the minn38{MASS} data set for the problems. Load the package into the current directory using library(MASS) and use help(minn38) to view the data description. How many female high school graduates were there in 1938? How many female high school graduates enrolled in college in 1938?

dta<- minn38
head(dta)

    hs phs fol sex  f
  1  L   C  F1   M 87
  2  L   C  F2   M 72
  3  L   C  F3   M 52
  4  L   C  F4   M 88
  5  L   C  F5   M 32
  6  L   C  F6   M 14

str(dta)

  'data.frame': 168 obs. of  5 variables:
   $ hs : Factor w/ 3 levels "L","M","U": 1 1 1 1 1 1 1 1 1 1 ...
   $ phs: Factor w/ 4 levels "C","E","N","O": 1 1 1 1 1 1 1 3 3 3 ...
   $ fol: Factor w/ 7 levels "F1","F2","F3",..: 1 2 3 4 5 6 7 1 2 3 ...
   $ sex: Factor w/ 2 levels "F","M": 2 2 2 2 2 2 2 2 2 2 ...
   $ f  : int  87 72 52 88 32 14 20 3 6 17 ...

help(minn38)

  starting httpd help server ...

   done

dta1<-filter(dta, sex == "F")
length(dta1$sex)

  [1] 84

dta2<-filter(dta, sex == "F" & phs == "C")
length(dta2$sex)

  [1] 21

Produce the following histogram for correlation coefficients between written and course variables by school in the data set Gcsemv{mlmRev}. The two vertical lines indicate averaged correlations over schools and correlation computed over individuals ignoring school label. Could you tell which is which?

dta<-Gcsemv
head(dta)

    school student gender written course
  1  20920      16      M      23     NA
  2  20920      25      F      NA   71.2
  3  20920      27      F      39   76.8
  4  20920      31      F      36   87.9
  5  20920      42      M      16   44.4
  6  20920      62      F      36     NA

str(dta)

  'data.frame': 1905 obs. of  5 variables:
   $ school : Factor w/ 73 levels "20920","22520",..: 1 1 1 1 1 1 1 1 1 2 ...
   $ student: Factor w/ 649 levels "1","2","3","4",..: 16 25 27 31 42 62 101 113 146 1 ...
   $ gender : Factor w/ 2 levels "F","M": 2 1 1 1 2 1 1 2 2 1 ...
   $ written: num  23 NA 39 36 16 36 49 25 NA 48 ...
   $ course : num  NA 71.2 76.8 87.9 44.4 NA 89.8 17.5 32.4 84.2 ...

dta<-na.omit(dta)

corall<-cor(dta$written, dta$course)
schcor<-ddply(dta, .(school), summarize, corr=cor(course, written))
schcor<-schcor[-72,]

ggplot(schcor, aes(x = corr))+
  geom_histogram(binwidth = 0.1, fill="skyblue") +
  geom_vline(xintercept= mean(schcor$corr), linetype= "dotted", color="red")+
  geom_text(aes(x=mean(schcor$corr), label="averaged correlations over schools", 
                y=20), colour="red", angle=90,  vjust = -0.5, hjust = 1)+
  geom_vline(xintercept= corall)+
  geom_text(aes(x=corall, label="correlation computed over individuals", 
                y=20), angle=90,  vjust = 1, hjust = 1)+
  labs(x="Correlation coefficient")+
  theme_bw()

Produce the following plot using the data set autism{WWGbook}.

dta<-autism
head(dta)

    age vsae sicdegp childid
  1   2    6       3       1
  2   3    7       3       1
  3   5   18       3       1
  4   9   25       3       1
  5  13   27       3       1
  6   2   17       3       3

str(dta)

  'data.frame': 612 obs. of  4 variables:
   $ age    : int  2 3 5 9 13 2 3 5 9 13 ...
   $ vsae   : int  6 7 18 25 27 17 18 12 18 24 ...
   $ sicdegp: int  3 3 3 3 3 3 3 3 3 3 ...
   $ childid: int  1 1 1 1 1 3 3 3 3 3 ...

ggplot(dta, aes(age, vsae, group=childid)) +
geom_point(size = 1.5)+
geom_line()+
facet_grid(.~sicdegp)+
labs(x="Age(years)", y="VSAE score")+
theme_bw()

Homework 2_ML

Ching Wen, Su

2016/9/19