Exercise : Introduction

1.Use the women data set in the package ‘datasets’ for the problems.

Compute mean height.

library(datasets)
mean(women$height)

## [1] 65

Make a box plot of women’s weights.

library(ggplot2)
ggplot(women,aes(x=factor(0),y=weight))+
        geom_boxplot()+
        labs(x = "Women")

Draw a scatter diagram of height by weight.

ggplot(women,aes(x=height,y=weight))+
        geom_point()

Add a line to indicate the mean height on the scatter plot.

ggplot(women,aes(x=height,y=weight))+
        geom_point()+
        geom_vline(xintercept  = mean(women$height))

2.Repeat the simple regression analysis in the math attainment example using curriculum coverage as a predictor of math attainment at Year 2.

data <- read.table("data/mathAttainment.txt",h=T)
head(data)

##   math2 math1     cc
## 1    28    18 328.20
## 2    56    22 406.03
## 3    51    44 386.94
## 4    13     8 166.91
## 5    39    20 328.20
## 6    41    12 328.20

model1 <- lm(math2~cc,data=data)
summary(model1)

## 
## Call:
## lm(formula = math2 ~ cc, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.9490  -6.2744  -0.9887   6.1920  18.3475 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.09304    3.23490   4.047 0.000254 ***
## cc           0.08301    0.01566   5.301 5.55e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.19 on 37 degrees of freedom
## Multiple R-squared:  0.4317, Adjusted R-squared:  0.4163 
## F-statistic:  28.1 on 1 and 37 DF,  p-value: 5.545e-06

3.Use descriptive analysis to provide tentative answers to the question posed in the IQ and language score data set.

ANS: A simple linear regression was calculated to predict language score based on IQ and SES.A significant regression equation was found (F(2,2284)=764.8,p < .05), with an R-squared of .4 . Standardized language score increased 0.55 for each score of IQ, and increased 0.18 for each unit of SES.

data <- read.table("data/verbalIQ.txt",h=T)
model2 <- lm(language~viq+ses,data=data)
library(lm.beta)
summary(lm.beta(model2))

## 
## Call:
## lm(formula = language ~ viq + ses, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -27.8208  -4.5375   0.4499   4.9173  25.6117 
## 
## Coefficients:
##             Estimate Standardized Std. Error t value Pr(>|t|)    
## (Intercept)  8.33128      0.00000    0.85416   9.754   <2e-16 ***
## viq          2.40542      0.55272    0.07430  32.376   <2e-16 ***
## ses          0.14877      0.18024    0.01409  10.557   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.971 on 2284 degrees of freedom
## Multiple R-squared:  0.4011, Adjusted R-squared:  0.4006 
## F-statistic: 764.8 on 2 and 2284 DF,  p-value: < 2.2e-16

4.Use the minn38 data set in the package ‘MASS’ for the problems. Use library(MASS) to load the package into the working directory and use help(minn38) to view the data description.

library(MASS)
help(minn38)
head(minn38)

##   hs phs fol sex  f
## 1  L   C  F1   M 87
## 2  L   C  F2   M 72
## 3  L   C  F3   M 52
## 4  L   C  F4   M 88
## 5  L   C  F5   M 32
## 6  L   C  F6   M 14

How many female high school graduates were there in 1938?

sum(subset(minn38,sex=="F")$f)

## [1] 7861

How many female high school graduates enrolled in college in 1938?

sum(subset(minn38,sex=="F"&phs=="C")$f)

## [1] 2027

Examine the distributions of high school graduates (frequecies or counts) in 1938 by father’s occupational status.

minn38_2 <- data.frame(F1=sum(subset(minn38,fol=="F1")$f))
minn38_2$F2 <- sum(subset(minn38,fol=="F2")$f)
minn38_2$F3 <- sum(subset(minn38,fol=="F3")$f)
minn38_2$F4 <- sum(subset(minn38,fol=="F4")$f)
minn38_2$F5 <- sum(subset(minn38,fol=="F5")$f)
minn38_2$F6 <- sum(subset(minn38,fol=="F6")$f)
minn38_2$F7 <- sum(subset(minn38,fol=="F7")$f)
library(reshape)
minn38_3 <- melt (minn38_2)

## Using  as id variables

ggplot(minn38_3,aes(x=variable,y=value))+
        geom_bar(stat = "identity")+
        labs(x="fol",y="Frequency")

Examine the distributions of male high school graduates (frequecies or counts) in 1938 by post high school status.

minn38_4 <- data.frame(C=sum(subset(minn38,phs=="C"&sex=="M")$f))
minn38_4$N <- sum(subset(minn38,phs=="N"&sex=="M")$f)
minn38_4$E <- sum(subset(minn38,phs=="E"&sex=="M")$f)
minn38_4$O <- sum(subset(minn38,phs=="O"&sex=="M")$f)
minn38_5 <- melt (minn38_4)

## Using  as id variables

ggplot(minn38_5,aes(x=variable,y=value))+
        geom_bar(stat = "identity")+
        labs(x="phs",y="Frequency")

5.Use the sleep data set in the package ‘datasets’ for the problems.

What does ‘?sleep’ do?

ANS:the command “?sleep” would print the page of relevant chunks of imformation for the sleep dataset.

?sleep

Plot the data set in some meaningful way.

Show the data in boxplot and density plot

library(ggplot2)
ggplot(sleep,aes(x=group,y=extra,fill=group))+
        geom_boxplot()+
        coord_flip()

ggplot(sleep,aes(extra,fill=group))+
        geom_density(alpha=0.5)+
        xlim(min(sleep$extra), max(sleep$extra))+ # It's fine to not use it
        scale_fill_brewer(palette="Spectral")

Was the effect of two drugs statistically different?

ANS: There is a significant difference between two drugs (t(9)=-4.06,p<.05). The effect of drug 2 was significant larger than drug 1.

model3 <- t.test(extra~group,paired=T,data=sleep)
model3

## 
##  Paired t-test
## 
## data:  extra by group
## t = -4.0621, df = 9, p-value = 0.002833
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -2.4598858 -0.7001142
## sample estimates:
## mean of the differences 
##                   -1.58

Show control group subjects whose weight is greater or equaviewl to 5. ANS: Need some further explanations in order to answer this question. List all subjects in the sleep data frame whose extra sleep time is more than 1 hour.

subset(sleep,extra>=1)

##    extra group ID
## 6    3.4     1  6
## 7    3.7     1  7
## 10   2.0     1 10
## 11   1.9     2  1
## 13   1.1     2  3
## 16   4.4     2  6
## 17   5.5     2  7
## 18   1.6     2  8
## 19   4.6     2  9
## 20   3.4     2 10

List all Group 1 subjects in the sleep data frame whose extra sleep time is reduced by 1 hour or more.

subset(sleep,group==1&extra<=-1)

##   extra group ID
## 2  -1.6     1  2
## 4  -1.2     1  4

6.Find out what heatmap(Harman74.cor$cov) accomplishes.

ANS:The function heatmap conduct a cluster analysis on a dataset with distance or correlation (covariance) matrix, using the re-order variable to draw the color of heatmap, and the results of cluster analysis were revealed by the line art on the top and left of the plot.

heatmap(Harman74.cor$cov)

Exercise : Introduction

Chi-Lin Yu