In-class exercise

Q1.

The math attainment page has a dataset and a script of R code chunks. Generate a markdown file from the script to push the output in HTML for posting to course Moodle site. first R session using math attainment data set

input data

# read in a plain text file with variable names and assign a name to it
dta <- read.table("math_attainment.txt", header = T)

checking data

# structure of data
str(dta)
## 'data.frame':    39 obs. of  3 variables:
##  $ math2: int  28 56 51 13 39 41 30 13 17 32 ...
##  $ math1: int  18 22 44 8 20 12 16 5 9 18 ...
##  $ cc   : num  328 406 387 167 328 ...
# first 6 rows
head(dta)

descriptive statistics

# variable mean
colMeans(dta)
##     math2     math1        cc 
##  28.76923  15.35897 188.83667
# variable sd
apply(dta, 2, sd) #2=column
##     math2     math1        cc 
## 10.720029  7.744224 84.842513
# correlation matrix
cor(dta)
##           math2     math1        cc
## math2 1.0000000 0.7443604 0.6570098
## math1 0.7443604 1.0000000 0.5956771
## cc    0.6570098 0.5956771 1.0000000

regression analysis

# regress math2 by math1
dta.lm <- lm(math2 ~ math1, data=dta)

# show results
summary(dta.lm)
## 
## Call:
## lm(formula = math2 ~ math1, data = dta)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -10.430  -5.521  -0.369   4.253  20.388 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   12.944      2.607   4.965 1.57e-05 ***
## math1          1.030      0.152   6.780 5.57e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.255 on 37 degrees of freedom
## Multiple R-squared:  0.5541, Adjusted R-squared:  0.542 
## F-statistic: 45.97 on 1 and 37 DF,  p-value: 5.571e-08
# show anova table
anova(dta.lm)

plot data

# specify square plot region
par(pty="s")

# scatter plot of math2 by math1
plot(math2 ~ math1, data=dta, xlim=c(0, 60), ylim=c(0, 60),
     xlab="Math score at Year 1", ylab="Math score at Year 2")

# add grid lines
grid()

# add regression line
abline(dta.lm, lty=2)

# add plot title
title("Mathematics Attainment")

diagnostics

# specify maximum plot region
par(pty="m")

#
plot(scale(resid(dta.lm)) ~ fitted(dta.lm), 
     ylim=c(-3.5, 3.5), type="n",
     xlab="Fitted values", ylab="Standardized residuals") 

#
text(fitted(dta.lm), scale(resid(dta.lm)), labels=rownames(dta), cex=0.5)  

#
grid()

# add a horizontal red dash line
abline(h=0, lty=2, col="red")

normality check

qqnorm(scale(resid(dta.lm)))

qqline(scale(resid(dta.lm)))

grid()

Q2.

The notation, women{datasets}, indicates that a data object by the name women is in the datasets package. This package is preloaded when R is invoked. Explain the difference between c(women) and c(as.matrix(women)) using the women{datasets}.

data(women)
d1 <- c(women)
d2 <- c(as.matrix(women))
d1
## $height
##  [1] 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72
## 
## $weight
##  [1] 115 117 120 123 126 129 132 135 139 142 146 150 154 159 164
d2
##  [1]  58  59  60  61  62  63  64  65  66  67  68  69  70  71  72 115 117
## [18] 120 123 126 129 132 135 139 142 146 150 154 159 164
typeof(d1)
## [1] "list"
typeof(d2)
## [1] "double"
class(d1)
## [1] "list"
class(d2)
## [1] "numeric"

Q3.

Use help to examine the coding scheme for the mother’s race variable in the birthwt{MASS} dataset. The MASS comes with the base R installation but is not automatically loaded when R is invoked.

??MASS::birthwt
library(MASS)
data(birthwt)
summary(factor(c("White", "Black", "Other")[birthwt$race]))
## Black Other White 
##    26    67    96
#summary(factor(ifelse(birthwt$race == 1, "White", ifelse(birthwt$race == 2,"Black", "Other"))))

How many black mothers are there in this data frame? Ans. 26 What does the following R command do? ’> c(“White”, “Black”, “Other”)[birthwt$race] birthwt\(race == 1 -> "White" birthwt\)race == 2 -> “Black” birthwt$race == 3 -> “Other”

Q4.

Regarding UCBAdmissions{datasets} data object, what does the output ‘> UCBAdmissions[,1,]’> UCBAdmissions[,1,1] ’> UCBAdmissions[1,1,] of each of the above R statements mean, respectively?

UCBAdmissions
## , , Dept = A
## 
##           Gender
## Admit      Male Female
##   Admitted  512     89
##   Rejected  313     19
## 
## , , Dept = B
## 
##           Gender
## Admit      Male Female
##   Admitted  353     17
##   Rejected  207      8
## 
## , , Dept = C
## 
##           Gender
## Admit      Male Female
##   Admitted  120    202
##   Rejected  205    391
## 
## , , Dept = D
## 
##           Gender
## Admit      Male Female
##   Admitted  138    131
##   Rejected  279    244
## 
## , , Dept = E
## 
##           Gender
## Admit      Male Female
##   Admitted   53     94
##   Rejected  138    299
## 
## , , Dept = F
## 
##           Gender
## Admit      Male Female
##   Admitted   22     24
##   Rejected  351    317

[Admit,Gender,Dept]

UCBAdmissions[,1,]
##           Dept
## Admit        A   B   C   D   E   F
##   Admitted 512 353 120 138  53  22
##   Rejected 313 207 205 279 138 351

在 Gender = Male 時 Dept 與 Admit 的列表。

UCBAdmissions[,1,1]
## Admitted Rejected 
##      512      313

在 Gender = Male, Dept = A 時 Admit 的值。

UCBAdmissions[1,1,]
##   A   B   C   D   E   F 
## 512 353 120 138  53  22

每一 Dept 在 Admit = Admitted, Gender = Male 的值。

Q5.

Concerning the chickwts{datasets}, explain the difference between ‘> chickwts[,2]’> chickwts[“feed”]

chickwts[,2]
##  [1] horsebean horsebean horsebean horsebean horsebean horsebean horsebean
##  [8] horsebean horsebean horsebean linseed   linseed   linseed   linseed  
## [15] linseed   linseed   linseed   linseed   linseed   linseed   linseed  
## [22] linseed   soybean   soybean   soybean   soybean   soybean   soybean  
## [29] soybean   soybean   soybean   soybean   soybean   soybean   soybean  
## [36] soybean   sunflower sunflower sunflower sunflower sunflower sunflower
## [43] sunflower sunflower sunflower sunflower sunflower sunflower meatmeal 
## [50] meatmeal  meatmeal  meatmeal  meatmeal  meatmeal  meatmeal  meatmeal 
## [57] meatmeal  meatmeal  meatmeal  casein    casein    casein    casein   
## [64] casein    casein    casein    casein    casein    casein    casein   
## [71] casein   
## Levels: casein horsebean linseed meatmeal soybean sunflower
class(chickwts[,2])
## [1] "factor"
typeof(chickwts[,2])
## [1] "integer"
chickwts["feed"]
class(chickwts["feed"])
## [1] "data.frame"
typeof(chickwts["feed"])
## [1] "list"

Q6.

What happens when the following command is entered? ’> help(ls(“package:MASS”)[92]) Use the observation to find out how many items there are in the package MASS.

help(ls("package:MASS")[92]) 

infomation about menarche {MASS} pop up

length(ls("package:MASS"))
## [1] 165

165 items in the package MASS.

Exercises

Q1.

A built-in demo for the data object ToothGrowth{datasets} is run by issuing the following command: ’> example(ToothGrowth)

Replicate the analysis in the style of the headache study to investigate if type of vitamin supplement has an effect on teeth growth in guinea pigs. For this exercise, we shall ignore the dose variable.

# example(ToothGrowth)
require(graphics)
plot(len ~ supp, data = ToothGrowth)

Q2.

Find all possible sums from rolling three dice using R. If possible, construct a histogram for the sum of three dice. Is this a probability histogram or an empirical histogram?

d1 <- sample(1:6, size=1000, replace = TRUE)
d2 <- sample(1:6, size=1000, replace = TRUE)
d3 <- sample(1:6, size=1000, replace = TRUE)
hist(d1+d2+d3,xlab = "total",main = "three dices summary")

Q3.

The IQ and behavior problem page has a dataset and a script of R code chunks. Generate a markdown file from the script to push the output in HTML for posting to course Moodle site. Explain what the code chunks do with your comments in the markdown file. https://rpubs.com/PL_W/c3e3

Q4.

The usBirths2015.txt is a dataset of monthly births in the US in 2015. Summarize the number of births by season.

dta <- read.table("usBirths2015.txt",header = T)
dta$season <- factor(
       ifelse(dta$month %in% c("March","April","May"), "Spring", 
       ifelse(dta$month %in% c("June","July","August"), "Summer", 
       ifelse(dta$month %in% c("September","October","November"), "Autumn", "Winter"))))
aggregate(birth ~ season, data=dta, FUN=sum)

Q5.

Ten subjects read a paragraph consisting of seven sentences. The reading time (in seconds) for each sentence was the outcome measure. The predictors are the serial position of the sentence (Sp), the number of words in the sentences (Wrds), and the number of new arguments in the sentence (New). (a) Rank subjects by their reading speeed (b) Estimate, on average, how long does it take to read a word.

dta <- read.table("readingtimes.txt",header = T)
dta
order(colSums(dta[,5:14]))
##  [1]  9  3  4  2  1 10  8  7  6  5
colSums(dta[,5:14])/sum(dta[,3])
##       S01       S02       S03       S04       S05       S06       S07 
## 0.3569259 0.3324568 0.3261728 0.3266049 0.4856790 0.4641852 0.4518395 
##       S08       S09       S10 
## 0.4330864 0.2368765 0.3883457