A sample of pupils whose mathematics attainment is tested at the end of Year 2 (seven years old). The coverage of mathematics curriculum each pupil received during Year 2 is measured; so is the mathematics attainment at the beginning of the year. There are 39 pupils in the study.
Column 1: Mathematics attainment, end of Year 2
Column 2: Mathematics attainment, end of Year 1
Column 3: Curriculum coverage
# first R session using math attainment data set
# read in a plain text file with variable names and assign a name to it
<- read.table("math_attainment.txt", header = T) dta
# structure of data
str(dta)
## 'data.frame': 39 obs. of 3 variables:
## $ math2: int 28 56 51 13 39 41 30 13 17 32 ...
## $ math1: int 18 22 44 8 20 12 16 5 9 18 ...
## $ cc : num 328 406 387 167 328 ...
# first 6 rows
head(dta)
## math2 math1 cc
## 1 28 18 328.20
## 2 56 22 406.03
## 3 51 44 386.94
## 4 13 8 166.91
## 5 39 20 328.20
## 6 41 12 328.20
# variable mean
colMeans(dta)
## math2 math1 cc
## 28.76923 15.35897 188.83667
Mean math score of second year is higher than first year.
# variable sd
apply(dta, 2, sd)
## math2 math1 cc
## 10.720029 7.744224 84.842513
# correlation matrix
cor(dta)
## math2 math1 cc
## math2 1.0000000 0.7443604 0.6570098
## math1 0.7443604 1.0000000 0.5956771
## cc 0.6570098 0.5956771 1.0000000
# regress math2 by math1
<- lm(math2 ~ math1, data=dta)
dta.lm
# show results
summary(dta.lm)
##
## Call:
## lm(formula = math2 ~ math1, data = dta)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.430 -5.521 -0.369 4.253 20.388
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12.944 2.607 4.965 1.57e-05 ***
## math1 1.030 0.152 6.780 5.57e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.255 on 37 degrees of freedom
## Multiple R-squared: 0.5541, Adjusted R-squared: 0.542
## F-statistic: 45.97 on 1 and 37 DF, p-value: 5.571e-08
# show anova table
anova(dta.lm)
## Analysis of Variance Table
##
## Response: math2
## Df Sum Sq Mean Sq F value Pr(>F)
## math1 1 2419.6 2419.59 45.973 5.571e-08 ***
## Residuals 37 1947.3 52.63
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# specify square plot region
par(pty="s")
# scatter plot of math2 by math1
plot(math2 ~ math1, data=dta, xlim=c(0, 60), ylim=c(0, 60),
xlab="Math score at Year 1", ylab="Math score at Year 2")
# add grid lines
grid()
# add regression line
abline(dta.lm, lty=2)
# add plot title
title("Mathematics Attainment")
The math score of first year may be a proper index to predict the math score of second year.
# specify maximum plot region
par(pty="m")
# plot centering residuals value ~ fitted value
# ??? resid(dta.lm) versus scale(resid(dta.lm)) ???
plot(scale(resid(dta.lm)) ~ fitted(dta.lm),
ylim=c(-3.5, 3.5), type="n",
xlab="Fitted values", ylab="Standardized residuals")
# add id label
text(fitted(dta.lm), scale(resid(dta.lm)), labels=rownames(dta), cex=0.5)
# add auxiliary line
grid()
# add a horizontal red dash line
abline(h=0, lty=2, col="red")
qqnorm(scale(resid(dta.lm)))
qqline(scale(resid(dta.lm)))
grid()
Question: The notation, women{datasets}, indicates that a data object by the name women is in the datasets package. This package is preloaded when R is invoked. Explain the difference between c(women) and c(as.matrix(women)) using the women{datasets}.
c() is a generic function which combines values into a vector or list.
The default method combines its arguments to form a vector. All arguments are coerced to a common type which is the type of the returned value, and all attributes except names are removed.
women
## height weight
## 1 58 115
## 2 59 117
## 3 60 120
## 4 61 123
## 5 62 126
## 6 63 129
## 7 64 132
## 8 65 135
## 9 66 139
## 10 67 142
## 11 68 146
## 12 69 150
## 13 70 154
## 14 71 159
## 15 72 164
# women is a list type object
typeof(women)
## [1] "list"
# therefore c(women) is a list class
class(c(women))
## [1] "list"
# c(women) is a list with 2 vector components
c(women)
## $height
## [1] 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72
##
## $weight
## [1] 115 117 120 123 126 129 132 135 139 142 146 150 154 159 164
# as.matrix() attempts to turn its argument into a matrix.
# list transform to double type array/matrix
as.matrix(women)
## height weight
## [1,] 58 115
## [2,] 59 117
## [3,] 60 120
## [4,] 61 123
## [5,] 62 126
## [6,] 63 129
## [7,] 64 132
## [8,] 65 135
## [9,] 66 139
## [10,] 67 142
## [11,] 68 146
## [12,] 69 150
## [13,] 70 154
## [14,] 71 159
## [15,] 72 164
# c(as.matrix(women)) is a double numeric vector
c(as.matrix(women))
## [1] 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 115 117 120 123
## [20] 126 129 132 135 139 142 146 150 154 159 164
typeof(c(as.matrix(women)))
## [1] "double"
class(c(as.matrix(women)))
## [1] "numeric"
Use help to examine the coding scheme for the mother’s race variable in the birthwt{MASS} dataset. The MASS comes with the base R installation but is not automatically loaded when R is invoked.
The birthwt data frame has 189 rows and 10 columns. The data were collected at Baystate Medical Center, Springfield, Mass during 1986.
race: mother’s race (1 = white, 2 = black, 3 = other).
1.How many black mothers are there in this data frame?
library(MASS)
head(birthwt)
## low age lwt race smoke ptl ht ui ftv bwt
## 85 0 19 182 2 0 0 0 1 0 2523
## 86 0 33 155 3 0 0 0 0 3 2551
## 87 0 20 105 1 1 0 0 0 1 2557
## 88 0 21 108 1 1 0 0 1 2 2594
## 89 0 18 107 1 1 0 0 1 0 2600
## 91 0 21 124 3 0 0 0 0 0 2622
ftable(birthwt$race)
## 1 2 3
##
## 96 26 67
2.What does the following R command do?
\[ c("White", "Black", "Other")[birthwt$race] \]
將race這個欄位裡原本的數字改成文字呈現 1=White, 2=Black, 3=Other
c("White", "Black", "Other")[birthwt$race]
## [1] "Black" "Other" "White" "White" "White" "Other" "White" "Other" "White"
## [10] "White" "Other" "Other" "Other" "Other" "White" "White" "Black" "White"
## [19] "Other" "White" "Other" "White" "White" "Other" "Other" "White" "White"
## [28] "White" "Black" "Black" "Black" "White" "Black" "White" "Black" "White"
## [37] "White" "White" "White" "White" "Black" "White" "Black" "White" "White"
## [46] "White" "White" "Other" "White" "Other" "White" "Other" "White" "White"
## [55] "Other" "Other" "Other" "Other" "Other" "Other" "Other" "Other" "Other"
## [64] "White" "Other" "Other" "Other" "Other" "White" "Black" "White" "Other"
## [73] "Other" "Black" "White" "Black" "White" "White" "Black" "White" "White"
## [82] "White" "Other" "Other" "Other" "Other" "Other" "White" "White" "White"
## [91] "White" "Other" "White" "White" "White" "White" "White" "White" "White"
## [100] "White" "White" "White" "Other" "White" "Other" "Black" "White" "White"
## [109] "White" "Black" "White" "Other" "White" "White" "White" "Other" "White"
## [118] "Other" "White" "Other" "White" "Other" "White" "White" "White" "White"
## [127] "White" "White" "White" "White" "Other" "White" "Black" "Other" "Other"
## [136] "Other" "Other" "Black" "Other" "White" "White" "White" "Other" "Other"
## [145] "White" "White" "Black" "White" "Other" "Other" "Other" "White" "White"
## [154] "White" "White" "Other" "Black" "White" "Black" "Other" "White" "Other"
## [163] "Other" "Other" "Black" "White" "Other" "Other" "White" "White" "Black"
## [172] "Black" "Black" "Other" "Other" "White" "White" "White" "White" "Black"
## [181] "Other" "Other" "White" "Other" "White" "Other" "Other" "Black" "White"
Question: Regarding UCBAdmissions{datasets} data object, what does the output of each of the R statements mean, respectively?
* UCBAdmissions[,1,]
* UCBAdmissions[,1,1]
* UCBAdmissions[1,1,]
UCBAdmissions {datasets} Student Admissions at UC Berkeley
Aggregate data on applicants to graduate school at Berkeley for the six largest departments in 1973 classified by admission and sex.
A 3-dimensional array resulting from cross-tabulating 4526 observations on 3 variables. The variables and their levels are as follows:
No | Name | Levels |
---|---|---|
1 | Admit | Admitted, Rejected |
2 | Gender | Male, Female |
3 | Dept | A, B, C, D, E, F |
UCBAdmissions
## , , Dept = A
##
## Gender
## Admit Male Female
## Admitted 512 89
## Rejected 313 19
##
## , , Dept = B
##
## Gender
## Admit Male Female
## Admitted 353 17
## Rejected 207 8
##
## , , Dept = C
##
## Gender
## Admit Male Female
## Admitted 120 202
## Rejected 205 391
##
## , , Dept = D
##
## Gender
## Admit Male Female
## Admitted 138 131
## Rejected 279 244
##
## , , Dept = E
##
## Gender
## Admit Male Female
## Admitted 53 94
## Rejected 138 299
##
## , , Dept = F
##
## Gender
## Admit Male Female
## Admitted 22 24
## Rejected 351 317
1,] UCBAdmissions[,
## Dept
## Admit A B C D E F
## Admitted 512 353 120 138 53 22
## Rejected 313 207 205 279 138 351
The admission status for each department in male students.
1,1] UCBAdmissions[,
## Admitted Rejected
## 512 313
The admission status for “A” department in male students.
1,1,] UCBAdmissions[
## A B C D E F
## 512 353 120 138 53 22
The admitted number for each department admitted in male students.
What happens when the following command is entered?
#No run
help(ls("package:MASS")[92])
Use the observation to find out how many items there are in the package MASS.
ls("package:MASS")
## [1] "abbey" "accdeaths" "addterm"
## [4] "Aids2" "Animals" "anorexia"
## [7] "area" "as.fractions" "bacteria"
## [10] "bandwidth.nrd" "bcv" "beav1"
## [13] "beav2" "biopsy" "birthwt"
## [16] "Boston" "boxcox" "cabbages"
## [19] "caith" "Cars93" "cats"
## [22] "cement" "chem" "con2tr"
## [25] "contr.sdif" "coop" "corresp"
## [28] "cov.mcd" "cov.mve" "cov.rob"
## [31] "cov.trob" "cpus" "crabs"
## [34] "Cushings" "DDT" "deaths"
## [37] "denumerate" "dose.p" "drivers"
## [40] "dropterm" "eagles" "enlist"
## [43] "epil" "eqscplot" "farms"
## [46] "fbeta" "fgl" "fitdistr"
## [49] "forbes" "fractions" "frequency.polygon"
## [52] "GAGurine" "galaxies" "gamma.dispersion"
## [55] "gamma.shape" "gehan" "genotype"
## [58] "geyser" "gilgais" "ginv"
## [61] "glm.convert" "glm.nb" "glmmPQL"
## [64] "hills" "hist.FD" "hist.scott"
## [67] "housing" "huber" "hubers"
## [70] "immer" "Insurance" "is.fractions"
## [73] "isoMDS" "kde2d" "lda"
## [76] "ldahist" "leuk" "lm.gls"
## [79] "lm.ridge" "lmsreg" "lmwork"
## [82] "loglm" "loglm1" "logtrans"
## [85] "lqs" "lqs.formula" "ltsreg"
## [88] "mammals" "mca" "mcycle"
## [91] "Melanoma" "menarche" "michelson"
## [94] "minn38" "motors" "muscle"
## [97] "mvrnorm" "nclass.freq" "neg.bin"
## [100] "negative.binomial" "negexp.SSival" "newcomb"
## [103] "nlschools" "npk" "npr1"
## [106] "Null" "oats" "OME"
## [109] "painters" "parcoord" "petrol"
## [112] "phones" "Pima.te" "Pima.tr"
## [115] "Pima.tr2" "polr" "psi.bisquare"
## [118] "psi.hampel" "psi.huber" "qda"
## [121] "quine" "Rabbit" "rational"
## [124] "renumerate" "rlm" "rms.curv"
## [127] "rnegbin" "road" "rotifer"
## [130] "Rubber" "sammon" "select"
## [133] "Shepard" "ships" "shoes"
## [136] "shrimp" "shuttle" "Sitka"
## [139] "Sitka89" "Skye" "snails"
## [142] "SP500" "stdres" "steam"
## [145] "stepAIC" "stormer" "studres"
## [148] "survey" "synth.te" "synth.tr"
## [151] "theta.md" "theta.ml" "theta.mm"
## [154] "topo" "Traffic" "truehist"
## [157] "ucv" "UScereal" "UScrime"
## [160] "VA" "waders" "whiteside"
## [163] "width.SJ" "write.matrix" "wtloss"