DataM: Homework Exercise 0413 - Trellis
HW exercise 1.
Use trellis graphics to explore various ways to display the sample data from the National Longitudinal Survey of Youth.
The data are drawn from the National Longitudinal Survey of Youth (NLSY). The sample observations are from the 1986, 1988, 1990, and 1992 assessment periods. Children were selected to be in kindergarten, first, and second grade and to be of age 5, 6, or 7 at the first assessment (1986). Both reading and mathematical achievement scores are recorded. The former is a recognition subscore of the Peabody Individual Achievement Test (PIAT). This was scaled as the percentage of 84 items that were answered correctly. The same 84 items were administered at all four time points, providing a consistent scale over time. The data set is a subsample of 166 subjects with complete observations.
Source: Bollen, K.A. & Curran, P.J. (2006). Latent curve models. A structural equation perspective. p.59.
- Column 1: Student ID
- Column 2: Gender, male or female
- Column 3: Race, minority or majority
- Column 4: Measurement occasions
- Column 5: Grade at which measurements were made, Kindergarten = 0, First grade = 1, Second grade = 2
- Column 6: Age in years
- Column 7: Age in months
- Column 8: Math score
- Column 9: Reading score
Load in the package lattice and the data set
'data.frame': 664 obs. of 9 variables:
$ id : int 2390 2560 3740 4020 6350 7030 7200 7610 7680 7700 ...
$ sex : Factor w/ 2 levels "Female","Male": 1 1 1 2 2 2 2 2 1 2 ...
$ race : Factor w/ 2 levels "Majority","Minority": 1 1 1 1 1 1 1 1 1 1 ...
$ time : int 1 1 1 1 1 1 1 1 1 1 ...
$ grade: int 0 0 0 0 1 0 0 0 0 0 ...
$ year : int 6 6 6 5 7 5 6 7 6 6 ...
$ month: int 67 66 67 60 78 62 66 79 76 67 ...
$ math : num 14.29 20.24 17.86 7.14 29.76 ...
$ read : num 19.05 21.43 21.43 7.14 30.95 ...
Gender difference
Draw the density plot of math socore and that of reading score with grouping of gender.
lst_gender <- lapply(split(NLSY, NLSY$sex), function(df) {
df_long <- df %>% select(math, read) %>% stack()
df_long$gender <- as.character(df$sex)[1]
return(df_long)
})
df_gender <- rbind(lst_gender[[1]], lst_gender[[2]])
densityplot(~ values | ind, groups = gender, data = df_gender,
layout = c(1, 2), auto.key=list(column=2), xlab='Score')Draw boxplots of math score of each grade with grouping of gender.
Draw boxplots of reading score of each grade with grouping of gender.
Draw scatter plots of math score and reading score with the regreesion line for each grade with grouping of gender.
xyplot(math ~ read | factor(grade), groups = sex, data = NLSY,
layout=c(4, 2), type=c('p', 'r', 'g'), auto.key=list(column=2),
xlab = 'Mathematics score', ylab = 'Reading score')Race
Draw the density plot of math socore and that of reading score with grouping of race.
lst_race <- lapply(split(NLSY, NLSY$race), function(df) {
df_long <- df %>% select(math, read) %>% stack()
df_long$race <- as.character(df$race)[1]
return(df_long)
})
df_race <- rbind(lst_race[[1]], lst_race[[2]])
densityplot(~ values | ind, groups = race, data = df_race,
layout = c(1, 2), auto.key=list(column=2), xlab='Score')Draw boxplots of math score of each grade with grouping of race.
Draw boxplots of reading score of each grade with grouping of race.
Draw scatter plots of math score and reading score with the regreesion line for each grade with grouping of race.
xyplot(math ~ read | factor(grade), groups = race, data = NLSY,
layout=c(4, 2), type=c('p', 'r', 'g'), auto.key=list(column=2),
xlab = 'Mathematics score', ylab = 'Reading score')Repeated factorial ANOVA
Y: Math score
model_math <- aov(math ~ (sex * race * grade) + Error(id / (sex * race * grade) + time), data = NLSY)
summary(model_math)
Error: id
Df Sum Sq Mean Sq
sex 1 1651 1651
Error: time
Df Sum Sq Mean Sq
grade 1 170290 170290
Error: id:sex
Df Sum Sq Mean Sq
sex 1 127.2 127.2
Error: id:race
Df Sum Sq Mean Sq
sex 1 1065 1065
Error: id:grade
Df Sum Sq Mean Sq
sex 1 1228 1228
Error: id:sex:race
Df Sum Sq Mean Sq
sex 1 112.8 112.8
Error: id:sex:grade
Df Sum Sq Mean Sq
sex 1 357.7 357.7
Error: id:race:grade
Df Sum Sq Mean Sq
sex 1 452.9 452.9
Error: id:sex:race:grade
Df Sum Sq Mean Sq
sex 1 2.812 2.812
Error: Within
Df Sum Sq Mean Sq F value Pr(>F)
sex 1 146 146 1.992 0.15866
race 1 776 776 10.612 0.00118 **
grade 1 5375 5375 73.506 < 2e-16 ***
sex:race 1 98 98 1.340 0.24747
sex:grade 1 15 15 0.199 0.65551
race:grade 1 7 7 0.095 0.75807
sex:race:grade 1 97 97 1.322 0.25057
Residuals 647 47314 73
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Both of race effect and grade effect is significant in math score. There is a significant difference in math score between minority race and majority race. All interaction terms are not significant.
Y: Reading score
model_read <- aov(read ~ (sex * race * grade) + Error(id / (sex * race * grade) + time), data = NLSY)
summary(model_read)
Error: id
Df Sum Sq Mean Sq
sex 1 437.5 437.5
Error: time
Df Sum Sq Mean Sq
grade 1 196976 196976
Error: id:sex
Df Sum Sq Mean Sq
sex 1 122.1 122.1
Error: id:race
Df Sum Sq Mean Sq
sex 1 621.2 621.2
Error: id:grade
Df Sum Sq Mean Sq
sex 1 1114 1114
Error: id:sex:race
Df Sum Sq Mean Sq
sex 1 188.9 188.9
Error: id:sex:grade
Df Sum Sq Mean Sq
sex 1 6.778 6.778
Error: id:race:grade
Df Sum Sq Mean Sq
sex 1 2038 2038
Error: id:sex:race:grade
Df Sum Sq Mean Sq
sex 1 6.317 6.317
Error: Within
Df Sum Sq Mean Sq F value Pr(>F)
sex 1 12 12 0.090 0.76397
race 1 1103 1103 8.580 0.00352 **
grade 1 4188 4188 32.591 1.73e-08 ***
sex:race 1 487 487 3.787 0.05209 .
sex:grade 1 77 77 0.600 0.43873
race:grade 1 120 120 0.936 0.33360
sex:race:grade 1 103 103 0.803 0.37065
Residuals 647 83150 129
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Both of race effect and grade effect is significant in reading score. There is a significant difference in reading score between minority race and majority race. All interaction terms are not significant.
HW exercise 2.
Eight different physical measurements of 30 French girls were recorded from 4 to 15 years old. Explore various ways to display the data using trellis graphics.
Source: Sempe, M., et al. (1987). Multivariate and longitudinal data on growing children: Presentation of the French auxiological survey. In J. Janssen, et al. (1987). Data analysis. The Ins and Outs of solving real problems (pp. 3-6). New York: Plenum Press.
- Column 1: Weight in grams
- Column 2: Height in mms
- Column 3: Head to butt length in mms
- Column 4: Head circumference in mms
- Column 5: Chest circumference in mms
- Column 6: Arm length in mms
- Column 7: Calf length in mms
- Column 8: Pelvis circumference in mms
- Column 9: Age in years
- Column 10: Girl ID
[Solution and Answer]
Load in the data set and check the data structure
Wt Ht Hb Hc Cc
Min. :1214 Min. : 932 Min. :546.0 Min. :465.0 Min. :476.0
1st Qu.:1981 1st Qu.:1145 1st Qu.:631.0 1st Qu.:508.0 1st Qu.:575.0
Median :2778 Median :1312 Median :694.5 Median :522.0 Median :630.0
Mean :3041 Mean :1318 Mean :707.2 Mean :522.8 Mean :656.4
3rd Qu.:4025 3rd Qu.:1504 3rd Qu.:786.0 3rd Qu.:537.0 3rd Qu.:740.0
Max. :6285 Max. :1682 Max. :905.0 Max. :585.0 Max. :935.0
Arm Calf Pelvis age id
Min. :144.0 Min. :198.0 Min. :150.0 Min. : 4.00 S1 : 12
1st Qu.:170.0 1st Qu.:234.8 1st Qu.:190.0 1st Qu.: 6.75 S10 : 12
Median :186.5 Median :266.0 Median :209.0 Median : 9.50 S11 : 12
Mean :192.8 Mean :271.5 Mean :214.8 Mean : 9.50 S12 : 12
3rd Qu.:216.0 3rd Qu.:309.2 3rd Qu.:241.0 3rd Qu.:12.25 S13 : 12
Max. :285.0 Max. :380.0 Max. :305.0 Max. :15.00 S14 : 12
(Other):288
[1] 12
Compute average measurements for each girl
Wt Ht Hb Hc Cc
Min. :1214 Min. : 932 Min. :546.0 Min. :465.0 Min. :476.0
1st Qu.:1981 1st Qu.:1145 1st Qu.:631.0 1st Qu.:508.0 1st Qu.:575.0
Median :2778 Median :1312 Median :694.5 Median :522.0 Median :630.0
Mean :3041 Mean :1318 Mean :707.2 Mean :522.8 Mean :656.4
3rd Qu.:4025 3rd Qu.:1504 3rd Qu.:786.0 3rd Qu.:537.0 3rd Qu.:740.0
Max. :6285 Max. :1682 Max. :905.0 Max. :585.0 Max. :935.0
Arm Calf Pelvis age id
Min. :144.0 Min. :198.0 Min. :150.0 Min. : 4.00 S1 : 12
1st Qu.:170.0 1st Qu.:234.8 1st Qu.:190.0 1st Qu.: 6.75 S10 : 12
Median :186.5 Median :266.0 Median :209.0 Median : 9.50 S11 : 12
Mean :192.8 Mean :271.5 Mean :214.8 Mean : 9.50 S12 : 12
3rd Qu.:216.0 3rd Qu.:309.2 3rd Qu.:241.0 3rd Qu.:12.25 S13 : 12
Max. :285.0 Max. :380.0 Max. :305.0 Max. :15.00 S14 : 12
(Other):288
[1] 12
Present relationship between weight and height for each year and generally
xyplot(Wt ~ Ht, groups = factor(age), data = dta2,
auto.key = list(column=4), pch=16,
type = c('p', 'r', 'g'), xlab= 'Height', ylab = 'Weight')[1] 0.9491352
4 5 6 7 8 9 10 11
0.6589876 0.7131878 0.7563261 0.7550721 0.7381013 0.6127379 0.6706406 0.6738371
12 13 14 15
0.7103265 0.6313972 0.4659650 0.3489307
# Intecept and slope in linear model (weight ~ height)
split(dta2, dta2$age) %>% sapply(., function(df) {coefficients(lm(Wt ~ Ht, data = df))[1]}) 4.(Intercept) 5.(Intercept) 6.(Intercept) 7.(Intercept) 8.(Intercept)
-1492.131 -2071.391 -2397.710 -2525.049 -3362.959
9.(Intercept) 10.(Intercept) 11.(Intercept) 12.(Intercept) 13.(Intercept)
-3230.680 -4717.608 -4661.448 -5253.717 -6183.765
14.(Intercept) 15.(Intercept)
-4622.857 -1026.453
4.Ht 5.Ht 6.Ht 7.Ht 8.Ht 9.Ht 10.Ht 11.Ht
3.019906 3.558213 3.827905 3.936904 4.638774 4.579293 5.745730 5.712182
12.Ht 13.Ht 14.Ht 15.Ht
6.131886 6.841161 5.939274 3.772245
- Generally, height highly positively correlated with weight. But when grouping with age, the correaltion are not so high.
- In the linear models of
weight ~ height, all slopes are positive. This indicates that from age 4-15, height keep positively correlatied with weight in girls. - While the intercept of model with age 15 is quite different from others, indicating that girls get heavier in age 15 than other age.
Present the relationship between weight (Wt) and other measurments for each age
Standardize each measurement
Warning in mean.default(x): argument is not numeric or logical: returning NA
Warning in Ops.factor(x, center): '-' not meaningful for factors
Wt Ht Hb Hc
Min. :-1.4813 Min. :-1.89954 Min. :-1.7435 Min. :-2.78039
1st Qu.:-0.8593 1st Qu.:-0.85086 1st Qu.:-0.8241 1st Qu.:-0.71183
Median :-0.2137 Median :-0.03113 Median :-0.1372 Median :-0.03835
Mean : 0.0000 Mean : 0.00000 Mean : 0.0000 Mean : 0.00000
3rd Qu.: 0.7977 3rd Qu.: 0.91785 3rd Qu.: 0.8526 3rd Qu.: 0.68324
Max. : 2.6301 Max. : 1.79298 Max. : 2.1399 Max. : 2.99233
Cc Arm Calf Pelvis
Min. :-1.7838 Min. :-1.6862 Min. :-1.7252 Min. :-1.9485
1st Qu.:-0.8049 1st Qu.:-0.7872 1st Qu.:-0.8628 1st Qu.:-0.7464
Median :-0.2610 Median :-0.2167 Median :-0.1295 Median :-0.1755
Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
3rd Qu.: 0.8267 3rd Qu.: 0.8034 3rd Qu.: 0.8855 3rd Qu.: 0.7862
Max. : 2.7548 Max. : 3.1892 Max. : 2.5457 Max. : 2.7095
age id
Min. : 4.00 S1 : 12
1st Qu.: 6.75 S10 : 12
Median : 9.50 S11 : 12
Mean : 9.50 S12 : 12
3rd Qu.:12.25 S13 : 12
Max. :15.00 S14 : 12
(Other):288
Turn data set into long form
dta2_Wt <- dta2_std %>% select(-Wt, -id, -age) %>% stack
dta2_Wt$Wt <- rep(dta2_std$Wt, ncol(dta2_std)-3)
dta2_Wt$age <- rep(dta2_std$age, ncol(dta2_std)-3)
dta2_Wt$id <- rep(dta2_std$id, ncol(dta2_std)-3)
str(dta2_Wt)'data.frame': 2520 obs. of 5 variables:
$ values: num -1.44 -1.57 -1.76 -1.54 -1.51 ...
$ ind : Factor w/ 7 levels "Ht","Hb","Hc",..: 1 1 1 1 1 1 1 1 1 1 ...
$ Wt : num -1.29 -1.31 -1.38 -1.16 -1.1 ...
$ age : int 4 4 4 4 4 4 4 4 4 4 ...
$ id : Factor w/ 30 levels "S1","S10","S11",..: 1 12 23 25 26 27 28 29 30 2 ...
Plot
xyplot(Wt ~ values | ind, groups = paste0('Age ', age), data = dta2_Wt,
auto.key = list(column=4), pch=16, cex=.4, type = c('p', 'r', 'g'),
xlab= 'Standardized measurement', ylab = 'Standardized weight')The slopes of Ht in predicting Wt are relatively similar in different ages. While slopes of other measurement in predicting Wt differ in different ages. Take head circumference (Hc) as an example, the slope of Hc in predicting Wt get higher when girls get older.
Try another variables: Present the relationship between one measuremeant and other measurments for each age
Put codes into a function
xyplots_dta2 <- function(var_name, var_label) {
library(lattice)
# Turn data set into long form
dta2_var <- dta2_std %>% select(-var_name, -id, -age) %>% stack
dta2_var$var_name <- rep(dta2_std[,var_name], ncol(dta2_std)-3)
dta2_var$age <- rep(dta2_std$age, ncol(dta2_std)-3)
dta2_var$id <- rep(dta2_std$id, ncol(dta2_std)-3)
# Plot
xyplot(var_name ~ values | ind, groups = paste0('Age ', age), data = dta2_var,
auto.key = list(column=4), pch=16, cex=.4, type = c('p', 'r', 'g'),
xlab = 'Standardized measurement',
ylab = paste0('Standardized ', var_label))
}Head to butt length
Note: Using an external vector in selections is ambiguous.
ℹ Use `all_of(var_name)` instead of `var_name` to silence this message.
ℹ See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
This message is displayed once per session.
- In linear models of
Hb ~ Htof each age, both of intecepts and slopes are similar. - In linear models of
Hb ~ Hcof each age,slopes are similar while intecepts differ.
HW exercise 3.
Your manager gave you a sales data on sevral products in a SAS format. Your task is to summarize and report the data in tables and graphs using the R lattice package.
Source: Gupta, S. K. (2006). Data Management and Reporting Made Easy with SAS Learning Edition 2.0
- Recode the region variable (1 to 4) by “Nothern”, “Southern”, “Eastern” and “Western”;
- the district variable (1-5) by “North East”, “South East”, “South West”, “North West”, “Central West”;
- the quarter variable (1-4) by “1st”, “2nd”, “3rd”, “4th”;
- and the month variable (1-12) by “Jan”, “Feb”, etc. Set negative sales values to zero.
[Solution and Answer]
Load in the data
product category customer year month
Boots :24 Shoes :48 Acme :60 Min. :2001 Min. : 1.00
Shoes :24 Slippers:24 BigX : 6 1st Qu.:2001 1st Qu.: 3.75
Slippers:24 TwoFeet: 6 Median :2002 Median : 6.50
Mean :2002 Mean : 6.50
3rd Qu.:2002 3rd Qu.: 9.25
Max. :2002 Max. :12.00
quarter market sales expense region
Min. :1.00 Min. :1.000 Min. :-1400 Min. :-980 Min. :1.000
1st Qu.:1.75 1st Qu.:1.000 1st Qu.: 1000 1st Qu.: 660 1st Qu.:1.000
Median :2.50 Median :2.000 Median : 1550 Median :1065 Median :1.000
Mean :2.50 Mean :1.667 Mean : 1686 Mean :1172 Mean :1.333
3rd Qu.:3.25 3rd Qu.:2.000 3rd Qu.: 2525 3rd Qu.:1860 3rd Qu.:1.000
Max. :4.00 Max. :2.000 Max. : 4700 Max. :2960 Max. :4.000
district return constantv quantity
Min. :1.0 Min. :0.0000 Min. :1 Min. : 0.0
1st Qu.:1.0 1st Qu.:0.0000 1st Qu.:1 1st Qu.:135.2
Median :1.0 Median :0.0000 Median :1 Median :220.0
Mean :1.5 Mean :0.6667 Mean :1 Mean :248.9
3rd Qu.:1.0 3rd Qu.:0.0000 3rd Qu.:1 3rd Qu.:287.8
Max. :5.0 Max. :5.0000 Max. :1 Max. :940.0
See the distribution of scales for each region.
There is no sale in Eastern region (region=3).
Compare sales of each quarter
Generally, the sale increase as the quarter goes by.
Compare sales of each market
There is a larger sale in market 2 than makret 1. And the variation of sales of market 1 is smaller than that of market 2.
Compare sales of each product
Slippers have a better sale.
Compare sales of each product in 2 markets
The sale of slippers is not that good in market 1.
Compare sales change of months for each product
dotplot(quantity ~ month, groups = product, data = dta3,
xlab="Month", ylab="Quantity",
type=c('p', 'g',"r"), auto.key=list(space="top", columns=3))Compare sales of each customer in 2 markets
We have only one customer, Acme, in market 1.
Compare distribution of sales of each product in four quaters.
densityplot(~ quantity | factor(quarter), groups = product, data = dta3,
xlab = "Quarter", auto.key = list(column=3), layout = c(1, 4))Finding
- We should develop markets in Eastern region
- Sale of TwoFeet and BigX are lower than Acme. We should put more efforts on attracting these two customers, especially in market 1.
- Keep a good relationship with our biggest customer, Acme.
- Put more resources on marketing boots and shoes.
- Try to develop products for spring to increase sales in month
HW exercise 4.
Use the Lattice package to graphically explore the age and gender effects on reaction time reported in the Bassin data example.
Each year the U.S. Naval Postgraduate School sets aside a “Discovery Day” during which the general public is invited into their laboratories. The data come from October 21 1995, when visitors could test their reaction times and hand-eye coordination in the Human Systems Integration Laboratory. The variable of interest, “anticipatory timing”, was measured by a Bassin timer, which measures the ability to estimate the speed of a moving light and its arrival at a designated point. The Timer consists of a 10 foot row of lights which is controlled by a variable speed potentiometer. The lights are switched on sequentially from one end to the other so that light ‘travels’ at 5 miles per hour down the Timer. Each visitor was instructed to anticipate the ‘arrival’ of the light at one end of the Timer and at that time to swing a plastic bat across a light beam at the same end of the Timer. An automatic timing device measured the difference between the breaking of the beam and the actual arrival of the light. A negative value of a trial variable indicates the bat broke the beam before the light actually arrived. Each of 113 visitors completed the trial five times. Age and gender were also recorded. Visitors tended to come in family groups, but that information was not recorded. It may be that subject #35, who is a two year old with much slower reaction times, should be deleted.
Source: OzData
- Column 1: Gender ID
- Column 2: Age (year)
- Column 3: Response time Trial 1
- Column 3: Response time Trial 2
- Column 3: Response time Trial 3
- Column 3: Response time Trial 4
- Column 3: Response time Trial 5
[Solution and Answer]
Load in the data set
'data.frame': 113 obs. of 7 variables:
$ Sex : Factor w/ 2 levels "F","M": 2 2 2 2 2 2 2 2 2 2 ...
$ Age : int 31 30 30 27 30 28 34 28 28 33 ...
$ Trial1: num 0.051 0.074 0.051 0.182 0.077 0.103 -0.066 0.204 -0.231 -0.052 ...
$ Trial2: num 0.023 0.006 0.094 0.166 0.001 0.065 0.031 -0.106 -0.124 -0.011 ...
$ Trial3: num 0.106 0.003 0.084 -0.073 0 0.063 0.036 -0.09 -0.065 -0.025 ...
$ Trial4: num 0.076 0.02 0.176 -0.044 -0.027 0.059 0.11 -0.04 -0.19 -0.014 ...
$ Trial5: num 0.013 0.022 0.103 0.029 -0.2 0.059 0.045 -0.03 -0.211 -0.059 ...
Create new variables
Min. 1st Qu. Median Mean 3rd Qu. Max.
-1.6360 -0.0880 0.2390 0.2676 0.5110 5.7950
Age effect
Scatter plot
xyplot(RT_av ~ Age, data = OzData, type = c('p', 'r', 'g'),
xlab= 'Age', ylab = 'Average reaction time')Group age to draw boxplot
OzData$Age_group <- cut(OzData$Age,
breaks=c(0, median(OzData$Age), 100),
labels=c('Young', 'Old'))
bwplot(RT_av ~ Age_group, data = OzData,
xlab= 'Age group', ylab = 'Average reaction time')The variation of average reaction time of younger participants is larger than that of older participants.
Gender effect
The variation of average reaction time of female is larger than that of male.
ANOVA
OzData_long <- OzData %>% select(contains('Trial')) %>% stack()
OzData_long$ID <- rep(rownames(OzData), each=5)
OzData_long$Sex <- rep(OzData$Sex, each=5)
OzData_long$Age <- rep(OzData$AGE, each=5)
OzData_long$Age_group <- rep(OzData$Age_group, each=5)
model_OzData <- aov(values ~ (Sex*Age_group) + Error(ID / (Sex*Age_group) + ind), data = OzData_long)Warning in aov(values ~ (Sex * Age_group) + Error(ID/(Sex * Age_group) + :
Error() model is singular
Error: ID
Df Sum Sq Mean Sq F value Pr(>F)
Sex 1 0.104 0.10428 3.176 0.0775 .
Age_group 1 0.010 0.01014 0.309 0.5795
Sex:Age_group 1 0.017 0.01734 0.528 0.4689
Residuals 109 3.579 0.03284
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Error: ind
Df Sum Sq Mean Sq F value Pr(>F)
Residuals 4 0.03211 0.008026
Error: Within
Df Sum Sq Mean Sq F value Pr(>F)
Residuals 448 17.25 0.03849
Neither gender effect nor age effect is significant. The interaction effect is not significant, either.