DataM: Homework Exercise 0413

HW exercise 1.

Use trellis graphics to explore various ways to display the sample data from the National Longitudinal Survey of Youth.

The data are drawn from the National Longitudinal Survey of Youth (NLSY). The sample observations are from the 1986, 1988, 1990, and 1992 assessment periods. Children were selected to be in kindergarten, first, and second grade and to be of age 5, 6, or 7 at the first assessment (1986). Both reading and mathematical achievement scores are recorded. The former is a recognition subscore of the Peabody Individual Achievement Test (PIAT). This was scaled as the percentage of 84 items that were answered correctly. The same 84 items were administered at all four time points, providing a consistent scale over time. The data set is a subsample of 166 subjects with complete observations.

Source: Bollen, K.A. & Curran, P.J. (2006). Latent curve models. A structural equation perspective. p.59.

Column 1: Student ID
Column 2: Gender, male or female
Column 3: Race, minority or majority
Column 4: Measurement occasions
Column 5: Grade at which measurements were made, Kindergarten = 0, First grade = 1, Second grade = 2
Column 6: Age in years
Column 7: Age in months
Column 8: Math score
Column 9: Reading score

library(dplyr)
library(lattice)

Load in the package `lattice` and the data set

NLSY <- read.table('../data/NLSY.txt', header = TRUE, sep = ',')
head(NLSY)

str(NLSY)

'data.frame':   664 obs. of  9 variables:
 $ id   : int  2390 2560 3740 4020 6350 7030 7200 7610 7680 7700 ...
 $ sex  : Factor w/ 2 levels "Female","Male": 1 1 1 2 2 2 2 2 1 2 ...
 $ race : Factor w/ 2 levels "Majority","Minority": 1 1 1 1 1 1 1 1 1 1 ...
 $ time : int  1 1 1 1 1 1 1 1 1 1 ...
 $ grade: int  0 0 0 0 1 0 0 0 0 0 ...
 $ year : int  6 6 6 5 7 5 6 7 6 6 ...
 $ month: int  67 66 67 60 78 62 66 79 76 67 ...
 $ math : num  14.29 20.24 17.86 7.14 29.76 ...
 $ read : num  19.05 21.43 21.43 7.14 30.95 ...

Gender difference

Draw the density plot of math socore and that of reading score with grouping of gender.

lst_gender <- lapply(split(NLSY, NLSY$sex), function(df) {
  df_long <- df %>% select(math, read) %>% stack()
  df_long$gender <- as.character(df$sex)[1]
  return(df_long)
})
df_gender <- rbind(lst_gender[[1]], lst_gender[[2]])

densityplot(~ values | ind, groups = gender, data = df_gender,
            layout = c(1, 2), auto.key=list(column=2), xlab='Score')

Draw boxplots of math score of each grade with grouping of gender.

bwplot(math ~ factor(grade) | sex, data = NLSY,
       xlab = 'Grade', ylab = 'Mathematics score')

Draw boxplots of reading score of each grade with grouping of gender.

bwplot(read ~ factor(grade) | sex, data = NLSY,
       xlab = 'Grade', ylab = 'Reading score')

Draw scatter plots of math score and reading score with the regreesion line for each grade with grouping of gender.

xyplot(math ~ read | factor(grade), groups = sex, data = NLSY,
       layout=c(4, 2), type=c('p', 'r', 'g'), auto.key=list(column=2),
       xlab = 'Mathematics score', ylab = 'Reading score')

Race

Draw the density plot of math socore and that of reading score with grouping of race.

lst_race <- lapply(split(NLSY, NLSY$race), function(df) {
  df_long <- df %>% select(math, read) %>% stack()
  df_long$race <- as.character(df$race)[1]
  return(df_long)
})
df_race <- rbind(lst_race[[1]], lst_race[[2]])

densityplot(~ values | ind, groups = race, data = df_race,
            layout = c(1, 2), auto.key=list(column=2), xlab='Score')

Draw boxplots of math score of each grade with grouping of race.

bwplot(math ~ factor(grade) | race, data = NLSY,
       xlab = 'Grade', ylab = 'Mathematics score')

Draw boxplots of reading score of each grade with grouping of race.

bwplot(read ~ factor(grade) | race, data = NLSY,
       xlab = 'Grade', ylab = 'Reading score')

Draw scatter plots of math score and reading score with the regreesion line for each grade with grouping of race.

xyplot(math ~ read | factor(grade), groups = race, data = NLSY,
       layout=c(4, 2), type=c('p', 'r', 'g'), auto.key=list(column=2),
       xlab = 'Mathematics score', ylab = 'Reading score')

Repeated factorial ANOVA

Y: Math score

model_math <- aov(math ~ (sex * race * grade) + Error(id / (sex * race * grade) + time), data = NLSY)
summary(model_math)


Error: id
    Df Sum Sq Mean Sq
sex  1   1651    1651

Error: time
      Df Sum Sq Mean Sq
grade  1 170290  170290

Error: id:sex
    Df Sum Sq Mean Sq
sex  1  127.2   127.2

Error: id:race
    Df Sum Sq Mean Sq
sex  1   1065    1065

Error: id:grade
    Df Sum Sq Mean Sq
sex  1   1228    1228

Error: id:sex:race
    Df Sum Sq Mean Sq
sex  1  112.8   112.8

Error: id:sex:grade
    Df Sum Sq Mean Sq
sex  1  357.7   357.7

Error: id:race:grade
    Df Sum Sq Mean Sq
sex  1  452.9   452.9

Error: id:sex:race:grade
    Df Sum Sq Mean Sq
sex  1  2.812   2.812

Error: Within
                Df Sum Sq Mean Sq F value  Pr(>F)    
sex              1    146     146   1.992 0.15866    
race             1    776     776  10.612 0.00118 ** 
grade            1   5375    5375  73.506 < 2e-16 ***
sex:race         1     98      98   1.340 0.24747    
sex:grade        1     15      15   0.199 0.65551    
race:grade       1      7       7   0.095 0.75807    
sex:race:grade   1     97      97   1.322 0.25057    
Residuals      647  47314      73                    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Both of race effect and grade effect is significant in math score. There is no significant gender difference in math score. All interaction terms are not significant.

Y: Reading score

model_read <- aov(read ~ (sex * race * grade) + Error(id / (sex * race * grade) + time), data = NLSY)
summary(model_read)


Error: id
    Df Sum Sq Mean Sq
sex  1  437.5   437.5

Error: time
      Df Sum Sq Mean Sq
grade  1 196976  196976

Error: id:sex
    Df Sum Sq Mean Sq
sex  1  122.1   122.1

Error: id:race
    Df Sum Sq Mean Sq
sex  1  621.2   621.2

Error: id:grade
    Df Sum Sq Mean Sq
sex  1   1114    1114

Error: id:sex:race
    Df Sum Sq Mean Sq
sex  1  188.9   188.9

Error: id:sex:grade
    Df Sum Sq Mean Sq
sex  1  6.778   6.778

Error: id:race:grade
    Df Sum Sq Mean Sq
sex  1   2038    2038

Error: id:sex:race:grade
    Df Sum Sq Mean Sq
sex  1  6.317   6.317

Error: Within
                Df Sum Sq Mean Sq F value   Pr(>F)    
sex              1     12      12   0.090  0.76397    
race             1   1103    1103   8.580  0.00352 ** 
grade            1   4188    4188  32.591 1.73e-08 ***
sex:race         1    487     487   3.787  0.05209 .  
sex:grade        1     77      77   0.600  0.43873    
race:grade       1    120     120   0.936  0.33360    
sex:race:grade   1    103     103   0.803  0.37065    
Residuals      647  83150     129                     
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Both of race effect and grade effect is significant in reading score. There is no significant gender difference in reading score. All interaction terms are not significant.

HW exercise 2.

Eight different physical measurements of 30 French girls were recorded from 4 to 15 years old. Explore various ways to display the data using trellis graphics.

Source: Sempe, M., et al. (1987). Multivariate and longitudinal data on growing children: Presentation of the French auxiological survey. In J. Janssen, et al. (1987). Data analysis. The Ins and Outs of solving real problems (pp. 3-6). New York: Plenum Press.

Column 1: Weight in grams
Column 2: Height in mms
Column 3: Head to butt length in mms
Column 4: Head circumference in mms
Column 5: Chest circumference in mms
Column 6: Arm length in mms
Column 7: Calf length in mms
Column 8: Pelvis circumference in mms
Column 9: Age in years
Column 10: Girl ID

[Solution and Answer]

Load in the data set and check the data structure

dta2 <- read.table('../data/data_hw0413_trellis_2.txt', header = TRUE)
head(dta2)

str(dta2)

'data.frame':   360 obs. of  10 variables:
 $ Wt    : int  1456 1426 1335 1607 1684 1374 1570 1450 1214 1456 ...
 $ Ht    : int  1025 998 961 1006 1012 1012 1040 990 968 983 ...
 $ Hb    : int  602 572 560 595 584 580 586 561 571 563 ...
 $ Hc    : int  486 501 494 497 490 492 511 488 481 485 ...
 $ Cc    : int  520 520 495 560 553 525 540 520 476 532 ...
 $ Arm   : int  157 150 145 178 165 158 153 159 145 158 ...
 $ Calf  : int  205 215 214 218 220 202 220 210 198 219 ...
 $ Pelvis: int  170 169 158 172 158 167 180 158 150 154 ...
 $ age   : int  4 4 4 4 4 4 4 4 4 4 ...
 $ id    : Factor w/ 30 levels "S1","S10","S11",..: 1 12 23 25 26 27 28 29 30 2 ...

HW exercise 3.

Your manager gave you a sales data on sevral products in a SAS format. Your task is to summarize and report the data in tables and graphs using the R lattice package.

Source: Gupta, S. K. (2006). Data Management and Reporting Made Easy with SAS Learning Edition 2.0

Recode the region variable (1 to 4) by “Nothern”, “Southern”, “Eastern” and “Western”;
the district variable (1 - 5) by “North East”, “South East”, “South West”, “North West”, “Central West”;
the quarter variable (1-4) by “1st”, “2nd”, “3rd”, “4th”;
and the month variable (1-12) by “Jan”, “Feb”, etc. Set negative sales values to zero.

[Solution and Answer]

Load in the data

pacman::p_load('sas7bdat')
dta3 <- read.sas7bdat('../data/sales.sas7bdat')
head(dta3)

summary(dta3)

     product       category     customer       year          month      
 Boots   :24   Shoes   :48   Acme   :60   Min.   :2001   Min.   : 1.00  
 Shoes   :24   Slippers:24   BigX   : 6   1st Qu.:2001   1st Qu.: 3.75  
 Slippers:24                 TwoFeet: 6   Median :2002   Median : 6.50  
                                          Mean   :2002   Mean   : 6.50  
                                          3rd Qu.:2002   3rd Qu.: 9.25  
                                          Max.   :2002   Max.   :12.00  
    quarter         market          sales          expense    
 Min.   :1.00   Min.   :1.000   Min.   :-1400   Min.   :-980  
 1st Qu.:1.75   1st Qu.:1.000   1st Qu.: 1000   1st Qu.: 660  
 Median :2.50   Median :2.000   Median : 1550   Median :1065  
 Mean   :2.50   Mean   :1.667   Mean   : 1686   Mean   :1172  
 3rd Qu.:3.25   3rd Qu.:2.000   3rd Qu.: 2525   3rd Qu.:1860  
 Max.   :4.00   Max.   :2.000   Max.   : 4700   Max.   :2960  
     region         district       return         constantv
 Min.   :1.000   Min.   :1.0   Min.   :0.0000   Min.   :1  
 1st Qu.:1.000   1st Qu.:1.0   1st Qu.:0.0000   1st Qu.:1  
 Median :1.000   Median :1.0   Median :0.0000   Median :1  
 Mean   :1.333   Mean   :1.5   Mean   :0.6667   Mean   :1  
 3rd Qu.:1.000   3rd Qu.:1.0   3rd Qu.:0.0000   3rd Qu.:1  
 Max.   :4.000   Max.   :5.0   Max.   :5.0000   Max.   :1  
    quantity    
 Min.   :  0.0  
 1st Qu.:135.2  
 Median :220.0  
 Mean   :248.9  
 3rd Qu.:287.8  
 Max.   :940.0

See the distribution of scales for each region.

densityplot(~ quantity | as.character(region), data = dta3, layout=c(1, 3))

There is no sale in Eastern region (region=3).

Compare sales of each quarter

bwplot(quantity ~ factor(quarter), data = dta3, xlab="Quarter")

Generally, the sale increase as the quarter goes by.

Compare sales of each quarter

bwplot(quantity ~ factor(quarter), data = dta3, xlab="Quarter")

Compare sales of each market

bwplot(quantity ~ factor(market), data = dta3, xlab="Market")

There is a larger sale in market 2 than makret 1. And the variation of sales of market 1 is smaller than that of market 2.

Compare sales of each product

bwplot(quantity ~ factor(product), data = dta3, xlab="Market")

Slippers have a better sale.

Compare sales of each product in 2 markets

bwplot(quantity ~ factor(product) | factor(market), data = dta3, xlab="Market")

The sale of slippers is not that good in market 1.

Compare sales change of months for each product

dotplot(quantity ~ month, groups = product, data = dta3, 
        xlab="Month", ylab="Quantity", 
        type=c('p', 'g',"r"), auto.key=list(space="top", columns=3))

Compare sales of each customer in 2 markets

bwplot(quantity ~ factor(customer) | factor(market), data = dta3, xlab="Market")

We have only one customer, Acme, in market 1.

Compare distribution of sales of each product in four quaters.

densityplot(~ quantity | factor(quarter), groups = product, data = dta3, 
            xlab = "Quarter", auto.key = list(column=3), layout = c(1, 4))

Finding

We should develop markets in Eastern region
Sale of TwoFeet and BigX are lower than Acme. We should put more efforts on attracting these two customers, especially in market 1.
Keep a good relationship with our biggest customer, Acme.
Put more resources on marketing boots and shoes.
Try to develop products for spring to increase sales in month

HW exercise 4.

Use the Lattice package to graphically explore the age and gender effects on reaction time reported in the Bassin data example.

Each year the U.S. Naval Postgraduate School sets aside a “Discovery Day” during which the general public is invited into their laboratories. The data come from October 21 1995, when visitors could test their reaction times and hand-eye coordination in the Human Systems Integration Laboratory. The variable of interest, “anticipatory timing”, was measured by a Bassin timer, which measures the ability to estimate the speed of a moving light and its arrival at a designated point. The Timer consists of a 10 foot row of lights which is controlled by a variable speed potentiometer. The lights are switched on sequentially from one end to the other so that light ‘travels’ at 5 miles per hour down the Timer. Each visitor was instructed to anticipate the ‘arrival’ of the light at one end of the Timer and at that time to swing a plastic bat across a light beam at the same end of the Timer. An automatic timing device measured the difference between the breaking of the beam and the actual arrival of the light. A negative value of a trial variable indicates the bat broke the beam before the light actually arrived. Each of 113 visitors completed the trial five times. Age and gender were also recorded. Visitors tended to come in family groups, but that information was not recorded. It may be that subject #35, who is a two year old with much slower reaction times, should be deleted.

Source: OzData

Column 1: Gender ID
Column 2: Age (year)
Column 3: Response time Trial 1
Column 3: Response time Trial 2
Column 3: Response time Trial 3
Column 3: Response time Trial 4
Column 3: Response time Trial 5

[Solution and Answer]

Load in the data set

OzData <- read.table('../data/OzData.txt', header = TRUE)
head(OzData)

str(OzData)

'data.frame':   113 obs. of  7 variables:
 $ Sex   : Factor w/ 2 levels "F","M": 2 2 2 2 2 2 2 2 2 2 ...
 $ Age   : int  31 30 30 27 30 28 34 28 28 33 ...
 $ Trial1: num  0.051 0.074 0.051 0.182 0.077 0.103 -0.066 0.204 -0.231 -0.052 ...
 $ Trial2: num  0.023 0.006 0.094 0.166 0.001 0.065 0.031 -0.106 -0.124 -0.011 ...
 $ Trial3: num  0.106 0.003 0.084 -0.073 0 0.063 0.036 -0.09 -0.065 -0.025 ...
 $ Trial4: num  0.076 0.02 0.176 -0.044 -0.027 0.059 0.11 -0.04 -0.19 -0.014 ...
 $ Trial5: num  0.013 0.022 0.103 0.029 -0.2 0.059 0.045 -0.03 -0.211 -0.059 ...

Create new variables

OzData$RT_av <- OzData %>% select(contains('Trial')) %>% rowSums()
summary(OzData$RT_av)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
-1.6360 -0.0880  0.2390  0.2676  0.5110  5.7950

Age effect

Scatter plot

xyplot(RT_av ~ Age, data = OzData, type = c('p', 'r', 'g'),
       xlab= 'Age', ylab = 'Average reaction time')

Group age to draw boxplot

OzData$Age_group <- cut(OzData$Age, 
                        breaks=c(0, median(OzData$Age), 100), 
                        labels=c('Young', 'Old'))

bwplot(RT_av ~ Age_group, data = OzData,
       xlab= 'Age group', ylab = 'Average reaction time')

The variation of average reaction time of younger participants is larger than that of older participants.

Gender effect

bwplot(RT_av ~ Sex, data = OzData, xlab= 'Gender', ylab = 'Average reaction time')

The variation of average reaction time of female is larger than that of male.

ANOVA

OzData_long <- OzData %>% select(contains('Trial')) %>% stack()
OzData_long$ID <- rep(rownames(OzData), each=5)
OzData_long$Sex <- rep(OzData$Sex, each=5)
OzData_long$Age <- rep(OzData$AGE, each=5)
OzData_long$Age_group <- rep(OzData$Age_group, each=5)

model_OzData <- aov(values ~ (Sex*Age_group) + Error(ID / (Sex*Age_group) + ind), data = OzData_long)

Warning in aov(values ~ (Sex * Age_group) + Error(ID/(Sex * Age_group) + :
Error() model is singular

summary(model_OzData)


Error: ID
               Df Sum Sq Mean Sq F value Pr(>F)  
Sex             1  0.104 0.10428   3.176 0.0775 .
Age_group       1  0.010 0.01014   0.309 0.5795  
Sex:Age_group   1  0.017 0.01734   0.528 0.4689  
Residuals     109  3.579 0.03284                 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Error: ind
          Df  Sum Sq  Mean Sq F value Pr(>F)
Residuals  4 0.03211 0.008026               

Error: Within
           Df Sum Sq Mean Sq F value Pr(>F)
Residuals 448  17.25 0.03849

Neither gender effect nor age effect is significant. The interaction effect is not significant, either.

DataM: Homework Exercise 0413 - Trellis