Stat 422: Homework 3: Wes McClintick

5.6 A school desires to estimate the average score that may be obtained on a reading comprehesnion exam for students in the sixth grade. The school's students are grouped into three tracks, with the fast learners in track I, the slow learners in track III, and the rest in track II. The school decides to stratify on tracks because this method should reduce the variability of test scores. The sixth grade contains 55 students in track I, 80 in track II, and 65 in track III. A stratified random sample of 50 students is proportionally allocated and yields simple random samples of n1 = 14, n2 = 20, and n3=16 from tracks I, II, and III. The test is administered to the sample of students; the results are available via a link from electronic section 5.0.

a. Estimate the average score for the sixth grade, and place a bound on the error of estimation.


source("H:\\STAT422\\MyBox.txt")
rm(list = ls(all = TRUE))

dat = read.csv("H:\\STAT422\\exercise5_6.csv")
StratMean = function(wts, means) {
    return(sum(wts * means))
}

n1 = length(dat$Track.I[!is.na(dat$Track.I)])
n2 = length(dat$TrackII[!is.na(dat$TrackII)])
n3 = length(dat$TrackIII[!is.na(dat$TrackIII)])
ntot = n1 + n2 + n3

mu1 = mean(dat$Track.I[!is.na(dat$Track.I)])
mu2 = mean(dat$TrackII[!is.na(dat$TrackII)])
mu3 = mean(dat$TrackIII[!is.na(dat$TrackIII)])

si1 = var(dat$Track.I[!is.na(dat$Track.I)])
si2 = var(dat$TrackII[!is.na(dat$TrackII)])
si3 = var(dat$TrackIII[!is.na(dat$TrackIII)])


# our point estimate of the average score for 6th graders.
s.mean = StratMean(wts = c(n1/ntot, n2/ntot, n3/ntot), means = c(mu1, 
    mu2, mu3))

VStratMean = function(Ni, ni, si) {
    N = sum(Ni)
    Ni2 = Ni^2
    sampfrac = ni/Ni
    sampadj = 1 - sampfrac
    sin = si/ni
    return((1/N^2) * sum(Ni2 * sampadj * sin))
}

vstratmean = VStratMean(Ni = c(55, 80, 65), ni = c(n1, n2, n3), si = c(si1, 
    si2, si3))

they = c(dat$Track.I[!is.na(dat$Track.I)], dat$TrackII[!is.na(dat$TrackII)], 
    dat$TrackIII[!is.na(dat$TrackIII)])
fact = as.factor(c(rep(1, n1), rep(2, n2), rep(3, n3)))

paste("A. The point estimate of the stratified mean is", s.mean)

## [1] "A. The point estimate of the stratified mean is 60.2"

paste("A. The bound for the stratified mean is", 2 * sqrt(vstratmean))

## [1] "A. The bound for the stratified mean is 3.03239535457891"

paste("A. compare to nonstratified mean and variance", mean(they), 
    var(they))

## [1] "A. compare to nonstratified mean and variance 60.2 432.65306122449"

b. construct parallel boxplots for these data and comment on teh patterns you see. Do you think there could be a problem in placing students in tracks?

dev.off

## function (which = dev.cur()) 
## {
##     if (which == 1) 
##         stop("cannot shut down device 1 (the null device)")
##     .Internal(dev.off(as.integer(which)))
##     dev.cur()
## }
## <bytecode: 0x00000000050235e8>
## <environment: namespace:grDevices>

MyBox = function(y, x, ...) {
    boxplot(y ~ x, main = "Y~X", col = "wheat", ylim = c(min(y), max(y)))
    a = tapply(y, x, mean)
    points(as.numeric(a), pch = 17)
    for (i in 1:(length(levels(x)))) {
        points(jitter(rep(i, length(y[x == levels(x)[i]]))), y[x == i])
    }
}
they = c(dat$Track.I[!is.na(dat$Track.I)], dat$TrackII[!is.na(dat$TrackII)], 
    dat$TrackIII[!is.na(dat$TrackIII)])

fact = as.factor(c(rep(1, n1), rep(2, n2), rep(3, n3)))

MyBox(they, fact)

plot of chunk unnamed-chunk-2

B. Each strata is supposed to be a non-overlapping segment of the population total. These scores show considerable overlap. So the student with a score of 65 could belong to either Tier I, II, or III!.

c. Estimate the difference in average scores between track I and track II students. Are track I students significantly better, on the average, than track II students?

difference = abs(mu2 - mu1)
paste("Since the samples are independent I can add variances..")

## [1] "Since the samples are independent I can add variances.."

paste("I'm guessing they have some roundoff error in their answer for **bound** since they often do.")

## [1] "I'm guessing they have some roundoff error in their answer for **bound** since they often do."


VarOMean = si1/(n1 - 1) + si2/(n2 - 1)
Bound = 2 * sqrt(VarOMean)

paste("The estimated difference is", difference, "with a bound of", 
    Bound)

## [1] "The estimated difference is 14.9642857142857 with a bound of 8.10286106834633"


# in fact an anova might be interesting here...
library(car)

## Loading required package: MASS

## Loading required package: nnet

Anova(lm(they ~ fact))

## Anova Table (Type II tests)
## 
## Response: they
##           Sum Sq Df F value  Pr(>F)    
## fact       14035  2      46 8.5e-12 ***
## Residuals   7165 47                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

paste("Guess there is a difference here")

## [1] "Guess there is a difference here"

TukeyHSD(aov(they ~ fact))

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = they ~ fact)
## 
## $fact
##       diff    lwr     upr  p adj
## 2-1 -14.96 -25.38  -4.552 0.0031
## 3-1 -42.28 -53.21 -31.342 0.0000
## 3-2 -27.31 -37.33 -17.290 0.0000
##

paste("But that difference is exaggerated in the Tukey pairwise comparison")

## [1] "But that difference is exaggerated in the Tukey pairwise comparison"

paste("Maybe because unbalanced design but function relies on Type I SS?")

## [1] "Maybe because unbalanced design but function relies on Type I SS?"

So far I is not significantly different than II by the book method. The questionable TukeyHSD does see a difference though, interestingly enough. Prefer the book estimate because

it uses more information: the population totals of each strata.
the Tukey relies on type I SS but this is an unbalanced design.

5.8 Using the data in Exercise 5.6, find the sample size required to estimate the average score, with a bound of four points on the error of estimation. Use proportional allocation.

## for the mean###for a given set of var's, ai's and a B.
THEn = function(varit, B, ai, Ni) {
    D = B^2/4
    N = sum(Ni)
    sum(((Ni^2) * varit)/(ai))/((N^2) * D + sum(Ni * varit))
}

samplesize = THEn(B = 4, ai = c(1/3, 1/3, 1/3), varit = c(si1, si2, 
    si3), Ni = c(55, 80, 65))

paste("So roughly", ceiling(samplesize)/3, "per strata")

## [1] "So roughly 11.3333333333333 per strata"

paste("(Which we have rounded up in class)")

## [1] "(Which we have rounded up in class)"

5.9 Repeat 5.8 using Neyman allocation. Compare the results with the answer to Exercise 5.8.

“Neyman allocation” is appropriate here since costs are unknown. “If costs are unknown, we may be willing to assume the costs per observation are equal” across strata.

Neymanpi = function(Ni, si) {
    Ni * si/sum(Ni * si)
}
thepi = Neymanpi(Ni = c(55, 80, 65), c(sqrt(si1), sqrt(si2), sqrt(si3)))

paste("The new allocation of the sample is", samplesize * thepi)

## [1] "The new allocation of the sample is 7.65065555203759"
## [2] "The new allocation of the sample is 13.6500954215016"
## [3] "The new allocation of the sample is 12.0300282182621"

paste("Which you could round accordingly...")

## [1] "Which you could round accordingly..."

paste("The Neyman took account the different sizes of the strata and the different variability of the strata.")

## [1] "The Neyman took account the different sizes of the strata and the different variability of the strata."

5.10 A forester wants to estimate the total number of farm acres planted with trees for a state. Because the number of acres of trees varies considerably with the size of the farm, he decides to stratify on farm sizes. The 240 farms in the state are placed in one of four categories according to size. A stratified random sample of 40 farms, selected by using proportional allocation, yields the results shown in the accompanying table on number of acres planted in trees. Estimate the total number of acres of trees on farms in the state, and place a bound on the error of estimation. Graph the data on an appropriate plot and comment on the variation as we move from I to IV.

dat2 = read.csv("H:\\STAT422\\farmacres.csv")
N1 = 86
N2 = 72
N3 = 52
N4 = 30
Ni = c(N1, N2, N3, N4)
N = sum(Ni)

n1 = length(dat2$acres[dat2$stratum == "I"])
n2 = length(dat2$acres[dat2$stratum == "II"])
n3 = length(dat2$acres[dat2$stratum == "III"])
n4 = length(dat2$acres[dat2$stratum == "IV"])
ni = c(n1, n2, n3, n4)
ntot = sum(ni)

mu1 = mean(dat2$acres[dat2$stratum == "I"])
mu2 = mean(dat2$acres[dat2$stratum == "II"])
mu3 = mean(dat2$acres[dat2$stratum == "III"])
mu4 = mean(dat2$acres[dat2$stratum == "IV"])

si1 = var(dat2$acres[dat2$stratum == "I"])
si2 = var(dat2$acres[dat2$stratum == "II"])
si3 = var(dat2$acres[dat2$stratum == "III"])
si4 = var(dat2$acres[dat2$stratum == "IV"])

MapFunc = function(x) {
    y = NULL
    if (x %in% c("I")) {
        y = 1
    }
    if (x %in% c("II")) {
        y = 2
    }
    if (x %in% c("III")) {
        y = 3
    }
    if (x %in% c("IV")) {
        y = 4
    }

    return(y)
}
thefactor = as.factor(mapply(MapFunc, dat2$stratum))

MyBox(y = dat2$acres, x = thefactor)

plot of chunk unnamed-chunk-6

The boxplot shows the variance is increasing as we move from 1 to 4.
Proportional allocation: Check

ni/Ni

## [1] 0.1628 0.1667 0.1731 0.1667

Estimate the stratified mean:

stratmean = StratMean(wts = c(n1/ntot, n2/ntot, n3/ntot, n4/ntot), 
    means = c(mu1, mu2, mu3, mu4))
paste("Compare stratified mean to normal mean", stratmean, mean(dat2$acres))

## [1] "Compare stratified mean to normal mean 212.75 212.75"


vstratmean = VStratMean(Ni = Ni, ni = ni, si = c(si1, si2, si3, si4))

paste("compare the var(stratmean) to var(total)", vstratmean, var(dat2$acres))

## [1] "compare the var(stratmean) to var(total) 325.736648907903 34475.9871794872"


bound = 2 * sqrt(N^2 * vstratmean)
paste("The point estimate of the total is", N * stratmean)

## [1] "The point estimate of the total is 51060"

paste("with a bound of", bound)

## [1] "with a bound of 8663.12437336443"

5.21 A quality control inspector must estimate the proportion of defective microcomputer chips coming from two different assembly operations. She knows that, among the chips in the lot to be inspected, 60% are from assembly operation A and 40% are from assembly operation B. In a random sample of 100 chips, 38 turn out to be from operation A and 62 from operation B. Among the sampled chips from operation A, six are defective. Among the sampled chips from operation B, ten are defective.

a. Considering only the simple random sample of 100 chips, estimate the proportion of defectives in the lot, and place a bound on the error of estimation.

paste("Apparently they're asking to do the problem without stratification")

## [1] "Apparently they're asking to do the problem without stratification"

paste("The estimate of the proportion is", 16/100)

## [1] "The estimate of the proportion is 0.16"


p = 16/100
vp = (p * (1 - p)/(100 - 1))
paste("The sample variance of a proportion is", (p * (1 - p)/(100 - 
    1)))

## [1] "The sample variance of a proportion is 0.00135757575757576"

paste("The bound of the proportion is", 2 * sqrt(vp))

## [1] "The bound of the proportion is 0.0736905898354941"

b. Stratifying the sample, after selection, into chips from operation A and B, estimate the proportion of defectives in the population, and place a bound on the error of estimation. Ignore the fpc in both cases. Which answers do you find more acceptable?

phat1 = 6/38
phat1

## [1] 0.1579

phat2 = 10/62
phat2

## [1] 0.1613


phattotal = StratMean(wts = c(0.6, 0.4), means = c(phat1, phat2))
paste("The overall estimate of phat is", phattotal)

## [1] "The overall estimate of phat is 0.159252971137521"


var.phat1 = (phat1 * (1 - phat1))/(38 - 1)
var.phat2 = (phat2 * (1 - phat2))/(62 - 1)
V.tot = (0.6^2) * var.phat1 + (0.4^2) * var.phat2
Bound = 2 * sqrt(V.tot)
paste("The Bound for the estimate of phatstratified is", Bound)

## [1] "The Bound for the estimate of phatstratified is 0.0812040719895091"



paste("So which is better? With stratifying our CI is", phattotal - 
    Bound, phattotal + Bound)

## [1] "So which is better? With stratifying our CI is 0.0780488991480121 0.24045704312703"

paste("So which is better? Without stratifying our CI is", p - (2 * 
    sqrt(vp)), p + (2 * sqrt(vp)))

## [1] "So which is better? Without stratifying our CI is 0.0863094101645059 0.233690589835494"

I prefer the one with stratification because it uses more of the available information. In reality it would depend on how confident we are in our given weights (60% and 40%) as to whether we could use them or not.

5.33 You want tto take a sample of students in your school to estimate the average amount they spent on their last haircut. Which sampling method do you think would work best–a simple random sanple with two strata, male and female; or a stratified random sample with class levels as strata? Give your reasoning.

The answer depends on two factors as I see it that can be simply described as center and spread. Let's imagine two classes, rich and poor, to make the discussion easier.

What's the mean difference between rich and poor for dollars paid per haircut? What's the mean difference between men and women for dollars paid per haircut?
What's the variability in how much the rich(poor) pay ? vs. What's the variability in how much men(women) pay?
Whichever difference in means is larger will most likely be the proper variable to stratify on whether it is gender or wealth. We can quantify those differences with the associated measures of variability.
I'd guess stratifying on gender would be better because I think there is a more systematic difference in how much men and women pay for haircuts and we could reduce the variance most by stratifying on gender.