Homework 5, Data Analysis

Resources :

  1. W5 folder in our Dropbox folder now contains some supplementary lecture readings re-emphasizing the core concepts of CLT, standard error, hypothesis testing, confidence interval and p-values. Please skim through them (~15 minutes) before attempting the assignment to refresh your memories.

  2. Please find the Open Intro Statistics textbook in our Dropbox folder - skimming over Chapter 5, 6 and 7 may be helpful to see the standard error formulas for some of the questions. I will explicitly redirect you to the textbook for some questions.

  3. I have 4 user defined functions below - you do not have to use them, but may find them useful to graphically draw out what is happening in the question.

# Clear the workspace
#  rm(list = ls()) # Clear environment - clashes with 
  gc()            # Clear unused memory
##          used (Mb) gc trigger (Mb) limit (Mb) max used (Mb)
## Ncells 533546 28.5    1189778 63.6         NA   669277 35.8
## Vcells 990261  7.6    8388608 64.0      16384  1840356 14.1
  cat("\f")       # Clear the console

Set Up (4 functions) to better answer the questions.

I. Function to Reject or Not

We write a function which takes in two arguments (numbers here), runs some computations (basic inequality) on them and prints an output based on the computation result -

myp=function(p, alpha){
  if(p<alpha){print('REJECT Ho')}else{print('FAIL 2 REJECT')}
}

Test our function to make sure it is performing as intended -

myp(.01, .05) # p is less than alpha
## [1] "REJECT Ho"
myp(.1,  .05) # p is greater than alpha
## [1] "FAIL 2 REJECT"

Now, lets write a bit more complex function (takes in many arguments) that is designed to shade the standard normal distribution as the default option for a 5% double sided hypothesis test and can be adapted for other purposes too. You can chnage the arguments of mu, sig, pcts, color,…

II. Function for Shading Normal

shadenorm = function(below=NULL, above=NULL, pcts = c(0.025,0.975), mu=0, sig=1, numpts = 500, color = "gray", dens = 40,                    justabove= FALSE, justbelow = FALSE, lines=FALSE,between=NULL,outside=NULL){

    if(is.null(between)){
         below = ifelse(is.null(below), qnorm(pcts[1],mu,sig), below)
         above = ifelse(is.null(above), qnorm(pcts[2],mu,sig), above)
    }
    if(is.null(outside)==FALSE){
         below = min(outside)
         above = max(outside)
    }
  
    lowlim = mu - 4*sig                         # min point plotted on x axis
    uplim  = mu + 4*sig                         # max point plotted on x axis
    x.grid = seq(lowlim,uplim, length= numpts)
    dens.all = dnorm(x.grid,mean=mu, sd = sig)
    
    if(lines==FALSE){
          plot(x.grid, dens.all, type="l", xlab="X", ylab="Density")    # label y and x axis
    }

    if(lines==TRUE){
          lines(x.grid,dens.all)
    }
    
    if(justabove==FALSE){
        x.below    = x.grid[x.grid<below]
        dens.below = dens.all[x.grid<below]
        polygon(c(x.below,rev(x.below)),c(rep(0,length(x.below)),rev(dens.below)),col=color,density=dens)
    }
    if(justbelow==FALSE){
        x.above    = x.grid[x.grid>above]
        dens.above = dens.all[x.grid>above]
        polygon(c(x.above,rev(x.above)),c(rep(0,length(x.above)),rev(dens.above)),col=color,density=dens)
    }
    
    if(is.null(between)==FALSE){
         from = min(between)
         to   = max(between)
         x.between    = x.grid[x.grid>from&x.grid<to]
         dens.between = dens.all[x.grid>from&x.grid<to]
         polygon(c(x.between,rev(x.between)),c(rep(0,length(x.between)),rev(dens.between)),col=color,density=dens)
    }
}

# TEST THE FUCTION
shadenorm(mu = 0, sig = 1, pcts = c(0.025,0.975))

# shadenorm(mu = 20, sig = 6, pcts = c(0.025,0.975))

III. Function for Shading t

shadet = function(below=NULL, above=NULL, pcts = c(0.025,0.975), df=1, numpts = 500, color = "gray", dens = 40,   justabove= FALSE, justbelow = FALSE, lines=FALSE,between=NULL,outside=NULL){

    if(is.null(between)){
         below = ifelse(is.null(below), qt(pcts[1],df), below)
         above = ifelse(is.null(above), qt(pcts[2],df), above)
    }
    if(is.null(outside)==FALSE){
         below = min(outside)
         above = max(outside)
    }
  
    lowlim = -4
    uplim  = 4
    x.grid = seq(lowlim,uplim, length= numpts)
    dens.all = dt(x.grid,df)
    
    if(lines==FALSE){
          plot(x.grid, dens.all, type="l", xlab="X", ylab="Density")
    }

    if(lines==TRUE){
          lines(x.grid,dens.all)
    }
    
    if(justabove==FALSE){
        x.below    = x.grid[x.grid<below]
        dens.below = dens.all[x.grid<below]
        polygon(c(x.below,rev(x.below)),c(rep(0,length(x.below)),rev(dens.below)),col=color,density=dens)
    }
    if(justbelow==FALSE){
        x.above    = x.grid[x.grid>above]
        dens.above = dens.all[x.grid>above]
        polygon(c(x.above,rev(x.above)),c(rep(0,length(x.above)),rev(dens.above)),col=color,density=dens)
    }
    
    if(is.null(between)==FALSE){
         from = min(between)
         to   = max(between)
         x.between    = x.grid[x.grid>from&x.grid<to]
         dens.between = dens.all[x.grid>from&x.grid<to]
         polygon(c(x.between,rev(x.between)),c(rep(0,length(x.between)),rev(dens.between)),col=color,density=dens)
    }
}

# TEST THE FUCTION
shadet(df = 4, pcts = c(0.025,0.975))     # see the area under the tails are further away from the mean 0..

# shadet(df = 120, pcts = c(0.025,0.975))   # t dist converges to normal when we have high degrees o freedom..

IV. Function for Shading Chi Square

shadechi = function(below=NULL, above=NULL, pcts = c(0.025,0.975), df=1, numpts = 500, color = "gray", dens = 40,   justabove= FALSE, justbelow = FALSE, lines=FALSE,between=NULL,outside=NULL){

    if(is.null(between)){
         below = ifelse(is.null(below), qchisq(pcts[1],df), below)
         above = ifelse(is.null(above), qchisq(pcts[2],df), above)
    }
    if(is.null(outside)==FALSE){
         below = min(outside)
         above = max(outside)
    }
  
    lowlim = 0
    uplim  = qchisq(.99,df)
    x.grid = seq(lowlim,uplim, length= numpts)
    dens.all = dchisq(x.grid,df)
    
    if(lines==FALSE){
          plot(x.grid, dens.all, type="l", xlab="X", ylab="Density")
    }
    if(lines==TRUE){
          lines(x.grid,dens.all)
    }
    
    if(justabove==FALSE){
        x.below    = x.grid[x.grid<below]
        dens.below = dens.all[x.grid<below]
        polygon(c(x.below,rev(x.below)),c(rep(0,length(x.below)),rev(dens.below)),col=color,density=dens)
    }
    if(justbelow==FALSE){
        x.above    = x.grid[x.grid>above]
        dens.above = dens.all[x.grid>above]
        polygon(c(x.above,rev(x.above)),c(rep(0,length(x.above)),rev(dens.above)),col=color,density=dens)
    }
    
    if(is.null(between)==FALSE){
         from = min(between)
         to   = max(between)
         x.between    = x.grid[x.grid>from&x.grid<to]
         dens.between = dens.all[x.grid>from&x.grid<to]
         polygon(c(x.between,rev(x.between)),c(rep(0,length(x.between)),rev(dens.between)),col=color,density=dens)
    }
}

# TEST THE FUCTION
shadechi(df = 2, pcts=c(.05))   # change pcts and see what happen

shadechi(df = 18, pcts=c(.05))  # change df and see what happens   

Question 1 (In class Lecture notes)

Using traditional methods, it takes 109 hours to receive a basic driving license. A new license training method using Computer Aided Instruction (CAI) has been proposed. A researcher used the technique with 190 students and observed that they had a mean of 110 hours. Assume the (population) standard deviation is known to be 6. A level of significance of 0.05 will be used to determine if the technique performs differently than the traditional method. Make a decision to reject or fail to reject the null hypothesis. Show all work in R.

Given: \(\mu=109, n=190, \bar{x}=110, \sigma=6, \alpha=.05\).

To Do: Determine if the technique performs differently than the traditional method. Burden of proof falls on alternative hypothesis -

i. Null and Alternative Hypothesis

Ho: \(\mu=109\), Ha: \(\mu \neq 109\)

Ho: Mean number of hours to obtain the driving license with the CAI is equal to the mean number of hours to obtain the driving license with traditional method (109).

Ha: Mean number of hours to obtain the driving license with the CAI is NOT equal to the mean number of hours to obtain the driving license with traditional method (109).

Two sided test (look at alternative hypothesis).

ii. Choose level of significance

\(\alpha = .05\)

iii. Test Statistic

Distribution: Z (known SD)

iv. Decision Rule

`myp()`

v. Take sample and decide

  # compute test statistic
Z = ( 110 - 109 ) / ( 6/sqrt(190) )                    
test_statistic <- Z
test_statistic
## [1] 2.297341
alpha = .05       

  # compute critical value (split alpha since we have double sided hypothesis)
?pnorm # for cdf/finding area to the left of a point q
?qnorm # for quintile/finding point to which area on the left is p
critical_value  <- qnorm(p = .975, 
                          mean = 0,
                          sd = 1
                )
critical_value
## [1] 1.959964

METHOD 1: Test statistic vs Critical value. If the test statistic is larger than the critical value in absolute terms/“more extreme”, reject the null.

Since the test statistic is more extreme than the critical value here, or alternatively the test statistic lies in the rejection region, we will reject the null.

Visually, you could have drawn by hand -

 # shades significance level gates 
shadenorm( mu = 109, 
           sig = 6/sqrt(190), 
           pcts = c(0.025,0.975), 
           color = "red")   

  # mark point estimate from sample
lines(x   = rep(x = 110,10), 
      y   = seq(from  = 0, 
                to = 1,
                length.out=10), 
      col ='blue')                       

METHOD 2: Compare the p value with alpha (significance level). If the p value is smaller than alpha, reject the null.

The p-value is the probability of obtaining results at least as extreme as the observed results of a statistical hypothesis test, assuming that the null hypothesis is correct. Read more about p-value

  # p value associated with the two sided hypothesis test
p_value = 2 * ( 1 - pnorm(q = test_statistic) )      # no need to explicitly specify the defaults in pnorm i.e. mean = 0, sd = 1    
p_value
## [1] 0.0215993
# given to us in question 
alpha
## [1] 0.05

Since the p value is less than alpha, we reject the null hypothesis - unlikely to see the sample mean we saw if the null is true.

We could have used can use our function to give us the same conclusion (which operationalizes the reject/do not reject rule by comparing p value and alpha/level of significance) -

myp(p_value, alpha)
## [1] "REJECT Ho"
?qnorm
qnorm(p = test_statistic)
## Warning in qnorm(p = test_statistic): NaNs produced
## [1] NaN
 # shades p-values here 
shadenorm( mu = 109, 
           sig = 6/sqrt(190), 
           pcts = c(1-pnorm(test_statistic),pnorm(test_statistic)), 
           color = "green")   

  # mark point estimate from sample
lines(x   = rep(x = 110,10), 
      y   = seq(from  = 0, 
                to = 1,
                length.out=10), 
      col ='blue') 

Simulated p value could have resulted in the same conclusion. NOT REQUIRED/WILL NOT BE TESTED.

Simulated P values

  # NOT REQUIRED - Simulated p value gives us pretty much the same result...

simulated_distribution <- rnorm(n = 100000,
                                mean = 109,
                                sd = 6/sqrt(190)
                                )   # 100000 obs generated from mean 110 and sd 6/sqrt(190)

p_value_sim <- 2 * length(simulated_distribution[simulated_distribution  >= 110]) / length(simulated_distribution)

p_value_sim 
## [1] 0.0213

METHOD 3: Confidence Interval constructed from the sample point estimate contains the hypothesized values? If not, reject the null.

CI give us the same conclusion, as the hypothesized population parameter is not within the 95% CI constructed from sample.

  # NOT REQUIRED - 

# interval = c( xbar - z * Se, xbar + z * Se)  ## [1] -2.766534  2.966534
xbar  <- 110
Se    <- 6/sqrt(190) 

z_critical_value     <- qnorm(p    = .975, 
                              mean = 0, 
                              sd   = 1)

interval <- c( xbar - z_critical_value * Se, 
               xbar + z_critical_value * Se)

interval
## [1] 109.1469 110.8531

Question 2 (Lecture notes)

Our environment is very sensitive to the amount of ozone in the upper atmosphere. The level of ozone normally found is 5.3 parts/million (ppm). A researcher believes that the current (mean) ozone level is at an insufficient level. The mean of 5 observations in a sample is 5.0 parts per million (ppm) with a standard deviation of 1.1. Does the data support the claim at the 0.05 level?

Assume the population distribution is approximately normal.

Given: \(\mu=5.3, n=5, \bar{x}=5, \sigma_x=1.1, \alpha=.05\).

To Do: Researcher believes that the current ozone level is not 5.3 parts/million (ppm) - does the data support the claim at the 0.05 level ?

i. Null and Alternative Hypothesis

Ho: \(\mu=5.3\), Ha: \(\mu \neq 5.3\)

Ho:Mean level of ozone is 5.3 ppm.

Ha:Mean level of ozone is not 5.3 ppm.

ii. Choose level of significance

\(\alpha = .05\)

iii. Test Statistic

Distribution: z (told to assume normal, else you could have assumed to be t)

iv. Decision Rule

`myp()`

v. Take sample and decide

#Ho:  Mu==5.3, Ha:  Mu!=5.3

alpha <- .05

xbar  <- 5
Mu    <- 5.3

n     <- 5
df    <- n-1

sd    <- 1.1
Se    <- sd/sqrt(n)

z     <-    (xbar - Mu) / Se
z
## [1] -0.6098367
test_statistic <- z
test_statistic
## [1] -0.6098367
?qt
critical_value  <- qnorm(p = alpha/2)  # compute critical value (split alpha since we have double sided hypothesis)
critical_value
## [1] -1.959964

METHOD 1: Test statistic vs Critical value. If the test statistic is larger than the critical value in absolute terms/“more extreme”, reject the null.

Since the critical value is greater than test statistic in absolute terms/“more extreme”, we fail to reject the null hypothesis - likely to see the sample mean we got if the null is true.

shadenorm(mu = 0, sig = 1, pcts = c(0.025 , 0.975))

lines(x = rep(x = test_statistic, 
              times = 10
              ), 
      y = seq(from = 0,
              to = 1,
              length.out=10
              ), 
      col = 'red')

METHOD 2: Compare the p value with alpha (significance level). If the p value is smaller than alpha, reject the null.

You can read How to Calculate the P-Value of a T-Score in R.

?pt # distribution function for the t distribution with df degrees of freedom

p_value <- 2 * pnorm(q  = test_statistic
                  ) ## [1] 0.54197
p_value
## [1] 0.54197

Since the p value is (much) greater than alpha, we fail to reject the null hypothesis - likely to see the sample mean we got if the null is true.

METHOD 3: Confidence Interval constructed from the sample point estimate contains the hypothesized values? If not, reject the null.

CI give us the same conclusion, as the hypothesized population parameter (5.3) is within the 95% CI constructed from sample

# interval = c( xbar - t * Se, xbar + t * Se)   

xbar  # calculated above
## [1] 5
Se    # calculated above
## [1] 0.491935
z_crit     <- qnorm(p = 1 - alpha/2 )
z_crit
## [1] 1.959964
interval = c( xbar - z_crit * Se,
              xbar + z_crit * Se)

interval
## [1] 4.035825 5.964175

Question 2: Extension (Lecture notes)

Our environment is very sensitive to the amount of ozone in the upper atmosphere. The level of ozone normally found is 5.3 parts/million (ppm). A researcher believes that the current (mean) ozone level is at an insufficient level. The mean of 5 observations in a sample is 5.0 parts per million (ppm) with a standard deviation of 1.1. Does the data support the claim at the 0.05 level? Assume the population distribution is approximately normal.

Assume the population distribution is approximately normal.

i. Null and Alternative Hypothesis

Ho: \(\mu \geq 5\), Ha: \(\mu < 5\)

Ho:Mean level of ozone is greater than or equal to 5.3 ppm.

Ha:Mean level of ozone is less than 5.3 ppm.

ii. Choose level of significance

\(\alpha = .05\)

iii. Test Statistic

Distribution: z (told to assume normal, else you could have assumed to be t)

iv. Decision Rule

`myp()`

v. Take sample and decide

#Ho:  Mu>=5.3, Ha:  Mu<5.3

alpha <- .05

xbar  <- 5
Mu    <- 5.3

n     <- 5
df    <- n-1

sd    <- 1.1
Se    <- sd/sqrt(n)

z     <-    (xbar - Mu) / Se
z
## [1] -0.6098367
test_statistic <- z
test_statistic
## [1] -0.6098367
  # compute critical value (split alpha since we have double sided hypothesis)
?qt
critical_value  <- qnorm(p = alpha)
critical_value
## [1] -1.644854
p_value <- pnorm(q = test_statistic)

FAIL TO REJECT THE NULL.

Visually, only the shaded area in the tails is changing !

shadenorm(mu = 0, sig = 1, pcts = c(alpha))

lines(x = rep(x = test_statistic, 
              times = 10
              ), 
      y = seq(from = 0,
              to = 1,
              length.out=10
              ), 
      col = 'red')

Terminology

- Sometimes, people may say -

  • “The mean of 5 samples is 5.0 ppm with a standard deviation of 1.1”, instead of

  • “The mean of 5 observations in a sample is 5.0 parts per million (ppm) with a standard deviation of 1.1”

We solved using the second case here. \(n=5\), \(\bar x=5\) and \(\sigma=1.1\) To apply CLT (for inference), we find standard error or \(\sigma_\bar x =\dfrac{1.1}{\sqrt n}\)

For the first case, we can still apply CLT - think about 5 distributions with same variance but different means. The mean of the 5 sample distributions is 5 (which will be out best estimate of the true population mean), and the standard deviation of these 5 distributions is 1.1 - thus, we can calculate the standard error again \(\sigma_\bar x =\dfrac{1.1}{\sqrt n}\)

We know the population sd in the second interpretation, so can use the normal or z distribution (instead of using a t distribution, as t is used when we do not know the population sd).

  • t-Distribution:

    • Small sample size.

    • Unknown population standard deviation.

    • Estimating population parameters.

  • z-Distribution:

    • Large sample size.

    • Known population standard deviation (rare in practice).

Question 3 (Lecture notes)

Our environment is very sensitive to the amount of ozone in the upper atmosphere. The level of ozone normally found is 7.3 parts/million (ppm). A researcher believes that the current (mean) ozone level is not at a normal level. The mean of 51 observations in a sample is 7.1 ppm with a variance of 0.49. Assume the population is normally distributed. A level of significance of 0.01 will be used.

Given: \(\mu=7.3, n=51, \bar{x}=7.1, \sigma^2=0.49, \alpha=.01\).

To Do: Researcher believes that the current ozone level is not at normal level. Thus, set a double sided hypothesis.

i. Null and Alternative Hypothesis

Ho: \(\mu=7.3\), Ha: \(\mu \neq 7.3\)

Ho:Mean level of ozone is 7.3 ppm.

Ha:Mean level of ozone is not 7.3 ppm.

ii. Choose level of significance

\(\alpha = .01\)

iii. Test Statistic

Distribution: z (known distribution)

iv. Decision Rule

`myp()`

v. Take sample and decide

Mu      <-   7.3
alpha   <-   .01

n       <-   51
df      <-   n-1

sd      <-   sqrt(.49)
Se      <-   sd/sqrt(n)

z <- (7.1 - 7.3) / Se
z
## [1] -2.040408
critical_value <- qt(p = .005, df = df)
critical_value
## [1] -2.677793
critical_value <- qnorm(p = alpha/2)
critical_value
## [1] -2.575829
shadenorm(mu = 7.3, sig = Se, pcts = c(.005 , .995))
lines(x = rep(x = 7.1, 
              times = 10), 
      y = seq(from = 0, 
              to   = 1, 
              length.out = 10), 
      col="red")  # plot the z statistic

?pnorm
p_value <-  2 * pnorm( q = z)  # calculate the p value ## [1] 0.04660827
p_value
## [1] 0.04130969
alpha
## [1] 0.01
myp(p = p_value, alpha = alpha)
## [1] "FAIL 2 REJECT"

Method 1 - Fail to reject null as t statistic falls in acceptance region / not as extreme as critical value.

Method 2: Since the p-value is greater than alpha, we fail to reject the null hypothesis - current ozone likely at 7.3 ppm, based on evidence we get from our sample.

Method 3: CI should give you the same conclusion. Try it.

# interval = c( xbar - t * Se, xbar + t * Se)   

xbar  <- 7.1    # sample estimate
Se              # calculated above
## [1] 0.09801961
critical_value <- qt(p = 1-alpha/2, df = df)
critical_value
## [1] 2.677793
interval = c( xbar - critical_value * Se,
              xbar + critical_value * Se)

interval
## [1] 6.837524 7.362476

Question 4 (See Open Stats Textbook - Chapter 5 Section 5.2: Confidence intervals for a proportion)

A publisher reports that 36% of their readers own a laptop. A marketing executive wants to test the claim that the percentage is actually less than the reported percentage. A random sample of 100 found that 29% of the readers owned a laptop. Is there sufficient evidence at the 0.02 level to support the executive’s claim? Show all work and hypothesis testing steps.

Given: \(\pi=.36, n=100, \hat{p}=.29,\alpha=.02\).

To Do: Executive wants to test the claim that the percentage is actually less than the reported percentage. Thus, set a single sided hypothesis.

i. Null and Alternative Hypothesis

Ho: \(\pi \geq .36\), Ha: \(\pi < .36\)

Ho:Mean proportion of readers that own a laptop is greater that or equal to .36

Ha:Mean proportion of readers that own a laptop is less than .36

Single sided.

ii. Choose level of significance

\(\alpha = .02\)

iii. Test Statistic

NOTE this is a proportion (discrete variable) question, not a mean (continuous variable) question - so the standard error formula will be a bit different.

Distribution: t (not told the underlying distribution, but sample proportion should be normally distributed with fat tails). Since n is 100 you could have modeled it as z too (the two distributions are practically the same when n is 120)

iv. Decision Rule

`myp()`

v. Take sample and decide

# Ho:  pi>=.36, Ha: pi<.36

phat  <-  .29    # sample proportion

p     <-  .36    # population hypothesized value
q     <-  1 - p

alpha <- .02
n     <-  100
df    <- n-1


# distribution: PROPORTION

Se <- sqrt( p * q / n)  # NOTE DIFFFERENT FORMULA for standard error now 


t <- ( phat - p ) / Se  # t <- (sample estimate - population ie null hypothesis value ) / Se
test_statistic <- t
test_statistic
## [1] -1.458333
critical_value <- qnorm(p = .02 )         # if you modeled as Z/standard normal distribution
critical_value
## [1] -2.053749
critical_value <- qt(p = .02,df = n-1 )   # if you modeled as t distribution 
critical_value
## [1] -2.081162
shadet(df = n-1, pcts = c(.02))
?lines
lines(x = rep(x = test_statistic, times = 10), y = seq(from = 0, to = 20,length.out=10), col='red')

#shadenorm(mu = .36, sig = Se, pcts = c(.02) )
#lines(x = rep(.29,10), y = seq(from = 0, to = 20,length.out=10), col='red')



p_value <- pnorm(t) ## [1] 0.01096244 ; single sided lower tail test from alternative 
p_value
## [1] 0.07237434
alpha 
## [1] 0.02
myp(p = p_value, alpha = alpha)
## [1] "FAIL 2 REJECT"

Since the p value is greater than alpha, we fail to reject the null hypothesis - likely to see the proportion of readers who own a laptop is .29 if the null is true (36% readers own laptops).

Fail to reject at 5% level of significance too.

critical_value <- qt(p = .05, df = n-1 )
critical_value
## [1] -1.660391
test_statistic # stays the same, only critical value changes as we change alpha
## [1] -1.458333

However, reject at 10% level of significance though. Critical value changes but the test statistic stays the same.

critical_value <- qt(p = .10,
                     df = n-1
                     )
critical_value
## [1] -1.290161
test_statistic # stays the same, only critical value changes as we chnage alpha
## [1] -1.458333

Again, extra and a fun way to simulate the p value.

temp <- rt(n = 1000000, 
           df = n-1)

p_value_sim <- length( temp[temp <= t] ) / length(temp) ## [1] 0.005643
p_value_sim
## [1] 0.074017
?shadet
## No documentation for 'shadet' in specified packages and libraries:
## you could try '??shadet'
shadet(df = n-1, pcts = c(.05))
lines(x = rep(test_statistic,10), y=seq(0,20,length.out=10), col='red')

shadet(df = n-1, pcts = c(.10))
lines(x = rep(x = test_statistic,
              times=10),
      y = seq(0,20,length.out=10), col='red')

prop.test formula

We could have applied the prop.test formula also, which gives the same result (including p value).

  • The CI will be (0.29, 0.395) as it is single sided (keep the point estimate fixed as one part of the interval).

  • The command does not work very well for single sided CI (prints 0 as well, so you have to do some thinking), but it works well for double sided in terms of the output.

    • This is usually an issue with many user written commands when you extend them to corner cases, and thus you should be able to derive everything from scratch.
prop_hat <- .29    # sample proportion
prop     <- 0.36   # population proportion
n        <- 100    # sample size

#test statistic and p value
?prop.test # can be used for testing the null that the proportions (probabilities of success) in several groups are the same, or that they equal certain given values.
prop.test( x = 29,               # a vector of counts of successes ; from sample
           n = 100,              # a vector of counts of trials ; from sample
           p = .36,              # a vector of probabilities of success ; from popultion
           alternative = "less", # a character string specifying the alternative hypothesis, must be one of "two.sided" (default), "greater" or "less"
           conf.level = 1-alpha  # confidence level of the returned confidence interval. 
           )
## 
##  1-sample proportions test with continuity correction
## 
## data:  29 out of 100, null probability 0.36
## X-squared = 1.8338, df = 1, p-value = 0.08784
## alternative hypothesis: true p is less than 0.36
## 98 percent confidence interval:
##  0.000000 0.395416
## sample estimates:
##    p 
## 0.29
# point estimate plus Se times the critical value (add it to the right side as we have a lower tail)
prop_hat + Se * -qt(p = .02,df = n-1 ) 
## [1] 0.3898958

Question 5 (See Open Stats Textbook - Chapter 5)

A hospital director is told that 31% of the treated patients are uninsured. The director wants to test the claim that the percentage of uninsured patients is less than the expected percentage. A sample of 380 patients found that 95 were uninsured. Make the decision to reject or fail to reject the null hypothesis at the 0.05 level. Show all work and hypothesis testing steps.

Given: \(\pi=.31, n=380, \hat{p}=\dfrac{95}{380}=.25,\alpha=.05\).

To Do: Researcher believes that the current ozone level is not at normal level. Thus, set a double sided hypothesis.

i. Null and Alternative Hypothesis

Ho: \(\pi \geq .31\), Ha: \(\pi < .31\)

Ho: Mean proportion of uninsured patients is greater that or equal to .31

Ha: Mean proportion of uninsured patients is less than .31

Single sided

ii. Choose level of significance

\(\alpha = .05\)

iii. Test Statistic

Distribution: Z (proportion)

iv. Decision Rule

`myp()`

v. Take sample and decide

# Ho:  pi<=.31, Ha: pi>.31
p <- .31     # success
q <- 1 - p   # failure

n <- 380   
  x <- 95  
phat <- x/n

phat
## [1] 0.25
alpha <- .05
Se <- sqrt( p * q / n)   # NOTE DIFFFERENT FORMULA for standard error now 
Se
## [1] 0.0237254
Z <- ( phat - p ) / Se   # sample estimate minus hypothesized value divided by Se
test_statistic <- Z
test_statistic
## [1] -2.528935
critical_value <-  qnorm(p = alpha,  # single sided, so no alpha/2
                         mean = 0,
                         sd = 1)

# METHOD 1: Reject the null as test statistic is bigger than critical value in absolute terms
test_statistic
## [1] -2.528935
critical_value
## [1] -1.644854
# METHOD 2: Reject the null as p-value is smaller than alpha
p_value <- pnorm(Z) ## [1] 0.005720462
p_value 
## [1] 0.005720462
alpha
## [1] 0.05
myp(p = p_value, alpha = alpha)
## [1] "REJECT Ho"
shadenorm(mu = .31, sig = Se, pcts = c(alpha)) # not alpha/2, splitting 
lines(x = rep(phat,10), y = seq(from = 0, to = 20,length.out=10), col = 'red')

Strongly reject the null hypothesis that 31% or less people are insured.

Single sided hypothesis testing CI adjustment

# METHOD 3: Lower one-sided confidence interval for proportion should be less than population estimate

z <- qnorm(p = alpha)
interval <- c(phat, phat - z * Se )

interval

Adjusted_interval CI is (0.250, 0.289).

We could have applied the prop test formula also, which gives the same result (including p value). The CI will be (0.250, 0.289) as it is single sided - the command output is not very well presented for that part, but it works well for double sided in terms of the output presentation.

prop_hat <- 95/380 # sample proportion (25%)
prop     <- 0.31   # population proportion
n        <- 380    # sample size

#test statistic and p value
?prop.test # can be used for testing the null that the proportions (probabilities of success) in several groups are the same, or that they equal certain given values.
prop.test( x = 95,               # a vector of counts of successes,
           n = 380,              # a vector of counts of trials
           p = .31,              # a vector of probabilities of success. 
           alternative = "less", # a character string specifying the alternative hypothesis, must be one of "two.sided" (default), "greater" or "less"
           conf.level = 1 - .05  # confidence level of the returned confidence interval. 
           )
## 
##  1-sample proportions test with continuity correction
## 
## data:  95 out of 380, null probability 0.31
## X-squared = 6.1181, df = 1, p-value = 0.00669
## alternative hypothesis: true p is less than 0.31
## 95 percent confidence interval:
##  0.0000000 0.2895878
## sample estimates:
##    p 
## 0.25
temp <- rnorm(n = 1000000, mean = p, sd = Se)
p_value_sim <- length(temp[temp<=x/n])/length(temp) ## [1] 0.005643
p_value_sim
## [1] 0.005685

Sample Size Analysis

Determines the minimum sample size required to achieve a desired level of precision/MOE.

  • calculate n given MOE, critical value and standard deviation

  • Reading the the first two section may be helpful.

\(MOE = Z\_crit * SD /sqrt(n)\)

simplifies to

n = ( (Z_crit * SD) / MOE )^2

Question 6 (See W5 Dropbox “Lecture 21” file for similar question on Chi-Squared Distribution)

A standardized test is given to a sixth-grade class. Historically the mean score has been 112 with a standard deviation of 24. The superintendent believes that the standard deviation of performance may have recently decreased. She randomly sampled 22 students and found a mean of 102 with a standard deviation of 15.4387. Is there evidence that the standard deviation has decreased at the \(\alpha= 0.1\) level? Show all work and hypothesis testing steps.

Given: \(n=22, \hat{\sigma}=15.4387,\alpha=.01\).

To Do: Is there evidence that the standard deviation has decreased?

i. Null and Alternative Hypothesis

Ho: \(\sigma \geq .24\)

Ha: \(\sigma < .24\)

Ho:Variation of the sixth grade class is .24 or higher/excessive

Ha:Variation of the sixth grade class is acceptable/less than .24

Single Sided

ii. Choose level of significance

\(\alpha = .01\)

iii. Test Statistic

Distribution: Chi Squared (This is for population variance / sd, not mean. The random variable is the probability distribution of the sum of the squared errors, and it occurs naturally by squaring a normal distribution.)

Chi Squared Distribution

iv. Decision Rule

`myp()`

v. Take sample and decide

#Ho:  sigma>=24, Ha: sigma<=24

sigma   = 24          # Population parameter
sigma_2 = sigma^2

alpha = .1

n = 22
s = 15.4387          # Sample parameter
s_2 = s^2

#Chi
df  = n-1
chi = df * s_2 / sigma_2   # Test statistic formula - df * ratio of sample to population variance

test_statistic <- chi
test_statistic
## [1] 8.68997
?qchisq # jusr required df argument
critical_value <- qchisq(p = alpha, df = n-1 )
critical_value  
## [1] 13.2396
shadechi(df = n-1, pcts = c(.1))
lines(rep(chi,10), seq(0,20,length.out=10),col='red')

?pchisq  # distribution function for the chi-squared distribution with df degrees of freedom

p_value <- pchisq(q = chi, df = df, lower.tail = T) ## [1] 0.008549436
p_value
## [1] 0.008549436
myp(p_value,alpha) ## [1] "REJECT Ho"
## [1] "REJECT Ho"

Strongly reject the null hypothesis.

The sample variance is so less compared to the population variance that it is very likely that the sample is taken from a different population (of course, we could be wrong alpha or 10% times.)

# Reject Null, p<alpha
temp=rchisq(100000,df)
p_value_sim <-length(temp[temp<=chi])/length(temp)  ## [1] 0.00896
p_value_sim  # Reach the same conclusion by this approximation.  
## [1] 0.00854

For 7 and 8, I use the Satterthwaite approximation1.

The Satterthwaite formula for 2 sample t-test degrees of freedom is \(\dfrac{(\dfrac{s_1^2}{n_1}+\dfrac{s_2^2}{n_2})^2}{\dfrac{1}{n_1-1}(\dfrac{s_1^2}{n_1})^2+\dfrac{1}{n_2-1}(\dfrac{s_2^2}{n_2})^2}\), where \(s_1\) and \(n_1\) is the standard deviation and sample size of group 1, and \(s_2\) and \(n_2\) is the standard deviation and sample size of group 2.

The official Satterthwaite formula for the degrees of freedom is quite complex and is generally computed using software, so instead you may use the smaller of n1 − 1 and n2 − 1 for the degrees of freedom if software isn’t readily available.

Thus, you may for 8 and 9 select the smaller of the two degrees of freedom corresponding to each group.

See 7.3.3.R - the R scripts for Case Study 7.3.3 in Open Stats textbook, available now in Dropbox W5 FOLDER, to see how to solve similar questions -

Question 7 (See Open Stats Section 7.3, Example 7.25 in particular)

A medical researcher wants to compare the pulse rates of smokers and non-smokers. He believes that the pulse rate for smokers and non-smokers is different and wants to test this claim at the 0.1 level of significance. The researcher checks 32 smokers and finds that they have a mean pulse rate of 87, and 31 non-smokers have a mean pulse rate of 84. The standard deviation of the pulse rates is found to be 9 for smokers and 10 for non-smokers. Let \(\mu_1\) be the true mean pulse rate for smokers and \(\mu_2\) be the true mean pulse rate for non-smokers. Show all work and hypothesis testing steps.

Let smoker group be indexed by 1, non-smoker group by 2.
Given: \(n_1 = 32, \mu_1 = 87, n_2 = 31, \mu_2 = 84, \sigma_1 = 9, \sigma_2 = 10 , \alpha = 10\%\).

To Do: Test if the pulse rate for smokers and non-smokers is different at the 0.1 level of significance. Thus, double sided test.

i. Null and Alternative Hypothesis

Ho: \(\mu_1 - \mu_2 = 0\), Ha: \(\mu_1 - \mu_2 \neq 0\)

Ho: Difference of two means (smoker and non-smoker group) is zero.

Ha: Difference of two means (smoker and non-smoker group) is not zero. In other words, the difference of the point estimates is not zero.

Double sided

ii. Choose level of significance

\(\alpha = .01\)

iii. Test Statistic

Distribution: t

iv. Decision Rule

`myp()`

v. Take sample and decide

# Ho: Mu1-mu2=0, Ha:  Mu1-Mu2<>0
mu1 <- 87
mu2 <- 84

alpha   =   .1

# dist = t
n1    =   32
n2    =   31

df1   =   n1 - 1
df2   =   n2 - 1

sd1   = 9
sd2   = 10

var1  = sd1^2
var2  = sd2^2

num_point_estimate_diff = (mu1 - mu2 )  # point estimate difference - the numerator term
den_Se = sqrt( var1/n1 + var2/n2 )  # Se formula - Standard Error using sample standard deviations rather than population standard deviations - the denominator term
t   = (num_point_estimate_diff - 0 ) / den_Se       # in our null, the hypothesized value is 0 - so we subtract it from our sample estimate.  Divide it by standard error as usual.    
test_statistic <- t
test_statistic
## [1] 1.25032
# Satterthwaite# Satterthwaite# Satterthwaite# Satterthwaite# Satterthwaite
numdf = (var1/n1 + var2/n2)^2                       # Satterthwaite
dendf = (var1/n1)^2 / df1 + (var2/n2)^2 / df2       # Satterthwaite
df = numdf / dendf                                  # Satterthwaite - can be replaced with smaller of df1 or df2

critical_value_satterwaite <- qt(p = (1-alpha/2), df = numdf / dendf ) # symmetric, p = .95 instead of .05 as I know test_statistic is postive and thus I need the upper critical value to compare if I can reject/fail to reject null
critical_value_satterwaite
## [1] 1.670703
?min
critical_value_conservative <- qt(p = (1-alpha/2), df = min(df1, df2) ) # symmetric, p = .95 instead of .05
critical_value_conservative  # very close to critical value from satterwaite method.  Usually they give the same result.  
## [1] 1.697261
critical_value_conservative_manual <- qt(p = (1-alpha/2), df = 30 )     # if you want to apply the conservative method manually 
critical_value_conservative_manual
## [1] 1.697261

Fail to reject null by method 1.

shadet(df = df, pcts = c(.05,.95))
?seq
lines(x = rep(t,10), 
      y = seq(from = 0, 
              to = 1, 
              length.out = 10),
      col='red'
        )

?pt # distribution function for the t distribution with df degrees of freedom
p_value = 2 * ( 1 - pt(q = t, 
                       df = numdf / dendf)
                )    # Satterthwaite           ## [1] 0.2160473
p_value
## [1] 0.2160473
myp(p = p_value,alpha = alpha)
## [1] "FAIL 2 REJECT"
p_value_conservative = 2 * ( 1 - pt(q = t, 
                              df = min(df1, df2)    # smaller of the numerator and denominator degree of freedom
                              )
                       )  # double sided hypothesis

p_value_conservative    # a bit different p value (similar), but the same end decision rule !!!   ## [1] 0.220848
## [1] 0.220848
myp(p = p_value_conservative,alpha = alpha)
## [1] "FAIL 2 REJECT"
CI <- c(num_point_estimate_diff - (critical_value_conservative * den_Se), 
        num_point_estimate_diff + (critical_value_conservative * den_Se))
CI
## [1] -1.072385  7.072385

FAIL TO REJECT the null - do not find that the difference of two means (smoker and non-smoker group) is significantly different from zero.

Question 8 (See Open Stats Section 7.3, Example 7.22 in particular)

Given two independent random samples with the following results: \(n_1=11, \bar{x}_1=127, \sigma_1=33, n_2=18, \bar{x}_2=157, \sigma_2=27\)

Use this data to find the 95% confidence interval for the true difference between the population means. Assume that the population variances are not equal and that the two populations are normally distributed.

To Do: Create a 95% confidence interval for true difference between the population means.

i. Null and Alternative Hypothesis

Ho: \(\bar{x}_1 - \bar{x}_2 = 0\)

Ha: \(\bar{x}_1 - \bar{x}_2 \neq 0\)

Double Sided

ii. Choose level of significance

\(\alpha = .05\)

iii. Test Statistic

Distribution: t

iv. Decision Rule

`myp()`

v. Take sample difference and construct 95% CI around it

alpha = .05

xbar1   =   127   # mean
xbar2   =   157   # mean

n1    =   11      # sample size
n2    =   18      # sample size

df1   =   n1-1    # degrees of freedom
df2   =   n2-1    # degrees of freedom

s1    =   33      # sd
s2    =   27      # sd

var1    =   s1^2    # variance
var2    =   s2^2    # variance

# Satterthwaite DF - can be replaced with smaller of df1 or df2
numdf =   ( var1 / n1   +   var2 / n2 )^2
dendf =   ( var1 / n1 )^2 / df1  +  (var2 / n2 )^2 / df2
df    =   numdf / dendf

delta   = xbar1   -   xbar2               # point estimate difference 
delta
## [1] -30
# SEE SYMMETRY in the two t values - 
t   =   qt(p = .025, df = df)             # two sided hypothesis test at 5% level of significance, p = vector of probabilities
t
## [1] -2.10029
t   =   qt(p = .975, df = df)             # two sided hypothesis test at 5% level of significance, p = vector of probabilities
t
## [1] 2.10029
Se = sqrt( var1/n1 + var2/n2 )            # Se formula - Standard Error using sample standard deviations rather than population standard deviations
Se
## [1] 11.81101
interval = c( delta - t * Se ,
              delta + t * Se )
interval
## [1] -54.806548  -5.193452

CI is (-54.806548, -5.193452) and since the difference of the mean is 0 (null hypothesis) which does not lie in the CI, we should reject the null. This is what our `myp()` test (method 3) will also suggest.

Method 1 gives same conclusion under both different forumulas for t stat df - Satterwhite and smaller df

test_statistic <- delta/Se
test_statistic
## [1] -2.540003
critical_value <- qt(p = alpha/2, df = numdf / dendf)
critical_value
## [1] -2.10029
critical_value <- qt(p = alpha/2, df = min(numdf, dendf))
critical_value
## [1] -1.96217

Visually (or if you want the p-values {Method 3} under both df formulas ) -

# Satterthwaite degrees of freedom (usually bigger than smaller of two df)
shadet(df = df, pcts = c(.025,.975) )  
lines(rep(delta/Se,10), seq(0,1,length.out=10),col='red')

?pt # distribution function for the t distribution with df degrees of freedom
p_value = 2 * ( pt(q = delta/Se, df = numdf/dendf))    # Satterthwaite         ## [1] 0.05
p_value
## [1] 0.02048034
myp(p_value,alpha)
## [1] "REJECT Ho"
# smaller of two degrees of freedom, df1 and df2
shadet(df = df1, pcts = c(.025,.975) ) 
lines(rep(delta/Se,10), seq(0,1,length.out=10),col='red')

p_value_robust = 2 * ( pt(q = delta/Se, df = min(df1, df2))) # smaller of the numerator and denominator degree of freedom
p_value_robust # a bit different p value, but the same end decision rule!!!   ## [1] 0.0655
## [1] 0.02936303
myp(p_value_robust,alpha)
## [1] "REJECT Ho"

Question 9 (See Open Stats Section 7.3, Example 7.22 in particular).

Two men, A and B, who usually commute to work together decide to conduct an experiment to see whether one route is faster than the other. The men feel that their driving habits are approximately the same, so each morning for two weeks one driver is assigned to route I and the other to route II. The times, recorded to the nearest minute, are shown in the following table. Using this data, find the 98% confidence interval for the true mean difference between the average travel time for route I and the average travel time for route II.

Let \(d1=\) (route I travel time) − (route II travel time).

Assume that the populations of travel times are normally distributed for both routes. Show all work and hypothesis testing steps.

To Do: Find the 98% confidence interval for the true mean difference between the average travel time for route I and the average travel time for route II.

Notation: Let \(d1 =\) (route I travel time) − (route II travel time).

Small Sample -> T distribution (Standard normal with fat tails).

See Open Stats 7.3.4 Pooled standard deviation estimate (special topic). This is different from Open Stats Section 6.2 Difference of two proportions, Example 6.2.2 in particular.

I avoid using pooled SD here as even though the two people have similar driving skills, they are driving on (slightly) different paths. You can definitely disagree, but justify your assumption and use the appropriate SE formula. For simplicity, you can avoid ignore pairing use unpaired data (though you can model this as paired data as well - formula not emphasized).

  • A pooled standard deviation is only appropriate when background research indicates the population standard deviations are nearly equal.

  • The pooled standard deviation of two groups is a way to use data from both samples to better estimate the standard deviation and standard error. If s1 and s2 are the standard deviations of groups 1 and 2 and there are very good reasons to believe that the population standard deviations are equal, then we can obtain an improved estimate of the group variances by pooling their data.

  • When the sample size is large and the condition may be adequately checked with data, the benefits of pooling the standard deviations greatly diminishes.

r1 <- c (32, 27, 34, 24, 31, 25, 30, 23, 27, 35)
r2 <- c (28, 28, 33, 25, 26, 29, 33, 27, 25, 33)

alpha <- .02

xbar1   <-   mean(r1)   # mean
xbar2   <-   mean(r2)   # mean

n1    <-   10      # sample size
n2    <-   10      # sample size

df1   <-   n1 - 1    # degrees of freedom
df2   <-   n2 - 1    # degrees of freedom

?sd # computes the standard deviation of the values in x.
s1    <-   sd(r1)      # sd
s2    <-   sd(r2)      # sd

var1    <-   s1^2    # variance
var2    <-   s2^2    # variance

# pooled_sd <- sqrt( (df1*var1+df2*var2) / (n1+n2-2) )
# Se = pooled_sd * sqrt( 1/n1 + 1/n2 )  
Se = sqrt( var1/n1 + var2/n2 )            # Se formula - Standard Error using sample standard deviations rather than population standard deviations
Se
## [1] 1.678955
delta   <- xbar1   -   xbar2               # point estimate difference 
delta
## [1] 0.1
# SEE SYMMETRY in the two t values - 
t   =   qt(p = alpha/2, df = min(df1,df2))        # two sided hypothesis test at 2% level of significance, p = vector of probabilities
t
## [1] -2.821438
t   =   qt(p = 1-alpha/2, df = min(df1,df2))        # two sided hypothesis test at 2% level of significance, p = vector of probabilities
t
## [1] 2.821438
interval = c( delta - t * Se , delta + t * Se )
interval    # -4.637066  4.837066
## [1] -4.637066  4.837066
# Satterthwaite# Satterthwaite# Satterthwaite# Satterthwaite# Satterthwaite
numdf = (var1/n1 + var2/n2)^2                       # Satterthwaite
dendf = (var1/n1)^2 / df1 + (var2/n2)^2 / df2       # Satterthwaite
df = numdf / dendf                                  # Satterthwaite - can be replaced with smaller of df1 or df2

critical_value_satterwaite <- qt(p = (1-alpha/2), df = df ) # symmetric, p = .98 
critical_value_satterwaite
## [1] 2.568883
interval = c( delta - critical_value_satterwaite * Se , 
              delta + critical_value_satterwaite * Se )
interval   # -4.213039  4.413039
## [1] -4.213039  4.413039

Question 10 (See Open Stats Textbook - 6.2.1 Sampling distribution of the difference of two proportions and 6.2.2 Confidence intervals for p1 - p2)

The U.S. Census Bureau conducts annual surveys to obtain information on the percentage of the voting-age population that is registered to vote. Suppose that 391 employed persons and 510 unemployed persons are independently and randomly selected, and that 195 of the employed persons and 193 of the unemployed persons have registered to vote. Can we conclude that the percentage of employed workers (p1) who have registered to vote, exceeds the percentage of unemployed workers (p2) who have registered to vote? Use a significance level of 0.05 for the test. Show all work and hypothesis testing steps.

Q: Can we conclude that the percentage of employed workers (p1) who have registered to vote, exceeds the percentage of unemployed workers (p2) who have registered to vote?

Notation: Employed group is group 1 ; Unemployed group is group 2.

i. Null and Alternative Hypothesis

\(Ho: \pi1 - \pi2 \leq 0\)

\(Ha: \pi1 - \pi2 > 0\)

Alternative - Percentage of employed workers p1 who have registered to vote, exceeds the percentage of unemployed workers p2 who have registered to vote.

Given mean, sample size of both groups, alpha, single sided test, use proportion formula (for Se) from framing of question.

Again, I am assuming unpooled SE (as the two groups are likely to be different in terms of their voting pattern).

alpha=.05
# Z

n1  = 391
n2  = 510

x1  = 195
x2  = 193

p1  = x1/n1   # Estimate 
p2  = x2/n2   # Estimate 
p1-p2
## [1] 0.1202899
Se = sqrt (p1 * (1-p1) / n1 + p2 * (1-p2) / n2)  # textbook formula
Se
## [1] 0.03317529
Z  = ( p1 - p2) / Se                            # textbook formula 
test_statistic <- Z
test_statistic
## [1] 3.625887
critical_value <- qnorm(p = 1 - alpha)                   # single sided, right gate, or just use qnorm(p = alpha) and remove the negative sign (symmetry)
critical_value
## [1] 1.644854
shadenorm(mu = 0, sig = Se, pcts = c(0.0,0.95))
lines(rep(p1-p2,10), seq(0,20,length.out=10),col='red')

pnorm(q = test_statistic, lower.tail = FALSE)
## [1] 0.0001439855
1 - pnorm(q = test_statistic)
## [1] 0.0001439855

Yes, reject the null. p value and CI will give you same result.


  1. If we assume equal variance of the two groups, then we used the “pooled” method. If we do not assume equal variances of the groups, then we use the “Satterthwaite” method.↩︎