1. Setting

System under test

The dataset used for this experiment gives the performance of various minorities on a CS exam.We test the effect (effects model) of the percentage of female students passing the exam on the total pass percentage of students, therefore the null hypothesis for this experiment can be stated as:

The variability in the total pass percentage of students cannot be explained by the variability in the percentage of female students passing the exam.In short there is no effect of female students passing the exam on the total number of students passing the exam.

We conduct a single factor-6 level experiment to analyze the effect of the variability in the percentage of female students passing the exam on the variability in the percentage of total students (minorities) passing the CS-A exam in the given years (2008-2013).We conduct an exploratory data analysis on the original data and also perform relevant diagnostics tests for model adequacy checking.So the factor in this experiment is the percentage of female students passing the exam and the response variable is the total percentage of passing students each year.

Based on our initial analysis, we select our resampling strategy out of the following three:

Permutations
Bootstrapping
Monte Carlo

Data Summary,Exploratory data analysis and ANOVA on the original dataset

Here is a summary of the original dataset.There are 161 obs. of 45 variables in the original data set.

Data Type: multivariate

#Read the data from file
datam<-(read.csv(file.choose(), header=T)) 

#Summary of the original dataset
summary(datam)

##       year      schools_offering      Total        yield_per_teacher
##  Min.   :1999   Min.   :    4.0   Min.   :    20   Min.   : 3.00    
##  1st Qu.:2007   1st Qu.:   26.0   1st Qu.:   213   1st Qu.: 8.00    
##  Median :2009   Median :   81.5   Median :   696   Median :11.00    
##  Mean   :2009   Mean   :  742.1   Mean   : 13174   Mean   :10.99    
##  3rd Qu.:2011   3rd Qu.:  127.0   3rd Qu.:  1460   3rd Qu.:14.00    
##  Max.   :2013   Max.   :11963.0   Max.   :270721   Max.   :24.00    
##                 NA's   :41                         NA's   :41       
##  diff.from.prev.year percentage_increase number.passed   
##  Min.   : -144.0     Min.   :-41.4634    Min.   :    14  
##  1st Qu.:    2.0     1st Qu.:  0.4726    1st Qu.:   131  
##  Median :   58.0     Median :  9.2511    Median :   402  
##  Mean   :  760.8     Mean   : 14.8554    Mean   :  7747  
##  3rd Qu.:  211.0     3rd Qu.: 26.2911    3rd Qu.:   874  
##  Max.   :14558.0     Max.   :333.3333    Max.   :159524  
##  NA's   :20          NA's   :20                          
##  percentage_passed     female       female_passed percentage_female_passed
##  Min.   :26.51     Min.   :     1   4      :  6   66.66666667:  4         
##  1st Qu.:55.00     1st Qu.:    37   21     :  5   *          :  3         
##  Median :60.89     Median :   110   17     :  4   15.38461538:  3         
##  Mean   :59.81     Mean   :  5880   67     :  4   14.28571429:  2         
##  3rd Qu.:66.86     3rd Qu.:   273   *      :  3   45.45454545:  2         
##  Max.   :90.00     Max.   :130268   10     :  3   56         :  2         
##                                     (Other):136   (Other)    :145         
##  percentage_female Number.of.Black   Number.Black.Passed
##  Min.   : 5.00     Min.   :    0.0   *      :23         
##  1st Qu.:15.00     1st Qu.:    8.0   3      :10         
##  Median :18.00     Median :   33.0   6      :10         
##  Mean   :19.63     Mean   :  660.6   0      : 7         
##  3rd Qu.:21.00     3rd Qu.:   77.0   2      : 7         
##  Max.   :49.00     Max.   :14908.0   13     : 6         
##                                      (Other):98         
##  percentage_Black_passed Percentage.Black Number.Black.Males
##  *          : 23         Min.   : 0.000   Min.   :   0.0    
##  0          :  7         1st Qu.: 3.030   1st Qu.:   6.0    
##  33.33333333:  4         Median : 4.573   Median :  25.0    
##  40         :  3         Mean   : 5.679   Mean   : 306.6    
##  42.85714286:  3         3rd Qu.: 7.733   3rd Qu.:  54.0    
##  10.71428571:  2         Max.   :28.546   Max.   :6708.0    
##  (Other)    :119                                            
##  Number.Black.Males.passed Percentage.Black.Male.passed
##  *      :24                *      : 24                 
##  0      :11                0      : 11                 
##  5      :11                25     :  6                 
##  3      : 8                50     :  6                 
##  1      : 7                20     :  4                 
##  12     : 7                60     :  4                 
##  (Other):93                (Other):106                 
##  Number.Black.Females Number.Black.Females.passed
##  Min.   :   0.0       *      :35                 
##  1st Qu.:   2.0       0      :28                 
##  Median :   8.0       1      :22                 
##  Mean   : 354.3       2      :21                 
##  3rd Qu.:  22.0       3      :13                 
##  Max.   :8200.0       4      : 7                 
##                       (Other):35                 
##  Percentage_Black_Females_passed Number.Hispanic Number.Hispanic.passed
##  *          :35                  Min.   :    0   *      : 39           
##  0          :28                  1st Qu.:    8   0      :  7           
##  14.28571429: 5                  Median :   32   3*     :  6           
##  16.66666667: 4                  Mean   : 1320   2      :  3           
##  33.33333333: 4                  3rd Qu.:  150   24     :  3           
##  11.11111111: 3                  Max.   :34726   25     :  3           
##  (Other)    :82                                  (Other):100           
##  Percentage.Hispanic.passed Number.Hispanic.Females
##  *      : 41                Min.   :    0.0        
##  0      :  7                1st Qu.:    1.0        
##  25*    :  3                Median :    6.0        
##  15.38* :  2                Mean   :  612.4        
##  21.43* :  2                3rd Qu.:   23.0        
##  23.08* :  2                Max.   :16847.0        
##  (Other):104                                       
##  Number.Hispanic.Females.passed Percentage.H.females.passed
##  *      :51                     *      :52                 
##  0      :35                     0      :35                 
##  2*     : 8                     12.5*  : 3                 
##  3*     : 6                     14.29* : 2                 
##  5*     : 6                     16.67* : 2                 
##  1*     : 5                     20*    : 2                 
##  (Other):50                     (Other):65                 
##  Percentage.Hispanic Number.Hispanic.males Number.Hispanic.males.passed
##  Min.   : 0.000             :145           Min.   : 319.0              
##  1st Qu.: 3.081      1168   :  1           1st Qu.: 407.0              
##  Median : 5.248      1388   :  1           Median : 649.0              
##  Mean   : 6.300      1488   :  1           Mean   : 721.1              
##  3rd Qu.: 8.073      1602   :  1           3rd Qu.:1017.0              
##  Max.   :21.387      1618   :  1           Max.   :1187.0              
##                      (Other): 11           NA's   :148                 
##        X          Number.white       Number.white.passed
##  Min.   :37.87   Min.   :    46.47   Min.   :   104     
##  1st Qu.:40.42   1st Qu.:   308.00   1st Qu.:   397     
##  Median :44.35   Median :  8516.00   Median :  5224     
##  Mean   :49.92   Mean   : 42888.36   Mean   : 27159     
##  3rd Qu.:62.05   3rd Qu.:124190.00   3rd Qu.: 80194     
##  Max.   :64.20   Max.   :162844.00   Max.   :103588     
##  NA's   :148     NA's   :132         NA's   :132        
##  Percentage.white.pass Number.white.males Number.white.females
##  Min.   : 45.22        Min.   :   60.47   Min.   :  26.29     
##  1st Qu.: 60.14        1st Qu.:   70.27   1st Qu.:  27.67     
##  Median : 63.40        Median : 7333.00   Median :1183.00     
##  Mean   :162.96        Mean   : 5655.88   Mean   : 977.72     
##  3rd Qu.: 66.79        3rd Qu.: 8938.00   3rd Qu.:1618.00     
##  Max.   :794.00        Max.   :13711.00   Max.   :2348.00     
##  NA's   :132           NA's   :148        NA's   :148         
##   Number.asian   Number.asian.passed Percentage.asian.passed
##  Min.   :   76   Min.   :   41.0     Min.   :50.00          
##  1st Qu.:  236   1st Qu.:  119.8     1st Qu.:55.54          
##  Median : 4555   Median : 3135.0     Median :63.12          
##  Mean   :14070   Mean   : 9206.8     Mean   :61.80          
##  3rd Qu.:32228   3rd Qu.:20712.0     3rd Qu.:66.51          
##  Max.   :46536   Max.   :30993.0     Max.   :70.65          
##  NA's   :137     NA's   :137         NA's   :137            
##  Number.Asian.Male Number.Asian.Female  Number.male     Number.male.pass
##  Min.   :2178      Min.   : 668.0      Min.   : 24070   Min.   :16355   
##  1st Qu.:2641      1st Qu.: 855.5      1st Qu.: 53166   1st Qu.:33932   
##  Median :3402      Median :1152.5      Median : 82262   Median :51508   
##  Mean   :3776      Mean   :1235.4      Mean   : 82262   Mean   :51508   
##  3rd Qu.:4627      3rd Qu.:1525.5      3rd Qu.:111357   3rd Qu.:69085   
##  Max.   :6403      Max.   :2072.0      Max.   :140453   Max.   :86661   
##  NA's   :153       NA's   :153         NA's   :159      NA's   :159     
##  male.pass.rate 
##  Min.   :61.70  
##  1st Qu.:63.26  
##  Median :64.82  
##  Mean   :64.82  
##  3rd Qu.:66.39  
##  Max.   :67.95  
##  NA's   :159

#Factors 
str(datam)

## 'data.frame':    161 obs. of  45 variables:
##  $ year                           : int  2006 2007 2008 2009 2010 2011 2012 2013 2006 2007 ...
##  $ schools_offering               : int  NA NA 1778 1879 2048 1972 2103 2253 NA NA ...
##  $ Total                          : int  14108 14529 15014 16061 19390 21139 24782 29555 190954 204546 ...
##  $ yield_per_teacher              : int  NA NA 8 9 9 11 12 13 NA NA ...
##  $ diff.from.prev.year            : int  NA 421 485 1047 3329 1749 3643 4773 NA 13592 ...
##  $ percentage_increase            : num  NA 2.98 3.34 6.97 20.73 ...
##  $ number.passed                  : int  8227 8184 8537 9925 12550 13463 15678 19760 116498 119388 ...
##  $ percentage_passed              : num  58.3 56.3 56.9 61.8 64.7 ...
##  $ female                         : int  2495 2665 2789 3096 3726 4000 4635 5485 92469 99538 ...
##  $ female_passed                  : Factor w/ 111 levels "*","0","1","10",..: 13 14 19 27 37 39 45 56 78 79 ...
##  $ percentage_female_passed       : Factor w/ 148 levels "*","0","100",..: 57 46 66 82 103 94 87 113 95 74 ...
##  $ percentage_female              : int  18 18 19 19 19 19 19 19 48 49 ...
##  $ Number.of.Black                : int  668 640 673 778 825 893 1014 1090 8453 9329 ...
##  $ Number.Black.Passed            : Factor w/ 54 levels "*","0","1","10",..: 11 7 13 15 21 30 28 38 26 24 ...
##  $ percentage_Black_passed        : Factor w/ 117 levels "*","0","10","10.14492754",..: 38 28 48 37 54 71 56 83 70 59 ...
##  $ Percentage.Black               : num  4.73 4.4 4.48 4.84 4.25 ...
##  $ Number.Black.Males             : int  468 437 454 538 577 671 762 826 3516 3771 ...
##  $ Number.Black.Males.passed      : Factor w/ 48 levels "*","0","1","10",..: 11 5 9 14 22 33 31 37 8 10 ...
##  $ Percentage.Black.Male.passed   : Factor w/ 98 levels "*","0","11.36363636",..: 35 28 36 34 48 64 50 71 60 57 ...
##  $ Number.Black.Females           : int  200 203 219 240 248 222 252 264 4937 5558 ...
##  $ Number.Black.Females.passed    : Factor w/ 31 levels "*","0","1","10",..: 18 17 23 20 26 22 24 29 8 6 ...
##  $ Percentage_Black_Females_passed: Factor w/ 69 levels "*","0","1.369863014",..: 14 10 26 15 27 22 20 37 44 32 ...
##  $ Number.Hispanic                : int  1079 1077 1142 1208 1466 1752 1919 2408 15040 17329 ...
##  $ Number.Hispanic.passed         : Factor w/ 85 levels "*","0","1","1*",..: 51 50 53 58 70 72 77 9 67 71 ...
##  $ Percentage.Hispanic.passed     : Factor w/ 105 levels "*","0","100",..: 45 40 39 49 71 48 64 89 57 43 ...
##  $ Number.Hispanic.Females        : int  220 250 223 281 298 364 431 488 7535 8743 ...
##  $ Number.Hispanic.Females.passed : Factor w/ 42 levels "*","0","1","1*",..: 35 35 33 29 42 41 7 9 15 17 ...
##  $ Percentage.H.females.passed    : Factor w/ 67 levels "*","0","12.07*",..: 32 25 22 7 45 30 33 47 53 44 ...
##  $ Percentage.Hispanic            : num  7.65 7.41 7.61 7.52 7.56 ...
##  $ Number.Hispanic.males          : Factor w/ 17 levels "","1168","1388",..: 12 11 13 14 2 3 4 10 1 1 ...
##  $ Number.Hispanic.males.passed   : int  331 319 348 407 518 561 649 932 NA NA ...
##  $ X                              : num  38.5 38.6 37.9 43.9 44.3 ...
##  $ Number.white                   : num  8240 8516 8757 9064 10556 ...
##  $ Number.white.passed            : int  5224 5191 5359 6054 7237 7633 8846 11119 80194 83540 ...
##  $ Percentage.white.pass          : num  63.4 61 61.2 66.8 68.6 ...
##  $ Number.white.males             : num  7081 7333 7453 7725 8938 ...
##  $ Number.white.females           : num  1159 1183 1304 1339 1618 ...
##  $ Number.asian                   : int  2846 3319 3556 3979 5131 5914 6869 8475 29049 31922 ...
##  $ Number.asian.passed            : int  1733 1984 2169 2645 3625 4109 4802 6178 19281 20302 ...
##  $ Percentage.asian.passed        : num  60.9 59.8 61 66.5 70.6 ...
##  $ Number.Asian.Male              : int  2178 2474 2697 2963 3842 4430 5219 6403 NA NA ...
##  $ Number.Asian.Female            : int  668 845 859 1016 1289 1484 1650 2072 NA NA ...
##  $ Number.male                    : int  NA NA NA NA NA NA NA 24070 NA NA ...
##  $ Number.male.pass               : int  NA NA NA NA NA NA NA 16355 NA NA ...
##  $ male.pass.rate                 : num  NA NA NA NA NA ...

# Plots to analyze the data and the overall location statistics

boxplot(datam$percentage_passed~datam$percentage_female,data=datam, xlab="Percentage of female students passing", ylab="Total percentage of passing students")
title("Boxplot")

hist(datam$percentage_female, xlim=c(0,50), xlab="Percentage of all students",ylab = "Percentage of female students passing CS exam")

The boxplots do indicate the variability in the response variables for various levels of the factor under study. From the histogram no reasonable conclusion about the undelying distribution of the population can be made.

Initial Analysis - Testing and Diagnostics check

We perform ANOVA on the original dataset and also estimate parameters to test for normality.

# Perform ANOVA

datam$percentage_female<-as.factor(datam$percentage_female)

model=lm(percentage_passed~percentage_female,data=datam)
anova(model)

## Analysis of Variance Table
## 
## Response: percentage_passed
##                    Df  Sum Sq Mean Sq F value Pr(>F)
## percentage_female  24  3869.4  161.22  1.2061 0.2475
## Residuals         136 18178.9  133.67

# Parameter estimation and diagnostics check

shapiro.test(datam$percentage_passed)

## 
##  Shapiro-Wilk normality test
## 
## data:  datam$percentage_passed
## W = 0.9792, p-value = 0.0159

#Plots for checking normality

qqnorm(residuals(model)) 
qqline(residuals(model))

plot(fitted(model),residuals(model))

The results from the ANOVA indicate that the p-value is much higher than .05. So we fail to reject the null hypothesis therefore there is no effect of the percentage of female students passing the CS exam on the total percentage of passing students.

The results of ANOVA are based on the assumption that the data is normally distributed. From the Shapiro-Wilk test we get a p-value of 0.0159 which indicates that the data can be normally distributed. From the normal q-q plots we can infer that the data is normal for most observations but not for all observations. There is a marked deviation from normality for many values.In conclusion the data does not confirm to the assumption of normality.

There is no trend in the scatter plot within a certain range (values 55-65). However, there is a cluster at the lower level of the dynamic range.Therefore our model cannot be considered a good fit.

Resampling

When the resampling is done from a known theoretical distribution it is known as a “Monte Carlo” simulation.From our initial analysis we are unable to establish any valid confirmation on the underlying distribution of the given data set. Therefore we discard this strategy. Also, the permutaion strategy is more effective if the sample size is exceptionally small given the possibility of forming all possible permutations.So the only effective and valid strategy for our case is ‘bootstrapping’.Essentially we will be sampling from a larger parent distribution, but we know nothing about it for certain. The sample that we have from the original dataset is the only information we have about the parent distribution. Therefore, we will calculate a sampling distribution for the test statistic by resampling from our mini-population WITH replacement.

dataf<-(read.csv(file.choose(), header=T))
#dataf<-subset(datam, percentage_female >= 14 | percentage_female < 21)
with(dataf,tapply(percentage_passed,percentage_female,mean))

##       15       16       17       18       19       20 
## 61.97876 57.62949 58.66538 54.72223 62.32898 63.61162

with(dataf,tapply(percentage_passed,percentage_female,var))

##        15        16        17        18        19        20 
## 188.55096  43.32771 105.98120 161.20022  86.00826  87.97422

with(dataf,tapply(percentage_passed,percentage_female,length))

## 15 16 17 18 19 20 
## 16 10 22 15 15  9

dataf$percentage_female <- as.factor(dataf$percentage_female)

summary(aov(dataf$percentage_passed~dataf$percentage_female,data=dataf))

##                         Df Sum Sq Mean Sq F value Pr(>F)
## dataf$percentage_female  5    762   152.4   1.285  0.279
## Residuals               81   9609   118.6

meanstar = with(dataf,tapply(percentage_passed,percentage_female,mean))
# now the bootstrap version of ANOVA
grpA = dataf$percentage_passed[dataf$percentage_female=='15'] - meanstar[1]
grpB = dataf$percentage_passed[dataf$percentage_female=='16'] - meanstar[2]
grpC = dataf$percentage_passed[dataf$percentage_female=='17'] - meanstar[3]
grpD = dataf$percentage_passed[dataf$percentage_female=='18'] - meanstar[4]
grpE = dataf$percentage_passed[dataf$percentage_female=='19'] - meanstar[5]
grpF = dataf$percentage_passed[dataf$percentage_female=='20'] - meanstar[6]
simfem = dataf$percentage_female
R = 10000
Fstar = numeric(R)
for (i in 1:R) {
  groupA = sample(grpA, size=16, replace=T)
  groupB = sample(grpB, size=10, replace=T)
  groupC = sample(grpC, size=22, replace=T)
  groupD = sample(grpD, size=15, replace=T)
  groupE = sample(grpE, size=15, replace=T)
  groupF = sample(grpF, size=9, replace=T)
  simpass = c(groupA,groupB,groupC,groupD,groupE,groupF)
  simdata = data.frame(simpass,simfem)
  Fstar[i] = oneway.test(simpass~simfem, var.equal=T, data=simdata)$statistic
}
# Now generate a similar plot, comparing bootstrapped distribution to the known F distribution
hist(Fstar, ylim=c(0,1), xlim=c(0,8), prob=T)
x=seq(.05,6,.25)
points(x,y=df(x,5,81),type="b",col="red")  # The F distribution

# Here is the alpha level from the bootstrapped distribution
print(realFstar<-oneway.test(percentage_passed~percentage_female, var.equal=T, data=dataf)$statistic)

##        F 
## 1.284714

mean(Fstar>=realFstar)

## [1] 0.2829

par(mfrow=c(1,2))

hist(Fstar,prob=TRUE,main="PDF of Theoretical F-distribution")
x = seq(.25,6,.25)
points(x,y=df(x,5,81), type = "b", col = "red")


# Generate quantiles of the analytic F distribution

qf(.05,5,81)

## [1] 0.2265328

qf(.95,5,81)

## [1] 2.327269

# Confidence interval estimation

quantile(Fstar,c(0.05,0.95))

##        5%       95% 
## 0.2156916 2.3522846

#Estimate standard error -(The bootstrap standard error of the statistic is obtained as the standard deviation of the bootstrap replicate estimates)

sd(Fstar)

## [1] 0.6890754

#Bias -(Bootstrap is often used to determine the bias of an estimate)

mean(Fstar)-mean(realFstar)

## [1] -0.2571779

Analysis: As can be seen from the two plots above, the theoretical distribution and the actual distribution are not significantly different from each other.Further, we perform parameter estimation like finding confidence intervals, standard error and bias in the bootstrapped version of the sample.A large number of bootstrap replicate estimates is required for an accurate confidence interval which might be one of the reasons behind the few differences we see.

Anova on the new model

We perform analysis of variance on the data generated through bootstrap resampling and again test the null Hypothesis.

newmodel=lm(simpass~simfem,data=simdata)
anova(newmodel)

## Analysis of Variance Table
## 
## Response: simpass
##           Df  Sum Sq Mean Sq F value Pr(>F)
## simfem     5   568.4  113.68   0.845 0.5219
## Residuals 81 10897.3  134.53

As is evident from the p-value (much higher than 0.05), we fail to reject the null hypothesis i.e. the variability in the percentage of female students passing the CS exam has no effect on the variability in the total percentage passing the exam (i.e. it is due to randomization only). However, to establish the validity of this result we again perform diagnostics check on this data.

#Diagnostics and Model Adequacy check

shapiro.test(simdata$simpass)

## 
##  Shapiro-Wilk normality test
## 
## data:  simdata$simpass
## W = 0.9593, p-value = 0.007882

par(mfrow=c(2,2))
plot(newmodel)

The Shapiro test results in a p-value which is much smaller than 0.1, therefore this data can be assumed to be normal. The normal Q-Q plot also depicts that majority of the values are normal. However, the fitted values show trend over the dynamic range.The scatter plots of the residuals and the fitted model will check the distribution of the model over the entire range.It is difficult to conclude anything substantive from the scatter plots since we have equal number of points on both sides of the mean (point zero) but these appear only for specific values in the overall range.

References

Data was compiled by Barbara Ericson (ericson@cc.gatech.edu) from data from the College Board for each year- http://research.collegeboard.org/programs/ap/data/participation/2013
State population taken from 2011 data at http://www.infoplease.com/ipa/A0004986.html
% black population see http://en.wikipedia.org/wiki/List_of_U.S._states_by_African-American_population (Georgia is 30% and Maryland is 29.44%) - based on 2010 census data
% hispanic/latino from http://www.hacu.net/images/hacu/OPAI/2012_Virtual_Binder/2010%20census%20brief%20-%20hispanic%20population.pdf

Recipe 7:ANOVA with Bootstrapping

Uzma Mushtaque

November 6, 2014