The dataset used for this experiment gives the performance of various minorities on a CS exam.We test the effect (effects model) of the percentage of female students passing the exam on the total pass percentage of students, therefore the null hypothesis for this experiment can be stated as:
The variability in the total pass percentage of students cannot be explained by the variability in the percentage of female students passing the exam.In short there is no effect of female students passing the exam on the total number of students passing the exam.
We conduct a single factor-6 level experiment to analyze the effect of the variability in the percentage of female students passing the exam on the variability in the percentage of total students (minorities) passing the CS-A exam in the given years (2008-2013).We conduct an exploratory data analysis on the original data and also perform relevant diagnostics tests for model adequacy checking.So the factor in this experiment is the percentage of female students passing the exam and the response variable is the total percentage of passing students each year.
Based on our initial analysis, we select our resampling strategy out of the following three:
Here is a summary of the original dataset.There are 161 obs. of 45 variables in the original data set.
Data Type: multivariate
#Read the data from file
datam<-(read.csv(file.choose(), header=T))
#Summary of the original dataset
summary(datam)
## year schools_offering Total yield_per_teacher
## Min. :1999 Min. : 4.0 Min. : 20 Min. : 3.00
## 1st Qu.:2007 1st Qu.: 26.0 1st Qu.: 213 1st Qu.: 8.00
## Median :2009 Median : 81.5 Median : 696 Median :11.00
## Mean :2009 Mean : 742.1 Mean : 13174 Mean :10.99
## 3rd Qu.:2011 3rd Qu.: 127.0 3rd Qu.: 1460 3rd Qu.:14.00
## Max. :2013 Max. :11963.0 Max. :270721 Max. :24.00
## NA's :41 NA's :41
## diff.from.prev.year percentage_increase number.passed
## Min. : -144.0 Min. :-41.4634 Min. : 14
## 1st Qu.: 2.0 1st Qu.: 0.4726 1st Qu.: 131
## Median : 58.0 Median : 9.2511 Median : 402
## Mean : 760.8 Mean : 14.8554 Mean : 7747
## 3rd Qu.: 211.0 3rd Qu.: 26.2911 3rd Qu.: 874
## Max. :14558.0 Max. :333.3333 Max. :159524
## NA's :20 NA's :20
## percentage_passed female female_passed percentage_female_passed
## Min. :26.51 Min. : 1 4 : 6 66.66666667: 4
## 1st Qu.:55.00 1st Qu.: 37 21 : 5 * : 3
## Median :60.89 Median : 110 17 : 4 15.38461538: 3
## Mean :59.81 Mean : 5880 67 : 4 14.28571429: 2
## 3rd Qu.:66.86 3rd Qu.: 273 * : 3 45.45454545: 2
## Max. :90.00 Max. :130268 10 : 3 56 : 2
## (Other):136 (Other) :145
## percentage_female Number.of.Black Number.Black.Passed
## Min. : 5.00 Min. : 0.0 * :23
## 1st Qu.:15.00 1st Qu.: 8.0 3 :10
## Median :18.00 Median : 33.0 6 :10
## Mean :19.63 Mean : 660.6 0 : 7
## 3rd Qu.:21.00 3rd Qu.: 77.0 2 : 7
## Max. :49.00 Max. :14908.0 13 : 6
## (Other):98
## percentage_Black_passed Percentage.Black Number.Black.Males
## * : 23 Min. : 0.000 Min. : 0.0
## 0 : 7 1st Qu.: 3.030 1st Qu.: 6.0
## 33.33333333: 4 Median : 4.573 Median : 25.0
## 40 : 3 Mean : 5.679 Mean : 306.6
## 42.85714286: 3 3rd Qu.: 7.733 3rd Qu.: 54.0
## 10.71428571: 2 Max. :28.546 Max. :6708.0
## (Other) :119
## Number.Black.Males.passed Percentage.Black.Male.passed
## * :24 * : 24
## 0 :11 0 : 11
## 5 :11 25 : 6
## 3 : 8 50 : 6
## 1 : 7 20 : 4
## 12 : 7 60 : 4
## (Other):93 (Other):106
## Number.Black.Females Number.Black.Females.passed
## Min. : 0.0 * :35
## 1st Qu.: 2.0 0 :28
## Median : 8.0 1 :22
## Mean : 354.3 2 :21
## 3rd Qu.: 22.0 3 :13
## Max. :8200.0 4 : 7
## (Other):35
## Percentage_Black_Females_passed Number.Hispanic Number.Hispanic.passed
## * :35 Min. : 0 * : 39
## 0 :28 1st Qu.: 8 0 : 7
## 14.28571429: 5 Median : 32 3* : 6
## 16.66666667: 4 Mean : 1320 2 : 3
## 33.33333333: 4 3rd Qu.: 150 24 : 3
## 11.11111111: 3 Max. :34726 25 : 3
## (Other) :82 (Other):100
## Percentage.Hispanic.passed Number.Hispanic.Females
## * : 41 Min. : 0.0
## 0 : 7 1st Qu.: 1.0
## 25* : 3 Median : 6.0
## 15.38* : 2 Mean : 612.4
## 21.43* : 2 3rd Qu.: 23.0
## 23.08* : 2 Max. :16847.0
## (Other):104
## Number.Hispanic.Females.passed Percentage.H.females.passed
## * :51 * :52
## 0 :35 0 :35
## 2* : 8 12.5* : 3
## 3* : 6 14.29* : 2
## 5* : 6 16.67* : 2
## 1* : 5 20* : 2
## (Other):50 (Other):65
## Percentage.Hispanic Number.Hispanic.males Number.Hispanic.males.passed
## Min. : 0.000 :145 Min. : 319.0
## 1st Qu.: 3.081 1168 : 1 1st Qu.: 407.0
## Median : 5.248 1388 : 1 Median : 649.0
## Mean : 6.300 1488 : 1 Mean : 721.1
## 3rd Qu.: 8.073 1602 : 1 3rd Qu.:1017.0
## Max. :21.387 1618 : 1 Max. :1187.0
## (Other): 11 NA's :148
## X Number.white Number.white.passed
## Min. :37.87 Min. : 46.47 Min. : 104
## 1st Qu.:40.42 1st Qu.: 308.00 1st Qu.: 397
## Median :44.35 Median : 8516.00 Median : 5224
## Mean :49.92 Mean : 42888.36 Mean : 27159
## 3rd Qu.:62.05 3rd Qu.:124190.00 3rd Qu.: 80194
## Max. :64.20 Max. :162844.00 Max. :103588
## NA's :148 NA's :132 NA's :132
## Percentage.white.pass Number.white.males Number.white.females
## Min. : 45.22 Min. : 60.47 Min. : 26.29
## 1st Qu.: 60.14 1st Qu.: 70.27 1st Qu.: 27.67
## Median : 63.40 Median : 7333.00 Median :1183.00
## Mean :162.96 Mean : 5655.88 Mean : 977.72
## 3rd Qu.: 66.79 3rd Qu.: 8938.00 3rd Qu.:1618.00
## Max. :794.00 Max. :13711.00 Max. :2348.00
## NA's :132 NA's :148 NA's :148
## Number.asian Number.asian.passed Percentage.asian.passed
## Min. : 76 Min. : 41.0 Min. :50.00
## 1st Qu.: 236 1st Qu.: 119.8 1st Qu.:55.54
## Median : 4555 Median : 3135.0 Median :63.12
## Mean :14070 Mean : 9206.8 Mean :61.80
## 3rd Qu.:32228 3rd Qu.:20712.0 3rd Qu.:66.51
## Max. :46536 Max. :30993.0 Max. :70.65
## NA's :137 NA's :137 NA's :137
## Number.Asian.Male Number.Asian.Female Number.male Number.male.pass
## Min. :2178 Min. : 668.0 Min. : 24070 Min. :16355
## 1st Qu.:2641 1st Qu.: 855.5 1st Qu.: 53166 1st Qu.:33932
## Median :3402 Median :1152.5 Median : 82262 Median :51508
## Mean :3776 Mean :1235.4 Mean : 82262 Mean :51508
## 3rd Qu.:4627 3rd Qu.:1525.5 3rd Qu.:111357 3rd Qu.:69085
## Max. :6403 Max. :2072.0 Max. :140453 Max. :86661
## NA's :153 NA's :153 NA's :159 NA's :159
## male.pass.rate
## Min. :61.70
## 1st Qu.:63.26
## Median :64.82
## Mean :64.82
## 3rd Qu.:66.39
## Max. :67.95
## NA's :159
#Factors
str(datam)
## 'data.frame': 161 obs. of 45 variables:
## $ year : int 2006 2007 2008 2009 2010 2011 2012 2013 2006 2007 ...
## $ schools_offering : int NA NA 1778 1879 2048 1972 2103 2253 NA NA ...
## $ Total : int 14108 14529 15014 16061 19390 21139 24782 29555 190954 204546 ...
## $ yield_per_teacher : int NA NA 8 9 9 11 12 13 NA NA ...
## $ diff.from.prev.year : int NA 421 485 1047 3329 1749 3643 4773 NA 13592 ...
## $ percentage_increase : num NA 2.98 3.34 6.97 20.73 ...
## $ number.passed : int 8227 8184 8537 9925 12550 13463 15678 19760 116498 119388 ...
## $ percentage_passed : num 58.3 56.3 56.9 61.8 64.7 ...
## $ female : int 2495 2665 2789 3096 3726 4000 4635 5485 92469 99538 ...
## $ female_passed : Factor w/ 111 levels "*","0","1","10",..: 13 14 19 27 37 39 45 56 78 79 ...
## $ percentage_female_passed : Factor w/ 148 levels "*","0","100",..: 57 46 66 82 103 94 87 113 95 74 ...
## $ percentage_female : int 18 18 19 19 19 19 19 19 48 49 ...
## $ Number.of.Black : int 668 640 673 778 825 893 1014 1090 8453 9329 ...
## $ Number.Black.Passed : Factor w/ 54 levels "*","0","1","10",..: 11 7 13 15 21 30 28 38 26 24 ...
## $ percentage_Black_passed : Factor w/ 117 levels "*","0","10","10.14492754",..: 38 28 48 37 54 71 56 83 70 59 ...
## $ Percentage.Black : num 4.73 4.4 4.48 4.84 4.25 ...
## $ Number.Black.Males : int 468 437 454 538 577 671 762 826 3516 3771 ...
## $ Number.Black.Males.passed : Factor w/ 48 levels "*","0","1","10",..: 11 5 9 14 22 33 31 37 8 10 ...
## $ Percentage.Black.Male.passed : Factor w/ 98 levels "*","0","11.36363636",..: 35 28 36 34 48 64 50 71 60 57 ...
## $ Number.Black.Females : int 200 203 219 240 248 222 252 264 4937 5558 ...
## $ Number.Black.Females.passed : Factor w/ 31 levels "*","0","1","10",..: 18 17 23 20 26 22 24 29 8 6 ...
## $ Percentage_Black_Females_passed: Factor w/ 69 levels "*","0","1.369863014",..: 14 10 26 15 27 22 20 37 44 32 ...
## $ Number.Hispanic : int 1079 1077 1142 1208 1466 1752 1919 2408 15040 17329 ...
## $ Number.Hispanic.passed : Factor w/ 85 levels "*","0","1","1*",..: 51 50 53 58 70 72 77 9 67 71 ...
## $ Percentage.Hispanic.passed : Factor w/ 105 levels "*","0","100",..: 45 40 39 49 71 48 64 89 57 43 ...
## $ Number.Hispanic.Females : int 220 250 223 281 298 364 431 488 7535 8743 ...
## $ Number.Hispanic.Females.passed : Factor w/ 42 levels "*","0","1","1*",..: 35 35 33 29 42 41 7 9 15 17 ...
## $ Percentage.H.females.passed : Factor w/ 67 levels "*","0","12.07*",..: 32 25 22 7 45 30 33 47 53 44 ...
## $ Percentage.Hispanic : num 7.65 7.41 7.61 7.52 7.56 ...
## $ Number.Hispanic.males : Factor w/ 17 levels "","1168","1388",..: 12 11 13 14 2 3 4 10 1 1 ...
## $ Number.Hispanic.males.passed : int 331 319 348 407 518 561 649 932 NA NA ...
## $ X : num 38.5 38.6 37.9 43.9 44.3 ...
## $ Number.white : num 8240 8516 8757 9064 10556 ...
## $ Number.white.passed : int 5224 5191 5359 6054 7237 7633 8846 11119 80194 83540 ...
## $ Percentage.white.pass : num 63.4 61 61.2 66.8 68.6 ...
## $ Number.white.males : num 7081 7333 7453 7725 8938 ...
## $ Number.white.females : num 1159 1183 1304 1339 1618 ...
## $ Number.asian : int 2846 3319 3556 3979 5131 5914 6869 8475 29049 31922 ...
## $ Number.asian.passed : int 1733 1984 2169 2645 3625 4109 4802 6178 19281 20302 ...
## $ Percentage.asian.passed : num 60.9 59.8 61 66.5 70.6 ...
## $ Number.Asian.Male : int 2178 2474 2697 2963 3842 4430 5219 6403 NA NA ...
## $ Number.Asian.Female : int 668 845 859 1016 1289 1484 1650 2072 NA NA ...
## $ Number.male : int NA NA NA NA NA NA NA 24070 NA NA ...
## $ Number.male.pass : int NA NA NA NA NA NA NA 16355 NA NA ...
## $ male.pass.rate : num NA NA NA NA NA ...
# Plots to analyze the data and the overall location statistics
boxplot(datam$percentage_passed~datam$percentage_female,data=datam, xlab="Percentage of female students passing", ylab="Total percentage of passing students")
title("Boxplot")
hist(datam$percentage_female, xlim=c(0,50), xlab="Percentage of all students",ylab = "Percentage of female students passing CS exam")
The boxplots do indicate the variability in the response variables for various levels of the factor under study. From the histogram no reasonable conclusion about the undelying distribution of the population can be made.
We perform ANOVA on the original dataset and also estimate parameters to test for normality.
# Perform ANOVA
datam$percentage_female<-as.factor(datam$percentage_female)
model=lm(percentage_passed~percentage_female,data=datam)
anova(model)
## Analysis of Variance Table
##
## Response: percentage_passed
## Df Sum Sq Mean Sq F value Pr(>F)
## percentage_female 24 3869.4 161.22 1.2061 0.2475
## Residuals 136 18178.9 133.67
# Parameter estimation and diagnostics check
shapiro.test(datam$percentage_passed)
##
## Shapiro-Wilk normality test
##
## data: datam$percentage_passed
## W = 0.9792, p-value = 0.0159
#Plots for checking normality
qqnorm(residuals(model))
qqline(residuals(model))
plot(fitted(model),residuals(model))
The results from the ANOVA indicate that the p-value is much higher than .05. So we fail to reject the null hypothesis therefore there is no effect of the percentage of female students passing the CS exam on the total percentage of passing students.
The results of ANOVA are based on the assumption that the data is normally distributed. From the Shapiro-Wilk test we get a p-value of 0.0159 which indicates that the data can be normally distributed. From the normal q-q plots we can infer that the data is normal for most observations but not for all observations. There is a marked deviation from normality for many values.In conclusion the data does not confirm to the assumption of normality.
There is no trend in the scatter plot within a certain range (values 55-65). However, there is a cluster at the lower level of the dynamic range.Therefore our model cannot be considered a good fit.
When the resampling is done from a known theoretical distribution it is known as a “Monte Carlo” simulation.From our initial analysis we are unable to establish any valid confirmation on the underlying distribution of the given data set. Therefore we discard this strategy. Also, the permutaion strategy is more effective if the sample size is exceptionally small given the possibility of forming all possible permutations.So the only effective and valid strategy for our case is ‘bootstrapping’.Essentially we will be sampling from a larger parent distribution, but we know nothing about it for certain. The sample that we have from the original dataset is the only information we have about the parent distribution. Therefore, we will calculate a sampling distribution for the test statistic by resampling from our mini-population WITH replacement.
dataf<-(read.csv(file.choose(), header=T))
#dataf<-subset(datam, percentage_female >= 14 | percentage_female < 21)
with(dataf,tapply(percentage_passed,percentage_female,mean))
## 15 16 17 18 19 20
## 61.97876 57.62949 58.66538 54.72223 62.32898 63.61162
with(dataf,tapply(percentage_passed,percentage_female,var))
## 15 16 17 18 19 20
## 188.55096 43.32771 105.98120 161.20022 86.00826 87.97422
with(dataf,tapply(percentage_passed,percentage_female,length))
## 15 16 17 18 19 20
## 16 10 22 15 15 9
dataf$percentage_female <- as.factor(dataf$percentage_female)
summary(aov(dataf$percentage_passed~dataf$percentage_female,data=dataf))
## Df Sum Sq Mean Sq F value Pr(>F)
## dataf$percentage_female 5 762 152.4 1.285 0.279
## Residuals 81 9609 118.6
meanstar = with(dataf,tapply(percentage_passed,percentage_female,mean))
# now the bootstrap version of ANOVA
grpA = dataf$percentage_passed[dataf$percentage_female=='15'] - meanstar[1]
grpB = dataf$percentage_passed[dataf$percentage_female=='16'] - meanstar[2]
grpC = dataf$percentage_passed[dataf$percentage_female=='17'] - meanstar[3]
grpD = dataf$percentage_passed[dataf$percentage_female=='18'] - meanstar[4]
grpE = dataf$percentage_passed[dataf$percentage_female=='19'] - meanstar[5]
grpF = dataf$percentage_passed[dataf$percentage_female=='20'] - meanstar[6]
simfem = dataf$percentage_female
R = 10000
Fstar = numeric(R)
for (i in 1:R) {
groupA = sample(grpA, size=16, replace=T)
groupB = sample(grpB, size=10, replace=T)
groupC = sample(grpC, size=22, replace=T)
groupD = sample(grpD, size=15, replace=T)
groupE = sample(grpE, size=15, replace=T)
groupF = sample(grpF, size=9, replace=T)
simpass = c(groupA,groupB,groupC,groupD,groupE,groupF)
simdata = data.frame(simpass,simfem)
Fstar[i] = oneway.test(simpass~simfem, var.equal=T, data=simdata)$statistic
}
# Now generate a similar plot, comparing bootstrapped distribution to the known F distribution
hist(Fstar, ylim=c(0,1), xlim=c(0,8), prob=T)
x=seq(.05,6,.25)
points(x,y=df(x,5,81),type="b",col="red") # The F distribution
# Here is the alpha level from the bootstrapped distribution
print(realFstar<-oneway.test(percentage_passed~percentage_female, var.equal=T, data=dataf)$statistic)
## F
## 1.284714
mean(Fstar>=realFstar)
## [1] 0.2829
par(mfrow=c(1,2))
hist(Fstar,prob=TRUE,main="PDF of Theoretical F-distribution")
x = seq(.25,6,.25)
points(x,y=df(x,5,81), type = "b", col = "red")
# Generate quantiles of the analytic F distribution
qf(.05,5,81)
## [1] 0.2265328
qf(.95,5,81)
## [1] 2.327269
# Confidence interval estimation
quantile(Fstar,c(0.05,0.95))
## 5% 95%
## 0.2156916 2.3522846
#Estimate standard error -(The bootstrap standard error of the statistic is obtained as the standard deviation of the bootstrap replicate estimates)
sd(Fstar)
## [1] 0.6890754
#Bias -(Bootstrap is often used to determine the bias of an estimate)
mean(Fstar)-mean(realFstar)
## [1] -0.2571779
Analysis: As can be seen from the two plots above, the theoretical distribution and the actual distribution are not significantly different from each other.Further, we perform parameter estimation like finding confidence intervals, standard error and bias in the bootstrapped version of the sample.A large number of bootstrap replicate estimates is required for an accurate confidence interval which might be one of the reasons behind the few differences we see.
We perform analysis of variance on the data generated through bootstrap resampling and again test the null Hypothesis.
newmodel=lm(simpass~simfem,data=simdata)
anova(newmodel)
## Analysis of Variance Table
##
## Response: simpass
## Df Sum Sq Mean Sq F value Pr(>F)
## simfem 5 568.4 113.68 0.845 0.5219
## Residuals 81 10897.3 134.53
As is evident from the p-value (much higher than 0.05), we fail to reject the null hypothesis i.e. the variability in the percentage of female students passing the CS exam has no effect on the variability in the total percentage passing the exam (i.e. it is due to randomization only). However, to establish the validity of this result we again perform diagnostics check on this data.
#Diagnostics and Model Adequacy check
shapiro.test(simdata$simpass)
##
## Shapiro-Wilk normality test
##
## data: simdata$simpass
## W = 0.9593, p-value = 0.007882
par(mfrow=c(2,2))
plot(newmodel)
The Shapiro test results in a p-value which is much smaller than 0.1, therefore this data can be assumed to be normal. The normal Q-Q plot also depicts that majority of the values are normal. However, the fitted values show trend over the dynamic range.The scatter plots of the residuals and the fitted model will check the distribution of the model over the entire range.It is difficult to conclude anything substantive from the scatter plots since we have equal number of points on both sides of the mean (point zero) but these appear only for specific values in the overall range.
Data was compiled by Barbara Ericson (ericson@cc.gatech.edu) from data from the College Board for each year- http://research.collegeboard.org/programs/ap/data/participation/2013
State population taken from 2011 data at http://www.infoplease.com/ipa/A0004986.html
% black population see http://en.wikipedia.org/wiki/List_of_U.S._states_by_African-American_population (Georgia is 30% and Maryland is 29.44%) - based on 2010 census data
% hispanic/latino from http://www.hacu.net/images/hacu/OPAI/2012_Virtual_Binder/2010%20census%20brief%20-%20hispanic%20population.pdf