This is an R Markdown document. Markdown is a simple formatting syntax for authoring web pages (click the MD toolbar button for help on Markdown).
When you click the Knit HTML button a web page will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:
as of August 28, 2014, superceding the version of August 24. Always use the most recent version.
Dataset under analysis includes detailed data on pass rates, race, and gender for 2013 advanced placement exams.
remove(list=ls())
test = read.csv("C:/Users/Trevor/Documents/DetailedStateInfoAP-CS-A-2006-2013-with-PercentBlackAndHIspanicByState-fixed.csv", header=TRUE)
#view first few lines
head(test)
## state numberschools X totalnumber yield.per.teacher numberpassed
## 1 California 211 NA 4964 23.52606635 3761
## 2 California 211 NA 4964 3786
## 3 Texas 271 NA 3979 14.68265683 2454
## 4 Texas 271 NA 3979 2479
## 5 New York 124 NA 1858 14.98387097 1278
## 6 New York 124 NA 1858 1303
## X..passed X..female X..female.passed X..female.passed.1 X.female
## 1 75.76551168 1074 776 72.25325885 21.64
## 2 NA NA
## 3 61.67378738 910 520 57.14285714 22.87
## 4 NA NA
## 5 68.78363832 377 216 57.29442971 20.29
## 6 NA NA
## X..Black X..Black.Passed X..Black.passed X..black.taking.exam
## 1 74 42 56.75675676 1.491
## 2 NA NA
## 3 132 64 48.48484848 3.317
## 4 NA NA
## 5 68 23 33.82352941 3.660
## 6 NA NA
## X..black.in.state X..taking.....state..100 X..Black.Females
## 1 6.7 22.25 16
## 2 NA NA NA
## 3 11.9 27.88 30
## 4 NA NA NA
## 5 15.2 24.08 19
## 6 NA NA NA
## X..Black.Females.passed X..Black.Females.passed.1 X..Hispanic
## 1 8 50 392
## 2 NA
## 3 14 46.66666667 751
## 4 NA
## 5 2 10.52631579 150
## 6 NA
## X..Hispanic.passed X..Hispanic.passed.1 X..Hispanic.Females
## 1 186 47.44897959 82
## 2 NA
## 3 334 44.47403462 178
## 4 NA
## 5 53 35.33333333 45
## 6 NA
## X..Hispanic.Females.passed X..Hispanic.females.passed
## 1 24* 29.27*
## 2
## 3 56* 31.46*
## 4
## 5 10 22.22222222
## 6
## X..hispanic.taking.exam X..Hispanic.in.state X..taking.....state...100
## 1 7.897 37.6 21.00
## 2 NA NA NA
## 3 18.874 37.6 50.20
## 4 NA NA NA
## 5 8.073 17.6 45.87
## 6 NA NA NA
#observe the structure of the data, ie. how many variables
str(test)
## 'data.frame': 103 obs. of 29 variables:
## $ state : Factor w/ 53 levels "","Alabama","Alaska",..: 6 6 44 44 33 33 49 49 21 21 ...
## $ numberschools : int 211 211 271 271 124 124 110 110 112 112 ...
## $ X : logi NA NA NA NA NA NA ...
## $ totalnumber : int 4964 4964 3979 3979 1858 1858 1655 1655 1629 1629 ...
## $ yield.per.teacher : Factor w/ 51 levels "","*","0","0.5",..: 27 1 17 1 19 1 20 1 16 1 ...
## $ numberpassed : int 3761 3786 2454 2479 1278 1303 1074 1099 1068 1093 ...
## $ X..passed : Factor w/ 50 levels "","*","0","100",..: 42 1 20 1 32 1 16 1 24 1 ...
## $ X..female : int 1074 NA 910 NA 377 NA 308 NA 323 NA ...
## $ X..female.passed : Factor w/ 36 levels "","*","0","107",..: 31 1 27 1 17 1 15 1 13 1 ...
## $ X..female.passed.1 : Factor w/ 40 levels "","*","0","100",..: 31 1 16 1 17 1 27 1 19 1 ...
## $ X.female : num 21.6 NA 22.9 NA 20.3 ...
## $ X..Black : int 74 NA 132 NA 68 NA 78 NA 170 NA ...
## $ X..Black.Passed : Factor w/ 21 levels "","*","0","1",..: 15 1 19 1 11 1 10 1 16 1 ...
## $ X..Black.passed : Factor w/ 28 levels "","*","0","19.27710843",..: 25 1 22 1 14 1 10 1 8 1 ...
## $ X..black.taking.exam : num 1.49 NA 3.32 NA 3.66 ...
## $ X..black.in.state : num 6.7 NA 11.9 NA 15.2 NA 19.9 NA 29.4 NA ...
## $ X..taking.....state..100 : num 22.2 NA 27.9 NA 24.1 ...
## $ X..Black.Females : int 16 NA 30 NA 19 NA 16 NA 51 NA ...
## $ X..Black.Females.passed : Factor w/ 11 levels "","*","0","1",..: 11 1 6 1 7 1 10 1 5 1 ...
## $ X..Black.Females.passed.1 : Factor w/ 13 levels "","*","0","10.52631579",..: 13 1 12 1 4 1 9 1 7 1 ...
## $ X..Hispanic : int 392 NA 751 NA 150 NA 90 NA 88 NA ...
## $ X..Hispanic.passed : Factor w/ 25 levels "","*","0","1*",..: 11 1 17 1 22 1 19 1 18 1 ...
## $ X..Hispanic.passed.1 : Factor w/ 26 levels "","*","0","11.11*",..: 20 1 17 1 11 1 19 1 16 1 ...
## $ X..Hispanic.Females : int 82 NA 178 NA 45 NA 9 NA 18 NA ...
## $ X..Hispanic.Females.passed: Factor w/ 12 levels "","*","0","10",..: 7 1 10 1 4 1 6 1 12 1 ...
## $ X..Hispanic.females.passed: Factor w/ 12 levels "","*","0","18.18*",..: 8 1 11 1 6 1 2 1 2 1 ...
## $ X..hispanic.taking.exam : num 7.9 NA 18.87 NA 8.07 ...
## $ X..Hispanic.in.state : num 37.6 NA 37.6 NA 17.6 NA 7.9 NA 8.2 NA ...
## $ X..taking.....state...100 : num 21 NA 50.2 NA 45.9 ...
#55 observations of 26 variables
attach(test)
numberschools and totalnumber are factors each with 47 levels. numberschools describes the number of schools with advanced placement exams administered in each state in 2013 and totalnumber describes the total number of advanced placement exams administered.
#typecast variables of interest as factors that are not already factors
test$numberschools = as.factor(test$numberschools)
test$totalnumber = as.factor(test$totalnumber)
test$numberpassed = as.integer(test$numberpassed)
#view summary statistics
summary(test)
## state numberschools X totalnumber
## Alabama : 2 2 : 6 Mode:logical 47 : 6
## Alaska : 2 7 : 6 NA's:103 173 : 4
## Arizona : 2 11 : 6 0 : 2
## Arkansas : 2 12 : 6 1 : 2
## California: 2 8 : 4 9 : 2
## Colorado : 2 (Other):72 (Other):84
## (Other) :91 NA's : 3 NA's : 3
## yield.per.teacher numberpassed X..passed X..female
## :53 Min. : 6 :54 Min. : 0.0
## * : 1 1st Qu.: 65 * : 1 1st Qu.: 8.2
## 0 : 1 Median : 130 0 : 1 Median : 29.0
## 0.5 : 1 Mean : 411 100 : 1 Mean : 109.9
## 1.904761905: 1 3rd Qu.: 539 43.58974359: 1 3rd Qu.: 98.8
## 10.27777778: 1 Max. :3786 48.19277108: 1 Max. :1074.0
## (Other) :45 NA's :3 (Other) :44 NA's :53
## X..female.passed X..female.passed.1 X.female X..Black
## :53 :53 Min. : 0.0 Min. : 0.0
## * : 7 * : 7 1st Qu.:11.6 1st Qu.: 1.0
## 0 : 3 0 : 3 Median :15.2 Median : 8.0
## 17 : 3 72.72727273: 3 Mean :14.7 Mean : 21.5
## 6 : 3 77.27272727: 2 3rd Qu.:19.2 3rd Qu.: 21.5
## 4 : 2 100 : 1 Max. :29.1 Max. :170.0
## (Other):32 (Other) :34 NA's :53 NA's :52
## X..Black.Passed X..Black.passed X..black.taking.exam X..black.in.state
## :53 :53 Min. : 0.00 Min. : 0.67
## * :11 * :11 1st Qu.: 1.26 1st Qu.: 3.23
## 0 :11 0 :11 Median : 2.39 Median : 7.35
## 6 : 6 40 : 2 Mean : 3.19 Mean :10.49
## 4 : 3 50 : 2 3rd Qu.: 4.93 3rd Qu.:15.12
## 10 : 2 66.66666667: 2 Max. :11.58 Max. :37.30
## (Other):17 (Other) :22 NA's :53 NA's :53
## X..taking.....state..100 X..Black.Females X..Black.Females.passed
## Min. : 0.0 Min. : 0.0 :53
## 1st Qu.: 10.9 1st Qu.: 0.0 * :19
## Median : 24.2 Median : 1.0 0 :19
## Mean : 44.9 Mean : 4.9 2 : 3
## 3rd Qu.: 50.9 3rd Qu.: 4.0 1 : 2
## Max. :275.7 Max. :51.0 4 : 2
## NA's :52 NA's :53 (Other): 5
## X..Black.Females.passed.1 X..Hispanic X..Hispanic.passed
## :53 Min. : 0.0 :53
## * :19 1st Qu.: 1.0 * :16
## 0 :19 Median : 7.0 0 : 8
## 33.33333333: 3 Mean : 47.3 1* : 2
## 10.52631579: 1 3rd Qu.: 23.5 10* : 2
## 12.12121212: 1 Max. :751.0 11 : 2
## (Other) : 7 NA's :52 (Other):20
## X..Hispanic.passed.1 X..Hispanic.Females X..Hispanic.Females.passed
## :53 Min. : 0.00 :53
## * :18 1st Qu.: 0.00 0 :22
## 0 : 8 Median : 1.00 * :17
## 42.86* : 2 Mean : 9.76 2* : 3
## 11.11* : 1 3rd Qu.: 3.00 10 : 1
## 16.67* : 1 Max. :178.00 14 : 1
## (Other):20 NA's :53 (Other): 6
## X..Hispanic.females.passed X..hispanic.taking.exam X..Hispanic.in.state
## :53 Min. : 0.00 Min. : 1.20
## 0 :22 1st Qu.: 1.30 1st Qu.: 4.25
## * :19 Median : 3.05 Median : 8.20
## 18.18* : 1 Mean : 4.58 Mean :10.61
## 21.73913043: 1 3rd Qu.: 5.97 3rd Qu.:12.22
## 22.22222222: 1 Max. :18.87 Max. :46.30
## (Other) : 6 NA's :53 NA's :53
## X..taking.....state...100
## Min. : 0.0
## 1st Qu.: 21.4
## Median : 46.6
## Mean : 48.4
## 3rd Qu.: 62.8
## Max. :178.6
## NA's :53
the response variable, numberpassed describes the number of advanced placement exams passed in 2013 in each state. ### The Data: How is it organized and what does it look like? The data are tabluated into 29 columns, with each state having detailed information about the overall number of AP exams taken and the number passed. The data are further broken down by gender and ethnicity regarding pass rates, but for the purposes of this experiment, only the overall attempt/pass numbers will be analyzed. ### Randomization The data were initially collected by College Board regarding all Advanced Placement exams.
Analysis of Variance will be used to determine if the factors numberschools and totalnumber have an effect on the numberpassed of advanced placement exams. Factor interactions and blocking will also be considered.
### What is the rationale for this design? It is possible that numberschools by itself may have an effect on the number of AP exams taken, and it is also possible that the totalnumber may have an effect on the numberpassed in each state in each year. ### Randomize: What is the Randomization Scheme? The following file is a spreadsheet with all the data for the number of CS AP A exam taken in each state from 1998 to 2013. The data was complied from the data available at http://research.collegeboard.org/programs/ap/data. This data was originally gathered by the CSTA board, but Barb Ericson of Georgia Tech keeps adding to it each year. ### Replicate: Are there replicates and/or repeated measures? There are no replicates or repeated measures.
### Block: Did you use blocking in the design? Yes, one model was performed with blocking to determine if each of the factors had an effect by themselves.
par(mfrow=c(1,1))
plot(numberpassed~state,las=3)
boxplot(totalnumber~state,las=3)
### Testing
#run analysis of variance for individual factor effects
model1=(aov(numberpassed~numberschools,data=test))
anova(model1)
## Analysis of Variance Table
##
## Response: numberpassed
## Df Sum Sq Mean Sq F value Pr(>F)
## numberschools 36 43891612 1219211 1305 <2e-16 ***
## Residuals 63 58853 934
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#based on numberschools alone, we reject the H0 that numberschools has no effect on numberpassed.
model2=(aov(numberpassed~totalnumber, data=test))
anova(model2)
## Analysis of Variance Table
##
## Response: numberpassed
## Df Sum Sq Mean Sq F value Pr(>F)
## totalnumber 46 43932847 955062 2873 <2e-16 ***
## Residuals 53 17618 332
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#based on totalnumber alone, we reject the H0 that totalnumber has no effect on numberpassed.
#run analysis of variance using interaction
interaction=aov(numberpassed~numberschools*totalnumber, data=test)
anova(interaction)
## Analysis of Variance Table
##
## Response: numberpassed
## Df Sum Sq Mean Sq F value Pr(>F)
## numberschools 36 43891612 1219211 3901.5 < 2e-16 ***
## totalnumber 13 43228 3325 10.6 2.6e-10 ***
## Residuals 50 15625 312
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#in this particular model, we reject the H0 that each factor independently has no effect on numberpassed. Both numberschools and totalnumber have an effect on the numberpassed for each state.
#run analysis of variance with blocking
blocking=aov(numberpassed~numberschools+totalnumber, data=test)
anova(blocking)
## Analysis of Variance Table
##
## Response: numberpassed
## Df Sum Sq Mean Sq F value Pr(>F)
## numberschools 36 43891612 1219211 3901.5 < 2e-16 ***
## totalnumber 13 43228 3325 10.6 2.6e-10 ***
## Residuals 50 15625 312
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#when considering interaction, all factors jointly have an effect on numberpassed.
Based on these results from the multiple analyses of variance, we reject the H0 that the number of schools and total number of advanced placement exams have no effect on the number of AP exams passed. Said differently, variation in the number of AP exams passed can be explained by something other than randomization. Finally, we must check the normality to ensure these results are valid.
# Shapiro-Wilk test of normality. Adequate if p < 0.1
shapiro.test(numberschools)
##
## Shapiro-Wilk normality test
##
## data: numberschools
## W = 0.7324, p-value = 3.281e-12
qqnorm(numberschools)
qqline(numberschools)
shapiro.test(totalnumber)
##
## Shapiro-Wilk normality test
##
## data: totalnumber
## W = 0.6085, p-value = 6.252e-15
qqnorm(totalnumber)
qqline(totalnumber)
One of the primary assumptions of the t-test is that the data are normally distributed, and since they are not, the results from the analysis of variance are essentially invalidated. ### Diagnostics/Model Adequacy Checking Describe
qqnorm(residuals(interaction))
qqline(residuals(interaction))
plot(fitted(interaction),residuals(interaction))
par(mfrow=c(1,1))
tukey1 = TukeyHSD(blocking, which="numberschools", ordered = FALSE)
tukey2 = TukeyHSD(blocking, which="totalnumber")
plot(tukey1)
plot(tukey2)
This data were compiled from the College Board under AP data - Archived Data.
The data include detailed data on pass rates, race, and gender for 2013. They were gathered from the Georgia Tech website along with other advanced placement testing information.