Recipe 3: Example of Descriptive Statistics

This is an R Markdown document. Markdown is a simple formatting syntax for authoring web pages (click the MD toolbar button for help on Markdown).

When you click the Knit HTML button a web page will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:

Recipes for the Design of Experiments: Recipe Outline

as of August 28, 2014, superceding the version of August 24. Always use the most recent version.

Design of Experiments

Trevor Manzanares

Rensselaer Polytechnic Institute

10/3/14

1. Setting

System under test

Dataset under analysis includes detailed data on pass rates, race, and gender for 2013 advanced placement exams.

remove(list=ls())
test = read.csv("C:/Users/Trevor/Documents/DetailedStateInfoAP-CS-A-2006-2013-with-PercentBlackAndHIspanicByState-fixed.csv", header=TRUE) 


#view first few lines
head(test)
##        state numberschools  X totalnumber yield.per.teacher numberpassed
## 1 California           211 NA        4964       23.52606635         3761
## 2 California           211 NA        4964                           3786
## 3      Texas           271 NA        3979       14.68265683         2454
## 4      Texas           271 NA        3979                           2479
## 5   New York           124 NA        1858       14.98387097         1278
## 6   New York           124 NA        1858                           1303
##     X..passed X..female X..female.passed X..female.passed.1 X.female
## 1 75.76551168      1074              776        72.25325885    21.64
## 2                    NA                                           NA
## 3 61.67378738       910              520        57.14285714    22.87
## 4                    NA                                           NA
## 5 68.78363832       377              216        57.29442971    20.29
## 6                    NA                                           NA
##   X..Black X..Black.Passed X..Black.passed X..black.taking.exam
## 1       74              42     56.75675676                1.491
## 2       NA                                                   NA
## 3      132              64     48.48484848                3.317
## 4       NA                                                   NA
## 5       68              23     33.82352941                3.660
## 6       NA                                                   NA
##   X..black.in.state X..taking.....state..100 X..Black.Females
## 1               6.7                    22.25               16
## 2                NA                       NA               NA
## 3              11.9                    27.88               30
## 4                NA                       NA               NA
## 5              15.2                    24.08               19
## 6                NA                       NA               NA
##   X..Black.Females.passed X..Black.Females.passed.1 X..Hispanic
## 1                       8                        50         392
## 2                                                            NA
## 3                      14               46.66666667         751
## 4                                                            NA
## 5                       2               10.52631579         150
## 6                                                            NA
##   X..Hispanic.passed X..Hispanic.passed.1 X..Hispanic.Females
## 1                186          47.44897959                  82
## 2                                                          NA
## 3                334          44.47403462                 178
## 4                                                          NA
## 5                 53          35.33333333                  45
## 6                                                          NA
##   X..Hispanic.Females.passed X..Hispanic.females.passed
## 1                        24*                     29.27*
## 2                                                      
## 3                        56*                     31.46*
## 4                                                      
## 5                         10                22.22222222
## 6                                                      
##   X..hispanic.taking.exam X..Hispanic.in.state X..taking.....state...100
## 1                   7.897                 37.6                     21.00
## 2                      NA                   NA                        NA
## 3                  18.874                 37.6                     50.20
## 4                      NA                   NA                        NA
## 5                   8.073                 17.6                     45.87
## 6                      NA                   NA                        NA
#observe the structure of the data, ie. how many variables
str(test)
## 'data.frame':    103 obs. of  29 variables:
##  $ state                     : Factor w/ 53 levels "","Alabama","Alaska",..: 6 6 44 44 33 33 49 49 21 21 ...
##  $ numberschools             : int  211 211 271 271 124 124 110 110 112 112 ...
##  $ X                         : logi  NA NA NA NA NA NA ...
##  $ totalnumber               : int  4964 4964 3979 3979 1858 1858 1655 1655 1629 1629 ...
##  $ yield.per.teacher         : Factor w/ 51 levels "","*","0","0.5",..: 27 1 17 1 19 1 20 1 16 1 ...
##  $ numberpassed              : int  3761 3786 2454 2479 1278 1303 1074 1099 1068 1093 ...
##  $ X..passed                 : Factor w/ 50 levels "","*","0","100",..: 42 1 20 1 32 1 16 1 24 1 ...
##  $ X..female                 : int  1074 NA 910 NA 377 NA 308 NA 323 NA ...
##  $ X..female.passed          : Factor w/ 36 levels "","*","0","107",..: 31 1 27 1 17 1 15 1 13 1 ...
##  $ X..female.passed.1        : Factor w/ 40 levels "","*","0","100",..: 31 1 16 1 17 1 27 1 19 1 ...
##  $ X.female                  : num  21.6 NA 22.9 NA 20.3 ...
##  $ X..Black                  : int  74 NA 132 NA 68 NA 78 NA 170 NA ...
##  $ X..Black.Passed           : Factor w/ 21 levels "","*","0","1",..: 15 1 19 1 11 1 10 1 16 1 ...
##  $ X..Black.passed           : Factor w/ 28 levels "","*","0","19.27710843",..: 25 1 22 1 14 1 10 1 8 1 ...
##  $ X..black.taking.exam      : num  1.49 NA 3.32 NA 3.66 ...
##  $ X..black.in.state         : num  6.7 NA 11.9 NA 15.2 NA 19.9 NA 29.4 NA ...
##  $ X..taking.....state..100  : num  22.2 NA 27.9 NA 24.1 ...
##  $ X..Black.Females          : int  16 NA 30 NA 19 NA 16 NA 51 NA ...
##  $ X..Black.Females.passed   : Factor w/ 11 levels "","*","0","1",..: 11 1 6 1 7 1 10 1 5 1 ...
##  $ X..Black.Females.passed.1 : Factor w/ 13 levels "","*","0","10.52631579",..: 13 1 12 1 4 1 9 1 7 1 ...
##  $ X..Hispanic               : int  392 NA 751 NA 150 NA 90 NA 88 NA ...
##  $ X..Hispanic.passed        : Factor w/ 25 levels "","*","0","1*",..: 11 1 17 1 22 1 19 1 18 1 ...
##  $ X..Hispanic.passed.1      : Factor w/ 26 levels "","*","0","11.11*",..: 20 1 17 1 11 1 19 1 16 1 ...
##  $ X..Hispanic.Females       : int  82 NA 178 NA 45 NA 9 NA 18 NA ...
##  $ X..Hispanic.Females.passed: Factor w/ 12 levels "","*","0","10",..: 7 1 10 1 4 1 6 1 12 1 ...
##  $ X..Hispanic.females.passed: Factor w/ 12 levels "","*","0","18.18*",..: 8 1 11 1 6 1 2 1 2 1 ...
##  $ X..hispanic.taking.exam   : num  7.9 NA 18.87 NA 8.07 ...
##  $ X..Hispanic.in.state      : num  37.6 NA 37.6 NA 17.6 NA 7.9 NA 8.2 NA ...
##  $ X..taking.....state...100 : num  21 NA 50.2 NA 45.9 ...
#55 observations of 26 variables
attach(test)

Factors of interest and their levels

numberschools and totalnumber are factors each with 47 levels. numberschools describes the number of schools with advanced placement exams administered in each state in 2013 and totalnumber describes the total number of advanced placement exams administered.

#typecast variables of interest as factors that are not already factors
test$numberschools = as.factor(test$numberschools)
test$totalnumber = as.factor(test$totalnumber)
test$numberpassed = as.integer(test$numberpassed)

#view summary statistics
summary(test)
##         state    numberschools    X            totalnumber
##  Alabama   : 2   2      : 6    Mode:logical   47     : 6  
##  Alaska    : 2   7      : 6    NA's:103       173    : 4  
##  Arizona   : 2   11     : 6                   0      : 2  
##  Arkansas  : 2   12     : 6                   1      : 2  
##  California: 2   8      : 4                   9      : 2  
##  Colorado  : 2   (Other):72                   (Other):84  
##  (Other)   :91   NA's   : 3                   NA's   : 3  
##    yield.per.teacher  numberpassed        X..passed    X..female     
##             :53      Min.   :   6              :54   Min.   :   0.0  
##  *          : 1      1st Qu.:  65   *          : 1   1st Qu.:   8.2  
##  0          : 1      Median : 130   0          : 1   Median :  29.0  
##  0.5        : 1      Mean   : 411   100        : 1   Mean   : 109.9  
##  1.904761905: 1      3rd Qu.: 539   43.58974359: 1   3rd Qu.:  98.8  
##  10.27777778: 1      Max.   :3786   48.19277108: 1   Max.   :1074.0  
##  (Other)    :45      NA's   :3      (Other)    :44   NA's   :53      
##  X..female.passed   X..female.passed.1    X.female       X..Black    
##         :53                  :53       Min.   : 0.0   Min.   :  0.0  
##  *      : 7       *          : 7       1st Qu.:11.6   1st Qu.:  1.0  
##  0      : 3       0          : 3       Median :15.2   Median :  8.0  
##  17     : 3       72.72727273: 3       Mean   :14.7   Mean   : 21.5  
##  6      : 3       77.27272727: 2       3rd Qu.:19.2   3rd Qu.: 21.5  
##  4      : 2       100        : 1       Max.   :29.1   Max.   :170.0  
##  (Other):32       (Other)    :34       NA's   :53     NA's   :52     
##  X..Black.Passed    X..Black.passed X..black.taking.exam X..black.in.state
##         :53                 :53     Min.   : 0.00        Min.   : 0.67    
##  *      :11      *          :11     1st Qu.: 1.26        1st Qu.: 3.23    
##  0      :11      0          :11     Median : 2.39        Median : 7.35    
##  6      : 6      40         : 2     Mean   : 3.19        Mean   :10.49    
##  4      : 3      50         : 2     3rd Qu.: 4.93        3rd Qu.:15.12    
##  10     : 2      66.66666667: 2     Max.   :11.58        Max.   :37.30    
##  (Other):17      (Other)    :22     NA's   :53           NA's   :53       
##  X..taking.....state..100 X..Black.Females X..Black.Females.passed
##  Min.   :  0.0            Min.   : 0.0            :53             
##  1st Qu.: 10.9            1st Qu.: 0.0     *      :19             
##  Median : 24.2            Median : 1.0     0      :19             
##  Mean   : 44.9            Mean   : 4.9     2      : 3             
##  3rd Qu.: 50.9            3rd Qu.: 4.0     1      : 2             
##  Max.   :275.7            Max.   :51.0     4      : 2             
##  NA's   :52               NA's   :53       (Other): 5             
##  X..Black.Females.passed.1  X..Hispanic    X..Hispanic.passed
##             :53            Min.   :  0.0          :53        
##  *          :19            1st Qu.:  1.0   *      :16        
##  0          :19            Median :  7.0   0      : 8        
##  33.33333333: 3            Mean   : 47.3   1*     : 2        
##  10.52631579: 1            3rd Qu.: 23.5   10*    : 2        
##  12.12121212: 1            Max.   :751.0   11     : 2        
##  (Other)    : 7            NA's   :52      (Other):20        
##  X..Hispanic.passed.1 X..Hispanic.Females X..Hispanic.Females.passed
##         :53           Min.   :  0.00             :53                
##  *      :18           1st Qu.:  0.00      0      :22                
##  0      : 8           Median :  1.00      *      :17                
##  42.86* : 2           Mean   :  9.76      2*     : 3                
##  11.11* : 1           3rd Qu.:  3.00      10     : 1                
##  16.67* : 1           Max.   :178.00      14     : 1                
##  (Other):20           NA's   :53          (Other): 6                
##  X..Hispanic.females.passed X..hispanic.taking.exam X..Hispanic.in.state
##             :53             Min.   : 0.00           Min.   : 1.20       
##  0          :22             1st Qu.: 1.30           1st Qu.: 4.25       
##  *          :19             Median : 3.05           Median : 8.20       
##  18.18*     : 1             Mean   : 4.58           Mean   :10.61       
##  21.73913043: 1             3rd Qu.: 5.97           3rd Qu.:12.22       
##  22.22222222: 1             Max.   :18.87           Max.   :46.30       
##  (Other)    : 6             NA's   :53              NA's   :53          
##  X..taking.....state...100
##  Min.   :  0.0            
##  1st Qu.: 21.4            
##  Median : 46.6            
##  Mean   : 48.4            
##  3rd Qu.: 62.8            
##  Max.   :178.6            
##  NA's   :53

Continuous variables (if any)

the response variable, numberpassed describes the number of advanced placement exams passed in 2013 in each state. ### The Data: How is it organized and what does it look like? The data are tabluated into 29 columns, with each state having detailed information about the overall number of AP exams taken and the number passed. The data are further broken down by gender and ethnicity regarding pass rates, but for the purposes of this experiment, only the overall attempt/pass numbers will be analyzed. ### Randomization The data were initially collected by College Board regarding all Advanced Placement exams.

2. (Experimental) Design

How will the experiment be organized and conducted to test the hypothesis?

Analysis of Variance will be used to determine if the factors numberschools and totalnumber have an effect on the numberpassed of advanced placement exams. Factor interactions and blocking will also be considered.
### What is the rationale for this design? It is possible that numberschools by itself may have an effect on the number of AP exams taken, and it is also possible that the totalnumber may have an effect on the numberpassed in each state in each year. ### Randomize: What is the Randomization Scheme? The following file is a spreadsheet with all the data for the number of CS AP A exam taken in each state from 1998 to 2013. The data was complied from the data available at http://research.collegeboard.org/programs/ap/data. This data was originally gathered by the CSTA board, but Barb Ericson of Georgia Tech keeps adding to it each year. ### Replicate: Are there replicates and/or repeated measures? There are no replicates or repeated measures.
### Block: Did you use blocking in the design? Yes, one model was performed with blocking to determine if each of the factors had an effect by themselves.

3. (Statistical) Analysis

(Exploratory Data Analysis) Graphics and descriptive summary

par(mfrow=c(1,1))
plot(numberpassed~state,las=3)

plot of chunk unnamed-chunk-3

boxplot(totalnumber~state,las=3)

plot of chunk unnamed-chunk-3 ### Testing

#run analysis of variance for individual factor effects
model1=(aov(numberpassed~numberschools,data=test))
anova(model1)
## Analysis of Variance Table
## 
## Response: numberpassed
##               Df   Sum Sq Mean Sq F value Pr(>F)    
## numberschools 36 43891612 1219211    1305 <2e-16 ***
## Residuals     63    58853     934                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#based on numberschools alone, we reject the H0 that numberschools has no effect on numberpassed.

model2=(aov(numberpassed~totalnumber, data=test))
anova(model2)
## Analysis of Variance Table
## 
## Response: numberpassed
##             Df   Sum Sq Mean Sq F value Pr(>F)    
## totalnumber 46 43932847  955062    2873 <2e-16 ***
## Residuals   53    17618     332                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#based on totalnumber alone, we reject the H0 that totalnumber has no effect on numberpassed.

#run analysis of variance using interaction
interaction=aov(numberpassed~numberschools*totalnumber, data=test)
anova(interaction)
## Analysis of Variance Table
## 
## Response: numberpassed
##               Df   Sum Sq Mean Sq F value  Pr(>F)    
## numberschools 36 43891612 1219211  3901.5 < 2e-16 ***
## totalnumber   13    43228    3325    10.6 2.6e-10 ***
## Residuals     50    15625     312                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#in this particular model, we reject the H0 that each factor independently has no effect on numberpassed. Both numberschools and totalnumber have an effect on the numberpassed for each state. 

#run analysis of variance with blocking
blocking=aov(numberpassed~numberschools+totalnumber, data=test)
anova(blocking)
## Analysis of Variance Table
## 
## Response: numberpassed
##               Df   Sum Sq Mean Sq F value  Pr(>F)    
## numberschools 36 43891612 1219211  3901.5 < 2e-16 ***
## totalnumber   13    43228    3325    10.6 2.6e-10 ***
## Residuals     50    15625     312                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#when considering interaction, all factors jointly have an effect on numberpassed.

Based on these results from the multiple analyses of variance, we reject the H0 that the number of schools and total number of advanced placement exams have no effect on the number of AP exams passed. Said differently, variation in the number of AP exams passed can be explained by something other than randomization. Finally, we must check the normality to ensure these results are valid.

Estimation (of Parameters)

# Shapiro-Wilk test of normality.  Adequate if p < 0.1
shapiro.test(numberschools)
## 
##  Shapiro-Wilk normality test
## 
## data:  numberschools
## W = 0.7324, p-value = 3.281e-12
qqnorm(numberschools)
qqline(numberschools)

plot of chunk unnamed-chunk-5

shapiro.test(totalnumber)
## 
##  Shapiro-Wilk normality test
## 
## data:  totalnumber
## W = 0.6085, p-value = 6.252e-15
qqnorm(totalnumber)
qqline(totalnumber)

plot of chunk unnamed-chunk-5 One of the primary assumptions of the t-test is that the data are normally distributed, and since they are not, the results from the analysis of variance are essentially invalidated. ### Diagnostics/Model Adequacy Checking Describe

qqnorm(residuals(interaction))
qqline(residuals(interaction))

plot of chunk unnamed-chunk-6

plot(fitted(interaction),residuals(interaction))

plot of chunk unnamed-chunk-6

par(mfrow=c(1,1))
tukey1 = TukeyHSD(blocking, which="numberschools", ordered = FALSE)
tukey2 = TukeyHSD(blocking, which="totalnumber")
plot(tukey1)

plot of chunk unnamed-chunk-6

plot(tukey2)

plot of chunk unnamed-chunk-6

4. References to the literature

This data were compiled from the College Board under AP data - Archived Data.

http://home.cc.gatech.edu/ice-gt/556

5. Appendices

A summary of, or pointer to, the raw data

complete and documented R code

The data include detailed data on pass rates, race, and gender for 2013. They were gathered from the Georgia Tech website along with other advanced placement testing information.