Processing math: 100%

Rensselaer Polytechnic Institute

Date: November 7, 2016

Version 1.3

1. Setting

The California public school system conducts standardized tests and assembles the reading and math scores, along with other descriptive data, from its school districts. These data are important because they can impact the levels of public funding that a school district may receive. In this report, the effect of grade span of a school district (K-6 or K-8) on the expenditure per student is analyzed.

System under test

The California Test Score Data Set is contained within the Ecdat R package in the Caschool dataframe. Using the command ?Caschools, we learn that dataframe includes a cross-section from 1998-1999 and contains the following information from each school district:

  • distcode district code
  • county name of county
  • district name of district
  • grspan grade span of district
  • enrltot total enrollment
  • teachers number of teachers
  • calwpct percent qualifying for CalWorks
  • mealpct percent qualifying for reduced-price lunch
  • computer number of computers
  • testscr average test score (reading + math)/2
  • compstu number of computers per student
  • expnstu expenditure per student
  • str student-to-teacher ratio
  • avginc district average income
  • elpct percent of English learners
  • readscr average reading score
  • mathscr average math score

To get a sense of what the data look like, we can inspect the head and tail of the dataframe.

# View the beginning and end of the Caschool dataframe
head(Caschool, n=10)
##    distcod      county                        district grspan enrltot
## 1    75119     Alameda              Sunol Glen Unified  KK-08     195
## 2    61499       Butte            Manzanita Elementary  KK-08     240
## 3    61549       Butte     Thermalito Union Elementary  KK-08    1550
## 4    61457       Butte Golden Feather Union Elementary  KK-08     243
## 5    61523       Butte        Palermo Union Elementary  KK-08    1335
## 6    62042      Fresno         Burrel Union Elementary  KK-08     137
## 7    68536 San Joaquin           Holt Union Elementary  KK-08     195
## 8    63834        Kern             Vineland Elementary  KK-08     888
## 9    62331      Fresno        Orange Center Elementary  KK-08     379
## 10   67306  Sacramento     Del Paso Heights Elementary  KK-06    2247
##    teachers calwpct  mealpct computer testscr    compstu  expnstu      str
## 1     10.90  0.5102   2.0408       67  690.80 0.34358975 6384.911 17.88991
## 2     11.15 15.4167  47.9167      101  661.20 0.42083332 5099.381 21.52466
## 3     82.90 55.0323  76.3226      169  643.60 0.10903226 5501.955 18.69723
## 4     14.00 36.4754  77.0492       85  647.70 0.34979424 7101.831 17.35714
## 5     71.50 33.1086  78.4270      171  640.85 0.12808989 5235.988 18.67133
## 6      6.40 12.3188  86.9565       25  605.55 0.18248175 5580.147 21.40625
## 7     10.00 12.9032  94.6237       28  606.75 0.14358975 5253.331 19.50000
## 8     42.50 18.8063 100.0000       66  609.00 0.07432432 4565.746 20.89412
## 9     19.00 32.1900  93.1398       35  612.50 0.09234829 5355.548 19.94737
## 10   108.00 78.9942  87.3164        0  612.65 0.00000000 5036.211 20.80556
##       avginc     elpct readscr mathscr
## 1  22.690001  0.000000   691.6   690.0
## 2   9.824000  4.583333   660.5   661.9
## 3   8.978000 30.000002   636.3   650.9
## 4   8.978000  0.000000   651.9   643.5
## 5   9.080333 13.857677   641.8   639.9
## 6  10.415000 12.408759   605.7   605.4
## 7   6.577000 68.717949   604.5   609.0
## 8   8.174000 46.959461   605.5   612.5
## 9   7.385000 30.079157   608.9   616.1
## 10 11.613333 40.275921   611.9   613.4
tail(Caschool, n=10)
##     distcod        county                     district grspan enrltot
## 411   61770  Contra Costa      Orinda Union Elementary  KK-08    2422
## 412   68908     San Mateo Hillsborough City Elementary  KK-08    1318
## 413   69161 Santa Barbara       Cold Spring Elementary  KK-06     220
## 414   68981     San Mateo    Portola Valley Elementary  KK-08     687
## 415   69682   Santa Clara    Saratoga Union Elementary  KK-08    2341
## 416   68957     San Mateo       Las Lomitas Elementary  KK-08     984
## 417   69518   Santa Clara         Los Altos Elementary  KK-08    3724
## 418   72611       Ventura       Somis Union Elementary  KK-08     441
## 419   72744          Yuba            Plumas Elementary  KK-08     101
## 420   72751          Yuba         Wheatland Elementary  KK-08    1778
##     teachers calwpct mealpct computer testscr   compstu  expnstu      str
## 411   139.47  0.0000  0.1239      466  698.20 0.1924030 5933.154 17.36574
## 412    87.06  0.1517  0.0000      412  698.25 0.3125948 7593.406 15.13898
## 413    12.33  0.4545  0.0000       22  698.45 0.1000000 6500.450 17.84266
## 414    44.59  0.3049  0.0000      209  699.10 0.3042212 7217.263 15.40704
## 415   124.09  0.1709  0.5980      286  700.30 0.1221700 5392.639 18.86534
## 416    59.73  0.1016  3.5569      195  704.30 0.1981707 7290.339 16.47413
## 417   208.48  1.0741  1.5038      721  706.75 0.1936090 5741.463 17.86263
## 418    20.15  3.5635 37.1938       45  645.00 0.1020408 4402.832 21.88586
## 419     5.00 11.8812 59.4059       14  672.20 0.1386139 4776.336 20.20000
## 420    93.40  6.9235 47.5712      313  655.75 0.1760405 5993.393 19.03640
##       avginc      elpct readscr mathscr
## 411 40.26400  0.5780346   699.1   697.3
## 412 35.81000  2.8072839   695.4   701.1
## 413 43.23000  1.3636364   693.3   703.6
## 414 50.67700  1.1644833   698.3   699.9
## 415 40.40200  2.0504057   698.9   701.7
## 416 28.71700  5.9959350   700.9   707.7
## 417 41.73411  4.7261009   704.0   709.5
## 418 23.73300 24.2630386   648.3   641.7
## 419  9.95200  2.9702971   667.9   676.5
## 420 12.50200  5.0056243   660.5   651.0

Factors and levels

The pair of categorical independent variables (IVs) selected was grspan and county.

Grade span is a factor with 2 levels. The grade span may either be K-6 (KK-06) or K-8 (KK-08).

County is a factor with 45 levels. There are 420 districts included in the dataframe, therefore on average about ten districts per county.

The assumptions about the number of levels for the factors is confirmed by examining the structure of the dataframe below.

# Examine the structure of the dataframe
str(Caschool)
## 'data.frame':    420 obs. of  17 variables:
##  $ distcod : int  75119 61499 61549 61457 61523 62042 68536 63834 62331 67306 ...
##  $ county  : Factor w/ 45 levels "Alameda","Butte",..: 1 2 2 2 2 6 29 11 6 25 ...
##  $ district: Factor w/ 409 levels "Ackerman Elementary",..: 362 214 367 132 270 53 152 383 263 94 ...
##  $ grspan  : Factor w/ 2 levels "KK-06","KK-08": 2 2 2 2 2 2 2 2 2 1 ...
##  $ enrltot : int  195 240 1550 243 1335 137 195 888 379 2247 ...
##  $ teachers: num  10.9 11.1 82.9 14 71.5 ...
##  $ calwpct : num  0.51 15.42 55.03 36.48 33.11 ...
##  $ mealpct : num  2.04 47.92 76.32 77.05 78.43 ...
##  $ computer: int  67 101 169 85 171 25 28 66 35 0 ...
##  $ testscr : num  691 661 644 648 641 ...
##  $ compstu : num  0.344 0.421 0.109 0.35 0.128 ...
##  $ expnstu : num  6385 5099 5502 7102 5236 ...
##  $ str     : num  17.9 21.5 18.7 17.4 18.7 ...
##  $ avginc  : num  22.69 9.82 8.98 8.98 9.08 ...
##  $ elpct   : num  0 4.58 30 0 13.86 ...
##  $ readscr : num  692 660 636 652 642 ...
##  $ mathscr : num  690 662 651 644 640 ...

Continuous variables

While the first few columns in the Caschool dataframe are categorical (county, district, grspan), the remaining columns may be considered continuous variables. Three of these columns (calwpct, mealpct, elpct) are percentages.

To perform ANOVA, a continuous dependent variable (DV) must be selected as the response variable.

Deciding upon a response variable

Initially, I wanted to use compstu, or the number of computers per student, as the response variable. The rationale was that there would be a different number of computers per student based on the grade span of a particular district.

A user-defined function MyCohensd was created to define Cohen’s d across the factor of the non-blocking IV, grade span. Because grspan is a factor with only two levels, the calculation of Cohen’s d was straightforward and followed the formula: d=μK8groupμK6groupσpooled where σpooled=σK6group+σK8group2

# User-defined function to calculate the effect size (Cohen's d) for any num or int variable in the Caschool dataframe, grouped by grade span (KK-06, KK-08)
MyCohensd <- function(variable){
  # Calculate the difference between the means of the variable in the KK-06 and KK-08 groups
  mean1 <- mean(subset(variable, Caschool$grspan=="KK-06"))
  mean2 <- mean(subset(variable, Caschool$grspan=="KK-08"))
  meandiff <- mean2 - mean1
  
  # Calculate the pooled standard deviation of the variable in the KK-06 and KK-08 groups
  SD1 <- sd(subset(variable, Caschool$grspan=="KK-06"))
  SD2 <- sd(subset(variable, Caschool$grspan=="KK-08"))
  SDpooled <- (SD1 + SD2)/2
  
  # Calculate Cohen's d, which is the mean difference divided by the pooled standard deviation
  result <- abs(meandiff/SDpooled)
  return(result)
}

The effect size determined by Cohen’s d is arbitrarily categorized as:

  • trivial effect when d < 0.1
  • small effect when 0.1 < d < 0.3
  • moderate effect when 0.3 < d < 0.5
  • large effect when d > 0.5

The user-defined function, called MyCohensd, was applied to all columns of interest (in a list called ColumnsofInterest) using the apply() function.

# Create a list of all columns of interest, then run the MyCohensd function on each column
# The apply() function creates a table with the value of Cohen's d for each column of interest
#   The second argument, 2, indicates that the function will be applied to each column. If this argument is 1, the function would be appled by row
ColumnsofInterest <- c('enrltot', 'teachers', 'calwpct', 'mealpct', 'computer', 'testscr', 'compstu', 'expnstu', 'str', 'avginc', 'elpct', 'readscr', 'mathscr')
Cohensd_Data <- apply(Caschool[,ColumnsofInterest], 2, function(x) MyCohensd(x))

# Show the Cohen's d results
Cohensd_Data
##    enrltot   teachers    calwpct    mealpct   computer    testscr 
## 0.15750676 0.17015810 0.17856325 0.30656704 0.08976346 0.45251294 
##    compstu    expnstu        str     avginc      elpct    readscr 
## 0.06624935 0.48402902 0.26219572 0.45018749 0.15778971 0.41222605 
##    mathscr 
## 0.47949551

Surprisingly, it turns out that the effect size for compstu is trivial (d < 0.1). Therefore, I chose to study the expenditure per student (expnstu) as the response variable instead, which has a moderate-to-large effect size (d = 0.48).

2. (Experimental) Design

The first test of these data is a Null Hypothesis Significance Test (NHST), where ANOVA was performed. Later, two alternatives to NHST are explored: Resampling Statistics and Plot Plus Error Bar (PPE) Approach.

Organization of experiment to test the hypothesis

The factors and response variable were selected according to the selection rules for ANOVA,1 which state that ANOVA is appropriate when the IVs are categorical and the DV (response) is continuous. Given the limited categorical data provided in the dataframe, grspan and county seemed like the most appropriate IVs. After exploring the effect sizes of various potential DVs, expnstu was selected as the response variable of interest.

In this experiment, the null hypothesis states that there is no difference between the two grspan groups (K-6 and K-8) in terms of expnstu, the expenditure per student. The research hypothesis is that there may be a difference between the grade span and amount of money spent per student.

A Type I error α of 0.05 was used for this analysis, as it is the most common cutoff used to determine statistical significance.
A Type II error β of 0.1 was used because we typically prefer a power (1 - β) of 0.8 or greater. Increasing the power increases the sample size required to draw a statistically significant conclusion.

Rationale for design

The rationale for the ANOVA is that it is a straightforward analysis for a blocked experiment, where we may perform a 2-way ANOVA and ignore the F-value associated with the blocked IV.

Randomization

A random sample was selected, and the sample size was determined using the program G*Power. Selection of the sample is performed in a later section following the sample size determination. Note that, for the randomization, a seed was set to ensure reproducibility. The sample() function was used to pull a random sample of given size from the dataframe.

Replication

Due to the nature of the dataset, which is a cross-section of data from 1998-1999, we do not have multiple datapoints for each district and are therefore not able to perform replication of the test scores.

Blocking

The data were blocked by county in order to reduce the noise in the expenditure data caused by the locations of the school districts within the state of California (where regions may have different socioeconomic status). A 2-way ANOVA was performed, but we are simply ignoring the F-statistic associated with the blocking variable (county).

3. (Statistical) Analysis

Before exploring alternatives to NHST, we first explore the data, check for the validity of model assumptions, and perform an ANOVA. G*Power was used for setting the sample size.

Exploratory data analysis

When we take a look at the histogram of expenditure per student for California school districts, we do see some positive skew in the data (Fig. 1). Skewness violates the assumption of population normality for conducting ANOVA, but the skew is slight and we will proceed with the analysis while making note in the Contingencies section below.

Fig. 1 Histogram of the expenditure per student for California school districts in 1998-1999.

Computing main effect

Calculation of the main effect of the non-blocking IV grspan on the response DV expnstu is straightforward given that the grspan factor has two levels. Therefore, the main effect is the difference in the means, as calculated below.

# Calculating main effect
#   Take the difference between the means of the grspan in the KK-06 and KK-08 groups
m1 <- mean(subset(Caschool$expnstu, Caschool$grspan=="KK-06"))
m2 <- mean(subset(Caschool$expnstu, Caschool$grspan=="KK-08"))
me <- m2 - m1

# Also calculate the group standard deviations, to be used later
sd1 <- sd(subset(Caschool$expnstu, Caschool$grspan=="KK-06"))
sd2 <- sd(subset(Caschool$expnstu, Caschool$grspan=="KK-08"))

# Print the result
me
## [1] -310.1282

While there is clearly a difference between the means, the spread of the data is not captured. To elucidate the main effect further, a boxplot was generated for each level of the grspan factor (Fig. 2).

Fig. 2 Boxplots to show the main effect of grade span on expenditure per student.

While the median expenditure of the K-8 group is less than the median expenditure of the K-6 group, we can see a number of positive outliers that occur in the K-8 group.

Determining sample size using G*Power

While the original dataframe is not prohibitively large to analyze, it is still can be a useful practice to minimize the sample size. Using the program G*Power 3.1.9.2 for Windows 10, and the following parameters:

  • Effect size f = 0.48
  • α err prob = 0.05
  • Power (1-β err prob) = 0.9
  • Number of groups = 2,

The following output was generated, which indicates a critical F-statistic of Fcrit = 4.05 and a sample size of N = 48 (Fig. 3).

Fig. 3 G*Power output indicating the total sample size required to achieve the desired power and α level.

Seen below, a sample size of 48 was randomly selected from the Caschool dataframe.

# Extract a sample of appropriate size from the dataframe
sample_size <- 48
# Set RNG seed for reproducibility of data for report
set.seed(2)
Casample <- Caschool[sample(nrow(Caschool), sample_size),]

ANOVA

A 2-way ANOVA was performed, and since we chose to block by county we will ignore the F-statistic associated with that factor.

# Perform 2-way ANOVA on the sample and show the summary
MyAOV <- aov(Casample$expnstu ~ Casample$grspan+Casample$county)
summary(MyAOV)
##                 Df  Sum Sq Mean Sq F value  Pr(>F)   
## Casample$grspan  1 1436319 1436319   9.768 0.00616 **
## Casample$county 29 5376570  185399   1.261 0.31312   
## Residuals       17 2499668  147039                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Model adequacy checking

To form a robust conclusion from the ANOVA, it is important to confirm that the dataset meets the assumptions of the model. While the histogram of the population seemed to indicate some skewness in the data, a Normal Q-Q plot was generated to more clearly ascertain the degree to which the sample deviates from a normal distribution (Fig. 4).

Fig. 4 A normal Q-Q plot indicates that the sample deviates from the normal distribution assumption.

However, a residuals plot reveals that there is not a systematic variation from the model (Fig. 5). Therefore, the quality of fit of the model is good.

Fig. 5 Residuals plot shows no systematic variations from the model.

The model appears to be appropriate, however the skewness of the data will affect the probabilities associated with the F-statistic. To better assess the probability distribution function for the data, Resampling was considered as an alternative to the NHST.

Contingencies to the experimental design

The Q-Q plot shows a strong deviation from normality. In addition, as mentioned above, the histogram of the expnstu indicates that there is a positive skewness to the distribution of expenditure per student across the districts. This means that some schools are spending more money per student than would be expected from a normal distribution.

A random number generator seed was set to ensure reproducibility of the conclusions drawn in this report. However, it should be noted that when the seed was removed and repeated random sample selections were performed, in about half of the samples the ANOVA results returned an F-statistic below the critical threshold of 4.05. Therefore, it seems that a sample size of 48 might be too small to draw conclusions from the entire population. The behavior could be a result of the skewness of the distribution or the occurence of outliers within the sample.

4. Alternatives to NHST

Null Hypothesis Statistical Testing (e.g. ANOVA, t-test) is common practice, but not all distributions meet the assumptions of ANOVA, such as the normality of residuals of homogeneity of variances.2

Here, we explore two alternatives to NHST, namely Resampling Statistics and Plot Plus Error Bar (PPE) Approach.

Resampling Statistics

If the population distribution does not follow the assumptions set forth in NHST, then the test statistics can be false and misleading. Resampling Statistics is a computational approach that finds an empirical estimate of the probability distribution function (PDF) of the data.

If you perform ANOVA and achieve a critical F-statistic associated with a p-value of 0.05, that means you would expect to find–on repetition of the experiment–that only 5% of the time you would witness an F-statistic greater than this critical value. Resampling tests that assumption by performing many iterations of an experiment, logging the F-statistic, and repeating. Afterwards, the accuracy of the initial F-statistic may be judged.

Resampling was performed using a for loop. The number of iterations was chosen to be 10,000. Inside the loop, a random sample is generated, ANOVA is performed, and the corresponding F-value is pulled from the ANOVA summary into a global array. A histogram of the array shows the shape of the PDF, which at first glance appears similar to the standard F-distribution where df = 1 (Fig. 6).

# Resampling Technique
# Repeatedly take new sample, perform aov, store F-value in an array
Fvals <- 0 # initialize an array called Fvals that stores the F-value in each repeated test
num_iter <- 10000 # number of iterations in the for loop
for (i in 1:num_iter) {
  # Take a new random sample (of the same sample size used in the NHST)
  Casample <- Caschool[sample(nrow(Caschool), sample_size),]
  # Extract the F-value of interest from the aov summary output
  Fvals[i] <- summary(aov(Casample$expnstu ~ Casample$grspan+Casample$county))[[1]][["F value"]][1]
}
hist(Fvals, freq = FALSE, xlab = "F-value", main = "Probability Density Function (PDF) from Resampling")

Fig. 6 Probability distribution function (PDF) of the empirically-determined F-distribution using the Resampling Technique.

While the PDF has the same general shape as the F-distribution (df = 1), we can calculate the actual probability of F-values that fall above our earlier determined critical value (F = 4.05), and compare that with our α level of 0.05.

# Determine the fraction of the PDF that ACTUALLY falls above the critical F-value
Fcrit <- 4.05
length(which(Fvals>Fcrit))/length(Fvals)
## [1] 0.2895

As we find, the actual probability of finding an F-value greater than 4.05 is much larger than 0.05. Therefore, we must be careful that our assumptions of normality and homogeneity have been met before fully relying on G*Power to reduce our sample size.

Plot Plus Error Bar (PPE) Approach

Oftentimes, when a plot is created from which one must draw a conclusion, error bars help to portray the confidence with which a conclusion may be drawn from the plot. Using a simple code from James Jones for adding error bars to a bar plot,3 code was written to compare the mean expenditures using a bar plot with error bars representing a 95% confidence interval (Fig. 7). Given that there is no overlap in the error parts, we can say with confidence at the 95% level that the mean expenditure is different between K-6 and K-8 school districts.

# Plot Plus Error Bar (PPE) Approach
# User-defined function for creating error bars by Jamie Jones (Ref. 3)
error.bar <- function(x, y, upper, lower=upper, length=0.1,...){
  if(length(x) != length(y) | length(y) !=length(lower) | length(lower) != length(upper))
    stop("vectors must be same length")
  arrows(x,y+upper, x, y-lower, angle=90, code=3, length=length, ...)
}

expnstu_means <- c(m1, m2)
expnstu_sds <- c(sd1, sd2)
mybarplot <- barplot(expnstu_means, names.arg = c("K-6", "K-8"), ylim = c(5000,5800), xpd = FALSE, col="gray", axis.lty=1, xlab="Grade Span", ylab="Expenditure per student", main = "Bar Plot of Mean Expenditure with Error Bars")
error.bar(mybarplot, expnstu_means, 1.96*expnstu_sds/sqrt(length(Caschool$expnstu)))

Fig. 7 Bar plot comparing the means of expenditure data for both grade span groups, with the aid of error bars representing a 95% confidence interval.

The length of the error bars is based on a 95% confidence interval where z* = 1.96, and this assumes a normal distribution. However, while Resampling Statistics may give a better indication of the actual distribution of the data, the PPE Approach is a useful tool for quickly communicating the robustness of a given difference between two means.

5. References

  1. Ludford, P.J. “An Overview: Choosing the Correct Statistical Test.” University of Minnesota. Available November 2, 2016 http://www-users.cs.umn.edu/~ludford/stat_overview.htm.
  2. Cochran, William G., and Gertrude M. Cox. “Experimental designs.” (1957).
  3. Jones, J. “Plotting Error Bars in R.” August 24, 2009. Available November 2, 2016 http://monkeysuncle.stanford.edu/?p=485.

6. Appendices

Raw data

The California Test Score Data Set can be found by installing and loading the Ecdat R package and using the Caschool dataframe within.

Complete and documented R code

# Project 2: Null Hypothesis Statistical Testing
# ISYE 6020
# Mike Deagen

## require the installation and loading of the Ecdat R package
require(Ecdat)

# explore the head and tail of the Caschool dataframe
head(Caschool, n=10)
tail(Caschool, n=10)

# Examine the structure of the dataframe
str(Caschool)

# user-defined function to calculate the effect size (Cohen's d) for any num or int variable in the Caschool dataframe, grouped by grade span (KK-06, KK-08)
MyCohensd <- function(variable){
  # calculate the difference between the means of the variable in the KK-06 and KK-08 groups
  mean1 <- mean(subset(variable, Caschool$grspan=="KK-06"))
  mean2 <- mean(subset(variable, Caschool$grspan=="KK-08"))
  meandiff <- mean2 - mean1
  
  # calculate the pooled standard deviation of the variable in the KK-06 and KK-08 groups
  SD1 <- sd(subset(variable, Caschool$grspan=="KK-06"))
  SD2 <- sd(subset(variable, Caschool$grspan=="KK-08"))
  SDpooled <- (SD1 + SD2)/2
  
  # calculate Cohen's d, which is the mean difference divided by the pooled standard deviation
  result <- abs(meandiff/SDpooled)
  return(result)
}

# exploring the different variables to find one with a moderate-to-large effect size
MyCohensd(Caschool$enrltot)   # total enrollment
MyCohensd(Caschool$teachers)  # number of teachers
MyCohensd(Caschool$calwpct)   # percent qualifying for CalWorks (welfare program)
MyCohensd(Caschool$mealpct)   # percent qualifying for reduced-price lunch
MyCohensd(Caschool$computer)  # number of computers
MyCohensd(Caschool$testscr)   # average test score (reading + math)/2
MyCohensd(Caschool$compstu)   # computers per student
MyCohensd(Caschool$expnstu)   # expenditure per student
MyCohensd(Caschool$str)       # student teacher ratio
MyCohensd(Caschool$avginc)    # district average income
MyCohensd(Caschool$elpct)     # percent of English learners
MyCohensd(Caschool$readscr)   # average reading score
MyCohensd(Caschool$mathscr)   # average math score

# the above is bulky and the output is lengthy, so a better way is to usse the apply function to select columns of interest (ColumnsofInterest)

# Create a list of all columns of interest, then run the MyCohensd function on each column
# The output is a table with the value of Cohen's d for each column of interest
ColumnsofInterest <- c('enrltot', 'teachers', 'calwpct', 'mealpct', 'computer', 'testscr', 'compstu', 'expnstu', 'str', 'avginc', 'elpct', 'readscr', 'mathscr')
Cohensd_Data <- apply(Caschool[,ColumnsofInterest], 2, function(x) MyCohensd(x))



# Generate a histogram of the expnstu variable
hist(Caschool$expnstu, main = "Expenditures per Student in California School Districts", xlab = "Expenditures per student (dollars)", ylab = "Number of districts")

# Calculating main effect
#   Take the difference between the means of the grspan in the KK-06 and KK-08 groups
m1 <- mean(subset(Caschool$expnstu, Caschool$grspan=="KK-06"))
m2 <- mean(subset(Caschool$expnstu, Caschool$grspan=="KK-08"))
me <- m2 - m1

# Also calculate the group standard deviations, to be used later
sd1 <- sd(subset(Caschool$expnstu, Caschool$grspan=="KK-06"))
sd2 <- sd(subset(Caschool$expnstu, Caschool$grspan=="KK-08"))

# Print the result
me

# Generate boxplot of expnstu versus grspan group
boxplot(expnstu~grspan, data=Caschool, vertical=TRUE, las=1, ylab="Expenditure per student", main="Main Effect of Grade Span on Expenditure")

# Extract a sample of appropriate size from the dataframe
sample_size <- 48
# Set RNG seed for reproducibility of data for report
set.seed(2)
Casample <- Caschool[sample(nrow(Caschool), sample_size),]


# Perform 2-way ANOVA on the sample and show the summary
MyAOV <- aov(Casample$expnstu ~ Casample$grspan+Casample$county)
summary(MyAOV)

# Create a Normal Q-Q plot to see how well the data fit a normal distribution
qqnorm(residuals(MyAOV))
# Add a reference line to the Q-Q plot
qqline(residuals(MyAOV))

# Create a residuals plot for the model
plot(fitted(MyAOV), residuals(MyAOV))


# Resampling Technique
# Repeatedly take new sample, perform aov, store F-value in an array
Fvals <- 0 # initialize an array called Fvals that stores the F-value in each repeated test
num_iter <- 10000 # number of iterations in the for loop
for (i in 1:num_iter) {
  Casample <- Caschool[sample(nrow(Caschool), sample_size),]
  Fvals[i] <- summary(aov(Casample$expnstu ~ Casample$grspan+Casample$county))[[1]][["F value"]][1]
}
hist(Fvals, freq = FALSE, xlab = "F-value", main = "Probability Density Function (PDF) from Resampling")

# Determine the fraction of the PDF that ACTUALLY falls above the critical F-value
Fcrit <- 4.05
length(which(Fvals>Fcrit))/length(Fvals)


# Plot Plus Error Bar (PPE) Approach
error.bar <- function(x, y, upper, lower=upper, length=0.1,...){
  if(length(x) != length(y) | length(y) !=length(lower) | length(lower) != length(upper))
    stop("vectors must be same length")
  arrows(x,y+upper, x, y-lower, angle=90, code=3, length=length, ...)
}

# Generate a bar plot with the mean expenditures, and include error bars
expnstu_means <- c(m1, m2)
expnstu_sds <- c(sd1, sd2)
barp <- barplot(expnstu_means, names.arg = c("K-6", "K-8"), ylim = c(5000,5800), xpd = FALSE, col="gray", axis.lty=1, xlab="Grade Span", ylab="Expenditure per student", main = "Bar Plot of Mean Expenditure with Error Bars")
error.bar(barp, expnstu_means, 1.96*expnstu_sds/sqrt(length(Caschool$expnstu)))