The California public school system conducts standardized tests and assembles the reading and math scores, along with other descriptive data, from its school districts. These data are important because they can impact the levels of public funding that a school district may receive. In this report, the effect of grade span of a school district (K-6 or K-8) on the expenditure per student is analyzed.
The California Test Score Data Set is contained within the Ecdat
R package in the Caschool
dataframe. Using the command ?Caschools
, we learn that dataframe includes a cross-section from 1998-1999 and contains the following information from each school district:
distcode
district codecounty
name of countydistrict
name of districtgrspan
grade span of districtenrltot
total enrollmentteachers
number of teacherscalwpct
percent qualifying for CalWorksmealpct
percent qualifying for reduced-price lunchcomputer
number of computerstestscr
average test score (reading + math)/2compstu
number of computers per studentexpnstu
expenditure per studentstr
student-to-teacher ratioavginc
district average incomeelpct
percent of English learnersreadscr
average reading scoremathscr
average math scoreTo get a sense of what the data look like, we can inspect the head and tail of the dataframe.
# View the beginning and end of the Caschool dataframe
head(Caschool, n=10)
## distcod county district grspan enrltot
## 1 75119 Alameda Sunol Glen Unified KK-08 195
## 2 61499 Butte Manzanita Elementary KK-08 240
## 3 61549 Butte Thermalito Union Elementary KK-08 1550
## 4 61457 Butte Golden Feather Union Elementary KK-08 243
## 5 61523 Butte Palermo Union Elementary KK-08 1335
## 6 62042 Fresno Burrel Union Elementary KK-08 137
## 7 68536 San Joaquin Holt Union Elementary KK-08 195
## 8 63834 Kern Vineland Elementary KK-08 888
## 9 62331 Fresno Orange Center Elementary KK-08 379
## 10 67306 Sacramento Del Paso Heights Elementary KK-06 2247
## teachers calwpct mealpct computer testscr compstu expnstu str
## 1 10.90 0.5102 2.0408 67 690.80 0.34358975 6384.911 17.88991
## 2 11.15 15.4167 47.9167 101 661.20 0.42083332 5099.381 21.52466
## 3 82.90 55.0323 76.3226 169 643.60 0.10903226 5501.955 18.69723
## 4 14.00 36.4754 77.0492 85 647.70 0.34979424 7101.831 17.35714
## 5 71.50 33.1086 78.4270 171 640.85 0.12808989 5235.988 18.67133
## 6 6.40 12.3188 86.9565 25 605.55 0.18248175 5580.147 21.40625
## 7 10.00 12.9032 94.6237 28 606.75 0.14358975 5253.331 19.50000
## 8 42.50 18.8063 100.0000 66 609.00 0.07432432 4565.746 20.89412
## 9 19.00 32.1900 93.1398 35 612.50 0.09234829 5355.548 19.94737
## 10 108.00 78.9942 87.3164 0 612.65 0.00000000 5036.211 20.80556
## avginc elpct readscr mathscr
## 1 22.690001 0.000000 691.6 690.0
## 2 9.824000 4.583333 660.5 661.9
## 3 8.978000 30.000002 636.3 650.9
## 4 8.978000 0.000000 651.9 643.5
## 5 9.080333 13.857677 641.8 639.9
## 6 10.415000 12.408759 605.7 605.4
## 7 6.577000 68.717949 604.5 609.0
## 8 8.174000 46.959461 605.5 612.5
## 9 7.385000 30.079157 608.9 616.1
## 10 11.613333 40.275921 611.9 613.4
tail(Caschool, n=10)
## distcod county district grspan enrltot
## 411 61770 Contra Costa Orinda Union Elementary KK-08 2422
## 412 68908 San Mateo Hillsborough City Elementary KK-08 1318
## 413 69161 Santa Barbara Cold Spring Elementary KK-06 220
## 414 68981 San Mateo Portola Valley Elementary KK-08 687
## 415 69682 Santa Clara Saratoga Union Elementary KK-08 2341
## 416 68957 San Mateo Las Lomitas Elementary KK-08 984
## 417 69518 Santa Clara Los Altos Elementary KK-08 3724
## 418 72611 Ventura Somis Union Elementary KK-08 441
## 419 72744 Yuba Plumas Elementary KK-08 101
## 420 72751 Yuba Wheatland Elementary KK-08 1778
## teachers calwpct mealpct computer testscr compstu expnstu str
## 411 139.47 0.0000 0.1239 466 698.20 0.1924030 5933.154 17.36574
## 412 87.06 0.1517 0.0000 412 698.25 0.3125948 7593.406 15.13898
## 413 12.33 0.4545 0.0000 22 698.45 0.1000000 6500.450 17.84266
## 414 44.59 0.3049 0.0000 209 699.10 0.3042212 7217.263 15.40704
## 415 124.09 0.1709 0.5980 286 700.30 0.1221700 5392.639 18.86534
## 416 59.73 0.1016 3.5569 195 704.30 0.1981707 7290.339 16.47413
## 417 208.48 1.0741 1.5038 721 706.75 0.1936090 5741.463 17.86263
## 418 20.15 3.5635 37.1938 45 645.00 0.1020408 4402.832 21.88586
## 419 5.00 11.8812 59.4059 14 672.20 0.1386139 4776.336 20.20000
## 420 93.40 6.9235 47.5712 313 655.75 0.1760405 5993.393 19.03640
## avginc elpct readscr mathscr
## 411 40.26400 0.5780346 699.1 697.3
## 412 35.81000 2.8072839 695.4 701.1
## 413 43.23000 1.3636364 693.3 703.6
## 414 50.67700 1.1644833 698.3 699.9
## 415 40.40200 2.0504057 698.9 701.7
## 416 28.71700 5.9959350 700.9 707.7
## 417 41.73411 4.7261009 704.0 709.5
## 418 23.73300 24.2630386 648.3 641.7
## 419 9.95200 2.9702971 667.9 676.5
## 420 12.50200 5.0056243 660.5 651.0
The pair of categorical independent variables (IVs) selected was grspan
and county
.
Grade span is a factor with 2 levels. The grade span may either be K-6 (KK-06
) or K-8 (KK-08
).
County is a factor with 45 levels. There are 420 districts included in the dataframe, therefore on average about ten districts per county.
The assumptions about the number of levels for the factors is confirmed by examining the structure of the dataframe below.
# Examine the structure of the dataframe
str(Caschool)
## 'data.frame': 420 obs. of 17 variables:
## $ distcod : int 75119 61499 61549 61457 61523 62042 68536 63834 62331 67306 ...
## $ county : Factor w/ 45 levels "Alameda","Butte",..: 1 2 2 2 2 6 29 11 6 25 ...
## $ district: Factor w/ 409 levels "Ackerman Elementary",..: 362 214 367 132 270 53 152 383 263 94 ...
## $ grspan : Factor w/ 2 levels "KK-06","KK-08": 2 2 2 2 2 2 2 2 2 1 ...
## $ enrltot : int 195 240 1550 243 1335 137 195 888 379 2247 ...
## $ teachers: num 10.9 11.1 82.9 14 71.5 ...
## $ calwpct : num 0.51 15.42 55.03 36.48 33.11 ...
## $ mealpct : num 2.04 47.92 76.32 77.05 78.43 ...
## $ computer: int 67 101 169 85 171 25 28 66 35 0 ...
## $ testscr : num 691 661 644 648 641 ...
## $ compstu : num 0.344 0.421 0.109 0.35 0.128 ...
## $ expnstu : num 6385 5099 5502 7102 5236 ...
## $ str : num 17.9 21.5 18.7 17.4 18.7 ...
## $ avginc : num 22.69 9.82 8.98 8.98 9.08 ...
## $ elpct : num 0 4.58 30 0 13.86 ...
## $ readscr : num 692 660 636 652 642 ...
## $ mathscr : num 690 662 651 644 640 ...
While the first few columns in the Caschool
dataframe are categorical (county
, district
, grspan
), the remaining columns may be considered continuous variables. Three of these columns (calwpct
, mealpct
, elpct
) are percentages.
To perform ANOVA, a continuous dependent variable (DV) must be selected as the response variable.
Initially, I wanted to use compstu
, or the number of computers per student, as the response variable. The rationale was that there would be a different number of computers per student based on the grade span of a particular district.
A user-defined function MyCohensd
was created to define Cohen’s d across the factor of the non-blocking IV, grade span. Because grspan
is a factor with only two levels, the calculation of Cohen’s d was straightforward and followed the formula: d=μK−8group−μK−6groupσpooled where σpooled=σK−6group+σK−8group2
# User-defined function to calculate the effect size (Cohen's d) for any num or int variable in the Caschool dataframe, grouped by grade span (KK-06, KK-08)
MyCohensd <- function(variable){
# Calculate the difference between the means of the variable in the KK-06 and KK-08 groups
mean1 <- mean(subset(variable, Caschool$grspan=="KK-06"))
mean2 <- mean(subset(variable, Caschool$grspan=="KK-08"))
meandiff <- mean2 - mean1
# Calculate the pooled standard deviation of the variable in the KK-06 and KK-08 groups
SD1 <- sd(subset(variable, Caschool$grspan=="KK-06"))
SD2 <- sd(subset(variable, Caschool$grspan=="KK-08"))
SDpooled <- (SD1 + SD2)/2
# Calculate Cohen's d, which is the mean difference divided by the pooled standard deviation
result <- abs(meandiff/SDpooled)
return(result)
}
The effect size determined by Cohen’s d is arbitrarily categorized as:
The user-defined function, called MyCohensd
, was applied to all columns of interest (in a list called ColumnsofInterest
) using the apply()
function.
# Create a list of all columns of interest, then run the MyCohensd function on each column
# The apply() function creates a table with the value of Cohen's d for each column of interest
# The second argument, 2, indicates that the function will be applied to each column. If this argument is 1, the function would be appled by row
ColumnsofInterest <- c('enrltot', 'teachers', 'calwpct', 'mealpct', 'computer', 'testscr', 'compstu', 'expnstu', 'str', 'avginc', 'elpct', 'readscr', 'mathscr')
Cohensd_Data <- apply(Caschool[,ColumnsofInterest], 2, function(x) MyCohensd(x))
# Show the Cohen's d results
Cohensd_Data
## enrltot teachers calwpct mealpct computer testscr
## 0.15750676 0.17015810 0.17856325 0.30656704 0.08976346 0.45251294
## compstu expnstu str avginc elpct readscr
## 0.06624935 0.48402902 0.26219572 0.45018749 0.15778971 0.41222605
## mathscr
## 0.47949551
Surprisingly, it turns out that the effect size for compstu
is trivial (d < 0.1). Therefore, I chose to study the expenditure per student (expnstu
) as the response variable instead, which has a moderate-to-large effect size (d = 0.48).
The first test of these data is a Null Hypothesis Significance Test (NHST), where ANOVA was performed. Later, two alternatives to NHST are explored: Resampling Statistics and Plot Plus Error Bar (PPE) Approach.
The factors and response variable were selected according to the selection rules for ANOVA,1 which state that ANOVA is appropriate when the IVs are categorical and the DV (response) is continuous. Given the limited categorical data provided in the dataframe, grspan
and county
seemed like the most appropriate IVs. After exploring the effect sizes of various potential DVs, expnstu
was selected as the response variable of interest.
In this experiment, the null hypothesis states that there is no difference between the two grspan
groups (K-6 and K-8) in terms of expnstu
, the expenditure per student. The research hypothesis is that there may be a difference between the grade span and amount of money spent per student.
A Type I error α of 0.05 was used for this analysis, as it is the most common cutoff used to determine statistical significance.
A Type II error β of 0.1 was used because we typically prefer a power (1 - β) of 0.8 or greater. Increasing the power increases the sample size required to draw a statistically significant conclusion.
The rationale for the ANOVA is that it is a straightforward analysis for a blocked experiment, where we may perform a 2-way ANOVA and ignore the F-value associated with the blocked IV.
A random sample was selected, and the sample size was determined using the program G*Power. Selection of the sample is performed in a later section following the sample size determination. Note that, for the randomization, a seed was set to ensure reproducibility. The sample()
function was used to pull a random sample of given size from the dataframe.
Due to the nature of the dataset, which is a cross-section of data from 1998-1999, we do not have multiple datapoints for each district and are therefore not able to perform replication of the test scores.
The data were blocked by county
in order to reduce the noise in the expenditure data caused by the locations of the school districts within the state of California (where regions may have different socioeconomic status). A 2-way ANOVA was performed, but we are simply ignoring the F-statistic associated with the blocking variable (county
).
Before exploring alternatives to NHST, we first explore the data, check for the validity of model assumptions, and perform an ANOVA. G*Power was used for setting the sample size.
When we take a look at the histogram of expenditure per student for California school districts, we do see some positive skew in the data (Fig. 1). Skewness violates the assumption of population normality for conducting ANOVA, but the skew is slight and we will proceed with the analysis while making note in the Contingencies section below.
Fig. 1 Histogram of the expenditure per student for California school districts in 1998-1999.
Calculation of the main effect of the non-blocking IV grspan
on the response DV expnstu
is straightforward given that the grspan
factor has two levels. Therefore, the main effect is the difference in the means, as calculated below.
# Calculating main effect
# Take the difference between the means of the grspan in the KK-06 and KK-08 groups
m1 <- mean(subset(Caschool$expnstu, Caschool$grspan=="KK-06"))
m2 <- mean(subset(Caschool$expnstu, Caschool$grspan=="KK-08"))
me <- m2 - m1
# Also calculate the group standard deviations, to be used later
sd1 <- sd(subset(Caschool$expnstu, Caschool$grspan=="KK-06"))
sd2 <- sd(subset(Caschool$expnstu, Caschool$grspan=="KK-08"))
# Print the result
me
## [1] -310.1282
While there is clearly a difference between the means, the spread of the data is not captured. To elucidate the main effect further, a boxplot was generated for each level of the grspan
factor (Fig. 2).
Fig. 2 Boxplots to show the main effect of grade span on expenditure per student.
While the median expenditure of the K-8 group is less than the median expenditure of the K-6 group, we can see a number of positive outliers that occur in the K-8 group.
While the original dataframe is not prohibitively large to analyze, it is still can be a useful practice to minimize the sample size. Using the program G*Power 3.1.9.2 for Windows 10, and the following parameters:
The following output was generated, which indicates a critical F-statistic of Fcrit = 4.05 and a sample size of N = 48 (Fig. 3).
Fig. 3 G*Power output indicating the total sample size required to achieve the desired power and α level.
Seen below, a sample size of 48 was randomly selected from the Caschool
dataframe.
# Extract a sample of appropriate size from the dataframe
sample_size <- 48
# Set RNG seed for reproducibility of data for report
set.seed(2)
Casample <- Caschool[sample(nrow(Caschool), sample_size),]
A 2-way ANOVA was performed, and since we chose to block by county
we will ignore the F-statistic associated with that factor.
# Perform 2-way ANOVA on the sample and show the summary
MyAOV <- aov(Casample$expnstu ~ Casample$grspan+Casample$county)
summary(MyAOV)
## Df Sum Sq Mean Sq F value Pr(>F)
## Casample$grspan 1 1436319 1436319 9.768 0.00616 **
## Casample$county 29 5376570 185399 1.261 0.31312
## Residuals 17 2499668 147039
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
To form a robust conclusion from the ANOVA, it is important to confirm that the dataset meets the assumptions of the model. While the histogram of the population seemed to indicate some skewness in the data, a Normal Q-Q plot was generated to more clearly ascertain the degree to which the sample deviates from a normal distribution (Fig. 4).
Fig. 4 A normal Q-Q plot indicates that the sample deviates from the normal distribution assumption.
However, a residuals plot reveals that there is not a systematic variation from the model (Fig. 5). Therefore, the quality of fit of the model is good.
Fig. 5 Residuals plot shows no systematic variations from the model.
The model appears to be appropriate, however the skewness of the data will affect the probabilities associated with the F-statistic. To better assess the probability distribution function for the data, Resampling was considered as an alternative to the NHST.
The Q-Q plot shows a strong deviation from normality. In addition, as mentioned above, the histogram of the expnstu
indicates that there is a positive skewness to the distribution of expenditure per student across the districts. This means that some schools are spending more money per student than would be expected from a normal distribution.
A random number generator seed was set to ensure reproducibility of the conclusions drawn in this report. However, it should be noted that when the seed was removed and repeated random sample selections were performed, in about half of the samples the ANOVA results returned an F-statistic below the critical threshold of 4.05. Therefore, it seems that a sample size of 48 might be too small to draw conclusions from the entire population. The behavior could be a result of the skewness of the distribution or the occurence of outliers within the sample.
Null Hypothesis Statistical Testing (e.g. ANOVA, t-test) is common practice, but not all distributions meet the assumptions of ANOVA, such as the normality of residuals of homogeneity of variances.2
Here, we explore two alternatives to NHST, namely Resampling Statistics and Plot Plus Error Bar (PPE) Approach.
If the population distribution does not follow the assumptions set forth in NHST, then the test statistics can be false and misleading. Resampling Statistics is a computational approach that finds an empirical estimate of the probability distribution function (PDF) of the data.
If you perform ANOVA and achieve a critical F-statistic associated with a p-value of 0.05, that means you would expect to find–on repetition of the experiment–that only 5% of the time you would witness an F-statistic greater than this critical value. Resampling tests that assumption by performing many iterations of an experiment, logging the F-statistic, and repeating. Afterwards, the accuracy of the initial F-statistic may be judged.
Resampling was performed using a for
loop. The number of iterations was chosen to be 10,000. Inside the loop, a random sample is generated, ANOVA is performed, and the corresponding F-value is pulled from the ANOVA summary into a global array. A histogram of the array shows the shape of the PDF, which at first glance appears similar to the standard F-distribution where df = 1 (Fig. 6).
# Resampling Technique
# Repeatedly take new sample, perform aov, store F-value in an array
Fvals <- 0 # initialize an array called Fvals that stores the F-value in each repeated test
num_iter <- 10000 # number of iterations in the for loop
for (i in 1:num_iter) {
# Take a new random sample (of the same sample size used in the NHST)
Casample <- Caschool[sample(nrow(Caschool), sample_size),]
# Extract the F-value of interest from the aov summary output
Fvals[i] <- summary(aov(Casample$expnstu ~ Casample$grspan+Casample$county))[[1]][["F value"]][1]
}
hist(Fvals, freq = FALSE, xlab = "F-value", main = "Probability Density Function (PDF) from Resampling")
Fig. 6 Probability distribution function (PDF) of the empirically-determined F-distribution using the Resampling Technique.
While the PDF has the same general shape as the F-distribution (df = 1), we can calculate the actual probability of F-values that fall above our earlier determined critical value (F = 4.05), and compare that with our α level of 0.05.
# Determine the fraction of the PDF that ACTUALLY falls above the critical F-value
Fcrit <- 4.05
length(which(Fvals>Fcrit))/length(Fvals)
## [1] 0.2895
As we find, the actual probability of finding an F-value greater than 4.05 is much larger than 0.05. Therefore, we must be careful that our assumptions of normality and homogeneity have been met before fully relying on G*Power to reduce our sample size.
Oftentimes, when a plot is created from which one must draw a conclusion, error bars help to portray the confidence with which a conclusion may be drawn from the plot. Using a simple code from James Jones for adding error bars to a bar plot,3 code was written to compare the mean expenditures using a bar plot with error bars representing a 95% confidence interval (Fig. 7). Given that there is no overlap in the error parts, we can say with confidence at the 95% level that the mean expenditure is different between K-6 and K-8 school districts.
# Plot Plus Error Bar (PPE) Approach
# User-defined function for creating error bars by Jamie Jones (Ref. 3)
error.bar <- function(x, y, upper, lower=upper, length=0.1,...){
if(length(x) != length(y) | length(y) !=length(lower) | length(lower) != length(upper))
stop("vectors must be same length")
arrows(x,y+upper, x, y-lower, angle=90, code=3, length=length, ...)
}
expnstu_means <- c(m1, m2)
expnstu_sds <- c(sd1, sd2)
mybarplot <- barplot(expnstu_means, names.arg = c("K-6", "K-8"), ylim = c(5000,5800), xpd = FALSE, col="gray", axis.lty=1, xlab="Grade Span", ylab="Expenditure per student", main = "Bar Plot of Mean Expenditure with Error Bars")
error.bar(mybarplot, expnstu_means, 1.96*expnstu_sds/sqrt(length(Caschool$expnstu)))
Fig. 7 Bar plot comparing the means of expenditure data for both grade span groups, with the aid of error bars representing a 95% confidence interval.
The length of the error bars is based on a 95% confidence interval where z* = 1.96, and this assumes a normal distribution. However, while Resampling Statistics may give a better indication of the actual distribution of the data, the PPE Approach is a useful tool for quickly communicating the robustness of a given difference between two means.
The California Test Score Data Set can be found by installing and loading the Ecdat
R package and using the Caschool
dataframe within.
# Project 2: Null Hypothesis Statistical Testing
# ISYE 6020
# Mike Deagen
## require the installation and loading of the Ecdat R package
require(Ecdat)
# explore the head and tail of the Caschool dataframe
head(Caschool, n=10)
tail(Caschool, n=10)
# Examine the structure of the dataframe
str(Caschool)
# user-defined function to calculate the effect size (Cohen's d) for any num or int variable in the Caschool dataframe, grouped by grade span (KK-06, KK-08)
MyCohensd <- function(variable){
# calculate the difference between the means of the variable in the KK-06 and KK-08 groups
mean1 <- mean(subset(variable, Caschool$grspan=="KK-06"))
mean2 <- mean(subset(variable, Caschool$grspan=="KK-08"))
meandiff <- mean2 - mean1
# calculate the pooled standard deviation of the variable in the KK-06 and KK-08 groups
SD1 <- sd(subset(variable, Caschool$grspan=="KK-06"))
SD2 <- sd(subset(variable, Caschool$grspan=="KK-08"))
SDpooled <- (SD1 + SD2)/2
# calculate Cohen's d, which is the mean difference divided by the pooled standard deviation
result <- abs(meandiff/SDpooled)
return(result)
}
# exploring the different variables to find one with a moderate-to-large effect size
MyCohensd(Caschool$enrltot) # total enrollment
MyCohensd(Caschool$teachers) # number of teachers
MyCohensd(Caschool$calwpct) # percent qualifying for CalWorks (welfare program)
MyCohensd(Caschool$mealpct) # percent qualifying for reduced-price lunch
MyCohensd(Caschool$computer) # number of computers
MyCohensd(Caschool$testscr) # average test score (reading + math)/2
MyCohensd(Caschool$compstu) # computers per student
MyCohensd(Caschool$expnstu) # expenditure per student
MyCohensd(Caschool$str) # student teacher ratio
MyCohensd(Caschool$avginc) # district average income
MyCohensd(Caschool$elpct) # percent of English learners
MyCohensd(Caschool$readscr) # average reading score
MyCohensd(Caschool$mathscr) # average math score
# the above is bulky and the output is lengthy, so a better way is to usse the apply function to select columns of interest (ColumnsofInterest)
# Create a list of all columns of interest, then run the MyCohensd function on each column
# The output is a table with the value of Cohen's d for each column of interest
ColumnsofInterest <- c('enrltot', 'teachers', 'calwpct', 'mealpct', 'computer', 'testscr', 'compstu', 'expnstu', 'str', 'avginc', 'elpct', 'readscr', 'mathscr')
Cohensd_Data <- apply(Caschool[,ColumnsofInterest], 2, function(x) MyCohensd(x))
# Generate a histogram of the expnstu variable
hist(Caschool$expnstu, main = "Expenditures per Student in California School Districts", xlab = "Expenditures per student (dollars)", ylab = "Number of districts")
# Calculating main effect
# Take the difference between the means of the grspan in the KK-06 and KK-08 groups
m1 <- mean(subset(Caschool$expnstu, Caschool$grspan=="KK-06"))
m2 <- mean(subset(Caschool$expnstu, Caschool$grspan=="KK-08"))
me <- m2 - m1
# Also calculate the group standard deviations, to be used later
sd1 <- sd(subset(Caschool$expnstu, Caschool$grspan=="KK-06"))
sd2 <- sd(subset(Caschool$expnstu, Caschool$grspan=="KK-08"))
# Print the result
me
# Generate boxplot of expnstu versus grspan group
boxplot(expnstu~grspan, data=Caschool, vertical=TRUE, las=1, ylab="Expenditure per student", main="Main Effect of Grade Span on Expenditure")
# Extract a sample of appropriate size from the dataframe
sample_size <- 48
# Set RNG seed for reproducibility of data for report
set.seed(2)
Casample <- Caschool[sample(nrow(Caschool), sample_size),]
# Perform 2-way ANOVA on the sample and show the summary
MyAOV <- aov(Casample$expnstu ~ Casample$grspan+Casample$county)
summary(MyAOV)
# Create a Normal Q-Q plot to see how well the data fit a normal distribution
qqnorm(residuals(MyAOV))
# Add a reference line to the Q-Q plot
qqline(residuals(MyAOV))
# Create a residuals plot for the model
plot(fitted(MyAOV), residuals(MyAOV))
# Resampling Technique
# Repeatedly take new sample, perform aov, store F-value in an array
Fvals <- 0 # initialize an array called Fvals that stores the F-value in each repeated test
num_iter <- 10000 # number of iterations in the for loop
for (i in 1:num_iter) {
Casample <- Caschool[sample(nrow(Caschool), sample_size),]
Fvals[i] <- summary(aov(Casample$expnstu ~ Casample$grspan+Casample$county))[[1]][["F value"]][1]
}
hist(Fvals, freq = FALSE, xlab = "F-value", main = "Probability Density Function (PDF) from Resampling")
# Determine the fraction of the PDF that ACTUALLY falls above the critical F-value
Fcrit <- 4.05
length(which(Fvals>Fcrit))/length(Fvals)
# Plot Plus Error Bar (PPE) Approach
error.bar <- function(x, y, upper, lower=upper, length=0.1,...){
if(length(x) != length(y) | length(y) !=length(lower) | length(lower) != length(upper))
stop("vectors must be same length")
arrows(x,y+upper, x, y-lower, angle=90, code=3, length=length, ...)
}
# Generate a bar plot with the mean expenditures, and include error bars
expnstu_means <- c(m1, m2)
expnstu_sds <- c(sd1, sd2)
barp <- barplot(expnstu_means, names.arg = c("K-6", "K-8"), ylim = c(5000,5800), xpd = FALSE, col="gray", axis.lty=1, xlab="Grade Span", ylab="Expenditure per student", main = "Bar Plot of Mean Expenditure with Error Bars")
error.bar(barp, expnstu_means, 1.96*expnstu_sds/sqrt(length(Caschool$expnstu)))