This paper will be inspecting the claim that a higher socioeconomic class correlates with a higher standardized test score.
Socioeconomic status is an extremely integral part of our society, determining diet, housing, entertainment, and many more parts of life. I intend to investigate whether or not this privelege extends to education. This will use the High School and Beyond 2 dataset.
# Load in libraries
library(openintro, quietly = TRUE) # The package which contains the dataset
library(ggplot2, quietly = TRUE) # A package to assist graphing
library(reshape2, quietly = TRUE) # A package to assist graphing
# Put the dataset into our environment. This allows addition of columns to the frame.
hsb2 <- hsb2
Let us examine the structure of the hsb2 dataset.
# Output structure of our dataset
str(hsb2)
## 'data.frame': 200 obs. of 11 variables:
## $ id : int 70 121 86 141 172 113 50 11 84 48 ...
## $ gender : chr "male" "female" "male" "male" ...
## $ race : chr "white" "white" "white" "white" ...
## $ ses : Factor w/ 3 levels "low","middle",..: 1 2 3 3 2 2 2 2 2 2 ...
## $ schtyp : Factor w/ 2 levels "public","private": 1 1 1 1 1 1 1 1 1 1 ...
## $ prog : Factor w/ 3 levels "general","academic",..: 1 3 1 3 2 2 1 2 1 2 ...
## $ read : int 57 68 44 63 47 44 50 34 63 57 ...
## $ write : int 52 59 33 44 52 52 59 46 57 55 ...
## $ math : int 41 53 54 47 57 51 42 45 54 52 ...
## $ science: int 47 63 58 53 53 63 53 39 58 50 ...
## $ socst : int 57 61 31 56 61 61 61 36 51 51 ...
In the dataframe there are many things which are not need. I will be ignoring race and gender as well as the type of program and school the student was enrolled in. The important fields for this analysis are socio-economic status(ses), and the five test scores(read, write, math, science, socst).
## This graph strongly suggests that a higher socioeconomic status leads to increased performance on tests,
## however there is quite a bit of overlap so how large that improvement is may be somewhat small.
I will begin by performing Analysis of Variance to determine which subjects require further investigation.
# Calcualte Analysis of Variance for each test
aov_reading <- aov(read ~ ses, data = hsb2)
aov_writing <- aov(write ~ ses, data = hsb2)
aov_mathematics <- aov(math ~ ses, data = hsb2)
aov_science <- aov(science ~ ses, data = hsb2)
aov_socialStudies <- aov(socst ~ ses, data = hsb2)
Next a check must be made to ensure the data is normally distributed, as that may adulterate the results. I will use the Shapiro-Wilk test. This test looks for the probability that the data is not normal, therefore a P-Value greater than 0.5 confirms the data is mostly normal.
## [1] "Reading Shapiro-Wilk P-Value: 0.0401"
## [1] "Writing Shapiro-Wilk P-Value: 0"
## [1] "Mathematics Shapiro-Wilk P-Value: 0.0368"
## [1] "Science Shapiro-Wilk P-Value: 0.3629"
## [1] "Social Studies Shapiro-Wilk P-Value: 2e-04"
All but the Science dataset is normal, therefore the requirements for Analysis of Variance are not met. I will use a Kruskal-Wallis rank sum test as a fallback for the datasets which are not normal.
## [1] "Reading Kruskal P-Value: 4e-04"
## [1] "Writing Kruskal P-Value: 0.0037"
## [1] "Mathematics Kruskal P-Value: 2e-04"
## [1] "Social Studies Kruskal P-Value: 0"
## [1] "Science AOV P-Value: 0.00027"
All subjects have an extremely low P-Value, strongly suggesting there is a difference in performance between social classes. I will now test whether or not this is a positive relationship with socio-economic class.
For a more detailed look at the data, I will be using the Tukey Honestly Significant Difference test.
# Tukey HSD test for reading score
TukeyHSD(aov_reading)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = read ~ ses, data = hsb2)
##
## $ses
## diff lwr upr p adj
## middle-low 3.302352 -0.8430804 7.447784 0.1468044
## high-low 8.223404 3.6612698 12.785539 0.0000948
## high-middle 4.921053 1.0475284 8.794577 0.0085237
For the reading standardized test there is only sufficient evidence to support the claim that high class is better than both middle and low class. Based on the given confidence intervals, this difference appears to be a positive relationship.
# Tukey HSD test for writing score
TukeyHSD(aov_writing)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = write ~ ses, data = hsb2)
##
## $ses
## diff lwr upr p adj
## middle-low 1.309295 -2.6052575 5.223847 0.7096950
## high-low 5.296772 0.9887256 9.604818 0.0114079
## high-middle 3.987477 0.3296892 7.645265 0.0289035
Once again, for the writing test, there is only sufficient evidence to suggest that high class is better than both middle and low class. Based on the given confidence intervals, this difference appears to be a positive relationship.
# Tukey HSD test for mathematics score
TukeyHSD(aov_mathematics)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = math ~ ses, data = hsb2)
##
## $ses
## diff lwr upr p adj
## middle-low 3.040314 -0.7738869 6.854514 0.1464711
## high-low 7.002201 2.8045938 11.199808 0.0003325
## high-middle 3.961887 0.3978687 7.525906 0.0252035
For the mathematics test, there is only sufficient evidence to suggest that high class is better than both middle and low class. Based on the given confidence intervals, this difference appears to be a positive relationship.
# Tukey HSD test for science score
TukeyHSD(aov_science)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = science ~ ses, data = hsb2)
##
## $ses
## diff lwr upr p adj
## middle-low 4.003135 -0.01646732 8.022738 0.0512092
## high-low 7.746148 3.32249141 12.169805 0.0001547
## high-middle 3.743013 -0.01293570 7.498961 0.0510149
There is only sufficient evidence to suggest that high class has better average science scores than low class. Based on the given confidence intervals, this difference appears to be a positive relationship.
# Tukey HSD test for Social Studies score
TukeyHSD(aov_socialStudies)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = socst ~ ses, data = hsb2)
##
## $ses
## diff lwr upr p adj
## middle-low 4.712430 0.4259517 8.998908 0.0272596
## high-low 9.818782 5.1014232 14.536141 0.0000055
## high-middle 5.106352 1.1010331 9.111671 0.0082532
There is sufficient evidence to suggest that, for social studies, average test scores are different between all socio-economic classes. Based on the given confidence intervals, this difference appears to be a positive relationship.
Based on all of these tests I can, with a fair level of confidence, reject the hypothesis that high and low socio-economic classes perform equivalently on tests. It is much more likely that high class students perform better than low class students. On subjects other than science, High class students also do better than middle class students.
There is sufficient evidence to conclude that high class students perform better than low class students in all subjects and better than middle class students in all but Science.
The Highschool and Beyond 2 dataset is quite small, with only 200 observations. Middle class students make up almost half of these. This small of a sample is not ideal, especially when the data is subset into smaller groups. This may lead to under or overestimating the difference between socio-economic classes. In addition, when performing this many tests on a dataset it is quite likely to have at least one be misrepresentative of the population. This paper also did not address the difference between public and private schools. More high class students may posess the funds to attend private schools, which may explain why high class students do, on average, better than both middle and low class students.
This document was produced as a final project for MAT 143H - Introduction to Statistics (Honors) at North Shore Community College.
The course was led by Professor Billy Jackson.
Student Name: Benjamin Alexander Semester: Fall 2018