For almost a decade, I was employed in an administrative capacity within Higher Education institutions (both public and private). I should preface this by stating I worked for only institutions in New York State, where SUNY and CUNY tuition rates are among the lowest in the nation. While working in Admissions & Recruitment, one of the queries I frequently encountered was if private (not-for-profit) institutions were better than public.
In general, the assumption was that you were paying for a better education. The disparity in cost may have skewed some people’s thinking, but the real point is that there is more to the quality of an education than the price tag, or the Football team’s record. This is a subject that people should care about because choosing a college (should one choose to attend) is one of the biggest decisions a young person will make, and in my personal belief the general public could be better informed about the college application and selection process. This is especially true for prospective students who are the first in their family to attend college.
Using statistical tools, I will attempt to accept or reject the hypothesis that private schools are more competitive than public schools.
In conclusion - talk about different factors that play into choosing a college/university
The data were collected by using Integrated Postsecondary Education Data System (IPEDS), which contains information from surveys conducted annually by the U.S. Department of Education’s National Center for Education Statistics (NCES). This is reliable source for all types of information about institutions of higher education, as the federal governement requires that institutions participating in federal aid programs (Financial Aid) report data.
Using the IPEDS Data Center, you can look up one or several institutions, and dozens of different variables to compare against. For this project, I am only looking at data for the most recent year (2014-15), and selected institution name, location, sector/level, tuition and fees, acceptance rate, and standardized test score information. After choosing variables, the information is then output in a .csv file, which is hosted in the GitHub repository for this project.
The cases found in the data set are accredited, degree-granting higher education institutions in the United States which require standardized test scores for their admissions considerations, and have supplied it as part of the NCES survey. The data does not include any open-enrollment institutions (Community Colleges, mainly), or institutions that did not provide test score data. In total there are 1186 cases, which is good sample size of the entire population of accredited institutions in the United States.
For this project, the variables I will be studying are the standardized test scores (SAT & ACT, 25\(^th\) percentile), the acceptance rate (as a percent of applicants) and the “sector” of the institution, which classifies the college or university as public or private.
The standardized test scores and admisison rate are response variables, which are discrete and numeric, while the sector classification of the institution is the explanatory variable, which is a nominal categorical variable.
The population of interest are all accredited, degree-granting institutions in the United States. Since the sample we are using includes a good number of the institutions in the population, the findings from this analysis can be generalized to the population. The sample may be broken down further and randomized to be more representative. It is important to note that there are more Private institutions than Public institutions in the data set. One potential source of bias in the data set that would prevent generalizability are the institutions that did not report their test scores (if any). The data set we are utilizing on includes institutions taht reported and/or require standardized test scores for admission. An institution that did not report may have done so because of poor results, and not having these cases in our study would affect about ability to generalize to the population.
The data for this analysis was collected using an observational study - that is, the data collection process did not directly interfere with the how the data surfaced. Because of this, a causal link between the variables cannot be established, but we can make a statement correlating the two variables which can be generalized to the population.
Perform relevant descriptive statistics, including summary statistics and visualization of the data.
## Admit_Pct
## Min. : 8.00
## 1st Qu.: 56.00
## Median : 69.00
## Mean : 66.95
## 3rd Qu.: 80.00
## Max. :100.00
## [1] 18.01072
Using Rs summary function, the mean admission rate for public colleges is just under 67%, and the median is only 2 points higher. There is some slight skew, but the admission rate is nearly normal, with a standard deviation of 18.01.
## Admit_Pct
## Min. : 5.00
## 1st Qu.: 52.00
## Median : 66.00
## Mean : 62.75
## 3rd Qu.: 76.00
## Max. :100.00
## [1] 20.37907
The summary for admission rates for public colleges shows a mean of just under 63%. The median is slightly higher than this, indicating that the admission rates are skewed slightly to the left. Interestingly, the IQR for both public and private college admission rates is 24. The admission rate is slightly lower than that of public colleges in our data set, and we will look at a few more parameters before deciding how meaningful this is.
We can verfiy the shape of our data by looking at a histogram of both sets:
Both histograms are unimodal, with some slight skew to the left. Since these are percentages, the range for both histograms is 0 to 100, no negative values are present, as a negative admissions rate is not possible.
We can futher evaluate our data by looking at the boxplot for both public and private schools:
The box plots also confirm that there are more outliers in the lower range of acceptance rates for the private colleges and universities. Without examing the specific institutions, one could guess that some of the most selective schools in the country (i.e. Ivy League or similar) are private schools, accounting for the appearance of more outliers in this range.
Finally we can look at a scatterplot of the corresponding acceptance rates for public private colleges with a regression line drawn between them. Since the Sector
variable is a two-factor categorical variable, the values will be stacked upon each other, with the regression line drawn between them.
## `stat_bindot()` using `bins = 30`. Pick better value with `binwidth`.
Looking at the regression line, there is a slight upward slope. This is a positive slope, but we need to keep in mind that lower admission rates are better than higher ones. For emphasis, another plot with the regression line is drawn over the box plots from earlier:
As above, we will look at some of our other variables in the same way. Since SAT scores are given in the two widely used sections, critical reading and math (no writing), as well as a combined score of the two, we will look at all three together:
## SAT_CR SAT_MAT SAT_COMP
## Min. :290.0 Min. :250.0 Min. : 653.0
## 1st Qu.:420.0 1st Qu.:430.0 1st Qu.: 850.0
## Median :450.0 Median :460.0 Median : 910.0
## Mean :460.4 Mean :473.3 Mean : 933.6
## 3rd Qu.:490.0 3rd Qu.:510.0 3rd Qu.:1000.0
## Max. :640.0 Max. :700.0 Max. :1310.0
## [1] 55.42
## [1] 64.17498
## [1] 116.8723
The scores for all three sections appear to have the same shape, that is being nearly normal, unimodal and skewed slightly to the right, as indicated by the median being less than the mean.
## SAT_CR SAT_MAT SAT_COMP
## Min. :265.0 Min. :200.0 Min. : 520
## 1st Qu.:430.0 1st Qu.:430.0 1st Qu.: 859
## Median :460.0 Median :460.0 Median : 920
## Mean :476.2 Mean :480.9 Mean : 957
## 3rd Qu.:510.0 3rd Qu.:510.0 3rd Qu.:1020
## Max. :730.0 Max. :770.0 Max. :1500
## NA's :1 NA's :1
## [1] 76.14863
## [1] 81.94305
## [1] 155.7852
Looking at the SAT scores for private colleges, the same characteristics that appeared in the public colleges (nearly normal, uni-modal) are apparent. The mean for the critical reading and math sections are about 10 points higher for the private colleges, with the mean combined score of these two (957) being about 24 points higher than that of the public colleges.
Visualizing the data with a histogram, we can confirm the findings from the summary function:
## Warning: Removed 2 rows containing non-finite values (stat_bin).
The histograms for private and public colleges for all three sections appear very similar. As with the acceptance rates, the private colleges apear to have more high scoring students in their reported cohort.
Again, we’ll use side-by-side boxplots to further evaluate these data:
All three variables have similar boxplots; the means are relatively close, with a few outliers on the lower range of the scores, and many on the upper end. The larger number of private colleges in the data set could account for this, but even taking this into consideration, the mean SAT scores for each section are nearly identical.
As with our previous admission acceptance rate variable, below is a scatterplot with our regression line drawn between the two “stacks” of SAT scores. As with the SAT data visualizations above, the plot is faceted for the three different SAT scores in our data.
Another version of the same plot is below, this time using the linear model we stored in our S2
object, just to check that our plot above is correct.
Lastly, we’ll look at another standardized test score, one that in recent years has gained in popularity - some would argue that the ACT is a more balanced test, evaluating students on other subject areas besides math and english. Unlike the SAT, which is evaluated on a scale of 200 to 800 points per section, the ACT has four main sections, each with a minimum of 1, and a maximum of 36. The four sections are averaged together to give the composite score.
## ACT_COMP
## Min. :15.00
## 1st Qu.:18.00
## Median :19.00
## Mean :20.02
## 3rd Qu.:22.00
## Max. :30.00
## NA's :28
## [1] 2.990267
The mean ACT scores for public institutions is 20, with the median (19) being 1 point less. The IQR is 4, while the standard deviatio is just slightly under 3.
## ACT_COMP
## Min. :11.00
## 1st Qu.:18.00
## Median :20.00
## Mean :20.98
## 3rd Qu.:23.00
## Max. :34.00
## NA's :16
## [1] 3.931565
Following a similar trend, the mean for the private schools in our data set are almost 1 full point higher (20.98) and the median is again 1 less. The IQR of 5 is slightly larger, and the standard deviation is 4. Looking at side-by-side histograms, we can confirm the shape of the data:
Both histograms show nearly normal, unimodal data, with a slight skew to the right. Checking the boxplot for the ACT will confirm our findings from above as well:
Examining the boxplots, the IQR of the ACT scores for private schools is greater than that of the public schools, showing greater variability of the scores. The whiskers for the private school ACT scores also extend out further, both this and the IQR can be attributed to the greater range and larger number of the private schools in our sample.
Finally, the last scatterplot of the ACT composite score variable. Similar to the SAT variable and the acceptance rate, the private schools have a slightly higher mean than the public schools. The regression line is displayed between the two categorical variables, and the same is drawn on the box plot below.
After conducting some exploratory data analysis, the summary statistics and visualizations for each numerical variable, divided into the categorical variables, show very similar trends. Overall, the statistics for the private colleges show slightly higher means for the standardized test scores, and a slightly lower mean for the acceptance rate (being more selective is better).
However, the difference is slight. In the acceptance rate, the public colleges had a mean of about 67%, while the private colleges had a mean of slightly under 63%. The IQR was exactly the same, but the standard deviation was a bit higher for the private colleges, showing more variability in the data. In the standardized test scores, public colleges had a mean combined SAT score of 933 and ACT composite of 20, while private colleges had a mean combined SAT of 957 and ACT composite of 20.98. Like the selectivity, the variability of the test scores for private colleges is greater; this could be due to the larger number in the data set, but also the greater range of scores.
Before investigating further, I am surprised that the data were so similar, and I would hypothesize that the higher means for the standardized test scores (reminder, these are the top 25th percentile) are being skewed just enough by the more competitive and highly selective colleges. Of the colleges with less than a 10 percent acceptance rate, only two were not private, and those are both military academies.
If your data fails some conditions and you can’t use a theoretical method, then you should use simulation. If you can use both methods, then you should use both methods. It is your responsibility to figure out the appropriate methodology.
A least-squares regression line was fit to our two-factor categorical scatterplot in the previous section for each numerical variable. The categorical variable for Sector
was converted in the transformation of our data at the beginning of this analysis - a new column was added to the data set, with values of 0 added for private colleges, and 1 for public, making this the indicator variable.
Unlike conditions for fitting the least squares regression line to data sets with numerical data on both axis, the conditions for categorical predictors with two levels are slightly different, as the linearity assumption will always be satisfied. We still need to look at the other conditions before moving forward:
Examining each histogram and normal probability plots, the residuals do not appear to be nearly normal, as there is quite a bit of skew for all four variables.
As above, since this is a two level categorical predictor, we can examine that the points have constant variability by simply looking at original scatterplots for all four variables. Since all of the points fall along the line, the variability appears to be fairly constant.
The only link between the observations are the students who are applying and being accepted to the institutions. For many institutions, they will have similar applicants, especially for institutions in the same geographic regions. However, the chances of the exact same mix of students applying and being admitted to two colleges is virtually zero, so we can presume the observations are independent.
Because the residuals do not appear to be nearly normal, and have a good amount of skew in the tails, we cannot rely soley on the linear model for any of the four variables for inference on the relationship with an institution being public or private.
-Simulation based inference - hypothesis test and confidence interval
Because of the skew in each variable for both the public and private college data, we will do a simulation by sampling 100 colleges from our divided data sets (used earlier to create the pub_priv
column). To make things a little simpler, we’ll just use the combined SAT scores to make our inference.
What we will do is take the sample and examine the difference in the means of each sample, and construct a 95% confidence interval for the average difference. We’ll take a sample of 100 public, and 100 private colleges, as that will be equal to less than 10% of the population of all institutions in the appropriate sector. The sample will be random, so we will meet our conditions as long as the distributions of the data look symmetric without too much skew.
# sample of public colleges
pub_sat <- pub_coll$SAT_COMP
pub_samp <- sample(pub_sat, 100)
# sample of private colleges
priv_sat <- priv_coll$SAT_COMP
priv_samp <- sample(priv_sat, 100)
hist(pub_samp)
hist(priv_samp)
Looking at the histograms for both the public and private school data, both are relatively normal, with some slight skew. Because the sample sizes are larger than 30, we can disregard this.
Using our summary and standard deviation functions in R, we can calculate the difference between two sample means:
summary(pub_samp)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 653.0 840.0 907.0 916.6 1010.0 1180.0
sd(pub_samp, na.rm=TRUE)
## [1] 105.0532
summary(priv_samp)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 520.0 850.0 930.0 952.7 1022.0 1390.0
sd(priv_samp, na.rm=TRUE)
## [1] 151.7806
Since we cannot permanently store the sample that R takes, we will simply look at this very instance. Our mean combined SAT score for the public colleges is 944.5, and the mean for the same variable in our private colleges is 960.4. The point estimate is the difference between these two, which is 15.9 points (private colleges will be our ${x}_1). The Now we will calculate the standard error for the difference between the two means:
Sector | mean SAT | s | n |
---|---|---|---|
Private | 960.4 | 163.83 | 100 |
Pubclic | 944.5 | 123.79 | 100 |
\(SE_{(\xbar_{priv}-\xbar_{pub})}\) = sqrt((163.83^2 / 100) + (123.79^2 / 100)) = 20.53393
Now that we have our Standard Error, we can calculate the confidence interval using the following:
\(15.9 \pm z^* \times SE\) = \(15.9 \pm 1.96 \times 20.53393\) = \(15.9 \pm 40.2465\)
The 95% confidence interval is (-24.35, 56.15)
The estimated difference in combined SAT critical reading and math sections for private and public colleges is on average between -24.35 and 56.15 points.
Our formal hypothesis for testing the difference in the mean combined CR+Math SAT score is
\(H_{0}\): The mean combined SAT score for private colleges is the same as the mean combined SAT score for public colleges. Any difference observed would simply be due to chance.
\(H_{A}\): There is a difference in the mean combined SAT score between private and public colleges.
Test Statistic and P-Value: Z = difference in means - 0 / SE = (15.9 - 0) / 20.53393 = 0.7743281
Since the sample sizes were sufficiently large, we’ll look up the Z value in the normal probability table, which give us an area of 0.7794. Subtracting this from 1, we will get the area of the upper tail:
\(upper tail = 1 - 0.7794 =\) 0.2206
Multiplying by 2, we get our p-value: 0.2206 x 2 = 0.4412 Since the p-value is so high, we cannot reject our null hypothesis. The data do not provide convincing evidence that the mean SAT score for private colleges is different that the mean SAT score for public colleges.
Originally, we sought out to find any difference in the metrics used for college admission (standardized test scores) between public or private colleges. Using linear models on the different variables did not seem possible given our current knowledge of statistics, and the skew on both ends of each data set, which did not meet one of the conditions required for fitting to a linear model.
However, from reviewing the summary statistics, we could already see there was not much difference in the means of all of the variables. Without using a fancier model, we have evaluated the difference in the two means to measure how significant the differences are. With 95% confidence, we calculated the difference in the means to be a little more than -25 to just over 56, the fact that zero is present in that range is significant. The differences in this range are equal to getting just a few more questions correct on the test, and based on this particular sample we can conclude that this is not enough of a difference to conclude that private colleges are overall better academically (or rather, attract better performing high school students) than thei public counterparts.
Write a brief summary of your findings without repeating your statements from earlier. Also include a discussion of what you have learned about your research question and the data you collected. You may also want to include ideas for possible future research.
The goal of this analysis was to draw a conclusion of whether private colleges are “better”, for lack of a better term, than public colleges or universities. This is a fairly high-level (as in 40,000 foot view) comparison, as there are many variables to consider when comparing colleges. For example, the SAT scores used to draw our conclusion only represent the top 25th percentile, and while these are a standardized test, some studies have shown bias in standardized test results stemming from areas with different demographics.
Whether examining all of the cases in the sample set, or just the random sample of 100 institutions, we have concluded that there is not a significant difference in the average SAT score of private and public colleges. A more in-depth look would be to apply the difference in means to the other variables. The SAT seemed to be the best variable to go with, as not every school provided ACT composite scores, and acceptance rates can be affected by many factors; mainly that this is simply the ratio of applicants admitted for one particular year, with the “bar” being set by administration. Essentially, this could be raised or lowered at any time in the admission cycle, so a low acceptance rate for some colleges is not all that it seems (though this can have an effect on the yield and resulting enrollment).
For further analysis, it would be useful to compare institutions with similar characteristics (size, location), or even comparing test scores to actual cost, rather than just the two-factor sector variable. It does not make too much sense to try and compare the University of Chicago to SUNY Plattsburgh. In conclusion, just as a prospective student needs to consider more than just the acceptance rate or average SAT score of the previous Freshman cohort, there are many other variables (some which could be considered as two-factor categorical) that should be considered in a comparison like this. I hope to revisit this anlysis after advancing my understanding of statistics.