DATA606 Data Project Final

Part 1 - Introduction:

Do private, not-for-profit Colleges and Universities attract academically better performing students than their public counterparts?

For almost a decade, I was employed in an administrative capacity within Higher Education institutions (both public and private). I should preface this by stating I worked for only institutions in New York State, where SUNY and CUNY tuition rates are among the lowest in the nation. While working in Admissions & Recruitment, one of the queries I frequently encountered was if private (not-for-profit) institutions were better than public.

In general, the assumption was that you were paying for a better education. The disparity in cost may have skewed some people’s thinking, but the real point is that there is more to the quality of an education than the price tag, or the Football team’s record. This is a subject that people should care about because choosing a college (should one choose to attend) is one of the biggest decisions a young person will make, and in my personal belief the general public could be better informed about the college application and selection process. This is especially true for prospective students who are the first in their family to attend college.

Using statistical tools, I will attempt to accept or reject the hypothesis that private schools are more competitive than public schools.

In conclusion - talk about different factors that play into choosing a college/university

Part 2 - Data:

Data collection: Describe how the data were collected.

The data were collected by using Integrated Postsecondary Education Data System (IPEDS), which contains information from surveys conducted annually by the U.S. Department of Education’s National Center for Education Statistics (NCES). This is reliable source for all types of information about institutions of higher education, as the federal governement requires that institutions participating in federal aid programs (Financial Aid) report data.

Using the IPEDS Data Center, you can look up one or several institutions, and dozens of different variables to compare against. For this project, I am only looking at data for the most recent year (2014-15), and selected institution name, location, sector/level, tuition and fees, acceptance rate, and standardized test score information. After choosing variables, the information is then output in a .csv file, which is hosted in the GitHub repository for this project.

Cases:

The cases found in the data set are accredited, degree-granting higher education institutions in the United States which require standardized test scores for their admissions considerations, and have supplied it as part of the NCES survey. The data does not include any open-enrollment institutions (Community Colleges, mainly), or institutions that did not provide test score data. In total there are 1186 cases, which is good sample size of the entire population of accredited institutions in the United States.

Variables: What are the two variables you will be studying? State the type of each variable.

For this project, the variables I will be studying are the standardized test scores (SAT & ACT, 25$^th$ percentile), the acceptance rate (as a percent of applicants) and the “sector” of the institution, which classifies the college or university as public or private.

The standardized test scores and admisison rate are response variables, which are discrete and numeric, while the sector classification of the institution is the explanatory variable, which is a nominal categorical variable.

Type of study: What is the type of study, observational or an experiment? Explain how you’ve arrived at your conclusion using information on the sampling and/or experimental design.

Scope of inference:

Generalizability:

The population of interest are all accredited, degree-granting institutions in the United States. Since the sample we are using includes a good number of the institutions in the population, the findings from this analysis can be generalized to the population. The sample may be broken down further and randomized to be more representative. It is important to note that there are more Private institutions than Public institutions in the data set. One potential source of bias in the data set that would prevent generalizability are the institutions that did not report their test scores (if any). The data set we are utilizing on includes institutions taht reported and/or require standardized test scores for admission. An institution that did not report may have done so because of poor results, and not having these cases in our study would affect about ability to generalize to the population.

Causality:

The data for this analysis was collected using an observational study - that is, the data collection process did not directly interfere with the how the data surfaced. Because of this, a causal link between the variables cannot be established, but we can make a statement correlating the two variables which can be generalized to the population.

Part 3 - Exploratory data analysis:

Perform relevant descriptive statistics, including summary statistics and visualization of the data.

Descriptive statistics and visualization

Admission Rate

Summary and Standard Deviation: Public College Admission Rate

##    Admit_Pct     
##  Min.   :  8.00  
##  1st Qu.: 56.00  
##  Median : 69.00  
##  Mean   : 66.95  
##  3rd Qu.: 80.00  
##  Max.   :100.00

## [1] 18.01072

Using Rs summary function, the mean admission rate for public colleges is just under 67%, and the median is only 2 points higher. There is some slight skew, but the admission rate is nearly normal, with a standard deviation of 18.01.

Summary and Standard Deviation: Private College Admission Rate

##    Admit_Pct     
##  Min.   :  5.00  
##  1st Qu.: 52.00  
##  Median : 66.00  
##  Mean   : 62.75  
##  3rd Qu.: 76.00  
##  Max.   :100.00

## [1] 20.37907

The summary for admission rates for public colleges shows a mean of just under 63%. The median is slightly higher than this, indicating that the admission rates are skewed slightly to the left. Interestingly, the IQR for both public and private college admission rates is 24. The admission rate is slightly lower than that of public colleges in our data set, and we will look at a few more parameters before deciding how meaningful this is.

We can verfiy the shape of our data by looking at a histogram of both sets:

Histogram: Admission Rate, Public vs. Private

Both histograms are unimodal, with some slight skew to the left. Since these are percentages, the range for both histograms is 0 to 100, no negative values are present, as a negative admissions rate is not possible.

We can futher evaluate our data by looking at the boxplot for both public and private schools:

Box Plot: Admission Rate, Public vs. Private

The box plots also confirm that there are more outliers in the lower range of acceptance rates for the private colleges and universities. Without examing the specific institutions, one could guess that some of the most selective schools in the country (i.e. Ivy League or similar) are private schools, accounting for the appearance of more outliers in this range.

Finally we can look at a scatterplot of the corresponding acceptance rates for public private colleges with a regression line drawn between them. Since the Sector variable is a two-factor categorical variable, the values will be stacked upon each other, with the regression line drawn between them.

Scatterplot: Admission Rate, Public vs. Private

## `stat_bindot()` using `bins = 30`. Pick better value with `binwidth`.

Looking at the regression line, there is a slight upward slope. This is a positive slope, but we need to keep in mind that lower admission rates are better than higher ones. For emphasis, another plot with the regression line is drawn over the box plots from earlier:

SAT Scores

As above, we will look at some of our other variables in the same way. Since SAT scores are given in the two widely used sections, critical reading and math (no writing), as well as a combined score of the two, we will look at all three together:

Summary and Standard Deviation: Public College SAT Scores

##      SAT_CR         SAT_MAT         SAT_COMP     
##  Min.   :290.0   Min.   :250.0   Min.   : 653.0  
##  1st Qu.:420.0   1st Qu.:430.0   1st Qu.: 850.0  
##  Median :450.0   Median :460.0   Median : 910.0  
##  Mean   :460.4   Mean   :473.3   Mean   : 933.6  
##  3rd Qu.:490.0   3rd Qu.:510.0   3rd Qu.:1000.0  
##  Max.   :640.0   Max.   :700.0   Max.   :1310.0

## [1] 55.42

## [1] 64.17498

## [1] 116.8723

The scores for all three sections appear to have the same shape, that is being nearly normal, unimodal and skewed slightly to the right, as indicated by the median being less than the mean.

Summary and Standard Deviation: Private College SAT Scores

##      SAT_CR         SAT_MAT         SAT_COMP   
##  Min.   :265.0   Min.   :200.0   Min.   : 520  
##  1st Qu.:430.0   1st Qu.:430.0   1st Qu.: 859  
##  Median :460.0   Median :460.0   Median : 920  
##  Mean   :476.2   Mean   :480.9   Mean   : 957  
##  3rd Qu.:510.0   3rd Qu.:510.0   3rd Qu.:1020  
##  Max.   :730.0   Max.   :770.0   Max.   :1500  
##                  NA's   :1       NA's   :1

## [1] 76.14863

## [1] 81.94305

## [1] 155.7852

Looking at the SAT scores for private colleges, the same characteristics that appeared in the public colleges (nearly normal, uni-modal) are apparent. The mean for the critical reading and math sections are about 10 points higher for the private colleges, with the mean combined score of these two (957) being about 24 points higher than that of the public colleges.

Visualizing the data with a histogram, we can confirm the findings from the summary function:

Histogram: SAT Scores, Public vs. Private

## Warning: Removed 2 rows containing non-finite values (stat_bin).

The histograms for private and public colleges for all three sections appear very similar. As with the acceptance rates, the private colleges apear to have more high scoring students in their reported cohort.

Again, we’ll use side-by-side boxplots to further evaluate these data:

Box Plot: SAT Scores, Public vs. Private

All three variables have similar boxplots; the means are relatively close, with a few outliers on the lower range of the scores, and many on the upper end. The larger number of private colleges in the data set could account for this, but even taking this into consideration, the mean SAT scores for each section are nearly identical.

As with our previous admission acceptance rate variable, below is a scatterplot with our regression line drawn between the two “stacks” of SAT scores. As with the SAT data visualizations above, the plot is faceted for the three different SAT scores in our data.

Scatterplot: SAT Scores, Public vs. Private

Another version of the same plot is below, this time using the linear model we stored in our S2 object, just to check that our plot above is correct.

Box Plot with Regression Line: SAT Critical Reading

Box Plot with Regression Line: SAT Math

Box Plot with Regression Line: SAT CR + Math

ACT Scores

Lastly, we’ll look at another standardized test score, one that in recent years has gained in popularity - some would argue that the ACT is a more balanced test, evaluating students on other subject areas besides math and english. Unlike the SAT, which is evaluated on a scale of 200 to 800 points per section, the ACT has four main sections, each with a minimum of 1, and a maximum of 36. The four sections are averaged together to give the composite score.

Summary and Standard Deviation: ACT Composite Score, Public Colleges

##     ACT_COMP    
##  Min.   :15.00  
##  1st Qu.:18.00  
##  Median :19.00  
##  Mean   :20.02  
##  3rd Qu.:22.00  
##  Max.   :30.00  
##  NA's   :28

## [1] 2.990267

The mean ACT scores for public institutions is 20, with the median (19) being 1 point less. The IQR is 4, while the standard deviatio is just slightly under 3.

Summary and Standard Deviation: ACT Composite Score, Private Colleges

##     ACT_COMP    
##  Min.   :11.00  
##  1st Qu.:18.00  
##  Median :20.00  
##  Mean   :20.98  
##  3rd Qu.:23.00  
##  Max.   :34.00  
##  NA's   :16

## [1] 3.931565

Following a similar trend, the mean for the private schools in our data set are almost 1 full point higher (20.98) and the median is again 1 less. The IQR of 5 is slightly larger, and the standard deviation is 4. Looking at side-by-side histograms, we can confirm the shape of the data:

Histogram: ACT Composite Score, Public vs. Private Colleges

Both histograms show nearly normal, unimodal data, with a slight skew to the right. Checking the boxplot for the ACT will confirm our findings from above as well:

Box Plots: ACT Composite Score, Public vs. Private Colleges

Examining the boxplots, the IQR of the ACT scores for private schools is greater than that of the public schools, showing greater variability of the scores. The whiskers for the private school ACT scores also extend out further, both this and the IQR can be attributed to the greater range and larger number of the private schools in our sample.

Finally, the last scatterplot of the ACT composite score variable. Similar to the SAT variable and the acceptance rate, the private schools have a slightly higher mean than the public schools. The regression line is displayed between the two categorical variables, and the same is drawn on the box plot below.

Scatterplot: ACT Composite Score, Public vs. Private Colleges

Box Plots with Regression Line: ACT Composite Score, Public vs. Private Colleges

After conducting some exploratory data analysis, the summary statistics and visualizations for each numerical variable, divided into the categorical variables, show very similar trends. Overall, the statistics for the private colleges show slightly higher means for the standardized test scores, and a slightly lower mean for the acceptance rate (being more selective is better).

However, the difference is slight. In the acceptance rate, the public colleges had a mean of about 67%, while the private colleges had a mean of slightly under 63%. The IQR was exactly the same, but the standard deviation was a bit higher for the private colleges, showing more variability in the data. In the standardized test scores, public colleges had a mean combined SAT score of 933 and ACT composite of 20, while private colleges had a mean combined SAT of 957 and ACT composite of 20.98. Like the selectivity, the variability of the test scores for private colleges is greater; this could be due to the larger number in the data set, but also the greater range of scores.

Before investigating further, I am surprised that the data were so similar, and I would hypothesize that the higher means for the standardized test scores (reminder, these are the top 25th percentile) are being skewed just enough by the more competitive and highly selective colleges. Of the colleges with less than a 10 percent acceptance rate, only two were not private, and those are both military academies.

Part 4 - Inference:

If your data fails some conditions and you can’t use a theoretical method, then you should use simulation. If you can use both methods, then you should use both methods. It is your responsibility to figure out the appropriate methodology.

Check conditions for Theoretical Inference:

A least-squares regression line was fit to our two-factor categorical scatterplot in the previous section for each numerical variable. The categorical variable for Sector was converted in the transformation of our data at the beginning of this analysis - a new column was added to the data set, with values of 0 added for private colleges, and 1 for public, making this the indicator variable.

Unlike conditions for fitting the least squares regression line to data sets with numerical data on both axis, the conditions for categorical predictors with two levels are slightly different, as the linearity assumption will always be satisfied. We still need to look at the other conditions before moving forward:

Nearly normal residuals

Histogram and Q-Q Plot: Admission Rate Linear Model Residuals

Histogram and Q-Q Plot: SAT Critical Reading Linear Model Residuals

Histogram and Q-Q Plot: SAT Math Linear Model Residuals

Histogram and Q-Q Plot: SAT Combined Linear Model Residuals

Histogram: ACT Composite Linear Model Residuals

Examining each histogram and normal probability plots, the residuals do not appear to be nearly normal, as there is quite a bit of skew for all four variables.

Constant Variability

As above, since this is a two level categorical predictor, we can examine that the points have constant variability by simply looking at original scatterplots for all four variables. Since all of the points fall along the line, the variability appears to be fairly constant.

Independent Observations

The only link between the observations are the students who are applying and being accepted to the institutions. For many institutions, they will have similar applicants, especially for institutions in the same geographic regions. However, the chances of the exact same mix of students applying and being admitted to two colleges is virtually zero, so we can presume the observations are independent.

Because the residuals do not appear to be nearly normal, and have a good amount of skew in the tails, we cannot rely soley on the linear model for any of the four variables for inference on the relationship with an institution being public or private.

-Simulation based inference - hypothesis test and confidence interval

Differences of Two Means, Hypothesis Test and Confidence Interval

Because of the skew in each variable for both the public and private college data, we will do a simulation by sampling 100 colleges from our divided data sets (used earlier to create the pub_priv column). To make things a little simpler, we’ll just use the combined SAT scores to make our inference.

What we will do is take the sample and examine the difference in the means of each sample, and construct a 95% confidence interval for the average difference. We’ll take a sample of 100 public, and 100 private colleges, as that will be equal to less than 10% of the population of all institutions in the appropriate sector. The sample will be random, so we will meet our conditions as long as the distributions of the data look symmetric without too much skew.

# sample of public colleges

pub_sat <- pub_coll$SAT_COMP
pub_samp <- sample(pub_sat, 100)

# sample of private colleges

priv_sat <- priv_coll$SAT_COMP
priv_samp <- sample(priv_sat, 100)

hist(pub_samp)

hist(priv_samp)

Looking at the histograms for both the public and private school data, both are relatively normal, with some slight skew. Because the sample sizes are larger than 30, we can disregard this.

Using our summary and standard deviation functions in R, we can calculate the difference between two sample means:

summary(pub_samp)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   653.0   840.0   907.0   916.6  1010.0  1180.0

sd(pub_samp, na.rm=TRUE)

## [1] 105.0532

summary(priv_samp)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   520.0   850.0   930.0   952.7  1022.0  1390.0

sd(priv_samp, na.rm=TRUE)

## [1] 151.7806

Since we cannot permanently store the sample that R takes, we will simply look at this very instance. Our mean combined SAT score for the public colleges is 944.5, and the mean for the same variable in our private colleges is 960.4. The point estimate is the difference between these two, which is 15.9 points (private colleges will be our ${x}_1). The Now we will calculate the standard error for the difference between the two means:

Sector	mean SAT	s	n
Private	960.4	163.83	100
Pubclic	944.5	123.79	100

$SE_{(\xbar_{priv}-\xbar_{pub})}$ = sqrt((163.83^2 / 100) + (123.79^2 / 100)) = 20.53393

Now that we have our Standard Error, we can calculate the confidence interval using the following:

$15.9 \pm z^* \times SE$ = $15.9 \pm 1.96 \times 20.53393$ = $15.9 \pm 40.2465$

The 95% confidence interval is (-24.35, 56.15)

The estimated difference in combined SAT critical reading and math sections for private and public colleges is on average between -24.35 and 56.15 points.

Our formal hypothesis for testing the difference in the mean combined CR+Math SAT score is

$H_{0}$: The mean combined SAT score for private colleges is the same as the mean combined SAT score for public colleges. Any difference observed would simply be due to chance.

$H_{A}$: There is a difference in the mean combined SAT score between private and public colleges.

Test Statistic and P-Value: Z = difference in means - 0 / SE = (15.9 - 0) / 20.53393 = 0.7743281

Since the sample sizes were sufficiently large, we’ll look up the Z value in the normal probability table, which give us an area of 0.7794. Subtracting this from 1, we will get the area of the upper tail:

$upper tail = 1 - 0.7794 =$ 0.2206

Multiplying by 2, we get our p-value: 0.2206 x 2 = 0.4412 Since the p-value is so high, we cannot reject our null hypothesis. The data do not provide convincing evidence that the mean SAT score for private colleges is different that the mean SAT score for public colleges.

Brief Description of Methodology:

Originally, we sought out to find any difference in the metrics used for college admission (standardized test scores) between public or private colleges. Using linear models on the different variables did not seem possible given our current knowledge of statistics, and the skew on both ends of each data set, which did not meet one of the conditions required for fitting to a linear model.

However, from reviewing the summary statistics, we could already see there was not much difference in the means of all of the variables. Without using a fancier model, we have evaluated the difference in the two means to measure how significant the differences are. With 95% confidence, we calculated the difference in the means to be a little more than -25 to just over 56, the fact that zero is present in that range is significant. The differences in this range are equal to getting just a few more questions correct on the test, and based on this particular sample we can conclude that this is not enough of a difference to conclude that private colleges are overall better academically (or rather, attract better performing high school students) than thei public counterparts.

Part 5 - Conclusion:

Write a brief summary of your findings without repeating your statements from earlier. Also include a discussion of what you have learned about your research question and the data you collected. You may also want to include ideas for possible future research.

The goal of this analysis was to draw a conclusion of whether private colleges are “better”, for lack of a better term, than public colleges or universities. This is a fairly high-level (as in 40,000 foot view) comparison, as there are many variables to consider when comparing colleges. For example, the SAT scores used to draw our conclusion only represent the top 25th percentile, and while these are a standardized test, some studies have shown bias in standardized test results stemming from areas with different demographics.

Whether examining all of the cases in the sample set, or just the random sample of 100 institutions, we have concluded that there is not a significant difference in the average SAT score of private and public colleges. A more in-depth look would be to apply the difference in means to the other variables. The SAT seemed to be the best variable to go with, as not every school provided ACT composite scores, and acceptance rates can be affected by many factors; mainly that this is simply the ratio of applicants admitted for one particular year, with the “bar” being set by administration. Essentially, this could be raised or lowered at any time in the admission cycle, so a low acceptance rate for some colleges is not all that it seems (though this can have an effect on the yield and resulting enrollment).

For further analysis, it would be useful to compare institutions with similar characteristics (size, location), or even comparing test scores to actual cost, rather than just the two-factor sector variable. It does not make too much sense to try and compare the University of Chicago to SUNY Plattsburgh. In conclusion, just as a prospective student needs to consider more than just the acceptance rate or average SAT score of the previous Freshman cohort, there are many other variables (some which could be considered as two-factor categorical) that should be considered in a comparison like this. I hope to revisit this anlysis after advancing my understanding of statistics.