1 Preface

This is a writing sample derived from a working paper of the same name. Modifications from the original working paper include presenting well-developed sections while omitting sections still under development. Intended for use as a writing sample, this version of the working paper includes additional documentation of the working process to demonstrate the intent behind certain methods.

This working paper has a co-author listed in single-paper submission to the Association for Public Policy and Management annual conference, (Heather Epstein-Diaz) whose purpose in being added was to aid in further work to publish and present this paper, and to provide guidance through future statistical methods. At this time, the co-author’s contribution is solely in advising and guiding more advanced statistical work to submit for conference proposal. The co-author is acknowledged to maintain full transparency, but all work in this writing sample is the sole and unaided work of the applicant.

The following analysis demonstrates experience in the following steps: Identifying sources of data, gathering and cleaning data from multiple sources, merging data using common identifiers, finding a strong initial model after many iterations of models, identifying heteroscedastic error in the initial model, employing weights to construct a second model, evaluating the second model using statistical and diagnostic methods, interpreting regressions results and identifying anomalies, and drawing conclusions from the final model.

Software used for this working paper is RStudio, with output to LaTeX through the knitr library.

2 Abstract

The COVID-19 pandemic prompted U.S. universities to offer testing services for their students and faculty. Across the wide variety of U.S. institutions’ characteristics and approaches to responding to the pandemic, diverse outcomes of positive test reporting are expected. Data from the “Tracking Covid-19 at U.S. Colleges and Universities” Survey conducted by the New York Times includes over 1,900 universities and their positive case count as of February 2021. We matched this to institutional characteristic data acquired from the Integrated Postsecondary Education Data System, as well as county-level population density and state-level data. Our final dataset included a nationally representative panel of 1,597 colleges and universities’ total positive case counts and their institutional, local, and state characteristics.

There exists little empirical evidence of the institutional factors associated with reported COVID-19 cases. The goal of this study is to determine early evidence of factors found to impact total positive case count reported by colleges and universities. Using least squares regression, our findings suggest that higher minority enrollment, admission rates, county density, and rural and suburb urbanization (P<.01) are negatively associated with total reported positive case count, all else held constant. The natural logarithm of total student enrollment, as well as having a hospital on campus and Republican state governor control are associated with higher total reported positive case count, all else held constant ( P<.01). We employ robustness testing and weighted least squares regression to correct for heteroscedasticity in these models.

In addition to the results obtained by our primary model, we also explore other regression models which use average cost of attendance, HBCU status, full-time equivalent enrollment, and percent of students receiving financial aid. Future iterations of our model also seek to include additional controls on proportions of underrepresented minorities, gender, and Pell Grant recipients to uncover the ways in which these institutions may be disproportionately impacted by COVID-19. Specific policy conditions within colleges and universities may also impact COVID-19 rates. To account for these policy concerns, we will also include average cost of attendance, and federal assistance rates.

As a primer of early evidence, this paper serves as a basis for future study of these factors and their associations with university positive case reporting. By researching these factors’ isolated associations, further evidence of causal factors may be derived from these early findings.

3 Introduction

In this paper, we engage in exploratory regression analysis to uncover relationships between institutional characteristics of U.S. colleges and their total count of positive COVID-19 cases. The COVID-19 pandemic prompted most U.S. universities to adopt plans for the health and safety of their students. This included moving classes online, shutting down facilities and classrooms, offering or mandating testing, and conducting contact tracing, among other methods. The diverse population of over 1,500 universities thus saw a diverse range of results from their practices, which are still in the beginning stages of being measured.

One early measurement of universities’ results is their total reported positive case count. Using this panel data, we choose relevant institutional characteristics and use least-squares regression to find relationships between institutional characteristics and total positive case count.

Because measurement of the consequences of COVID-19 is still in its early stages, little research presently exists on the topic of university characteristics relating to case counts. This is also a result of the complicated nature of analysis of testing and case counts, and the difficulty in drawing strong causal conclusions between case counts and external factors. For this reason, this paper does not attempt to draw causal conclusions, and instead explores statistically significant relationships between institutional characteristics (cost of attendance, student enrollment, campus urbanization type) and positive case count.

Understanding the relationships between institutional characteristics and positive case counts is important for at least two reasons. First, it may lend statistical strength to reasoning attempting to explain why institutions may have had low or high case counts - whether fitting the model well, or being an outlier. For example, did a university predicted to have a high case count have a very low one? Or, can we find a relationship between a certain institutional characteristic and unexpectedly high case counts? Second, because there is yet scarce investigation into institutional characteristics’ relationship with positive case counts, an exploration into these factors may have merit.

We use the total positive case count as of February 2021 provided by the New York Times, and current institutional characteristics of over 1,500 U.S. colleges and universities to explore these relationships and find statistical significance in select factors.

4 Data

Using the National Center for Education Statistics’ Integrated Postsecondary Education Data System Data Search tool, “All U.S. Institutions” were selected, with the following variables selected for retrieval:

  • Sector of Institution
  • Degree of urbanization (Urban-centric locale)
  • State FIPS code
  • County FIPS code
  • HBCU Status
  • Whether the institution has an on-campus hospital
  • Average cost of attendance for an in–state student
  • Percent of students admitted to the university
  • Percent of students receiving any federal financial aid
  • Percent of student receiving Pell grants
  • Total number of students
  • Number of students of each race as identified by the Department of Education
  • Fall Full Time Enrollment (FTE)
  • Revenues derived from state, local, federal, and private sources
  • Students identifying as Male and Female

The governor of each state in 2020 was collected from Ballotpedia and assigned to each institution. The “governor” of Washington, D.C. was coded as democratic due to the democratic mayor Muriel Bowser.

The New York Times publishes a dataset of over 1,500 U.S. universities’ positive COVID-19 case counts as of February 2021. We match this dataset to that of IPEDS using each institution’s IPEDS identification number, and merge with county density and state governorship data to arrive at our final dataset.

We filter the dataset to include only campuses reporting positive COVID-19 cases greater than two. This was used to better fit the available data to appropriate statistical methods available, and we explore alternative methods to consider low case counts in future expanded methodology. The target of the full analysis is to filter the dataset to include campuses reporting positive cases greater than zero, which requires advanced diagnostic examination of outliers. This is not included in this sample.

We create a “Minority Percent” variable by adding together counts of all non-white races and divide by the number of total students. All percentages are multiplied by 100 to be observed in XX.XX% units as opposed to 0.XX.

4.1 Descriptives

Table 1 shows the descriptive statistics of all variables in the dataset. In the second row we see that on average, each university reported 312 cases. (In the unfiltered dataset, this average is 281.) This ranges from a minimum of 3 to a maximum of 8,894 (reported by the University of Florida, outlier confirmed by local news)

To visually observe correlations between variables, we use a correlation matrix. We use variables considered for inclusion in the initial model, and some not used in the model. Correlations with a P value of < 0.01 are marked with an X.

This simple correlation can inspire many directions for further investigation, but for the purposes of constructing our initial model we observe the correlations between cases (or per_100, a calculated variable finding cases per 100 students using fall enrollment) and other variables.

5 Methodology

We use multivariate least squares regression to estimate the effects of the collected characteristics correlating with the positive COVID-19 cases on each college campus. To correct for error bias demonstrated by a linear regression explored in the original expanded methodology, we employ OLS regression using \(log(cases)\). The natural logarithm of county_density is also generated and used.

The below table displays the variables used in the model. For categorical variables, we italicize the category held as the reference.

Variables Used in Model
Variable Description
log(Cases) (Dependent) Total positive cases as of February 2021
log(Total_Students) Total student enrollment
Sector Sector of university (Public or Private)
Urbanization Urbanization type (City, Rural, Town, Suburb)
Hospital Whether the university has a hospital
Governor Republican or Democrat governor
log(County_Density) County Population Density of college location
Priceperstudent Average cost of attendance for an in-state undergraduate student
Minority_percent Percent of students identifying as a minority (Any other than white)
HBCU Whether the university is an HBCU
Percent_admitted Percent of undergraduate applicants admitted

5.1 Model

Through testing of many variables detailed in the original expanded methodology, we arrive at an initial model:

\[\begin{array}{ccc} log(cases) \sim \beta_1 + \beta_2 log(TotalStudents) + \beta_3 Sector + \beta _4 Urbanization + \\ \beta_5 Hospital + \beta_6 Governor + \beta_7 log(county density) + \beta_8 Priceperstudent + \beta_9 Minoritypercent + \\ \beta_{10} HBCU + \beta_{11} Percentadmitted + e \end{array}\]

We use dummy effect coding to assign values to categorical variables Sector, and Urbanization. Hospital, Governor, and HBCU are binary variables.

We run this model and report its results, ANOVA, and test for heteroscedasticity using the Breusch-Pagan test and a nonconstant variance score test.

This regression equation is found to be significant (Table 3, \(F(12,1115) = 132.669 , \ P < 0.01\)) , with an \(R^2\) of \(0.584\). The natural logarithm of total student enrollment, urbanization of rural and suburb compared to urban, hospital status, governor party affiliation, the log of county density, the price per student, percent minority students, \((P < 0.01)\) and percent admitted \((P < 0.05)\) were all found to be significant. University sector, public or private, was not found to be statistically significant in this model.

Notably, the presence of a hospital had a practically significant positive effect on positive COVID-19 case reporting \((P < 0.01)\). Republican governorship also was found to be significant in association with more positive case reporting \((P < 0.01)\). Higher cost of attendance was found to be positively associated \((P < 0.01)\), while a higher percent of the student body being of a minority identity was found to be negatively associated with positive case reporting \((P < 0.01)\).

Breusch-Pagan heteroscedasticity Test of Unweighted Regression
Breusch-Pagan heteroscedasticity Test
Statistic Parameter P value
18.38117 12 0.1046
Non-constant Variance Score Test of Unweighted Regression
Non-constant Variance Score Test
Chisquare Degrees of Freedom P value
8.204674 1 0.0041783

The ANOVA results suggest that Urbanization is significant as a whole despite the one insignificant type of urbanization in the model (“Town”) \((P > 0.01)\). Sector remains insignificant, but due to further testing is included in the model to maintain good estimates of heteroscedasticity.

We fail to reject the null hypothesis from the Breusch-Pagan test, indicating a lack of evidence of heteroscedasticity. However, the non-constant variance score strongly rejects the null hypothesis, indicating evidence that heteroscedasticity exists in the error term. This leads us to use weights in a second regression model.

5.2 Weighted Model

To correct for the heteroscedasticity presented by the non-constant variance score in the first regression, we use a weighted least squares regression model.

We calculate fitted values from a regression of absolute residuals versus the explanatory variables, and fit a WLS model using \(weights = 1/(fitted \ values)^2\).

Breusch-Pagan heteroscedasticity Test of Weighted Regression
Breusch-Pagan heteroscedasticity Test
Statistic Parameter P value
18.3811 12 0.1046
Non-constant Variance Score Test of Weighted Regression
Non-constant Variance Score Test
Chisquare Degrees of Freedom P value
0.4988 1 0.48001

This regression equation is found to be significant (Table 7, \(F(12,1115) = 155.044 , \ P < 0.01\)), with a higher \(R^2\) than the unweighted model of \(0.621\).

When using weights in this regression, the non-constant variance score test now fails to reject the null hypothesis that the error term is constant. We interpret this to mean that when using weights, testing fails to find evidence of heteroscedasticity in the error term. With more confidence in a stronger model, we reference the diagnostic plots to visually observe the strength of the model.

An imaginary blue line is overlaid on the residual density plot to compare against an arbitrary hypothetical normal distribution.

The diagnostic plots provide some support for the strength of this model. With relatively normally distributed residuals, we observe a visual representation of the Breusch-Pagan test and Non-constant Variance Score test’s results.

We observe a trend in the residuals that supports the weak rejection of the Breusch-Pagan test. In the Residuals vs. Fitted and the Normal Q-Q plots, we observe the low and high end of the residuals skewing in the same direction. This indicates that our error term is not perfectly normal, which is supported by the residual density plot’s slight asymmetry.

In earlier iterations not included in this sample, using a linear regression (not taking the natural log of Cases and Total Students, and models using cases greater than 0 instead of greater than 2) produced greatly skewed Residuals vs Fitted and Normal Q-Q plots, and strong evidence of heteroscedasticity by both the non-constant variance score and Breusch-Pagan tests. Using the natural logarithm of case counts in our regression, we observed more normal distributions of the residuals.

5.2.1 Interpretation

All significant variables in the unweighted model remain significant \((P < 0.01)\), with percent admitted gaining one level of significance strength \((P < 0.01)\). University sector remains statistically insignificant in this model.

This model predicts that when total student enrollment increases by 1%, total positive case count reported to the university is expected to increase by 1.12%. This rests well with the assumption that a larger student body would provide more opportunities for virus transmission, or that universities with larger student bodies would execute more rigorous testing programs due to their size.

Compared to campuses categorized as Urban urbanization, Rural universities are expected to report about \(((e^{0.599}-1)=0.82)\) 82% fewer cases. Suburban universities are expected to report about \(((e^{0.280}-1)=0.32)\) 32% fewer cases. Because this model controls for county density, these urbanization classifications’ characteristics of distance to principal cities is more isolated. Urban universities’ fewer estimated case count could be associated with fewer real cases given farther distances to the nearest cities, or with rural universities not pursuing aggressive testing programs. The same speculations stand for suburban universities.

A 1% increase in county density is expected to associate with an approximate 0.14% decrease in positive case reporting. This is speculated to be associated with a number of factors: it may be the case that highly dense counties establish aggressive testing programs either by the county, state, or other non-university entities, in which students may choose to get tested at those locations and not have their cases reported to this dataset. In contrast, students in less-dense counties may only be able to receive testing at their universities, which would increase positive case reporting by the universities.

Universities coded as having a hospital on-campus are estimated to report about \(((e^{0.614}-1)=0.85)\) 85% more cases than universities without on-campus hospitals. We can speculate that this may be due to universities with on-campus hospitals having more robust testing programs for students, or that these universities had more aggressive testing requirements.

Universities in states with Republican governorship are expected to report about \(((e^{0.231}-1)=0.260)\) 26% more cases than universities in states with Democratic governorship. We speculate this to be associated with possible higher rates of cases in Republican-governed states, or more aggressive testing programs established in universities in Republican-governed states in response to less state-pursued testing.

Each one unit increase in the ratio of average cost to total students, priceperstudent, is expected to associate with a 0.01% higher positive case count reported by the university. As this is a practically small effect, we observe the statistical significance of this relationship between these two variables without speculating on its effects.

One percentage point increase in the percent of minority student body makeup is expected to associate with 2.6% fewer positive cases reported to the university. While a practically large effect, we refrain from speculating on the motivations of minority-identifying student behavior in this report.

One percentage point increase in the percent of students admitted is expected to associate with 0.4% fewer positive cases reported to the university. Casually, this suggests a relationship between the “selectivity” of a university and its positive case reporting: A marginally more selective university is expected to have a slightly higher positive case count than a marginally less-selective university, all else held constant. We may interpret this with respect to the speculation that more selective universities may be likely to have more robust testing protocols, capturing more positive cases on campus.

The interpretation of HBCU status prompts further investigation. Controlling for other factors, this model predicts universities with HBCU status report \(((e^{1.107}-1)=2.03)\) 203% more cases than non-HBCU status universities. This is the subject of further evaluation in later iterations of this working paper, and for the purposes of this writing sample, is identified as an anomaly that may have explanations within the data or limitations in existing methodology.

6 Conclusion

It is important to not draw ethical or causal conclusions from these results. Higher case reporting may be either a “good” or “bad” thing - given the complicated nature of testing, it may be the case that higher positive case count is reflective of more diligent testing (either voluntary or mandatory) as opposed to symptomatic people not receiving testing due to lack of resources. Further, it is important to note that positive cases tested by a city or federally run testing center would not appear in this dataset, which offers a challenging data collection error. Any statistical analyses must be rooted in the origin of the dataset. That is, positive cases reported to the University, and only among Universities included in the New York Times dataset and having available all requested variables from IPEDS.

The goal of this report is to offer preliminary evidence of factors affecting positive COVID-19 case reporting on University campuses. After using a weighted OLS multivariate regression model, we provide support to the claim that observing these factors with relatively unbiased errors is possible. Most notably from this research, percent minority students, HBCU status, cost of attendance, and campus urbanization are all significant explanatory variables of case reporting.

Observing statistically significant effects of cost of attendance, admission rates and minority student makeup when controlling for other factors prompts further research in isolating socioeconomic and racial disparities in health outcomes pertaining to the COVID-19 pandemic.