Private School Trends in Large Institutions: Enrollment, Staffing, and School Levels Across the U.S. with a Focus on the DMV
1. Dataset: Private school in USA
2. Image
Private Schools
3. Introduction:
For my second project, I selected a dataset focused on private schools in the United States because I have always been interested in the education field and the distribution of schools across the country. I was also curious to explore the demographics and geographic patterns through data analysis. The data was obtained from Public Opendatasoft, sourced from the PSS Private School Universe Survey for the 2021–2022 school year, conducted by the National Center for Education Statistics (NCES). The dataset includes 35 variables that describe each school’s identity, location, staffing, enrollment, and classification. Some variables are categorical but stored as numeric codes, such as TYPE, STATUS, and LEVEL_. The variable TYPE ranges from 1 to 7 and indicates the school’s program emphasis: (1) Regular Elementary or Secondary, (2) Montessori, (3) Special Program Emphasis, (4) Special Education, (5) Career/Technical/Vocational, (6) Alternative/Other, and (7) Early Childhood Program or Child Care Center. STATUS is coded as 1 for currently operating schools and 0 for inactive ones. Similarly, LEVEL_ indicates the school level, where 1 is elementary, 2 is middle school, and 3 is high school.
Other variables in the dataset include NCESID (a unique national school ID), NAME, ADDRESS, CITY, STATE, ZIP, and ZIP4 (school location), and TELEPHONE (contact information). ENROLLMENT (the number of students) and FT_TEACHERS (number of full-time teachers) are numerical variables. POPULATION represents the total number of individuals associated with the school (sum of number of teachers and number of students), and ST_GRADE and END_GRADE define the range of grades offered. The dataset also includes geographic coordinates like LATITUDE, LONGITUDE, GEO POINT, and GEO SHAPE, along with regional identifiers such as COUNTY, COUNTYFIPS (a federal county code), and COUNTRY. Additional classification fields include NAICS_CODE and NAICS_DESC, which describe the type of educational service. Metadata fields like SOURCE, SOURCE_DATE (when the data was collected), VAL_METHOD (how it was verified), and VAL_DATE (when it was last reviewed).
To clean the dataset for my project, I first checked for missing values and found only one NA in the END_GRADE column, which I removed since it was not relevant to my analysis. I then reduced the number of variables by removing columns that were not useful, such as Geo Point, Geo Shape, TELEPHONE, NCESID, ZIP, date-related metadata, and other codes that didn’t contribute to the analysis. This helped me focus on the most important variables like ENROLLMENT, FT_TEACHERS, POPULATION, and LEVEL_. After that, I created a new version of the dataset that only included larger schools with more than 500 students and more than 50 full-time teachers. I also added a new column to clearly label the school level as “Elementary School,” “Middle School,” or “High School” based on the original numeric codes. When preparing the dataset for the bubble plot, I joined it with a state name reference to replace the state abbreviations with full state names, making the tooltip and plot labels more readable. Separately, for my map visualization focused on the DMV area (DC, Maryland, and Virginia), I created another filtered dataset that included only schools with more than 500 students and over 80 full-time teachers.
I found this dataset meaningful not only because of my personal interest in education, but also due to its potential for broader analysis. For instance, the data could support studies on the relationship between access to private education and socioeconomic indicators such as family income, local GDP, or university enrollment rates. Mapping this data provides a powerful way to visualize patterns and disparities in educational opportunities across different areas.
Warning: One or more parsing issues, call `problems()` on your data frame for details,
e.g.:
dat <- vroom(...)
problems(dat)
data(usprivateschools)
Warning in data(usprivateschools): data set 'usprivateschools' not found
5. Cleaning
Check if there is any NA
#Sum to check if there are any NA values in the entire datasetsum(is.na(usprivateschools))
[1] 1
#Find the position of the NA.usprivateschools |>is.na() |>which(arr.ind =TRUE)
row col
[1,] 18325 30
Based on this, there is only one NA value in the dataset, located in row 18,325 and column 30 (END_GRADE). However, I decided to remove this variable because it is not relevant to my analysis.
Removing variables that will not be used in the project.
I removed the variables that were not relevant or reliable for this project, such as Geo Point (since Latitude and Longitude are already provided in separate columns), as well as other non-essential information like Telephone numbers, ZIP codes, and NCES codes. I also excluded date-related variables because the dataset is based on the 2021–2022 Private School Universe Survey, and all the data reflects that specific time period.
# A tibble: 6 × 13
NAME ADDRESS CITY STATE TYPE POPULATION COUNTY COUNTRY LATITUDE LONGITUDE
<chr> <chr> <chr> <chr> <dbl> <dbl> <chr> <chr> <dbl> <dbl>
1 ST PAU… 510 3R… CULL… AL 1 81 CULLM… USA 34.2 -86.8
2 BIG CO… 6354 H… OWEN… AL 3 36 MADIS… USA 34.7 -86.5
3 RIVERV… 123 RI… MORR… AR 1 8 CONWAY USA 35.1 -92.8
4 ST BRI… 7120 W… LAKE… CA 1 93 LOS A… USA 34.2 -118.
5 ST FIN… 2120 W… BURB… CA 1 248 LOS A… USA 34.2 -118.
6 MARY S… 2500 N… SAN … CA 1 432 LOS A… USA 33.8 -118.
# ℹ 3 more variables: LEVEL_ <dbl>, ENROLLMENT <dbl>, FT_TEACHERS <dbl>
6. Correlation and Linear Regresion:
For my linear regression model, I decided to use the complete dataset instead of applying the filter for large private schools. The reason behind this choice is that when I ran the model using only the filtered data (schools with more than 500 students and over 50 full-time teachers), the adjusted R-squared value dropped to around 30%. This indicated that the model was not statistically strong or reliable. However, when I used the full sample without the size filter, the adjusted R-squared increased significantly to around 70%, showing a much better model fit and stronger explanatory power. For this reason, I kept the full dataset for the regression analysis, although I continued using the filtered data for other visualizations focused specifically on larger schools.
In my first regression model, I used both the number of full-time teachers (FT_TEACHERS) and the total school population (POPULATION) to predict student enrollment (ENROLLMENT). However, the results showed a perfect mathematical relationship among the variables, with an Adjusted R-squared value of 1. This indicated that the variables were not independent and that enrollment could be directly calculated from the other two, making the model unrealistic for real-world analysis. Because of this, I decided to run a second model using only one predictor: the number of full-time teachers. This final model showed a strong and meaningful relationship, with an Adjusted R-squared of 0.703, meaning about 70% of the variation in enrollment can be explained by the number of full-time teachers. The coefficient suggests that for each additional full-time teacher, enrollment increases by approximately 8.88 students. This model is more appropriate because it avoids multicollinearity and offers a clear interpretation that helps in understanding how staffing levels relate to school size. Additionally, to visualize the relationships among the numerical variables, I used a scatterplot matrix (with ggpairs() function), which displays scatterplots along with correlation coefficients. This helped me also to identify the strongest correlation between full-time teachers and student enrollment.
model <-lm(ENROLLMENT ~ FT_TEACHERS, data = privateschools)summary(model)
Call:
lm(formula = ENROLLMENT ~ FT_TEACHERS, data = privateschools)
Residuals:
Min 1Q Median 3Q Max
-2845.9 -35.1 -16.0 21.5 4901.7
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 19.38375 1.05796 18.32 <2e-16 ***
FT_TEACHERS 8.87687 0.03868 229.52 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 127.9 on 22237 degrees of freedom
Multiple R-squared: 0.7032, Adjusted R-squared: 0.7032
F-statistic: 5.268e+04 on 1 and 22237 DF, p-value: < 2.2e-16
This means that for each additional full-time teacher, enrollment is expected to increase by approximately 8.88 students, on average. The intercept of 19.38 represents the expected enrollment when there are zero full-time teachers, although this situation may not occur in real life.
6.2 P-Values and Coefficients
Both the intercept and full-time teachers (FT_TEACHERS) variable have very small p-values (<2e-16, less than 0.005) which indicates that they are statistically significant.This means that the number of full-time teachers is a strong predictor of student enrollment in private schools.
Adjusted R-squared Value
The model has an Adjusted R-squared of 0.703, which means that approximately 70.3% of the variation in enrollment can be explained by the number of full-time teachers alone. For a single-variable model, this is a strong level of explanatory power and shows that staffing levels are closely related to school size.
Diagnostic Plots
The Residuals vs Fitted plot shows a relatively even distribution of points around the horizontal axis, with no curve or changing spread. This suggests that the assumptions of linearity and constant variance are correct, supporting the use of a linear regression model for this analysis. also I can see some points in the plot that appear far from the zero line, suggesting possible outliers where the predicted enrollment differs significantly from the actual value. These cases may represent unique schools or anomalies in the data. While most residuals are small and randomly distributed, a few large residuals may indicate observations that the model struggles to fit accurately.The Normal Q-Q plot indicates that residuals are approximately normally distributed. Based on this, the model appears appropriate for interpreting general trends between these variables.
Conclusion
This analysis confirms that the number of full time teachers (FT_TEACHERS) is a strong and meaningful predictor of Enrollment of students in private schools. The model suggests that schools tend to grow in student population as they increase their full-time teaching staff. While some variation in enrollment is influenced by other factors not included in this model, the number of teachers alone provides a reliable and interpretable estimate of school size.
7. Visualization: Bubble Plot of Large Private Schools in the U.S
In this section, I use the highcharter package to create a bubble plot that displays the relationship between the number of full-time teachers and student enrollment in large private schools (with over 500 students and 50 full-time teachers). The size of each bubble represents the total school population, and the color indicates the school level (elementary, middle, or high school). The data was filtered and prepared using common dplyr functions such as select(), filter(), mutate(), and arrange().
Filtering and labeling large private schools by level
I created a small table that matches each state abbreviation (like “CA” or “NY”) with its full name (like “California” or “New York”). This makes the chart labels and tooltips easier to understand when viewing the data.
Creating a bubble chart with tooltips and school-level legend
In this step, I used the highcharter package to create an interactive bubble chart. I set the axes to show full-time teachers and student enrollment, and used bubble size to represent the total population. I added one series for each school level using a loop and assigned specific colors to each level. I also included a tooltip so I can see more details when I move the mouse over each bubble, and I added a legend and a data source credit.
hc <-highchart() |>hc_chart(type ="bubble") |>hc_title(text ="Full-Time Teachers vs. Enrollment in U.S. Private Schools (Top 100), 2021–2022") |>hc_subtitle(text ="Bubble size = School Population | Color = School Level") |>hc_xAxis(title =list(text ="Full-Time Teachers")) |>hc_yAxis(title =list(text ="Student Enrollment")) |>hc_legend(layout ="vertical",align ="right",verticalAlign ="middle",title =list(text ="School Level") )# Add a series for each school levelfor (level innames(school_colors)) { level_data <- bubble_data |>filter(School_Level == level) |>mutate(color = school_colors[[level]]) hc <- hc |>hc_add_series(data = level_data |>select(x, y, z, name, color) |>list_parse(),name = level,color = school_colors[[level]],marker =list(fillColor = school_colors[[level]]) )}## Add Tooltip hc|>hc_tooltip(useHTML =TRUE,pointFormat =paste0("<b>{point.name}</b><br>","Students Enrollment: {point.y}<br>","Full-Time Teachers: {point.x}<br>","Population: {point.z}" ) ) |>hc_credits(enabled =TRUE,text ="Source: PSS Private School Universe Survey, 2021–2022",href ="https://nces.ed.gov/surveys/pss/",style =list(fontSize ="10", color ="black", fontStyle ="italic") ) |>hc_add_theme(hc_theme_flat())
8. Mapping private schools in the DMV area
I created an interactive map to show where the largest private schools are located in the DMV area (D.C., Maryland, and Virginia). I filtered the data to include only schools with more than 500 students and over 80 full-time teachers, so the map focuses on the most significant institutions. I used the leaflet package to display each school as a colored circle based on its level—elementary, middle, or high school—and the size of the circle shows how many students are enrolled. When I click on a school, I can see more details like its name, city, number of students, and full-time teachers.
Setting school level order and color palette for the legend map
I organized the school levels in a specific order and assigned custom colors to each level for the map legend. By using factor(), I was able to control the display order of the school levels when visualizing the legend. I set the order as “High School,” “Middle School,” and “Elementary School” so these categories appeared in a logical top-down sequence.
I used the leaflet package to create an interactive map that shows the location of large private schools in the DMV area. I centered the map around Washington, D.C., and used colored circles to represent each school, where the size of the circle reflects the number of enrolled students. When I click on a school, a popup shows key information like the name, city, state, enrollment, and number of teachers. I also added a color legend to indicate the school level.
According to the National Center for Education Statistics (NCES), private schools serve about 10% of the U.S. K–12 student population and tend to be concentrated in urban and suburban areas, often with distinct demographic and resource patterns compared to public schools. These institutions frequently have smaller student-teacher ratios and different curricular emphases, which can affect student outcomes (NCES, 2023). In many regions, especially the DMV, private schools also reflect neighborhood wealth and segregation patterns. My goal was to better understand how private education is distributed geographically and how school size and staffing levels vary by type.
The visualizations I created include an interactive bubble chart and an interactive map. The bubble chart visualizes the relationship between full-time teachers and enrollment for the 100 largest private schools in the U.S., with bubble size representing total school population and color representing school level. One interesting pattern that emerged is that High Schools tend to have both higher enrollment and more teachers, while Elementary Schools tend to be smaller in both respects.One challenge I faced was customizing the color legend in highcharter to match each school level correctly; I solved this by using multiple series and matching the palette with a custom factor order.
The map provides a spatial distribution of schools in the DMV area. I used color-coded circle markers to represent each school level, scaled by enrollment size, and included interactive tooltips for detailed school-level information.By mapping large private schools in the DMV area, I was able to clearly see how school level and size relate to geographic location. Most of the biggest schools are concentrated around Washington, D.C., and Baltimore, which makes sense given the higher population and income levels in those areas. I noticed that high schools are especially common in city centers, while elementary and middle schools are more spread out. This map helped me understand how private education is distributed across the region and gave me a better sense of how location and school size are connected.
To conclude,for future analyses, I will be in exploring the TYPE variable in the dataset, which categorizes private schools by educational model—such as Montessori, Special Education, Vocational, or Regular Elementary/Secondary schools. Analyzing how these types are distributed across different regions and how they correlate with socioeconomic indicators like median household income could offer important insights into educational access and equity.