Choosing a public school for a child is a stressful process for most parents. It usually starts by thinking about what you want the school to teach your kid, then a visit to the school and perhaps relocate to that school district before submitting the application. Some parents even go a step further and look for information about test performance.
Fortunately, lots of state government have made such information readily available on their website for the public. Doing so, promotes transparency and makes the process of choosing a new school for kids a little less stressful. After all, students test results are most meaningful for us parents when they are discussed in relation to other students scores.
For this project, we will be looking at a dataset from the California State school system that was published in 1999. The dataset includes summary data on test performance, some school characteristics, and some student demographic background for the school districts. I got the dataset on Github, but the original dataset was downloaded from the state of California public website.
Our dataset is clean to begin with, except for one useless column which we will remove. The dataset has 420 observations and 14 variables. Here’s a list of the variables and what they represent:
district: categorical. District code.
school: categorical. School name.
county: categorical. Factor indicating county.
grades: categorical. Factor indicating grade span of district.
students: numeric. Total enrollment.
teachers: numeric. Number of teachers.
calworks: numeric. Percent qualifying for CalWorks (income assistance).
lunch: numeric. percent qualifying for reduced-price lunch.
computer: numeric. Number of computers.
expenditure: numeric. Expenditure per student.
income: numeric. District average income (in USD 1,000).
english: numeric. Percent of English learners.
read: numeric. Average reading score.
math: numeric. Average math score.
Load Libraries
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5 v purrr 0.3.4
## v tibble 3.1.4 v dplyr 1.0.7
## v tidyr 1.1.3 v stringr 1.4.0
## v readr 2.0.1 v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(dplyr)
library(highcharter)
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
library(scales)
##
## Attaching package: 'scales'
## The following object is masked from 'package:purrr':
##
## discard
## The following object is masked from 'package:readr':
##
## col_factor
Read the dataset into Rstudio and show first few rows
#Set working directory
setwd("~/Data110")
#Read data into Rstudio
caschools <- read.csv("caschools.csv")
# Show the first few lines
head(caschools)
## X district school county grades students teachers
## 1 1 75119 Sunol Glen Unified Alameda KK-08 195 10.90
## 2 2 61499 Manzanita Elementary Butte KK-08 240 11.15
## 3 3 61549 Thermalito Union Elementary Butte KK-08 1550 82.90
## 4 4 61457 Golden Feather Union Elementary Butte KK-08 243 14.00
## 5 5 61523 Palermo Union Elementary Butte KK-08 1335 71.50
## 6 6 62042 Burrel Union Elementary Fresno KK-08 137 6.40
## calworks lunch computer expenditure income english read math
## 1 0.5102 2.0408 67 6384.911 22.690001 0.000000 691.6 690.0
## 2 15.4167 47.9167 101 5099.381 9.824000 4.583333 660.5 661.9
## 3 55.0323 76.3226 169 5501.955 8.978000 30.000002 636.3 650.9
## 4 36.4754 77.0492 85 7101.831 8.978000 0.000000 651.9 643.5
## 5 33.1086 78.4270 171 5235.988 9.080333 13.857677 641.8 639.9
## 6 12.3188 86.9565 25 5580.147 10.415000 12.408759 605.7 605.4
Let’s create a new data frame for our statistical analysis
This dataset has an extra column that we won’t be working with.
We will use the select function to get rid of it.
caschool1<- caschools%>%
select(district, school, county, grades, students, teachers, calworks, lunch, computer, expenditure, income, english, read, math)
# caschool1
Create a simple plot - Let’s map students against teachers
firstplot <- caschool1 %>%
ggplot(aes(students, teachers, color=grades))+
labs(title="State of California - Ratio of Students per Teacher (1998 -1999)")+
xlab("Number of Students")+
ylab("Number of Teachers")+
geom_point()+
geom_smooth(method = 'lm', formula = y~x)+
theme_bw()+
facet_wrap(~grades)+
scale_color_discrete(name=" ")
firstplot <- ggplotly(firstplot)
firstplot
Let’s calculate the correlation coefficient and find the linear regression model
cor(caschool1$students, caschool1$teachers)
## [1] 0.9971161
analysis <- lm (teachers ~ students, data = caschool1)
summary(analysis)
##
## Call:
## lm(formula = teachers ~ students, data = caschool1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -85.837 -3.171 -1.015 2.912 124.542
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.1931940 0.8396254 3.803 0.000164 ***
## students 0.0478829 0.0001783 268.620 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14.28 on 418 degrees of freedom
## Multiple R-squared: 0.9942, Adjusted R-squared: 0.9942
## F-statistic: 7.216e+04 on 1 and 418 DF, p-value: < 2.2e-16
The correlation r=0.9971161 indicates a strong, positive association between the variables students and teachers.
Base on our calculations, we can conclude that our model has the following equation: teachers = 0.048(students)+3.19
We also have a very small p-value of 2e-16 and an Adjusted R-Square value suggesting that about 99.4% of the variation in the observations maybe explained by our model.
New Dataframe
hcdata <- caschool1 %>%
filter(math>=690)%>%
filter(read>=690)%>%
arrange(-read)%>%
head(10)
New Plot - Income VS Read
highchart()%>%
hc_add_series(name="Reading Score", data = hcdata, type="scatter",hcaes(x=income, y=read, color = hcdata$district))%>%
hc_title(text="Income Impact",style = list(color = "#2b908f", fontWeight = "bold"))%>%
hc_subtitle(text="California KK-6 and KK-8 Test Score Data (1998-1999)",style = list(color = "#2b908f", fontWeight = "bold"))%>%
hc_xAxis(title=list(text="Income", style = list(color = "#980000",fontWeight = "bold")))%>%
hc_yAxis(title=list(text="Reading Test Score", style = list(color = "#980000",fontWeight = "bold")))%>%
hc_legend(align = "right", verticalAlign="top")%>%
hc_tooltip(shared=TRUE, borderColor = "red",pointFormat= "{point.school} <br> Income: {point.x:.2f} <br> <br> Score: {point.y:.2f}<br>")%>%
hc_add_theme(hc_theme_alone())%>%
hc_caption(text="Source:https://vincentarelbundock.github.io/Rdatasets/datasets.html")
cor(hcdata$income, hcdata$read)
## [1] -0.144603
cor(hcdata$income, hcdata$math)
## [1] -0.2436599
The correlation coefficient between those variables suggest a weak, negative association.
A parent’s income does not seem to lower or improve a child test performance. Note that some studies indicate the opposite, however.
For example, a report by Thomas, concluded that “children from lower income households score significantly lower on measures of vocabulary and communication skills, knowledge of numbers, copying and symbol use, ability to concentrate and cooperative play with other children than children from higher income households.”
A complete copy of that report can be found here: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2528798/#:~:text=A%20report%20by%20Thomas%20(10,children%20from%20higher%20income%20households.
This dataset is about average, nonetheless. Perhaps, if we were looking at individual cases, i.e., a dataset of students from randowm class rooms, the results would have been different.