Choosing a public school for a child is a stressful process for most parents. It usually starts by thinking about what you want the school to teach your kid, then a visit to the school and perhaps relocate to that school district before submitting the application. Some parents even go a step further and look for information about test performance.

Fortunately, lots of state government have made such information readily available on their website for the public. Doing so, promotes transparency and makes the process of choosing a new school for kids a little less stressful. After all, students test results are most meaningful for us parents when they are discussed in relation to other students scores.

For this project, we will be looking at a dataset from the California State school system that was published in 1999. The dataset includes summary data on test performance, some school characteristics, and some student demographic background for the school districts. I got the dataset on Github, but the original dataset was downloaded from the state of California public website.

Data source: https://vincentarelbundock.github.io/Rdatasets/datasets.html

Our dataset is clean to begin with, except for one useless column which we will remove. The dataset has 420 observations and 14 variables. Here’s a list of the variables and what they represent:

district: categorical. District code.

school: categorical. School name.

county: categorical. Factor indicating county.

grades: categorical. Factor indicating grade span of district.

students: numeric. Total enrollment.

teachers: numeric. Number of teachers.

calworks: numeric. Percent qualifying for CalWorks (income assistance).

lunch: numeric. percent qualifying for reduced-price lunch.

computer: numeric. Number of computers.

expenditure: numeric. Expenditure per student.

income: numeric. District average income (in USD 1,000).

english: numeric. Percent of English learners.

read: numeric. Average reading score.

math: numeric. Average math score.

Load Libraries

library(tidyverse)

## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --

## v ggplot2 3.3.5     v purrr   0.3.4
## v tibble  3.1.4     v dplyr   1.0.7
## v tidyr   1.1.3     v stringr 1.4.0
## v readr   2.0.1     v forcats 0.5.1

## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(dplyr)
library(highcharter)

## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo

library(plotly)

## 
## Attaching package: 'plotly'

## The following object is masked from 'package:ggplot2':
## 
##     last_plot

## The following object is masked from 'package:stats':
## 
##     filter

## The following object is masked from 'package:graphics':
## 
##     layout

library(scales)

## 
## Attaching package: 'scales'

## The following object is masked from 'package:purrr':
## 
##     discard

## The following object is masked from 'package:readr':
## 
##     col_factor

Read the dataset into Rstudio and show first few rows

#Set working directory
setwd("~/Data110")

#Read data into Rstudio
caschools <- read.csv("caschools.csv")

# Show the first few lines
head(caschools)

##   X district                          school  county grades students teachers
## 1 1    75119              Sunol Glen Unified Alameda  KK-08      195    10.90
## 2 2    61499            Manzanita Elementary   Butte  KK-08      240    11.15
## 3 3    61549     Thermalito Union Elementary   Butte  KK-08     1550    82.90
## 4 4    61457 Golden Feather Union Elementary   Butte  KK-08      243    14.00
## 5 5    61523        Palermo Union Elementary   Butte  KK-08     1335    71.50
## 6 6    62042         Burrel Union Elementary  Fresno  KK-08      137     6.40
##   calworks   lunch computer expenditure    income   english  read  math
## 1   0.5102  2.0408       67    6384.911 22.690001  0.000000 691.6 690.0
## 2  15.4167 47.9167      101    5099.381  9.824000  4.583333 660.5 661.9
## 3  55.0323 76.3226      169    5501.955  8.978000 30.000002 636.3 650.9
## 4  36.4754 77.0492       85    7101.831  8.978000  0.000000 651.9 643.5
## 5  33.1086 78.4270      171    5235.988  9.080333 13.857677 641.8 639.9
## 6  12.3188 86.9565       25    5580.147 10.415000 12.408759 605.7 605.4

Let’s create a new data frame for our statistical analysis

This dataset has an extra column that we won’t be working with.

We will use the select function to get rid of it.

caschool1<- caschools%>%
  select(district, school, county, grades, students, teachers, calworks, lunch, computer, expenditure, income, english, read, math)
  
# caschool1

Create a simple plot - Let’s map students against teachers

firstplot <- caschool1 %>%
  ggplot(aes(students, teachers, color=grades))+
  labs(title="State of California - Ratio of Students per Teacher (1998 -1999)")+
  xlab("Number of Students")+
  ylab("Number of Teachers")+
  geom_point()+
  geom_smooth(method = 'lm', formula = y~x)+
  theme_bw()+
  facet_wrap(~grades)+
  scale_color_discrete(name=" ")
firstplot <- ggplotly(firstplot)
firstplot

Let’s calculate the correlation coefficient and find the linear regression model

cor(caschool1$students, caschool1$teachers)

## [1] 0.9971161

analysis <- lm (teachers ~ students, data = caschool1)
summary(analysis)

## 
## Call:
## lm(formula = teachers ~ students, data = caschool1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -85.837  -3.171  -1.015   2.912 124.542 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 3.1931940  0.8396254   3.803 0.000164 ***
## students    0.0478829  0.0001783 268.620  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.28 on 418 degrees of freedom
## Multiple R-squared:  0.9942, Adjusted R-squared:  0.9942 
## F-statistic: 7.216e+04 on 1 and 418 DF,  p-value: < 2.2e-16

The correlation r=0.9971161 indicates a strong, positive association between the variables students and teachers.

Base on our calculations, we can conclude that our model has the following equation: teachers = 0.048(students)+3.19

We also have a very small p-value of 2e-16 and an Adjusted R-Square value suggesting that about 99.4% of the variation in the observations maybe explained by our model.

Does a district average income have any impact on students reading performance?

To answer this question we will create a new dataframe.

This time we will focus on the top 10 performers for math and reading.

New Dataframe

hcdata <- caschool1 %>%
  filter(math>=690)%>%
  filter(read>=690)%>%
  arrange(-read)%>%
head(10)

New Plot - Income VS Read

highchart()%>%
  
  
 
  hc_add_series(name="Reading Score", data = hcdata, type="scatter",hcaes(x=income, y=read, color = hcdata$district))%>%
   
  hc_title(text="Income Impact",style = list(color = "#2b908f", fontWeight = "bold"))%>%
  hc_subtitle(text="California KK-6 and KK-8 Test Score Data (1998-1999)",style = list(color = "#2b908f", fontWeight = "bold"))%>%
  hc_xAxis(title=list(text="Income", style = list(color = "#980000",fontWeight = "bold")))%>%
  hc_yAxis(title=list(text="Reading Test Score", style = list(color = "#980000",fontWeight = "bold")))%>%
  
  hc_legend(align = "right", verticalAlign="top")%>%
  
  hc_tooltip(shared=TRUE, borderColor = "red",pointFormat= "{point.school} <br> Income: {point.x:.2f} <br> <br> Score: {point.y:.2f}<br>")%>%
  hc_add_theme(hc_theme_alone())%>%
  hc_caption(text="Source:https://vincentarelbundock.github.io/Rdatasets/datasets.html")

cor(hcdata$income, hcdata$read)

## [1] -0.144603

cor(hcdata$income, hcdata$math)

## [1] -0.2436599

The correlation coefficient between those variables suggest a weak, negative association.

A parent’s income does not seem to lower or improve a child test performance. Note that some studies indicate the opposite, however.

For example, a report by Thomas, concluded that “children from lower income households score significantly lower on measures of vocabulary and communication skills, knowledge of numbers, copying and symbol use, ability to concentrate and cooperative play with other children than children from higher income households.”

A complete copy of that report can be found here: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2528798/#:~:text=A%20report%20by%20Thomas%20(10,children%20from%20higher%20income%20households.

This dataset is about average, nonetheless. Perhaps, if we were looking at individual cases, i.e., a dataset of students from randowm class rooms, the results would have been different.

For our final plot, we will use the last datatframe we created to show the average test performance, the school characteristics and student demographic backgrounds for the top 10 school districts in California, i.e., the ones with the highest average score.

final<- highchart()%>%
  hc_chart(type="bar")%>%
  hc_title(text="CA Test performance, School characteristics and Student demographic backgrounds")%>%
  hc_subtitle(text="Grades KK-6 And KK-8 - Top 10 (1988-1999)",style = list(color = "#080482", fontWeight = "normal"))%>%
  hc_add_series(name="Percentage of English Learners", data = hcdata$english, color = "green")%>%
  hc_add_series(name="Income", data = hcdata$income, color = "red")%>%
  hc_add_series(name="Math Score", data = hcdata$math, color = "blue")%>%
  hc_add_series(name="Read Score", data = hcdata$read, color = "orange")%>%
  hc_xAxis(categories = hcdata$school, title =list(text="School Name", style = list(fontSize= "14px",fontWeight="bold", color="blue")))  %>%
    
   hc_xAxis(title=list(text="School", style = list(color = "#9b1003",fontWeight = "bold")))%>%
  hc_yAxis(title=list(text="Values", style = list(color = "#9b1003",fontWeight = "bold")))%>%
  hc_caption(text="Source:https://vincentarelbundock.github.io/Rdatasets/datasets.html")%>%
  #hc_legend(align = "right", verticalAlign="top")
  hc_add_theme(hc_theme_bloom())
  
  #hc_yAxis_multiples(create_yaxis(naxis = 3, lineWidth = 2, title = list(text = NULL))) %>%
  #hc_tooltip(shared=TRUE, borderColor = "black",pointFormat= "{point.school} <br> Income: {point.x:.2f} <br> <br> Score: {point.y:.2f}<br>")
final

Final Project

Daniel Lachaud

12/5/2021

Data source: https://vincentarelbundock.github.io/Rdatasets/datasets.html

Our dataset is clean to begin with, except for one useless column which we will remove. The dataset has 420 observations and 14 variables. Here’s a list of the variables and what they represent:

district: categorical. District code.

school: categorical. School name.

county: categorical. Factor indicating county.

grades: categorical. Factor indicating grade span of district.

students: numeric. Total enrollment.

teachers: numeric. Number of teachers.

calworks: numeric. Percent qualifying for CalWorks (income assistance).

lunch: numeric. percent qualifying for reduced-price lunch.

computer: numeric. Number of computers.

expenditure: numeric. Expenditure per student.

income: numeric. District average income (in USD 1,000).

english: numeric. Percent of English learners.

read: numeric. Average reading score.

math: numeric. Average math score.

Load Libraries

Read the dataset into Rstudio and show first few rows

Let’s create a new data frame for our statistical analysis

This dataset has an extra column that we won’t be working with.

We will use the select function to get rid of it.

Create a simple plot - Let’s map students against teachers

Let’s calculate the correlation coefficient and find the linear regression model

The correlation r=0.9971161 indicates a strong, positive association between the variables students and teachers.

Base on our calculations, we can conclude that our model has the following equation: teachers = 0.048(students)+3.19

We also have a very small p-value of 2e-16 and an Adjusted R-Square value suggesting that about 99.4% of the variation in the observations maybe explained by our model.

Does a district average income have any impact on students reading performance?

To answer this question we will create a new dataframe.

This time we will focus on the top 10 performers for math and reading.

New Dataframe

New Plot - Income VS Read

The correlation coefficient between those variables suggest a weak, negative association.

A parent’s income does not seem to lower or improve a child test performance. Note that some studies indicate the opposite, however.

A complete copy of that report can be found here: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2528798/#:~:text=A%20report%20by%20Thomas%20(10,children%20from%20higher%20income%20households.

This dataset is about average, nonetheless. Perhaps, if we were looking at individual cases, i.e., a dataset of students from randowm class rooms, the results would have been different.

For our final plot, we will use the last datatframe we created to show the average test performance, the school characteristics and student demographic backgrounds for the top 10 school districts in California, i.e., the ones with the highest average score.