Overview

The societal impact of income inequality is well known. Can school clubs, sports and other after school programs shape a childs future who otherwise would not be exposed to the positive influences additional school activities can bring?

Objective

In this data anaylsis we try to answer the question What is the impact of children who are exposed to extra school programs vs children are not exposed to additional programs. This markup will be used to analyze the data and a shiny app will be built to display additional details at this link:

https://apag101.shinyapps.io/final_project/

Approach

We will use data from City of New York government site from 2011-2012. One dataset will list schools in NYC with their sports programs and the second dataset will show school grades

https://data.cityofnewyork.us/Education/DOE-High-School-Directory-2013-2014/u553-m549/data

Education is one of biggest keys for a child’s success in life. In this project we will try to measure additional school programs impact on a childs life by camparing it with schools graduation and college enrollment rates. The theory is if schools that have more programs have a higher rate of grads and college enrollment then schools that do not, schools with more programs give a child better chance to succeed as an adult.

Data Exploration

Import Data

During the import we replace missing values with N/A by setting na.strings=c("“,”N/A").

The below visualizaitons show the columns that are missing data.

Data Prepartion

In data prepartion we are renaming DBN columns that were changed during the import. We get a subset of data of interest, rename some columns to shorter names then merge the sports dataset with the grade dataset. Below is a glimpse of the 2 datasets

## Observations: 422
## Variables: 5
## $ DBN            <fct> 01M292, 01M448, 01M450, 01M509, 01M539, 01M696, 02M0...
## $ grad2011       <fct> 56.30%, 70.70%, 71.60%, 56.40%, 95.30%, 97.60%, 69.6...
## $ collenroll2011 <fct> 51.90%, 36.30%, 69.20%, 47.70%, 87.00%, 95.70%, 47.1...
## $ grad2012       <fct> 50.70%, 74.40%, 72.70%, 59.50%, 97.50%, 96.60%, 54.8...
## $ collenroll2012 <fct> 40.00%, 67.30%, 55.70%, 47.90%, 82.80%, 96.80%, 30.4...
## Observations: 422
## Variables: 22
## $ DBN             <fct> 01M292, 01M448, 01M450, 01M509, 01M539, 01M696, 02M...
## $ BN              <fct> M292, M448, M450, M509, M539, M696, M047, M288, M29...
## $ Printed_Name    <fct> "Henry Street School for International Studies", "U...
## $ TotStudnts      <fct> "388", "331", "636", "367", "1,672", "601", "189", ...
## $ School_Type     <fct> NA, NA, NA, NA, NA, NA, NA, CTE School, NA, NA, NA,...
## $ LangClass       <fct> "Chinese (Mandarin), Spanish", "Chinese, Spanish", ...
## $ APCourses       <fct> "Psychology", "Calculus AB, Chinese Language and Cu...
## $ LeaderSupport   <fct> "Math through Card Play, Poetry Club, Drama Club, B...
## $ Sports          <fct> "Boxing, CHAMPS, Double Dutch, Table tennis, Track ...
## $ OnlineAP        <fct> "Chinese Language and Culture, Spanish Literature",...
## $ OnlineLang      <fct> "Chinese (Mandarin), Spanish", NA, "American Sign L...
## $ School_Type2    <fct> NA, NA, NA, NA, NA, NA, NA, CTE School, NA, NA, NA,...
## $ neighborhood    <fct> Lower East Side, Lower East Side, East Village, Low...
## $ postalCode      <fct> "10,002", "10,002", "10,009", "10,002", "10,002", "...
## $ precinct        <int> 7, 7, 9, 7, 7, 7, 13, 18, 7, 18, 5, 18, 18, 7, 7, 1...
## $ school_district <int> 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, ...
## $ lat             <dbl> 40.71348, 40.71223, 40.72985, 40.72055, 40.71949, 4...
## $ lon             <dbl> -73.98532, -73.98382, -73.98315, -73.98568, -73.979...
## $ Location        <fct> "220 Henry Street\nNew York, NY 10002\n(40.7134809,...
## $ Boro            <fct> MANHATTAN, MANHATTAN, MANHATTAN, MANHATTAN, MANHATT...
## $ CouncilDistrict <int> 1, 1, 2, 1, 2, 2, 2, 3, 1, 3, 1, 3, 3, 1, 1, 1, 2, ...
## $ CensusTract     <fct> "201", "202", "34", "3,001", "2,201", "1,002", "64"...

To prepare our model we remove % columns from grad and college columns and merge the 2 datasets. We then create sprtscode,apcode,langcode,onlinelangcode and onlineapcode columns to show a 1 for schools with the programs and 0 with schools wihtout the programs. Lastly, we write the new dataframe to a csv for later use in our shiny app dashboard. We end with a glimpse of the newly combined dataset.

## Observations: 422
## Variables: 32
## $ DBN             <fct> 01M292, 01M448, 01M450, 01M509, 01M539, 01M696, 02M...
## $ BN              <fct> M292, M448, M450, M509, M539, M696, M047, M135, M13...
## $ Printed_Name    <fct> "Henry Street School for International Studies", "U...
## $ TotStudnts      <fct> "388", "331", "636", "367", "1,672", "601", "189", ...
## $ School_Type     <fct> NA, NA, NA, NA, NA, NA, NA, "CTE, New School", "CTE...
## $ LangClass       <fct> "Chinese (Mandarin), Spanish", "Chinese, Spanish", ...
## $ APCourses       <fct> "Psychology", "Calculus AB, Chinese Language and Cu...
## $ LeaderSupport   <fct> "Math through Card Play, Poetry Club, Drama Club, B...
## $ Sports          <fct> "Boxing, CHAMPS, Double Dutch, Table tennis, Track ...
## $ OnlineAP        <fct> "Chinese Language and Culture, Spanish Literature",...
## $ OnlineLang      <fct> "Chinese (Mandarin), Spanish", NA, "American Sign L...
## $ School_Type2    <fct> NA, NA, NA, NA, NA, NA, NA, "CTE, New School", "CTE...
## $ neighborhood    <fct> Lower East Side, Lower East Side, East Village, Low...
## $ postalCode      <fct> "10,002", "10,002", "10,009", "10,002", "10,002", "...
## $ precinct        <int> 7, 7, 9, 7, 7, 7, 13, 18, 5, 18, 7, 18, 5, 18, 18, ...
## $ school_district <int> 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, ...
## $ lat             <dbl> 40.71348, 40.71223, 40.72985, 40.72055, 40.71949, 4...
## $ lon             <dbl> -73.98532, -73.98382, -73.98315, -73.98568, -73.979...
## $ Location        <fct> "220 Henry Street\nNew York, NY 10002\n(40.7134809,...
## $ Boro            <fct> MANHATTAN, MANHATTAN, MANHATTAN, MANHATTAN, MANHATT...
## $ CouncilDistrict <int> 1, 1, 2, 1, 2, 2, 2, 3, 1, 3, 1, 3, 1, 3, 3, 1, 1, ...
## $ CensusTract     <fct> "201", "202", "34", "3,001", "2,201", "1,002", "64"...
## $ grad2011        <dbl> 56.3, 70.7, 71.6, 56.4, 95.3, 97.6, 69.6, NA, NA, 8...
## $ collenroll2011  <dbl> 51.9, 36.3, 69.2, 47.7, 87.0, 95.7, 47.1, NA, NA, 6...
## $ grad2012        <dbl> 50.7, 74.4, 72.7, 59.5, 97.5, 96.6, 54.8, NA, NA, 7...
## $ collenroll2012  <dbl> 40.0, 67.3, 55.7, 47.9, 82.8, 96.8, 30.4, NA, NA, 5...
## $ sprtcode        <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ apcode          <fct> 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, ...
## $ langcode        <fct> 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ leadcode        <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ onlineapcode    <fct> 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, ...
## $ onlinelangcode  <fct> 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...

Data Summary

The summary below gives summary statistics of each column of the merged dataset as well as a subset summary that splits the sports code of 1 and 0

##                 vars   n   mean     sd median trimmed    mad    min    max
## DBN*               1 422 211.50 121.97 211.50  211.50 156.41   1.00 422.00
## BN*                2 422 211.50 121.97 211.50  211.50 156.41   1.00 422.00
## Printed_Name*      3 422 211.50 121.97 211.50  211.50 156.41   1.00 422.00
## TotStudnts*        4 406 162.40  85.40 167.50  163.37 100.82   1.00 316.00
## School_Type*       5  84   4.63   2.30   3.00    4.47   2.97   1.00   9.00
## LangClass*         6 392  58.46  17.38  70.00   62.26   0.00   1.00  72.00
## APCourses*         7 304 138.90  74.70 143.50  141.23  97.85   1.00 252.00
## LeaderSupport*     8 420 210.50 121.39 210.50  210.50 155.67   1.00 420.00
## Sports*            9 299 140.70  79.58 137.00  140.63 100.82   1.00 281.00
## OnlineAP*         10  46  21.57  11.72  22.50   21.76  15.57   1.00  40.00
## OnlineLang*       11  52  21.44   9.74  26.00   22.48   7.41   1.00  32.00
## School_Type2*     12  84   4.63   2.30   3.00    4.47   2.97   1.00   9.00
## neighborhood*     13 422  52.75  30.40  52.00   51.93  34.10   1.00 109.00
## postalCode*       14 422  53.07  31.61  51.00   52.11  31.13   1.00 117.00
## precinct          15 422  59.90  33.85  52.00   59.65  39.29   1.00 123.00
## school_district   16 422  13.91   9.22  12.00   13.40  10.38   1.00  32.00
## lat               17 422  40.75   0.08  40.74   40.75   0.11  40.53  40.89
## lon               18 422 -73.92   0.07 -73.93  -73.93   0.07 -74.19 -73.73
## Location*         19 422 137.29  75.07 136.50  138.32  97.85   1.00 258.00
## Boro*             20 420   2.39   1.14   2.00    2.34   1.48   1.00   5.00
## CouncilDistrict   21 420  21.78  14.40  18.00   21.17  17.79   1.00  51.00
## CensusTract*      22 420 101.46  57.15 106.50  101.25  74.87   1.00 198.00
## grad2011          23 310  73.80  14.34  72.60   73.61  16.38  41.20 100.00
## collenroll2011    24 291  53.12  19.31  48.90   51.70  18.24  14.10 100.00
## grad2012          25 337  73.70  15.12  73.50   73.99  15.57  25.50 100.00
## collenroll2012    26 314  53.93  19.09  50.35   52.73  17.94   8.90 100.00
## sprtcode*         27 422   1.71   0.45   2.00    1.76   0.00   1.00   2.00
## apcode*           28 422   1.72   0.45   2.00    1.78   0.00   1.00   2.00
## langcode*         29 422   1.93   0.26   2.00    2.00   0.00   1.00   2.00
## leadcode*         30 422   2.00   0.07   2.00    2.00   0.00   1.00   2.00
## onlineapcode*     31 422   1.11   0.31   1.00    1.01   0.00   1.00   2.00
## onlinelangcode*   32 422   1.12   0.33   1.00    1.03   0.00   1.00   2.00
##                  range   skew kurtosis   se
## DBN*            421.00   0.00    -1.21 5.94
## BN*             421.00   0.00    -1.21 5.94
## Printed_Name*   421.00   0.00    -1.21 5.94
## TotStudnts*     315.00  -0.08    -1.00 4.24
## School_Type*      8.00   0.50    -0.92 0.25
## LangClass*       71.00  -1.62     1.82 0.88
## APCourses*      251.00  -0.20    -1.23 4.28
## LeaderSupport*  419.00   0.00    -1.21 5.92
## Sports*         280.00   0.00    -1.15 4.60
## OnlineAP*        39.00  -0.11    -1.30 1.73
## OnlineLang*      31.00  -0.68    -0.99 1.35
## School_Type2*     8.00   0.50    -0.92 0.25
## neighborhood*   108.00   0.23    -0.95 1.48
## postalCode*     116.00   0.19    -0.79 1.54
## precinct        122.00   0.13    -1.08 1.65
## school_district  31.00   0.38    -1.04 0.45
## lat               0.36  -0.11    -0.85 0.00
## lon               0.46   0.05     1.06 0.00
## Location*       257.00  -0.08    -1.19 3.65
## Boro*             4.00   0.31    -0.99 0.06
## CouncilDistrict  50.00   0.28    -1.12 0.70
## CensusTract*    197.00  -0.01    -1.24 2.79
## grad2011         58.80   0.12    -0.88 0.81
## collenroll2011   85.90   0.59    -0.35 1.13
## grad2012         74.50  -0.15    -0.46 0.82
## collenroll2012   91.10   0.53    -0.46 1.08
## sprtcode*         1.00  -0.91    -1.17 0.02
## apcode*           1.00  -0.98    -1.04 0.02
## langcode*         1.00  -3.33     9.09 0.01
## leadcode*         1.00 -14.37   205.02 0.00
## onlineapcode*     1.00   2.50     4.26 0.02
## onlinelangcode*   1.00   2.28     3.23 0.02
## mdata$sprtcode: 0
##     grad2011      collenroll2011     grad2012      collenroll2012 
##  Min.   : 45.70   Min.   :14.10   Min.   : 25.50   Min.   : 8.90  
##  1st Qu.: 66.00   1st Qu.:43.10   1st Qu.: 64.97   1st Qu.:40.95  
##  Median : 73.80   Median :51.70   Median : 73.20   Median :51.80  
##  Mean   : 75.05   Mean   :54.13   Mean   : 74.71   Mean   :53.06  
##  3rd Qu.: 84.95   3rd Qu.:65.30   3rd Qu.: 85.30   3rd Qu.:63.20  
##  Max.   :100.00   Max.   :99.00   Max.   :100.00   Max.   :98.70  
##  NA's   :28       NA's   :32      NA's   :21       NA's   :27     
## ------------------------------------------------------------ 
## mdata$sprtcode: 1
##     grad2011      collenroll2011      grad2012      collenroll2012  
##  Min.   : 41.20   Min.   : 18.30   Min.   : 35.40   Min.   : 19.50  
##  1st Qu.: 61.50   1st Qu.: 37.75   1st Qu.: 62.45   1st Qu.: 39.83  
##  Median : 71.60   Median : 47.80   Median : 73.50   Median : 50.00  
##  Mean   : 73.24   Mean   : 52.66   Mean   : 73.27   Mean   : 54.31  
##  3rd Qu.: 85.10   3rd Qu.: 65.75   3rd Qu.: 84.60   3rd Qu.: 67.05  
##  Max.   :100.00   Max.   :100.00   Max.   :100.00   Max.   :100.00  
##  NA's   :84       NA's   :99       NA's   :64       NA's   :81

The plot summaries displays a boxplot of graduation and college enrollment split by sports code. The linear plot compares graduation vs college enrollments and plots the sports codes of 1 and 0. The plot shows the grad and college enrollment have a positive correlation.

Modeling

We begin with a GLM model of sports code vs all of the college grad columns. The linearl model shows that only college enroll 2012 is significant when sports is the response variable.

## # A tibble: 5 x 5
##   term           estimate std.error statistic p.value
##   <chr>             <dbl>     <dbl>     <dbl>   <dbl>
## 1 (Intercept)      2.33      0.865      2.70  0.00700
## 2 grad2011        -0.0158    0.0220    -0.716 0.474  
## 3 collenroll2011  -0.0101    0.0141    -0.718 0.473  
## 4 grad2012        -0.0226    0.0194    -1.17  0.243  
## 5 collenroll2012   0.0343    0.0174     1.97  0.0483
## # A tibble: 1 x 7
##   null.deviance df.null logLik   AIC   BIC deviance df.residual
##           <dbl>   <int>  <dbl> <dbl> <dbl>    <dbl>       <int>
## 1          359.     289  -177.  364.  382.     354.         285

We now use a GLM model with a combination of grad and college enrollment with sports code. In this model sports code is slightly significant.

## # A tibble: 2 x 5
##   term        estimate std.error statistic   p.value
##   <chr>          <dbl>     <dbl>     <dbl>     <dbl>
## 1 (Intercept)   258.        6.61    39.0   5.38e-117
## 2 sprtcode1      -5.04      7.96    -0.633 5.27e-  1
## # A tibble: 1 x 11
##   r.squared adj.r.squared sigma statistic p.value    df logLik   AIC   BIC
##       <dbl>         <dbl> <dbl>     <dbl>   <dbl> <int>  <dbl> <dbl> <dbl>
## 1   0.00139      -0.00208  62.7     0.400   0.527     2 -1611. 3228. 3239.
## # ... with 2 more variables: deviance <dbl>, df.residual <int>

In this next section we switch to a linear model and try to correlate college 2012 enrollment with language, leader, onlineAP, onlineLanguage and sports code. This model says leader code is significant and language code is slightly significant.

## # A tibble: 7 x 5
##   term            estimate std.error statistic    p.value
##   <chr>              <dbl>     <dbl>     <dbl>      <dbl>
## 1 (Intercept)       88.7       18.9      4.68  0.00000425
## 2 sprtcode1          2.18       2.35     0.929 0.354     
## 3 apcode1            4.64       2.89     1.60  0.110     
## 4 langcode1         11.4        6.17     1.84  0.0664    
## 5 leadcode1        -51.1       20.1     -2.55  0.0114    
## 6 onlineapcode1     -2.24       3.90    -0.575 0.566     
## 7 onlinelangcode1    0.574      3.96     0.145 0.885
## # A tibble: 1 x 11
##   r.squared adj.r.squared sigma statistic p.value    df logLik   AIC   BIC
##       <dbl>         <dbl> <dbl>     <dbl>   <dbl> <int>  <dbl> <dbl> <dbl>
## 1    0.0344        0.0156  18.9      1.82  0.0938     7 -1366. 2747. 2777.
## # ... with 2 more variables: deviance <dbl>, df.residual <int>

This last model removes all other insignificant predictors and compare college 2012 enroll to leader and language code. This model shows that both leader and language are significant.

## # A tibble: 3 x 5
##   term        estimate std.error statistic    p.value
##   <chr>          <dbl>     <dbl>     <dbl>      <dbl>
## 1 (Intercept)     88.7     18.9       4.69 0.00000415
## 2 langcode1       12.3      6.08      2.03 0.0437    
## 3 leadcode1      -46.8     19.8      -2.36 0.0190
## # A tibble: 1 x 11
##   r.squared adj.r.squared sigma statistic p.value    df logLik   AIC   BIC
##       <dbl>         <dbl> <dbl>     <dbl>   <dbl> <int>  <dbl> <dbl> <dbl>
## 1    0.0235        0.0172  18.9      3.74  0.0247     3 -1367. 2743. 2758.
## # ... with 2 more variables: deviance <dbl>, df.residual <int>

Visualizations

The comparison of graduate to college enroll for 2011 and 2012 shows there is an positive correlation. Both the xyplot and ggplots separte sports , language and leader code. From the visualizaiton you can see sports show a similar trend whether there is sports or no sports. In the language plot, the trend is clear in schools with language class than schools with no language classes. The leader plot show the majority of schools with a positive trend has some type of leader program.

The ggplot confirms the analysis in the xyplot. The sports code plot has a mix of sports and no sports, the language plots show low number of schools with no language classes and leader plot have nearly all schools with a leader program.

These qq density plots hint that in schools where >80% graduate and enroll in college, there is a sports or language or leader program.

With the disovery of the density plot we narrow the histogram plot to schools with >80% students graduates and enroll in college, and split 3 programs. This is one more look at the split of the 3 codes where language and lead code has a higher significance.

Map

Below we setup a subset of schools that have > 80% graduation or college enroll rate and include the location, number of students and program information on the drill down of the map.

Conclusion

This basic modelling showed in NYC in 2011 and 2012 leadership and language courses may have had a significant impact on a child sucess in graduation and applying to college.

As an additional part of this project I created a shiny app to anaylyze more variables in the dataset.

https://apag101.shinyapps.io/final_project/

The shiny app does the following;

  • Drop down to select predictors and response variables
  • Display linear, density and histogram plots
  • Display basic statistical data
  • Give table of schools where 80% or more students graduated and enrolled in college
  • Give map of all schools with details of AP, Language, Leadership and sports programs

Appendix

Code used in analysis

knitr::opts_chunk$set(
    echo = FALSE,
    message = FALSE,
    warning = FALSE
)
require(knitr)
library(ggplot2)
library(tidyr)
library(MASS)
library(psych)
library(kableExtra)
library(dplyr)
library(faraway)
library(gridExtra)
library(reshape2)
library(leaps)
library(pROC)
library(caret)
library(naniar)
library(pander)
library(pROC)
library(plotly)
library(effects)
library(leaflet)
library(tidyr)
library(tibble)
library(broom)
library(grid)
library(gridExtra)
library(shiny)
schlsprts <- read.csv("DOE_High_School_Directory_2013-2014.csv",na.strings=c("","N/A"),header=TRUE)
schlgrds <- read.csv("2013_-_2014_DOE_High_School_Performance-Directory.csv",na.strings=c("","N/A"),header=TRUE)

vis_miss(schlsprts)
gg_miss_upset(schlsprts)

vis_miss(schlgrds)
gg_miss_upset(schlgrds)
#Set N/A rows for school sports column to NA
#nrow(subset(schlsprts,is.na(schlsprts$School.Sports)))
#is.na(schlsprts$School.Sports)<-schlsprts$School.Sports == "N/A"
#nrow(subset(schlsprts,is.na(schlsprts$School.Sports)))

#Rename DBA Columns
names(schlsprts)[names(schlsprts) == "ï..DBN"] <- "DBN"
names(schlgrds)[names(schlgrds) == "ï..DBN"] <- "DBN"

#Select subset columns and rename
sub_schlsprts<-subset(schlsprts, select = c('DBN','BN','Printed_Name','Total.Student.10.26','School_Type','Language.Classes','Advanced.Placement.Courses','Extra_Leadership.Support','School.Sports','Online.AP.Courses','Online.Language.Courses','School_Type2','neighborhood','postalCode','precinct','school_district','latitude','longitude','Location.1','Borough','Council.District','Census.Tract'))

names(sub_schlsprts)<-c('DBN','BN','Printed_Name','TotStudnts','School_Type','LangClass','APCourses','LeaderSupport','Sports','OnlineAP','OnlineLang','School_Type2','neighborhood','postalCode','precinct','school_district','lat','lon','Location','Boro','CouncilDistrict','CensusTract')

sub_schlgrds<-subset(schlgrds, select = c('DBN','graduation.2010.11','college.enroll.2010.11','graduation.2011.12','college.enroll.2011.12'))

names(sub_schlgrds)<-c('DBN','grad2011','collenroll2011','grad2012','collenroll2012')

glimpse(sub_schlgrds)
glimpse(sub_schlsprts)
#get subset of rows with School sports NA or N/A values
#subset(schlsprts,is.na(schlsprts$School.Sports)|schlsprts$School.Sports == "N/A")
#set sumarry columns from characters to numeric
sub_schlgrds[,-1]<-lapply(sub_schlgrds[,-1], function(x) as.numeric(sub("%","",x)))

#Merge School and sports details with summary data from performance details
mdata<-merge(sub_schlsprts, sub_schlgrds, by="DBN")

#Add column to show if a school as sports programs
mdata$sprtcode<- as.factor(ifelse(is.na(mdata$Sports), 0, 1))
mdata$apcode<- as.factor(ifelse(is.na(mdata$APCourses), 0, 1))
mdata$langcode<- as.factor(ifelse(is.na(mdata$LangClass), 0, 1))
mdata$leadcode<- as.factor(ifelse(is.na(mdata$LeaderSupport), 0, 1))
mdata$onlineapcode<- as.factor(ifelse(is.na(mdata$OnlineAP), 0, 1))
mdata$onlinelangcode<- as.factor(ifelse(is.na(mdata$OnlineLang), 0, 1))

write.csv(mdata,"schooldataupdated.csv")

glimpse(mdata)

describe(mdata)
by(subset(mdata, select =colnames(sub_schlgrds[-1])),mdata$sprtcode,summary)

par(mfrow = c(2,2))
plot(grad2011~sprtcode,mdata)
plot(collenroll2011~grad2011, pch=as.character(sprtcode), mdata)
abline(lm(collenroll2011~grad2011, data = mdata))
plot(grad2012~sprtcode,mdata)
plot(collenroll2011~grad2011, pch=as.character(sprtcode), mdata)
abline(lm(collenroll2011~grad2011, data = mdata))
glmod<-glm(sprtcode~grad2011+collenroll2011+grad2012+collenroll2012,data=mdata,family="binomial"(link="logit"))
tidy(glmod)
glance(glmod)
par(mfrow = c(2,2))
plot(glmod)
glmod<-lm(grad2011+collenroll2011+grad2012+collenroll2012~sprtcode,data=mdata,family="binomial"(link="logit"))
tidy(glmod)
glance(glmod)
par(mfrow = c(2,2))
plot(glmod)
lmod1<-lm(collenroll2012~sprtcode+apcode+langcode+leadcode+onlineapcode+onlinelangcode,data=mdata)
tidy(lmod1)
glance(lmod1)
par(mfrow = c(2,2))
plot(lmod1)
lmod2<-lm(collenroll2012~langcode+leadcode,data=mdata)
tidy(lmod2)
glance(lmod2)
par(mfrow = c(2,2))
plot(lmod2)
xyplot(grad2011 ~ collenroll2011 | sprtcode,mdata, type="l")
xyplot(grad2011 ~ collenroll2011 | langcode,mdata, type="l")
xyplot(grad2011 ~ collenroll2011 | leadcode,mdata, type="l")
g1<-ggplot(mdata, aes(x=grad2011, y=collenroll2011, color = sprtcode)) +geom_point() +stat_smooth(method="lm", se=TRUE)
g2<-ggplot(mdata, aes(x=grad2012, y=collenroll2012, color = sprtcode)) +geom_point() +stat_smooth(method="lm", se=TRUE)

g3<-ggplot(mdata, aes(x=grad2011, y=collenroll2011, color = langcode)) +geom_point() +stat_smooth(method="lm", se=TRUE)
g4<-ggplot(mdata, aes(x=grad2012, y=collenroll2012, color = langcode)) +geom_point() +stat_smooth(method="lm", se=TRUE)

g5<-ggplot(mdata, aes(x=grad2011, y=collenroll2011, color = leadcode)) +geom_point() +stat_smooth(method="lm", se=TRUE)
g6<-ggplot(mdata, aes(x=grad2012, y=collenroll2012, color = leadcode)) +geom_point() +stat_smooth(method="lm", se=TRUE)

grid.arrange(g1, g2, g3, g4, g5, g6, ncol = 2)


q1<-qplot(grad2011, data = mdata, geom = "density",
      fill = sprtcode)
q2<-qplot(grad2012, data = mdata, geom = "density",
      fill = sprtcode)
q3<-qplot(collenroll2011, data = mdata, geom = "density",
      fill = sprtcode)
q4<-qplot(collenroll2012, data = mdata, geom = "density",
      fill = sprtcode)
grid.arrange(q1, q2, q3, q4, ncol = 2)

q1<-qplot(grad2011, data = mdata, geom = "density",
      fill = langcode)
q2<-qplot(grad2012, data = mdata, geom = "density",
      fill = langcode)
q3<-qplot(collenroll2011, data = mdata, geom = "density",
      fill = langcode)
q4<-qplot(collenroll2012, data = mdata, geom = "density",
      fill = langcode)
grid.arrange(q1, q2, q3, q4, ncol = 2)

q1<-qplot(grad2011, data = mdata, geom = "density",
      fill = leadcode)
q2<-qplot(grad2012, data = mdata, geom = "density",
      fill = leadcode)
q3<-qplot(collenroll2011, data = mdata, geom = "density",
      fill = leadcode)
q4<-qplot(collenroll2012, data = mdata, geom = "density",
      fill = leadcode)
grid.arrange(q1, q2, q3, q4, ncol = 2)

q1<-qplot(grad2011, data = subset(mdata,grad2011 > 80), geom = "histogram",fill = sprtcode)+facet_wrap(~ sprtcode) 
q2<-qplot(grad2012, data = subset(mdata,grad2012 > 80), geom = "histogram",
      fill = sprtcode)+facet_wrap(~ sprtcode) 
q3<-qplot(collenroll2011, data = subset(mdata,collenroll2011 > 80), geom = "histogram",
      fill = sprtcode)+facet_wrap(~ sprtcode) 
q4<-qplot(collenroll2012, data = subset(mdata,collenroll2012 > 80), geom = "histogram",
      fill = sprtcode)+facet_wrap(~ sprtcode) 
grid.arrange(q1, q2, q3, q4, ncol = 2)

q1<-qplot(grad2011, data = subset(mdata,grad2011 > 80), geom = "histogram",fill =  langcode)+facet_wrap(~ langcode) 
q2<-qplot(grad2012, data = subset(mdata,grad2012 > 80), geom = "histogram",
      fill = langcode)+facet_wrap(~ langcode) 
q3<-qplot(collenroll2011, data = subset(mdata,collenroll2011 > 80), geom = "histogram",
      fill = langcode)+facet_wrap(~ langcode) 
q4<-qplot(collenroll2012, data = subset(mdata,collenroll2012 > 80), geom = "histogram",
      fill = langcode)+facet_wrap(~ langcode) 
grid.arrange(q1, q2, q3, q4, ncol = 2)


q1<-qplot(grad2011, data = subset(mdata,grad2011 > 80), geom = "histogram",fill = leadcode)+facet_wrap(~ leadcode) 
q2<-qplot(grad2012, data = subset(mdata,grad2012 > 80), geom = "histogram",
      fill = leadcode)+facet_wrap(~ leadcode) 
q3<-qplot(collenroll2011, data = subset(mdata,collenroll2011 > 80), geom = "histogram",
      fill = leadcode)+facet_wrap(~ leadcode) 
q4<-qplot(collenroll2012, data = subset(mdata,collenroll2012 > 80), geom = "histogram",
      fill = leadcode)+facet_wrap(~ leadcode) 
grid.arrange(q1, q2, q3, q4, ncol = 2)

#install.packages("leaflet")
mdata80<-subset(mdata,grad2011 > 80|grad2012>80|collenroll2012>80|collenroll2011>80)

#Set Details of data
content <- paste(sep = "<br/>",
 "------------------------------",
 '<b>School Information:</>', 
 mdata80$Printed_Name,
 mdata80$Boro,
 mdata80$Location,
  "------------------------------",
 '<b>Total Students:</>',
 mdata80$TotStudnts,
 "------------------------------",
 '<b>Sports:</>',
 mdata80$Sports,
 "------------------------------",
 '<b>Language Classes</>',
 mdata80$LangClass,
 "------------------------------",
 '<b>AP Courses</>',
 mdata80$APCourses,
 "------------------------------",
 '<b>Leader Support</>',
 mdata80$LeaderSupport)
 
df = data.frame(Lat = mdata80$lat, Long = mdata80$lon)

m <- leaflet(df, width = 900, height = 900) %>% 
    setView(lng = mdata80$lon[nrow(mdata80)], lat = mdata80$lat[nrow(mdata80)], zoom = 12)
m %>% 
    addTiles() %>% 
    #addCircles()%>%
    addMarkers(clusterOptions = markerClusterOptions(),mdata80$lon, mdata80$lat, popup = paste(sep = "<br/>", content))
shinyAppDir(
  system.file('..//Final Project/', package="shiny"),
  options = list(width = "100%", height = 700)
)