DATA606 Project Proposal

Data Preparation

# load data
lead <- read.csv("https://raw.githubusercontent.com/swigodsky/DATA-606/master/Lead_Testing_in_School.csv", stringsAsFactors = FALSE)

Research question

You should phrase your research question in a way that matches up with the scope of inference your dataset allows for. Is there a relationship between the county in New York State and the likelihood its schools’ drinking water has too much lead?

Cases

str(lead)

## 'data.frame':    4646 obs. of  27 variables:
##  $ School.District                       : chr  "BEACON CITY SCHOOLS" "CARMEL CENTRAL SCHOOL" "CLYDE-SAVANNAH CENTRAL SCHOOL" "GENEVA CITY SCHOOLS" ...
##  $ School                                : chr  "GLENHAM SCHOOL" "CARMEL H S" "CS ELEMENTARY" "WEST ST ELEMENTARY" ...
##  $ County                                : chr  "Dutchess" "Putnam" "Wayne" "Ontario" ...
##  $ Type.of.Organization                  : chr  "Public School" "Public School" "Public School" "Public School" ...
##  $ Number.of.Outlets                     : int  NA 387 85 105 27 117 28 133 138 126 ...
##  $ Any.Buildings.with.Lead.Free.Plumbing.: chr  "No" "No" "No" "No" ...
##  $ Previously.Sampled.Outlets            : int  NA 0 0 0 26 98 28 123 0 0 ...
##  $ Outlets.Waiver.Requested              : int  NA 0 0 0 0 0 0 0 0 0 ...
##  $ Waivers.Granted                       : int  NA 0 0 0 0 0 0 0 0 0 ...
##  $ Outlets.Sampled.After.Regulation      : int  NA 387 85 105 0 19 0 0 138 126 ...
##  $ Sampling.Complete                     : chr  "No" "Yes" "Yes" "Yes" ...
##  $ Sampling.Completion.Date              : chr  "" "10/8/2016" "11/4/2016" "9/24/2016" ...
##  $ Number.of.Outlets..Result.â...15.ppb  : int  NA NA 68 91 26 117 28 123 135 114 ...
##  $ Number.of.Outlets..Result...15.ppb    : int  NA NA NA 14 1 2 0 10 3 12 ...
##  $ Out.of.Service                        : chr  "No" "No" "No" "Yes" ...
##  $ All.Results.Received.                 : chr  "No" "No" "No" "Yes" ...
##  $ Date.All.Results.Received             : chr  "" "" "" "10/12/2016" ...
##  $ School.Website.                       : chr  "" "http://www.carmelschools.org" "http://www.clydesavannah.org" "http://www.genevacsd.org" ...
##  $ BEDS.Code                             : chr  "1.302E+11" "4.80102E+11" "6.50301E+11" "4.307E+11" ...
##  $ School.Street                         : chr  "20 CHASE DR" "30 FAIR ST" "EAST DEZENG ST" "WEST ST" ...
##  $ School.City                           : chr  "FISHKILL" "CARMEL" "CLYDE" "GENEVA" ...
##  $ School.State                          : chr  "NY" "NY" "NY" "NY" ...
##  $ School.Zip.Code                       : int  12524 10512 14433 14456 11751 12733 11735 14512 10451 10451 ...
##  $ Date.Sampling.Updated                 : chr  "" "10/25/2016" "11/10/2016" "11/10/2016" ...
##  $ Date.Results.Updated                  : chr  "" "" "11/10/2016" "11/10/2016" ...
##  $ County.Location                       : chr  "(41.686216, -73.840468)" "(41.41131, -73.717443)" "(43.144336, -77.117995)" "(42.894571, -77.252045)" ...
##  $ Location                              : chr  "20 CHASE DR\nFISHKILL, NY 12524\n(41.51854672700006, -73.92878661499998)" "30 FAIR ST\nCARMEL, NY 10512\n(41.427951036000024, -73.67671849899995)" "EAST DEZENG ST\nCLYDE, NY 14433\n(43.084543531000065, -76.86076832299995)" "WEST ST\nGENEVA, NY 14456\n(42.86457864000005, -76.99516574599994)" ...

What are the cases, and how many are there? Each case represents a public school or board of cooperative education services (BOCES) in New York State. There are 4646 cases.

Data collection

Describe the method of data collection. Water samples are collected by the school districts. Every potable water fixture that could be used for drinking or cooking must be tested. The samples are analyzed to measure the level of lead in the water by a laboratory that is approved by the Department’s Environmental Laboratory Approval Program. When the school receives the results, it is responsible for reporting the data to the New York State Department of Health, the local health department and the State Education Department through an electronic survey. This is then transferred to Health Data NY.

Type of study

What type of study is this (observational/experiment)? This is an observational study.

Data Source

If you collected the data, state self-collected. If not, provide a citation/link.

The data is published by the state of New York and can be found on the HealthData.gov website. The link to the data is https://www.healthdata.gov/dataset/lead-testing-school-drinking-water-sampling-and-results-most-recently-reported-beginning. I saved the data into a csv file and loaded it onto github. The github link is https://raw.githubusercontent.com/swigodsky/DATA-606/master/Lead_Testing_in_School.csv The data was modified on 10/8/17, which is the day I accessed the data. The data is public and the license is the Open Data Commons Open Database License.

Response

What is the response variable, and what type is it (numerical/categorical)? The response variable is the ratio of the number of outlets (water sources) with lead levels that are too high to the total number of outlets in each county. This is a numerical variable.

Explanatory

What is the explanatory variable, and what type is it (numerical/categorical)? The explanatory variable is the county in New York State. This is a categorical variable.

Relevant summary statistics

Provide summary statistics relevant to your research question. For example, if you’re comparing means across groups provide means, SDs, sample sizes of each group. This step requires the use of R, hence a code chunk is provided below. Insert more code chunks as needed.

Calculating the Proportion of Lead in School Water By County in New York State

library(tidyr)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(psych)

leaddb <- lead %>%
  select("School", "County", "Number.of.Outlets", "Number.of.Outlets..Result...15.ppb")

colnames(leaddb) <- c("School", "County", "OutletNum", "NumHighLead")
head(leaddb)

##                       School   County OutletNum NumHighLead
## 1             GLENHAM SCHOOL Dutchess        NA          NA
## 2                 CARMEL H S   Putnam       387          NA
## 3              CS ELEMENTARY    Wayne        85          NA
## 4         WEST ST ELEMENTARY  Ontario       105          14
## 5                ISLIP J H S  Suffolk        27           1
## 6 BENJAMIN COSOR ELEM SCHOOL Sullivan       117           2

leadratiocounty <- leaddb %>%  
  group_by(County) %>%
  summarise(Proportion_LeadInWater = sum(NumHighLead, na.rm=TRUE)/sum(OutletNum, na.rm = TRUE))

leadratiocounty

## # A tibble: 62 x 2
##         County Proportion_LeadInWater
##          <chr>                  <dbl>
##  1      Albany             0.12842589
##  2    Allegany             0.07089552
##  3       Bronx             0.07027912
##  4      Broome             0.11265530
##  5 Cattaraugus             0.10121887
##  6      Cayuga             0.08837209
##  7  Chautauqua             0.09702233
##  8     Chemung             0.04375531
##  9    Chenango             0.04741379
## 10     Clinton             0.16329966
## # ... with 52 more rows

describe(leadratiocounty$Proportion_LeadInWater)

##    vars  n mean   sd median trimmed  mad min  max range skew kurtosis   se
## X1    1 62 0.13 0.05   0.12    0.13 0.05   0 0.24  0.24    0    -0.44 0.01

Graphing the Proportion of Lead in School Water By County In New York State

library(ggplot2)

## 
## Attaching package: 'ggplot2'

## The following objects are masked from 'package:psych':
## 
##     %+%, alpha

ggplot(leadratiocounty, aes(x=Proportion_LeadInWater)) + geom_histogram(binwidth = .015, fill="blue")