# load data
lead <- read.csv("https://raw.githubusercontent.com/swigodsky/DATA-606/master/Lead_Testing_in_School.csv", stringsAsFactors = FALSE)
You should phrase your research question in a way that matches up with the scope of inference your dataset allows for. Is there a relationship between the county in New York State and the likelihood its schools’ drinking water has too much lead?
str(lead)
## 'data.frame': 4646 obs. of 27 variables:
## $ School.District : chr "BEACON CITY SCHOOLS" "CARMEL CENTRAL SCHOOL" "CLYDE-SAVANNAH CENTRAL SCHOOL" "GENEVA CITY SCHOOLS" ...
## $ School : chr "GLENHAM SCHOOL" "CARMEL H S" "CS ELEMENTARY" "WEST ST ELEMENTARY" ...
## $ County : chr "Dutchess" "Putnam" "Wayne" "Ontario" ...
## $ Type.of.Organization : chr "Public School" "Public School" "Public School" "Public School" ...
## $ Number.of.Outlets : int NA 387 85 105 27 117 28 133 138 126 ...
## $ Any.Buildings.with.Lead.Free.Plumbing.: chr "No" "No" "No" "No" ...
## $ Previously.Sampled.Outlets : int NA 0 0 0 26 98 28 123 0 0 ...
## $ Outlets.Waiver.Requested : int NA 0 0 0 0 0 0 0 0 0 ...
## $ Waivers.Granted : int NA 0 0 0 0 0 0 0 0 0 ...
## $ Outlets.Sampled.After.Regulation : int NA 387 85 105 0 19 0 0 138 126 ...
## $ Sampling.Complete : chr "No" "Yes" "Yes" "Yes" ...
## $ Sampling.Completion.Date : chr "" "10/8/2016" "11/4/2016" "9/24/2016" ...
## $ Number.of.Outlets..Result.â...15.ppb : int NA NA 68 91 26 117 28 123 135 114 ...
## $ Number.of.Outlets..Result...15.ppb : int NA NA NA 14 1 2 0 10 3 12 ...
## $ Out.of.Service : chr "No" "No" "No" "Yes" ...
## $ All.Results.Received. : chr "No" "No" "No" "Yes" ...
## $ Date.All.Results.Received : chr "" "" "" "10/12/2016" ...
## $ School.Website. : chr "" "http://www.carmelschools.org" "http://www.clydesavannah.org" "http://www.genevacsd.org" ...
## $ BEDS.Code : chr "1.302E+11" "4.80102E+11" "6.50301E+11" "4.307E+11" ...
## $ School.Street : chr "20 CHASE DR" "30 FAIR ST" "EAST DEZENG ST" "WEST ST" ...
## $ School.City : chr "FISHKILL" "CARMEL" "CLYDE" "GENEVA" ...
## $ School.State : chr "NY" "NY" "NY" "NY" ...
## $ School.Zip.Code : int 12524 10512 14433 14456 11751 12733 11735 14512 10451 10451 ...
## $ Date.Sampling.Updated : chr "" "10/25/2016" "11/10/2016" "11/10/2016" ...
## $ Date.Results.Updated : chr "" "" "11/10/2016" "11/10/2016" ...
## $ County.Location : chr "(41.686216, -73.840468)" "(41.41131, -73.717443)" "(43.144336, -77.117995)" "(42.894571, -77.252045)" ...
## $ Location : chr "20 CHASE DR\nFISHKILL, NY 12524\n(41.51854672700006, -73.92878661499998)" "30 FAIR ST\nCARMEL, NY 10512\n(41.427951036000024, -73.67671849899995)" "EAST DEZENG ST\nCLYDE, NY 14433\n(43.084543531000065, -76.86076832299995)" "WEST ST\nGENEVA, NY 14456\n(42.86457864000005, -76.99516574599994)" ...
What are the cases, and how many are there? Each case represents a public school or board of cooperative education services (BOCES) in New York State. There are 4646 cases.
Describe the method of data collection. Water samples are collected by the school districts. Every potable water fixture that could be used for drinking or cooking must be tested. The samples are analyzed to measure the level of lead in the water by a laboratory that is approved by the Department’s Environmental Laboratory Approval Program. When the school receives the results, it is responsible for reporting the data to the New York State Department of Health, the local health department and the State Education Department through an electronic survey. This is then transferred to Health Data NY.
What type of study is this (observational/experiment)? This is an observational study.
If you collected the data, state self-collected. If not, provide a citation/link.
The data is published by the state of New York and can be found on the HealthData.gov website. The link to the data is https://www.healthdata.gov/dataset/lead-testing-school-drinking-water-sampling-and-results-most-recently-reported-beginning. I saved the data into a csv file and loaded it onto github. The github link is https://raw.githubusercontent.com/swigodsky/DATA-606/master/Lead_Testing_in_School.csv The data was modified on 10/8/17, which is the day I accessed the data. The data is public and the license is the Open Data Commons Open Database License.
What is the response variable, and what type is it (numerical/categorical)? The response variable is the ratio of the number of outlets (water sources) with lead levels that are too high to the total number of outlets in each county. This is a numerical variable.
What is the explanatory variable, and what type is it (numerical/categorical)? The explanatory variable is the county in New York State. This is a categorical variable.
Provide summary statistics relevant to your research question. For example, if you’re comparing means across groups provide means, SDs, sample sizes of each group. This step requires the use of R, hence a code chunk is provided below. Insert more code chunks as needed.
library(tidyr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(psych)
leaddb <- lead %>%
select("School", "County", "Number.of.Outlets", "Number.of.Outlets..Result...15.ppb")
colnames(leaddb) <- c("School", "County", "OutletNum", "NumHighLead")
head(leaddb)
## School County OutletNum NumHighLead
## 1 GLENHAM SCHOOL Dutchess NA NA
## 2 CARMEL H S Putnam 387 NA
## 3 CS ELEMENTARY Wayne 85 NA
## 4 WEST ST ELEMENTARY Ontario 105 14
## 5 ISLIP J H S Suffolk 27 1
## 6 BENJAMIN COSOR ELEM SCHOOL Sullivan 117 2
leadratiocounty <- leaddb %>%
group_by(County) %>%
summarise(Proportion_LeadInWater = sum(NumHighLead, na.rm=TRUE)/sum(OutletNum, na.rm = TRUE))
leadratiocounty
## # A tibble: 62 x 2
## County Proportion_LeadInWater
## <chr> <dbl>
## 1 Albany 0.12842589
## 2 Allegany 0.07089552
## 3 Bronx 0.07027912
## 4 Broome 0.11265530
## 5 Cattaraugus 0.10121887
## 6 Cayuga 0.08837209
## 7 Chautauqua 0.09702233
## 8 Chemung 0.04375531
## 9 Chenango 0.04741379
## 10 Clinton 0.16329966
## # ... with 52 more rows
describe(leadratiocounty$Proportion_LeadInWater)
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 62 0.13 0.05 0.12 0.13 0.05 0 0.24 0.24 0 -0.44 0.01
library(ggplot2)
##
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
##
## %+%, alpha
ggplot(leadratiocounty, aes(x=Proportion_LeadInWater)) + geom_histogram(binwidth = .015, fill="blue")