Task:
Choose any three of the “wide” datasets identified in the Week 6 Discussion items. (You may use your own dataset; please don’t use my Sample Post dataset, since that was used in your Week 6 assignment!) For each of the three chosen datasets:
Create a .CSV file (or optionally, a MySQL database!) that includes all of the information included in the dataset. You’re encouraged to use a “wide” structure similar to how the information appears in the discussion item, so that you can practice tidying and transformations as described below. Read the information from your .CSV file into R, and use tidyr and dplyr as needed to tidy and transform your data. [Most of your grade will be based on this step!]
Perform the analysis requested in the discussion item. Your code should be in an R Markdown file, posted to rpubs.com, and should include narrative descriptions of your data cleanup work, analysis, and conclusions.
Datasets:
In this project, I’ve looked at three different datasets:
1.) Marriage and Divorce Rates, found by classmate Jiadi Li.
2.) 2014-15 To 2016-17 School- Level NYC Regents, found by Nicholas Schettini
3.) CDC Chronic Health Diseases - found by Niteeen Kumar
Dataset 1 - Marriage and Divorce rate
Libraries:
library(tidyverse)
library(readr)
Read data: Data uploaded to github, then run through rawgit.
The data has issues with column names, formatting, types (int, char). Showing an enlongated head to diplay issues with the data.
data <- read.csv("https://rawgit.com/nschettini/CUNY-MSDS-DATA-607/master/national_marriage_divorce_rates_00-16.csv")
head(data, 30)
## ï..Provisional.number.of.marriages.and.marriage.rate..United.States..2000.2016
## 1
## 2 Year
## 3 2016
## 4 2015
## 5 2014/1
## 6 2013/1
## 7 2012
## 8 2011
## 9 2010
## 10 2009
## 11 2008
## 12 2007
## 13 2006/2
## 14 2005
## 15 2004
## 16 2003
## 17 2002
## 18 2001
## 19 2000
## 20
## 21 1/Excludes data for Georgia.
## 22 2/Excludes data for Louisiana.
## 23
## 24 Note: Rates for 2001-2009 have been revised and are based on intercensal population estimates from the 2000
## 25 and 2010 censuses. Populations for 2010 rates are based on the 2010 census.
## 26 Source: CDC/NCHS National Vital Statistics System.
## 27
## 28
## 29 Provisional number of divorces and annulments and rate: United States, 2000-2016
## 30
## X X.1 X.2 X.3 X.4 X.5 X.6
## 1 NA NA NA NA
## 2 Marriages Population Rate per 1,000 total population NA NA NA NA
## 3 2,245,404 323,127,513 6.9 NA NA NA NA
## 4 2,221,579 321,418,820 6.9 NA NA NA NA
## 5 2,140,272 308,759,713 6.9 NA NA NA NA
## 6 2,081,301 306,136,672 6.8 NA NA NA NA
## 7 2,131,000 313,914,040 6.8 NA NA NA NA
## 8 2,118,000 311,591,917 6.8 NA NA NA NA
## 9 2,096,000 308,745,538 6.8 NA NA NA NA
## 10 2,080,000 306,771,529 6.8 NA NA NA NA
## 11 2,157,000 304,093,966 7.1 NA NA NA NA
## 12 2,197,000 301,231,207 7.3 NA NA NA NA
## 13 2,193,000 294,077,247 7.5 NA NA NA NA
## 14 2,249,000 295,516,599 7.6 NA NA NA NA
## 15 2,279,000 292,805,298 7.8 NA NA NA NA
## 16 2,245,000 290,107,933 7.7 NA NA NA NA
## 17 2,290,000 287,625,193 8.0 NA NA NA NA
## 18 2,326,000 284,968,955 8.2 NA NA NA NA
## 19 2,315,000 281,421,906 8.2 NA NA NA NA
## 20 NA NA NA NA
## 21 NA NA NA NA
## 22 NA NA NA NA
## 23 NA NA NA NA
## 24 NA NA NA NA
## 25 NA NA NA NA
## 26 NA NA NA NA
## 27 NA NA NA NA
## 28 NA NA NA NA
## 29 NA NA NA NA
## 30 NA NA NA NA
## X.7 X.8
## 1 NA NA
## 2 NA NA
## 3 NA NA
## 4 NA NA
## 5 NA NA
## 6 NA NA
## 7 NA NA
## 8 NA NA
## 9 NA NA
## 10 NA NA
## 11 NA NA
## 12 NA NA
## 13 NA NA
## 14 NA NA
## 15 NA NA
## 16 NA NA
## 17 NA NA
## 18 NA NA
## 19 NA NA
## 20 NA NA
## 21 NA NA
## 22 NA NA
## 23 NA NA
## 24 NA NA
## 25 NA NA
## 26 NA NA
## 27 NA NA
## 28 NA NA
## 29 NA NA
## 30 NA NA
Converted data into a tibble. Tibbles make it easier to display data; which makes it easier to understand the data.
as.tbl(data)
## # A tibble: 60 x 10
## ï..Provisional.n~ X X.1 X.2 X.3 X.4 X.5 X.6 X.7 X.8
## <fct> <fct> <fct> <fct> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl>
## 1 "" "" "" "" NA NA NA NA NA NA
## 2 Year Marr~ Popu~ Rate~ NA NA NA NA NA NA
## 3 2016 2,24~ 323,~ 6.9 NA NA NA NA NA NA
## 4 2015 2,22~ 321,~ 6.9 NA NA NA NA NA NA
## 5 2014/1 2,14~ 308,~ 6.9 NA NA NA NA NA NA
## 6 2013/1 2,08~ 306,~ 6.8 NA NA NA NA NA NA
## 7 2012 2,13~ 313,~ 6.8 NA NA NA NA NA NA
## 8 2011 2,11~ 311,~ 6.8 NA NA NA NA NA NA
## 9 2010 2,09~ 308,~ 6.8 NA NA NA NA NA NA
## 10 2009 2,08~ 306,~ 6.8 NA NA NA NA NA NA
## # ... with 50 more rows
tbl_df(data)
## # A tibble: 60 x 10
## ï..Provisional.n~ X X.1 X.2 X.3 X.4 X.5 X.6 X.7 X.8
## <fct> <fct> <fct> <fct> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl>
## 1 "" "" "" "" NA NA NA NA NA NA
## 2 Year Marr~ Popu~ Rate~ NA NA NA NA NA NA
## 3 2016 2,24~ 323,~ 6.9 NA NA NA NA NA NA
## 4 2015 2,22~ 321,~ 6.9 NA NA NA NA NA NA
## 5 2014/1 2,14~ 308,~ 6.9 NA NA NA NA NA NA
## 6 2013/1 2,08~ 306,~ 6.8 NA NA NA NA NA NA
## 7 2012 2,13~ 313,~ 6.8 NA NA NA NA NA NA
## 8 2011 2,11~ 311,~ 6.8 NA NA NA NA NA NA
## 9 2010 2,09~ 308,~ 6.8 NA NA NA NA NA NA
## 10 2009 2,08~ 306,~ 6.8 NA NA NA NA NA NA
## # ... with 50 more rows
Select columns from the data and rename the first column something that is understandable.
data <- data %>%
as.tbl() %>%
select(ï..Provisional.number.of.marriages.and.marriage.rate..United.States..2000.2016,
X,
X.1,
X.2) %>%
rename(num_marrage_rate = ï..Provisional.number.of.marriages.and.marriage.rate..United.States..2000.2016)
data
## # A tibble: 60 x 4
## num_marrage_rate X X.1 X.2
## <fct> <fct> <fct> <fct>
## 1 "" "" "" ""
## 2 Year Marriages Population Rate per 1,000 total population
## 3 2016 2,245,404 323,127,513 6.9
## 4 2015 2,221,579 321,418,820 6.9
## 5 2014/1 2,140,272 308,759,713 6.9
## 6 2013/1 2,081,301 306,136,672 6.8
## 7 2012 2,131,000 313,914,040 6.8
## 8 2011 2,118,000 311,591,917 6.8
## 9 2010 2,096,000 308,745,538 6.8
## 10 2009 2,080,000 306,771,529 6.8
## # ... with 50 more rows
Remove the “/” which references footnotes in the data. Using reg. expression to search through the column for what needds to be removed.
data1 <- data
data1$num_marrage_rate <- gsub("/\\d", "", data$num_marrage_rate)
data1
## # A tibble: 60 x 4
## num_marrage_rate X X.1 X.2
## <chr> <fct> <fct> <fct>
## 1 "" "" "" ""
## 2 Year Marriages Population Rate per 1,000 total population
## 3 2016 2,245,404 323,127,513 6.9
## 4 2015 2,221,579 321,418,820 6.9
## 5 2014 2,140,272 308,759,713 6.9
## 6 2013 2,081,301 306,136,672 6.8
## 7 2012 2,131,000 313,914,040 6.8
## 8 2011 2,118,000 311,591,917 6.8
## 9 2010 2,096,000 308,745,538 6.8
## 10 2009 2,080,000 306,771,529 6.8
## # ... with 50 more rows
Remove first two rows. The first two rows have: 1. a blink row, and 2. a row that has the column names. We’ll rename these later.
data1 <- data1[-c(1:2), ]
Rename columns. Renaming the columns to something that is understandable. X becomes Marriages, and so on.
data1 <- rename(data1, marriages = X, population = X.1, rate_per_1000 = X.2)
data1 <- rename(data1, year = num_marrage_rate)
data1
## # A tibble: 58 x 4
## year marriages population rate_per_1000
## <chr> <fct> <fct> <fct>
## 1 2016 2,245,404 323,127,513 6.9
## 2 2015 2,221,579 321,418,820 6.9
## 3 2014 2,140,272 308,759,713 6.9
## 4 2013 2,081,301 306,136,672 6.8
## 5 2012 2,131,000 313,914,040 6.8
## 6 2011 2,118,000 311,591,917 6.8
## 7 2010 2,096,000 308,745,538 6.8
## 8 2009 2,080,000 306,771,529 6.8
## 9 2008 2,157,000 304,093,966 7.1
## 10 2007 2,197,000 301,231,207 7.3
## # ... with 48 more rows
Remove ’,’s. When trying to use calculations on the data, the ,’s were interfering. Turns out the numbers are actually characters and not numbers.
data1$marriages <- gsub(",", "", data1$marriages)
data1$population <- gsub(",", "", data1$population)
data1
## # A tibble: 58 x 4
## year marriages population rate_per_1000
## <chr> <chr> <chr> <fct>
## 1 2016 2245404 323127513 6.9
## 2 2015 2221579 321418820 6.9
## 3 2014 2140272 308759713 6.9
## 4 2013 2081301 306136672 6.8
## 5 2012 2131000 313914040 6.8
## 6 2011 2118000 311591917 6.8
## 7 2010 2096000 308745538 6.8
## 8 2009 2080000 306771529 6.8
## 9 2008 2157000 304093966 7.1
## 10 2007 2197000 301231207 7.3
## # ... with 48 more rows
Convert columns into numeric instead of char using as.interger.
data1$marriages <- as.integer(data1$marriages)
data1$population <- as.integer(data1$population)
head(data1)
## # A tibble: 6 x 4
## year marriages population rate_per_1000
## <chr> <int> <int> <fct>
## 1 2016 2245404 323127513 6.9
## 2 2015 2221579 321418820 6.9
## 3 2014 2140272 308759713 6.9
## 4 2013 2081301 306136672 6.8
## 5 2012 2131000 313914040 6.8
## 6 2011 2118000 311591917 6.8
Create a dataframe for just marrages. It’s easier to manipulate the data for marriages and divorce when they’re not in the same columns.
df_marrages <- data1[1:17,]
df_marrages
## # A tibble: 17 x 4
## year marriages population rate_per_1000
## <chr> <int> <int> <fct>
## 1 2016 2245404 323127513 6.9
## 2 2015 2221579 321418820 6.9
## 3 2014 2140272 308759713 6.9
## 4 2013 2081301 306136672 6.8
## 5 2012 2131000 313914040 6.8
## 6 2011 2118000 311591917 6.8
## 7 2010 2096000 308745538 6.8
## 8 2009 2080000 306771529 6.8
## 9 2008 2157000 304093966 7.1
## 10 2007 2197000 301231207 7.3
## 11 2006 2193000 294077247 7.5
## 12 2005 2249000 295516599 7.6
## 13 2004 2279000 292805298 7.8
## 14 2003 2245000 290107933 7.7
## 15 2002 2290000 287625193 8.0
## 16 2001 2326000 284968955 8.2
## 17 2000 2315000 281421906 8.2
Create datafrane for divorce
df_divorce <- data1[30:46,]
df_divorce
## # A tibble: 17 x 4
## year marriages population rate_per_1000
## <chr> <int> <int> <fct>
## 1 2016 827261 257904548 3.2
## 2 2015 800909 258518265 3.1
## 3 2014 813862 256483624 3.2
## 4 2013 832157 254408815 3.3
## 5 2012 851000 248041986 3.4
## 6 2011 877000 246273366 3.6
## 7 2010 872000 244122529 3.6
## 8 2009 840000 242610561 3.5
## 9 2008 844000 240545163 3.5
## 10 2007 856000 238352850 3.6
## 11 2006 872000 236094277 3.7
## 12 2005 847000 233495163 3.6
## 13 2004 879000 236402656 3.7
## 14 2003 927000 243902090 3.8
## 15 2002 955000 243108303 3.9
## 16 2001 940000 236416762 4.0
## 17 2000 944000 233550143 4.0
Rename marriage column to divorce.
df_divorce <- rename(df_divorce, divorce = marriages)
df_divorce
## # A tibble: 17 x 4
## year divorce population rate_per_1000
## <chr> <int> <int> <fct>
## 1 2016 827261 257904548 3.2
## 2 2015 800909 258518265 3.1
## 3 2014 813862 256483624 3.2
## 4 2013 832157 254408815 3.3
## 5 2012 851000 248041986 3.4
## 6 2011 877000 246273366 3.6
## 7 2010 872000 244122529 3.6
## 8 2009 840000 242610561 3.5
## 9 2008 844000 240545163 3.5
## 10 2007 856000 238352850 3.6
## 11 2006 872000 236094277 3.7
## 12 2005 847000 233495163 3.6
## 13 2004 879000 236402656 3.7
## 14 2003 927000 243902090 3.8
## 15 2002 955000 243108303 3.9
## 16 2001 940000 236416762 4.0
## 17 2000 944000 233550143 4.0
Crude Divorce Rate - The number of divorces per 1000 in the population.
This does not take into account the num of people who can’t marry (kids, etc.), as such it isn’t that accurate. We’ll see a better way down below looking at the divorce to marriage ratio.
df_marrages <- df_marrages %>%
mutate(marrages_rate = marriages/population *1000)
df_marrages
## # A tibble: 17 x 5
## year marriages population rate_per_1000 marrages_rate
## <chr> <int> <int> <fct> <dbl>
## 1 2016 2245404 323127513 6.9 6.95
## 2 2015 2221579 321418820 6.9 6.91
## 3 2014 2140272 308759713 6.9 6.93
## 4 2013 2081301 306136672 6.8 6.80
## 5 2012 2131000 313914040 6.8 6.79
## 6 2011 2118000 311591917 6.8 6.80
## 7 2010 2096000 308745538 6.8 6.79
## 8 2009 2080000 306771529 6.8 6.78
## 9 2008 2157000 304093966 7.1 7.09
## 10 2007 2197000 301231207 7.3 7.29
## 11 2006 2193000 294077247 7.5 7.46
## 12 2005 2249000 295516599 7.6 7.61
## 13 2004 2279000 292805298 7.8 7.78
## 14 2003 2245000 290107933 7.7 7.74
## 15 2002 2290000 287625193 8.0 7.96
## 16 2001 2326000 284968955 8.2 8.16
## 17 2000 2315000 281421906 8.2 8.23
df_divorce <- df_divorce %>%
mutate(divorse_rate = divorce/population*1000)
df_divorce
## # A tibble: 17 x 5
## year divorce population rate_per_1000 divorse_rate
## <chr> <int> <int> <fct> <dbl>
## 1 2016 827261 257904548 3.2 3.21
## 2 2015 800909 258518265 3.1 3.10
## 3 2014 813862 256483624 3.2 3.17
## 4 2013 832157 254408815 3.3 3.27
## 5 2012 851000 248041986 3.4 3.43
## 6 2011 877000 246273366 3.6 3.56
## 7 2010 872000 244122529 3.6 3.57
## 8 2009 840000 242610561 3.5 3.46
## 9 2008 844000 240545163 3.5 3.51
## 10 2007 856000 238352850 3.6 3.59
## 11 2006 872000 236094277 3.7 3.69
## 12 2005 847000 233495163 3.6 3.63
## 13 2004 879000 236402656 3.7 3.72
## 14 2003 927000 243902090 3.8 3.80
## 15 2002 955000 243108303 3.9 3.93
## 16 2001 940000 236416762 4.0 3.98
## 17 2000 944000 233550143 4.0 4.04
Combine divorce and marriage dataframes into a single variable: df_combine
df_combine <- data.frame(c(df_marrages, df_divorce))
df_combine
## year marriages population rate_per_1000 marrages_rate year.1 divorce
## 1 2016 2245404 323127513 6.9 6.948972 2016 827261
## 2 2015 2221579 321418820 6.9 6.911789 2015 800909
## 3 2014 2140272 308759713 6.9 6.931837 2014 813862
## 4 2013 2081301 306136672 6.8 6.798601 2013 832157
## 5 2012 2131000 313914040 6.8 6.788483 2012 851000
## 6 2011 2118000 311591917 6.8 6.797352 2011 877000
## 7 2010 2096000 308745538 6.8 6.788762 2010 872000
## 8 2009 2080000 306771529 6.8 6.780290 2009 840000
## 9 2008 2157000 304093966 7.1 7.093202 2008 844000
## 10 2007 2197000 301231207 7.3 7.293401 2007 856000
## 11 2006 2193000 294077247 7.5 7.457224 2006 872000
## 12 2005 2249000 295516599 7.6 7.610402 2005 847000
## 13 2004 2279000 292805298 7.8 7.783329 2004 879000
## 14 2003 2245000 290107933 7.7 7.738499 2003 927000
## 15 2002 2290000 287625193 8.0 7.961750 2002 955000
## 16 2001 2326000 284968955 8.2 8.162293 2001 940000
## 17 2000 2315000 281421906 8.2 8.226083 2000 944000
## population.1 rate_per_1000.1 divorse_rate
## 1 257904548 3.2 3.207625
## 2 258518265 3.1 3.098075
## 3 256483624 3.2 3.173154
## 4 254408815 3.3 3.270944
## 5 248041986 3.4 3.430871
## 6 246273366 3.6 3.561083
## 7 244122529 3.6 3.571977
## 8 242610561 3.5 3.462339
## 9 240545163 3.5 3.508697
## 10 238352850 3.6 3.591314
## 11 236094277 3.7 3.693440
## 12 233495163 3.6 3.627484
## 13 236402656 3.7 3.718232
## 14 243902090 3.8 3.800705
## 15 243108303 3.9 3.928290
## 16 236416762 4.0 3.976029
## 17 233550143 4.0 4.041959
Select columns from the combined dataset, and rename some columns that were ‘duplicate’ names.
df_combine <- df_combine %>%
select(year, marriages, population, marrages_rate, divorce, population.1, divorse_rate) %>%
rename(population_m = population, population_d = population.1)
Visualization of marriages and divorse over the 16 years in the data set.
Looking at the data, it seems overall theres a relationship. As the # of marriages increase, so does the divorce rate. But what does this actually tell us? It could just be the total population over 16 years has increased…
ggplot(df_combine) +
geom_point(aes(year, population_m), color = 'blue') +
geom_point(aes(year, population_d), color = 'red') +
facet_grid(~year) +
theme_minimal()
Divorse to Marriage ratio.
Number of divorces to the number of marriages in a given year. This takes into account how many people were actually married!
d_2_m_ratio <- df_combine$divorce/df_combine$marriages
d_2_m_ratio
## [1] 0.3684241 0.3605134 0.3802610 0.3998254 0.3993430 0.4140699 0.4160305
## [8] 0.4038462 0.3912842 0.3896222 0.3976288 0.3766118 0.3856955 0.4129176
## [15] 0.4170306 0.4041273 0.4077754
Add ratio column to dataframe for our visualization.
df_combine <- df_combine %>%
mutate(dm_ratio = df_combine$divorce/df_combine$marriages)
ggplot(df_combine, aes(year, dm_ratio)) +
geom_point(aes(color = dm_ratio)) +
theme_dark()
It looks like the divorse rate has actually decreased over the 16 years in the dataset. From ~41% to ~37%. It would be interesting if the data had more information about the individuals surveyed. For example: the Age that they became married, their education status, and their income level. All of these factors could have an impact on if a couple gets divorsed.
Doing some research online, it seems that couples who have a college degree have a much lower chance of being divorced.
Dataset 2 - 2014-15 To 2016-17 School- Level NYC Regents
Libraries:
library(tidyverse)
library(readr)
library(knitr)
library(kableExtra)
Read Data:
school_data <- read_csv("2014-15_To_2016-17_School-_Level_NYC_Regents_Report_For_All_Variables.csv")
## Parsed with column specification:
## cols(
## `School DBN` = col_character(),
## `School Name` = col_character(),
## `School Type` = col_character(),
## `School Level` = col_character(),
## `Regents Exam` = col_character(),
## Year = col_integer(),
## `Demographic Category` = col_character(),
## `Demographic Variable` = col_character(),
## `Total Tested` = col_integer(),
## `Mean Score` = col_character(),
## `Number Scoring Below 65` = col_character(),
## `Percent Scoring Below 65` = col_character(),
## `Number Scoring 65 or Above` = col_character(),
## `Percent Scoring 65 or Above` = col_character(),
## `Number Scoring 80 or Above` = col_character(),
## `Percent Scoring 80 or Above` = col_character(),
## `Number Scoring CR` = col_character(),
## `Percent Scoring CR` = col_character()
## )
Viewing the data to see what variables we have. Using head function since there are over 212k rows.
head(as.tbl(school_data))
## # A tibble: 6 x 18
## `School DBN` `School Name` `School Type` `School Level` `Regents Exam`
## <chr> <chr> <chr> <chr> <chr>
## 1 01M034 P.S. 034 Frank~ General Acad~ K-8 Common Core A~
## 2 01M034 P.S. 034 Frank~ General Acad~ K-8 Living Enviro~
## 3 01M034 P.S. 034 Frank~ General Acad~ K-8 Living Enviro~
## 4 01M140 P.S. 140 Natha~ General Acad~ K-8 Common Core A~
## 5 01M140 P.S. 140 Natha~ General Acad~ K-8 Common Core A~
## 6 01M140 P.S. 140 Natha~ General Acad~ K-8 Living Enviro~
## # ... with 13 more variables: Year <int>, `Demographic Category` <chr>,
## # `Demographic Variable` <chr>, `Total Tested` <int>, `Mean
## # Score` <chr>, `Number Scoring Below 65` <chr>, `Percent Scoring Below
## # 65` <chr>, `Number Scoring 65 or Above` <chr>, `Percent Scoring 65 or
## # Above` <chr>, `Number Scoring 80 or Above` <chr>, `Percent Scoring 80
## # or Above` <chr>, `Number Scoring CR` <chr>, `Percent Scoring CR` <chr>
Looking at some of the variables, I notice that there are different schools, regents exams (living environment, common core, etc), demographics, and test score data. I use to be a science teacher in a public middle school, it would be interesting to look at science data.
unique(school_data$`Regents Exam`)
## [1] "Common Core Algebra" "Living Environment"
## [3] "Common Core English" "Algebra2/Trigonometry"
## [5] "Common Core Algebra2" "Common Core Geometry"
## [7] "English" "Geometry"
## [9] "Global History and Geography" "Integrated Algebra"
## [11] "Physical Settings/Chemistry" "Physical Settings/Earth Science"
## [13] "U.S. History and Government" "Physical Settings/Physics"
## [15] NA
Living environment was a subject I taught in middle school, so I’m going to break down the data by only living environment exams.
This dataset has various grade levels - I’m going to explore junior high/middle school.
unique(school_data$`School Level`)
## [1] "K-8" "High school"
## [3] "Junior High-Intermediate-Middle" "Secondary School"
## [5] "K-12 all grades" "Elementary"
This data has different student demographics, it might be interesting to compare their overall scores.
unique(school_data$`Demographic Variable`)
## [1] "All Students"
## [2] "SWD"
## [3] "Non-SWD"
## [4] "ELL"
## [5] "English Proficient"
## [6] "Former ELL"
## [7] "Male"
## [8] "Female"
## [9] "White"
## [10] "Black"
## [11] "Hispanic"
## [12] "Asian"
## [13] "Multiple Race Categories Not Represented"
Looking through the different schools listed… there are too many to display, but it would be interesting if my old school is listed…
head(unique(school_data$`School Name`))
## [1] "P.S. 034 Franklin D. Roosevelt"
## [2] "P.S. 140 Nathan Straus"
## [3] "P.S. 184m Shuang Wen"
## [4] "P.S. 188 The Island School"
## [5] "Orchard Collegiate Academy"
## [6] "Technology, Arts, and Sciences Studio"
Lets see …
school_data %>%
filter(`School Name` == "Marsh Avenue School for Expeditionary Learning")
## # A tibble: 83 x 18
## `School DBN` `School Name` `School Type` `School Level` `Regents Exam`
## <chr> <chr> <chr> <chr> <chr>
## 1 31R063 Marsh Avenue ~ General Acad~ Junior High-I~ Common Core A~
## 2 31R063 Marsh Avenue ~ General Acad~ Junior High-I~ Common Core A~
## 3 31R063 Marsh Avenue ~ General Acad~ Junior High-I~ Common Core A~
## 4 31R063 Marsh Avenue ~ General Acad~ Junior High-I~ Living Enviro~
## 5 31R063 Marsh Avenue ~ General Acad~ Junior High-I~ Living Enviro~
## 6 31R063 Marsh Avenue ~ General Acad~ Junior High-I~ U.S. History ~
## 7 31R063 Marsh Avenue ~ General Acad~ Junior High-I~ U.S. History ~
## 8 31R063 Marsh Avenue ~ General Acad~ Junior High-I~ U.S. History ~
## 9 31R063 Marsh Avenue ~ General Acad~ Junior High-I~ Common Core A~
## 10 31R063 Marsh Avenue ~ General Acad~ Junior High-I~ Common Core A~
## # ... with 73 more rows, and 13 more variables: Year <int>, `Demographic
## # Category` <chr>, `Demographic Variable` <chr>, `Total Tested` <int>,
## # `Mean Score` <chr>, `Number Scoring Below 65` <chr>, `Percent Scoring
## # Below 65` <chr>, `Number Scoring 65 or Above` <chr>, `Percent Scoring
## # 65 or Above` <chr>, `Number Scoring 80 or Above` <chr>, `Percent
## # Scoring 80 or Above` <chr>, `Number Scoring CR` <chr>, `Percent
## # Scoring CR` <chr>
Ok, now that I have some information on my dataset, lets start taking a deeper look.
I noticed that the numeric data is enetered at character type, so I need to fix that:
school_data1 <- school_data
school_data1$`Mean Score` <- as.double(school_data1$`Mean Score`)
## Warning: NAs introduced by coercion
school_data1$`Number Scoring Below 65` <- as.double(school_data1$`Number Scoring Below 65`)
## Warning: NAs introduced by coercion
school_data1$`Percent Scoring Below 65` <- as.double(school_data1$`Percent Scoring Below 65`)
## Warning: NAs introduced by coercion
school_data1$`Number Scoring 65 or Above` <- as.double(school_data1$`Number Scoring 65 or Above`)
## Warning: NAs introduced by coercion
school_data1$`Percent Scoring 65 or Above` <- as.double(school_data1$`Percent Scoring 65 or Above`)
## Warning: NAs introduced by coercion
school_data1$`Number Scoring 80 or Above` <- as.double(school_data1$`Number Scoring 80 or Above`)
## Warning: NAs introduced by coercion
school_data1$`Percent Scoring 80 or Above` <- as.double(school_data1$`Percent Scoring 80 or Above`)
## Warning: NAs introduced by coercion
head(school_data1)
## # A tibble: 6 x 18
## `School DBN` `School Name` `School Type` `School Level` `Regents Exam`
## <chr> <chr> <chr> <chr> <chr>
## 1 01M034 P.S. 034 Frank~ General Acad~ K-8 Common Core A~
## 2 01M034 P.S. 034 Frank~ General Acad~ K-8 Living Enviro~
## 3 01M034 P.S. 034 Frank~ General Acad~ K-8 Living Enviro~
## 4 01M140 P.S. 140 Natha~ General Acad~ K-8 Common Core A~
## 5 01M140 P.S. 140 Natha~ General Acad~ K-8 Common Core A~
## 6 01M140 P.S. 140 Natha~ General Acad~ K-8 Living Enviro~
## # ... with 13 more variables: Year <int>, `Demographic Category` <chr>,
## # `Demographic Variable` <chr>, `Total Tested` <int>, `Mean
## # Score` <dbl>, `Number Scoring Below 65` <dbl>, `Percent Scoring Below
## # 65` <dbl>, `Number Scoring 65 or Above` <dbl>, `Percent Scoring 65 or
## # Above` <dbl>, `Number Scoring 80 or Above` <dbl>, `Percent Scoring 80
## # or Above` <dbl>, `Number Scoring CR` <chr>, `Percent Scoring CR` <chr>
So the data has a lot of NA’s. This makes things … frustrating because some columns have more NA’s than others. So I either need to remove everything that has an NA or remove the NA’s when I want to work with those variables. I also notice that Number Scoring CR and Percent Scoring CR have much more NAs than the rest, I think I’m going to remove that completely.
science_data <- school_data1 %>%
filter(`Regents Exam` == "Living Environment", `School Level` == c("Junior High-Intermediate-Middle", "Secondary School")) %>%
select(`School Name`, `School Level`, `Regents Exam`, Year, `Demographic Variable`, `Total Tested`, `Mean Score`,`Number Scoring 65 or Above`, `Number Scoring 65 or Above`,
`Number Scoring Below 65`, `Percent Scoring 80 or Above`, `Percent Scoring 80 or Above`, `Percent Scoring Below 65`)
science_data <- na.omit(science_data)
Lets take a look at mean test scores for the living environment exam for middle school students relating to demographic variables.
science_data %>%
group_by(`Demographic Variable`, `Mean Score`) %>%
ggplot(aes(`Mean Score`)) +
geom_histogram(aes(fill = `Demographic Variable`)) +
ggtitle("Living Environment - Middle School")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
science_data %>%
group_by(`Demographic Variable`, `Mean Score`) %>%
ggplot(aes(`Mean Score`, `Demographic Variable`)) +
geom_point(aes(fill = `Demographic Variable`), alpha = 0.5) +
ggtitle("Living Environment - Middle School")
science_data1 <- science_data %>%
group_by(`Demographic Variable`) %>%
summarize(mean(`Mean Score`))
kable(science_data1, "html", escape = F) %>%
kable_styling("striped", full_width = T) %>%
column_spec(1, bold = T) %>%
row_spec(2, bold = T, color = "white", background = "green") %>%
row_spec(13, bold = T, color = "white", background = "green") %>%
row_spec(4, bold = T, color = "white", background = "red")
Demographic Variable |
mean(Mean Score )
|
---|---|
All Students | 73.64864 |
Asian | 81.71721 |
Black | 71.81100 |
ELL | 55.63714 |
English Proficient | 74.56829 |
Female | 73.21759 |
Former ELL | 76.12270 |
Hispanic | 72.59544 |
Male | 75.53846 |
Multiple Race Categories Not Represented | 75.24500 |
Non-SWD | 74.38908 |
SWD | 60.45306 |
White | 81.65682 |
Conclusion:
Looking at this data, it seems that the demographics: White and Asian have the highest mean living environment scores for middle school students. The lowest mean score are English Language Learners, with a 55.6 mean score. They are, on average, about 15 points lower than all of the others. It’s interesting that former ELL students are on average with other demographic groups.
The Living Environment exam is broken up into different sections, with a lot of reading comphrension. One seciton has a hands-on lab, where students must follow a set of instructions withouth any guidance from the teacher. This could be one reason why ELL students strugle with this exam.
I’m interested in knowing how their scores compare to a non-reading heavy exam - like Math. (Though now with the common core it is very reading heavy.)
Lets take a look!
So, math has many different types of exams, and the Living Environment is technically a 9th grade exam, however students can take it in 8th grade. Lets break down the intermediate algebra exam, which I believe is also a 9th grade exam that can be taken in middle school.
math_data <- school_data1 %>%
filter(`Regents Exam` == "Integrated Algebra", `School Level` == c("Junior High-Intermediate-Middle", "Secondary School")) %>%
select(`School Name`, `School Level`, `Regents Exam`, Year, `Demographic Variable`, `Total Tested`, `Mean Score`,`Number Scoring 65 or Above`, `Number Scoring 65 or Above`,
`Number Scoring Below 65`, `Percent Scoring 80 or Above`, `Percent Scoring 80 or Above`, `Percent Scoring Below 65`)
math_data <- na.omit(math_data)
math_data %>%
group_by(`Demographic Variable`, `Mean Score`) %>%
ggplot(aes(`Mean Score`)) +
geom_histogram(aes(fill = `Demographic Variable`)) +
ggtitle("Intermediate Algebra - Middle School")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
math_data %>%
group_by(`Demographic Variable`, `Mean Score`) %>%
ggplot(aes(`Mean Score`, `Demographic Variable`)) +
geom_point(aes(fill = `Demographic Variable`), alpha = 0.5) +
ggtitle("Intermediate Algebra - Middle School")
math_data1 <- math_data %>%
group_by(`Demographic Variable`) %>%
summarize(mean(`Mean Score`))
kable(math_data1, "html", escape = F) %>%
kable_styling("striped", full_width = T) %>%
column_spec(1, bold = T) %>%
row_spec(2, bold = T, color = "white", background = "green") %>%
row_spec(11, bold = T, color = "white", background = "red") %>%
row_spec(4, bold = T, color = "white", background = "lightblue")
Demographic Variable |
mean(Mean Score )
|
---|---|
All Students | 67.71231 |
Asian | 76.13333 |
Black | 65.85000 |
ELL | 61.24375 |
English Proficient | 67.97083 |
Female | 67.71091 |
Former ELL | 70.15714 |
Hispanic | 67.10541 |
Male | 67.82000 |
Non-SWD | 68.57347 |
SWD | 58.88000 |
White | 73.11429 |
Conculsions:
It looks like ELL’s mean score has increased by about ~5. The mean score for Asian and White has also decreased by about ~5 and ~8, respectively. The lowest scoring group for the intermediate algbera is now the SWD - students with disabilities.
It’s also interesting to look at the spread of data in the point graphs. The math range of scores are much more spread when compared to the science scores.
Lets compare my old school to the whole NYC dataset.
former_school <- school_data1 %>%
filter(`Regents Exam` == "Living Environment", `School Name` == "Marsh Avenue School for Expeditionary Learning" ) %>%
select(`School Name`, `School Level`, `Regents Exam`, Year, `Demographic Variable`, `Total Tested`, `Mean Score`,`Number Scoring 65 or Above`, `Number Scoring 65 or Above`,
`Number Scoring Below 65`, `Percent Scoring 80 or Above`, `Percent Scoring 80 or Above`, `Percent Scoring Below 65`)
former_school <- na.omit(former_school)
former_school
## # A tibble: 11 x 11
## `School Name` `School Level` `Regents Exam` Year `Demographic Va~
## <chr> <chr> <chr> <int> <chr>
## 1 Marsh Avenue Sc~ Junior High-Int~ Living Enviro~ 2016 All Students
## 2 Marsh Avenue Sc~ Junior High-Int~ Living Enviro~ 2017 All Students
## 3 Marsh Avenue Sc~ Junior High-Int~ Living Enviro~ 2017 Non-SWD
## 4 Marsh Avenue Sc~ Junior High-Int~ Living Enviro~ 2016 Female
## 5 Marsh Avenue Sc~ Junior High-Int~ Living Enviro~ 2016 Male
## 6 Marsh Avenue Sc~ Junior High-Int~ Living Enviro~ 2017 Female
## 7 Marsh Avenue Sc~ Junior High-Int~ Living Enviro~ 2017 Male
## 8 Marsh Avenue Sc~ Junior High-Int~ Living Enviro~ 2016 Asian
## 9 Marsh Avenue Sc~ Junior High-Int~ Living Enviro~ 2016 White
## 10 Marsh Avenue Sc~ Junior High-Int~ Living Enviro~ 2017 Asian
## 11 Marsh Avenue Sc~ Junior High-Int~ Living Enviro~ 2017 White
## # ... with 6 more variables: `Total Tested` <int>, `Mean Score` <dbl>,
## # `Number Scoring 65 or Above` <dbl>, `Number Scoring Below 65` <dbl>,
## # `Percent Scoring 80 or Above` <dbl>, `Percent Scoring Below 65` <dbl>
former_school %>%
group_by(`Demographic Variable`, `Mean Score`) %>%
ggplot(aes(`Mean Score`)) +
geom_histogram(aes(fill = `Demographic Variable`)) +
ggtitle("Living Environment - Marsh Avenue")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
former_school %>%
group_by(`Demographic Variable`, `Mean Score`) %>%
ggplot(aes(`Mean Score`, `Demographic Variable`)) +
geom_point( alpha = 0.5) +
ggtitle("Living Environment - Marsh Avenue")
former_school %>%
group_by(`Mean Score`) %>%
ggplot(aes(`Mean Score`, `Demographic Variable`)) +
geom_density(aes(fill = `Demographic Variable`), alpha = 0.5) +
ggtitle("Living Environment - Marsh Avenue")
former_school1 <- former_school %>%
group_by(`Demographic Variable`) %>%
summarize(mean(`Mean Score`))
kable(former_school1, "html", escape = F) %>%
kable_styling("striped", full_width = T) %>%
column_spec(1, bold = T)
Demographic Variable |
mean(Mean Score )
|
---|---|
All Students | 83.30 |
Asian | 87.30 |
Female | 82.35 |
Male | 84.45 |
Non-SWD | 81.70 |
White | 81.65 |
Based on the students that did take the Living environment regents, it seems that for ‘all students’, this school had a higher mean than the city average (81.57 vs. 83.30).
Lets take a look on how students regents scores have progressed over the years in our dataset:
science_data %>%
ggplot(aes(Year, `Mean Score`)) +
geom_point() +
aes(colour = factor(Year)) +
stat_summary()
## No summary function supplied, defaulting to `mean_se()
It looks like as a whole, students have increased their living environment exams over the 3 years in the data set. As a former science teacher, I know the science standards have only increased in complexity, and teachers are continually held to a high standard.
math_data %>%
ggplot(aes(Year, `Mean Score`)) +
geom_point() +
aes(colour = factor(Year)) +
stat_summary()
## No summary function supplied, defaulting to `mean_se()
Dataset3 - CDC Chronic Disease Data
Libraries:
library(geofacet)
library(tidyverse)
Read in data:
CDC <- read.csv("https://raw.githubusercontent.com/niteen11/MSDS/master/DATA607/Week5/dataset/U.S._Chronic_Disease_Indicators__CDI.csv", stringsAsFactors = F)
The data consists of many variables and over 50k rows of data. I want to break the data down by creating a filter using the tidyverse for only the year 2013.
colnames(CDC)
## [1] "ï..Year" "LocationAbbr"
## [3] "LocationDesc" "Category"
## [5] "Indicator" "Datasource"
## [7] "DataValueUnit" "DataValueType"
## [9] "DataValue" "DataValueAlt"
## [11] "DataValueFootnoteSymbol" "DataValueFootnote"
## [13] "Gender" "StratificationID1"
## [15] "IndicatorID" "LocationID"
## [17] "LowConfidenceInterval" "HighConfidenceInterval"
## [19] "GeoLocation"
unique(CDC$ï..Year)
## [1] "2013" "2011" "2012" "2010" "2007-2011"
## [6] "2006-2010" "2009" "2009-2011" "2008" "2011-2012"
## [11] "2001" "2007"
unique(CDC$Category)
## [1] "Alcohol"
## [2] "Arthritis"
## [3] "Asthma"
## [4] "Cancer"
## [5] "Chronic Kidney Disease"
## [6] "Chronic Obstructive Pulmonary Disease"
## [7] "Cardiovascular Disease"
## [8] "Diabetes"
## [9] "Disability"
## [10] "Immunization"
## [11] "Mental Health"
## [12] "Nutrition, physical activity, and weight status"
## [13] "Nutrition, Physical Activity, and Weight Status"
## [14] "Older Adults"
## [15] "Oral Health"
## [16] "Overarching Conditions"
## [17] "Reproductive Health"
## [18] "Tobacco"
In this chunk of code, I’m renaming the CDC variable to CDC1, and selecting certain variables I want to work with. I’m also filtering for the year 2013, and renaming some columns to more user-friendly names.
CDC1 <- CDC %>%
select(ï..Year, LocationAbbr, Category, Indicator, IndicatorID, LowConfidenceInterval, HighConfidenceInterval) %>%
filter(ï..Year == 2013) %>%
rename(year = ï..Year, LocationID = LocationAbbr, LowCI = LowConfidenceInterval,
HighCI = HighConfidenceInterval)
head(CDC1)
## year LocationID Category Indicator IndicatorID LowCI
## 1 2013 AL Alcohol Alcohol use among youth ALC1_1 30.1
## 2 2013 AK Alcohol Alcohol use among youth ALC1_1 19.3
## 3 2013 AZ Alcohol Alcohol use among youth ALC1_1 31.4
## 4 2013 AR Alcohol Alcohol use among youth ALC1_1 32.3
## 5 2013 CA Alcohol Alcohol use among youth ALC1_1 NA
## 6 2013 CO Alcohol Alcohol use among youth ALC1_1 NA
## HighCI
## 1 40.3
## 2 26.1
## 3 40.9
## 4 40.4
## 5 NA
## 6 NA
For the year 2013 - there are still over 24k rows of data.
This next chunk of code breaks down all of the categories in the dataset. As you can see below, there are many factors included that lead to disease in individuals.
#Entire 2013
ggplot(CDC1, aes(Category)) + geom_bar(aes(fill = Category)) +
coord_flip()
According to this data, for the year 2013 diabetes is the top leading cause for disease in this dataset.
I wonder how this relates to each state in the USA. Do certain states have higher percentages of certain types of diseases?
ggplot(CDC1, aes(Category, LocationID)) +
geom_col(aes(fill=Category)) +
facet_geo(~LocationID, grid = "us_state_grid2") +
coord_flip() +
theme(axis.text.y=element_blank()) +
theme(legend.position = "none")
## Some values in the specified facet_geo column 'LocationID' do not
## match the 'code' column of the specified grid and will be
## removed: GU, PR, US, VI
I removed the legend so the graph is easier to read (but it’s the same as the one above in relation to colors).
CDC1 %>%
group_by(Category) %>%
filter(Category == c("Alcohol", "Tobacco", "Cardiovascular Disease")) %>%
summarise(min(LocationID), max(LocationID))
## # A tibble: 3 x 3
## Category `min(LocationID)` `max(LocationID)`
## <chr> <chr> <chr>
## 1 Alcohol AK WY
## 2 Cardiovascular Disease AK WY
## 3 Tobacco AK WY
Looking at the data, it seems WY has the highest reported chronic diseases relating to Alcohol, TObacco, and Card Disease.
The dataset has lots of variables that affect ones health. There are a few I’m most interested in looking at, including: Tobacco
This code chunk breaks down 2013 into only Tobacco relalted vairables.
#2013 Alcohol
CDC_Tobacco <- CDC1 %>%
group_by(LocationID, Category) %>%
filter(Category %in% c("Tobacco"))
CDC_Tobacco
## # A tibble: 2,090 x 7
## # Groups: LocationID, Category [55]
## year LocationID Category Indicator IndicatorID LowCI HighCI
## <chr> <chr> <chr> <chr> <chr> <dbl> <dbl>
## 1 2013 AL Tobacco Current cigarette s~ TOB1_1 15.5 20.8
## 2 2013 AK Tobacco Current cigarette s~ TOB1_1 8.20 13.4
## 3 2013 AZ Tobacco Current cigarette s~ TOB1_1 11.8 16.6
## 4 2013 AR Tobacco Current cigarette s~ TOB1_1 16.4 22.2
## 5 2013 CA Tobacco Current cigarette s~ TOB1_1 NA NA
## 6 2013 CO Tobacco Current cigarette s~ TOB1_1 NA NA
## 7 2013 CT Tobacco Current cigarette s~ TOB1_1 11.1 16.3
## 8 2013 DE Tobacco Current cigarette s~ TOB1_1 12.5 16.0
## 9 2013 DC Tobacco Current cigarette s~ TOB1_1 NA NA
## 10 2013 FL Tobacco Current cigarette s~ TOB1_1 9.70 12.0
## # ... with 2,080 more rows
How does the reported disease data on Alcohol compare to that of tobacco and cardio health?
#Alcohol comparison to tobacco
CDC_Tobaccocompare <- CDC1 %>%
group_by(LocationID, Category) %>%
filter(Category %in% c("Tobacco"))
CDC_Tobaccocompare %>%
group_by(Category) %>%
filter(Category == c("Tobacco")) %>%
summarise(n=n())
## # A tibble: 1 x 2
## Category n
## <chr> <int>
## 1 Tobacco 2090
In our data, we have a total 2090 cases that suffer from Tobacco issues for our filtered dataset. Looking into the data, it doesn’t seem that the recorded data mentions if an individual that reported chronic tobacco use also had cardiovascular disease. It seems the data just reported on a case-by-case basis, so that makes it hard to draw relationships between the two - e.g.: if more users that suffered from tobacco also had cardio vascular issues.
This following code chunk attempts to break down the tobacco number (2090) into those who suffered from tobacco and cardiovascular disease.
CDC_Tobaccocompare <- CDC1 %>%
group_by(LocationID, Category) %>%
filter(Category %in% c("Tobacco", "Cardiovascular Disease" ))
CDC_Tobaccocompare %>%
group_by(Category) %>%
filter(Category == c("Tobacco", "Cardiovascular Disease")) %>%
summarise(n=n())
## Warning in Category == c("Tobacco", "Cardiovascular Disease"): longer
## object length is not a multiple of shorter object length
## # A tibble: 2 x 2
## Category n
## <chr> <int>
## 1 Cardiovascular Disease 1347
## 2 Tobacco 1045
What indicator most closely relates to Cardiovascular issues?
WIthin the dataset for 2013, it shows that for pure number of reported incidents, Cardiovascular Disease and Nutrition are the most compareable. In today’s society, we know there is an issue with being overweight and having heart problems. Most reported indicidents for 2013 were infact those two, with tobacco following closely behind. It would have been nice to know if an individual also suffered from more than one disease. The way the data is broken down is by reported cases by state, in it’s own category. Based on that, it’s hard to say that Nutirion issues can “cause” cardiovascular disease, but intuitively one can say that it does.
CDC_heartdisease <- CDC1 %>%
group_by(LocationID, Category) %>%
filter(Category %in% c("Nutrition, physical activity, and weight status", "Tobacco", "Alcohol", "Cardiovascular Disease"))
ggplot(CDC_heartdisease, aes(Category)) +
geom_bar(aes(fill=Category))