The goal of this assignment is to give you practice in preparing different datasets for downstream analysis work.
Your task is to:
- Choose any three of the “wide” datasets identified in the Week 5 Discussion items. (You may use your own dataset; please don’t use my Sample Post dataset, since that was used in your Week 5 assignment!)
For each of the three chosen datasets:
Create a .CSV file (or optionally, a MySQL database!) that includes all of the information included in the dataset. You’re encouraged to use a “wide” structure similar to how the information appears in the discussion item, so that you can practice tidying and transformations as described below.
Read the information from your .CSV file into R, and use tidyr and dplyr as needed to tidy and transform your data. [Most of your grade will be based on this step!]
Perform the analysis requested in the discussion item.
Your code should be in an R Markdown file, posted to rpubs.com, and should include narrative descriptions of your data cleanup work, analysis, and conclusions.
- Please include in your homework submission, for each of the three chosen datasets:
The URL to the .Rmd file in your GitHub repository, and The URL for your rpubs.com web page.
PROCEDURE
Library definitions
# Need to employ kable table reporting functionality
library(knitr)
# Need to employ stringr for Regular Expressions
library(stringr)
# Need to employ to use tidy data functions
library(tidyr)
library(dplyr)
Annual Arts in School Report Data 2013 - 2014.
Dataset url location: https://data.cityofnewyork.us/Education/Annual-Arts-in-School-Reports-Raw-Data/vdgp-ddvg
The Annual Arts in Schools Report includes data about arts teachers, arts budgeting, space for the arts, partnerships with arts and cultural organizations, and parent involvement for elementary, middle, and high schools. These reports help school administrators, parents, and students understand how their schools are progressing towards offering universal arts education to all students.
Last Updated:
September 1, 2015
Data Provided by:
Department of Education (DOE)
Dataset Owner:
NYC OpenData
Dictionary
This dataset does not seem to have a dictionary. The download link is for a .zip file containing three .csv files as follows:
Arts Survey Data_ 2011-2012_ODP.csv
Arts Survey Data_ 2012-2013_ODP.csv
Arts Survey Data_ 2013-2014_ODP.csv
For simplicity reasons, I will read the raw data directly from the source.
URL Raw data location:
url <- "https://data.cityofnewyork.us/api/file_data/bXGEFYYP6F8SYozNljEvaG4BmTKdDtilNHRWeDEyEY4?filename=Arts_Survey_Data.zip"
(1) Read information from .CSV file into R.
From the above file, I will choose “Arts Survey Data_ 2013-2014_ODP.csv” for the latest information contained on that .zip file, I am just keeping in mind that the other two files can be worked and compared in a yearly fashion since each file represents the information for a single school year.
Read .csv from url by employing read.csv()
For this project I will try something new. I will load the data directly from the original location from a zip file. For this, I will be reading such url for the .zip file into R without unzipping the file from the web in advance; then I will unzip the file that I want to work with and read it into a data frame.
# Procedure to download .zip file containing various files including the one I want to work with.
temp <- tempfile()
download.file(url, temp)
my.file <- unz(temp, "Arts Survey Data_ 2013-2014_ODP.csv")
my.data <- read.csv(my.file, header=TRUE, sep=",", stringsAsFactors=FALSE)
# Deleting downloaded file
unlink(temp)
Imported file structure display
## 'data.frame': 1415 obs. of 1659 variables:
## $ School.Year : chr "2013-14" "2013-14" "2013-14" "2013-14" ...
## $ DBN : chr "01M015" "01M019" "01M020" "01M034" ...
## $ BN : chr "M015" "M019" "M020" "M034" ...
## $ District : int 1 1 1 1 1 1 1 1 1 1 ...
## $ School_Name : chr "P.S. 015 Roberto Clemente" "P.S. 019 Asher Levy" "P.S. 020 Anna Silver" "P.S. 034 Franklin D. Roosevelt" ...
## $ School_Level : chr "Elementary" "Elementary" "Elementary" "K-8" ...
## $ PreK : int 26 36 54 18 18 36 33 18 12 31 ...
## $ K : int 39 39 114 33 40 63 41 24 30 69 ...
## $ FirstGr : int 39 38 111 32 33 85 38 28 31 48 ...
## $ SecondGr : int 21 36 98 35 26 56 53 30 31 61 ...
## $ ThirdGr : int 16 45 109 34 23 74 46 43 34 53 ...
## $ FourthGr : int 26 47 71 45 20 56 49 33 34 63 ...
## $ FifthGr : int 23 44 74 43 19 46 58 26 28 59 ...
## $ SixthGr : int NA NA NA 43 NA NA NA NA 61 NA ...
## $ SeventhGr : int NA NA NA 57 NA NA NA NA 63 NA ...
## $ EighthGr : int NA NA NA 53 NA NA NA NA 69 NA ...
## $ NinthGr : int NA NA NA NA NA NA NA NA NA NA ...
## $ TenthGr : int NA NA NA NA NA NA NA NA NA NA ...
## $ EleventhGr : int NA NA NA NA NA NA NA NA NA NA ...
## $ TwelfthGr : int NA NA NA NA NA NA NA NA NA NA ...
## $ G01_CLS_01 : chr "1-012" "1-101" "1-015" "1-101" ...
## $ G01_CLS_02 : chr "1-102" "1-102" "1-101" "1-102" ...
## $ G01_CLS_03 : chr "" "1-103" "1-102" "" ...
## $ G01_CLS_04 : chr "" "1-901" "1-103" "" ...
## $ G01_CLS_05 : chr "" "" "1-104" "" ...
## $ G01_CLS_06 : chr "" "" "1-105" "" ...
## $ G01_CLS_07 : chr "" "" "" "" ...
## $ G01_CLS_08 : chr "" "" "" "" ...
## $ G01_CLS_09 : chr "" "" "" "" ...
## $ G01_CLS_10 : chr "" "" "" "" ...
## $ G01_CLS_11 : chr "" "" "" "" ...
## $ G01_CLS_12 : chr "" "" "" "" ...
## $ G01_CLS_13 : chr "" "" "" "" ...
## $ G01_CLS_14 : chr "" "" "" "" ...
## $ G01_CLS_15 : chr "" "" "" "" ...
## $ G01_CLS_16 : chr "" "" "" "" ...
## $ G02_CLS_01 : chr "2-202" "2-201" "2-103" "2-201" ...
## $ G02_CLS_02 : chr "" "2-202" "2-201" "2-202" ...
## $ G02_CLS_03 : chr "" "2-901" "2-202" "" ...
## $ G02_CLS_04 : chr "" "" "2-203" "" ...
## $ G02_CLS_05 : chr "" "" "2-204" "" ...
## $ G02_CLS_06 : chr "" "" "2-205" "" ...
## $ G02_CLS_07 : chr "" "" "" "" ...
## $ G02_CLS_08 : chr "" "" "" "" ...
## $ G02_CLS_09 : chr "" "" "" "" ...
## $ G02_CLS_10 : chr "" "" "" "" ...
## $ G02_CLS_11 : chr "" "" "" "" ...
## $ G02_CLS_12 : chr "" "" "" "" ...
## $ G02_CLS_13 : chr "" "" "" "" ...
## $ G03_CLS_01 : chr "3-302" "3-301" "3-203" "3-301" ...
## $ G03_CLS_02 : chr "" "3-302" "3-301" "3-302" ...
## $ G03_CLS_03 : chr "" "3-303" "3-302" "3-951" ...
## $ G03_CLS_04 : chr "" "3-902" "3-303" "" ...
## $ G03_CLS_05 : chr "" "" "3-304" "" ...
## $ G03_CLS_06 : chr "" "" "" "" ...
## $ G03_CLS_07 : chr "" "" "" "" ...
## $ G03_CLS_08 : chr "" "" "" "" ...
## $ G03_CLS_09 : chr "" "" "" "" ...
## $ G03_CLS_10 : chr "" "" "" "" ...
## $ G03_CLS_11 : chr "" "" "" "" ...
## $ G03_CLS_12 : chr "" "" "" "" ...
## $ G03_CLS_13 : chr "" "" "" "" ...
## $ G03_CLS_14 : chr "" "" "" "" ...
## $ G04_CLS_01 : chr "4-402" "4-401" "4-203" "4-401" ...
## $ G04_CLS_02 : chr "" "4-402" "4-401" "4-402" ...
## $ G04_CLS_03 : chr "" "4-403" "4-402" "4-951" ...
## $ G04_CLS_04 : chr "" "4-902" "4-403" "" ...
## $ G04_CLS_05 : chr "" "" "4-404" "" ...
## $ G04_CLS_06 : chr "" "" "" "" ...
## $ G04_CLS_07 : chr "" "" "" "" ...
## $ G04_CLS_08 : chr "" "" "" "" ...
## $ G04_CLS_09 : chr "" "" "" "" ...
## $ G04_CLS_10 : chr "" "" "" "" ...
## $ G04_CLS_11 : chr "" "" "" "" ...
## $ G04_CLS_12 : chr "" "" "" "" ...
## $ G04_CLS_13 : chr "" "" "" "" ...
## $ G04_CLS_14 : chr "" "" "" "" ...
## $ G04_CLS_15 : chr "" "" "" "" ...
## $ G05_CLS_01 : chr "5-502" "5-501" "5-402" "5-501" ...
## $ G05_CLS_02 : chr "" "5-502" "5-403" "5-502" ...
## $ G05_CLS_03 : chr "" "5-902" "5-501" "5-951" ...
## $ G05_CLS_04 : chr "" "" "5-502" "" ...
## $ G05_CLS_05 : chr "" "" "" "" ...
## $ G05_CLS_06 : chr "" "" "" "" ...
## $ G05_CLS_07 : chr "" "" "" "" ...
## $ G05_CLS_08 : chr "" "" "" "" ...
## $ G05_CLS_09 : chr "" "" "" "" ...
## $ G05_CLS_10 : chr "" "" "" "" ...
## $ G05_CLS_11 : chr "" "" "" "" ...
## $ G05_CLS_12 : chr "" "" "" "" ...
## $ G05_CLS_13 : chr "" "" "" "" ...
## $ G05_CLS_14 : chr "" "" "" "" ...
## $ G05_CLS_15 : chr "" "" "" "" ...
## $ G05_CLS_16 : logi NA NA NA NA NA NA ...
## $ REG01_CLS_01 : int 9 10 7 14 15 22 22 3 26 3 ...
## $ REG01_CLS_02 : int 30 10 24 18 18 24 16 25 5 21 ...
## $ REG01_CLS_03 : int NA 14 27 NA NA 16 NA NA NA 24 ...
## $ REG01_CLS_04 : int NA 4 14 NA NA 18 NA NA NA NA ...
## $ REG01_CLS_05 : int NA NA 19 NA NA 5 NA NA NA NA ...
## [list output truncated]
In summary, this data frame contains 1416 independent observations of 1659 recognizable possible values.
Data transformation
Now that I have the data frame I will subset a small portion (first 20 columns) in order to create some possible outcomes from the given information.
# New subset of the main data (first 20 columns) from School.Year to TwelfthGr
my.new.data <- my.data %>% subset(select=(School.Year:TwelfthGr))
New subset table.
School.Year | DBN | BN | District | School_Name | School_Level | PreK | K | FirstGr | SecondGr | ThirdGr | FourthGr | FifthGr | SixthGr | SeventhGr | EighthGr | NinthGr | TenthGr | EleventhGr | TwelfthGr |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2013-14 | 01M015 | M015 | 1 | P.S. 015 Roberto Clemente | Elementary | 26 | 39 | 39 | 21 | 16 | 26 | 23 | NA | NA | NA | NA | NA | NA | NA |
2013-14 | 01M019 | M019 | 1 | P.S. 019 Asher Levy | Elementary | 36 | 39 | 38 | 36 | 45 | 47 | 44 | NA | NA | NA | NA | NA | NA | NA |
2013-14 | 01M020 | M020 | 1 | P.S. 020 Anna Silver | Elementary | 54 | 114 | 111 | 98 | 109 | 71 | 74 | NA | NA | NA | NA | NA | NA | NA |
2013-14 | 01M034 | M034 | 1 | P.S. 034 Franklin D. Roosevelt | K-8 | 18 | 33 | 32 | 35 | 34 | 45 | 43 | 43 | 57 | 53 | NA | NA | NA | NA |
2013-14 | 01M063 | M063 | 1 | The STAR Academy - P.S.63 | Elementary | 18 | 40 | 33 | 26 | 23 | 20 | 19 | NA | NA | NA | NA | NA | NA | NA |
2013-14 | 01M110 | M110 | 1 | P.S. 110 Florence Nightingale | Elementary | 36 | 63 | 85 | 56 | 74 | 56 | 46 | NA | NA | NA | NA | NA | NA | NA |
Creating a Tidy Table
From the given table, we can modify it by taking the grade as one variable, that is from PreK to TwelfthGr will be put into a column named Grade. This table transformation will done by employing the gather() function from the tidyr library.
my.tidy.data <- my.new.data %>% gather("Grade","n.Students", PreK:TwelfthGr)
Short tidy table example after transformation.
School.Year | DBN | BN | District | School_Name | School_Level | Grade | n.Students |
---|---|---|---|---|---|---|---|
2013-14 | 01M015 | M015 | 1 | P.S. 015 Roberto Clemente | Elementary | PreK | 26 |
2013-14 | 01M019 | M019 | 1 | P.S. 019 Asher Levy | Elementary | PreK | 36 |
2013-14 | 01M020 | M020 | 1 | P.S. 020 Anna Silver | Elementary | PreK | 54 |
2013-14 | 01M034 | M034 | 1 | P.S. 034 Franklin D. Roosevelt | K-8 | PreK | 18 |
2013-14 | 01M063 | M063 | 1 | The STAR Academy - P.S.63 | Elementary | PreK | 18 |
2013-14 | 01M110 | M110 | 1 | P.S. 110 Florence Nightingale | Elementary | PreK | 36 |
From the above table we can explore a few things as follows:
Total number of art students:
The grand total of art students for all districts, school levels and grades is 883813 art students.
Top 5 districts with higest art student population and percentages.
District | Total Students | Percentage |
---|---|---|
2 | 55030 | 6.23 % |
31 | 54080 | 6.12 % |
24 | 52333 | 5.92 % |
10 | 49000 | 5.54 % |
20 | 46263 | 5.23 % |
Bottom 5 districts with lowest art student population and percentages.
District | Total Students | Percentage |
---|---|---|
5 | 11621 | 1.31 % |
1 | 10594 | 1.2 % |
32 | 10514 | 1.19 % |
23 | 8654 | 0.98 % |
16 | 6329 | 0.72 % |
Bar plot: Total of art students in a district sorted by district.
Bar plot: Total of art students in a district sorted by the number of art students in a district.
School level: art student population and percentages.
School_Level | Total Students | Percentage |
---|---|---|
Elementary | 362416 | 41.01 % |
High school | 232480 | 26.3 % |
Junior High-Intermediate-Middle | 139233 | 15.75 % |
K-8 | 85567 | 9.68 % |
Secondary School | 36000 | 4.07 % |
K-12 all grades | 16644 | 1.88 % |
Early Childhood | 11473 | 1.3 % |
Bar plot: Total of art students by school level sorted by the number art of students.
School Grades: art student population and percentages.
Grade | Total Students | Percentage |
---|---|---|
PreK | 21157 | 2.39 % |
K | 70095 | 7.93 % |
FirstGr | 74529 | 8.43 % |
SecondGr | 71157 | 8.05 % |
ThirdGr | 69241 | 7.83 % |
FourthGr | 67154 | 7.6 % |
FifthGr | 65302 | 7.39 % |
SixthGr | 59266 | 6.71 % |
SeventhGr | 60318 | 6.82 % |
EighthGr | 62700 | 7.09 % |
NinthGr | 74568 | 8.44 % |
TenthGr | 71619 | 8.1 % |
EleventhGr | 58074 | 6.57 % |
TwelfthGr | 58633 | 6.63 % |
Bar plot: Total of art students in a grade sorted by grade.
Conclusions
Districts:
Based on data analysis, we can appreciate major art student population differences in between the top 5 districts and bottom 5 districts with as much as 48701 art students. This will represent a good starting point to find out the multiple reasons that could affect as much; could it be population density playing a role on this?
School level:
Based on simple observation, we can observe how higher school levels tend to have more students enrolled in art classes while basic school levels tend to have less students enrolled. This, I think is an excellent topic to discuss in regards of early child development and communication trough art expression.
Grade:
If we look at the chart and results, we noticed that over all the NYC department of education in the 2013-2014 school year, presented very similar populations of all grade levels taking art classes, once again on this comparison comes to show how early aged students tend to have less populations taking art classes.
Final conclucion:
Since there’s more data available, it will be interesting to perform school year to year comparisons in regards of the enrolled art students; this, to find out if some programs have been put in place in order to increase or decrease student participation in art classes. Also, it will be interesting to see how the populations transform or behave from year to year.