R Markdown
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.3 ✓ purrr 0.3.4
## ✓ tibble 3.1.0 ✓ dplyr 1.0.5
## ✓ tidyr 1.1.3 ✓ stringr 1.4.0
## ✓ readr 1.4.0 ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
NYC_Zoning_Tax_Lot_Database <- read_csv("NYC_Zoning_Tax_Lot_Database.csv")
##
## ── Column specification ────────────────────────────────────────────────────────
## cols(
## `Borough Code` = col_double(),
## `Tax Block` = col_double(),
## `Tax Lot` = col_double(),
## BBL = col_double(),
## `Zoning District 1` = col_character(),
## `Zoning District 2` = col_character(),
## `Zoning District 3` = col_logical(),
## `Zoning District 4` = col_logical(),
## `Commercial Overlay 1` = col_character(),
## `Commercial Overlay 2` = col_logical(),
## `Special District 1` = col_character(),
## `Special District 2` = col_logical(),
## `Special District 3` = col_logical(),
## `Limited Height District` = col_character(),
## `Zoning Map Number` = col_character(),
## `Zoning Map Code` = col_logical(),
## Notes = col_logical()
## )
## Warning: 16142 parsing failures.
## row col expected actual file
## 2855 Zoning District 3 1/0/T/F/TRUE/FALSE C1-9A 'NYC_Zoning_Tax_Lot_Database.csv'
## 3201 Commercial Overlay 2 1/0/T/F/TRUE/FALSE C2-5 'NYC_Zoning_Tax_Lot_Database.csv'
## 3475 Commercial Overlay 2 1/0/T/F/TRUE/FALSE C1-2 'NYC_Zoning_Tax_Lot_Database.csv'
## 4586 Commercial Overlay 2 1/0/T/F/TRUE/FALSE C1-2 'NYC_Zoning_Tax_Lot_Database.csv'
## 5157 Zoning District 3 1/0/T/F/TRUE/FALSE R8B 'NYC_Zoning_Tax_Lot_Database.csv'
## .... .................... .................. ...... .................................
## See problems(...) for more details.
NYC_Zoning_Section = NYC_Zoning_Tax_Lot_Database[1:1000,]
Tidy_NYC_Zoning_Section=NYC_Zoning_Section%>%
pivot_longer(cols = 5:14,names_to = "District",values_to = "Code",values_drop_na = TRUE)
Final_NYC_Zoning=Tidy_NYC_Zoning_Section[,-c(6,7)]
## The data comes from the website data.gov and is representative of NYC zoning tax codes to different lots.
## The data set contained over 858459 observations and 17 variables, which is a large amount that does not even run on rstudio cloud. I decided to subset the data to 1000 observations, a more manageable figure.
NYC_Zoning_Section
## # A tibble: 1,000 x 17
## `Borough Code` `Tax Block` `Tax Lot` BBL `Zoning Distric… `Zoning Distric…
## <dbl> <dbl> <dbl> <dbl> <chr> <chr>
## 1 1 1926 54 1.02e9 R7A <NA>
## 2 1 1322 143 1.01e9 R8B C6-6
## 3 1 1317 9 1.01e9 C5-2 <NA>
## 4 2 2686 38 2.03e9 R7-1 <NA>
## 5 2 2691 183 2.03e9 R7-1 <NA>
## 6 2 2692 36 2.03e9 R7-1 <NA>
## 7 1 1917 56 1.02e9 R7-2 <NA>
## 8 1 520 40 1.01e9 R7-2 <NA>
## 9 1 528 80 1.01e9 R6 <NA>
## 10 2 3820 68 2.04e9 R5 <NA>
## # … with 990 more rows, and 11 more variables: Zoning District 3 <lgl>,
## # Zoning District 4 <lgl>, Commercial Overlay 1 <chr>,
## # Commercial Overlay 2 <lgl>, Special District 1 <chr>,
## # Special District 2 <lgl>, Special District 3 <lgl>,
## # Limited Height District <chr>, Zoning Map Number <chr>,
## # Zoning Map Code <lgl>, Notes <lgl>
## The subset data contains redundant zoning districts and commercial overlays that include purely unknown figures. I used pivot_longer to combine these variables into a single one named District and attach their values to a "code," meaning tax code. Once transferring all the variables into a single column (District), I removed the present unkowns by using values_drop_na. Lastly, I removed the zoning map code and notes variables as they had all unknown values.
Final_NYC_Zoning
## # A tibble: 1,230 x 7
## `Borough Code` `Tax Block` `Tax Lot` BBL `Zoning Map Num… District Code
## <dbl> <dbl> <dbl> <dbl> <chr> <chr> <chr>
## 1 1 1926 54 1.02e9 6A Zoning D… R7A
## 2 1 1322 143 1.01e9 8D Zoning D… R8B
## 3 1 1322 143 1.01e9 8D Zoning D… C6-6
## 4 1 1322 143 1.01e9 8D Special … MiD
## 5 1 1317 9 1.01e9 8D Zoning D… C5-2
## 6 2 2686 38 2.03e9 6C Zoning D… R7-1
## 7 2 2691 183 2.03e9 6C Zoning D… R7-1
## 8 2 2692 36 2.03e9 3D Zoning D… R7-1
## 9 1 1917 56 1.02e9 6A Zoning D… R7-2
## 10 1 520 40 1.01e9 12A Zoning D… R7-2
## # … with 1,220 more rows
## After all this work the data frame ended up with a more refined variable list and less redudancy as all the unkown values were removed.
library(tidyverse)
Math_Test_Results <- read_csv("Math_Test_Results.csv")
##
## ── Column specification ────────────────────────────────────────────────────────
## cols(
## Grade = col_character(),
## Year = col_double(),
## Category = col_character(),
## `Number Tested` = col_double(),
## `Mean Scale Score` = col_double(),
## `Level 1 #` = col_double(),
## `Level 1 %` = col_double(),
## `Level 2 #` = col_double(),
## `Level 2 %` = col_double(),
## `Level 3 #` = col_double(),
## `Level 3 %` = col_double(),
## `Level 4 #` = col_double(),
## `Level 4 %` = col_double(),
## `Level 3+4 #` = col_double(),
## `Level 3+4 %` = col_double()
## )
Math_Test_Results
## # A tibble: 168 x 15
## Grade Year Category `Number Tested` `Mean Scale Sco… `Level 1 #` `Level 1 %`
## <chr> <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 3 2006 Asian 9768 700 243 2.5
## 2 4 2006 Asian 9973 699 294 2.9
## 3 5 2006 Asian 9852 691 369 3.7
## 4 6 2006 Asian 9606 682 452 4.7
## 5 7 2006 Asian 9433 671 521 5.5
## 6 8 2006 Asian 9593 675 671 7
## 7 All … 2006 Asian 58225 687 2550 4.4
## 8 3 2007 Asian 9750 706 156 1.6
## 9 4 2007 Asian 9881 704 209 2.1
## 10 5 2007 Asian 10111 700 211 2.1
## # … with 158 more rows, and 8 more variables: Level 2 # <dbl>, Level 2 % <dbl>,
## # Level 3 # <dbl>, Level 3 % <dbl>, Level 4 # <dbl>, Level 4 % <dbl>,
## # Level 3+4 # <dbl>, Level 3+4 % <dbl>
## The data was gathered from the data.gov website and it represents the math test scores of all students in New York from 2006-2012. The data is split by race and different grade score levels depending on the mean scale score.
## The data has a number and percentage representation of all the different levels.
New_Math_Test_Results= Math_Test_Results[,-c(6,8,10,12,14)]
New_Math_Test_Results%>%
pivot_longer(cols=6:10,names_to="Level",values_to="percent")
## # A tibble: 840 x 7
## Grade Year Category `Number Tested` `Mean Scale Score` Level percent
## <chr> <dbl> <chr> <dbl> <dbl> <chr> <dbl>
## 1 3 2006 Asian 9768 700 Level 1 % 2.5
## 2 3 2006 Asian 9768 700 Level 2 % 5.6
## 3 3 2006 Asian 9768 700 Level 3 % 42.3
## 4 3 2006 Asian 9768 700 Level 4 % 49.7
## 5 3 2006 Asian 9768 700 Level 3+4 % 92
## 6 4 2006 Asian 9973 699 Level 1 % 2.9
## 7 4 2006 Asian 9973 699 Level 2 % 6
## 8 4 2006 Asian 9973 699 Level 3 % 42.6
## 9 4 2006 Asian 9973 699 Level 4 % 48.5
## 10 4 2006 Asian 9973 699 Level 3+4 % 91
## # … with 830 more rows
## I decided to only focus on the level's percentages and removed the variables that included the levels with a number figure. Then, I used pivot_longer to group the different levels under one column named "level" and attached the values as percentages.