Tidying Data

R Markdown

library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──

## ✓ ggplot2 3.3.3     ✓ purrr   0.3.4
## ✓ tibble  3.1.0     ✓ dplyr   1.0.5
## ✓ tidyr   1.1.3     ✓ stringr 1.4.0
## ✓ readr   1.4.0     ✓ forcats 0.5.1

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

NYC_Zoning_Tax_Lot_Database <- read_csv("NYC_Zoning_Tax_Lot_Database.csv")

## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   `Borough Code` = col_double(),
##   `Tax Block` = col_double(),
##   `Tax Lot` = col_double(),
##   BBL = col_double(),
##   `Zoning District 1` = col_character(),
##   `Zoning District 2` = col_character(),
##   `Zoning District 3` = col_logical(),
##   `Zoning District 4` = col_logical(),
##   `Commercial Overlay 1` = col_character(),
##   `Commercial Overlay 2` = col_logical(),
##   `Special District 1` = col_character(),
##   `Special District 2` = col_logical(),
##   `Special District 3` = col_logical(),
##   `Limited Height District` = col_character(),
##   `Zoning Map Number` = col_character(),
##   `Zoning Map Code` = col_logical(),
##   Notes = col_logical()
## )

## Warning: 16142 parsing failures.
##  row                  col           expected actual                              file
## 2855 Zoning District 3    1/0/T/F/TRUE/FALSE  C1-9A 'NYC_Zoning_Tax_Lot_Database.csv'
## 3201 Commercial Overlay 2 1/0/T/F/TRUE/FALSE  C2-5  'NYC_Zoning_Tax_Lot_Database.csv'
## 3475 Commercial Overlay 2 1/0/T/F/TRUE/FALSE  C1-2  'NYC_Zoning_Tax_Lot_Database.csv'
## 4586 Commercial Overlay 2 1/0/T/F/TRUE/FALSE  C1-2  'NYC_Zoning_Tax_Lot_Database.csv'
## 5157 Zoning District 3    1/0/T/F/TRUE/FALSE  R8B   'NYC_Zoning_Tax_Lot_Database.csv'
## .... .................... .................. ...... .................................
## See problems(...) for more details.

NYC_Zoning_Section = NYC_Zoning_Tax_Lot_Database[1:1000,]

Tidy_NYC_Zoning_Section=NYC_Zoning_Section%>%
 pivot_longer(cols = 5:14,names_to = "District",values_to = "Code",values_drop_na = TRUE)
Final_NYC_Zoning=Tidy_NYC_Zoning_Section[,-c(6,7)]

## The data comes from the website data.gov and is representative of NYC zoning tax codes to different lots. 
## The data set contained over 858459 observations and 17 variables, which is a large amount that does not even run on rstudio cloud. I decided to subset the data to 1000 observations, a more manageable figure.

NYC_Zoning_Section

## # A tibble: 1,000 x 17
##    `Borough Code` `Tax Block` `Tax Lot`    BBL `Zoning Distric… `Zoning Distric…
##             <dbl>       <dbl>     <dbl>  <dbl> <chr>            <chr>           
##  1              1        1926        54 1.02e9 R7A              <NA>            
##  2              1        1322       143 1.01e9 R8B              C6-6            
##  3              1        1317         9 1.01e9 C5-2             <NA>            
##  4              2        2686        38 2.03e9 R7-1             <NA>            
##  5              2        2691       183 2.03e9 R7-1             <NA>            
##  6              2        2692        36 2.03e9 R7-1             <NA>            
##  7              1        1917        56 1.02e9 R7-2             <NA>            
##  8              1         520        40 1.01e9 R7-2             <NA>            
##  9              1         528        80 1.01e9 R6               <NA>            
## 10              2        3820        68 2.04e9 R5               <NA>            
## # … with 990 more rows, and 11 more variables: Zoning District 3 <lgl>,
## #   Zoning District 4 <lgl>, Commercial Overlay 1 <chr>,
## #   Commercial Overlay 2 <lgl>, Special District 1 <chr>,
## #   Special District 2 <lgl>, Special District 3 <lgl>,
## #   Limited Height District <chr>, Zoning Map Number <chr>,
## #   Zoning Map Code <lgl>, Notes <lgl>

## The subset data contains redundant zoning districts and commercial overlays that include purely unknown figures. I used pivot_longer to combine these variables into a single one named District and attach their values to a "code," meaning tax code. Once transferring all the variables into a single column (District), I removed the present unkowns by using values_drop_na. Lastly, I removed the zoning map code and notes variables as they had all unknown values.

Final_NYC_Zoning

## # A tibble: 1,230 x 7
##    `Borough Code` `Tax Block` `Tax Lot`     BBL `Zoning Map Num… District  Code 
##             <dbl>       <dbl>     <dbl>   <dbl> <chr>            <chr>     <chr>
##  1              1        1926        54  1.02e9 6A               Zoning D… R7A  
##  2              1        1322       143  1.01e9 8D               Zoning D… R8B  
##  3              1        1322       143  1.01e9 8D               Zoning D… C6-6 
##  4              1        1322       143  1.01e9 8D               Special … MiD  
##  5              1        1317         9  1.01e9 8D               Zoning D… C5-2 
##  6              2        2686        38  2.03e9 6C               Zoning D… R7-1 
##  7              2        2691       183  2.03e9 6C               Zoning D… R7-1 
##  8              2        2692        36  2.03e9 3D               Zoning D… R7-1 
##  9              1        1917        56  1.02e9 6A               Zoning D… R7-2 
## 10              1         520        40  1.01e9 12A              Zoning D… R7-2 
## # … with 1,220 more rows

## After all this work the data frame ended up with a more refined variable list and less redudancy as all the unkown values were removed.

library(tidyverse)
Math_Test_Results <- read_csv("Math_Test_Results.csv")

## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   Grade = col_character(),
##   Year = col_double(),
##   Category = col_character(),
##   `Number Tested` = col_double(),
##   `Mean Scale Score` = col_double(),
##   `Level 1 #` = col_double(),
##   `Level 1 %` = col_double(),
##   `Level 2 #` = col_double(),
##   `Level 2 %` = col_double(),
##   `Level 3 #` = col_double(),
##   `Level 3 %` = col_double(),
##   `Level 4 #` = col_double(),
##   `Level 4 %` = col_double(),
##   `Level 3+4 #` = col_double(),
##   `Level 3+4 %` = col_double()
## )

Math_Test_Results

## # A tibble: 168 x 15
##    Grade  Year Category `Number Tested` `Mean Scale Sco… `Level 1 #` `Level 1 %`
##    <chr> <dbl> <chr>              <dbl>            <dbl>       <dbl>       <dbl>
##  1 3      2006 Asian               9768              700         243         2.5
##  2 4      2006 Asian               9973              699         294         2.9
##  3 5      2006 Asian               9852              691         369         3.7
##  4 6      2006 Asian               9606              682         452         4.7
##  5 7      2006 Asian               9433              671         521         5.5
##  6 8      2006 Asian               9593              675         671         7  
##  7 All …  2006 Asian              58225              687        2550         4.4
##  8 3      2007 Asian               9750              706         156         1.6
##  9 4      2007 Asian               9881              704         209         2.1
## 10 5      2007 Asian              10111              700         211         2.1
## # … with 158 more rows, and 8 more variables: Level 2 # <dbl>, Level 2 % <dbl>,
## #   Level 3 # <dbl>, Level 3 % <dbl>, Level 4 # <dbl>, Level 4 % <dbl>,
## #   Level 3+4 # <dbl>, Level 3+4 % <dbl>

## The data was gathered from the data.gov website and it represents the math test scores of all students in New York from 2006-2012. The data is split by race and different grade score levels depending on the mean scale score. 
## The data has a number and percentage representation of all the different levels.

New_Math_Test_Results= Math_Test_Results[,-c(6,8,10,12,14)]
New_Math_Test_Results%>%
    pivot_longer(cols=6:10,names_to="Level",values_to="percent")

## # A tibble: 840 x 7
##    Grade  Year Category `Number Tested` `Mean Scale Score` Level       percent
##    <chr> <dbl> <chr>              <dbl>              <dbl> <chr>         <dbl>
##  1 3      2006 Asian               9768                700 Level 1 %       2.5
##  2 3      2006 Asian               9768                700 Level 2 %       5.6
##  3 3      2006 Asian               9768                700 Level 3 %      42.3
##  4 3      2006 Asian               9768                700 Level 4 %      49.7
##  5 3      2006 Asian               9768                700 Level 3+4 %    92  
##  6 4      2006 Asian               9973                699 Level 1 %       2.9
##  7 4      2006 Asian               9973                699 Level 2 %       6  
##  8 4      2006 Asian               9973                699 Level 3 %      42.6
##  9 4      2006 Asian               9973                699 Level 4 %      48.5
## 10 4      2006 Asian               9973                699 Level 3+4 %    91  
## # … with 830 more rows

## I decided to only focus on the level's percentages and removed the variables that included the levels with a number figure. Then, I used pivot_longer to group the different levels under one column named "level" and attached the values as percentages.