DATA 607 TidyVerse CREATE Assignment

NYC Income Vs Population

The original data set is downloaded from https://www.kaggle.com/datasets/muonneutrino/new-york-city-census-data. that contains such as total population, racial/ethnic demographic information, employment and community characteristics. The data frame has 2167 observations and 36 variables. The missing value is approximately 1.6% of entire data.

#Loading required packages and dataset
library(tidyverse) #general data analysis environment functions

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.3     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(mice) #multiple imputation

## 
## Attaching package: 'mice'
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following objects are masked from 'package:base':
## 
##     cbind, rbind

library(dplyr)
library(ggplot2)

nyc <- read_csv("https://raw.githubusercontent.com/LwinShwe/DATA607TidyverseCREATE/main/nyc_census_tracts.csv")

## Rows: 2167 Columns: 36
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (2): County, Borough
## dbl (34): CensusTract, TotalPop, Men, Women, Hispanic, White, Black, Native,...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# Get the dimensions of the Data Frame
dim(nyc)

## [1] 2167   36

# Calculate the percentage of missing data
missed_data <- sum(is.na(nyc))
print(paste(round(100 * missed_data / (nrow(nyc) * ncol(nyc)), 5), '% of the total.'))

## [1] "1.62667 % of the total."

# Display the first few rows of the data frame
print(names(nyc))

##  [1] "CensusTract"     "County"          "Borough"         "TotalPop"       
##  [5] "Men"             "Women"           "Hispanic"        "White"          
##  [9] "Black"           "Native"          "Asian"           "Citizen"        
## [13] "Income"          "IncomeErr"       "IncomePerCap"    "IncomePerCapErr"
## [17] "Poverty"         "ChildPoverty"    "Professional"    "Service"        
## [21] "Office"          "Construction"    "Production"      "Drive"          
## [25] "Carpool"         "Transit"         "Walk"            "OtherTransp"    
## [29] "WorkAtHome"      "MeanCommute"     "Employed"        "PrivateWork"    
## [33] "PublicWork"      "SelfEmployed"    "FamilyWork"      "Unemployment"

head(nyc)

## # A tibble: 6 × 36
##   CensusTract County Borough TotalPop   Men Women Hispanic White Black Native
##         <dbl> <chr>  <chr>      <dbl> <dbl> <dbl>    <dbl> <dbl> <dbl>  <dbl>
## 1 36005000100 Bronx  Bronx       7703  7133   570     29.9   6.1  60.9    0.2
## 2 36005000200 Bronx  Bronx       5403  2659  2744     75.8   2.3  16      0  
## 3 36005000400 Bronx  Bronx       5915  2896  3019     62.7   3.6  30.7    0  
## 4 36005001600 Bronx  Bronx       5879  2558  3321     65.1   1.6  32.4    0  
## 5 36005001900 Bronx  Bronx       2591  1206  1385     55.4   9    29      0  
## 6 36005002000 Bronx  Bronx       8516  3301  5215     61.1   1.6  31.1    0.3
## # ℹ 26 more variables: Asian <dbl>, Citizen <dbl>, Income <dbl>,
## #   IncomeErr <dbl>, IncomePerCap <dbl>, IncomePerCapErr <dbl>, Poverty <dbl>,
## #   ChildPoverty <dbl>, Professional <dbl>, Service <dbl>, Office <dbl>,
## #   Construction <dbl>, Production <dbl>, Drive <dbl>, Carpool <dbl>,
## #   Transit <dbl>, Walk <dbl>, OtherTransp <dbl>, WorkAtHome <dbl>,
## #   MeanCommute <dbl>, Employed <dbl>, PrivateWork <dbl>, PublicWork <dbl>,
## #   SelfEmployed <dbl>, FamilyWork <dbl>, Unemployment <dbl>

Handle Missing and incomplete Data

A missing value is a way to signal an absence of information in a dataset. It’s the equivalent of a blank cell in an Excel spreadsheet. In R, missing values typically look like an NA appearing in a variable, a vector, or a dataframe. However, missing values might be you may also encountered in datasets that aren’t equivalent of blank cells. Sometimes the creators of a dataset will use a numeric value to indicate missing data or a string of characters.

# Calculate the number of NA values in each column of the Data Frame and 'trim' argument is used to control whether to trim (remove) columns with zero NA values,
na_col_sums <- function(df, trim = TRUE) {
  na_counts <- colSums(is.na(df))
  nacols.df <- data.frame(NAs = na_counts)
  
  nacols.df$Percent_of_Column <- round(100 * nacols.df$NAs / nrow(df), 2)
  nacols.df$Percent_of_Incomplete_Rows <- round(100 * nacols.df$NAs / sum(!complete.cases(df)), 2)
  
  if (trim) {
    nacols.df <- nacols.df[nacols.df$NAs != 0, ]
  }
  
  return(nacols.df)
}
result <- na_col_sums(nyc)
result

##                 NAs Percent_of_Column Percent_of_Incomplete_Rows
## Hispanic         39              1.80                      54.17
## White            39              1.80                      54.17
## Black            39              1.80                      54.17
## Native           39              1.80                      54.17
## Asian            39              1.80                      54.17
## Income           66              3.05                      91.67
## IncomeErr        66              3.05                      91.67
## IncomePerCap     46              2.12                      63.89
## IncomePerCapErr  46              2.12                      63.89
## Poverty          42              1.94                      58.33
## ChildPoverty     60              2.77                      83.33
## Professional     43              1.98                      59.72
## Service          43              1.98                      59.72
## Office           43              1.98                      59.72
## Construction     43              1.98                      59.72
## Production       43              1.98                      59.72
## Drive            43              1.98                      59.72
## Carpool          43              1.98                      59.72
## Transit          43              1.98                      59.72
## Walk             43              1.98                      59.72
## OtherTransp      43              1.98                      59.72
## WorkAtHome       43              1.98                      59.72
## MeanCommute      61              2.81                      84.72
## PrivateWork      43              1.98                      59.72
## PublicWork       43              1.98                      59.72
## SelfEmployed     43              1.98                      59.72
## FamilyWork       43              1.98                      59.72
## Unemployment     42              1.94                      58.33

#function to filter rows containing NAs
#lim = 0.2 will filter rows where 20% of columns are NA or less, etc. Default lim=0
#keep=T returns the removed rows instead of the cleaned dataframe
row_nafilter <- function(df, keep=F, lim=0){
    row_v <- apply(df, 1, function(x) sum(is.na(x))/length(x) <= lim)
    if(keep){
        df[!row_v, ]
    }else{
        df[row_v, ]
    }
}
nyc_na.df <- row_nafilter(nyc, keep=T)
nyc <- nyc[!is.na(nyc$Income), ]
nyc <- nyc[!is.na(nyc$TotalPop), ]
nyc <- row_nafilter(nyc, lim=0.2)
print(paste('Count of NA values in nyc_data: ', sum(is.na(nyc))))

## [1] "Count of NA values in nyc_data:  7"

Population Distribution of NYC

A right-skewed Poisson distribution, marred by outliers that could complicate the interpretation of population plots for different populations. One census tract, located in the Northeastern Bronx, exhibits an exceptionally high population of nearly 30,000, nearly ten times the median

Pop_hist.plot <- ggplot(nyc, aes(x=TotalPop)) + 
    geom_histogram(colour='black', fill='blue4', bins=50, size=0.2) + 
    theme_bw() +
  geom_vline(mapping = NULL, data = NULL, xintercept=median(nyc$TotalPop), colour='red',
  show.legend = NA) + 
  labs(title='New York City Population Distribution',
       y='Number of Tracts',
       x='Tract Population',
      caption='Source: ACS 5-Year Estimates, 2015') +
  theme(plot.caption = element_text(size = 8)) +
  scale_x_continuous(breaks=c(0, 3622, 10000, 20000, 30000),
                     labels=c('0', '3622\nNYC Median', '10000', '20000', '30000'))

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Pop_hist.plot

EDA (Income and Total Population in NYC)

# Income and Total Population have right-skewed distributions
library(moments)

# Descriptive statistics
income_mean <- mean(nyc$Income)
income_median <- median(nyc$Income)
income_sd <- sd(nyc$Income)
income_skew <- moments::skewness(nyc$Income)

# Histograms
par(mfrow=c(1,2))
hist(nyc$Income, main="Income Histogram")
hist(nyc$TotalPop, main="Population Histogram")

# Scatterplot
plot(nyc$TotalPop, nyc$Income, main="Income vs. Total Population Plot")

# Correlation analysis
correlation <- cor(nyc$Income, nyc$TotalPop)

Find out Landowner per Income & Professional

# add a variable for whether this household owns land
nyc <- nyc %>%
    mutate(landowner = (Income >= 100000 & Professional > 35)) 
nyc

## # A tibble: 2,101 × 37
##    CensusTract County Borough TotalPop   Men Women Hispanic White Black Native
##          <dbl> <chr>  <chr>      <dbl> <dbl> <dbl>    <dbl> <dbl> <dbl>  <dbl>
##  1 36005000200 Bronx  Bronx       5403  2659  2744     75.8   2.3  16      0  
##  2 36005000400 Bronx  Bronx       5915  2896  3019     62.7   3.6  30.7    0  
##  3 36005001600 Bronx  Bronx       5879  2558  3321     65.1   1.6  32.4    0  
##  4 36005001900 Bronx  Bronx       2591  1206  1385     55.4   9    29      0  
##  5 36005002000 Bronx  Bronx       8516  3301  5215     61.1   1.6  31.1    0.3
##  6 36005002300 Bronx  Bronx       4774  2130  2644     62.3   0.2  36.5    1  
##  7 36005002500 Bronx  Bronx       5355  2338  3017     76.5   1.5  18.9    0  
##  8 36005002701 Bronx  Bronx       3016  1375  1641     68     0    31.2    0  
##  9 36005002702 Bronx  Bronx       4778  2427  2351     71.3   1.6  26.2    0  
## 10 36005002800 Bronx  Bronx       5299  2292  3007     23     0.2  71.4    0  
## # ℹ 2,091 more rows
## # ℹ 27 more variables: Asian <dbl>, Citizen <dbl>, Income <dbl>,
## #   IncomeErr <dbl>, IncomePerCap <dbl>, IncomePerCapErr <dbl>, Poverty <dbl>,
## #   ChildPoverty <dbl>, Professional <dbl>, Service <dbl>, Office <dbl>,
## #   Construction <dbl>, Production <dbl>, Drive <dbl>, Carpool <dbl>,
## #   Transit <dbl>, Walk <dbl>, OtherTransp <dbl>, WorkAtHome <dbl>,
## #   MeanCommute <dbl>, Employed <dbl>, PrivateWork <dbl>, PublicWork <dbl>, …

# summerize just our new variable
summary(nyc$landowner)

##    Mode   FALSE    TRUE 
## logical    1925     176

boxplot(Income~County,data = nyc,
        main="Boxplot of Income in Different Counties",
        xlab="County",ylab="Income")

boxplot(TotalPop~County,data = nyc,
        main="Boxplot of Total Population in Different Counties",
        xlab="County",ylab="Population")

# plot of each variable  of outcome variable IncomePerCap
nyc %>%
  gather(Men, Women, Hispanic, Black, White, Native, Asian, Citizen, Office, Construction,
         Production, Employed, PublicWork, SelfEmployed, FamilyWork,Unemployment,
         key = "var", value = "value") %>%
  ggplot(aes(x = value, y = IncomePerCap)) +
  geom_point(colour = "blue", alpha = 0.2, size=0.1) +
  facet_wrap(~ var, scales = "free") +
  theme_bw()

Conclusions

The above graphs illustrate that the estimated average household income in New York amounted to 75,000 U.S. dollars and The population of the City was over 8 millions in 2020, a record high population. This is an increase of 629,057 people since the 2010 Census. The city is characterized by the constant ebb and flow of people that results in a unique level of population “churn” and diversity.