The original data set is downloaded from https://www.kaggle.com/datasets/muonneutrino/new-york-city-census-data. that contains such as total population, racial/ethnic demographic information, employment and community characteristics. The data frame has 2167 observations and 36 variables. The missing value is approximately 1.6% of entire data.
#Loading required packages and dataset
library(tidyverse) #general data analysis environment functions
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.3 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.4 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(mice) #multiple imputation
##
## Attaching package: 'mice'
##
## The following object is masked from 'package:stats':
##
## filter
##
## The following objects are masked from 'package:base':
##
## cbind, rbind
library(dplyr)
library(ggplot2)
nyc <- read_csv("https://raw.githubusercontent.com/LwinShwe/DATA607TidyverseCREATE/main/nyc_census_tracts.csv")
## Rows: 2167 Columns: 36
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): County, Borough
## dbl (34): CensusTract, TotalPop, Men, Women, Hispanic, White, Black, Native,...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Get the dimensions of the Data Frame
dim(nyc)
## [1] 2167 36
# Calculate the percentage of missing data
missed_data <- sum(is.na(nyc))
print(paste(round(100 * missed_data / (nrow(nyc) * ncol(nyc)), 5), '% of the total.'))
## [1] "1.62667 % of the total."
# Display the first few rows of the data frame
print(names(nyc))
## [1] "CensusTract" "County" "Borough" "TotalPop"
## [5] "Men" "Women" "Hispanic" "White"
## [9] "Black" "Native" "Asian" "Citizen"
## [13] "Income" "IncomeErr" "IncomePerCap" "IncomePerCapErr"
## [17] "Poverty" "ChildPoverty" "Professional" "Service"
## [21] "Office" "Construction" "Production" "Drive"
## [25] "Carpool" "Transit" "Walk" "OtherTransp"
## [29] "WorkAtHome" "MeanCommute" "Employed" "PrivateWork"
## [33] "PublicWork" "SelfEmployed" "FamilyWork" "Unemployment"
head(nyc)
## # A tibble: 6 × 36
## CensusTract County Borough TotalPop Men Women Hispanic White Black Native
## <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 36005000100 Bronx Bronx 7703 7133 570 29.9 6.1 60.9 0.2
## 2 36005000200 Bronx Bronx 5403 2659 2744 75.8 2.3 16 0
## 3 36005000400 Bronx Bronx 5915 2896 3019 62.7 3.6 30.7 0
## 4 36005001600 Bronx Bronx 5879 2558 3321 65.1 1.6 32.4 0
## 5 36005001900 Bronx Bronx 2591 1206 1385 55.4 9 29 0
## 6 36005002000 Bronx Bronx 8516 3301 5215 61.1 1.6 31.1 0.3
## # ℹ 26 more variables: Asian <dbl>, Citizen <dbl>, Income <dbl>,
## # IncomeErr <dbl>, IncomePerCap <dbl>, IncomePerCapErr <dbl>, Poverty <dbl>,
## # ChildPoverty <dbl>, Professional <dbl>, Service <dbl>, Office <dbl>,
## # Construction <dbl>, Production <dbl>, Drive <dbl>, Carpool <dbl>,
## # Transit <dbl>, Walk <dbl>, OtherTransp <dbl>, WorkAtHome <dbl>,
## # MeanCommute <dbl>, Employed <dbl>, PrivateWork <dbl>, PublicWork <dbl>,
## # SelfEmployed <dbl>, FamilyWork <dbl>, Unemployment <dbl>
A missing value is a way to signal an absence of information in a dataset. It’s the equivalent of a blank cell in an Excel spreadsheet. In R, missing values typically look like an NA appearing in a variable, a vector, or a dataframe. However, missing values might be you may also encountered in datasets that aren’t equivalent of blank cells. Sometimes the creators of a dataset will use a numeric value to indicate missing data or a string of characters.
# Calculate the number of NA values in each column of the Data Frame and 'trim' argument is used to control whether to trim (remove) columns with zero NA values,
na_col_sums <- function(df, trim = TRUE) {
na_counts <- colSums(is.na(df))
nacols.df <- data.frame(NAs = na_counts)
nacols.df$Percent_of_Column <- round(100 * nacols.df$NAs / nrow(df), 2)
nacols.df$Percent_of_Incomplete_Rows <- round(100 * nacols.df$NAs / sum(!complete.cases(df)), 2)
if (trim) {
nacols.df <- nacols.df[nacols.df$NAs != 0, ]
}
return(nacols.df)
}
result <- na_col_sums(nyc)
result
## NAs Percent_of_Column Percent_of_Incomplete_Rows
## Hispanic 39 1.80 54.17
## White 39 1.80 54.17
## Black 39 1.80 54.17
## Native 39 1.80 54.17
## Asian 39 1.80 54.17
## Income 66 3.05 91.67
## IncomeErr 66 3.05 91.67
## IncomePerCap 46 2.12 63.89
## IncomePerCapErr 46 2.12 63.89
## Poverty 42 1.94 58.33
## ChildPoverty 60 2.77 83.33
## Professional 43 1.98 59.72
## Service 43 1.98 59.72
## Office 43 1.98 59.72
## Construction 43 1.98 59.72
## Production 43 1.98 59.72
## Drive 43 1.98 59.72
## Carpool 43 1.98 59.72
## Transit 43 1.98 59.72
## Walk 43 1.98 59.72
## OtherTransp 43 1.98 59.72
## WorkAtHome 43 1.98 59.72
## MeanCommute 61 2.81 84.72
## PrivateWork 43 1.98 59.72
## PublicWork 43 1.98 59.72
## SelfEmployed 43 1.98 59.72
## FamilyWork 43 1.98 59.72
## Unemployment 42 1.94 58.33
#function to filter rows containing NAs
#lim = 0.2 will filter rows where 20% of columns are NA or less, etc. Default lim=0
#keep=T returns the removed rows instead of the cleaned dataframe
row_nafilter <- function(df, keep=F, lim=0){
row_v <- apply(df, 1, function(x) sum(is.na(x))/length(x) <= lim)
if(keep){
df[!row_v, ]
}else{
df[row_v, ]
}
}
nyc_na.df <- row_nafilter(nyc, keep=T)
nyc <- nyc[!is.na(nyc$Income), ]
nyc <- nyc[!is.na(nyc$TotalPop), ]
nyc <- row_nafilter(nyc, lim=0.2)
print(paste('Count of NA values in nyc_data: ', sum(is.na(nyc))))
## [1] "Count of NA values in nyc_data: 7"
A right-skewed Poisson distribution, marred by outliers that could complicate the interpretation of population plots for different populations. One census tract, located in the Northeastern Bronx, exhibits an exceptionally high population of nearly 30,000, nearly ten times the median
Pop_hist.plot <- ggplot(nyc, aes(x=TotalPop)) +
geom_histogram(colour='black', fill='blue4', bins=50, size=0.2) +
theme_bw() +
geom_vline(mapping = NULL, data = NULL, xintercept=median(nyc$TotalPop), colour='red',
show.legend = NA) +
labs(title='New York City Population Distribution',
y='Number of Tracts',
x='Tract Population',
caption='Source: ACS 5-Year Estimates, 2015') +
theme(plot.caption = element_text(size = 8)) +
scale_x_continuous(breaks=c(0, 3622, 10000, 20000, 30000),
labels=c('0', '3622\nNYC Median', '10000', '20000', '30000'))
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
Pop_hist.plot
# Income and Total Population have right-skewed distributions
library(moments)
# Descriptive statistics
income_mean <- mean(nyc$Income)
income_median <- median(nyc$Income)
income_sd <- sd(nyc$Income)
income_skew <- moments::skewness(nyc$Income)
# Histograms
par(mfrow=c(1,2))
hist(nyc$Income, main="Income Histogram")
hist(nyc$TotalPop, main="Population Histogram")
# Scatterplot
plot(nyc$TotalPop, nyc$Income, main="Income vs. Total Population Plot")
# Correlation analysis
correlation <- cor(nyc$Income, nyc$TotalPop)
# add a variable for whether this household owns land
nyc <- nyc %>%
mutate(landowner = (Income >= 100000 & Professional > 35))
nyc
## # A tibble: 2,101 × 37
## CensusTract County Borough TotalPop Men Women Hispanic White Black Native
## <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 36005000200 Bronx Bronx 5403 2659 2744 75.8 2.3 16 0
## 2 36005000400 Bronx Bronx 5915 2896 3019 62.7 3.6 30.7 0
## 3 36005001600 Bronx Bronx 5879 2558 3321 65.1 1.6 32.4 0
## 4 36005001900 Bronx Bronx 2591 1206 1385 55.4 9 29 0
## 5 36005002000 Bronx Bronx 8516 3301 5215 61.1 1.6 31.1 0.3
## 6 36005002300 Bronx Bronx 4774 2130 2644 62.3 0.2 36.5 1
## 7 36005002500 Bronx Bronx 5355 2338 3017 76.5 1.5 18.9 0
## 8 36005002701 Bronx Bronx 3016 1375 1641 68 0 31.2 0
## 9 36005002702 Bronx Bronx 4778 2427 2351 71.3 1.6 26.2 0
## 10 36005002800 Bronx Bronx 5299 2292 3007 23 0.2 71.4 0
## # ℹ 2,091 more rows
## # ℹ 27 more variables: Asian <dbl>, Citizen <dbl>, Income <dbl>,
## # IncomeErr <dbl>, IncomePerCap <dbl>, IncomePerCapErr <dbl>, Poverty <dbl>,
## # ChildPoverty <dbl>, Professional <dbl>, Service <dbl>, Office <dbl>,
## # Construction <dbl>, Production <dbl>, Drive <dbl>, Carpool <dbl>,
## # Transit <dbl>, Walk <dbl>, OtherTransp <dbl>, WorkAtHome <dbl>,
## # MeanCommute <dbl>, Employed <dbl>, PrivateWork <dbl>, PublicWork <dbl>, …
# summerize just our new variable
summary(nyc$landowner)
## Mode FALSE TRUE
## logical 1925 176
boxplot(Income~County,data = nyc,
main="Boxplot of Income in Different Counties",
xlab="County",ylab="Income")
boxplot(TotalPop~County,data = nyc,
main="Boxplot of Total Population in Different Counties",
xlab="County",ylab="Population")
# plot of each variable of outcome variable IncomePerCap
nyc %>%
gather(Men, Women, Hispanic, Black, White, Native, Asian, Citizen, Office, Construction,
Production, Employed, PublicWork, SelfEmployed, FamilyWork,Unemployment,
key = "var", value = "value") %>%
ggplot(aes(x = value, y = IncomePerCap)) +
geom_point(colour = "blue", alpha = 0.2, size=0.1) +
facet_wrap(~ var, scales = "free") +
theme_bw()
The above graphs illustrate that the estimated average household income in New York amounted to 75,000 U.S. dollars and The population of the City was over 8 millions in 2020, a record high population. This is an increase of 629,057 people since the 2010 Census. The city is characterized by the constant ebb and flow of people that results in a unique level of population “churn” and diversity.