1 Introduction

This week’s assignment is to explore the distributional and spatial characteristics of a subset of attributes measured using the detroit1 Census tract database. You will perform several tasks. First, you will graphically summarize the selected attributes by creating histograms and boxplots using the R programming software. In doing so, you will also determine several descriptive statistics (e.g., mean, median, standard deviation, CV). Second, you will explore potential statistical relationships among the selected attributes by creating scatterplots. Your overall objective is to use these quantitative measures and graphical displays to summarize the main characteristics of each selected variable. In other words, how would you describe these selected attributes and their associated spatial patterns. There is no central question – this is an exploratory analysis.

2 Overview and Steps:

You will utilize a new geographic data analysis software program to complete this analysis → R, a free, open-source software program. You will use RStudio, an integrated development environment for R, to execute codes in R. R contains a variety of exploratory spatial data analysis tools for use with point and areal data. You will use R to;

describe the statistical distribution of each variable using graphics and summary statistics.
identify any outlying observations. Outliers might be defined as “surprisingly high maximums or surprisingly low minimums”.
describe the key features of the spatial pattern of each variable, and
identify potential statistical relationships among the selected variables.

You will focus your ESDA on five variables: WPOP, BPOP, HMEDINC, VHU, and PER_POV. These attributes describe the total number of persons identifying as white, the total number of persons identifying as black, median household income in past 12 months, the number of vacant houses, and the percent of the population living below the poverty line for each Census tract; see Attribute_Descrtiptions.pdf.

3 Setting up R

3.1 Packages

To get R to run, we need to install and load packages that it will need to execute tasks in this session. If you have these packages installed already, you can go straight to loading them. The codes below will install and load packages for you.

#install multiple packages. You do this only the first time. 
#install.packages(c('tidyverse', 'dplyr', 'tmap', 'ggplot2', 'sf', 'EnvStas', 'cowplot'))

#Load the libraries. You do this during every R session. 
library(tidyverse) #for processing dataframes (tables, like CSV files)
library(dplyr) #for processing dataframes (tables, like CSV files)
library(tmap) #for plotting shapefiles
library(ggplot2) #for plotting graphics in R
library(sf) #for processing shapefiles
library(EnvStats) #to display some stats on histograms and boxplots
library(cowplot) #To combine multiple graphs into one

3.2 Important Functions and Parameters

Below I set important parameters that we will use in the rest of the session.

#Some important parameters
name1 <- 'Jack Bienvenue'

"Here is an important function to help us get descriptive stats for each histogram bin.
The function will require that you provide the ggplot histogram object and the name of the variable/column you want to summarize. "
histo_stats <- function(hist_graph, column_name){
  xmins <- ggplot_build(hist_graph)$data[[1]]$xmin
  xmaxs <- ggplot_build(hist_graph)$data[[1]]$xmax
  bin_size1 <- length(xmaxs)
  
  detroit12 <- detroit1 %>% 
    dplyr::select(all_of(column_name)) %>% 
    rename('column1' = 1)
  
  stats1 <- detroit12 %>% 
    group_by(gr=cut(column1, breaks= c(xmins, xmaxs[bin_size1]))) %>%
    summarize(count = n(),
              mean = mean(column1, na.rm = TRUE),
              median = median(column1, na.rm = TRUE),
              sd = sd(column1, na.rm = TRUE),
              cv= sd / mean * 100) %>% 
    as.data.frame() %>%
    dplyr::mutate(xmin = xmins,
                  xmax = xmaxs) %>% 
    dplyr::select(c(xmin, xmax, count, mean, median, sd, cv))
  return(stats1)
}

"Below is a function to identify outliers from a variable of interest"
outlier_finder <- function(x) {
  return(x < quantile(x, .25) - 1.5*IQR(x) | x > quantile(x, .75) + 1.5*IQR(x))
}

4 Read in Your Files

Read your detroit12015_CTracts shapefile and explore it.

detroit1 <- st_read('/Users/jackbienvenuejr/Desktop/CLASSES_S24/GEOG3500/Lab2/Data/Detroit2015_CTracts.shp') ##CHANGED THIS

#print it to view some details
detroit1

You can map a few of the variables to explore patterns. For example, let’s see explore the distribution of POP15 - total population per census tract in 2015

tmap_mode('view')
Column <- 'POP15'
title1 <- paste0('Total Pop. : ', Column)
tm_shape(detroit1) +
  tm_fill(col=Column, palette = "RdBu", title=title1) +
  tm_borders(alpha = 0.5)

Let’s now explore the distribution of WPOP, the total number of persons identifying as white

tmap_mode('view')
Column <- 'WPOP'
title1 <- paste0('Total White Pop. : ', Column)
tm_shape(detroit1) +
  tm_fill(col=Column, palette = "RdBu", title=title1) +
  tm_borders(alpha = 0.5)

We need to explore the spatial patterns of WPOP, BPOP, HMEDINC, VHU, and PER_POV. Here, we will do the same thing as above, but we will go an extra step to specify the classification scheme used.

Some of the available classification schemes in tmap are;

quantile - Each class contains an equal number of features that are predetermined (quintiles, etc.). Well suited for linearly distributed data. Can be misleading if observations are clustered around break points—similar features can be placed in adjacent classes.
equal - Divides attributes values into equal sized ranges (1-5, 6-10, 11-15, etc). Unlike quantile classification, the number of records that fall into each category will differ. Good if data ranges are familiar to users, e.g., temperature bands.
jenks - Called Natural breaks (or Jenks). The classification is data driven and classes are based on natural groupings inherent in the data. Breakpoints identified by simultaneously picking class breaks that best group similar values and maximize the differences between classes.
std - Show’s how a feature’s attribute values differ from the mean. Great for displaying extreme highs/lows and outliers. Often makes use of a ± two-color ramp.
pretty - This is the default tmap classification scheme (used above). This scheme rounds each break-point up or down so you you have ‘pretty breaks’.

You can read more about R classification schemes in the link here.

Below we map WPOP using the natural jenks classification scheme with 5 classes.

I ask that you try out a few options. For example, use the same number of classes for the same variable while changing the style to ‘jenks’, then to ‘equal’, then to ‘quantile’. Finally, choose one combination of parameters that you think best represents each variable, and include this as your final map for that variable.

4.1 WPOP Map

tm_wpop <- tm_shape(detroit1) +
  tm_fill("WPOP", n=5, style = "jenks", title=title1) +
  tm_borders(alpha = 0.5)

tm_wpop

4.2 BPOP Map

tmap_mode('view')
Column <- 'BPOP'
title1 <- paste0('Total Black Pop. : ', Column)

tm_bpop <- tm_shape(detroit1) +
  tm_fill("BPOP", n=5, style = "jenks", title=title1) +
  tm_borders(alpha = 0.5)

tm_bpop

4.3 HMEDINC Map

Provide your HMEDINC map here

tmap_mode('view')
Column <- 'HMEDINC'
title1 <- paste0('Median Household Income. : ', Column)

tm_hmedinc <- tm_shape(detroit1) +
  tm_fill("HMEDINC", n=5, style = "jenks", title=title1) +
  tm_borders(alpha = 0.5)

tm_hmedinc

4.4 VHU Map

Provide your VHU map here

tmap_mode('view')
Column <- 'VHU'
title1 <- paste0('Vacant Housing Units. : ', Column)

tm_vhu <- tm_shape(detroit1) +
  tm_fill("VHU", n=5, style = "jenks", title=title1) +
  tm_borders(alpha = 0.5)

tm_vhu

#PER_POV Map

Provide your PER_POV map here

tmap_mode('view')
Column <- 'PER_POV'
title1 <- paste0('% Under Poverty Line : ', Column)

tm_per_pov <- tm_shape(detroit1) +
  tm_fill("PER_POV", n=5, style = "jenks", title=title1) +
  tm_borders(alpha = 0.5)

tm_per_pov

5 Histograms

A good way to assess the most appropriate mapping display is to examine the distribution of each variable, which you will do next.

You are required to report the mean, median, IQR, standard deviation, and coefficient of variation for each variable. Not all required descriptive statistics are obtained from the histogram; for example the IQR will be reported within the boxplot graphs. In addition, you need to calculate the CV for each attribute using the formula provided in class.

5.1 WPOP Histogram

Here, we are using 10 bins to display the WPOP variable. Please note that the default is 30 bins. You can alter the number of bins used to classify the data. Select the number of bins you feel best represents the distributional characteristics of each variable, and include this as your final histogram for that variable.

#Set important parameters
number_bins <- 10
title1 <- 'WPOP: Total White Population'
wpop_histo1 <- ggplot(detroit1, aes(x=WPOP)) + #WPOP is the column name, change it when you want to graph a different variable. 
    geom_histogram( bins = number_bins, fill="#69b3a2", color="#e9ecef", alpha=0.9) +
    ggtitle(paste0(title1, ' using ', number_bins, ' bins', '\n', name1)) +
    theme_minimal() +
    theme(
      plot.title = element_text(size=15)
    )
#Display the histogram
wpop_histo1

You need to explore the characteristics of each histogram bin. Unfortunately, R does not provide descriptive statistics for each histogram bin. Therefore, we have to compute this manually. I have provided the histo_stats function at the beginning of this document to do exactly that. The only thing you need to provide is the histogram object and the name of the column/variable that the histogram was based on. For example, below I calculate statistics for each bin created for the WPOP variable above. Here, you should observe the distributions of each bins, especially the number of data points in each bin. These are for you to understand the data better, you will not need to report these bin statistics in your submission.

#Get the descriptive stats for each histogram bin for your variable. 
histo_stats(wpop_histo1, 'WPOP')

Next, use R subsetting functions to select low or high values and determine the locations of these Census tracts within the city of detroit1. Think about whether high or low values are concentrated in particular parts of the city for a particular variable and whether several variables are similar with regard to where low and high values occur. Below we will subset values in the highest 3 bins and map their locations. We will use the xmin and xmax values provided in the descriptive summary above.

#The xmin for the seventh bin is 2279.4444, we will use this as the lowest limit for WPOP. 
xmin1 <- 2279.4444
high_values <- detroit1 %>% 
  dplyr::filter(WPOP >= xmin1) #get WPOP values that are greater than or equal to the xmin value. 
Column <- 'WPOP'

#Now plot the new shapefile
title1 <- paste0('High Values for Total White Pop. : ', Column)
tm_shape(detroit1) +
  tm_borders(alpha = 0.5) + 
  tm_shape(high_values) + 
  tm_fill(col=Column, palette = "RdBu", title=title1)

#Now, remove the high_values, title1 variables. 
rm(high_values, title1, xmin1)

5.2 BPOP Histogram

number_bins <- 10
title2 <- 'BPOP: Total Black Population'
bpop_histo1 <- ggplot(detroit1, aes(x=BPOP)) + #WPOP is the column name, change it when you want to graph a different variable. 
    geom_histogram( bins = number_bins, fill="#69b3a2", color="#e9ecef", alpha=0.9) +
    ggtitle(paste0(title2, ' using ', number_bins, ' bins', '\n', name1)) +
    theme_minimal() +
    theme(
      plot.title = element_text(size=15)
    )

#Histogram
bpop_histo1

#Descriptive stats
histo_stats(bpop_histo1, 'BPOP')

#Define your lowest limit for BPOP 
xmin1 <- 3367.8333
high_values <- detroit1 %>% 
  dplyr::filter(BPOP >= xmin1) #get BPOP values that are greater than or equal to the xmin value. 
Column <- 'BPOP'

#Now plot the new shapefile
title2 <- paste0('High Values for Total Black Pop. : ', Column)
tm_shape(detroit1) +
  tm_borders(alpha = 1) + 
  tm_shape(high_values) + 
  tm_fill(col=Column, palette = "RdBu", title=title2)

#Now, remove the high_values, title1 variables. 
rm(high_values, title2, xmin1)

5.3 HMEDINC Histogram

Using the WPOP and BPOP examples above, create a histogram for HMEDINC. The code for mapping the locations of census tracts in the highest 3 bins is provided for you, but you will need to define your xmin1 variable.

number_bins <- 7 #Note: use 7 classes for this variable. 
title3 <- 'HMEDINC: Median Household Income'
hmedinc_histo1 <- ggplot(detroit1, aes(x=HMEDINC)) + #WPOP is the column name, change it when you want to graph a different variable. 
    geom_histogram( bins = number_bins, fill="#69c8a2", color="#a9ecef", alpha=0.9) +
    ggtitle(paste0(title3, ' using ', number_bins, ' bins', '\n', name1)) +
    theme_minimal() +
    theme(
      plot.title = element_text(size=15)
    )

#Histogram
hmedinc_histo1

#Descriptive stats
histo_stats(hmedinc_histo1, 'HMEDINC')

#Define your lowest limit for HMEDINC
xmin1 <- min(detroit1$HMEDINC) ##EDITED
high_values <- detroit1 %>% 
  dplyr::filter(HMEDINC >= xmin1) #get BPOP values that are greater than or equal to the xmin value. 
Column <- 'HMEDINC'

#Now plot the new shapefile
title3 <- paste0('High Values for Median Household Income : ', Column)
tm_shape(detroit1) +
  tm_borders(alpha = 1) + 
  tm_shape(high_values) + 
  tm_fill(col=Column, palette = "RdBu", title=title3)

#Now, remove the high_values, title1 variables. 
rm(high_values, title3, xmin1)

5.4 VHU Histogram

Using the examples above, create a histogram for VHU. The code for mapping the locations of census tracts in the highest 3 bins is provided for you, but you will need to define your xmin1 variable.

number_bins <- 10
title4 <- 'VHU: Vacant Housing Units'
vhu_histo1 <-  ggplot(detroit1, aes(x=VHU)) + #WPOP is the column name, change it when you want to graph a different variable. 
    geom_histogram( bins = number_bins, fill="#69c8a2", color="#a9ecef", alpha=0.9) +
    ggtitle(paste0(title4, ' using ', number_bins, ' bins', '\n', name1)) +
    theme_minimal() +
    theme(
      plot.title = element_text(size=15)
    )

#Histogram
vhu_histo1

#Descriptive stats
histo_stats(vhu_histo1, 'VHU')

#Define your lowest limit for VHU 
xmin1 <- min(detroit1$VHU)
high_values <- detroit1 %>% 
  dplyr::filter(VHU >= xmin1) #get VHU values that are greater than or equal to the xmin value. 
Column <- 'VHU'

#Now plot the new shapefile
title4 <- paste0('High Values for Vacant Housing Units : ', Column)
tm_shape(detroit1) +
  tm_borders(alpha = 0.5) + 
  tm_shape(high_values) + 
  tm_fill(col=Column, palette = "RdBu", title=title4)

#Now, remove the high_values, title1 variables. 
rm(high_values, title1, xmin1)

5.5 PER_POV Histogram

Using the WPOP example above, create a histogram for PER_POV. The code for mapping the locations of census tracts in the highest 3 bins is provided for you, but you will need to define your xmin1 variable.

number_bins <- 10
title5 <- 'PER_POV: Percent Living Under Poverty Line'
per_pov_histo1 <- ggplot(detroit1, aes(x=PER_POV)) + #WPOP is the column name, change it when you want to graph a different variable. 
    geom_histogram( bins = number_bins, fill="#69c8a2", color="#a9ecef", alpha=0.9) +
    ggtitle(paste0(title5, ' using ', number_bins, ' bins', '\n', name1)) +
    theme_minimal() +
    theme(
      plot.title = element_text(size=15)
    )

#Histogram
per_pov_histo1

#Descriptive stats
histo_stats(per_pov_histo1, 'PER_POV')

#Define your lowest limit for VHU 
xmin1 <- min(detroit1$PER_POV)
high_values <- detroit1 %>% 
  dplyr::filter(PER_POV >= xmin1) #get PER_POV values that are greater than or equal to the xmin value. 
Column <- 'PER_POV'

#Now plot the new shapefile
title5 <- paste0('High Values for Percent Living Under Poverty Line : ', Column)
tm_shape(detroit1) +
  tm_borders(alpha = 0.5) + 
  tm_shape(high_values) + 
  tm_fill(col=Column, palette = "RdBu", title=title5)

#Now, remove the high_values, title1 variables. 
rm(high_values, title1, xmin1)

6 Boxplots

Boxplots are great tools for identifying potential outliers.

As a reminder, the lines extending from the box (“whiskers”) define the interquartile range +/- 1.5 or 3.0 x IQR. Potential outliers are plotted as points falling beyond these whiskers.
When describing each boxplot, use terms like positively skewed, negatively skewed or symmetric; compare each boxplot to its corresponding histogram – do they both suggest similar distributional patterns?; identify whether outlying observations exist.
Again, you must include a boxplot for each variable in your submission.
I have provided some important statistics and information within each boxplot. The blue dot represents the mean of the variable which is also printed, along with the standard deviation, median, and the IQR.

6.1 WPOP Boxplot

#Set important parameters
title6 <- 'WPOP: Total White Population'
wpop_box <- ggplot(detroit1) + 
  geom_boxplot(aes(y = WPOP)) + 
  scale_x_discrete( ) +
  stat_summary(fun.y=mean, aes(x=0, y=WPOP), geom="point", shape=20, size=10, color="blue", fill="blue") +
  stat_median_iqr_text(aes(x=0.5, y=WPOP), color = "brown") +
  stat_mean_sd_text(aes(x=0.1, y=WPOP), color = "brown") +
  theme_bw() +
  labs(title = title6,
       y = "Total Population")

wpop_box

Examine the boxplot for WPOP above.

Looking at the boxplot, you probably get a good idea which values are outliers. We can extract the data points that are flagged as outliers using the outlier_finder function provided at the beginning of this document. Then we can map these census tracts to see where they are located. Below I am identifying outliers in the WPOP variable and mapping them.

Consider the spatial distribution of the outliers; are these census tracts clustered in specific locations in detroit1? Are they dispersed?

#Set important parameters
outliers <- detroit1 %>%
  mutate(outlier = ifelse(outlier_finder(WPOP), WPOP, NA)) %>% 
  dplyr::filter(!is.na(outlier))
  
#Now, map the outliers
column1 <- 'WPOP'
#Now plot the new shapefile
title1 <- 'Oultiers for Total White Pop.'
out_wpop <- tm_shape(detroit1) +
  tm_borders(alpha = 0.5) + 
  tm_shape(outliers) + 
  tm_fill(col=column1, palette = "RdBu", title=title1)

#Now, remove the variables you dont need to keep. 
rm(outliers, column1, title1)

wpop_box

6.2 BPOP Boxplot

Here I create a boxplot for the BPOP variable. I also identify the outliers and map them.

#Set important parameters
title1 <- 'BPOP: Total Black Population'
bpop_box <- ggplot(detroit1) + 
  geom_boxplot(aes(y = BPOP)) + 
  scale_x_discrete( ) +
  stat_summary(fun.y=mean, aes(x=0, y=BPOP), geom="point", shape=20, size=10, color="blue", fill="blue") +
  stat_median_iqr_text(aes(x=0.6, y=BPOP), color = "brown") +
  stat_mean_sd_text(aes(x=0.1, y=BPOP), color = "brown") +
  theme_bw() +
  labs(title = title1,
       y = "Total Population")

bpop_box

#Set important parameters
outliers <- detroit1 %>%
  mutate(outlier = ifelse(outlier_finder(BPOP), BPOP, NA)) %>% 
  dplyr::filter(!is.na(outlier))
  
#Now, map the outliers
column1 <- 'BPOP'
#Now plot the new shapefile
title1 <- 'Oultiers for Total Black Pop.'
out_bpop <- tm_shape(detroit1) +
  tm_borders(alpha = 0.5) + 
  tm_shape(outliers) + 
  tm_fill(col=column1, palette = "RdBu", title=title1)

#Now, remove the variables you dont need to keep.  
rm(outliers, column1, title1)

Examine the boxplot for BPOP above.

6.3 HMEDINC Boxplot

Using the WPOP and BPOP examples above, create a boxplot for the HMEDINC variable. The code for identifying and mapping the outliers is provided for you.

#Set important parameters
title1 <- 'HMEDINC: Median Household Income'
hmedinc_box <- ggplot(detroit1) + 
  geom_boxplot(aes(y = HMEDINC)) + 
  scale_x_discrete( ) +
  stat_summary(fun.y=mean, aes(x=0, y=HMEDINC), geom="point", shape=20, size=10, color="blue", fill="blue") +
  stat_median_iqr_text(aes(x=0.6, y=HMEDINC), color = "brown") +
  stat_mean_sd_text(aes(x=0.1, y=HMEDINC), color = "brown") +
  theme_bw() +
  labs(title = title1,
       y = "Total Population")

hmedinc_box

#Find outliers
outliers <- detroit1 %>%
  mutate(outlier = ifelse(outlier_finder(HMEDINC), HMEDINC, NA)) %>% 
  dplyr::filter(!is.na(outlier))
  
#Now, map the outliers
column1 <- 'HMEDINC'
#Now plot the new shapefile
title1 <- 'Oultiers for Med. House Income.'
out_hmedinc <- tm_shape(detroit1) +
  tm_borders(alpha = 0.5) + 
  tm_shape(outliers) + 
  tm_fill(col=column1, palette = "RdBu", title=title1)

#Now, remove the variables you dont need to keep.  
rm(outliers, column1, title1)

Examine the boxplot for HMEDINC above.

6.4 VHU Boxplot

Using the examples above, create a boxplot for the VHU variable. The code for identifying and mapping the outliers is provided for you.

#Set important parameters
title1 <- 'VHU: Vacant Housing Units'
vhu_box <- ggplot(detroit1) + 
  geom_boxplot(aes(y = VHU)) + 
  scale_x_discrete( ) +
  stat_summary(fun.y=mean, aes(x=0, y=VHU), geom="point", shape=20, size=10, color="blue", fill="blue") +
  stat_median_iqr_text(aes(x=0.6, y=VHU), color = "brown") +
  stat_mean_sd_text(aes(x=0.1, y=VHU), color = "brown") +
  theme_bw() +
  labs(title = title1,
       y = "Total Population")

vhu_box

#Set important parameters
outliers <- detroit1 %>%
  mutate(outlier = ifelse(outlier_finder(VHU), VHU, NA)) %>% 
  dplyr::filter(!is.na(outlier))
  
#Now, map the outliers
column1 <- 'VHU'
#Now plot the new shapefile
title1 <- 'Oultiers for Vacant Housing Units'
out_vhu <- tm_shape(detroit1) +
  tm_borders(alpha = 0.5) + 
  tm_shape(outliers) + 
  tm_fill(col=column1, palette = "RdBu", title=title1)

#Now, remove the variables you dont need to keep.  
rm(outliers, column1, title1)

Examine the boxplot for VHU above.

6.5 PER_POV Boxplot

Using the examples above, create a boxplot for the PER_POV variable. The code for identifying and mapping the outliers is provided for you.

#Set important parameters
title1 <- 'PER_POV: % Under Poverty Line'
per_pov_box <- ggplot(detroit1) + 
  geom_boxplot(aes(y = PER_POV)) + 
  scale_x_discrete( ) +
  stat_summary(fun.y=mean, aes(x=0, y=PER_POV), geom="point", shape=20, size=10, color="blue", fill="blue") +
  stat_median_iqr_text(aes(x=0.6, y=PER_POV), color = "brown") +
  stat_mean_sd_text(aes(x=0.1, y=PER_POV), color = "brown") +
  theme_bw() +
  labs(title = title1,
       y = "Total Population")

per_pov_box

#Set important parameters
outliers <- detroit1 %>%
  mutate(outlier = ifelse(outlier_finder(PER_POV), PER_POV, NA)) %>% 
  dplyr::filter(!is.na(outlier))
  
#Now, map the outliers
column1 <- 'PER_POV'
#Now plot the new shapefile
title1 <- 'Oultiers for % Under Poverty Line'
out_per_pov <- tm_shape(detroit1) +
  tm_borders(alpha = 0.5) + 
  tm_shape(outliers) + 
  tm_fill(col=column1, palette = "RdBu", title=title1)

#Now, remove the variables you dont need to keep. 
rm(outliers, column1, title1)

7 Scatterplots

Finally, create scatterplots to determine the relationship between HMEDINC and the remaining four attributes (e.g., HMEDINC – WPOP, HMEDINC – PER_POV). We will be using HMEDINC as the dependent variable in each graph.

Consider the following: Are the observed relationships weak or strong based on your visual assessment? What is the direction of each relationship (i.e., positive or negative)?

wpop1 <- ggplot(detroit1, aes(y = HMEDINC, x=WPOP)) +
  geom_point()+
  geom_smooth(method=lm, se=T) +
  theme_classic() +
  labs(title = 'HMEDINC vs WPOP')

bpop1 <- ggplot(detroit1, aes(y = HMEDINC, x=BPOP)) +
  geom_point()+
  geom_smooth(method=lm, se=T) +
  theme_classic() +
  labs(title = 'HMEDINC vs BPOP')

#Using examples above, provide codes to create scatterplots for;
#HMEDINC and VHU
vhu1 <- ggplot(detroit1, aes(y = HMEDINC, x=VHU)) +
  geom_point()+
  geom_smooth(method=lm, se=T) +
  theme_classic() +
  labs(title = 'HMEDINC vs VHU')

#HMEDINC and PER_POV
per_pov1 <- ggplot(detroit1, aes(y = HMEDINC, x=PER_POV)) +
  geom_point()+
  geom_smooth(method=lm, se=T) +
  theme_classic() +
  labs(title = 'HMEDINC vs PER_POV')

#Print each variable to see show the graph

vhu1

per_pov1

8 Summarizing Your Findings

Based on your calculated descriptive statistics, summary graphics, and exploratory spatial analysis, provide a thoughtful summary of the key characteristics of each attribute. Focus on the data characteristics you believe are of greatest interest. In writing your response, think “what are the most interesting characteristics of these variables and what are the most important insights that should be conveyed?” Be sure to comment on the spatial characteristics of each variable as well – where are outlying observations located? Do high or low values concentrate in certain regions of the city? Include your responses in the following subsections.

9 Submissions

Your submission will consist of a single knitted HTML file containing your written description of the key spatial and distributional characteristics of each attribute, along with the table of descriptive statistics, histogram for each attribute, boxplot for each variable, and scatterplots for each HMEDINC – attribute pair. Please name this file using the convention: LastName_Assignment2.html.

Submit all files to the Assignment 2 folder on HuskyCT. Your submission is due Wednesday, February 9th by 11:59pm.

9.1 Submission 1

Provide maps with your chosen classification scheme.

tmap_arrange(tm_wpop, tm_bpop, tm_hmedinc, tm_vhu, tm_per_pov, ncol = 3)

Describe the spatial patterns of each mapped variable.

Your answer:

White populations:

The map of white populations shows a pattern of concentration in two main areas: an area slightly of the center of the city and an area at the south side of the city. This map does not show any co variants so there is no clear explanation of why this spatial trend exists. Across the rest of the city, there is a low white population in each census block.

Black populations:

The map of black populations shows a pattern of concentration that cannot be described easily. To the naked eye, it looks to be somewhat random, which is not necessarily the case. This variable, as well as WPOP, were not normalized against total population in each census block and therefore visible trends might be deceiving.

Median household income:

Median household income is very low throughout census blocks in the city. The median household income seems to have home clusters of higher values, chiefly close to the river and on the west side of the city. Visually, it seems as though there exist some spatial gradients in the median household income data.

Vacant Housing Units:

There are a number of census tracts with several hundred vacant housing units. There appears to be some clustering occurring, both in terms of area with very high numbers of vacant housing units and lower numbers of vacant housing units.

% of Population under poverty line:

Across the city, the percent of impoverished citizens is quite high. There are some areas of especially high concentration, which include census tracts near the center of the city. Overall, the center to northeast part of the city has many high values.

You must include a table that summarizes the descriptive statistics for each variable and includes: a) variable name, b) mean, c) median, d) IQR, e) standard deviation, and f) coefficient of variation. Provide the descriptive stats for each variable below. Note: The descriptive stats for the PER_POV variable are provided for you as an example. You will need to calculate CV for each variable using the equation from class.

wpop_stats <- c('339.9', '93.0', '260.0', '624.2', '27.4') #Replace mean, median, IQR, SD, and CV with values from your analyses.
bpop_stats <- c('1826.2', '1724.0', '1566.0', '1080.0', '94.4') #Replace mean, median, IQR, SD, and CV with values from your analyses.
hmedinc_stats <- c('25473.5', '23750.0', '11030.2', '11462.1', '45.0') #Replace mean, median, IQR, SD, and CV with values from your analyses.
vhu_stats <- c('368.0', '345.0', '249.0', '197.7', '53.7') #Replace mean, median, IQR, SD, and CV with values from your analyses.
per_pov_stats <- c(41.1, 41.8, 17.6, 13.7, 33.3) #mean, median, IQR, SD, and CV

#Create a table for all variables
desc_stats <- as.data.frame(rbind(wpop_stats, bpop_stats, hmedinc_stats, vhu_stats, per_pov_stats))
#Rename columns
colnames(desc_stats) <- c('Mean', 'Median', 'IQR', 'SD', 'CV (%)')
rownames(desc_stats) <- c('WPOP', 'BPOP', 'HMEDINC', 'VHU', 'PER_POV')

knitr::kable(desc_stats)

	Mean	Median	IQR	SD	CV (%)
WPOP	339.9	93.0	260.0	624.2	27.4
BPOP	1826.2	1724.0	1566.0	1080.0	94.4
HMEDINC	25473.5	23750.0	11030.2	11462.1	45.0
VHU	368.0	345.0	249.0	197.7	53.7
PER_POV	41.1	41.8	17.6	13.7	33.3

Describe the statistics for each variable.

Your answer:

White populations:

The mean white population for each census block is 339.9 people, with a standard deivation of 624.2. This standard deviation is extremely large in comparison to the mean. This is largely due to the fact that the median value (93.0) is quite low but that the distribution is heavily skewed to the right. The IQR for the number of white residents in a census block in the city is 260.0.

Black populations:

The mean black population for each census block is 1826.2 people, with a standard deviation of 1080.0. This standard deviation is quite large in comparison to the mean, but the distribution follows a normal distribution much more nearly than the data for white populations. The IQR is quite large, 1566.0, and the median value for black population in a census block is 1724 people.

Median household income:

Mean of median household income for all census blocks is $25,473 with a median of $23,750. The standard deviation is $11,462 and the IQR is $11,032. The data is skewed to the right, and Q2 and Q3 are roughly the same size in their ranges.

Vacant Housing Units:

The mean value for number of vacant housing units per census tract is 368, with a standard deviation of 197.7. The median value for number of vacant housing units per census tract is 345 with an IQR of 249. The data is slightly skewed to the right and the size of the range of Q2 and Q3 are approximately the same.

% of Population under poverty line:

The mean value for the proportion of residents in poverty in census blocks is 41.1% in Detroit. The median is close to the mean, 41.8%. The data appears to follow an approximately normal distribution. Q2 and Q3’s ranges look to be about the same in length. The standard deviation is 13.7% and the IQR is 17.6%.

9.2 Submission 2: Distributional statistics for HMEDINC

Based on your calculated descriptive statistics, summary graphics, and exploratory spatial analysis, provide a thoughtful summary of the key characteristics of the HMEDINC variable. Focus on the data characteristics you believe are of greatest interest. In writing your response, think “what are the most interesting characteristics of these variables and what are the most important insights that should be conveyed?” Be sure to comment on the spatial characteristics of the HMEDINC variable as well – where are outlying observations located? do high or low values concentrate in certain regions of the city?

out_hmedinc #show the map of outliers
plot_grid(hmedinc_histo1, hmedinc_box, scale = c(1,1)) #show boxplot and histogram

Your answer:

Overall, the median income of Detroit residents by census block is quite low, especially as compared to the national median income of $74,850 in the United States at large. The vast majority of income values are concentrated in the $10,000-$42,500 range. It should not be lost of the viewer of a map that there are very, very few census blocks have a median income that is above the national average. Spatially, we can see that there are small pockets of high median incomes.

9.3 Submission 3: Distributional statistics for WPOP

Similar to HMEDINC above, provide a thoughtful summary of the key characteristics of the WPOP variable.

out_wpop #show outliers
plot_grid(wpop_histo1, wpop_box, wpop1, scale = c(1,1,1,1)) #show histogram, boxplot, and scatterplot

Your answer

The histogram for white population within census blocks in Detroit shows a high concentration of values close to zero with the frequency of data falling in successive bins rapidly declining. The scatterplot does not show any clear linear correlation between white population and median household income in a census block. Spatially, there appear to be a few areas with particularly high white populations, but across the city these tracts are few.

9.4 Submission 4: Distributional statistics for BPOP

Provide a thoughtful summary of the key characteristics of the BPOP variable.

out_bpop
plot_grid(bpop_histo1, bpop_box, bpop1, scale = c(1,1,1))

Your answer: ……………………………………………………………….

The histogram for black population by census tract appears to follow an approximately normal distribution. The scatterplot indicates that there is a possibility that the quantity of black residents in a census block is positively correlated with median household income for that census block. Spatially, there appear to be areas of the city with some form of clustering for total black populations.

9.5 Submission 5: Distributional statistics for VHU

Provide a thoughtful summary of the key characteristics of the VHU variable.

out_vhu
plot_grid(vhu_histo1, vhu_box, vhu1, scale = c(1,1,1,1))

Your answer: ……………………………………………………………….

The amount of vacant housing units in census blocks across the city appears to follow an approximately normal distribution, which is skewed to the right. This may indicate that housing vacancy is a very widespread pattern, and not a highly localized pattern. The scatterplot indiciates that an increase in the number of vacant housing units in a census block could be negatively correlated to the household median income in the census block.

9.6 Submission 6: Distributional statistics for PER_POV

Provide a thoughtful summary of the key characteristics of the PER_POV variable.

out_per_pov
plot_grid(per_pov_histo1, per_pov_box, per_pov1, scale = c(1,1,1,1))

Your answer: ……………………………………………………………….

The histogram for the percentage of people living under the poverty line among Detroit census tracts follows an approximately normal distribution. This is interesting because this is a parameter that might not necessarily be expected to follow a normal distribution. The scatterplot also shows that there appears to be a trend where an increase in the percent of impoverished people in a census block is associated with lower median incomes. This intuitively makes sense. As with the median income, a visual display of this data should not hide the fact that poverty rates are extremely high across the board in the city.

…………………………….End of Lab Assignment 2…………………………….

GEOG 3500: Lab Assignment 2 - Exploratory Spatial Data Analysis

Jack Bienvenue

2024-02-01

1 Introduction

2 Overview and Steps:

3 Setting up R

3.1 Packages

3.2 Important Functions and Parameters

4 Read in Your Files

4.1 WPOP Map

4.2 BPOP Map

4.3 HMEDINC Map

4.4 VHU Map

5 Histograms

5.1 WPOP Histogram

5.2 BPOP Histogram

5.3 HMEDINC Histogram

5.4 VHU Histogram

5.5 PER_POV Histogram

6 Boxplots

6.1 WPOP Boxplot

6.2 BPOP Boxplot

6.3 HMEDINC Boxplot

6.4 VHU Boxplot

6.5 PER_POV Boxplot

7 Scatterplots

8 Summarizing Your Findings

9 Submissions

9.1 Submission 1

9.2 Submission 2: Distributional statistics for HMEDINC

9.3 Submission 3: Distributional statistics for WPOP

9.4 Submission 4: Distributional statistics for BPOP

9.5 Submission 5: Distributional statistics for VHU

9.6 Submission 6: Distributional statistics for PER_POV