Introduction

The data chosen for this exercise consist of one ESRI Shape file (Shelby.shp) and one CSV file (Data.csv). The shape file is of Shelby County census tract of 2018. The Data.csv contains population and median income data of each census track of the county. The data file also contains population of the White, African American, and Asian people in the census.

Before loading the data it is important to clear the global environment and the console to ensure minimum data storage within the environment. The following codes will clean the environment and the console.

Cleaning the Global environment and the console

rm(list = ls())      # Clears the global environment for a fresh start
cat('\f')            # Cleans the console

Loading libraries

This code allows to load multiple library at once.

my_libtrary <- c("sf", "dplyr", "dbplyr", "tibble", "ggplot2")
lapply(my_libtrary, library, character.only = TRUE)

## Linking to GEOS 3.8.0, GDAL 3.0.4, PROJ 6.3.1

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

## 
## Attaching package: 'dbplyr'

## The following objects are masked from 'package:dplyr':
## 
##     ident, sql

Loading the data

The shape file will be loaded using sf package and the data file will be loaded to R as tibble. The following codes will be executed to load both files as an object.

Shelby <- st_read("Data/Assignment_5/Shelby.shp")

## Reading layer `Shelby' from data source `C:\Rizwan\Education\UofM PhD Work\Seminar in Earth Sciences\R-Spatial\Data\Assignment_5\Shelby.shp' using driver `ESRI Shapefile'
## Simple feature collection with 221 features and 9 fields
## geometry type:  POLYGON
## dimension:      XY
## bbox:           xmin: -90.3103 ymin: 34.99419 xmax: -89.63278 ymax: 35.40948
## geographic CRS: NAD83

Data <- as_tibble(read.csv("Data/Assignment_5/Data.csv"))

Basic exploratory data analysis

The Shelby.shp is a simple feature(sf) class and the data type for it is data.frame. Summary shows the basic statistics for the variables with numeric or integer class in Shelby data frame.

str(Shelby)

## Classes 'sf' and 'data.frame':   221 obs. of  10 variables:
##  $ STATEFP : chr  "47" "47" "47" "47" ...
##  $ COUNTYFP: chr  "157" "157" "157" "157" ...
##  $ TRACTCE : chr  "021312" "021530" "021726" "022024" ...
##  $ AFFGEOID: chr  "1400000US47157021312" "1400000US47157021530" "1400000US47157021726" "1400000US47157022024" ...
##  $ GEOID   : chr  "47157021312" "47157021530" "47157021726" "47157022024" ...
##  $ NAME    : chr  "213.12" "215.30" "217.26" "220.24" ...
##  $ LSAD    : chr  "CT" "CT" "CT" "CT" ...
##  $ ALAND   : num  1362074 10068590 1433803 2581685 1056567 ...
##  $ AWATER  : num  0 0 2132 0 45527 ...
##  $ geometry:sfc_POLYGON of length 221; first list element: List of 1
##   ..$ : num [1:26, 1:2] -89.8 -89.8 -89.8 -89.8 -89.8 ...
##   ..- attr(*, "class")= chr [1:3] "XY" "POLYGON" "sfg"
##  - attr(*, "sf_column")= chr "geometry"
##  - attr(*, "agr")= Factor w/ 3 levels "constant","aggregate",..: NA NA NA NA NA NA NA NA NA
##   ..- attr(*, "names")= chr [1:9] "STATEFP" "COUNTYFP" "TRACTCE" "AFFGEOID" ...

cat("Number of observastions in the data frame is: ", nrow(Shelby), sep = '')

## Number of observastions in the data frame is: 221

cat("Number of variables in the data frame is: ", ncol(Shelby), sep = '')

## Number of variables in the data frame is: 10

summary(Shelby)

##    STATEFP            COUNTYFP           TRACTCE            AFFGEOID        
##  Length:221         Length:221         Length:221         Length:221        
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##     GEOID               NAME               LSAD               ALAND          
##  Length:221         Length:221         Length:221         Min.   :   555457  
##  Class :character   Class :character   Class :character   1st Qu.:  1940071  
##  Mode  :character   Mode  :character   Mode  :character   Median :  3843270  
##                                                           Mean   :  8948641  
##                                                           3rd Qu.:  6099276  
##                                                           Max.   :138912108  
##      AWATER                  geometry  
##  Min.   :       0   POLYGON      :221  
##  1st Qu.:       0   epsg:4269    :  0  
##  Median :       0   +proj=long...:  0  
##  Mean   :  251458                      
##  3rd Qu.:    8845                      
##  Max.   :32514243

The Data.csv is loaded to R as tibble. The elaborated variable names are given below.

FID: Feature ID
STATEFP : Federal Information Processing Standards (FIPS) code for the state
COUNTYFP : Federal Information Processing Standards (FIPS) code for the county
GEO_ID : Geographical ID
OBJECTID : Object ID
ALAND : Area of land in sq. ft.
AWATER : Area of Water body in sq. ft.
Area_sqmil : Area of the census in sq. miles
Tot_Pop : Total population
Med_Inc : Median Income
Tot_Wht : Total White population
Tot_AA : Total African American population
Tot_Asi : Total Asian population
Pop_Den : Population density per sq. mile
PopDen_Wht : Population density of White population per sq. mile
PopDen_AA : Population density of African American population per sq. mile
PopDen_Asi : Population density of Asian population per sq. mile

str(Data)

## tibble [221 x 17] (S3: tbl_df/tbl/data.frame)
##  $ FID       : int [1:221] 0 1 2 3 4 5 6 7 8 9 ...
##  $ STATEFP   : int [1:221] 47 47 47 47 47 47 47 47 47 47 ...
##  $ COUNTYFP  : int [1:221] 157 157 157 157 157 157 157 157 157 157 ...
##  $ GEO_ID    : chr [1:221] "1400000US47157021312" "1400000US47157021530" "1400000US47157021726" "1400000US47157022024" ...
##  $ OBJECTID  : int [1:221] 1161 1176 1186 1201 1001 1006 1021 1029 1044 1059 ...
##  $ ALAND     : int [1:221] 1362074 10068590 1433803 2581685 1056567 2037629 1356159 1413184 1418277 1726745 ...
##  $ AWATER    : int [1:221] 0 0 2132 0 45527 154343 0 0 0 0 ...
##  $ Area_sqmil: num [1:221] 0.528 3.886 0.554 0.994 0.421 ...
##  $ Tot_Pop   : int [1:221] 2108 5396 3486 3429 790 2286 2030 3138 2253 1402 ...
##  $ Med_Inc   : int [1:221] 54769 129375 26731 49667 14345 14299 45729 56111 20724 27344 ...
##  $ Tot_Wht   : int [1:221] 1902 4530 1012 167 0 42 882 2417 18 21 ...
##  $ Tot_AA    : int [1:221] 99 308 2423 3250 762 2244 857 595 2229 1379 ...
##  $ Tot_Asian : int [1:221] 97 479 0 0 0 0 110 80 0 1 ...
##  $ Pop_Den   : num [1:221] 3990 1388 6294 3449 1877 ...
##  $ PopDen_Wht: num [1:221] 3600 1166 1827 168 0 ...
##  $ PopDen_AA : num [1:221] 187.4 79.3 4375.1 3269 1810.7 ...
##  $ PopDen_Asi: num [1:221] 184 123 0 0 0 ...

cat("Number of variables in the data table is: ", ncol(Data), sep = '')

## Number of variables in the data table is: 17

cat("Number of observations in the data table is: ", nrow(Data), sep = '')

## Number of observations in the data table is: 221

cat("Number of NA value in the data table is: ", sum(is.na(Data)), sep = '')

## Number of NA value in the data table is: 0

summary(Data)

##       FID         STATEFP      COUNTYFP      GEO_ID             OBJECTID   
##  Min.   :  0   Min.   :47   Min.   :157   Length:221         Min.   :1000  
##  1st Qu.: 55   1st Qu.:47   1st Qu.:157   Class :character   1st Qu.:1055  
##  Median :110   Median :47   Median :157   Mode  :character   Median :1110  
##  Mean   :110   Mean   :47   Mean   :157                      Mean   :1110  
##  3rd Qu.:165   3rd Qu.:47   3rd Qu.:157                      3rd Qu.:1165  
##  Max.   :220   Max.   :47   Max.   :157                      Max.   :1220  
##      ALAND               AWATER           Area_sqmil         Tot_Pop     
##  Min.   :   555457   Min.   :       0   Min.   : 0.2163   Min.   :    0  
##  1st Qu.:  1940071   1st Qu.:       0   1st Qu.: 0.7730   1st Qu.: 2625  
##  Median :  3843270   Median :       0   Median : 1.4855   Median : 3958  
##  Mean   :  8948641   Mean   :  251458   Mean   : 3.5522   Mean   : 4240  
##  3rd Qu.:  6099276   3rd Qu.:    8845   3rd Qu.: 2.4992   3rd Qu.: 5426  
##  Max.   :138912108   Max.   :32514243   Max.   :66.1540   Max.   :16377  
##     Med_Inc          Tot_Wht          Tot_AA       Tot_Asian     
##  Min.   :     0   Min.   :    0   Min.   :   0   Min.   :   0.0  
##  1st Qu.: 25296   1st Qu.:  167   1st Qu.: 787   1st Qu.:   0.0  
##  Median : 39375   Median :  845   Median :1800   Median :  23.0  
##  Mean   : 48137   Mean   : 1667   Mean   :2270   Mean   : 108.3  
##  3rd Qu.: 65382   3rd Qu.: 2535   3rd Qu.:3425   3rd Qu.:  97.0  
##  Max.   :165321   Max.   :10300   Max.   :8736   Max.   :2219.0  
##     Pop_Den        PopDen_Wht       PopDen_AA        PopDen_Asi    
##  Min.   :    0   Min.   :   0.0   Min.   :   0.0   Min.   :  0.00  
##  1st Qu.: 1722   1st Qu.: 122.9   1st Qu.: 456.9   1st Qu.:  0.00  
##  Median : 2993   Median : 432.5   Median :1473.2   Median : 16.44  
##  Mean   : 3039   Mean   :1008.1   Mean   :1822.5   Mean   : 52.86  
##  3rd Qu.: 4197   3rd Qu.:1754.7   3rd Qu.:2922.1   3rd Qu.: 72.23  
##  Max.   :11705   Max.   :5276.3   Max.   :9842.2   Max.   :622.92

Data Visualization

The shape file will be visualized using the <font color=‘red’)sf package.

plot(st_geometry(Shelby), main = "Shelby County Census Tract", cex.main = 2)

The data can be visualized using histogram. For example, the histogram of the population densities of the Shelby County census tract can visualized using following code.

par(mfrow=c(2,2))
hist(Data$Pop_Den, main = "Histogram of Pop_Den", xlab = "Population Density", col = "deepskyblue",
     cex.main = 1.5, cex.lab = 1.5)
hist(Data$PopDen_Wht, main = "Histogram of PopDen_Wht", xlab = "Population Density", col = "deepskyblue",
     cex.main = 1.5, cex.lab = 1.5)
hist(Data$PopDen_AA, main = "Histogram of PopDen_AA", xlab = "Population Density", col = "deepskyblue",
     cex.main = 1.5, cex.lab = 1.5)
hist(Data$PopDen_Asi, main = "Histogram of PopDen_Asi", xlab = "Population Density", col = "deepskyblue",
     cex.main = 1.5, cex.lab = 1.5)

Data can also be visualized using boxplot to check for potential outliers. For example, the outliers of the population densities can visualized using following code.

par(mfrow=c(2,2))
boxplot(Data$Pop_Den, main = "Boxplot of Pop_Den", col = "orange", cex.main = 1.5)
boxplot(Data$PopDen_Wht, main = "Boxplot of PopDen_Wht", col = "orange", cex.main = 1.5)
boxplot(Data$PopDen_AA, main = "Boxplot of PopDen_AA", col = "orange", cex.main = 1.5)
boxplot(Data$PopDen_Asi, main = "Boxplot of PopDen_Asi", col = "orange", cex.main = 1.5)

Conclusion

We will explore the correlation among the variables later for this data. The data will also be joined with the shape file later for final spatial analysis.

Assignment 5: Importing Own Data and Basic Exploratory Data Analysis with Visualization

Md Rizwanul Hasan