Functions Tasks
full_join() Return all rows and all columns from both x and y. Where there are not matching values, returns NA for the one missing.
left_join() Returns all rows from x, and all columns from x and y. Rows in x with no match in y will have NA values in the new columns. If there are multiple matches between x and y, all combinations of the matches are returned.
right_join() Return all rows from y, and all columns from x and y. Rows in y with no match in x will have NA values in the new columns. If there are multiple matches between x and y, all combinations of the matches are returned.
inner_join() Returns all rows from x where there are matching values in y, and all columns from x and y.
cut() Divides the range of the input vector into intervals and codes the values according to which interval they fall.
table() Builds a contingency table of the counts. Note: we discussed the expss user driven command cro() in class, but I will use the base R command table() in this lab.
quantile() Produces sample quantiles corresponding to the given probabilities
chisq.test() Performs chi-squared contingency table tests and goodness-of-fit tests.
cor.test() Tests for association between paired samples
Lambda() Calculates Goodman Kruskal lambda and their confidence intervals.
CramerV() Calculates Cramer’s V
cov() Calculates covariance using two vectors.

0. Before We Get Started: Downloading data, setting the working directory, and reading data

In this lab, we will download two variables from the American Fact Finder, join the two data using a key variable, clean the data, and conduct correlation and bivariate regression. The question we will examine is “Do neighborhoods with a higher proportion of people with graduate degrees tend to be wealthier?”

# Install required packages ----------- This code will not run if you already have tidyverse installed.
if (!require("tidyverse")) install.packages("tidyverse", dependencies = TRUE)
if (!require("DescTools")) install.packages("DescTools", dependencies = TRUE)
if (!require("rcompanion")) install.packages("rcompanion", dependencies = TRUE)
# Call the packages using library()
library(tidyverse)
library(DescTools)
library(rcompanion)

The data we will use is “2013-2017 American Community Survey 5-Year Estimates of MEDIAN HOUSEHOLD INCOME IN THE PAST 12 MONTHS (IN 2017 INFLATION-ADJUSTED DOLLARS)” (table id: B19013) and “2013-2017 American Community Survey 5-Year Estimates of PLACE OF BIRTH BY EDUCATIONAL ATTAINMENT IN THE UNITED STATES” (table id: B06009). You don’t need to download them yourself - I provided them in the Modules tab on Canvas.

# USE YOUR OWN working directory and file name
setwd("C:\\Users\\cod\\Desktop\\PhD Files\\GRA & GTA\\GTA\\CP6025 Fall 2021\\Week 6\\Lab5_2020") # Use your own pathname
edu <- read.csv("Week6_edu.csv") # Education data
income <- read.csv("Week6_hinc.csv") # Income data

# Check the data
head(edu)
##                 GEO_ID                                        NAME B06009_001E
## 1 1400000US13001950100  Census Tract 9501, Appling County, Georgia        1914
## 2 1400000US13001950200  Census Tract 9502, Appling County, Georgia        2896
## 3 1400000US13001950300  Census Tract 9503, Appling County, Georgia        3970
## 4 1400000US13001950400  Census Tract 9504, Appling County, Georgia        1121
## 5 1400000US13001950500  Census Tract 9505, Appling County, Georgia        2465
## 6 1400000US13003960100 Census Tract 9601, Atkinson County, Georgia        1349
##   B06009_001M B06009_002E B06009_002M B06009_003E B06009_003M B06009_004E
## 1         205         475         137         726         163         457
## 2         317         732         181        1089         188         684
## 3         385        1149         319        1671         323         786
## 4         144         255          86         430         100         329
## 5         275         465         174        1045         159         648
## 6         171         533         125         450         108         288
##   B06009_004M B06009_005E B06009_005M B06009_006E B06009_006M B06009_007E
## 1         136         131          66         125          55        1606
## 2         181         209         103         182          80        2107
## 3         192         194          86         170          82        2845
## 4          79          32          17          75          49         888
## 5         147         253          90          54          41        1715
## 6          89          32          24          46          37         826
##   B06009_007M B06009_008E B06009_008M B06009_009E B06009_009M B06009_010E
## 1         193         408         135         631         143         357
## 2         284         595         161         722         156         529
## 3         397         865         268         995         241         719
## 4         126         181          66         359          96         271
## 5         224         243          96         802         153         444
## 6         141         266         106         338          79         160
##   B06009_010M B06009_011E B06009_011M B06009_012E B06009_012M B06009_013E
## 1         118         109          56         101          49         290
## 2         153         103          67         158          79         634
## 3         185         145          70         121          72         673
## 4          71          26          15          51          38         208
## 5         122         191          86          35          34         659
## 6          53          21          18          41          36         245
##   B06009_013M B06009_014E B06009_014M B06009_015E B06009_015M B06009_016E
## 1         109          49          37          95          59         100
## 2         191         100          68         301         132         115
## 3         194         125          96         383         144          67
## 4          86          64          54          71          43          58
## 5         158         159         121         229         110         190
## 6         105          45          30         109          56          79
##   B06009_016M B06009_017E B06009_017M B06009_018E B06009_018M B06009_019E
## 1          54          22          26          24          30          18
## 2          62          94          80          24          22           0
## 3          48          49          46          49          50           0
## 4          43           6           5           9          13          10
## 5          83          62          51          19          22          20
## 6          68           7          16           5           7           0
##   B06009_019M B06009_020E B06009_020M B06009_021E B06009_021M B06009_022E
## 1          27          18          27           0          13           0
## 2          13           0          13           0          13           0
## 3          18           0          18           0          18           0
## 4          16          10          16           0          13           0
## 5          25           0          13           6          11          14
## 6          13           0          13           0          13           0
##   B06009_022M B06009_023E B06009_023M B06009_024E B06009_024M B06009_025E
## 1          13           0          13           0          13           0
## 2          13           0          13           0          13         155
## 3          18           0          18           0          18         452
## 4          13           0          13           0          13          15
## 5          22           0          13           0          13          71
## 6          13           0          13           0          13         278
##   B06009_025M B06009_026E B06009_026M B06009_027E B06009_027M B06009_028E
## 1          13           0          13           0          13           0
## 2         141          37          59          66          72          40
## 3         185         159         157         293         192           0
## 4          25           0          13           0          13           0
## 5          61          63          60           8          14           0
## 6         110         222          81           3           7          49
##   B06009_028M B06009_029E B06009_029M B06009_030E B06009_030M
## 1          13           0          13           0          13
## 2          46          12          18           0          13
## 3          18           0          18           0          18
## 4          13           0          13          15          25
## 5          13           0          13           0          13
## 6          45           4          11           0          13
head(income)
##                 GEO_ID                                        NAME B19013_001E
## 1 1400000US13001950100  Census Tract 9501, Appling County, Georgia       53242
## 2 1400000US13001950200  Census Tract 9502, Appling County, Georgia       31642
## 3 1400000US13001950300  Census Tract 9503, Appling County, Georgia       34534
## 4 1400000US13001950400  Census Tract 9504, Appling County, Georgia       40667
## 5 1400000US13001950500  Census Tract 9505, Appling County, Georgia       42540
## 6 1400000US13003960100 Census Tract 9601, Atkinson County, Georgia       29688
##   B19013_001M
## 1        9830
## 2        5124
## 3        3618
## 4       12627
## 5        6428
## 6        4775

1. Joining the two tables (and changing variable names)

Now we have two separate data.frames each of which has income and education data. In order to conduct statistical analysis using two or more variables (e.g., Kendall’s Tau, Goodman and Kruskals’ Lambda, Cramer’s V, correlation analysis, etc.), it is VERY DANGEROUS to use variables from different dataframes while they are not join properly. You MUST join them first.

But before we join the two table, let’s first change their variable names. This is because we know from the previous labs that the variables have confusing, non-human readible code names, and joining the tables with these code names will amplify the confusion on our side.

Note that edu data has many columns (see the image below) and we will combine them into three groups of (1) less than high school, (2) high school graduate and bachelor, (3) more than bachelor.

Note that in the code cell below, I first assign new column names for income and edu into separate vectors first (i.e., new_names_for_income and new_names_for_edu), and then use thoes vector to actually change the column names. This can be handy because you can reuse those vectors for selecting columns in the following code.

# Working with income data.
new_names_for_income <- c("hinc", "hinc.error")
colnames(income)[3:4] <- new_names_for_income

new_names_for_edu <- c("total", "total.error",
                         "lessHS", "lessHS.error",
                         "HS", "HS.error",
                         "someCol", "someCol.error",
                         "bachelor", "bachelor.error",
                         "graduate", "graduate.error")

# Working with edu data. 
colnames(edu)[3:14] <- new_names_for_edu

# To clean things up, let's drop irrelevant columns. We can reuse the new_names_for_income vector here again.
income <- income[, c("GEO_ID", "NAME", new_names_for_income)]
edu <- edu[, c("GEO_ID", "NAME", new_names_for_edu)]

   

Data from the Census Bureau usually contain unique ID for each geographic unit (in this case, a census tract). For example, there are 1~2 variables (e.g, “GEO.id”, “GEO.id2”, “GEO_ID”, “GEOID”, etc.) that represents the unique ID of each Census Block Group, Census Tract, or County, etc. The two tables we have must be joined based on this unique ID. Otherwise, we may be running statistical analysis on wrong pairs of Census Tracts. See two example tables below.

Notice that while the number of rows are the same for df.x and df.y, only three names (i.e., Andrew, Olivia, and Emily) are on both tables. If we simply run, for example, some measures of association test using hinc from df.x and house from df.y, R will blindly give us answer BUT THIS ANSWER IS WRONG because the 1st row of df.x is for Celine’s income while the 1st of df.y is Tony’s housing value. To avoid using incorrect pairs of data, We must JOIN THE TABLES PROPERLY before we run any analysis. The package ‘tidyverse’ provide a series of very convenient functions for joining two (or more) tables. You can choose from the following functions.

Functions Tasks
full_join(x, y) Return all rows and all columns from both x and y. Where there are not matching values, returns NA for the one missing.
left_join(x, y) Returns all rows from x, and all columns from x and y. Rows in x with no match in y will have NA values in the new columns. If there are multiple matches between x and y, all combinations of the matches are returned.
right_join(x, y) Return all rows from y, and all columns from x and y. Rows in y with no match in x will have NA values in the new columns. If there are multiple matches between x and y, all combinations of the matches are returned.
inner_join(x, y) Returns all rows from x where there are matching values in y, and all columns from x and y.

For this lab, we know that the two data.frame, edu and income, contain the same list of Census Tracts because they are both for all census tracts in Georgia. We will use full_join() to join the two tables. You can specify the name of the column based on which you want the two tables to be joined. We can use GEO_ID to join the tables.

The str() function shows that all columns from edu and income are in join.df and that hinc variables are in character format. Also note that because both edu and income each have their own NAME column, the resulting join.df has duplicates of NAME variable with suffix .x and .y.

# Join edu.var and income.var together to create join.df
join.df <- full_join(edu, income, by = "GEO_ID")

# Check the result
str(join.df)
## 'data.frame':    1969 obs. of  17 variables:
##  $ GEO_ID        : chr  "1400000US13001950100" "1400000US13001950200" "1400000US13001950300" "1400000US13001950400" ...
##  $ NAME.x        : chr  "Census Tract 9501, Appling County, Georgia" "Census Tract 9502, Appling County, Georgia" "Census Tract 9503, Appling County, Georgia" "Census Tract 9504, Appling County, Georgia" ...
##  $ total         : int  1914 2896 3970 1121 2465 1349 2974 919 1938 2781 ...
##  $ total.error   : int  205 317 385 144 275 171 203 138 209 264 ...
##  $ lessHS        : int  475 732 1149 255 465 533 798 249 340 369 ...
##  $ lessHS.error  : int  137 181 319 86 174 125 168 79 160 98 ...
##  $ HS            : int  726 1089 1671 430 1045 450 1328 325 748 1268 ...
##  $ HS.error      : int  163 188 323 100 159 108 277 114 199 207 ...
##  $ someCol       : int  457 684 786 329 648 288 608 235 575 784 ...
##  $ someCol.error : int  136 181 192 79 147 89 155 83 143 168 ...
##  $ bachelor      : int  131 209 194 32 253 32 156 64 129 247 ...
##  $ bachelor.error: int  66 103 86 17 90 24 80 42 74 111 ...
##  $ graduate      : int  125 182 170 75 54 46 84 46 146 113 ...
##  $ graduate.error: int  55 80 82 49 41 37 54 33 64 74 ...
##  $ NAME.y        : chr  "Census Tract 9501, Appling County, Georgia" "Census Tract 9502, Appling County, Georgia" "Census Tract 9503, Appling County, Georgia" "Census Tract 9504, Appling County, Georgia" ...
##  $ hinc          : chr  "53242" "31642" "34534" "40667" ...
##  $ hinc.error    : chr  "9830" "5124" "3618" "12627" ...

 

2. Finishing up the rest of the data cleaning

The remaining steps of data cleaning includes (1) parse out county names from NAME column and use to retain only Fulton, DeKalb, Clayton, and Cobb county, (2) calculating the proportion of residents with graduate degrees, (3) converting hinc variable from factor to numeric, and (4) create some ordinal variables.

2.1. Extracting the four counties

We are familiar with this process, right? After we extract the four counties, let’s use table() function to see if we did it correctly. table() function counts how many occurrences there are for each unique value stored in a vector.

# Separate NAME column (note that we need to supply NAME.x or NAME.y here)
join.df.sep <- separate(join.df, "NAME.x", c("tract", 'county', 'state'), sep = ", ")
join.df.county <- join.df.sep[join.df.sep$county %in% c("Fulton County", "DeKalb County", "Clayton County", "Cobb County"),]

# Check the number of occurrences of each county in the 'county' column
table(join.df.county$county)
## 
## Clayton County    Cobb County  DeKalb County  Fulton County 
##             50            120            145            204

Looks like we have the four counties and nothing else. Job well done.

2.2. Calculating the proportion of residents with graduate degrees

If we examine the number of people with graduate degrees rather than their proportion, it may give us a misleading idea. Imagine we have two Census Tracts where 50% of the residents have graduate degrees. If one Census Tract is more populous than the other, the more populous Census Tract is likely to have a greater number of residents with graduate degrees even though the proportion is equally 50%. In other words, the number of graduate degree holders are heavily affected by the total population while the median household income is less so. Therefore, we will convert some of the variables in the edu data into percentages and use it for the analysis.

# Convert the number of graduate degree holders to percentage and assign it in a new variable
join.df.county$p.lessHS <- join.df.county$lessHS / join.df.county$total
join.df.county$p.HS <- join.df.county$HS / join.df.county$total
join.df.county$p.someCol <- join.df.county$someCol / join.df.county$total
join.df.county$p.bachelor <- join.df.county$bachelor / join.df.county$total
join.df.county$p.graduate <- join.df.county$graduate / join.df.county$total

# Check the result
summary(join.df.county)
##     GEO_ID             tract              county             state          
##  Length:519         Length:519         Length:519         Length:519        
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##      total        total.error         lessHS        lessHS.error  
##  Min.   :    0   Min.   :  13.0   Min.   :   0.0   Min.   :  2.0  
##  1st Qu.: 2364   1st Qu.: 228.5   1st Qu.: 114.0   1st Qu.: 73.5  
##  Median : 3396   Median : 316.0   Median : 261.0   Median :128.0  
##  Mean   : 3529   Mean   : 330.2   Mean   : 350.3   Mean   :145.1  
##  3rd Qu.: 4434   3rd Qu.: 410.0   3rd Qu.: 470.5   3rd Qu.:190.0  
##  Max.   :11620   Max.   :1119.0   Max.   :1826.0   Max.   :577.0  
##                                                                   
##        HS            HS.error        someCol     someCol.error  
##  Min.   :   0.0   Min.   : 13.0   Min.   :   0   Min.   : 13.0  
##  1st Qu.: 336.0   1st Qu.:128.5   1st Qu.: 487   1st Qu.:139.0  
##  Median : 628.0   Median :192.0   Median : 773   Median :189.0  
##  Mean   : 727.2   Mean   :209.3   Mean   : 903   Mean   :212.7  
##  3rd Qu.:1023.0   3rd Qu.:274.5   3rd Qu.:1192   3rd Qu.:268.5  
##  Max.   :3204.0   Max.   :781.0   Max.   :4489   Max.   :906.0  
##                                                                 
##     bachelor      bachelor.error     graduate      graduate.error
##  Min.   :   0.0   Min.   :  6.0   Min.   :   0.0   Min.   :  9   
##  1st Qu.: 394.5   1st Qu.:130.0   1st Qu.: 211.0   1st Qu.: 91   
##  Median : 783.0   Median :185.0   Median : 478.0   Median :139   
##  Mean   : 943.7   Mean   :204.9   Mean   : 605.1   Mean   :156   
##  3rd Qu.:1355.5   3rd Qu.:269.5   3rd Qu.: 888.0   3rd Qu.:210   
##  Max.   :3938.0   Max.   :763.0   Max.   :2664.0   Max.   :597   
##                                                                  
##     NAME.y              hinc            hinc.error           p.lessHS      
##  Length:519         Length:519         Length:519         Min.   :0.00000  
##  Class :character   Class :character   Class :character   1st Qu.:0.03390  
##  Mode  :character   Mode  :character   Mode  :character   Median :0.08674  
##                                                           Mean   :0.11045  
##                                                           3rd Qu.:0.15313  
##                                                           Max.   :0.73690  
##                                                           NA's   :3        
##       p.HS          p.someCol         p.bachelor         p.graduate     
##  Min.   :0.0000   Min.   :0.04769   Min.   :0.005122   Min.   :0.00000  
##  1st Qu.:0.1146   1st Qu.:0.18789   1st Qu.:0.151414   1st Qu.:0.07205  
##  Median :0.2113   Median :0.25941   Median :0.236147   Median :0.13672  
##  Mean   :0.2121   Mean   :0.25229   Mean   :0.257702   Mean   :0.16744  
##  3rd Qu.:0.3016   3rd Qu.:0.31610   3rd Qu.:0.369951   3rd Qu.:0.24777  
##  Max.   :0.5070   Max.   :0.48070   Max.   :0.558525   Max.   :0.56511  
##  NA's   :3        NA's   :3         NA's   :3          NA's   :3

Notice that the newly created variables contain NAs due to Census Tract with 0 population (i.e., any number divided by 0 is infinite). Let’s get rid of those Census Tracts using is.na() function.

# Filtering NAs out
na.filter <- !is.na(join.df.county$p.lessHS) # Here, I am creating the na.filter using just p.lessHS. I am doing so because I know
                                      # the rows that have NA is all newly created column are the same. 
                                      # If it was not the case, you'd want to do this for all columns with NA individually.
join.df.na <- join.df.county[na.filter, ]

# Check the result
summary(join.df.na)
##     GEO_ID             tract              county             state          
##  Length:516         Length:516         Length:516         Length:516        
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##      total        total.error         lessHS        lessHS.error  
##  Min.   :  136   Min.   :  63.0   Min.   :   0.0   Min.   :  2.0  
##  1st Qu.: 2389   1st Qu.: 229.0   1st Qu.: 114.8   1st Qu.: 74.0  
##  Median : 3426   Median : 316.5   Median : 263.0   Median :128.0  
##  Mean   : 3550   Mean   : 332.1   Mean   : 352.3   Mean   :145.8  
##  3rd Qu.: 4445   3rd Qu.: 410.0   3rd Qu.: 473.2   3rd Qu.:190.2  
##  Max.   :11620   Max.   :1119.0   Max.   :1826.0   Max.   :577.0  
##        HS            HS.error        someCol       someCol.error  
##  Min.   :   0.0   Min.   : 13.0   Min.   :  26.0   Min.   : 30.0  
##  1st Qu.: 340.5   1st Qu.:129.8   1st Qu.: 489.8   1st Qu.:141.8  
##  Median : 629.0   Median :192.5   Median : 773.0   Median :189.5  
##  Mean   : 731.4   Mean   :210.4   Mean   : 908.2   Mean   :213.9  
##  3rd Qu.:1025.8   3rd Qu.:275.2   3rd Qu.:1193.2   3rd Qu.:269.5  
##  Max.   :3204.0   Max.   :781.0   Max.   :4489.0   Max.   :906.0  
##     bachelor      bachelor.error     graduate      graduate.error  
##  Min.   :   4.0   Min.   :  6.0   Min.   :   0.0   Min.   :  9.00  
##  1st Qu.: 397.5   1st Qu.:131.0   1st Qu.: 214.0   1st Qu.: 91.75  
##  Median : 787.5   Median :185.5   Median : 481.5   Median :140.50  
##  Mean   : 949.1   Mean   :206.1   Mean   : 608.6   Mean   :156.80  
##  3rd Qu.:1356.5   3rd Qu.:270.2   3rd Qu.: 890.2   3rd Qu.:211.25  
##  Max.   :3938.0   Max.   :763.0   Max.   :2664.0   Max.   :597.00  
##     NAME.y              hinc            hinc.error           p.lessHS      
##  Length:516         Length:516         Length:516         Min.   :0.00000  
##  Class :character   Class :character   Class :character   1st Qu.:0.03390  
##  Mode  :character   Mode  :character   Mode  :character   Median :0.08674  
##                                                           Mean   :0.11045  
##                                                           3rd Qu.:0.15313  
##                                                           Max.   :0.73690  
##       p.HS          p.someCol         p.bachelor         p.graduate     
##  Min.   :0.0000   Min.   :0.04769   Min.   :0.005122   Min.   :0.00000  
##  1st Qu.:0.1146   1st Qu.:0.18789   1st Qu.:0.151414   1st Qu.:0.07205  
##  Median :0.2113   Median :0.25941   Median :0.236147   Median :0.13672  
##  Mean   :0.2121   Mean   :0.25229   Mean   :0.257702   Mean   :0.16744  
##  3rd Qu.:0.3016   3rd Qu.:0.31610   3rd Qu.:0.369951   3rd Qu.:0.24777  
##  Max.   :0.5070   Max.   :0.48070   Max.   :0.558525   Max.   :0.56511

The summary() shows that the newly created variables are bounded between 0 and 1, suggesting that it is in percentage format.

 

2.3. Converting hinc variable from character to numeric

Notice that hinc in the summary output above is character. That’s because when the data was first imported into R, there was at least one row in hinc column that was not a number. That forces R to treat that entire column as character (or factor). Let’s convert the median household income variable into numeric. Note that when you are converting non-numeric column into numeric column, it is strongly recommended that you first convert it to character, and then convert it to numeric.

join.df.na$hinc.num <- as.numeric(as.character(join.df.na$hinc)) # To character first, and then to numeric
## Warning: NAs introduced by coercion
str(join.df.na)
## 'data.frame':    516 obs. of  25 variables:
##  $ GEO_ID        : chr  "1400000US13063040202" "1400000US13063040203" "1400000US13063040204" "1400000US13063040302" ...
##  $ tract         : chr  "Census Tract 402.02" "Census Tract 402.03" "Census Tract 402.04" "Census Tract 403.02" ...
##  $ county        : chr  "Clayton County" "Clayton County" "Clayton County" "Clayton County" ...
##  $ state         : chr  "Georgia" "Georgia" "Georgia" "Georgia" ...
##  $ total         : int  1393 2134 2452 3520 4072 1959 3098 3089 2421 4892 ...
##  $ total.error   : int  202 321 241 398 389 322 372 333 412 443 ...
##  $ lessHS        : int  318 306 190 1034 1059 781 979 1001 618 989 ...
##  $ lessHS.error  : int  93 199 90 222 300 243 255 263 164 229 ...
##  $ HS            : int  581 688 877 1347 1319 430 1087 1127 914 1879 ...
##  $ HS.error      : int  134 236 185 325 365 149 257 205 370 349 ...
##  $ someCol       : int  338 753 825 855 1075 567 787 703 582 1426 ...
##  $ someCol.error : int  103 183 184 213 266 221 182 184 145 264 ...
##  $ bachelor      : int  123 265 371 235 530 169 214 245 209 431 ...
##  $ bachelor.error: int  53 99 133 136 226 105 98 135 94 165 ...
##  $ graduate      : int  33 122 189 49 89 12 31 13 98 167 ...
##  $ graduate.error: int  28 59 70 49 85 18 40 17 53 107 ...
##  $ NAME.y        : chr  "Census Tract 402.02, Clayton County, Georgia" "Census Tract 402.03, Clayton County, Georgia" "Census Tract 402.04, Clayton County, Georgia" "Census Tract 403.02, Clayton County, Georgia" ...
##  $ hinc          : chr  "31524" "36786" "39194" "33190" ...
##  $ hinc.error    : chr  "6665" "5722" "5231" "1973" ...
##  $ p.lessHS      : num  0.2283 0.1434 0.0775 0.2938 0.2601 ...
##  $ p.HS          : num  0.417 0.322 0.358 0.383 0.324 ...
##  $ p.someCol     : num  0.243 0.353 0.336 0.243 0.264 ...
##  $ p.bachelor    : num  0.0883 0.1242 0.1513 0.0668 0.1302 ...
##  $ p.graduate    : num  0.0237 0.0572 0.0771 0.0139 0.0219 ...
##  $ hinc.num      : num  31524 36786 39194 33190 37236 ...
# Check the result
summary(join.df.na$hinc.num)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    9815   40153   58418   66750   84288  200179       2

Although we have successfully converted hinc variable from factor to numeric, the summary function shows that we still have some NAs in the variable. Those need to be dropped as well.

# Drop rows that contain NA in hinc.num
na.filter.hinc <- !is.na(join.df.na$hinc.num)
df.ready <- join.df.na[na.filter.hinc, ]

# Check the result
summary(df.ready)
##     GEO_ID             tract              county             state          
##  Length:514         Length:514         Length:514         Length:514        
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##      total        total.error         lessHS        lessHS.error  
##  Min.   :  136   Min.   :  63.0   Min.   :   0.0   Min.   :  2.0  
##  1st Qu.: 2392   1st Qu.: 229.2   1st Qu.: 114.2   1st Qu.: 74.0  
##  Median : 3430   Median : 317.0   Median : 262.0   Median :128.0  
##  Mean   : 3556   Mean   : 332.5   Mean   : 351.8   Mean   :145.9  
##  3rd Qu.: 4455   3rd Qu.: 410.0   3rd Qu.: 471.8   3rd Qu.:190.8  
##  Max.   :11620   Max.   :1119.0   Max.   :1826.0   Max.   :577.0  
##        HS            HS.error        someCol       someCol.error  
##  Min.   :   0.0   Min.   : 13.0   Min.   :  26.0   Min.   : 30.0  
##  1st Qu.: 339.5   1st Qu.:130.0   1st Qu.: 490.8   1st Qu.:142.0  
##  Median : 629.0   Median :192.5   Median : 773.0   Median :190.5  
##  Mean   : 730.9   Mean   :210.4   Mean   : 909.8   Mean   :214.2  
##  3rd Qu.:1023.5   3rd Qu.:274.8   3rd Qu.:1193.8   3rd Qu.:270.5  
##  Max.   :3204.0   Max.   :781.0   Max.   :4489.0   Max.   :906.0  
##     bachelor      bachelor.error     graduate      graduate.error  
##  Min.   :   4.0   Min.   :  6.0   Min.   :   0.0   Min.   :  9.00  
##  1st Qu.: 409.8   1st Qu.:132.0   1st Qu.: 216.0   1st Qu.: 92.25  
##  Median : 797.5   Median :186.0   Median : 483.5   Median :141.00  
##  Mean   : 952.5   Mean   :206.6   Mean   : 610.8   Mean   :157.28  
##  3rd Qu.:1357.5   3rd Qu.:270.8   3rd Qu.: 894.8   3rd Qu.:211.75  
##  Max.   :3938.0   Max.   :763.0   Max.   :2664.0   Max.   :597.00  
##     NAME.y              hinc            hinc.error           p.lessHS      
##  Length:514         Length:514         Length:514         Min.   :0.00000  
##  Class :character   Class :character   Class :character   1st Qu.:0.03386  
##  Mode  :character   Mode  :character   Mode  :character   Median :0.08644  
##                                                           Mean   :0.10985  
##                                                           3rd Qu.:0.15296  
##                                                           Max.   :0.73690  
##       p.HS          p.someCol         p.bachelor         p.graduate     
##  Min.   :0.0000   Min.   :0.04769   Min.   :0.005122   Min.   :0.00000  
##  1st Qu.:0.1145   1st Qu.:0.18712   1st Qu.:0.152710   1st Qu.:0.07227  
##  Median :0.2111   Median :0.25941   Median :0.236405   Median :0.13731  
##  Mean   :0.2113   Mean   :0.25229   Mean   :0.258531   Mean   :0.16801  
##  3rd Qu.:0.2980   3rd Qu.:0.31635   3rd Qu.:0.369975   3rd Qu.:0.24796  
##  Max.   :0.5070   Max.   :0.48070   Max.   :0.558525   Max.   :0.56511  
##     hinc.num     
##  Min.   :  9815  
##  1st Qu.: 40153  
##  Median : 58418  
##  Mean   : 66750  
##  3rd Qu.: 84288  
##  Max.   :200179

   

2.4. Create some ordinal variables

The statistical tests we will practice require nominal or ordinal variables, and today we will use three variables in the data we prepared so far, hinc.num, graduate, and county (other variables will be recycled in future labs). Since county is already a nominal variable, let’s convert df.ready$hinc.num and df.ready$p.graduate into an ordinal variable with 4 levels each.

Note that converting a continuous (e.g., interval or ratio) variable into an ordinal variable is usually not desirable because ratio variables have more information than ordinal variables. By converting, we are essentially throwing away some information. We are doing it here just for the purpose of demonstration. We are going to use cut() function to convert continuous variable into ordinal. The output of the cut() function as shown below is of type ‘ordered factor’. Perhaps the most confusing part in using the cut() function is the ‘breaks’ argument. I am using quantile() function to generate breaks points that will determine where one class ends and another class begins. See the blue box below for more information.

# Cut a continuous variable into an ordinal variable
df.ready$hinc.num.ord <- cut(df.ready$hinc.num, # The name of the data.frame and variable that you want to cut
                             breaks = quantile(df.ready$hinc.num), # Break points
                             labels = c("poor", "lower-middle class", "upper-middle class", "wealthy"), # Labels for each level
                             ordered_result = TRUE,# Whether to make it ordinal
                             include.lowest = TRUE) # Whether to include the lowest value to the first category

# Cut a continuous variable into an ordinal variable
df.ready$p.graduate.ord <- cut(df.ready$p.graduate, # The name of the data.frame and variable that you want to cut
                               breaks = quantile(df.ready$p.graduate), # Break points
                               labels = c("very low", "low", "medium", "high"), # Labels for each level
                               ordered_result = TRUE,# Whether to make it ordinal
                               include.lowest = TRUE) # Whether to include the lowest value to the first category

# Check the result
table(df.ready$hinc.num.ord)
## 
##               poor lower-middle class upper-middle class            wealthy 
##                129                128                128                129
table(df.ready$p.graduate.ord)
## 
## very low      low   medium     high 
##      129      128      128      129

[ If you want to take a look at the quantile function.. ]

To better understand the quantile() function, print out quantile(df.ready$hinc.num) and see what output it gives you.

# quantile function
quantile(df.ready$hinc.num)
##        0%       25%       50%       75%      100% 
##   9815.00  40152.75  58417.50  84287.50 200179.00

The output contains two rows of information. The first row shows that quantile() function shows the values at 0, 25, 50, 75, and 100th percentile. The second row shows the specific values in the column that are at the corresponding percentile.

   

3. Chi-square test

Let’s briefly take a look at chi-square test. Because all the theoretical discussions were covered in the lecture, we are simply going to run the test. Also, most of the coding required to run statistical tests are actually neatly condensed into a few functions (if not just one), so, like I mentioned in a previous lab, the majority of the work is for getting the data ready for the statistical tests.

One thing to note is that you can supply either (1) two vectors separately as two arguments or (2) supply the output of the table() as single argument. The results do not differ.

# Building a contingency table
ctable <- table(df.ready$hinc.num.ord, df.ready$county)
ctable
##                     
##                      Clayton County Cobb County DeKalb County Fulton County
##   poor                           20          11            28            70
##   lower-middle class             21          22            51            34
##   upper-middle class              8          43            38            39
##   wealthy                         0          44            26            59
# Run a chi-square test
chisq.test(df.ready$hinc.num.ord, df.ready$county) # you can supply the two vectors separately or
## 
##  Pearson's Chi-squared test
## 
## data:  df.ready$hinc.num.ord and df.ready$county
## X-squared = 79.087, df = 9, p-value = 2.453e-13
chisq.test(ctable) # you can supply the output of the table() function.
## 
##  Pearson's Chi-squared test
## 
## data:  ctable
## X-squared = 79.087, df = 9, p-value = 2.453e-13

The output is very simple. First, there is X-squared = 79.087 shows the test statistic. This is a very large value, because with df = 9, the critical value needed to get p-value of 0.05 is roughly 16.92 (try qchisq(p = 0.95, df = 9) to get the critical value at which the area under the curve adds up to 0.95 for the chi-square distribution with df = 9). As you expected from the large X-squared statistic, the p-value is essentially 0.

The result from the test shows that there exists a statistically significant relationship between the ordinal variable of median household income and county.

Try running the same test using p.graduate.ord before you move on!

   

4. Measures of Association

Note: everything under section 4 will be covered in the following lecture. However, you must read through the section and practice all the scripts; this will give you an overview before the next class.

4.1. Kendall’s Tau

Before we run Kendall’s Tau, there is one important step that you must do - converting the factor variables into numeric. Factor in R is a type of data that is both character AND numeric. For example, when you print out a variable in factor-type, it would show “hello”, “bye”, “hi” (i.e., characters), but under the hood, those character values are attached with numeric codes such as 1, 2, 3. So, by setting hinc.num.ord and p.graduate.ord as ordered factors, they can be printed out as actual characters that is meaningful to human (which is going to be useful for us) AND automatically assign numeric values in the order of labels = c("poor", "lower-middle class", "upper-middle class", "wealthy") (which is needed for R to run some tests such as Kendall’s Tau).

# Tau-b
cor.test(as.numeric(df.ready$hinc.num.ord),     # First vector. You need to convert a variable in factor format into numeric in that order.
         as.numeric(df.ready$p.graduate.ord),   # Second vector. You need to convert a variable in factor format into numeric in that order. Note. we did not have to do the as.numeric in lecture because the data was not in format format.
         method = "kendall")
## 
##  Kendall's rank correlation tau
## 
## data:  as.numeric(df.ready$hinc.num.ord) and as.numeric(df.ready$p.graduate.ord)
## z = 18.152, p-value < 2.2e-16
## alternative hypothesis: true tau is not equal to 0
## sample estimates:
##       tau 
## 0.6679014

Now, I guess we have become somewhat familiar with the output formats in R. The low p-value (note that the default alpha = 0.05) indicates that we can reject the null hypothesis. The Kendall’s tau correlation coefficient is 0.668 indicating a strong relationship .

4.2. Goodman & Kruskals’ Lambda

I am going to use the same two ordinal variable for a demonstration of Lambda. Note that when you have two ordinal variables, you don’t need to run Lambda as there are other tests (e.g., Kendall’s tau) that makes use of the ordered nature of the data. As with other examples in this lab, I am using them just for demonstration purposes.

This Lambda() function is coming from the package DescTools.

# Lambda - row
Lambda(df.ready$hinc.num.ord,
       df.ready$p.graduate.ord, 
       direction = "row")
## [1] 0.3818182

Also, when using the Lambda(), we can input a cross tab instead of using a vector. The script above uses a vector as an input for the Lambda(). The script below is another way of doing the same thing, but this time we use a cross tab as an input; both will give you the same result.

Rc <- table(df.ready$hinc.num.ord, df.ready$p.graduate.ord)

Lambda(Rc, direction = "row")
## [1] 0.3818182

4.3. Cramer’s V

The DescTools packages we installed for Lambda() also has a function for Cramer’s V, CramerV. The package introduced in the lecture rcompanion also has a function called cramerV (note that one function has a capital C and the other has a small c for the function name).

# Cramer's V
DescTools::CramerV(df.ready$hinc.num.ord,
                   df.ready$p.graduate.ord)
## [1] 0.4819146
rcompanion::cramerV(df.ready$hinc.num.ord,
                   df.ready$p.graduate.ord)
## Cramer V 
##   0.4819

Part 5

In this section we will be looking at two measures of association for continuous variables, most importantly the correlation coefficient, and how to plot two continuous variables to visually assess a relationship or lack thereof. These topics will be covered in class next Tuesday and appear on Assignment 4.

5.1. Covariance and Correlation

Before we examine the data we have, let us briefly go through what correlation is and what it isn’t. The correlation can tell us how intimately two variables are aligned on a line (see the image below). Notice that the correlation coefficient on the second row in the image are all 1 regardless of how steep or gentle the slope of the line is. In other words, the correlation coefficient is NOT about the slope of the line; it is about the spread of data points from the line.

(image source: Wikipedia)

First, we will look at the relationship between percent people with graduate degree in a neighborhood and the median household income of a neighborhood. We anecdotally know that neighborhoods with more educated population tend to be wealthier. In other words, as a variable for the education level increases, we expect to see an increase in a variable for income. Let’s visualize hinc.edu to see if this is true for Georgia too. We can plot one variable on the x-axis and plot the other variable on the y-axis to create a scatterplot.

# Creating a Scatterplot between hinc and p.graduate
plot(df.ready$p.graduate,
     df.ready$hinc.num,
     main = "Scatterplot between % gradute degree \n and median household income",
     xlab = "Proportion of residents with graduate degrees",
     ylab = "Median household income",
     col = "blue")

The scatterplot shows that there is a clear positive relationship between the two variables. But without running the correlation analysis, we cannot quantify how strong the relationship is and whether the relationship is statistically significant.

The code below calculates covariance and correlation with which we can quantify the relationship between the two variables.

# Covariance
cov(df.ready$p.graduate, df.ready$hinc.num) # This value by itself is not really useful because it is affected by the unit of the data.
## [1] 3105.944
# Correlation
cor.test(df.ready$p.graduate, df.ready$hinc.num,
         alternative = "two.sided", # or "less" "greater" for one-tailed test
         method = "pearson",        # or "kendall" "spearman"
         conf.level = 0.95)
## 
##  Pearson's product-moment correlation
## 
## data:  df.ready$p.graduate and df.ready$hinc.num
## t = 24.844, df = 512, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.6974296 0.7761786
## sample estimates:
##       cor 
## 0.7393218

The correlation coefficient is 0.7393218 The t-value is huge with a large degrees of freedom (and of course a very small p-value). These numbers indicate a statistically significant positive correlation between the two variables.

 

Compare the scatterplot of p.graduate and hinc.num against the large correlation plot above with multiple correlation coefficients and corresponding scatterplot. The correlation coefficient we got from the data is 0.7393218, which is close to 0.8. Can see the similarity in terms of how spread the points are from the line?

If the correlation coefficient we got from the data was, for example, 0.4, the points in the scatterplot would have been more spread out. Again, correlation coefficient do not tell us anything about the slope! It only tells us whether it is upward/downward slope and how spread out the points are from the line.

A few more examples of correlation tests

We have seen that there is a positive correlation between p.graduate and hinc.num. Let’s examine if there is the opposite relationship between having low educational attainment and the median household income of neighborhoods in the four county. This time, let’s use p.HS, the proportion of residents who are older than 25 and have high school education.

# Creating a Scatterplot between hinc and p.HS
plot(df.ready$p.HS,
     df.ready$hinc.num,
     main = "Scatterplot between % high school graduate \n and median household income",
     xlab = "Proportion of residents with high school diploma",
     ylab = "Median household income",
     col = "blue")

The plot shows that there is a very clear negative relationship between the two variables. We can easily predict from this plot that the correlation coefficient will be large (to be minimize potential confusions, I’d say the absolute value (or the magnitude) of the coefficient will be large or the coefficient is large and negative or something similar). Let’s run the analysis.

cor.test(df.ready$p.HS, df.ready$hinc.num,
         alternative = "two.sided", # or "less" "greater" for one-tailed test
         method = "pearson",        # or "kendall" "spearman"
         conf.level = 0.95)
## 
##  Pearson's product-moment correlation
## 
## data:  df.ready$p.HS and df.ready$hinc.num
## t = -23.285, df = 512, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.7567151 -0.6723812
## sample estimates:
##        cor 
## -0.7171636

Just like we predicted, the correlation coefficient is -0.7171636 and the hypothesis test result is highly statistically significant. They indicate that there is a strong negative correlation between p.HS and hinc.num.

Try repeating what we’ve done so far with other educational attainment levels, for example, df.ready$someCol. Examine if there a statistically significant correlation, what the scatterplot looks like, and how the plot compares with the large correlation chart (i.e., the first figure of this document).