| Functions | Tasks |
|---|---|
full_join() |
Return all rows and all columns from both x and y. Where there are not matching values, returns NA for the one missing. |
left_join() |
Returns all rows from x, and all columns from x and y. Rows in x with no match in y will have NA values in the new columns. If there are multiple matches between x and y, all combinations of the matches are returned. |
right_join() |
Return all rows from y, and all columns from x and y. Rows in y with no match in x will have NA values in the new columns. If there are multiple matches between x and y, all combinations of the matches are returned. |
inner_join() |
Returns all rows from x where there are matching values in y, and all columns from x and y. |
cut() |
Divides the range of the input vector into intervals and codes the values according to which interval they fall. |
table() |
Builds a contingency table of the counts. Note: we discussed the expss user driven command cro() in class, but I will use the base R command table() in this lab. |
quantile() |
Produces sample quantiles corresponding to the given probabilities |
chisq.test() |
Performs chi-squared contingency table tests and goodness-of-fit tests. |
cor.test() |
Tests for association between paired samples |
Lambda() |
Calculates Goodman Kruskal lambda and their confidence intervals. |
CramerV() |
Calculates Cramer’s V |
cov() |
Calculates covariance using two vectors. |
In this lab, we will download two variables from the American Fact Finder, join the two data using a key variable, clean the data, and conduct correlation and bivariate regression. The question we will examine is “Do neighborhoods with a higher proportion of people with graduate degrees tend to be wealthier?”
# Install required packages ----------- This code will not run if you already have tidyverse installed.
if (!require("tidyverse")) install.packages("tidyverse", dependencies = TRUE)
if (!require("DescTools")) install.packages("DescTools", dependencies = TRUE)
if (!require("rcompanion")) install.packages("rcompanion", dependencies = TRUE)
# Call the packages using library()
library(tidyverse)
library(DescTools)
library(rcompanion)
The data we will use is “2013-2017 American Community Survey 5-Year Estimates of MEDIAN HOUSEHOLD INCOME IN THE PAST 12 MONTHS (IN 2017 INFLATION-ADJUSTED DOLLARS)” (table id: B19013) and “2013-2017 American Community Survey 5-Year Estimates of PLACE OF BIRTH BY EDUCATIONAL ATTAINMENT IN THE UNITED STATES” (table id: B06009). You don’t need to download them yourself - I provided them in the Modules tab on Canvas.
# USE YOUR OWN working directory and file name
setwd("C:\\Users\\cod\\Desktop\\PhD Files\\GRA & GTA\\GTA\\CP6025 Fall 2021\\Week 6\\Lab5_2020") # Use your own pathname
edu <- read.csv("Week6_edu.csv") # Education data
income <- read.csv("Week6_hinc.csv") # Income data
# Check the data
head(edu)
## GEO_ID NAME B06009_001E
## 1 1400000US13001950100 Census Tract 9501, Appling County, Georgia 1914
## 2 1400000US13001950200 Census Tract 9502, Appling County, Georgia 2896
## 3 1400000US13001950300 Census Tract 9503, Appling County, Georgia 3970
## 4 1400000US13001950400 Census Tract 9504, Appling County, Georgia 1121
## 5 1400000US13001950500 Census Tract 9505, Appling County, Georgia 2465
## 6 1400000US13003960100 Census Tract 9601, Atkinson County, Georgia 1349
## B06009_001M B06009_002E B06009_002M B06009_003E B06009_003M B06009_004E
## 1 205 475 137 726 163 457
## 2 317 732 181 1089 188 684
## 3 385 1149 319 1671 323 786
## 4 144 255 86 430 100 329
## 5 275 465 174 1045 159 648
## 6 171 533 125 450 108 288
## B06009_004M B06009_005E B06009_005M B06009_006E B06009_006M B06009_007E
## 1 136 131 66 125 55 1606
## 2 181 209 103 182 80 2107
## 3 192 194 86 170 82 2845
## 4 79 32 17 75 49 888
## 5 147 253 90 54 41 1715
## 6 89 32 24 46 37 826
## B06009_007M B06009_008E B06009_008M B06009_009E B06009_009M B06009_010E
## 1 193 408 135 631 143 357
## 2 284 595 161 722 156 529
## 3 397 865 268 995 241 719
## 4 126 181 66 359 96 271
## 5 224 243 96 802 153 444
## 6 141 266 106 338 79 160
## B06009_010M B06009_011E B06009_011M B06009_012E B06009_012M B06009_013E
## 1 118 109 56 101 49 290
## 2 153 103 67 158 79 634
## 3 185 145 70 121 72 673
## 4 71 26 15 51 38 208
## 5 122 191 86 35 34 659
## 6 53 21 18 41 36 245
## B06009_013M B06009_014E B06009_014M B06009_015E B06009_015M B06009_016E
## 1 109 49 37 95 59 100
## 2 191 100 68 301 132 115
## 3 194 125 96 383 144 67
## 4 86 64 54 71 43 58
## 5 158 159 121 229 110 190
## 6 105 45 30 109 56 79
## B06009_016M B06009_017E B06009_017M B06009_018E B06009_018M B06009_019E
## 1 54 22 26 24 30 18
## 2 62 94 80 24 22 0
## 3 48 49 46 49 50 0
## 4 43 6 5 9 13 10
## 5 83 62 51 19 22 20
## 6 68 7 16 5 7 0
## B06009_019M B06009_020E B06009_020M B06009_021E B06009_021M B06009_022E
## 1 27 18 27 0 13 0
## 2 13 0 13 0 13 0
## 3 18 0 18 0 18 0
## 4 16 10 16 0 13 0
## 5 25 0 13 6 11 14
## 6 13 0 13 0 13 0
## B06009_022M B06009_023E B06009_023M B06009_024E B06009_024M B06009_025E
## 1 13 0 13 0 13 0
## 2 13 0 13 0 13 155
## 3 18 0 18 0 18 452
## 4 13 0 13 0 13 15
## 5 22 0 13 0 13 71
## 6 13 0 13 0 13 278
## B06009_025M B06009_026E B06009_026M B06009_027E B06009_027M B06009_028E
## 1 13 0 13 0 13 0
## 2 141 37 59 66 72 40
## 3 185 159 157 293 192 0
## 4 25 0 13 0 13 0
## 5 61 63 60 8 14 0
## 6 110 222 81 3 7 49
## B06009_028M B06009_029E B06009_029M B06009_030E B06009_030M
## 1 13 0 13 0 13
## 2 46 12 18 0 13
## 3 18 0 18 0 18
## 4 13 0 13 15 25
## 5 13 0 13 0 13
## 6 45 4 11 0 13
head(income)
## GEO_ID NAME B19013_001E
## 1 1400000US13001950100 Census Tract 9501, Appling County, Georgia 53242
## 2 1400000US13001950200 Census Tract 9502, Appling County, Georgia 31642
## 3 1400000US13001950300 Census Tract 9503, Appling County, Georgia 34534
## 4 1400000US13001950400 Census Tract 9504, Appling County, Georgia 40667
## 5 1400000US13001950500 Census Tract 9505, Appling County, Georgia 42540
## 6 1400000US13003960100 Census Tract 9601, Atkinson County, Georgia 29688
## B19013_001M
## 1 9830
## 2 5124
## 3 3618
## 4 12627
## 5 6428
## 6 4775
Now we have two separate data.frames each of which has income and education data. In order to conduct statistical analysis using two or more variables (e.g., Kendall’s Tau, Goodman and Kruskals’ Lambda, Cramer’s V, correlation analysis, etc.), it is VERY DANGEROUS to use variables from different dataframes while they are not join properly. You MUST join them first.
But before we join the two table, let’s first change their variable names. This is because we know from the previous labs that the variables have confusing, non-human readible code names, and joining the tables with these code names will amplify the confusion on our side.
Note that edu data has many columns (see the image below) and we will combine them into three groups of (1) less than high school, (2) high school graduate and bachelor, (3) more than bachelor.
Note that in the code cell below, I first assign new column names for income and edu into separate vectors first (i.e., new_names_for_income and new_names_for_edu), and then use thoes vector to actually change the column names. This can be handy because you can reuse those vectors for selecting columns in the following code.
# Working with income data.
new_names_for_income <- c("hinc", "hinc.error")
colnames(income)[3:4] <- new_names_for_income
new_names_for_edu <- c("total", "total.error",
"lessHS", "lessHS.error",
"HS", "HS.error",
"someCol", "someCol.error",
"bachelor", "bachelor.error",
"graduate", "graduate.error")
# Working with edu data.
colnames(edu)[3:14] <- new_names_for_edu
# To clean things up, let's drop irrelevant columns. We can reuse the new_names_for_income vector here again.
income <- income[, c("GEO_ID", "NAME", new_names_for_income)]
edu <- edu[, c("GEO_ID", "NAME", new_names_for_edu)]
Data from the Census Bureau usually contain unique ID for each geographic unit (in this case, a census tract). For example, there are 1~2 variables (e.g, “GEO.id”, “GEO.id2”, “GEO_ID”, “GEOID”, etc.) that represents the unique ID of each Census Block Group, Census Tract, or County, etc. The two tables we have must be joined based on this unique ID. Otherwise, we may be running statistical analysis on wrong pairs of Census Tracts. See two example tables below.
Notice that while the number of rows are the same for df.x and df.y, only three names (i.e., Andrew, Olivia, and Emily) are on both tables. If we simply run, for example, some measures of association test using hinc from df.x and house from df.y, R will blindly give us answer BUT THIS ANSWER IS WRONG because the 1st row of df.x is for Celine’s income while the 1st of df.y is Tony’s housing value. To avoid using incorrect pairs of data, We must JOIN THE TABLES PROPERLY before we run any analysis. The package ‘tidyverse’ provide a series of very convenient functions for joining two (or more) tables. You can choose from the following functions.
| Functions | Tasks |
|---|---|
full_join(x, y) |
Return all rows and all columns from both x and y. Where there are not matching values, returns NA for the one missing. |
left_join(x, y) |
Returns all rows from x, and all columns from x and y. Rows in x with no match in y will have NA values in the new columns. If there are multiple matches between x and y, all combinations of the matches are returned. |
right_join(x, y) |
Return all rows from y, and all columns from x and y. Rows in y with no match in x will have NA values in the new columns. If there are multiple matches between x and y, all combinations of the matches are returned. |
inner_join(x, y) |
Returns all rows from x where there are matching values in y, and all columns from x and y. |
For this lab, we know that the two data.frame, edu and income, contain the same list of Census Tracts because they are both for all census tracts in Georgia. We will use full_join() to join the two tables. You can specify the name of the column based on which you want the two tables to be joined. We can use GEO_ID to join the tables.
The str() function shows that all columns from edu and income are in join.df and that hinc variables are in character format. Also note that because both edu and income each have their own NAME column, the resulting join.df has duplicates of NAME variable with suffix .x and .y.
# Join edu.var and income.var together to create join.df
join.df <- full_join(edu, income, by = "GEO_ID")
# Check the result
str(join.df)
## 'data.frame': 1969 obs. of 17 variables:
## $ GEO_ID : chr "1400000US13001950100" "1400000US13001950200" "1400000US13001950300" "1400000US13001950400" ...
## $ NAME.x : chr "Census Tract 9501, Appling County, Georgia" "Census Tract 9502, Appling County, Georgia" "Census Tract 9503, Appling County, Georgia" "Census Tract 9504, Appling County, Georgia" ...
## $ total : int 1914 2896 3970 1121 2465 1349 2974 919 1938 2781 ...
## $ total.error : int 205 317 385 144 275 171 203 138 209 264 ...
## $ lessHS : int 475 732 1149 255 465 533 798 249 340 369 ...
## $ lessHS.error : int 137 181 319 86 174 125 168 79 160 98 ...
## $ HS : int 726 1089 1671 430 1045 450 1328 325 748 1268 ...
## $ HS.error : int 163 188 323 100 159 108 277 114 199 207 ...
## $ someCol : int 457 684 786 329 648 288 608 235 575 784 ...
## $ someCol.error : int 136 181 192 79 147 89 155 83 143 168 ...
## $ bachelor : int 131 209 194 32 253 32 156 64 129 247 ...
## $ bachelor.error: int 66 103 86 17 90 24 80 42 74 111 ...
## $ graduate : int 125 182 170 75 54 46 84 46 146 113 ...
## $ graduate.error: int 55 80 82 49 41 37 54 33 64 74 ...
## $ NAME.y : chr "Census Tract 9501, Appling County, Georgia" "Census Tract 9502, Appling County, Georgia" "Census Tract 9503, Appling County, Georgia" "Census Tract 9504, Appling County, Georgia" ...
## $ hinc : chr "53242" "31642" "34534" "40667" ...
## $ hinc.error : chr "9830" "5124" "3618" "12627" ...
The remaining steps of data cleaning includes (1) parse out county names from NAME column and use to retain only Fulton, DeKalb, Clayton, and Cobb county, (2) calculating the proportion of residents with graduate degrees, (3) converting hinc variable from factor to numeric, and (4) create some ordinal variables.
We are familiar with this process, right? After we extract the four counties, let’s use table() function to see if we did it correctly. table() function counts how many occurrences there are for each unique value stored in a vector.
# Separate NAME column (note that we need to supply NAME.x or NAME.y here)
join.df.sep <- separate(join.df, "NAME.x", c("tract", 'county', 'state'), sep = ", ")
join.df.county <- join.df.sep[join.df.sep$county %in% c("Fulton County", "DeKalb County", "Clayton County", "Cobb County"),]
# Check the number of occurrences of each county in the 'county' column
table(join.df.county$county)
##
## Clayton County Cobb County DeKalb County Fulton County
## 50 120 145 204
Looks like we have the four counties and nothing else. Job well done.
If we examine the number of people with graduate degrees rather than their proportion, it may give us a misleading idea. Imagine we have two Census Tracts where 50% of the residents have graduate degrees. If one Census Tract is more populous than the other, the more populous Census Tract is likely to have a greater number of residents with graduate degrees even though the proportion is equally 50%. In other words, the number of graduate degree holders are heavily affected by the total population while the median household income is less so. Therefore, we will convert some of the variables in the edu data into percentages and use it for the analysis.
# Convert the number of graduate degree holders to percentage and assign it in a new variable
join.df.county$p.lessHS <- join.df.county$lessHS / join.df.county$total
join.df.county$p.HS <- join.df.county$HS / join.df.county$total
join.df.county$p.someCol <- join.df.county$someCol / join.df.county$total
join.df.county$p.bachelor <- join.df.county$bachelor / join.df.county$total
join.df.county$p.graduate <- join.df.county$graduate / join.df.county$total
# Check the result
summary(join.df.county)
## GEO_ID tract county state
## Length:519 Length:519 Length:519 Length:519
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## total total.error lessHS lessHS.error
## Min. : 0 Min. : 13.0 Min. : 0.0 Min. : 2.0
## 1st Qu.: 2364 1st Qu.: 228.5 1st Qu.: 114.0 1st Qu.: 73.5
## Median : 3396 Median : 316.0 Median : 261.0 Median :128.0
## Mean : 3529 Mean : 330.2 Mean : 350.3 Mean :145.1
## 3rd Qu.: 4434 3rd Qu.: 410.0 3rd Qu.: 470.5 3rd Qu.:190.0
## Max. :11620 Max. :1119.0 Max. :1826.0 Max. :577.0
##
## HS HS.error someCol someCol.error
## Min. : 0.0 Min. : 13.0 Min. : 0 Min. : 13.0
## 1st Qu.: 336.0 1st Qu.:128.5 1st Qu.: 487 1st Qu.:139.0
## Median : 628.0 Median :192.0 Median : 773 Median :189.0
## Mean : 727.2 Mean :209.3 Mean : 903 Mean :212.7
## 3rd Qu.:1023.0 3rd Qu.:274.5 3rd Qu.:1192 3rd Qu.:268.5
## Max. :3204.0 Max. :781.0 Max. :4489 Max. :906.0
##
## bachelor bachelor.error graduate graduate.error
## Min. : 0.0 Min. : 6.0 Min. : 0.0 Min. : 9
## 1st Qu.: 394.5 1st Qu.:130.0 1st Qu.: 211.0 1st Qu.: 91
## Median : 783.0 Median :185.0 Median : 478.0 Median :139
## Mean : 943.7 Mean :204.9 Mean : 605.1 Mean :156
## 3rd Qu.:1355.5 3rd Qu.:269.5 3rd Qu.: 888.0 3rd Qu.:210
## Max. :3938.0 Max. :763.0 Max. :2664.0 Max. :597
##
## NAME.y hinc hinc.error p.lessHS
## Length:519 Length:519 Length:519 Min. :0.00000
## Class :character Class :character Class :character 1st Qu.:0.03390
## Mode :character Mode :character Mode :character Median :0.08674
## Mean :0.11045
## 3rd Qu.:0.15313
## Max. :0.73690
## NA's :3
## p.HS p.someCol p.bachelor p.graduate
## Min. :0.0000 Min. :0.04769 Min. :0.005122 Min. :0.00000
## 1st Qu.:0.1146 1st Qu.:0.18789 1st Qu.:0.151414 1st Qu.:0.07205
## Median :0.2113 Median :0.25941 Median :0.236147 Median :0.13672
## Mean :0.2121 Mean :0.25229 Mean :0.257702 Mean :0.16744
## 3rd Qu.:0.3016 3rd Qu.:0.31610 3rd Qu.:0.369951 3rd Qu.:0.24777
## Max. :0.5070 Max. :0.48070 Max. :0.558525 Max. :0.56511
## NA's :3 NA's :3 NA's :3 NA's :3
Notice that the newly created variables contain NAs due to Census Tract with 0 population (i.e., any number divided by 0 is infinite). Let’s get rid of those Census Tracts using is.na() function.
# Filtering NAs out
na.filter <- !is.na(join.df.county$p.lessHS) # Here, I am creating the na.filter using just p.lessHS. I am doing so because I know
# the rows that have NA is all newly created column are the same.
# If it was not the case, you'd want to do this for all columns with NA individually.
join.df.na <- join.df.county[na.filter, ]
# Check the result
summary(join.df.na)
## GEO_ID tract county state
## Length:516 Length:516 Length:516 Length:516
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## total total.error lessHS lessHS.error
## Min. : 136 Min. : 63.0 Min. : 0.0 Min. : 2.0
## 1st Qu.: 2389 1st Qu.: 229.0 1st Qu.: 114.8 1st Qu.: 74.0
## Median : 3426 Median : 316.5 Median : 263.0 Median :128.0
## Mean : 3550 Mean : 332.1 Mean : 352.3 Mean :145.8
## 3rd Qu.: 4445 3rd Qu.: 410.0 3rd Qu.: 473.2 3rd Qu.:190.2
## Max. :11620 Max. :1119.0 Max. :1826.0 Max. :577.0
## HS HS.error someCol someCol.error
## Min. : 0.0 Min. : 13.0 Min. : 26.0 Min. : 30.0
## 1st Qu.: 340.5 1st Qu.:129.8 1st Qu.: 489.8 1st Qu.:141.8
## Median : 629.0 Median :192.5 Median : 773.0 Median :189.5
## Mean : 731.4 Mean :210.4 Mean : 908.2 Mean :213.9
## 3rd Qu.:1025.8 3rd Qu.:275.2 3rd Qu.:1193.2 3rd Qu.:269.5
## Max. :3204.0 Max. :781.0 Max. :4489.0 Max. :906.0
## bachelor bachelor.error graduate graduate.error
## Min. : 4.0 Min. : 6.0 Min. : 0.0 Min. : 9.00
## 1st Qu.: 397.5 1st Qu.:131.0 1st Qu.: 214.0 1st Qu.: 91.75
## Median : 787.5 Median :185.5 Median : 481.5 Median :140.50
## Mean : 949.1 Mean :206.1 Mean : 608.6 Mean :156.80
## 3rd Qu.:1356.5 3rd Qu.:270.2 3rd Qu.: 890.2 3rd Qu.:211.25
## Max. :3938.0 Max. :763.0 Max. :2664.0 Max. :597.00
## NAME.y hinc hinc.error p.lessHS
## Length:516 Length:516 Length:516 Min. :0.00000
## Class :character Class :character Class :character 1st Qu.:0.03390
## Mode :character Mode :character Mode :character Median :0.08674
## Mean :0.11045
## 3rd Qu.:0.15313
## Max. :0.73690
## p.HS p.someCol p.bachelor p.graduate
## Min. :0.0000 Min. :0.04769 Min. :0.005122 Min. :0.00000
## 1st Qu.:0.1146 1st Qu.:0.18789 1st Qu.:0.151414 1st Qu.:0.07205
## Median :0.2113 Median :0.25941 Median :0.236147 Median :0.13672
## Mean :0.2121 Mean :0.25229 Mean :0.257702 Mean :0.16744
## 3rd Qu.:0.3016 3rd Qu.:0.31610 3rd Qu.:0.369951 3rd Qu.:0.24777
## Max. :0.5070 Max. :0.48070 Max. :0.558525 Max. :0.56511
The summary() shows that the newly created variables are bounded between 0 and 1, suggesting that it is in percentage format.
hinc variable from character to numericNotice that hinc in the summary output above is character. That’s because when the data was first imported into R, there was at least one row in hinc column that was not a number. That forces R to treat that entire column as character (or factor). Let’s convert the median household income variable into numeric. Note that when you are converting non-numeric column into numeric column, it is strongly recommended that you first convert it to character, and then convert it to numeric.
join.df.na$hinc.num <- as.numeric(as.character(join.df.na$hinc)) # To character first, and then to numeric
## Warning: NAs introduced by coercion
str(join.df.na)
## 'data.frame': 516 obs. of 25 variables:
## $ GEO_ID : chr "1400000US13063040202" "1400000US13063040203" "1400000US13063040204" "1400000US13063040302" ...
## $ tract : chr "Census Tract 402.02" "Census Tract 402.03" "Census Tract 402.04" "Census Tract 403.02" ...
## $ county : chr "Clayton County" "Clayton County" "Clayton County" "Clayton County" ...
## $ state : chr "Georgia" "Georgia" "Georgia" "Georgia" ...
## $ total : int 1393 2134 2452 3520 4072 1959 3098 3089 2421 4892 ...
## $ total.error : int 202 321 241 398 389 322 372 333 412 443 ...
## $ lessHS : int 318 306 190 1034 1059 781 979 1001 618 989 ...
## $ lessHS.error : int 93 199 90 222 300 243 255 263 164 229 ...
## $ HS : int 581 688 877 1347 1319 430 1087 1127 914 1879 ...
## $ HS.error : int 134 236 185 325 365 149 257 205 370 349 ...
## $ someCol : int 338 753 825 855 1075 567 787 703 582 1426 ...
## $ someCol.error : int 103 183 184 213 266 221 182 184 145 264 ...
## $ bachelor : int 123 265 371 235 530 169 214 245 209 431 ...
## $ bachelor.error: int 53 99 133 136 226 105 98 135 94 165 ...
## $ graduate : int 33 122 189 49 89 12 31 13 98 167 ...
## $ graduate.error: int 28 59 70 49 85 18 40 17 53 107 ...
## $ NAME.y : chr "Census Tract 402.02, Clayton County, Georgia" "Census Tract 402.03, Clayton County, Georgia" "Census Tract 402.04, Clayton County, Georgia" "Census Tract 403.02, Clayton County, Georgia" ...
## $ hinc : chr "31524" "36786" "39194" "33190" ...
## $ hinc.error : chr "6665" "5722" "5231" "1973" ...
## $ p.lessHS : num 0.2283 0.1434 0.0775 0.2938 0.2601 ...
## $ p.HS : num 0.417 0.322 0.358 0.383 0.324 ...
## $ p.someCol : num 0.243 0.353 0.336 0.243 0.264 ...
## $ p.bachelor : num 0.0883 0.1242 0.1513 0.0668 0.1302 ...
## $ p.graduate : num 0.0237 0.0572 0.0771 0.0139 0.0219 ...
## $ hinc.num : num 31524 36786 39194 33190 37236 ...
# Check the result
summary(join.df.na$hinc.num)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 9815 40153 58418 66750 84288 200179 2
Although we have successfully converted hinc variable from factor to numeric, the summary function shows that we still have some NAs in the variable. Those need to be dropped as well.
# Drop rows that contain NA in hinc.num
na.filter.hinc <- !is.na(join.df.na$hinc.num)
df.ready <- join.df.na[na.filter.hinc, ]
# Check the result
summary(df.ready)
## GEO_ID tract county state
## Length:514 Length:514 Length:514 Length:514
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## total total.error lessHS lessHS.error
## Min. : 136 Min. : 63.0 Min. : 0.0 Min. : 2.0
## 1st Qu.: 2392 1st Qu.: 229.2 1st Qu.: 114.2 1st Qu.: 74.0
## Median : 3430 Median : 317.0 Median : 262.0 Median :128.0
## Mean : 3556 Mean : 332.5 Mean : 351.8 Mean :145.9
## 3rd Qu.: 4455 3rd Qu.: 410.0 3rd Qu.: 471.8 3rd Qu.:190.8
## Max. :11620 Max. :1119.0 Max. :1826.0 Max. :577.0
## HS HS.error someCol someCol.error
## Min. : 0.0 Min. : 13.0 Min. : 26.0 Min. : 30.0
## 1st Qu.: 339.5 1st Qu.:130.0 1st Qu.: 490.8 1st Qu.:142.0
## Median : 629.0 Median :192.5 Median : 773.0 Median :190.5
## Mean : 730.9 Mean :210.4 Mean : 909.8 Mean :214.2
## 3rd Qu.:1023.5 3rd Qu.:274.8 3rd Qu.:1193.8 3rd Qu.:270.5
## Max. :3204.0 Max. :781.0 Max. :4489.0 Max. :906.0
## bachelor bachelor.error graduate graduate.error
## Min. : 4.0 Min. : 6.0 Min. : 0.0 Min. : 9.00
## 1st Qu.: 409.8 1st Qu.:132.0 1st Qu.: 216.0 1st Qu.: 92.25
## Median : 797.5 Median :186.0 Median : 483.5 Median :141.00
## Mean : 952.5 Mean :206.6 Mean : 610.8 Mean :157.28
## 3rd Qu.:1357.5 3rd Qu.:270.8 3rd Qu.: 894.8 3rd Qu.:211.75
## Max. :3938.0 Max. :763.0 Max. :2664.0 Max. :597.00
## NAME.y hinc hinc.error p.lessHS
## Length:514 Length:514 Length:514 Min. :0.00000
## Class :character Class :character Class :character 1st Qu.:0.03386
## Mode :character Mode :character Mode :character Median :0.08644
## Mean :0.10985
## 3rd Qu.:0.15296
## Max. :0.73690
## p.HS p.someCol p.bachelor p.graduate
## Min. :0.0000 Min. :0.04769 Min. :0.005122 Min. :0.00000
## 1st Qu.:0.1145 1st Qu.:0.18712 1st Qu.:0.152710 1st Qu.:0.07227
## Median :0.2111 Median :0.25941 Median :0.236405 Median :0.13731
## Mean :0.2113 Mean :0.25229 Mean :0.258531 Mean :0.16801
## 3rd Qu.:0.2980 3rd Qu.:0.31635 3rd Qu.:0.369975 3rd Qu.:0.24796
## Max. :0.5070 Max. :0.48070 Max. :0.558525 Max. :0.56511
## hinc.num
## Min. : 9815
## 1st Qu.: 40153
## Median : 58418
## Mean : 66750
## 3rd Qu.: 84288
## Max. :200179
The statistical tests we will practice require nominal or ordinal variables, and today we will use three variables in the data we prepared so far, hinc.num, graduate, and county (other variables will be recycled in future labs). Since county is already a nominal variable, let’s convert df.ready$hinc.num and df.ready$p.graduate into an ordinal variable with 4 levels each.
Note that converting a continuous (e.g., interval or ratio) variable into an ordinal variable is usually not desirable because ratio variables have more information than ordinal variables. By converting, we are essentially throwing away some information. We are doing it here just for the purpose of demonstration. We are going to use cut() function to convert continuous variable into ordinal. The output of the cut() function as shown below is of type ‘ordered factor’. Perhaps the most confusing part in using the cut() function is the ‘breaks’ argument. I am using quantile() function to generate breaks points that will determine where one class ends and another class begins. See the blue box below for more information.
# Cut a continuous variable into an ordinal variable
df.ready$hinc.num.ord <- cut(df.ready$hinc.num, # The name of the data.frame and variable that you want to cut
breaks = quantile(df.ready$hinc.num), # Break points
labels = c("poor", "lower-middle class", "upper-middle class", "wealthy"), # Labels for each level
ordered_result = TRUE,# Whether to make it ordinal
include.lowest = TRUE) # Whether to include the lowest value to the first category
# Cut a continuous variable into an ordinal variable
df.ready$p.graduate.ord <- cut(df.ready$p.graduate, # The name of the data.frame and variable that you want to cut
breaks = quantile(df.ready$p.graduate), # Break points
labels = c("very low", "low", "medium", "high"), # Labels for each level
ordered_result = TRUE,# Whether to make it ordinal
include.lowest = TRUE) # Whether to include the lowest value to the first category
# Check the result
table(df.ready$hinc.num.ord)
##
## poor lower-middle class upper-middle class wealthy
## 129 128 128 129
table(df.ready$p.graduate.ord)
##
## very low low medium high
## 129 128 128 129
[ If you want to take a look at the quantile function.. ]
To better understand the quantile() function, print out quantile(df.ready$hinc.num) and see what output it gives you.
# quantile function
quantile(df.ready$hinc.num)
## 0% 25% 50% 75% 100%
## 9815.00 40152.75 58417.50 84287.50 200179.00
The output contains two rows of information. The first row shows that quantile() function shows the values at 0, 25, 50, 75, and 100th percentile. The second row shows the specific values in the column that are at the corresponding percentile.
Let’s briefly take a look at chi-square test. Because all the theoretical discussions were covered in the lecture, we are simply going to run the test. Also, most of the coding required to run statistical tests are actually neatly condensed into a few functions (if not just one), so, like I mentioned in a previous lab, the majority of the work is for getting the data ready for the statistical tests.
One thing to note is that you can supply either (1) two vectors separately as two arguments or (2) supply the output of the table() as single argument. The results do not differ.
# Building a contingency table
ctable <- table(df.ready$hinc.num.ord, df.ready$county)
ctable
##
## Clayton County Cobb County DeKalb County Fulton County
## poor 20 11 28 70
## lower-middle class 21 22 51 34
## upper-middle class 8 43 38 39
## wealthy 0 44 26 59
# Run a chi-square test
chisq.test(df.ready$hinc.num.ord, df.ready$county) # you can supply the two vectors separately or
##
## Pearson's Chi-squared test
##
## data: df.ready$hinc.num.ord and df.ready$county
## X-squared = 79.087, df = 9, p-value = 2.453e-13
chisq.test(ctable) # you can supply the output of the table() function.
##
## Pearson's Chi-squared test
##
## data: ctable
## X-squared = 79.087, df = 9, p-value = 2.453e-13
The output is very simple. First, there is X-squared = 79.087 shows the test statistic. This is a very large value, because with df = 9, the critical value needed to get p-value of 0.05 is roughly 16.92 (try qchisq(p = 0.95, df = 9) to get the critical value at which the area under the curve adds up to 0.95 for the chi-square distribution with df = 9). As you expected from the large X-squared statistic, the p-value is essentially 0.
The result from the test shows that there exists a statistically significant relationship between the ordinal variable of median household income and county.
Try running the same test using p.graduate.ord before you move on!
Note: everything under section 4 will be covered in the following lecture. However, you must read through the section and practice all the scripts; this will give you an overview before the next class.
Before we run Kendall’s Tau, there is one important step that you must do - converting the factor variables into numeric. Factor in R is a type of data that is both character AND numeric. For example, when you print out a variable in factor-type, it would show “hello”, “bye”, “hi” (i.e., characters), but under the hood, those character values are attached with numeric codes such as 1, 2, 3. So, by setting hinc.num.ord and p.graduate.ord as ordered factors, they can be printed out as actual characters that is meaningful to human (which is going to be useful for us) AND automatically assign numeric values in the order of labels = c("poor", "lower-middle class", "upper-middle class", "wealthy") (which is needed for R to run some tests such as Kendall’s Tau).
# Tau-b
cor.test(as.numeric(df.ready$hinc.num.ord), # First vector. You need to convert a variable in factor format into numeric in that order.
as.numeric(df.ready$p.graduate.ord), # Second vector. You need to convert a variable in factor format into numeric in that order. Note. we did not have to do the as.numeric in lecture because the data was not in format format.
method = "kendall")
##
## Kendall's rank correlation tau
##
## data: as.numeric(df.ready$hinc.num.ord) and as.numeric(df.ready$p.graduate.ord)
## z = 18.152, p-value < 2.2e-16
## alternative hypothesis: true tau is not equal to 0
## sample estimates:
## tau
## 0.6679014
Now, I guess we have become somewhat familiar with the output formats in R. The low p-value (note that the default alpha = 0.05) indicates that we can reject the null hypothesis. The Kendall’s tau correlation coefficient is 0.668 indicating a strong relationship .
I am going to use the same two ordinal variable for a demonstration of Lambda. Note that when you have two ordinal variables, you don’t need to run Lambda as there are other tests (e.g., Kendall’s tau) that makes use of the ordered nature of the data. As with other examples in this lab, I am using them just for demonstration purposes.
This Lambda() function is coming from the package DescTools.
# Lambda - row
Lambda(df.ready$hinc.num.ord,
df.ready$p.graduate.ord,
direction = "row")
## [1] 0.3818182
Also, when using the Lambda(), we can input a cross tab instead of using a vector. The script above uses a vector as an input for the Lambda(). The script below is another way of doing the same thing, but this time we use a cross tab as an input; both will give you the same result.
Rc <- table(df.ready$hinc.num.ord, df.ready$p.graduate.ord)
Lambda(Rc, direction = "row")
## [1] 0.3818182
The DescTools packages we installed for Lambda() also has a function for Cramer’s V, CramerV. The package introduced in the lecture rcompanion also has a function called cramerV (note that one function has a capital C and the other has a small c for the function name).
# Cramer's V
DescTools::CramerV(df.ready$hinc.num.ord,
df.ready$p.graduate.ord)
## [1] 0.4819146
rcompanion::cramerV(df.ready$hinc.num.ord,
df.ready$p.graduate.ord)
## Cramer V
## 0.4819
In this section we will be looking at two measures of association for continuous variables, most importantly the correlation coefficient, and how to plot two continuous variables to visually assess a relationship or lack thereof. These topics will be covered in class next Tuesday and appear on Assignment 4.
Before we examine the data we have, let us briefly go through what correlation is and what it isn’t. The correlation can tell us how intimately two variables are aligned on a line (see the image below). Notice that the correlation coefficient on the second row in the image are all 1 regardless of how steep or gentle the slope of the line is. In other words, the correlation coefficient is NOT about the slope of the line; it is about the spread of data points from the line.
(image source: Wikipedia)
First, we will look at the relationship between percent people with graduate degree in a neighborhood and the median household income of a neighborhood. We anecdotally know that neighborhoods with more educated population tend to be wealthier. In other words, as a variable for the education level increases, we expect to see an increase in a variable for income. Let’s visualize hinc.edu to see if this is true for Georgia too. We can plot one variable on the x-axis and plot the other variable on the y-axis to create a scatterplot.
# Creating a Scatterplot between hinc and p.graduate
plot(df.ready$p.graduate,
df.ready$hinc.num,
main = "Scatterplot between % gradute degree \n and median household income",
xlab = "Proportion of residents with graduate degrees",
ylab = "Median household income",
col = "blue")
The scatterplot shows that there is a clear positive relationship between the two variables. But without running the correlation analysis, we cannot quantify how strong the relationship is and whether the relationship is statistically significant.
The code below calculates covariance and correlation with which we can quantify the relationship between the two variables.
# Covariance
cov(df.ready$p.graduate, df.ready$hinc.num) # This value by itself is not really useful because it is affected by the unit of the data.
## [1] 3105.944
# Correlation
cor.test(df.ready$p.graduate, df.ready$hinc.num,
alternative = "two.sided", # or "less" "greater" for one-tailed test
method = "pearson", # or "kendall" "spearman"
conf.level = 0.95)
##
## Pearson's product-moment correlation
##
## data: df.ready$p.graduate and df.ready$hinc.num
## t = 24.844, df = 512, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.6974296 0.7761786
## sample estimates:
## cor
## 0.7393218
The correlation coefficient is 0.7393218 The t-value is huge with a large degrees of freedom (and of course a very small p-value). These numbers indicate a statistically significant positive correlation between the two variables.
Compare the scatterplot of p.graduate and hinc.num against the large correlation plot above with multiple correlation coefficients and corresponding scatterplot. The correlation coefficient we got from the data is 0.7393218, which is close to 0.8. Can see the similarity in terms of how spread the points are from the line?
If the correlation coefficient we got from the data was, for example, 0.4, the points in the scatterplot would have been more spread out. Again, correlation coefficient do not tell us anything about the slope! It only tells us whether it is upward/downward slope and how spread out the points are from the line.
We have seen that there is a positive correlation between p.graduate and hinc.num. Let’s examine if there is the opposite relationship between having low educational attainment and the median household income of neighborhoods in the four county. This time, let’s use p.HS, the proportion of residents who are older than 25 and have high school education.
# Creating a Scatterplot between hinc and p.HS
plot(df.ready$p.HS,
df.ready$hinc.num,
main = "Scatterplot between % high school graduate \n and median household income",
xlab = "Proportion of residents with high school diploma",
ylab = "Median household income",
col = "blue")
The plot shows that there is a very clear negative relationship between the two variables. We can easily predict from this plot that the correlation coefficient will be large (to be minimize potential confusions, I’d say the absolute value (or the magnitude) of the coefficient will be large or the coefficient is large and negative or something similar). Let’s run the analysis.
cor.test(df.ready$p.HS, df.ready$hinc.num,
alternative = "two.sided", # or "less" "greater" for one-tailed test
method = "pearson", # or "kendall" "spearman"
conf.level = 0.95)
##
## Pearson's product-moment correlation
##
## data: df.ready$p.HS and df.ready$hinc.num
## t = -23.285, df = 512, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.7567151 -0.6723812
## sample estimates:
## cor
## -0.7171636
Just like we predicted, the correlation coefficient is -0.7171636 and the hypothesis test result is highly statistically significant. They indicate that there is a strong negative correlation between p.HS and hinc.num.
Try repeating what we’ve done so far with other educational attainment levels, for example, df.ready$someCol. Examine if there a statistically significant correlation, what the scatterplot looks like, and how the plot compares with the large correlation chart (i.e., the first figure of this document).