Continuing from the slide of Lab 3, this document will guide you through (1) reading the downloaded data into R, (2) cleaning the data, (3) calculate z-score using the cleaned data, and (4) some basic probability-related functions in R.
| Functions | Tasks |
|---|---|
data.frame() |
Create a data.frame object |
separate() |
Turns a single character column into multiple columns |
colnames() |
Retrieve or set the column names of a matrix-like object |
is.na() |
Identify which elements are missing (i.e., NA or NaN) |
scale() |
Calculates z-score using a vector |
pnorm() |
Takes the quantile value and outputs the probability using a normal distribution |
qnorm() |
Takes the probability value and outputs the quantile using a normal distribution |
Let’s first download required packages in R. In addition to the functions in base R, we will use separate function from “tidyverse” package. Then, we bring in the packages and read the data in. This is the data on poverty that we downloaded from the Census Bureau.
# Install required packages
install.packages("tidyverse")
# Call the packages using library()
library(tidyverse)
# USE YOUR OWN working directory and file name
setwd("C:\\Users\\cod\\Desktop\\PhD Files\\GRA & GTA\\GTA\\CP6025 Fall 2021\\Week 3\\Lab3\\ACSDT5Y2017.C17002_2020-08-29T152002") # Use your own pathname
pov.data <- read.csv("ACSDT5Y2017.C17002_data_with_overlays_2020-08-29T151945.csv")
# Check the data
head(pov.data)
## GEO_ID NAME C17002_001E
## 1 1400000US13001950100 Census Tract 9501, Appling County, Georgia 2807
## 2 1400000US13001950200 Census Tract 9502, Appling County, Georgia 4158
## 3 1400000US13001950300 Census Tract 9503, Appling County, Georgia 5673
## 4 1400000US13001950400 Census Tract 9504, Appling County, Georgia 1577
## 5 1400000US13001950500 Census Tract 9505, Appling County, Georgia 3722
## 6 1400000US13003960100 Census Tract 9601, Atkinson County, Georgia 2106
## C17002_001M C17002_002E C17002_002M C17002_003E C17002_003M C17002_004E
## 1 356 158 122 225 179 119
## 2 515 375 240 777 353 698
## 3 491 807 441 925 410 510
## 4 226 224 137 86 58 123
## 5 463 434 273 424 207 299
## 6 255 147 113 418 234 369
## C17002_004M C17002_005E C17002_005M C17002_006E C17002_006M C17002_007E
## 1 105 250 190 181 137 99
## 2 402 246 176 255 161 65
## 3 300 496 322 270 220 190
## 4 76 38 28 113 51 26
## 5 262 218 145 229 164 105
## 6 224 168 97 198 115 23
## C17002_007M C17002_008E C17002_008M
## 1 91 1775 320
## 2 85 1742 360
## 3 218 2475 459
## 4 24 967 203
## 5 91 2013 328
## 6 26 783 183
In the real world, data you need to analyze will always come in messy or incomplete forms. Even data from extremely systematized sources, such as the U.S. Census Bureau, often requires some amount of work to ‘get the data ready’. This process of getting the data ready is often called data cleaning.
The data we just read has a few issues that need to be cleaned if we were to use it for any statistical analysis.
There are many columns other than the ones we are interested in. We want to exclude the variables that are not interesting to us.
We need to extract the names of counties from the variable named ‘NAME’. We need to separate the variable into three pieces.
The names of the variable are not intuitive. We need to replace these codes with something more meaningful.
The data contains data for all counties in Georgia while we are interested in only four counties around Atlanta (Fulton, DeKalb, Clayton, and Cobb). We need to exclude other counties from the data.
The data we downloaded contains the total population and the number of those who are under poverty. We need to convert this data into percentage.
Finally, there are some Census Tracts that have zero population (To see this, try min(pov.data$C17002_001E)), which will create an issue when we divide the number of people in poverty by population to calculate the percent under poverty (e.g., 0/0 is NaN in R, which stands for ‘Not a Number’). These values need to be excluded from the data.
We are interested in the proportion of people whose income is lower than the poverty line and whether that is different in different counties but there are many other columns containing data that we do not need. To find out which ones are the variables we need, we must know what “C17002_001E”, “C17002_002E”, etc. mean. We need to see the data dictionary that came together with the data file. The variables we need are (1) the two ID variables for each Census Tract (i.e., GEO_ID and NAME), (2) total population, (3) number of people whose income is under 50% of the poverty line, and (4) number of people whose income is between 50% and 99% of the poverty line. The names of the variable that we are interested in are highlighted in yellow in the image below.
Before we exclude variables and retain only those that we need, let’s recap how to do subsetting in R.
Subsetting and indexing a vector or a dataframe: Subsetting a vector or a dataframe in R can be done using square brackets []. When you put [] after a vector or a dataframe, it means you are about to extract some parts of it. If it is a vector (which is one dimension), you only need to specify one index to parse it. See examples below.
my.vec <- c(1,3,5,7,9,11)
my.vec[3] # This returns 5, which is in 3rd position of the vector
## [1] 5
my.vec[c(TRUE, FALSE, FALSE, TRUE, TRUE, FALSE)] # this returns 1,7,9, those which are in position where TRUEs are. Think of TRUEs and FALSEs as filters. TRUEs let through the values being filtered. FALSEs block the values being filtered. As a result, you get values in a vector that are in the same position as TRUEs in the square bracket.
## [1] 1 7 9
If it is a dataframe, which is 2-dimensional with rows and columns, you need to supply two indices to parse it, one for rows and one for columns. The format is like this: dataframe[ index for row, index for column ]. See examples below and read the comments (hashtags) carefully.
my.df <- data.frame(a = c("a", "b", "c", "d"),
b = c(1, 2, 3, 4),
Avengers = c("Peter", "Natasha", "Hulk", "Thor"),
stringsAsFactors = FALSE)
my.df # See what the data.frame looks like before subsetting
## a b Avengers
## 1 a 1 Peter
## 2 b 2 Natasha
## 3 c 3 Hulk
## 4 d 4 Thor
my.df[1 , 3] # This returns the 1st row of the 3rd column, which is "Peter"
## [1] "Peter"
my.df[2:3 , 1] # This returns 2nd and 3rd rows of the first column, which are "b", "c"
## [1] "b" "c"
my.df[ , "Avengers"] # This returns all rows in the column named "Avengers". If the row-part in the square bracket is empty, it means all rows. If columns-part in the square bracket is empty, it mean all columns.
## [1] "Peter" "Natasha" "Hulk" "Thor"
my.df[c(FALSE, TRUE, FALSE, TRUE) , "Avengers"] # This will return the 2nd and 4th rows (which are where TRUEs are located) of the column "Avengers", which are Natasha and Thor.
## [1] "Natasha" "Thor"
Now we know what variables we need to retain and how to subset a data frame. The code below is an R-way saying “Give me all rows (note the empty part before the comma in the square bracket!) and variables named GEO_ID, NAME, C17002_001E, C17002_002E, C17002_003E from a data.frame called pov.data and put it in pov.data.var.” Note that if you don’t assign the subsetted data.frame into an R object, it will simply print the subsetted data.frame in the console window and disappear.
pov.data.var <- pov.data[ , c("GEO_ID", "NAME", "C17002_001E", "C17002_002E", "C17002_003E") ]
head(pov.data.var)
## GEO_ID NAME C17002_001E
## 1 1400000US13001950100 Census Tract 9501, Appling County, Georgia 2807
## 2 1400000US13001950200 Census Tract 9502, Appling County, Georgia 4158
## 3 1400000US13001950300 Census Tract 9503, Appling County, Georgia 5673
## 4 1400000US13001950400 Census Tract 9504, Appling County, Georgia 1577
## 5 1400000US13001950500 Census Tract 9505, Appling County, Georgia 3722
## 6 1400000US13003960100 Census Tract 9601, Atkinson County, Georgia 2106
## C17002_002E C17002_003E
## 1 158 225
## 2 375 777
## 3 807 925
## 4 224 86
## 5 434 424
## 6 147 418
We want to compare poverty rate of different counties, and to do that we need a variable that tells us which county the given Census Tract falls into. The name of the county is embedded in a variable called ‘GEO.display.label’. Let’s extract the county names from the variable. The ‘GEO.display.label’ is formatted in the following way: “Census Tract 9501, Appling County, Georgia”. If we break it at every comma, we will have “Census Tract 9501”, “Appling County”, and “Georgia” all separated out, which can then be stored in three different variables. The schematic below illustrates the operation we want to do.
The separate function takes four arguments (arguments are what you supply to functions as inputs): (1) name of data, (2) name of variable to be parsed, (3) a list of columns to be created as the results, (4) and the character that will be used as the separator. Notice that there is a space after comma in sep = ", ". If we forget this space, all values in the newly created variable will be preceded by a space because R treats a space as a character too if it is in a character string.
Note that separate function is part of tidyverse package. You need to bring in the package in order to use this function.
# Separate 'GEO.display.label' variable into three variables at a comma and a space
pov.data.sep <- separate(
data = pov.data.var, # 1st: specify the data.frame
col = NAME, # 2nd: specify the name of the variable you want to parse
into = c("tract", "county", "state"), # 3rd: a list of variable names that will be created as the result of parsing
sep = ", ") # 4th: character string that will be used as the separator
Same as before, you need to assign the output of separate function into an R object. Otherwise, the function will just print the output of the separate function in the console window and the output will disappear. Let’s examine what separate function has produced.
head(pov.data.sep)
## GEO_ID tract county state C17002_001E
## 1 1400000US13001950100 Census Tract 9501 Appling County Georgia 2807
## 2 1400000US13001950200 Census Tract 9502 Appling County Georgia 4158
## 3 1400000US13001950300 Census Tract 9503 Appling County Georgia 5673
## 4 1400000US13001950400 Census Tract 9504 Appling County Georgia 1577
## 5 1400000US13001950500 Census Tract 9505 Appling County Georgia 3722
## 6 1400000US13003960100 Census Tract 9601 Atkinson County Georgia 2106
## C17002_002E C17002_003E
## 1 158 225
## 2 375 777
## 3 807 925
## 4 224 86
## 5 434 424
## 6 147 418
The code worked perfectly! Now ‘tract’ variable contains only census tract ID, ‘county’ variable contains only county names, and so on.
The name of the variables are still in “C17002_001E” form, which is completely meaningless and a bit long. Let’s give them shorter and interpretable names. We can access the names of the variables in a dataframe using colnames function.
colnames(pov.data.sep)
## [1] "GEO_ID" "tract" "county" "state" "C17002_001E"
## [6] "C17002_002E" "C17002_003E"
We can see that the first 5 variables have acceptable names. It is the 5th through 7th variables that are the problems. Using the same technique for subsetting and indexing, let’s give those problematic variables new names. The following code means “Replace the 5th, 6th, and 7th names in the list of variable names in the data.frame pov.data.sep with new names: total, under00_50, and under50_99”
# Rename variables
colnames(pov.data.sep)[5:7] <- c("total", "under00_50", "under50_99")
head(pov.data.sep)
## GEO_ID tract county state total
## 1 1400000US13001950100 Census Tract 9501 Appling County Georgia 2807
## 2 1400000US13001950200 Census Tract 9502 Appling County Georgia 4158
## 3 1400000US13001950300 Census Tract 9503 Appling County Georgia 5673
## 4 1400000US13001950400 Census Tract 9504 Appling County Georgia 1577
## 5 1400000US13001950500 Census Tract 9505 Appling County Georgia 3722
## 6 1400000US13003960100 Census Tract 9601 Atkinson County Georgia 2106
## under00_50 under50_99
## 1 158 225
## 2 375 777
## 3 807 925
## 4 224 86
## 5 434 424
## 6 147 418
See that now the new names are applied to the data.frame and we can immediately understand what each of the variable represents.
Since we are interested in only Fulton, DeKalb, Clayton, and Cobb Counties, let’s drop rows that contain data of other counties but keep all the columns. What we want to do in a schematic is as follows:
We can use %in% operator to do the job. The %in% operator is used to check if elements belong to a vector. This %in% operator returns TRUE if the element we are checking (i.e., the thing on the left of %in%) is in the vector we are checking it against (i.e., the thing on the right side of %in%). See the example below:
# Create a toy vector
my.vec <- c(1,2,3,10,20,30)
1 %in% my.vec # This returns one TRUE because there is 1 in my.vec
## [1] TRUE
15 %in% my.vec # This returns FALSE because my.vec does not contain 15
## [1] FALSE
c(2,3,40) %in% my.vec # This gives TRUE, TRUE, FALSE because the first two elements are in my.vec while the last isn't
## [1] TRUE TRUE FALSE
The following code tests whether each element in pov.data.sep$county is in a vector of c("Fulton County", "DeKalb County", "Cobb County", and "Clayton County"), returns a vector of TRUEs and FALSEs, and stores it in an object called county.filter.
If the first row of pov.data.sep$county is one of the four counties, county.filter gets TRUE as its first element. If the second row of pov.data.sep$county is NOT one of the four counties, the second element of county.filter gets FALSE, and so on.
# Create a logical vector in which rows that have the four counties in 'county' column gets TRUE and otherwise FALSE
county.filter <- pov.data.sep$county %in% c("Fulton County", "DeKalb County", "Cobb County", "Clayton County")
county.filter # In this county.filter vector, there are as many elements as the number of rows in pov.data.sep
## [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [37] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [49] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [61] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [73] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [85] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [97] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [109] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [121] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [133] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [145] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [157] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [169] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [181] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [193] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [205] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [217] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [229] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [241] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [253] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [265] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [277] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [289] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [301] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [313] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [325] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [337] FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [349] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [361] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [373] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [385] TRUE TRUE TRUE TRUE TRUE FALSE FALSE TRUE TRUE TRUE TRUE TRUE
## [397] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [409] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [421] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [433] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [445] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [457] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [469] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [481] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [493] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [505] TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE
## [517] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [529] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [541] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [553] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [565] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [577] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [589] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE
## [601] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [613] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [625] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [637] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [649] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [661] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [673] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [685] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [697] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [709] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [721] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [733] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE
## [745] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [757] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [769] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [781] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [793] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [805] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [817] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [829] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [841] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [853] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [865] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [877] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [889] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [901] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [913] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [925] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [937] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [949] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [961] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [973] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [985] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [997] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [1009] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [1021] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [1033] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [1045] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [1057] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [1069] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [1081] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [1093] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [1105] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [1117] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [1129] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [1141] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [1153] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [1165] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [1177] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [1189] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [1201] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [1213] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [1225] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [1237] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [1249] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [1261] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [1273] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [1285] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [1297] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [1309] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [1321] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [1333] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [1345] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [1357] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [1369] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [1381] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [1393] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [1405] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [1417] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [1429] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [1441] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [1453] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [1465] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [1477] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [1489] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [1501] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [1513] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [1525] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [1537] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [1549] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [1561] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [1573] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [1585] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [1597] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [1609] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [1621] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [1633] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [1645] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [1657] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [1669] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [1681] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [1693] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [1705] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [1717] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [1729] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [1741] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [1753] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [1765] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [1777] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [1789] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [1801] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [1813] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [1825] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [1837] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [1849] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [1861] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [1873] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [1885] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [1897] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [1909] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [1921] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [1933] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [1945] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [1957] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [1969] FALSE
We can insert this logical vector county.filter into the “row-part” of the square bracket [ , ]. Remember that the left-side of comma in [,] represents rows and the right-side represents columns. If we write pov.data.sep[county.filter , ], it is identical to writing pov.data.sep[c(FALSE, TRUE, FALSE, ... , FALSE) , ], which means “give me all columns from pov.data.sep and rows in the corresponding locations as TRUEs in county.filter”. As a result, you get a data.frame shown below.
# Filter 'pov.data.sep' using 'county.filter'
pov.data.4county <- pov.data.sep[county.filter, ]
# Number of rows and columns after filtering
head(pov.data.4county)
## GEO_ID tract county state total
## 340 1400000US13063040202 Census Tract 402.02 Clayton County Georgia 2641
## 341 1400000US13063040203 Census Tract 402.03 Clayton County Georgia 3573
## 342 1400000US13063040204 Census Tract 402.04 Clayton County Georgia 4087
## 343 1400000US13063040302 Census Tract 403.02 Clayton County Georgia 5962
## 344 1400000US13063040303 Census Tract 403.03 Clayton County Georgia 6778
## 345 1400000US13063040306 Census Tract 403.06 Clayton County Georgia 4090
## under00_50 under50_99
## 340 288 261
## 341 498 381
## 342 477 397
## 343 939 1097
## 344 443 1345
## 345 982 762
We have reduced the number of rows from 1969 (before the filtering) to 519 (after the filtering) while maintaining the number of columns.
Once you get more familiar with R, you don’t need to create a county.filter first and then use it in the square bracket; you can simply put them all together in one line code. See below.
# This code is ...
county.filter <- pov.data.sep$county %in% c("Fulton County", "DeKalb County", "Cobb County", "Clayton County")
pov.data.4county <- pov.data.sep[county.filter, ]
# Identical to ...
pov.data.4county <- pov.data.sep[pov.data.sep$county %in% c("Fulton County", "DeKalb County", "Cobb County", "Clayton County"), ]
You can use whichever works better for you.
The variable “total” is the total population of each Census Tract. The variable “under00_50” and “under50_99” are the number of people whose income is between 0 to 50% of the poverty line and 50% to 99% of the poverty line, respectively. Because what we want is the proportion of people whose income is lower than the poverty line, we first need to add “under00_50”, and “under50_99” to calculate the total number of people under poverty line and, second, divide that number by the total population.
Note that we are assigning the outcome of the calculation into a variable called p.pov which currently is NOT present in the data.frame. This is how we can create a new variable in a data.frame.
# Calculate percentages
pov.data.4county$p.pov <- (pov.data.4county$under00_50 + pov.data.4county$under50_99) / pov.data.4county$total
summary(pov.data.4county$p.pov)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.004575 0.069188 0.145023 0.174992 0.253455 0.805505 5
We are almost there! Notice that the summary function above shows that there are 5 NA’s in the newly created p.pov variable (to be specific, they are NaNs, not NAs but summary function doesn’t distinguish the two). This is due to the Census Tract with 0 population. They are of no value to us because (1) there is no point analyzing poverty if there is no one living there and (2) NAs and NaNs can cause malfunction of some functions like mean (To see this, try mean(c(1,2,3,NA))).
We can filter out NAs in the same way we filtered out counties in 1.5. This time, we will use is.na() function. This function takes a vector and returns TRUE to elements that are either NA or NaN and FALSE otherwise. Because the function returns TRUE to NAs and NaNs, which are what we want to drop, we need to FLIP IT using a negation operator !. For example, !TRUE is FALSE and !FALSE is TRUE.
!is.na(pov.data.4county$p.pov) is R-way of saying "Give me a logical vector that has TRUEs in places where there are NAs in pov.data.4county$p.pov and FALSEs otherwise and FLIP IT.
# Filtering out NAs
na.filter <- !is.na(pov.data.4county$p.pov)
df.ready <- pov.data.4county[na.filter, ]
summary(df.ready)
## GEO_ID tract county state
## Length:514 Length:514 Length:514 Length:514
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## total under00_50 under50_99 p.pov
## Min. : 136 Min. : 0.0 Min. : 0.0 Min. :0.004575
## 1st Qu.: 3539 1st Qu.: 147.2 1st Qu.: 139.5 1st Qu.:0.069188
## Median : 4874 Median : 284.0 Median : 338.5 Median :0.145023
## Mean : 5248 Mean : 377.6 Mean : 442.7 Mean :0.174992
## 3rd Qu.: 6497 3rd Qu.: 520.2 3rd Qu.: 623.5 3rd Qu.:0.253455
## Max. :17857 Max. :2040.0 Max. :2234.0 Max. :0.805505
Can you see that the 5 NAs in the p.pov variable are no longer there? We have completed the data cleaning!
Using the cleaned data, we will calculate z-score for
p.povvariable. Perhaps the usefulness of z-score may be not be apparent at this moment. But z-score is used extensively, and it is important that we have an experience of calculating z-scores in R. If you forgot the equation for z-score, see the lecture slide for Week3-Probability Distributions 1.
To calculate z-score, we first need to get the mean and the standard deviation of p.pov. Then, we need to subtract the mean from each value of p.pov and divide it by the standard deviation. We will store the z-score of p.pov back into df.ready by creating a new variable in df.ready named p.pov.z.
# Calculating mean
p.pov.mean <- mean(df.ready$p.pov)
# Calculating standard deviation
p.pov.sd <- sd(df.ready$p.pov)
# Calculate z-score and assign it into p.pov.z
df.ready$p.pov.z <- (df.ready$p.pov - p.pov.mean) / p.pov.sd
summary(df.ready)
## GEO_ID tract county state
## Length:514 Length:514 Length:514 Length:514
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## total under00_50 under50_99 p.pov
## Min. : 136 Min. : 0.0 Min. : 0.0 Min. :0.004575
## 1st Qu.: 3539 1st Qu.: 147.2 1st Qu.: 139.5 1st Qu.:0.069188
## Median : 4874 Median : 284.0 Median : 338.5 Median :0.145023
## Mean : 5248 Mean : 377.6 Mean : 442.7 Mean :0.174992
## 3rd Qu.: 6497 3rd Qu.: 520.2 3rd Qu.: 623.5 3rd Qu.:0.253455
## Max. :17857 Max. :2040.0 Max. :2234.0 Max. :0.805505
## p.pov.z
## Min. :-1.2919
## 1st Qu.:-0.8021
## Median :-0.2272
## Mean : 0.0000
## 3rd Qu.: 0.5948
## Max. : 4.7799
The summary function at the end shows that we successfully created a new variable called p.pov.z and that it has the mean of 0 (which is expected). Go ahead and check if p.pov.z has the standard deviation of 1 by using sd() function.
In the real life, we usually use a function specifically designed to calculate z-score, scale() function.
# Calculating z-score using scale()
df.ready$p.pov.scale <- scale(df.ready$p.pov, center = TRUE, scale = TRUE)
summary(df.ready)
## GEO_ID tract county state
## Length:514 Length:514 Length:514 Length:514
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## total under00_50 under50_99 p.pov
## Min. : 136 Min. : 0.0 Min. : 0.0 Min. :0.004575
## 1st Qu.: 3539 1st Qu.: 147.2 1st Qu.: 139.5 1st Qu.:0.069188
## Median : 4874 Median : 284.0 Median : 338.5 Median :0.145023
## Mean : 5248 Mean : 377.6 Mean : 442.7 Mean :0.174992
## 3rd Qu.: 6497 3rd Qu.: 520.2 3rd Qu.: 623.5 3rd Qu.:0.253455
## Max. :17857 Max. :2040.0 Max. :2234.0 Max. :0.805505
## p.pov.z p.pov.scale.V1
## Min. :-1.2919 Min. :-1.291931
## 1st Qu.:-0.8021 1st Qu.:-0.802100
## Median :-0.2272 Median :-0.227197
## Mean : 0.0000 Mean : 0.000000
## 3rd Qu.: 0.5948 3rd Qu.: 0.594832
## Max. : 4.7799 Max. : 4.779929
In R, probability-related functions usually come in a group of four. For example, try running ?pnorm in your console to bring up a help page. You will see that, instead of bringing up a page just for pnorm(), R will show you a page that explains four similar functions, including dnorm(), pnorm(), qnorm(), and rnorm().
These four functions are related with the normal distribution, hence the suffix norm. If the distribution you need is t-distribution, you can use dt, pt, qt, and rt. See the pattern?
Of the four prefixes (i.e., d, p, q, and r), two that are most relevant to this course are the ones that start with p and q. Ones that start with p takes the quantile and outputs the corresponding probability. Ones that start with q takes the probability and outputs the corresponding quantile. It is okay if we don’t fully understand how these operations can be useful at this moment, that is okay. We will come back to this later in the semester. For now, let’s focus on learning these functions in R.
For the demonstration, I will use a normal distribution, particularly the standard normal distribution (i.e., the normal distribution with the mean of 0 and the standard deviation of 1). Let’s go back to the help page for pnorm() (i.e., ?pnorm). The help page shows states that:
pnorm(q, mean = 0, sd = 1, lower.tail = TRUE, log.p = FALSE)
Notice that mean, sd, lower.tail, and log.p are followed by an equal sign and some values. For example, mean = 0. Also notice that q is not followed by anything. This means that, when you are using pnorm() function, you do not have to supply arguments for mean (you can if you want to shift the mean of the distribution), because R will automatically supply the default value, which is 0. Similarly, the default value for sd is 1, and so on. But q, which stands for quantile does not have the default value specified. This means you must supply that value; otherwise the function will not run. So, if you do not intentionally change the mean and sd setting, pnorm() function will use the standard normal distribution (which by definition has the mean of 0 and the standard deviation of 1). I will use it that way.
Let’s try some examples. pnorm() takes quantile and outputs probability. I’d like to try supplying 2 as the input to pnorm(). See the image below to understand what it all means.
# pnorm() takes quantile and outputs probability.
pnorm(2)
## [1] 0.9772499
pnorm(2) returned 0.9772499. This means that the blue area in the image is 0.9772499 (Note that the entire area under the curve in the image is 1). That is what we wanted to know!
Let’s try the counterpart function, qnorm(). Since we know that the probability is 0.9772499 at quantile 2, Let’s see if qnorm(0.9772499) will return 2.
# qnorm() takes the probability and outputs the corresponding quantile.
qnorm(0.9772499)
## [1] 2.000001
We got 2.000001, which is very close to 2! Notice that the slight misalignment is cause by the rounding of the probability.