Continuing from the slide of Lab 3, this document will guide you through (1) reading the downloaded data into R, (2) cleaning the data, (3) calculate z-score using the cleaned data, and (4) some basic probability-related functions in R.

Functions	Tasks
`data.frame()`	Create a data.frame object
`separate()`	Turns a single character column into multiple columns
`colnames()`	Retrieve or set the column names of a matrix-like object
`is.na()`	Identify which elements are missing (i.e., NA or NaN)
`scale()`	Calculates z-score using a vector
`pnorm()`	Takes the quantile value and outputs the probability using a normal distribution
`qnorm()`	Takes the probability value and outputs the quantile using a normal distribution

0. Before We Get Started: Setting the working directory & reading data

Let’s first download required packages in R. In addition to the functions in base R, we will use separate function from “tidyverse” package. Then, we bring in the packages and read the data in. This is the data on poverty that we downloaded from the Census Bureau.

# Install required packages
install.packages("tidyverse")

# Call the packages using library()
library(tidyverse)

# USE YOUR OWN working directory and file name
setwd("C:\\Users\\cod\\Desktop\\PhD Files\\GRA & GTA\\GTA\\CP6025 Fall 2021\\Week 3\\Lab3\\ACSDT5Y2017.C17002_2020-08-29T152002") # Use your own pathname

pov.data <- read.csv("ACSDT5Y2017.C17002_data_with_overlays_2020-08-29T151945.csv")

# Check the data
head(pov.data)

##                 GEO_ID                                        NAME C17002_001E
## 1 1400000US13001950100  Census Tract 9501, Appling County, Georgia        2807
## 2 1400000US13001950200  Census Tract 9502, Appling County, Georgia        4158
## 3 1400000US13001950300  Census Tract 9503, Appling County, Georgia        5673
## 4 1400000US13001950400  Census Tract 9504, Appling County, Georgia        1577
## 5 1400000US13001950500  Census Tract 9505, Appling County, Georgia        3722
## 6 1400000US13003960100 Census Tract 9601, Atkinson County, Georgia        2106
##   C17002_001M C17002_002E C17002_002M C17002_003E C17002_003M C17002_004E
## 1         356         158         122         225         179         119
## 2         515         375         240         777         353         698
## 3         491         807         441         925         410         510
## 4         226         224         137          86          58         123
## 5         463         434         273         424         207         299
## 6         255         147         113         418         234         369
##   C17002_004M C17002_005E C17002_005M C17002_006E C17002_006M C17002_007E
## 1         105         250         190         181         137          99
## 2         402         246         176         255         161          65
## 3         300         496         322         270         220         190
## 4          76          38          28         113          51          26
## 5         262         218         145         229         164         105
## 6         224         168          97         198         115          23
##   C17002_007M C17002_008E C17002_008M
## 1          91        1775         320
## 2          85        1742         360
## 3         218        2475         459
## 4          24         967         203
## 5          91        2013         328
## 6          26         783         183

1. Data Cleaning

In the real world, data you need to analyze will always come in messy or incomplete forms. Even data from extremely systematized sources, such as the U.S. Census Bureau, often requires some amount of work to ‘get the data ready’. This process of getting the data ready is often called data cleaning.

The data we just read has a few issues that need to be cleaned if we were to use it for any statistical analysis.

There are many columns other than the ones we are interested in. We want to exclude the variables that are not interesting to us.
We need to extract the names of counties from the variable named ‘NAME’. We need to separate the variable into three pieces.
The names of the variable are not intuitive. We need to replace these codes with something more meaningful.
The data contains data for all counties in Georgia while we are interested in only four counties around Atlanta (Fulton, DeKalb, Clayton, and Cobb). We need to exclude other counties from the data.
The data we downloaded contains the total population and the number of those who are under poverty. We need to convert this data into percentage.
Finally, there are some Census Tracts that have zero population (To see this, try min(pov.data$C17002_001E)), which will create an issue when we divide the number of people in poverty by population to calculate the percent under poverty (e.g., 0/0 is NaN in R, which stands for ‘Not a Number’). These values need to be excluded from the data.

Let’s fix these six issues one by one!

1.1. Selecting variables (i.e., filtering out columns that are not needed)

We are interested in the proportion of people whose income is lower than the poverty line and whether that is different in different counties but there are many other columns containing data that we do not need. To find out which ones are the variables we need, we must know what “C17002_001E”, “C17002_002E”, etc. mean. We need to see the data dictionary that came together with the data file. The variables we need are (1) the two ID variables for each Census Tract (i.e., GEO_ID and NAME), (2) total population, (3) number of people whose income is under 50% of the poverty line, and (4) number of people whose income is between 50% and 99% of the poverty line. The names of the variable that we are interested in are highlighted in yellow in the image below.

Before we exclude variables and retain only those that we need, let’s recap how to do subsetting in R.

Subsetting and indexing a vector or a dataframe: Subsetting a vector or a dataframe in R can be done using square brackets []. When you put [] after a vector or a dataframe, it means you are about to extract some parts of it. If it is a vector (which is one dimension), you only need to specify one index to parse it. See examples below.

my.vec <- c(1,3,5,7,9,11)
my.vec[3] # This returns 5, which is in 3rd position of the vector

## [1] 5

my.vec[c(TRUE, FALSE, FALSE, TRUE, TRUE, FALSE)] # this returns 1,7,9, those which are in position where TRUEs are. Think of TRUEs and FALSEs as filters. TRUEs let through the values being filtered. FALSEs block the values being filtered. As a result, you get values in a vector that are in the same position as TRUEs in the square bracket.

## [1] 1 7 9

If it is a dataframe, which is 2-dimensional with rows and columns, you need to supply two indices to parse it, one for rows and one for columns. The format is like this: dataframe[ index for row, index for column ]. See examples below and read the comments (hashtags) carefully.

my.df <- data.frame(a = c("a", "b", "c", "d"), 
                    b = c(1, 2, 3, 4), 
                    Avengers = c("Peter", "Natasha", "Hulk", "Thor"),
                    stringsAsFactors = FALSE)
my.df # See what the data.frame looks like before subsetting

##   a b Avengers
## 1 a 1    Peter
## 2 b 2  Natasha
## 3 c 3     Hulk
## 4 d 4     Thor

my.df[1 , 3] # This returns the 1st row of the 3rd column, which is "Peter"

## [1] "Peter"

my.df[2:3 , 1] # This returns 2nd and 3rd rows of the first column, which are "b", "c"

## [1] "b" "c"

my.df[ , "Avengers"] # This returns all rows in the column named "Avengers". If the row-part in the square bracket is empty, it means all rows. If columns-part in the square bracket is empty, it mean all columns.

## [1] "Peter"   "Natasha" "Hulk"    "Thor"

my.df[c(FALSE, TRUE, FALSE, TRUE) , "Avengers"] # This will return the 2nd and 4th rows (which are where TRUEs are located) of the column "Avengers", which are Natasha and Thor.

## [1] "Natasha" "Thor"

Now we know what variables we need to retain and how to subset a data frame. The code below is an R-way saying “Give me all rows (note the empty part before the comma in the square bracket!) and variables named GEO_ID, NAME, C17002_001E, C17002_002E, C17002_003E from a data.frame called pov.data and put it in pov.data.var.” Note that if you don’t assign the subsetted data.frame into an R object, it will simply print the subsetted data.frame in the console window and disappear.

pov.data.var <- pov.data[ , c("GEO_ID", "NAME", "C17002_001E", "C17002_002E", "C17002_003E") ] 
head(pov.data.var)

##                 GEO_ID                                        NAME C17002_001E
## 1 1400000US13001950100  Census Tract 9501, Appling County, Georgia        2807
## 2 1400000US13001950200  Census Tract 9502, Appling County, Georgia        4158
## 3 1400000US13001950300  Census Tract 9503, Appling County, Georgia        5673
## 4 1400000US13001950400  Census Tract 9504, Appling County, Georgia        1577
## 5 1400000US13001950500  Census Tract 9505, Appling County, Georgia        3722
## 6 1400000US13003960100 Census Tract 9601, Atkinson County, Georgia        2106
##   C17002_002E C17002_003E
## 1         158         225
## 2         375         777
## 3         807         925
## 4         224          86
## 5         434         424
## 6         147         418

1.2. Separating a variable at commas

We want to compare poverty rate of different counties, and to do that we need a variable that tells us which county the given Census Tract falls into. The name of the county is embedded in a variable called ‘GEO.display.label’. Let’s extract the county names from the variable. The ‘GEO.display.label’ is formatted in the following way: “Census Tract 9501, Appling County, Georgia”. If we break it at every comma, we will have “Census Tract 9501”, “Appling County”, and “Georgia” all separated out, which can then be stored in three different variables. The schematic below illustrates the operation we want to do.

The separate function takes four arguments (arguments are what you supply to functions as inputs): (1) name of data, (2) name of variable to be parsed, (3) a list of columns to be created as the results, (4) and the character that will be used as the separator. Notice that there is a space after comma in sep = ", ". If we forget this space, all values in the newly created variable will be preceded by a space because R treats a space as a character too if it is in a character string.

Note that separate function is part of tidyverse package. You need to bring in the package in order to use this function.

# Separate 'GEO.display.label' variable into three variables at a comma and a space
pov.data.sep <- separate(
  data = pov.data.var,                  # 1st: specify the data.frame
  col = NAME,                           # 2nd: specify the name of the variable you want to parse
  into = c("tract", "county", "state"), # 3rd: a list of variable names that will be created as the result of parsing
  sep = ", ")                           # 4th: character string that will be used as the separator

Same as before, you need to assign the output of separate function into an R object. Otherwise, the function will just print the output of the separate function in the console window and the output will disappear. Let’s examine what separate function has produced.

head(pov.data.sep)

##                 GEO_ID             tract          county   state C17002_001E
## 1 1400000US13001950100 Census Tract 9501  Appling County Georgia        2807
## 2 1400000US13001950200 Census Tract 9502  Appling County Georgia        4158
## 3 1400000US13001950300 Census Tract 9503  Appling County Georgia        5673
## 4 1400000US13001950400 Census Tract 9504  Appling County Georgia        1577
## 5 1400000US13001950500 Census Tract 9505  Appling County Georgia        3722
## 6 1400000US13003960100 Census Tract 9601 Atkinson County Georgia        2106
##   C17002_002E C17002_003E
## 1         158         225
## 2         375         777
## 3         807         925
## 4         224          86
## 5         434         424
## 6         147         418

The code worked perfectly! Now ‘tract’ variable contains only census tract ID, ‘county’ variable contains only county names, and so on.

1.3. Changing the name of the variables

The name of the variables are still in “C17002_001E” form, which is completely meaningless and a bit long. Let’s give them shorter and interpretable names. We can access the names of the variables in a dataframe using colnames function.

colnames(pov.data.sep)

## [1] "GEO_ID"      "tract"       "county"      "state"       "C17002_001E"
## [6] "C17002_002E" "C17002_003E"

We can see that the first 5 variables have acceptable names. It is the 5th through 7th variables that are the problems. Using the same technique for subsetting and indexing, let’s give those problematic variables new names. The following code means “Replace the 5th, 6th, and 7th names in the list of variable names in the data.frame pov.data.sep with new names: total, under00_50, and under50_99”

# Rename variables
colnames(pov.data.sep)[5:7] <- c("total", "under00_50", "under50_99")
head(pov.data.sep)

##                 GEO_ID             tract          county   state total
## 1 1400000US13001950100 Census Tract 9501  Appling County Georgia  2807
## 2 1400000US13001950200 Census Tract 9502  Appling County Georgia  4158
## 3 1400000US13001950300 Census Tract 9503  Appling County Georgia  5673
## 4 1400000US13001950400 Census Tract 9504  Appling County Georgia  1577
## 5 1400000US13001950500 Census Tract 9505  Appling County Georgia  3722
## 6 1400000US13003960100 Census Tract 9601 Atkinson County Georgia  2106
##   under00_50 under50_99
## 1        158        225
## 2        375        777
## 3        807        925
## 4        224         86
## 5        434        424
## 6        147        418

See that now the new names are applied to the data.frame and we can immediately understand what each of the variable represents.

1.4. Selecting the four counties around Atlanta (i.e., filtering out rows that represent counties other than Fulton, DeKalb, Cobb, and Clayton County)

Since we are interested in only Fulton, DeKalb, Clayton, and Cobb Counties, let’s drop rows that contain data of other counties but keep all the columns. What we want to do in a schematic is as follows:

We can use %in% operator to do the job. The %in% operator is used to check if elements belong to a vector. This %in% operator returns TRUE if the element we are checking (i.e., the thing on the left of %in%) is in the vector we are checking it against (i.e., the thing on the right side of %in%). See the example below:

# Create a toy vector
my.vec <- c(1,2,3,10,20,30)
1 %in% my.vec # This returns one TRUE because there is 1 in my.vec

## [1] TRUE

15 %in% my.vec # This returns FALSE because my.vec does not contain 15

## [1] FALSE

c(2,3,40) %in% my.vec # This gives TRUE, TRUE, FALSE because the first two elements are in my.vec while the last isn't

## [1]  TRUE  TRUE FALSE

The following code tests whether each element in pov.data.sep$county is in a vector of c("Fulton County", "DeKalb County", "Cobb County", and "Clayton County"), returns a vector of TRUEs and FALSEs, and stores it in an object called county.filter.

If the first row of pov.data.sep$county is one of the four counties, county.filter gets TRUE as its first element. If the second row of pov.data.sep$county is NOT one of the four counties, the second element of county.filter gets FALSE, and so on.

# Create a logical vector in which rows that have the four counties in 'county' column gets TRUE and otherwise FALSE
county.filter <- pov.data.sep$county %in% c("Fulton County", "DeKalb County", "Cobb County", "Clayton County")
county.filter # In this county.filter vector, there are as many elements as the number of rows in pov.data.sep

##    [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##   [13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##   [25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##   [37] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##   [49] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##   [61] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##   [73] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##   [85] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##   [97] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [109] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [121] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [133] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [145] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [157] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [169] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [181] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [193] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [205] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [217] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [229] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [241] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [253] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [265] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [277] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [289] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [301] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [313] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [325] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [337] FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
##  [349]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
##  [361]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
##  [373]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
##  [385]  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE
##  [397]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
##  [409]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
##  [421]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
##  [433]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
##  [445]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
##  [457]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
##  [469]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
##  [481]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
##  [493]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
##  [505]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE
##  [517] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [529] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [541] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [553] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [565] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [577] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [589] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE
##  [601]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
##  [613]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
##  [625]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
##  [637]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
##  [649]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
##  [661]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
##  [673]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
##  [685]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
##  [697]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
##  [709]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
##  [721]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
##  [733]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE
##  [745] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [757] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [769] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [781] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [793] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [805] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [817] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [829] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [841] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [853] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [865] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [877] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [889] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [901] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [913] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [925]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
##  [937]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
##  [949]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
##  [961]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
##  [973]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
##  [985]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
##  [997]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [1009]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [1021]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [1033]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [1045]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [1057]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [1069]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [1081]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [1093]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [1105]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [1117]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [1129] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [1141] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [1153] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [1165] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [1177] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [1189] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [1201] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [1213] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [1225] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [1237] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [1249] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [1261] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [1273] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [1285] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [1297] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [1309] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [1321] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [1333] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [1345] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [1357] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [1369] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [1381] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [1393] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [1405] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [1417] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [1429] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [1441] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [1453] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [1465] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [1477] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [1489] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [1501] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [1513] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [1525] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [1537] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [1549] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [1561] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [1573] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [1585] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [1597] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [1609] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [1621] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [1633] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [1645] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [1657] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [1669] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [1681] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [1693] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [1705] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [1717] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [1729] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [1741] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [1753] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [1765] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [1777] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [1789] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [1801] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [1813] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [1825] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [1837] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [1849] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [1861] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [1873] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [1885] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [1897] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [1909] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [1921] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [1933] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [1945] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [1957] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [1969] FALSE

We can insert this logical vector county.filter into the “row-part” of the square bracket [ , ]. Remember that the left-side of comma in [,] represents rows and the right-side represents columns. If we write pov.data.sep[county.filter , ], it is identical to writing pov.data.sep[c(FALSE, TRUE, FALSE, ... , FALSE) , ], which means “give me all columns from pov.data.sep and rows in the corresponding locations as TRUEs in county.filter”. As a result, you get a data.frame shown below.

# Filter 'pov.data.sep' using 'county.filter'
pov.data.4county <- pov.data.sep[county.filter, ]

# Number of rows and columns after filtering
head(pov.data.4county)

##                   GEO_ID               tract         county   state total
## 340 1400000US13063040202 Census Tract 402.02 Clayton County Georgia  2641
## 341 1400000US13063040203 Census Tract 402.03 Clayton County Georgia  3573
## 342 1400000US13063040204 Census Tract 402.04 Clayton County Georgia  4087
## 343 1400000US13063040302 Census Tract 403.02 Clayton County Georgia  5962
## 344 1400000US13063040303 Census Tract 403.03 Clayton County Georgia  6778
## 345 1400000US13063040306 Census Tract 403.06 Clayton County Georgia  4090
##     under00_50 under50_99
## 340        288        261
## 341        498        381
## 342        477        397
## 343        939       1097
## 344        443       1345
## 345        982        762

We have reduced the number of rows from 1969 (before the filtering) to 519 (after the filtering) while maintaining the number of columns.

Once you get more familiar with R, you don’t need to create a county.filter first and then use it in the square bracket; you can simply put them all together in one line code. See below.

# This code is ...
county.filter <- pov.data.sep$county %in% c("Fulton County", "DeKalb County", "Cobb County", "Clayton County")
pov.data.4county <- pov.data.sep[county.filter, ]

# Identical to ...
pov.data.4county <- pov.data.sep[pov.data.sep$county %in% c("Fulton County", "DeKalb County", "Cobb County", "Clayton County"), ]

You can use whichever works better for you.

1.5. Calculating a new variable based on existing ones

The variable “total” is the total population of each Census Tract. The variable “under00_50” and “under50_99” are the number of people whose income is between 0 to 50% of the poverty line and 50% to 99% of the poverty line, respectively. Because what we want is the proportion of people whose income is lower than the poverty line, we first need to add “under00_50”, and “under50_99” to calculate the total number of people under poverty line and, second, divide that number by the total population.

Note that we are assigning the outcome of the calculation into a variable called p.pov which currently is NOT present in the data.frame. This is how we can create a new variable in a data.frame.

# Calculate percentages
pov.data.4county$p.pov <- (pov.data.4county$under00_50 + pov.data.4county$under50_99) / pov.data.4county$total
summary(pov.data.4county$p.pov)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max.     NA's 
## 0.004575 0.069188 0.145023 0.174992 0.253455 0.805505        5

1.6. (Finally!) Excluding NAs or NaNs

We are almost there! Notice that the summary function above shows that there are 5 NA’s in the newly created p.pov variable (to be specific, they are NaNs, not NAs but summary function doesn’t distinguish the two). This is due to the Census Tract with 0 population. They are of no value to us because (1) there is no point analyzing poverty if there is no one living there and (2) NAs and NaNs can cause malfunction of some functions like mean (To see this, try mean(c(1,2,3,NA))).

We can filter out NAs in the same way we filtered out counties in 1.5. This time, we will use is.na() function. This function takes a vector and returns TRUE to elements that are either NA or NaN and FALSE otherwise. Because the function returns TRUE to NAs and NaNs, which are what we want to drop, we need to FLIP IT using a negation operator !. For example, !TRUE is FALSE and !FALSE is TRUE.

!is.na(pov.data.4county$p.pov) is R-way of saying "Give me a logical vector that has TRUEs in places where there are NAs in pov.data.4county$p.pov and FALSEs otherwise and FLIP IT.

# Filtering out NAs
na.filter <- !is.na(pov.data.4county$p.pov)
df.ready <- pov.data.4county[na.filter, ]
summary(df.ready)

##     GEO_ID             tract              county             state          
##  Length:514         Length:514         Length:514         Length:514        
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##      total         under00_50       under50_99         p.pov         
##  Min.   :  136   Min.   :   0.0   Min.   :   0.0   Min.   :0.004575  
##  1st Qu.: 3539   1st Qu.: 147.2   1st Qu.: 139.5   1st Qu.:0.069188  
##  Median : 4874   Median : 284.0   Median : 338.5   Median :0.145023  
##  Mean   : 5248   Mean   : 377.6   Mean   : 442.7   Mean   :0.174992  
##  3rd Qu.: 6497   3rd Qu.: 520.2   3rd Qu.: 623.5   3rd Qu.:0.253455  
##  Max.   :17857   Max.   :2040.0   Max.   :2234.0   Max.   :0.805505

Can you see that the 5 NAs in the p.pov variable are no longer there? We have completed the data cleaning!

2. Z-Scores

Using the cleaned data, we will calculate z-score for p.pov variable. Perhaps the usefulness of z-score may be not be apparent at this moment. But z-score is used extensively, and it is important that we have an experience of calculating z-scores in R. If you forgot the equation for z-score, see the lecture slide for Week3-Probability Distributions 1.

To calculate z-score, we first need to get the mean and the standard deviation of p.pov. Then, we need to subtract the mean from each value of p.pov and divide it by the standard deviation. We will store the z-score of p.pov back into df.ready by creating a new variable in df.ready named p.pov.z.

# Calculating mean
p.pov.mean <- mean(df.ready$p.pov)

# Calculating standard deviation
p.pov.sd <- sd(df.ready$p.pov)

# Calculate z-score and assign it into p.pov.z
df.ready$p.pov.z <- (df.ready$p.pov - p.pov.mean) / p.pov.sd
summary(df.ready)

##     GEO_ID             tract              county             state          
##  Length:514         Length:514         Length:514         Length:514        
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##      total         under00_50       under50_99         p.pov         
##  Min.   :  136   Min.   :   0.0   Min.   :   0.0   Min.   :0.004575  
##  1st Qu.: 3539   1st Qu.: 147.2   1st Qu.: 139.5   1st Qu.:0.069188  
##  Median : 4874   Median : 284.0   Median : 338.5   Median :0.145023  
##  Mean   : 5248   Mean   : 377.6   Mean   : 442.7   Mean   :0.174992  
##  3rd Qu.: 6497   3rd Qu.: 520.2   3rd Qu.: 623.5   3rd Qu.:0.253455  
##  Max.   :17857   Max.   :2040.0   Max.   :2234.0   Max.   :0.805505  
##     p.pov.z       
##  Min.   :-1.2919  
##  1st Qu.:-0.8021  
##  Median :-0.2272  
##  Mean   : 0.0000  
##  3rd Qu.: 0.5948  
##  Max.   : 4.7799

The summary function at the end shows that we successfully created a new variable called p.pov.z and that it has the mean of 0 (which is expected). Go ahead and check if p.pov.z has the standard deviation of 1 by using sd() function.

In the real life, we usually use a function specifically designed to calculate z-score, scale() function.

# Calculating z-score using scale()
df.ready$p.pov.scale <- scale(df.ready$p.pov, center = TRUE, scale = TRUE)
summary(df.ready)

##     GEO_ID             tract              county             state          
##  Length:514         Length:514         Length:514         Length:514        
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##      total         under00_50       under50_99         p.pov         
##  Min.   :  136   Min.   :   0.0   Min.   :   0.0   Min.   :0.004575  
##  1st Qu.: 3539   1st Qu.: 147.2   1st Qu.: 139.5   1st Qu.:0.069188  
##  Median : 4874   Median : 284.0   Median : 338.5   Median :0.145023  
##  Mean   : 5248   Mean   : 377.6   Mean   : 442.7   Mean   :0.174992  
##  3rd Qu.: 6497   3rd Qu.: 520.2   3rd Qu.: 623.5   3rd Qu.:0.253455  
##  Max.   :17857   Max.   :2040.0   Max.   :2234.0   Max.   :0.805505  
##     p.pov.z          p.pov.scale.V1   
##  Min.   :-1.2919   Min.   :-1.291931  
##  1st Qu.:-0.8021   1st Qu.:-0.802100  
##  Median :-0.2272   Median :-0.227197  
##  Mean   : 0.0000   Mean   : 0.000000  
##  3rd Qu.: 0.5948   3rd Qu.: 0.594832  
##  Max.   : 4.7799   Max.   : 4.779929

3. Probability Distributions

In R, probability-related functions usually come in a group of four. For example, try running ?pnorm in your console to bring up a help page. You will see that, instead of bringing up a page just for pnorm(), R will show you a page that explains four similar functions, including dnorm(), pnorm(), qnorm(), and rnorm().

These four functions are related with the normal distribution, hence the suffix norm. If the distribution you need is t-distribution, you can use dt, pt, qt, and rt. See the pattern?

Of the four prefixes (i.e., d, p, q, and r), two that are most relevant to this course are the ones that start with p and q. Ones that start with p takes the quantile and outputs the corresponding probability. Ones that start with q takes the probability and outputs the corresponding quantile. It is okay if we don’t fully understand how these operations can be useful at this moment, that is okay. We will come back to this later in the semester. For now, let’s focus on learning these functions in R.

For the demonstration, I will use a normal distribution, particularly the standard normal distribution (i.e., the normal distribution with the mean of 0 and the standard deviation of 1). Let’s go back to the help page for pnorm() (i.e., ?pnorm). The help page shows states that:

pnorm(q, mean = 0, sd = 1, lower.tail = TRUE, log.p = FALSE)

Notice that mean, sd, lower.tail, and log.p are followed by an equal sign and some values. For example, mean = 0. Also notice that q is not followed by anything. This means that, when you are using pnorm() function, you do not have to supply arguments for mean (you can if you want to shift the mean of the distribution), because R will automatically supply the default value, which is 0. Similarly, the default value for sd is 1, and so on. But q, which stands for quantile does not have the default value specified. This means you must supply that value; otherwise the function will not run. So, if you do not intentionally change the mean and sd setting, pnorm() function will use the standard normal distribution (which by definition has the mean of 0 and the standard deviation of 1). I will use it that way.

Let’s try some examples. pnorm() takes quantile and outputs probability. I’d like to try supplying 2 as the input to pnorm(). See the image below to understand what it all means.

# pnorm() takes quantile and outputs probability. 
pnorm(2)

## [1] 0.9772499

pnorm(2) returned 0.9772499. This means that the blue area in the image is 0.9772499 (Note that the entire area under the curve in the image is 1). That is what we wanted to know!

Let’s try the counterpart function, qnorm(). Since we know that the probability is 0.9772499 at quantile 2, Let’s see if qnorm(0.9772499) will return 2.

# qnorm() takes the probability and outputs the corresponding quantile.
qnorm(0.9772499)

## [1] 2.000001

We got 2.000001, which is very close to 2! Notice that the slight misalignment is cause by the rounding of the probability.

Lab3 - Data Cleaning, Z-scores, Probabilities

Lab created by Bonwoo Koo

Created on 9/4/2020