Continuing from the slide of Lab 3 and the cleaned poverty data from Lab2, this document will guide you through (1) saving the cleaned poverty data to local folder, (2) examining summary statistics including mean, median, variance, and range, (3) calculate z-score, (4) some basic probability-related functions in R, and (5) drawing some basic charts and graphs to visualize our data.

Functions Tasks
write.csv() Save the csv data to your local folder
sum() Return the sum of all elements
length() Returns the count of elements. Works for vectors and lists
summary() Returns central tendency statistics for the provided R object
var() Returns variance of a vector, matrix, or data frame
sd() Returns standard deviation of a vector
table() Returns a table with labels and associated occurences of each label
hist() Generates a histogram of the data in the given vector
boxplot() Generates a boxplot of the data in the given data frame or a list
scale() Calculates z-score using a vector
pnorm() Takes the quantile value and outputs the probability using a normal distribution
qnorm() Takes the probability value and outputs the quantile using a normal distribution

1. Review data cleaning and save csv data

We will be using the same poverty data from Census in Lab2. To continue using the same data, it is smart to save the cleaned data in to a local folder, so that you don’t have to go through the cleaning process again if you want to use the data for future analyses. Similar to read.csv(), the write.csv() function allows you to export your data to a local path.

write.csv("PATH TO A LOCAL FOLDER")

To read the data again, you just need to input the same path to the read.csv() function.

read.csv("PATH TO A LOCAL FOLDER")

To recreate the cleaned data in base R (except the separate function from tidyverse), you can rerun the following codes to get the cleaned data.

# load necessary library
library(tidyverse)

# set the path (replace the path to your own path)
setwd("/Users/xiaofanliang/Dropbox (GaTech)/GT_Academics/CP6025_TA/Lab/Lab3/") 

# read the data 
pov.data <- read.csv("ACSDT5Y2017.C17002-Data.csv")

# subset dataframe to selected columns 
pov.data.var <- pov.data[ , c("GEO_ID", "NAME", "C17002_001E", "C17002_002E", "C17002_003E") ] 

# separate 'NAME' variable into three variables at a comma and a space
pov.data.sep <- separate(
  data = pov.data.var,                  # 1st: specify the data.frame
  col = NAME,                           # 2nd: specify the name of the variable you want to parse
  into = c("tract", "county", "state"), # 3rd: a list of variable names that will be created as the result of parsing
  sep = ", ")                           # 4th: character string that will be used as the separator

# rename column 5 to 7 
colnames(pov.data.sep)[5:7] <- c("total", "under00_50", "under50_99")

# filter census tracts to those in the Atlanta four counties 
pov.data.4county <- pov.data.sep[pov.data.sep$county %in% c("Fulton County", "DeKalb County", "Cobb County", "Clayton County"), ]

# create a new variable called p.pov that represent **the proportion of people whose income is lower than the poverty line**
pov.data.4county$p.pov <- (pov.data.4county$under00_50 + pov.data.4county$under50_99) / pov.data.4county$total

# remove NAs in the p.pov variable 
na.filter <- !is.na(pov.data.4county$p.pov)
df.ready <- pov.data.4county[na.filter, ]

head(df.ready)
##                   GEO_ID               tract         county   state total
## 340 1400000US13063040202 Census Tract 402.02 Clayton County Georgia  2641
## 341 1400000US13063040203 Census Tract 402.03 Clayton County Georgia  3573
## 342 1400000US13063040204 Census Tract 402.04 Clayton County Georgia  4087
## 343 1400000US13063040302 Census Tract 403.02 Clayton County Georgia  5962
## 344 1400000US13063040303 Census Tract 403.03 Clayton County Georgia  6778
## 345 1400000US13063040306 Census Tract 403.06 Clayton County Georgia  4090
##     under00_50 under50_99     p.pov
## 340        288        261 0.2078758
## 341        498        381 0.2460118
## 342        477        397 0.2138488
## 343        939       1097 0.3414961
## 344        443       1345 0.2637946
## 345        982        762 0.4264059

The tidyverse syntax for the same cleaning process is the following:

# load necessary library
library(tidyverse)

# set the path (replace the path to your own path)
setwd("/Users/xiaofanliang/Dropbox (GaTech)/GT_Academics/CP6025_TA/Lab/Lab3/") 

# read the data 
pov.data <- read.csv("ACSDT5Y2017.C17002-Data.csv")

df.ready <- pov.data %>% 
  # subset dataframe to selected columns 
  select(c(GEO_ID, NAME, C17002_001E, C17002_002E, C17002_003E)) %>%
  # separate 'NAME' variable into three variables at a comma and a space
  separate(col=NAME, into = c("tract", "county", "state"), sep = ", ") %>% 
  # rename column C17002_001E, C17002_002E, C17002_003E
  rename(total = C17002_001E, under00_50 = C17002_002E, under50_99 = C17002_003E) %>% 
  # filter census tracts to those in the Atlanta four counties 
  filter(county %in% c("Fulton County", "DeKalb County", "Cobb County", "Clayton County")) %>% 
  # create a new variable called p.pov that represent **the proportion of people whose income is lower than the poverty line**
  mutate(p.pov = (under00_50 + under50_99) / total) %>% 
  # remove NAs in the p.pov variable 
  drop_na(p.pov)

head(df.ready)
##                 GEO_ID               tract         county   state total
## 1 1400000US13063040202 Census Tract 402.02 Clayton County Georgia  2641
## 2 1400000US13063040203 Census Tract 402.03 Clayton County Georgia  3573
## 3 1400000US13063040204 Census Tract 402.04 Clayton County Georgia  4087
## 4 1400000US13063040302 Census Tract 403.02 Clayton County Georgia  5962
## 5 1400000US13063040303 Census Tract 403.03 Clayton County Georgia  6778
## 6 1400000US13063040306 Census Tract 403.06 Clayton County Georgia  4090
##   under00_50 under50_99     p.pov
## 1        288        261 0.2078758
## 2        498        381 0.2460118
## 3        477        397 0.2138488
## 4        939       1097 0.3414961
## 5        443       1345 0.2637946
## 6        982        762 0.4264059

To save df.ready so that we don’t have to run data cleaning again, we will use the write.csv function. The first argument of write.csv() takes in a dataframe, and the second argument file takes in a path that includes the name of the exported dataframe and the .csv extension. It is recommended to not have dot (i.e., .) in the file name to avoid confusion with the extension .csv. Without setting row.names=FALSE, the saved csv will automatically create a column that saves the row index (i.e., row name) of the dataframe. We do not want that additional column. The file will be saved at the path saved in setwd(). You can also specify a different path in the file argument.

NOTE If you already exported a dfReady.csv data file in your Lab3 folder, the write.csv() will overwrite the file under the same name automatically.

# replace the path to your own path
write.csv(df.ready, file='dfReady.csv', row.names = FALSE)

To read the data we just saved, run the following line and you will see that our df.ready dataframe is back!

# read data
df.ready <- read.csv('dfReady.csv')
head(df.ready)
##                 GEO_ID               tract         county   state total
## 1 1400000US13063040202 Census Tract 402.02 Clayton County Georgia  2641
## 2 1400000US13063040203 Census Tract 402.03 Clayton County Georgia  3573
## 3 1400000US13063040204 Census Tract 402.04 Clayton County Georgia  4087
## 4 1400000US13063040302 Census Tract 403.02 Clayton County Georgia  5962
## 5 1400000US13063040303 Census Tract 403.03 Clayton County Georgia  6778
## 6 1400000US13063040306 Census Tract 403.06 Clayton County Georgia  4090
##   under00_50 under50_99     p.pov
## 1        288        261 0.2078758
## 2        498        381 0.2460118
## 3        477        397 0.2138488
## 4        939       1097 0.3414961
## 5        443       1345 0.2637946
## 6        982        762 0.4264059

2. Calculate mean, variance, and standard deviation

We have already started to use the short-cut methods (e.g., mean()) in R for calculating central tendencies in Lab2. Let’s try to doing it manually so you have a better insight as to what is happening behind the scene.

2.1 Mean

First, let’s calculate the mean of df.ready$p.pov. Here is the equation for calculating mean: \(\bar{x} = \frac{1}{n}\left (\sum_{i=1}^n{x_i}\right )\). The first part of the equation we need to solve is \(\left (\sum_{i=1}^n{x_i}\right )\), which is mathematical way of saying “sum every element in x”. In R, this equation can be written as:

sum.p.pov <- sum(df.ready$p.pov)
sum.p.pov
## [1] 89.9458

Second, the equation \(\bar{x} = \frac{1}{n}\left (\sum_{i=1}^n{x_i}\right )\) suggests that sum.hinc needs to be multiplied by \(\frac{1}{n}\), where n is the number of elements in df.ready$p.pov. You can get the number of elements using length().

1 / length(df.ready$p.pov) * sum.p.pov
## [1] 0.1749918

There you have it! Let’s also use R function mean() to check if the manually calculated mean is correct.

mean(df.ready$p.pov)
## [1] 0.1749918

Matches perfectly! It shows that, on average across all census tracts, the proportion of people whose income is lower than the poverty line is 18.5%.

2.2 Variance

First, let’s calculate variance using the same df.ready and the mean we just calculated. Like we did for mean, let’s break down the equation for variance and tackle it piece by piece. The equation is: \(\sigma_Y^2 = \frac {1}{n-1} \sum_{i=1}^n \left(Y_i - \overline{Y} \right)^2\)

let’s focus on \(\left(Y_i - \overline{Y} \right)^2\) part. The following code calculates it and stores it in an object called parenthesis.

parenthesis <- (df.ready$p.pov - mean(df.ready$p.pov))^2 # ^2 means to square it
head(parenthesis) # Prints the first six elements in parenthesis
## [1] 0.001081356 0.005043830 0.001509863 0.027723686 0.007885937 0.063209019

Second, we are done with \(\left(Y_i - \overline{Y} \right)^2\) part. Now, we need to sum all numbers in this vector to finish \(\sum_{i=1}^n \left(Y_i - \overline{Y} \right)^2\) part.

sum.parenthesis <- sum(parenthesis)
sum.parenthesis
## [1] 8.92611

Third, the last thing we need to do is to divide sum.parenthesis by \(n-1\), where \(n\) is the number of elements in the vector. You can get the number of elements using length(). Let’s also use R function var() to see if the numbers match.

sum.parenthesis / (length(df.ready$p.pov) - 1)
## [1] 0.01739983
var(df.ready$p.pov)
## [1] 0.01739983

2.3 Summary statistics in base R

The following codes show the base R functions that calculate the summary statistics of a vector of values.

mean(df.ready$p.pov) # This function outputs mean of the given vector
## [1] 0.1749918
var(df.ready$p.pov) # This function outputs variance of the given vector
## [1] 0.01739983
sd(df.ready$p.pov) # This function outputs standard deviation of the given vector
## [1] 0.1319084
summary(df.ready$p.pov) # This function outputs various descriptive statistics. Note that summary() sometimes rounds up the numbers!
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## 0.004575 0.069188 0.145023 0.174992 0.253455 0.805505

NOTE using summary() to view the descriptive statistics can sometimes be misleading. This is because summary() rounds up the results (see the difference between mean(df.ready$p.pov) and the mean from summary(df.ready$p.pov)). If you need to generate precise statistics (e.g., for your option paper or thesis), I recommend using mean(), min(), max(), median(), etc. instead of summary().

3. Z-Scores

Using the cleaned data, we will calculate z-score for p.pov variable. Perhaps the usefulness of z-score may be not be apparent at this moment. But z-score is used extensively, and it is important that we have an experience of calculating z-scores in R. If you forgot the equation for z-score, see the lab 3 lecture slide.

To calculate z-score, we first need to get the mean and the standard deviation of p.pov. Then, we need to subtract the mean from each value of p.pov and divide it by the standard deviation. We will store the z-score of p.pov back into df.ready by creating a new variable in df.ready named p.pov.z.

# Calculating mean
p.pov.mean <- mean(df.ready$p.pov)

# Calculating standard deviation
p.pov.sd <- sd(df.ready$p.pov)

# Calculate z-score and assign it into p.pov.z
df.ready$p.pov.z <- (df.ready$p.pov - p.pov.mean) / p.pov.sd
summary(df.ready)
##     GEO_ID             tract              county             state          
##  Length:514         Length:514         Length:514         Length:514        
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##      total         under00_50       under50_99         p.pov         
##  Min.   :  136   Min.   :   0.0   Min.   :   0.0   Min.   :0.004575  
##  1st Qu.: 3539   1st Qu.: 147.2   1st Qu.: 139.5   1st Qu.:0.069188  
##  Median : 4874   Median : 284.0   Median : 338.5   Median :0.145023  
##  Mean   : 5248   Mean   : 377.6   Mean   : 442.7   Mean   :0.174992  
##  3rd Qu.: 6497   3rd Qu.: 520.2   3rd Qu.: 623.5   3rd Qu.:0.253455  
##  Max.   :17857   Max.   :2040.0   Max.   :2234.0   Max.   :0.805505  
##     p.pov.z       
##  Min.   :-1.2919  
##  1st Qu.:-0.8021  
##  Median :-0.2272  
##  Mean   : 0.0000  
##  3rd Qu.: 0.5948  
##  Max.   : 4.7799

The summary function at the end shows that we successfully created a new variable called p.pov.z and that it has the mean of 0 (which is expected). Go ahead and check if p.pov.z has the standard deviation of 1 by using sd() function.

In the real life, we usually use a function specifically designed to calculate z-score, scale() function.

# Calculating z-score using scale()
df.ready$p.pov.scale <- scale(df.ready$p.pov, center = TRUE, scale = TRUE)
summary(df.ready)
##     GEO_ID             tract              county             state          
##  Length:514         Length:514         Length:514         Length:514        
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##      total         under00_50       under50_99         p.pov         
##  Min.   :  136   Min.   :   0.0   Min.   :   0.0   Min.   :0.004575  
##  1st Qu.: 3539   1st Qu.: 147.2   1st Qu.: 139.5   1st Qu.:0.069188  
##  Median : 4874   Median : 284.0   Median : 338.5   Median :0.145023  
##  Mean   : 5248   Mean   : 377.6   Mean   : 442.7   Mean   :0.174992  
##  3rd Qu.: 6497   3rd Qu.: 520.2   3rd Qu.: 623.5   3rd Qu.:0.253455  
##  Max.   :17857   Max.   :2040.0   Max.   :2234.0   Max.   :0.805505  
##     p.pov.z          p.pov.scale.V1   
##  Min.   :-1.2919   Min.   :-1.291931  
##  1st Qu.:-0.8021   1st Qu.:-0.802100  
##  Median :-0.2272   Median :-0.227197  
##  Mean   : 0.0000   Mean   : 0.000000  
##  3rd Qu.: 0.5948   3rd Qu.: 0.594832  
##  Max.   : 4.7799   Max.   : 4.779929

4. Probability Distributions

In R, probability-related functions usually come in a group of four. For example, try running ?pnorm in your console to bring up a help page. You will see that, instead of bringing up a page just for pnorm(), R will show you a page that explains four similar functions, including dnorm(), pnorm(), qnorm(), and rnorm().

These four functions are related with the normal distribution, hence the suffix norm. If the distribution you need is t-distribution, you can use dt, pt, qt, and rt. See the pattern?

Of the four prefixes (i.e., d, p, q, and r), two that are most relevant to this course are the ones that start with p and q. Ones that start with p takes the quantile (quantile is a word used to define a particular part of a data set or distribution) and outputs the corresponding probability. Ones that start with q takes the probability and outputs the corresponding quantile. It is okay if we don’t fully understand how these operations can be useful at this moment, that is okay. We will come back to this later in the semester. For now, let’s focus on learning these functions in R.

For the demonstration, I will use a normal distribution, particularly the standard normal distribution (i.e., the normal distribution with the mean of 0 and the standard deviation of 1). Let’s go back to the help page for pnorm() (i.e., ?pnorm). The help page shows states that:

pnorm(q, mean = 0, sd = 1, lower.tail = TRUE, log.p = FALSE)

Notice that mean, sd, lower.tail, and log.p are followed by an equal sign and some values. For example, mean = 0. Also notice that q is not followed by anything. This means that, when you are using pnorm() function, you do not have to supply arguments for mean (you can if you want to shift the mean of the distribution), because R will automatically supply the default value, which is 0. Similarly, the default value for sd is 1, and so on. But q, which stands for quantile does not have the default value specified. This means you must supply that value; otherwise the function will not run. So, if you do not intentionally change the mean and sd setting, pnorm() function will use the standard normal distribution (which by definition has the mean of 0 and the standard deviation of 1). I will use it that way.

Let’s try some examples. pnorm() takes quantile and outputs probability. I’d like to try supplying 2 as the input to pnorm(). See the image below to understand what it all means.

# pnorm() takes quantile and outputs probability. 
pnorm(2)
## [1] 0.9772499

pnorm(2) returned 0.9772499. This means that the blue area in the image is 0.9772499 (Note that the entire area under the curve in the image is 1). That is what we wanted to know!

Let’s try the counterpart function, qnorm(). Since we know that the probability is 0.9772499 at quantile 2, Let’s see if qnorm(0.9772499) will return 2.

# qnorm() takes the probability and outputs the corresponding quantile.
qnorm(0.9772499)
## [1] 2.000001

We got 2.000001, which is very close to 2! Notice that the slight misalignment is cause by the rounding of the probability.

5. Visualizing the variable

Visualizing data you are about to crunch can give you invaluable insights about your data. The most popular and useful way to visualize variables include histogram and boxplot. Visualizing your data is a very important first step of any data analysis.

First, let’s visualize the distribution of the variable using hist(). This draws a histogram.

hist(df.ready$p.pov)

This histogram suggests that df.ready$p.pov is heavily skewed with a long tail to the right (=right-skewed). There are many Census tracts of which the median household income is around $50,000. There are a very small number of Census tracts that are super rich.

Also note that by including breaks = 5, you can make the bars wider, as shown below. But notice how having too wide bars obscure some details. Try different breaks to find out what number makes the histogram most informative.

hist(df.ready$p.pov, breaks = 5)

An interesting side note: Although a histogram and a barplot may look similar, they are very different. A histogram uses a continuous variable and requires only one variable. For example, if you have df.ready$p.pov, that is all you need for a histogram. However, creating a barplot requires a categorical variable (either nominal or ordinal) AND an associated quantity. For example, you can’t draw a barplot with df.ready$p.pov; you need another variable with which we can group df.ready$p.pov. See the barplot below for an illustration where I create a barplot of the number of Census tracts for each County.

We first need to count how many Census tracts there are in each county. table() function returns the unique category names (in this case, the names of counties stored in df.ready$county) and the number of their occurences.

number.tract <- table(df.ready$county)
number.tract
## 
## Clayton County    Cobb County  DeKalb County  Fulton County 
##             49            120            143            202

In number.tract, We now have a categorical variable (Name of counties) and an associated quantity (the number of their occurences). The barplot can created in R by:

barplot(number.tract)

Second, boxplots are another way of visualizing the distribution. The image below shows how to interpret a boxplot.

Now let’s look at a boxplot of df.ready$p.pov.

boxplot(df.ready$p.pov)

It shows that (1) the median (the thick horizontal bar inside the box) is skewed downwards and (2) there are many outliers on the higher side of the distribution. Compare the shape of the boxplot with the histogram above and see how they correspond to each other.

Boxplot is very useful especially when your continuous variable (in our case, df.ready$p.pov) can be grouped by a categorical variable. Remember that df.ready$p.pov is the median household income of Census tracts in the 4 counties around Atlanta, including Clayton, Cobb, Dekalb, and Fulton County. That means, we can group the Census tracts into four groups based on what county they fall into.

In our df.ready, we can draw a boxplot of p.pov for different county.

boxplot(df.ready$p.pov ~ df.ready$county)

Notice the df.ready$p.pov ~ df.ready$count part in the code. This is R way of saying “draw a boxplot of df.ready$p.pov and group it by df.ready$count”. A few notable things from the plot include (1) different counties have different median values, (2) Census tracts in Clayton County has relatively less dispersed distribution of median income, (3) Fulton County has the most widely dispersed distribution with many outliers, and (4) Dekalb County has the median similar to Fulton County but has distinctively many outliers.

Finally, the labels on the histrogram and boxplot are either missing or not so intuitive. Let’s make them more readable by adding some axis labels.

hist(df.ready$p.pov, 
     main = "Histrogram of Proportion of People below Poverty Line",
     xlab = "Proportion of People (Income) below Poverty Line",
     ylab = "Frequency",
     col = "skyblue") 

boxplot(df.ready$p.pov ~ df.ready$county,
        main = "Boxplot of Proportion below Poverty Grouped by County",
        xlab = "County",
        ylab = "Proportion of People (Income) below Poverty Line",
        col = c("yellow", "Grey", "orange", "pink"))

6. Graph visualization with package ggplot2 (optional extension)

hist() and boxplot() are base R functions to visualize variables. If you demand more visualization functions and detailed controls of graph elements, package ggplot2 is the state-of-the-art graph visualization package in R. It can seamlessly integrate with tidyverse packages. ggplot2 generates plots with layers of commands, connected by + sign. The first layer with ggplot() function specifies the input data and aesthetics that will be applied to the whole graph, such as which variables go to X and Y axis, whether the color, shape, and other aesthetics are applied by groups. The second layer often specifies what types of graph you want to represent the data. The additional layers specify other graph elements, such as X and Y axis labels.

ggplot(data=df.ready, aes(x=p.pov)) + 
  geom_histogram(fill="skyblue", color="black", bins=10) + 
  labs(x="Proportion of People (Income) below Poverty Line", y="Frequency")+
  ggtitle("Histrogram of Proportion of People below Poverty Line")

ggplot(data=df.ready, aes(x=county, y=p.pov, color=county)) + 
  geom_boxplot() + 
  labs(x="County", y="Proportion of People (Income) below Poverty Line")+
  ggtitle("Boxplot of Proportion below Poverty Grouped by County")

You might have noticed that the histogram created by the geom_boxplot is slightly different from the histogram created by the base R hist() function. This is normal because geom_boxplot use different defaults to create the intervals. Both histograms are correct as they capture the same distribution of values. This stackoverflow post shows how you can create a histogram that is exactly the same between hist() and geom_boxplot.

Short Exercises

  1. The height of students are normally distributed with mean \(\mu = 200\) and standard deviation \(\sigma = 30\). What is the probability that someone has a height of \(185\) \(cm\) \(or\) \(less\)? \(Answer:\) \(0.309\)
  • What is the probability that someone has a height of \(220\) \(cm\) \(or\) \(more\)? \(Answer:\) \(0.252\)

  • What is the probability that someone has a height between \(185\) \(cm\) and \(220\) \(cm\)? \(Answer:\) \(0.439\)

  1. The military service sets the minimum acceptable IQ score for enlistees. For example, if the Navy decides that each enlisted man or woman should have an IQ score above the lowest 10% of the population, what IQ score must all volunteers equal or exceed, giving mean \(\mu = 100\) and standard deviation \(\sigma = 15\)? \(Answer:\) \(80.78\) \(or\) \(81,\) \(rounding\) \(up\)