Continuing from the slide of Lab 3 and the cleaned poverty data from Lab2, this document will guide you through (1) saving the cleaned poverty data to local folder, (2) examining summary statistics including mean, median, variance, and range, (3) calculate z-score, (4) some basic probability-related functions in R, and (5) drawing some basic charts and graphs to visualize our data.
| Functions | Tasks |
|---|---|
write.csv() |
Save the csv data to your local folder |
sum() |
Return the sum of all elements |
length() |
Returns the count of elements. Works for vectors and lists |
summary() |
Returns central tendency statistics for the provided R object |
var() |
Returns variance of a vector, matrix, or data frame |
sd() |
Returns standard deviation of a vector |
table() |
Returns a table with labels and associated occurences of each label |
hist() |
Generates a histogram of the data in the given vector |
boxplot() |
Generates a boxplot of the data in the given data frame or a list |
scale() |
Calculates z-score using a vector |
pnorm() |
Takes the quantile value and outputs the probability using a normal distribution |
qnorm() |
Takes the probability value and outputs the quantile using a normal distribution |
We will be using the same poverty data from Census in Lab2. To
continue using the same data, it is smart to save the cleaned data in to
a local folder, so that you don’t have to go through the cleaning
process again if you want to use the data for future analyses. Similar
to read.csv(), the write.csv() function allows
you to export your data to a local path.
write.csv("PATH TO A LOCAL FOLDER")
To read the data again, you just need to input the same path to the
read.csv() function.
read.csv("PATH TO A LOCAL FOLDER")
To recreate the cleaned data in base R (except the
separate function from tidyverse), you can rerun the
following codes to get the cleaned data.
# load necessary library
library(tidyverse)
# set the path (replace the path to your own path)
setwd("/Users/xiaofanliang/Dropbox (GaTech)/GT_Academics/CP6025_TA/Lab/Lab3/")
# read the data
pov.data <- read.csv("ACSDT5Y2017.C17002-Data.csv")
# subset dataframe to selected columns
pov.data.var <- pov.data[ , c("GEO_ID", "NAME", "C17002_001E", "C17002_002E", "C17002_003E") ]
# separate 'NAME' variable into three variables at a comma and a space
pov.data.sep <- separate(
data = pov.data.var, # 1st: specify the data.frame
col = NAME, # 2nd: specify the name of the variable you want to parse
into = c("tract", "county", "state"), # 3rd: a list of variable names that will be created as the result of parsing
sep = ", ") # 4th: character string that will be used as the separator
# rename column 5 to 7
colnames(pov.data.sep)[5:7] <- c("total", "under00_50", "under50_99")
# filter census tracts to those in the Atlanta four counties
pov.data.4county <- pov.data.sep[pov.data.sep$county %in% c("Fulton County", "DeKalb County", "Cobb County", "Clayton County"), ]
# create a new variable called p.pov that represent **the proportion of people whose income is lower than the poverty line**
pov.data.4county$p.pov <- (pov.data.4county$under00_50 + pov.data.4county$under50_99) / pov.data.4county$total
# remove NAs in the p.pov variable
na.filter <- !is.na(pov.data.4county$p.pov)
df.ready <- pov.data.4county[na.filter, ]
head(df.ready)
## GEO_ID tract county state total
## 340 1400000US13063040202 Census Tract 402.02 Clayton County Georgia 2641
## 341 1400000US13063040203 Census Tract 402.03 Clayton County Georgia 3573
## 342 1400000US13063040204 Census Tract 402.04 Clayton County Georgia 4087
## 343 1400000US13063040302 Census Tract 403.02 Clayton County Georgia 5962
## 344 1400000US13063040303 Census Tract 403.03 Clayton County Georgia 6778
## 345 1400000US13063040306 Census Tract 403.06 Clayton County Georgia 4090
## under00_50 under50_99 p.pov
## 340 288 261 0.2078758
## 341 498 381 0.2460118
## 342 477 397 0.2138488
## 343 939 1097 0.3414961
## 344 443 1345 0.2637946
## 345 982 762 0.4264059
The tidyverse syntax for the same cleaning process is the following:
# load necessary library
library(tidyverse)
# set the path (replace the path to your own path)
setwd("/Users/xiaofanliang/Dropbox (GaTech)/GT_Academics/CP6025_TA/Lab/Lab3/")
# read the data
pov.data <- read.csv("ACSDT5Y2017.C17002-Data.csv")
df.ready <- pov.data %>%
# subset dataframe to selected columns
select(c(GEO_ID, NAME, C17002_001E, C17002_002E, C17002_003E)) %>%
# separate 'NAME' variable into three variables at a comma and a space
separate(col=NAME, into = c("tract", "county", "state"), sep = ", ") %>%
# rename column C17002_001E, C17002_002E, C17002_003E
rename(total = C17002_001E, under00_50 = C17002_002E, under50_99 = C17002_003E) %>%
# filter census tracts to those in the Atlanta four counties
filter(county %in% c("Fulton County", "DeKalb County", "Cobb County", "Clayton County")) %>%
# create a new variable called p.pov that represent **the proportion of people whose income is lower than the poverty line**
mutate(p.pov = (under00_50 + under50_99) / total) %>%
# remove NAs in the p.pov variable
drop_na(p.pov)
head(df.ready)
## GEO_ID tract county state total
## 1 1400000US13063040202 Census Tract 402.02 Clayton County Georgia 2641
## 2 1400000US13063040203 Census Tract 402.03 Clayton County Georgia 3573
## 3 1400000US13063040204 Census Tract 402.04 Clayton County Georgia 4087
## 4 1400000US13063040302 Census Tract 403.02 Clayton County Georgia 5962
## 5 1400000US13063040303 Census Tract 403.03 Clayton County Georgia 6778
## 6 1400000US13063040306 Census Tract 403.06 Clayton County Georgia 4090
## under00_50 under50_99 p.pov
## 1 288 261 0.2078758
## 2 498 381 0.2460118
## 3 477 397 0.2138488
## 4 939 1097 0.3414961
## 5 443 1345 0.2637946
## 6 982 762 0.4264059
To save df.ready so that we don’t have to run data
cleaning again, we will use the write.csv function. The
first argument of write.csv() takes in a dataframe, and the
second argument file takes in a path that includes the name
of the exported dataframe and the .csv extension. It is
recommended to not have dot (i.e., .) in the file name to
avoid confusion with the extension .csv. Without setting
row.names=FALSE, the saved csv will automatically create a
column that saves the row index (i.e., row name) of the dataframe. We do
not want that additional column. The file will be saved at the path
saved in setwd(). You can also specify a different path in
the file argument.
NOTE If you already exported a
dfReady.csv data file in your Lab3 folder, the
write.csv() will overwrite the file under the same name
automatically.
# replace the path to your own path
write.csv(df.ready, file='dfReady.csv', row.names = FALSE)
To read the data we just saved, run the following line and you will
see that our df.ready dataframe is back!
# read data
df.ready <- read.csv('dfReady.csv')
head(df.ready)
## GEO_ID tract county state total
## 1 1400000US13063040202 Census Tract 402.02 Clayton County Georgia 2641
## 2 1400000US13063040203 Census Tract 402.03 Clayton County Georgia 3573
## 3 1400000US13063040204 Census Tract 402.04 Clayton County Georgia 4087
## 4 1400000US13063040302 Census Tract 403.02 Clayton County Georgia 5962
## 5 1400000US13063040303 Census Tract 403.03 Clayton County Georgia 6778
## 6 1400000US13063040306 Census Tract 403.06 Clayton County Georgia 4090
## under00_50 under50_99 p.pov
## 1 288 261 0.2078758
## 2 498 381 0.2460118
## 3 477 397 0.2138488
## 4 939 1097 0.3414961
## 5 443 1345 0.2637946
## 6 982 762 0.4264059
We have already started to use the short-cut methods (e.g., mean()) in R for calculating central tendencies in Lab2. Let’s try to doing it manually so you have a better insight as to what is happening behind the scene.
First, let’s calculate the mean of
df.ready$p.pov. Here is the equation for calculating mean:
\(\bar{x} = \frac{1}{n}\left
(\sum_{i=1}^n{x_i}\right )\). The first part of the equation we
need to solve is \(\left
(\sum_{i=1}^n{x_i}\right )\), which is mathematical way of saying
“sum every element in x”. In R, this equation can be written as:
sum.p.pov <- sum(df.ready$p.pov)
sum.p.pov
## [1] 89.9458
Second, the equation \(\bar{x} = \frac{1}{n}\left
(\sum_{i=1}^n{x_i}\right )\) suggests that sum.hinc
needs to be multiplied by \(\frac{1}{n}\), where n is the
number of elements in df.ready$p.pov. You can get the
number of elements using length().
1 / length(df.ready$p.pov) * sum.p.pov
## [1] 0.1749918
There you have it! Let’s also use R function mean() to
check if the manually calculated mean is correct.
mean(df.ready$p.pov)
## [1] 0.1749918
Matches perfectly! It shows that, on average across all census tracts, the proportion of people whose income is lower than the poverty line is 18.5%.
First, let’s calculate variance using the same df.ready and the mean we just calculated. Like we did for mean, let’s break down the equation for variance and tackle it piece by piece. The equation is: \(\sigma_Y^2 = \frac {1}{n-1} \sum_{i=1}^n \left(Y_i - \overline{Y} \right)^2\)
let’s focus on \(\left(Y_i - \overline{Y}
\right)^2\) part. The following code calculates it and stores it
in an object called parenthesis.
parenthesis <- (df.ready$p.pov - mean(df.ready$p.pov))^2 # ^2 means to square it
head(parenthesis) # Prints the first six elements in parenthesis
## [1] 0.001081356 0.005043830 0.001509863 0.027723686 0.007885937 0.063209019
Second, we are done with \(\left(Y_i - \overline{Y} \right)^2\) part. Now, we need to sum all numbers in this vector to finish \(\sum_{i=1}^n \left(Y_i - \overline{Y} \right)^2\) part.
sum.parenthesis <- sum(parenthesis)
sum.parenthesis
## [1] 8.92611
Third, the last thing we need to do is to divide
sum.parenthesis by \(n-1\), where \(n\) is the number of elements in the
vector. You can get the number of elements using length().
Let’s also use R function var() to see if the numbers
match.
sum.parenthesis / (length(df.ready$p.pov) - 1)
## [1] 0.01739983
var(df.ready$p.pov)
## [1] 0.01739983
The following codes show the base R functions that calculate the summary statistics of a vector of values.
mean(df.ready$p.pov) # This function outputs mean of the given vector
## [1] 0.1749918
var(df.ready$p.pov) # This function outputs variance of the given vector
## [1] 0.01739983
sd(df.ready$p.pov) # This function outputs standard deviation of the given vector
## [1] 0.1319084
summary(df.ready$p.pov) # This function outputs various descriptive statistics. Note that summary() sometimes rounds up the numbers!
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.004575 0.069188 0.145023 0.174992 0.253455 0.805505
NOTE using summary() to view the
descriptive statistics can sometimes be misleading. This is because
summary() rounds up the results (see the difference between
mean(df.ready$p.pov) and the mean from
summary(df.ready$p.pov)). If you need to generate precise
statistics (e.g., for your option paper or thesis), I recommend using
mean(), min(), max(),
median(), etc. instead of summary().
Using the cleaned data, we will calculate z-score for
p.povvariable. Perhaps the usefulness of z-score may be not be apparent at this moment. But z-score is used extensively, and it is important that we have an experience of calculating z-scores in R. If you forgot the equation for z-score, see the lab 3 lecture slide.
To calculate z-score, we first need to get the mean and the standard
deviation of p.pov. Then, we need to subtract the mean from
each value of p.pov and divide it by the standard
deviation. We will store the z-score of p.pov back into
df.ready by creating a new variable in
df.ready named p.pov.z.
# Calculating mean
p.pov.mean <- mean(df.ready$p.pov)
# Calculating standard deviation
p.pov.sd <- sd(df.ready$p.pov)
# Calculate z-score and assign it into p.pov.z
df.ready$p.pov.z <- (df.ready$p.pov - p.pov.mean) / p.pov.sd
summary(df.ready)
## GEO_ID tract county state
## Length:514 Length:514 Length:514 Length:514
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## total under00_50 under50_99 p.pov
## Min. : 136 Min. : 0.0 Min. : 0.0 Min. :0.004575
## 1st Qu.: 3539 1st Qu.: 147.2 1st Qu.: 139.5 1st Qu.:0.069188
## Median : 4874 Median : 284.0 Median : 338.5 Median :0.145023
## Mean : 5248 Mean : 377.6 Mean : 442.7 Mean :0.174992
## 3rd Qu.: 6497 3rd Qu.: 520.2 3rd Qu.: 623.5 3rd Qu.:0.253455
## Max. :17857 Max. :2040.0 Max. :2234.0 Max. :0.805505
## p.pov.z
## Min. :-1.2919
## 1st Qu.:-0.8021
## Median :-0.2272
## Mean : 0.0000
## 3rd Qu.: 0.5948
## Max. : 4.7799
The summary function at the end shows that we successfully created a
new variable called p.pov.z and that it has the mean of 0
(which is expected). Go ahead and check if p.pov.z has the
standard deviation of 1 by using sd() function.
In the real life, we usually use a function specifically designed to
calculate z-score, scale() function.
# Calculating z-score using scale()
df.ready$p.pov.scale <- scale(df.ready$p.pov, center = TRUE, scale = TRUE)
summary(df.ready)
## GEO_ID tract county state
## Length:514 Length:514 Length:514 Length:514
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## total under00_50 under50_99 p.pov
## Min. : 136 Min. : 0.0 Min. : 0.0 Min. :0.004575
## 1st Qu.: 3539 1st Qu.: 147.2 1st Qu.: 139.5 1st Qu.:0.069188
## Median : 4874 Median : 284.0 Median : 338.5 Median :0.145023
## Mean : 5248 Mean : 377.6 Mean : 442.7 Mean :0.174992
## 3rd Qu.: 6497 3rd Qu.: 520.2 3rd Qu.: 623.5 3rd Qu.:0.253455
## Max. :17857 Max. :2040.0 Max. :2234.0 Max. :0.805505
## p.pov.z p.pov.scale.V1
## Min. :-1.2919 Min. :-1.291931
## 1st Qu.:-0.8021 1st Qu.:-0.802100
## Median :-0.2272 Median :-0.227197
## Mean : 0.0000 Mean : 0.000000
## 3rd Qu.: 0.5948 3rd Qu.: 0.594832
## Max. : 4.7799 Max. : 4.779929
In R, probability-related functions usually come in a group of four.
For example, try running ?pnorm in your console to bring up
a help page. You will see that, instead of bringing up a page just for
pnorm(), R will show you a page that explains four similar
functions, including dnorm(), pnorm(),
qnorm(), and rnorm().
These four functions are related with the normal distribution, hence
the suffix norm. If the distribution you need is
t-distribution, you can use dt, pt,
qt, and rt. See the pattern?
Of the four prefixes (i.e., d, p, q, and r), two that are most relevant to this course are the ones that start with p and q. Ones that start with p takes the quantile (quantile is a word used to define a particular part of a data set or distribution) and outputs the corresponding probability. Ones that start with q takes the probability and outputs the corresponding quantile. It is okay if we don’t fully understand how these operations can be useful at this moment, that is okay. We will come back to this later in the semester. For now, let’s focus on learning these functions in R.
For the demonstration, I will use a normal distribution, particularly
the standard normal distribution (i.e., the normal distribution with the
mean of 0 and the standard deviation of 1). Let’s go back to the help
page for pnorm() (i.e., ?pnorm). The help page
shows states that:
pnorm(q, mean = 0, sd = 1, lower.tail = TRUE, log.p = FALSE)
Notice that mean, sd, lower.tail, and log.p are followed by an equal
sign and some values. For example, mean = 0. Also notice that q is
not followed by anything. This means that, when you are
using pnorm() function, you do not have to supply arguments
for mean (you can if you want to shift the mean of the distribution),
because R will automatically supply the default value, which is 0.
Similarly, the default value for sd is 1, and so on. But q,
which stands for quantile does not have the default value
specified. This means you must supply that value; otherwise the
function will not run. So, if you do not intentionally change the
mean and sd setting, pnorm()
function will use the standard normal distribution (which by definition
has the mean of 0 and the standard deviation of 1). I will use it that
way.
Let’s try some examples. pnorm() takes quantile and
outputs probability. I’d like to try supplying 2 as the input to
pnorm(). See the image below to understand what it all
means.
# pnorm() takes quantile and outputs probability.
pnorm(2)
## [1] 0.9772499
pnorm(2) returned 0.9772499. This means that the blue
area in the image is 0.9772499 (Note that the entire area under the
curve in the image is 1). That is what we wanted to know!
Let’s try the counterpart function, qnorm(). Since we
know that the probability is 0.9772499 at quantile 2, Let’s see if
qnorm(0.9772499) will return 2.
# qnorm() takes the probability and outputs the corresponding quantile.
qnorm(0.9772499)
## [1] 2.000001
We got 2.000001, which is very close to 2! Notice that the slight misalignment is cause by the rounding of the probability.
Visualizing data you are about to crunch can give you invaluable insights about your data. The most popular and useful way to visualize variables include histogram and boxplot. Visualizing your data is a very important first step of any data analysis.
First, let’s visualize the distribution of the
variable using hist(). This draws a histogram.
hist(df.ready$p.pov)
This histogram suggests that df.ready$p.pov is heavily
skewed with a long tail to the right (=right-skewed). There are many
Census tracts of which the median household income is around $50,000.
There are a very small number of Census tracts that are super rich.
Also note that by including breaks = 5, you can make the
bars wider, as shown below. But notice how having too wide bars obscure
some details. Try different breaks to find out what number
makes the histogram most informative.
hist(df.ready$p.pov, breaks = 5)
An interesting side note: Although a histogram and
a barplot may look similar, they are very different. A histogram uses
a continuous variable and requires only one variable. For
example, if you have df.ready$p.pov, that is all you need
for a histogram. However, creating a barplot requires a
categorical variable (either nominal or ordinal) AND an
associated quantity. For example, you can’t draw a barplot with
df.ready$p.pov; you need another variable with which we can
group df.ready$p.pov. See the barplot below for an
illustration where I create a barplot of the number of Census tracts for
each County.
We first need to count how many Census tracts there are in each
county. table() function returns the unique category names
(in this case, the names of counties stored in
df.ready$county) and the number of their occurences.
number.tract <- table(df.ready$county)
number.tract
##
## Clayton County Cobb County DeKalb County Fulton County
## 49 120 143 202
In number.tract, We now have a categorical variable
(Name of counties) and an associated quantity (the number of their
occurences). The barplot can created in R by:
barplot(number.tract)
Second, boxplots are another way of visualizing the
distribution. The image below shows how to interpret a boxplot.
Now let’s look at a boxplot of df.ready$p.pov.
boxplot(df.ready$p.pov)
It shows that (1) the median (the thick horizontal bar inside the box) is skewed downwards and (2) there are many outliers on the higher side of the distribution. Compare the shape of the boxplot with the histogram above and see how they correspond to each other.
Boxplot is very useful especially when your continuous
variable (in our case, df.ready$p.pov) can be grouped by a
categorical variable. Remember that df.ready$p.pov is the
median household income of Census tracts in the 4 counties around
Atlanta, including Clayton, Cobb, Dekalb, and Fulton County. That means,
we can group the Census tracts into four groups based on what county
they fall into.
In our df.ready, we can draw a boxplot of
p.pov for different county.
boxplot(df.ready$p.pov ~ df.ready$county)
Notice the df.ready$p.pov ~ df.ready$count part in the
code. This is R way of saying “draw a boxplot of
df.ready$p.pov and group it by
df.ready$count”. A few notable things from the plot include
(1) different counties have different median values, (2) Census tracts
in Clayton County has relatively less dispersed distribution of median
income, (3) Fulton County has the most widely dispersed distribution
with many outliers, and (4) Dekalb County has the median similar to
Fulton County but has distinctively many outliers.
Finally, the labels on the histrogram and boxplot are either missing or not so intuitive. Let’s make them more readable by adding some axis labels.
hist(df.ready$p.pov,
main = "Histrogram of Proportion of People below Poverty Line",
xlab = "Proportion of People (Income) below Poverty Line",
ylab = "Frequency",
col = "skyblue")
boxplot(df.ready$p.pov ~ df.ready$county,
main = "Boxplot of Proportion below Poverty Grouped by County",
xlab = "County",
ylab = "Proportion of People (Income) below Poverty Line",
col = c("yellow", "Grey", "orange", "pink"))
hist() and boxplot() are base R functions
to visualize variables. If you demand more visualization functions and
detailed controls of graph elements, package
ggplot2 is the state-of-the-art graph visualization package
in R. It can seamlessly integrate with tidyverse
packages. ggplot2 generates plots with layers of commands,
connected by + sign. The first layer with
ggplot() function specifies the input data and aesthetics
that will be applied to the whole graph, such as which variables go to X
and Y axis, whether the color, shape, and other aesthetics are applied
by groups. The second layer often specifies what types of graph you want
to represent the data. The additional layers specify other graph
elements, such as X and Y axis labels.
ggplot(data=df.ready, aes(x=p.pov)) +
geom_histogram(fill="skyblue", color="black", bins=10) +
labs(x="Proportion of People (Income) below Poverty Line", y="Frequency")+
ggtitle("Histrogram of Proportion of People below Poverty Line")
ggplot(data=df.ready, aes(x=county, y=p.pov, color=county)) +
geom_boxplot() +
labs(x="County", y="Proportion of People (Income) below Poverty Line")+
ggtitle("Boxplot of Proportion below Poverty Grouped by County")
You might have noticed that the histogram created by the
geom_boxplot is slightly different from the histogram
created by the base R hist() function. This is normal
because geom_boxplot use different defaults to create the
intervals. Both histograms are correct as they capture the same
distribution of values. This stackoverflow
post shows how you can create a histogram that is exactly the same
between hist() and geom_boxplot.
Short Exercises
What is the probability that someone has a height of \(220\) \(cm\) \(or\) \(more\)? \(Answer:\) \(0.252\)
What is the probability that someone has a height between \(185\) \(cm\) and \(220\) \(cm\)? \(Answer:\) \(0.439\)