Lab2 - Descriptive stats, frequency distribution, central tendency

This lab will guide you through (1) reading in data from your computer, (2) examining their summary statistics including mean, median, variance, and range, and (3) drawing some basic charts and graphs to visualize our data. In specific, materials in this document include:

Functions	Tasks
`getwd()`	Print the current working directory
`setwd()`	Set the working directory to the specified path
`dir()`	Returns a character vector of names of files and folders in the current wd
`read.csv()`	Read in files with ‘.csv’ extensions
`head()`	Return the first few elements or rows (usually 6)
`sum()`	Return the sum of all elements
`length()`	Returns the count of elements. Works for vectors and lists
`summary()`	Returns central tendency statistics for the provided R object
`var()`	Returns variance of a vector, matrix, or data frame
`sd()`	Returns standard deviation of a vector
`table()`	Returns a table with labels and associated occurences of each label
`hist()`	Generates a histogram of the data in the given vector
`boxplot()`	Generates a boxplot of the data in the given data frame or a list

1. Setting up working environment

When you open R-studio, your R session is located in one of the folders in your computer. This folder is called working directory. When you tell R to read a data using read.csv() function, R starts looking for the file in the working directory. If your file is not there, R cannot find the file. You need to tell R where to find the file that you want to read.

Let’s see where your R session is currently located by using getwd() function. ‘wd’ here stands for working directory.

getwd()

## [1] "C:/Users/cod/Desktop/PhD Files/GRA & GTA/GTA/CP6025 Fall 2021/Week 2/Lab2"

Above is the folder your R session is currently in. Note that the path shown above is the current working directory of my (Gabriel’s) computer. Your current working directory may be different. Now, let’s tell R the location of the working directory that you want to work in (which is where the data is located) by using setwd() function.

Note that the folder names are separated by /, not \. If you copied and pasted a directory path from windows file explorer, you will need to change \ to /.

setwd("C:/Users/cod/Desktop/PhD Files/GRA & GTA/GTA/CP6025 Fall 2021/Week 2/Lab2")

You can also use the choose.dir() function to change your directory.

setwd(choose.dir())

Every time you open R-studio, you are initiating a new R session, and the working directory will revert back to where it first was. You will need to set the working directory to an appropriate folder every time you open R-studio.

2. Manually calculating mean

Before we start using short-cut methods for calculating central tendencies, let’s try doing it manually. This will give you a better insight as to what is happening behind the scene.

First, let’s read some data in for the exercise. To see what files are in the new working directory, you can use dir() function.

dir() # When there is no argument inside the parenthesis, dir() returns the names of files or folders in the working directory. The output below is showing files and folders in my (Gabriel's) computer, and they may be different from yours.

## [1] "archive"                    "boxplot.png"               
## [3] "boxplot2.png"               "lab2.html"                 
## [5] "lab2.Rmd"                   "rsconnect"                 
## [7] "testdata.csv"               "Week2_Lab_Descriptive.pdf" 
## [9] "Week2_Lab_Descriptive.pptx"

If you see “testdata.csv”, you are in the right folder! Since we confirmed that R is in the right folder and can locate the file, let’s read it using read.csv() function and assign it to an R object called testdata. We can view the first few rows of the dataset using head().

# setwd("insert the path to the folder where data file is located")
testdata <- read.csv("testdata.csv")
head(testdata)

##   X        GISJOIN      YEAR   STATE         COUNTY TRACT tot_age25 less_hs
## 1 1 G1300630040202 2013-2017 Georgia Clayton County 40202      1393     318
## 2 2 G1300630040203 2013-2017 Georgia Clayton County 40203      2134     306
## 3 3 G1300630040204 2013-2017 Georgia Clayton County 40204      2452     190
## 4 4 G1300630040302 2013-2017 Georgia Clayton County 40302      3520    1034
## 5 5 G1300630040303 2013-2017 Georgia Clayton County 40303      4072    1059
## 6 6 G1300630040306 2013-2017 Georgia Clayton County 40306      1959     781
##     hs some_college college graduate  hinc
## 1  581          338     123       33 31524
## 2  688          753     265      122 36786
## 3  877          825     371      189 39194
## 4 1347          855     235       49 33190
## 5 1319         1075     530       89 37236
## 6  430          567     169       12 27372

Another way to examine the data is by using str() function. Note that str() function not only shows what variables are in the data but also how many rows there are and what types of data each of the variables are.

str(testdata) # prints the dimension of the data in "the number of rows, the number of columns" format

## 'data.frame':    514 obs. of  13 variables:
##  $ X           : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ GISJOIN     : chr  "G1300630040202" "G1300630040203" "G1300630040204" "G1300630040302" ...
##  $ YEAR        : chr  "2013-2017" "2013-2017" "2013-2017" "2013-2017" ...
##  $ STATE       : chr  "Georgia" "Georgia" "Georgia" "Georgia" ...
##  $ COUNTY      : chr  "Clayton County" "Clayton County" "Clayton County" "Clayton County" ...
##  $ TRACT       : int  40202 40203 40204 40302 40303 40306 40307 40308 40407 40408 ...
##  $ tot_age25   : int  1393 2134 2452 3520 4072 1959 3098 3089 2421 4892 ...
##  $ less_hs     : int  318 306 190 1034 1059 781 979 1001 618 989 ...
##  $ hs          : int  581 688 877 1347 1319 430 1087 1127 914 1879 ...
##  $ some_college: int  338 753 825 855 1075 567 787 703 582 1426 ...
##  $ college     : int  123 265 371 235 530 169 214 245 209 431 ...
##  $ graduate    : int  33 122 189 49 89 12 31 13 98 167 ...
##  $ hinc        : int  31524 36786 39194 33190 37236 27372 37064 25159 45768 37224 ...

Now the R object testdata contains the data from “testdata.csv”. testdata is a dataframe (similar to an Excel spreadsheet) with 514 rows and 13 columns (a.k.a. variables). For this exercise, we will only use one of them - the variable named hinc. This variable represents the median household income of the Census tracts (similar to neighborhoods) in the 4 counties around Atlanta.

In R, testdata$hinc means “Give me the variable named hinc in the dataset called testdata.” Here, $ operator extracts parts of an R object. If used with a data.frame, it extracts columns.

head(testdata$hinc)

## [1] 31524 36786 39194 33190 37236 27372

Second, let’s calculate the mean of testdata$hinc. Here is the equation for calculating mean: $\bar{x} = \frac{1}{n}\left (\sum_{i=1}^n{x_i}\right )$. The first part of the equation we need to solve is $\left (\sum_{i=1}^n{x_i}\right )$, which is mathematical way of saying “sum every element in x”. In R, this equation can be written as:

sum.hinc <- sum(testdata$hinc)

Note that we created an R object called sum.hinc by assigning the result of sum(testdata$hinc) into the object using <-. To see what is stored in sum.hinc, you can simply call the object.

sum.hinc

## [1] 34309653

Third, the equation $\bar{x} = \frac{1}{n}\left (\sum_{i=1}^n{x_i}\right )$ suggests that sum.hinc needs to be multiplied by $\frac{1}{n}$, where n is the number of elements in testdata$hinc. You can get the number of elements using length().

1 / length(testdata$hinc) * sum.hinc

## [1] 66750.3

There you have it! Let’s also use R function mean() to check if the manually calculated mean is correct.

mean(testdata$hinc)

## [1] 66750.3

Matches perfectly! It shows that, on average, typical households in Georgia makes about $67,000 per year.

3. Manually calculating variance

This time we will calculate variance using the same testdata and the mean we just calculated. Like we did for mean, let’s break down the equation for variance and tackle it piece by piece. The equation is: $\sigma_Y^2 = \frac {1}{n-1} \sum_{i=1}^n \left(Y_i - \overline{Y} \right)^2$

First, let’s focus on $\left(Y_i - \overline{Y} \right)^2$ part. The following code calculates it and stores it in an object called parenthesis .

parenthesis <- (testdata$hinc - mean(testdata$hinc))^2 # ^2 means to square it
head(parenthesis) # Prints the first six elements in parenthesis

## [1] 1240892047  897859135  759349541 1126293579  871093767 1550650327

Second, we are done with $\left(Y_i - \overline{Y} \right)^2$ part. Now, we need to sum all numbers in this vector to finish $\sum_{i=1}^n \left(Y_i - \overline{Y} \right)^2$ part.

sum.parenthesis <- sum(parenthesis)
sum.parenthesis

## [1] 671687216869

Third, the last thing we need to do is to divide sum.parenthesis by $n-1$, where $n$ is the number of elements in the vector. You can get the number of elements using length(). Let’s also use R function var() to see if the numbers match.

sum.parenthesis / (length(testdata$hinc) - 1)

## [1] 1309331807

var(testdata$hinc)

## [1] 1309331807

Matches perfectly again!

4. Using base R functions

Instead of doing this every time we need to calculate mean, variance, etc., we can simply use base R function that does it all for us.

mean(testdata$hinc) # This function outputs mean of the given vector

## [1] 66750.3

var(testdata$hinc) # This function outputs variance of the given vector

## [1] 1309331807

sd(testdata$hinc) # This function outputs standard deviation of the given vector

## [1] 36184.69

summary(testdata$hinc) # This function outputs various descriptive statistics. Note that summary() sometimes rounds up the numbers!

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    9815   40153   58418   66750   84288  200179

NOTE using summary() to view the descriptive statistics can sometimes be misleading. This is because summary() rounds up the results (see the difference between mean(testdata$hinc) and the mean from summary(testdata$hinc)). If you need to generate precise statistics (e.g., for your option paper or thesis), I recommend using mean(), min(), max(), median(), etc. instead of summary().

5. Visualizing the variable

Visualizing data you are about to crunch can give you invaluable insights about your data. The most popular and useful way to visualize variables include histogram and boxplot. Visualizing your data is a very important first step of any data analysis.

First, let’s visualize the distribution of the variable using hist(). This draws a histogram.

hist(testdata$hinc)

This histogram suggests that testdata$hinc is heavily skewed with a long tail to the right (=right-skewed). There are many Census tracts of which the median household income is around $50,000. There are a very small number of Census tracts that are super rich.

Also note that by including breaks = 5, you can make the bars wider, as shown below. But notice how having too wide bars obscure some details. Try different breaks to find out what number makes the histogram most informative.

hist(testdata$hinc, breaks = 5)

An interesting side note: Although a histogram and a barplot may look similar, they are very different. A histogram uses a continuous variable and requires only one variable. For example, if you have testdata$hinc, that is all you need for a histogram. However, creating a barplot requires a categorical variable (either nominal or ordinal) AND an associated quantity. For example, you can’t draw a barplot with testdata$hinc; you need another variable with which we can group testdata$hinc. See the barplot below for an illustration where I create a barplot of the number of Census tracts for each County.

We first need to count how many Census tracts there are in each county. table() function returns the unique category names (in this case, the names of counties stored in testdata$COUNTY) and the number of their occurences.

number.tract <- table(testdata$COUNTY)
number.tract

## 
## Clayton County    Cobb County  DeKalb County  Fulton County 
##             49            120            143            202

In number.tract, We now have a categorical variable (Name of counties) and an associated quantity (the number of their occurences). The barplot can created in R by:

barplot(number.tract)

Second, boxplots are another way of visualizing the distribution. The image below shows how to interpret a boxplot.

Now let’s look at a boxplot of testdata$hinc.

boxplot(testdata$hinc)

It shows that (1) the median (the thick horizontal bar inside the box) is skewed downwards and (2) there are many outliers on the higher side of the distribution. Compare the shape of the boxplot with the histogram above and see how they correspond to each other.

Boxplot is very useful especially when your continuous variable (in our case, testdata$hinc) can be grouped by a categorical variable. Remember that testdata$hinc is the median household income of Census tracts in the 4 counties around Atlanta, including Clayton, Cobb, Dekalb, and Fulton County. That means, we can group the Census tracts into four groups based on what county they fall into.

In our testdata, we can draw a boxplot of hinc for different COUNTY.

boxplot(testdata$hinc ~ testdata$COUNTY)

Notice the testdata$hinc ~ testdata$COUNTY part in the code. This is R way of saying “draw a boxplot of testdata$hinc and group it by testdata$COUNTY”. A few notable things from the plot include (1) different counties have different median values, (2) Census tracts in Clayton County has relatively less dispersed distribution of median income, (3) Fulton County has the most widely dispersed distribution with many outliers, and (4) Dekalb County has the median similar to Fulton County but has distinctively many outliers.

Finally, the labels on the histrogram and boxplot are either missing or not so intuitive. Let’s make them more readable by adding some axis labels.

hist(testdata$hinc, 
     main = "Histrogram of Median Household Income of Four Counties Around Atlanta",
     xlab = "Median Household Income",
     ylab = "Frequency",
     col = "skyblue")

boxplot(testdata$hinc ~ testdata$COUNTY,
        main = "Boxplot of Median Household Income Grouped by County",
        xlab = "County",
        ylab = "Median Household Income",
        col = c("yellow", "Grey", "orange", "pink"))