What is data analysis?

Data analysis involves making sense of the world by collecting, summarizing, and modelling quantitative information (numbers).

  • The process of data analysis workflow is described below:
  • The first part of data analysis is finding or creating the dataset that you will analyze. In this class, we will use, among others, the data from the General Social Survey (described later in this document).

  • The second part is tidying your data. This means that you need to have data in a format that can be analyzed. Specifically, cases have to be in rows and variables have to be in columns.

  • Once you have tidied the data, you often need to transform the data so that you can answer your research questions most effectively. For instance, you can recode the age variable into two groups–people older than 65 and people 65 and younger.

  • We will discuss the visualization, modeling, and communication part later in the semester.

Why data management is important?

Data management is especially important when you only have access to messy data. Data are typically messy when you see one or all of these three things occur:

  1. Column headers are values, not variable names.
  1. Variables are stored in both rows and columns.
  1. Multiple variables are stored in one column.

In sum, before the analysis, such as calculating correlations and doing data visualization, your data should be tidy: each variable is a column and each observation is a row.

Frequently used data management operations include selecting columns (variables), filtering rows (cases), and creating new columns. These are the operations we will cover here.

dplyr package

Data management in R can be done without using any additional packages but the code is not very intuitive. Instead, consider using the dplyr package.

Computing using R and R Studio

Computing is a fundamental part of modern statistics and data analysis. We will use R–a free software for statistical computing and graphics. Learning how to program in R is also a useful foundation for learning other programming langugages.

  • We will use R via R Studio (integrated development environment, IDE).

R Studio

R packages

  • R packages need to be installed and loaded for many specialized operations. Packages are also loaded when you want to perform standard operations more efficiently compared to what is possible in the R base installation.

  • Once an R package is installed on your computer, you do not need to install the package again when you want to use it but you need to load the package for each new R session.

Say you want to install a package that we will use to visualize data called ggplot2. You would do that with a single line of code (package name has to be in parentheses and surrounded with quotation marks):

install.packages("ggplot2")

You will not be able to use the package until you load it with the function library() (after you have installed the package).

library(ggplot2)

Advice

  1. Write your code in a plain text editor (no MS Word, ever)
  2. Reproducibility!

Other people should be able to understand your code; this means others (and you!) can reproduce exactly your original output. Ensure this happens by writing comments that describe why (now what) you did.

  1. Consider using R Markdown: https://rmarkdown.rstudio.com/.

R style guide

Writing good code means you stick to consistent and meaningful rules.

Differerent style guides exist. You can certainly make your own.

Let’s go over Hadley Wickham’s guide: http://adv-r.had.co.nz/Style.html.

Some takeaways:

  1. File names should be meaningful and end in .R

  2. If files need to be run in sequence, prefix them with numbers.

  3. Variable and function names should be lowercase.

  4. Use an underscore (_) to separate words within a name.

  5. Generally, variable names should be nouns and function names should be verbs.

  6. Avoid using names of existing functions and variables.

  7. Place spaces around all infix operators (=, +, -, <-, etc.).

  8. Always put a space after a comma, and never before.

  9. Place a space before left parentheses, except in a function call.

  10. An opening curly brace should never go on its own line.

  11. Always indent the code inside curly braces.

  12. Strive to limit your code to 80 characters per line.

  13. When indenting your code, use two spaces.

R essentials

Everything has a name: variables, data, and functions

  • some names are forbidden (e.g. FALSE or mean()).

Everything is an object

You do things using functions

  • functions take arguments–things functions need to know to perform an action.

To see inside an object, ask for its structure: str()

Common errors

  • Incomplete parentheses
  • Watch out for extra spaces
  • Watch out for cases!

Let’s code

R is a powerful programming language. Mastering R requires a lot of effort but anyone can start using R for basic analysis quickly. Here is an example of using R as a calculator:

1 + 1
## [1] 2
2 * 3 
## [1] 6
24 / 6
## [1] 4

Exercise

Open R Studio on your computers and do the following exercises:

  1. What is the sum of 15 and 176?
  2. Multiply 8 by 12 and divide by 3.

Data exploration

  • Data exploration is the art of looking at your data, quickly detecting important patterns, and potential problems. Every data analysis project should start with data exploration.

  • The goal of data exploration is to generate promising leads that you can later explore in more depth. Transformation and visualization are central parts of data exploration.

Loading data into R

  • In this class, we will use, among others, the General Social Survey (gss) dataset that was collected from a probability sample of the US population in 2012.

  • The gss is one of the most important social science surveys that has been continuously conducted since 1972. The goal is to provide clear an unbiased information on public opinion.

  • You can find more about the General Social Survey, download, and explore the gss data at the following website: http://gss.norc.org/.

Data frame

In R, datasets are called data frames. A data frame is a rectangual collection of variables in columns and observations in rows.

Setting the working directory (1)

Before we load a dataset, a good practice is to set a working directory. The working directory wd should usually be the directory on your computer where you stored the dataset. This will be the directory (or folder) where R will automatically store all the output you produce.

  • You can set the working directory using the setwd() function.

Setting the working directory (2)

The easiest way to set the working directory is by using the drop down menu:

Loading data into R

Since the gss data are stored as a comma-separated values (csv) file, we need to use the function read.csv function.

  • In the line of code below, I stored the dataset in an object that I named gss. From now on, whenever I want to operate on the dataset I just loaded into R, I need to refer to that dataset as gss.
gss <- read.csv("gss2012.csv")

Inspecting the loaded dataset

You can see how many rows (observations) and how many columns (variables) there are in the dataset by using the dim function (dimensions of the dataframe).

dim(gss)
## [1] 1545    8

The dim function says our dataset has 1,545 rows (observations or cases) and 8 columns (variables).

You can see the names of the variables in the dataset by using the function names.

names(gss)
## [1] "age"      "sex"      "race"     "arrest"   "lockedup" "cappun"  
## [7] "prestg10" "educ"

Let’s use the head function to see the first six rows rows in the gss dataset. Each row represents one person in the sample.

For instance, person in row 2 is 49 years old, white (race is coded the following way: 1=white, 2=black, 3=other), female (because in gss variable sex is coded as 1=male, 2=female), their occupational prestige score is 60, and they completed 13 years of schooling.

head(gss)
##   age sex race arrest lockedup cappun prestg10 educ
## 1  21   1    1      2        2      1       43   12
## 2  49   2    1      2        2      1       60   13
## 3  70   2    2      1        2      2       40   16
## 4  50   2    1      1        1      2       73   19
## 5  35   2    1      2        2      1       31   15
## 6  28   2    1      1        2      2       53   17

Let’s see the last 15 instances:

tail(gss, n=15)
##      age sex race arrest lockedup cappun prestg10 educ
## 1531  74   1    1      2        2      1       40   14
## 1532  60   1    1      2        2      1       45   12
## 1533  42   1    1      2        2      1       61   16
## 1534  36   2    1      2        2      1       70   20
## 1535  50   2    1      2        2      2       53   19
## 1536  63   1    1      2        2      1       52   16
## 1537  71   1    1      2        2      1       35   12
## 1538  50   1    3      1        1      2       32   14
## 1539  65   2    1      2        2      2       51   19
## 1540  60   1    1      1        1      1       35    8
## 1541  78   1    1      2        2      1       28    9
## 1542  61   2    3      2        2      1       38   16
## 1543  53   2    3      2        1      2       47   13
## 1544  48   1    1      2        2      1       41   13
## 1545  37   2    3      2        1      1       53   12

What about the number of rows?

nrow(gss)
## [1] 1545

And columns?

ncol(gss)
## [1] 8
  • You can quickly generate summary information for all the variables in your dataset by using the summary function.

  • For continuous variables (like age), this will calculate five statistics: minimum value on the variable, first quartile, second quartile, median, mean, third quartile, and the maximum.

  • For categorical variables (like race), the summary function will produce the count of cases in each category.

summary(gss)
##       age             sex             race           arrest     
##  Min.   :18.00   Min.   :1.000   Min.   :1.000   Min.   :1.000  
##  1st Qu.:33.00   1st Qu.:1.000   1st Qu.:1.000   1st Qu.:2.000  
##  Median :46.00   Median :2.000   Median :1.000   Median :2.000  
##  Mean   :47.12   Mean   :1.538   Mean   :1.336   Mean   :1.787  
##  3rd Qu.:60.00   3rd Qu.:2.000   3rd Qu.:1.000   3rd Qu.:2.000  
##  Max.   :89.00   Max.   :2.000   Max.   :3.000   Max.   :2.000  
##     lockedup        cappun         prestg10          educ     
##  Min.   :1.00   Min.   :1.000   Min.   :16.00   Min.   : 0.0  
##  1st Qu.:2.00   1st Qu.:1.000   1st Qu.:35.00   1st Qu.:12.0  
##  Median :2.00   Median :1.000   Median :45.00   Median :14.0  
##  Mean   :1.85   Mean   :1.335   Mean   :45.21   Mean   :13.9  
##  3rd Qu.:2.00   3rd Qu.:2.000   3rd Qu.:55.00   3rd Qu.:16.0  
##  Max.   :2.00   Max.   :2.000   Max.   :80.00   Max.   :20.0
  • However, it seems like our categorical variables are not stored as factors! Let’s do that now.
gss$race <- factor(gss$race, labels = c("white", "black", "other"))
gss$sex <- factor(gss$sex, levels = c(1, 2), 
                  labels = c("male", "female"))

Let’s check again if the changes were actually done.

summary(gss)
##       age            sex         race          arrest         lockedup   
##  Min.   :18.00   male  :714   white:1179   Min.   :1.000   Min.   :1.00  
##  1st Qu.:33.00   female:831   black: 213   1st Qu.:2.000   1st Qu.:2.00  
##  Median :46.00                other: 153   Median :2.000   Median :2.00  
##  Mean   :47.12                             Mean   :1.787   Mean   :1.85  
##  3rd Qu.:60.00                             3rd Qu.:2.000   3rd Qu.:2.00  
##  Max.   :89.00                             Max.   :2.000   Max.   :2.00  
##      cappun         prestg10          educ     
##  Min.   :1.000   Min.   :16.00   Min.   : 0.0  
##  1st Qu.:1.000   1st Qu.:35.00   1st Qu.:12.0  
##  Median :1.000   Median :45.00   Median :14.0  
##  Mean   :1.335   Mean   :45.21   Mean   :13.9  
##  3rd Qu.:2.000   3rd Qu.:55.00   3rd Qu.:16.0  
##  Max.   :2.000   Max.   :80.00   Max.   :20.0

You can check the class or how the variable is stored by using the class function:

class(gss$race)
## [1] "factor"

Inspecting individual variables

Instead of telling R to calculate summary statistics for all the variables in the dataset, you can tell R to calculate those measures for only a single variable.

  • To analyze a single variable from a dataset, you have to write the command as datasetname$variablename
summary(gss$race)
## white black other 
##  1179   213   153

We can also calculate single descriptive measures that describe a variable such as the mean. Below, we calculate the mean and store it into a new object called mean_age.

mean_age <- mean(gss$age)

We can print that object to the console by simply writing out its name mean_age.

mean_age
## [1] 47.12104

The mean age among gss respondents in 2012 was 47 years. Note, however, that I have removed all the missing data using the na.omit(gss) function.

Exercise

  • Download and load the gss2012-r.csv dataset into R.

  • Set your working directory to where you stored the dataset.

  • What are the dimensions of the gss2012-r dataset?

  • What are the names of the variables in the dataset?

  • What are the values on all the variables for the first two cases?

  • What is the mean of the educ variable?

Other useful functions

List objects in the current workspace: ls

Remove all objects from workspace: rm(list=ls())

Logical operators

== exactly equal to

!= not equal to

> greater than

< less than

<= less than or equal to

>= greater than or equal to

x & y x AND y

x | y x OR y

Arithmetic operators

= - * / ^

Summary: single variable

summary(gss$age) 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   18.00   33.00   46.00   47.12   60.00   89.00
mean(gss$age) # mean
## [1] 47.12104
mean(gss$age[gss$sex=="male"]) # mean age for males
## [1] 47.41317
summary(gss$age[gss$sex=="female"]) # mean age for females
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   18.00   34.00   46.00   46.87   59.00   89.00
sd(gss$age) # standard deviation
## [1] 16.57035
var(gss$age) # variance
## [1] 274.5767

Summary: two variables

summary(data.frame(gss$sex, gss$age)) # summarize two variables 
##    gss.sex       gss.age     
##  male  :714   Min.   :18.00  
##  female:831   1st Qu.:33.00  
##               Median :46.00  
##               Mean   :47.12  
##               3rd Qu.:60.00  
##               Max.   :89.00

Table of frequencies for a categorical variable:

table(gss$sex)
## 
##   male female 
##    714    831
table(gss$arrest)
## 
##    1    2 
##  329 1216

Exercise: calculate mean, median, range, and standard deviation for occupational prestige score.

Recode variables

Recode variable arrest into 0/1 variable:

table(gss$arrest)
## 
##    1    2 
##  329 1216
gss$arrest[gss$arrest==2] <- 0 
gss$arrest[gss$arrest==1] <- 1 
table(gss$arrest)
## 
##    0    1 
## 1216  329

Label levels of the variable

gss$arrest <- factor(gss$arrest, 
                  levels = c(0, 1),
                  labels = c("no", "yes"))
table(gss$arrest)
## 
##   no  yes 
## 1216  329
prop.table(table(gss$arrest))
## 
##       no      yes 
## 0.787055 0.212945

Exercise: recode variable lockedup into 0/1 variable and label the values. Value 1 (has been incarcerated) should remain coded as 1 and 2 should be recoded as 0 (never been incarcerated). How many people in the sample have been incarcerated? How many women have been incarcerated? How many men?

Rename variables

Rename variable lockedup into jail:

names(gss)[names(gss) == 'lockedup'] <- 'jail'
names(gss)
## [1] "age"      "sex"      "race"     "arrest"   "jail"     "cappun"  
## [7] "prestg10" "educ"
table(gss$jail)
## 
##    0    1 
## 1313  232

Exercise

  1. Rename variable jail back into lockedup and
  2. Rename variable sex into female

Select variables

Keep only three variables: sex, age and arrest

Method 1

gss_small <- subset(gss, select=c(sex, age, arrest))
summary(gss_small)
##      sex           age        arrest    
##  male  :714   Min.   :18.00   no :1216  
##  female:831   1st Qu.:33.00   yes: 329  
##               Median :46.00             
##               Mean   :47.12             
##               3rd Qu.:60.00             
##               Max.   :89.00

Method 2

myvars <- c("sex", "age", "arrest")
myvars
## [1] "sex"    "age"    "arrest"
gss_small <- gss[myvars]
summary(gss_small)
##      sex           age        arrest    
##  male  :714   Min.   :18.00   no :1216  
##  female:831   1st Qu.:33.00   yes: 329  
##               Median :46.00             
##               Mean   :47.12             
##               3rd Qu.:60.00             
##               Max.   :89.00

Method 3 (based on order)

head(gss)
##   age    sex  race arrest jail cappun prestg10 educ
## 1  21   male white     no    0      1       43   12
## 2  49 female white     no    0      1       60   13
## 3  70 female black    yes    0      2       40   16
## 4  50 female white    yes    1      2       73   19
## 5  35 female white     no    0      1       31   15
## 6  28 female white    yes    0      2       53   17
gss_small <- gss[c(1:2,4)] 
summary(gss_small)
##       age            sex      arrest    
##  Min.   :18.00   male  :714   no :1216  
##  1st Qu.:33.00   female:831   yes: 329  
##  Median :46.00                          
##  Mean   :47.12                          
##  3rd Qu.:60.00                          
##  Max.   :89.00

Exercise

Keep only variables that describe atitudes about capital punishment (cappun) and the occupational prestige score (prestg10).

Exclude only these two variables: sex and age

gss_excl <- subset(gss, select = -c(sex, age))
summary(gss_excl)
##     race      arrest          jail            cappun         prestg10    
##  white:1179   no :1216   Min.   :0.0000   Min.   :1.000   Min.   :16.00  
##  black: 213   yes: 329   1st Qu.:0.0000   1st Qu.:1.000   1st Qu.:35.00  
##  other: 153              Median :0.0000   Median :1.000   Median :45.00  
##                          Mean   :0.1502   Mean   :1.335   Mean   :45.21  
##                          3rd Qu.:0.0000   3rd Qu.:2.000   3rd Qu.:55.00  
##                          Max.   :1.0000   Max.   :2.000   Max.   :80.00  
##       educ     
##  Min.   : 0.0  
##  1st Qu.:12.0  
##  Median :14.0  
##  Mean   :13.9  
##  3rd Qu.:16.0  
##  Max.   :20.0

Exercise: exclude variables that describe occupational prestige score and educational attainment.

Select observations

Keep only men who have been arrested

table(gss$arrest)
## 
##   no  yes 
## 1216  329
gss_arrested_men <- subset(gss, sex=="male" & arrest=="yes")
summary(gss_arrested_men)
##       age            sex         race     arrest         jail       
##  Min.   :18.00   male  :213   white:164   no :  0   Min.   :0.0000  
##  1st Qu.:31.00   female:  0   black: 27   yes:213   1st Qu.:0.0000  
##  Median :42.00                other: 22             Median :1.0000  
##  Mean   :44.28                                      Mean   :0.5962  
##  3rd Qu.:54.00                                      3rd Qu.:1.0000  
##  Max.   :89.00                                      Max.   :1.0000  
##      cappun         prestg10          educ      
##  Min.   :1.000   Min.   :17.00   Min.   : 4.00  
##  1st Qu.:1.000   1st Qu.:33.00   1st Qu.:12.00  
##  Median :1.000   Median :39.00   Median :12.00  
##  Mean   :1.319   Mean   :42.01   Mean   :13.09  
##  3rd Qu.:2.000   3rd Qu.:49.00   3rd Qu.:15.00  
##  Max.   :2.000   Max.   :80.00   Max.   :20.00

Exercise: keep only women who have never been arrested and who are older than 30.

Linear regression

  1. Install (if neccessary) and load packages
library(foreign)
  1. Set working directory
setwd("~/Dropbox/Teaching/Rutgers/Data Science/Lectures/3 R Basics")
  1. Load dataset into R and drop all missing cases
gss <- read.dta("gss2012prison.dta", convert.factors = FALSE)
gss <- na.omit(gss)
  1. Run ordinary least squares regression (OLS): predicting occupational prestige score
model <- lm(prestg10 ~ age + factor(health) + prison + female, data = gss)
model <- lm(prestg10 ~ ., data = gss)
  1. Print model summary

How do we interpret the coefficients?

summary(model)
## 
## Call:
## lm(formula = prestg10 ~ ., data = gss)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -33.299 -10.214  -0.646   9.947  38.512 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 47.69485    1.56481  30.480  < 2e-16 ***
## age          0.10247    0.02485   4.123 4.02e-05 ***
## health      -2.72415    0.49314  -5.524 4.13e-08 ***
## prison      -5.85862    1.19423  -4.906 1.07e-06 ***
## female      -1.57548    0.82411  -1.912   0.0562 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.44 on 1108 degrees of freedom
## Multiple R-squared:  0.06862,    Adjusted R-squared:  0.06526 
## F-statistic: 20.41 on 4 and 1108 DF,  p-value: 3.066e-16
  1. Print only coefficients
coef(model)
## (Intercept)         age      health      prison      female 
##  47.6948527   0.1024676  -2.7241485  -5.8586227  -1.5754846
mymodel_coefs <- coef(model)
mymodel_coefs
## (Intercept)         age      health      prison      female 
##  47.6948527   0.1024676  -2.7241485  -5.8586227  -1.5754846
  1. Calculate predicted values
gss$p <- predict(model, data = gss) # predict on full data and store in p
summary(gss$p)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   32.44   43.38   45.69   45.29   47.74   54.09
head(gss)
##    prestg10 age health prison female        p
## 1        38  22      2      0      0 44.50084
## 2        43  21      1      0      0 47.12252
## 3        75  42      2      0      0 46.55020
## 6        73  50      4      1      1 34.48753
## 10       53  28      2      0      1 43.54016
## 11       49  55      4      0      0 42.43398
  1. Compute errors
gss$error <- gss$p - gss$prestg10
summary(gss$error)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -38.5125  -9.9475   0.6458   0.0000  10.2143  33.2990
mean(gss$error)
## [1] 1.224966e-13
round(mean(gss$error))
## [1] 0
hist(gss$error)

  1. Calculate RMSE
sqrt(mean(gss$error^2))
## [1] 13.40681
  1. Plot coefficients
if(!require(coefplot)) install.packages("coefplot",repos = "http://cran.us.r-project.org")
## Loading required package: coefplot
## Warning: package 'coefplot' was built under R version 3.4.3
library(coefplot)
coefplot(model)

coefplot(model, intercept = FALSE) # without intercept

coefplot(model, intercept = FALSE, sort = "magnitude") # without intercept

Resources

Healy, K. (2019). Data Visualization. Cambridge University Press.

Wickham, H. & Grolemund, G. (2017). R for Data Science. O’Reilly.