Data analysis involves making sense of the world by collecting, summarizing, and modelling quantitative information (numbers).
The first part of data analysis is finding or creating the dataset that you will analyze. In this class, we will use, among others, the data from the General Social Survey (described later in this document).
The second part is tidying your data. This means that you need to have data in a format that can be analyzed. Specifically, cases have to be in rows and variables have to be in columns.
Once you have tidied the data, you often need to transform the data so that you can answer your research questions most effectively. For instance, you can recode the age variable into two groups–people older than 65 and people 65 and younger.
We will discuss the visualization, modeling, and communication part later in the semester.
Data management is especially important when you only have access to messy data. Data are typically messy when you see one or all of these three things occur:
In sum, before the analysis, such as calculating correlations and doing data visualization, your data should be tidy: each variable is a column and each observation is a row.
Frequently used data management operations include selecting columns (variables), filtering rows (cases), and creating new columns. These are the operations we will cover here.
Data management in R can be done without using any additional packages but the code is not very intuitive. Instead, consider using the dplyr
package.
Computing is a fundamental part of modern statistics and data analysis. We will use R–a free software for statistical computing and graphics. Learning how to program in R is also a useful foundation for learning other programming langugages.
R packages need to be installed and loaded for many specialized operations. Packages are also loaded when you want to perform standard operations more efficiently compared to what is possible in the R base installation.
Once an R package is installed on your computer, you do not need to install the package again when you want to use it but you need to load the package for each new R session.
Say you want to install a package that we will use to visualize data called ggplot2
. You would do that with a single line of code (package name has to be in parentheses and surrounded with quotation marks):
install.packages("ggplot2")
You will not be able to use the package until you load it with the function library()
(after you have installed the package).
library(ggplot2)
Other people should be able to understand your code; this means others (and you!) can reproduce exactly your original output. Ensure this happens by writing comments that describe why (now what) you did.
Writing good code means you stick to consistent and meaningful rules.
Differerent style guides exist. You can certainly make your own.
Let’s go over Hadley Wickham’s guide: http://adv-r.had.co.nz/Style.html.
Some takeaways:
File names should be meaningful and end in .R
If files need to be run in sequence, prefix them with numbers.
Variable and function names should be lowercase.
Use an underscore (_) to separate words within a name.
Generally, variable names should be nouns and function names should be verbs.
Avoid using names of existing functions and variables.
Place spaces around all infix operators (=, +, -, <-, etc.).
Always put a space after a comma, and never before.
Place a space before left parentheses, except in a function call.
An opening curly brace should never go on its own line.
Always indent the code inside curly braces.
Strive to limit your code to 80 characters per line.
When indenting your code, use two spaces.
Everything has a name: variables, data, and functions
FALSE
or mean()
).Everything is an object
You do things using functions
To see inside an object, ask for its structure: str()
R is a powerful programming language. Mastering R requires a lot of effort but anyone can start using R for basic analysis quickly. Here is an example of using R as a calculator:
1 + 1
## [1] 2
2 * 3
## [1] 6
24 / 6
## [1] 4
Exercise
Open R Studio on your computers and do the following exercises:
Data exploration is the art of looking at your data, quickly detecting important patterns, and potential problems. Every data analysis project should start with data exploration.
The goal of data exploration is to generate promising leads that you can later explore in more depth. Transformation and visualization are central parts of data exploration.
In this class, we will use, among others, the General Social Survey (gss
) dataset that was collected from a probability sample of the US population in 2012.
The gss
is one of the most important social science surveys that has been continuously conducted since 1972. The goal is to provide clear an unbiased information on public opinion.
You can find more about the General Social Survey, download, and explore the gss
data at the following website: http://gss.norc.org/.
In R, datasets are called data frames. A data frame is a rectangual collection of variables in columns and observations in rows.
Before we load a dataset, a good practice is to set a working directory. The working directory wd
should usually be the directory on your computer where you stored the dataset. This will be the directory (or folder) where R will automatically store all the output you produce.
setwd()
function.The easiest way to set the working directory is by using the drop down menu:
Since the gss
data are stored as a comma-separated values (csv) file, we need to use the function read.csv
function.
gss
. From now on, whenever I want to operate on the dataset I just loaded into R, I need to refer to that dataset as gss
.gss <- read.csv("gss2012.csv")
You can see how many rows (observations) and how many columns (variables) there are in the dataset by using the dim
function (dimensions of the dataframe).
dim(gss)
## [1] 1545 8
The dim
function says our dataset has 1,545 rows (observations or cases) and 8 columns (variables).
You can see the names of the variables in the dataset by using the function names
.
names(gss)
## [1] "age" "sex" "race" "arrest" "lockedup" "cappun"
## [7] "prestg10" "educ"
Let’s use the head
function to see the first six rows rows in the gss
dataset. Each row represents one person in the sample.
For instance, person in row 2 is 49 years old, white (race is coded the following way: 1=white, 2=black, 3=other), female (because in gss
variable sex
is coded as 1=male, 2=female), their occupational prestige score is 60, and they completed 13 years of schooling.
head(gss)
## age sex race arrest lockedup cappun prestg10 educ
## 1 21 1 1 2 2 1 43 12
## 2 49 2 1 2 2 1 60 13
## 3 70 2 2 1 2 2 40 16
## 4 50 2 1 1 1 2 73 19
## 5 35 2 1 2 2 1 31 15
## 6 28 2 1 1 2 2 53 17
Let’s see the last 15 instances:
tail(gss, n=15)
## age sex race arrest lockedup cappun prestg10 educ
## 1531 74 1 1 2 2 1 40 14
## 1532 60 1 1 2 2 1 45 12
## 1533 42 1 1 2 2 1 61 16
## 1534 36 2 1 2 2 1 70 20
## 1535 50 2 1 2 2 2 53 19
## 1536 63 1 1 2 2 1 52 16
## 1537 71 1 1 2 2 1 35 12
## 1538 50 1 3 1 1 2 32 14
## 1539 65 2 1 2 2 2 51 19
## 1540 60 1 1 1 1 1 35 8
## 1541 78 1 1 2 2 1 28 9
## 1542 61 2 3 2 2 1 38 16
## 1543 53 2 3 2 1 2 47 13
## 1544 48 1 1 2 2 1 41 13
## 1545 37 2 3 2 1 1 53 12
What about the number of rows?
nrow(gss)
## [1] 1545
And columns?
ncol(gss)
## [1] 8
You can quickly generate summary information for all the variables in your dataset by using the summary
function.
For continuous variables (like age
), this will calculate five statistics: minimum value on the variable, first quartile, second quartile, median, mean, third quartile, and the maximum.
For categorical variables (like race
), the summary
function will produce the count of cases in each category.
summary(gss)
## age sex race arrest
## Min. :18.00 Min. :1.000 Min. :1.000 Min. :1.000
## 1st Qu.:33.00 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:2.000
## Median :46.00 Median :2.000 Median :1.000 Median :2.000
## Mean :47.12 Mean :1.538 Mean :1.336 Mean :1.787
## 3rd Qu.:60.00 3rd Qu.:2.000 3rd Qu.:1.000 3rd Qu.:2.000
## Max. :89.00 Max. :2.000 Max. :3.000 Max. :2.000
## lockedup cappun prestg10 educ
## Min. :1.00 Min. :1.000 Min. :16.00 Min. : 0.0
## 1st Qu.:2.00 1st Qu.:1.000 1st Qu.:35.00 1st Qu.:12.0
## Median :2.00 Median :1.000 Median :45.00 Median :14.0
## Mean :1.85 Mean :1.335 Mean :45.21 Mean :13.9
## 3rd Qu.:2.00 3rd Qu.:2.000 3rd Qu.:55.00 3rd Qu.:16.0
## Max. :2.00 Max. :2.000 Max. :80.00 Max. :20.0
gss$race <- factor(gss$race, labels = c("white", "black", "other"))
gss$sex <- factor(gss$sex, levels = c(1, 2),
labels = c("male", "female"))
Let’s check again if the changes were actually done.
summary(gss)
## age sex race arrest lockedup
## Min. :18.00 male :714 white:1179 Min. :1.000 Min. :1.00
## 1st Qu.:33.00 female:831 black: 213 1st Qu.:2.000 1st Qu.:2.00
## Median :46.00 other: 153 Median :2.000 Median :2.00
## Mean :47.12 Mean :1.787 Mean :1.85
## 3rd Qu.:60.00 3rd Qu.:2.000 3rd Qu.:2.00
## Max. :89.00 Max. :2.000 Max. :2.00
## cappun prestg10 educ
## Min. :1.000 Min. :16.00 Min. : 0.0
## 1st Qu.:1.000 1st Qu.:35.00 1st Qu.:12.0
## Median :1.000 Median :45.00 Median :14.0
## Mean :1.335 Mean :45.21 Mean :13.9
## 3rd Qu.:2.000 3rd Qu.:55.00 3rd Qu.:16.0
## Max. :2.000 Max. :80.00 Max. :20.0
You can check the class or how the variable is stored by using the class
function:
class(gss$race)
## [1] "factor"
Instead of telling R to calculate summary statistics for all the variables in the dataset, you can tell R to calculate those measures for only a single variable.
datasetname$variablename
summary(gss$race)
## white black other
## 1179 213 153
We can also calculate single descriptive measures that describe a variable such as the mean. Below, we calculate the mean and store it into a new object called mean_age
.
mean_age <- mean(gss$age)
We can print that object to the console by simply writing out its name mean_age
.
mean_age
## [1] 47.12104
The mean age among gss
respondents in 2012 was 47 years. Note, however, that I have removed all the missing data using the na.omit(gss)
function.
Download and load the gss2012-r.csv
dataset into R.
Set your working directory to where you stored the dataset.
What are the dimensions of the gss2012-r
dataset?
What are the names of the variables in the dataset?
What are the values on all the variables for the first two cases?
What is the mean of the educ
variable?
List objects in the current workspace: ls
Remove all objects from workspace: rm(list=ls())
==
exactly equal to
!=
not equal to
>
greater than
<
less than
<=
less than or equal to
>=
greater than or equal to
x & y
x AND y
x | y
x OR y
=
-
*
/
^
summary(gss$age)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 18.00 33.00 46.00 47.12 60.00 89.00
mean(gss$age) # mean
## [1] 47.12104
mean(gss$age[gss$sex=="male"]) # mean age for males
## [1] 47.41317
summary(gss$age[gss$sex=="female"]) # mean age for females
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 18.00 34.00 46.00 46.87 59.00 89.00
sd(gss$age) # standard deviation
## [1] 16.57035
var(gss$age) # variance
## [1] 274.5767
summary(data.frame(gss$sex, gss$age)) # summarize two variables
## gss.sex gss.age
## male :714 Min. :18.00
## female:831 1st Qu.:33.00
## Median :46.00
## Mean :47.12
## 3rd Qu.:60.00
## Max. :89.00
Table of frequencies for a categorical variable:
table(gss$sex)
##
## male female
## 714 831
table(gss$arrest)
##
## 1 2
## 329 1216
Exercise: calculate mean, median, range, and standard deviation for occupational prestige score.
Recode variable arrest into 0/1 variable:
table(gss$arrest)
##
## 1 2
## 329 1216
gss$arrest[gss$arrest==2] <- 0
gss$arrest[gss$arrest==1] <- 1
table(gss$arrest)
##
## 0 1
## 1216 329
Label levels of the variable
gss$arrest <- factor(gss$arrest,
levels = c(0, 1),
labels = c("no", "yes"))
table(gss$arrest)
##
## no yes
## 1216 329
prop.table(table(gss$arrest))
##
## no yes
## 0.787055 0.212945
Exercise: recode variable lockedup into 0/1 variable and label the values. Value 1 (has been incarcerated) should remain coded as 1 and 2 should be recoded as 0 (never been incarcerated). How many people in the sample have been incarcerated? How many women have been incarcerated? How many men?
Rename variable lockedup into jail:
names(gss)[names(gss) == 'lockedup'] <- 'jail'
names(gss)
## [1] "age" "sex" "race" "arrest" "jail" "cappun"
## [7] "prestg10" "educ"
table(gss$jail)
##
## 0 1
## 1313 232
Exercise
Keep only three variables: sex, age and arrest
Method 1
gss_small <- subset(gss, select=c(sex, age, arrest))
summary(gss_small)
## sex age arrest
## male :714 Min. :18.00 no :1216
## female:831 1st Qu.:33.00 yes: 329
## Median :46.00
## Mean :47.12
## 3rd Qu.:60.00
## Max. :89.00
Method 2
myvars <- c("sex", "age", "arrest")
myvars
## [1] "sex" "age" "arrest"
gss_small <- gss[myvars]
summary(gss_small)
## sex age arrest
## male :714 Min. :18.00 no :1216
## female:831 1st Qu.:33.00 yes: 329
## Median :46.00
## Mean :47.12
## 3rd Qu.:60.00
## Max. :89.00
Method 3 (based on order)
head(gss)
## age sex race arrest jail cappun prestg10 educ
## 1 21 male white no 0 1 43 12
## 2 49 female white no 0 1 60 13
## 3 70 female black yes 0 2 40 16
## 4 50 female white yes 1 2 73 19
## 5 35 female white no 0 1 31 15
## 6 28 female white yes 0 2 53 17
gss_small <- gss[c(1:2,4)]
summary(gss_small)
## age sex arrest
## Min. :18.00 male :714 no :1216
## 1st Qu.:33.00 female:831 yes: 329
## Median :46.00
## Mean :47.12
## 3rd Qu.:60.00
## Max. :89.00
Exercise
Keep only variables that describe atitudes about capital punishment (cappun
) and the occupational prestige score (prestg10
).
Exclude only these two variables: sex and age
gss_excl <- subset(gss, select = -c(sex, age))
summary(gss_excl)
## race arrest jail cappun prestg10
## white:1179 no :1216 Min. :0.0000 Min. :1.000 Min. :16.00
## black: 213 yes: 329 1st Qu.:0.0000 1st Qu.:1.000 1st Qu.:35.00
## other: 153 Median :0.0000 Median :1.000 Median :45.00
## Mean :0.1502 Mean :1.335 Mean :45.21
## 3rd Qu.:0.0000 3rd Qu.:2.000 3rd Qu.:55.00
## Max. :1.0000 Max. :2.000 Max. :80.00
## educ
## Min. : 0.0
## 1st Qu.:12.0
## Median :14.0
## Mean :13.9
## 3rd Qu.:16.0
## Max. :20.0
Exercise: exclude variables that describe occupational prestige score and educational attainment.
Keep only men who have been arrested
table(gss$arrest)
##
## no yes
## 1216 329
gss_arrested_men <- subset(gss, sex=="male" & arrest=="yes")
summary(gss_arrested_men)
## age sex race arrest jail
## Min. :18.00 male :213 white:164 no : 0 Min. :0.0000
## 1st Qu.:31.00 female: 0 black: 27 yes:213 1st Qu.:0.0000
## Median :42.00 other: 22 Median :1.0000
## Mean :44.28 Mean :0.5962
## 3rd Qu.:54.00 3rd Qu.:1.0000
## Max. :89.00 Max. :1.0000
## cappun prestg10 educ
## Min. :1.000 Min. :17.00 Min. : 4.00
## 1st Qu.:1.000 1st Qu.:33.00 1st Qu.:12.00
## Median :1.000 Median :39.00 Median :12.00
## Mean :1.319 Mean :42.01 Mean :13.09
## 3rd Qu.:2.000 3rd Qu.:49.00 3rd Qu.:15.00
## Max. :2.000 Max. :80.00 Max. :20.00
Exercise: keep only women who have never been arrested and who are older than 30.
library(foreign)
setwd("~/Dropbox/Teaching/Rutgers/Data Science/Lectures/3 R Basics")
gss <- read.dta("gss2012prison.dta", convert.factors = FALSE)
gss <- na.omit(gss)
model <- lm(prestg10 ~ age + factor(health) + prison + female, data = gss)
model <- lm(prestg10 ~ ., data = gss)
How do we interpret the coefficients?
summary(model)
##
## Call:
## lm(formula = prestg10 ~ ., data = gss)
##
## Residuals:
## Min 1Q Median 3Q Max
## -33.299 -10.214 -0.646 9.947 38.512
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 47.69485 1.56481 30.480 < 2e-16 ***
## age 0.10247 0.02485 4.123 4.02e-05 ***
## health -2.72415 0.49314 -5.524 4.13e-08 ***
## prison -5.85862 1.19423 -4.906 1.07e-06 ***
## female -1.57548 0.82411 -1.912 0.0562 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.44 on 1108 degrees of freedom
## Multiple R-squared: 0.06862, Adjusted R-squared: 0.06526
## F-statistic: 20.41 on 4 and 1108 DF, p-value: 3.066e-16
coef(model)
## (Intercept) age health prison female
## 47.6948527 0.1024676 -2.7241485 -5.8586227 -1.5754846
mymodel_coefs <- coef(model)
mymodel_coefs
## (Intercept) age health prison female
## 47.6948527 0.1024676 -2.7241485 -5.8586227 -1.5754846
gss$p <- predict(model, data = gss) # predict on full data and store in p
summary(gss$p)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 32.44 43.38 45.69 45.29 47.74 54.09
head(gss)
## prestg10 age health prison female p
## 1 38 22 2 0 0 44.50084
## 2 43 21 1 0 0 47.12252
## 3 75 42 2 0 0 46.55020
## 6 73 50 4 1 1 34.48753
## 10 53 28 2 0 1 43.54016
## 11 49 55 4 0 0 42.43398
gss$error <- gss$p - gss$prestg10
summary(gss$error)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -38.5125 -9.9475 0.6458 0.0000 10.2143 33.2990
mean(gss$error)
## [1] 1.224966e-13
round(mean(gss$error))
## [1] 0
hist(gss$error)
sqrt(mean(gss$error^2))
## [1] 13.40681
if(!require(coefplot)) install.packages("coefplot",repos = "http://cran.us.r-project.org")
## Loading required package: coefplot
## Warning: package 'coefplot' was built under R version 3.4.3
library(coefplot)
coefplot(model)
coefplot(model, intercept = FALSE) # without intercept
coefplot(model, intercept = FALSE, sort = "magnitude") # without intercept
Healy, K. (2019). Data Visualization. Cambridge University Press.
Wickham, H. & Grolemund, G. (2017). R for Data Science. O’Reilly.