March 22, 2016

Background

Why I use R

  • A flexible statistical analysis toolkit

    • All of the standard data analysis tools are built right into the R language (and many others are available via "packages")
  • A language

    • In R, you do data analysis by writing functions and scripts, not by pointing and clicking

    • A script documents all your work, from data access to reporting, and can instantly be re-run at any time

  • Graphics and data visualization

    • One of the design principles of R was that visualization of data through charts and graphs is an essential part of the data analysis process

    • As a result, it has excellent tools for creating graphics

Why I use R

  • Access to powerful, cutting-edge analytics

    • Leading academics and researches from around the world use R to develop the latest methods in statistics
  • A robust, vibrant community

    • There's a wealth of community resources for R available on the Web, for help in just about every domain
  • Platform independence

    • For Linux/Mac users that work with people who use windows, this is helpful

    • You can use code contributed by others in the open-source community, or extend R with your own functions

R syntax

This is code (syntax):

c() # makes a vector
: # makes a sequence

Let's make a vector of the numbers 1 to 10:

my_vector <- c(1:10)

This is the result of typing the name of "my_vector":

my_vector
##  [1]  1  2  3  4  5  6  7  8  9 10

This finds the mean:

mean(my_vector)
## [1] 5.5

Figuring out what a function does or what an object is

- ? # this is to find out what a function does
- str() # this is to find out the 'structure' of anything
- View() # this allows you to view a data frame (think spreadsheet) or matrix
- class() # this tells you what kind of object this is

Loading data

Loading a comma separated values (CSV) file:

setwd("~/documents/r_introduction") # this sets the working directory
my_data <- read.csv("r_introduction_data.csv") # loads a CSV and saves it to `my_data` 
str(my_data) # this examines the "structure" of any object (like `my_data`)
## 'data.frame':    212 obs. of  15 variables:
##  $ Int         : int  1 1 0 0 1 0 NA 0 1 1 ...
##  $ StudentID   : int  1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 ...
##  $ Grade       : int  1 0 0 1 1 1 1 0 1 1 ...
##  $ Age         : int  11 10 10 11 11 12 11 9 11 11 ...
##  $ Gender      : int  0 1 0 1 1 0 0 0 1 1 ...
##  $ ClassTeacher: int  5 0 1 4 7 5 4 1 6 4 ...
##  $ SciTeacher  : int  2 0 0 2 3 2 2 0 3 2 ...
##  $ PrevIQWST   : int  NA 0 0 NA NA NA NA 0 NA NA ...
##  $ PreEff_Ave  : num  5 6.2 5.4 5.4 5 6 6.6 4.6 5.2 5.8 ...
##  $ PreInt_Ave  : num  7 4 6.4 1.8 4.8 6.6 6.8 6.6 5.2 3.4 ...
##  $ PreVal_Ave  : num  5.5 4.43 4.38 2.75 6.38 ...
##  $ PostEff_Ave : num  5 6.2 6.6 5.4 5.4 4.6 6.2 5.2 3.6 6.8 ...
##  $ PostInt_Ave : num  6.6 3.2 5.4 3 2.8 4.8 6.6 6.6 2.8 3.2 ...
##  $ PostVal_Ave : num  6.38 5.62 4.25 3.62 6.25 ...
##  $ Achievement : int  6 8 7 6 7 7 5 NA 9 6 ...

(Akcaoglu & Rosenberg, in preparation)

Finding information about objects

my_data[1, ] # just the first row of data frame
my_data[, 1] # just the first column of data frame
head(my_data) # first six rows of data frame
##   Int StudentID Grade Age Gender ClassTeacher SciTeacher PrevIQWST
## 1   1      1001     1  11      0            5          2        NA
## 2   1      1002     0  10      1            0          0         0
## 3   0      1003     0  10      0            1          0         0
## 4   0      1004     1  11      1            4          2        NA
## 5   1      1005     1  11      1            7          3        NA
## 6   0      1006     1  12      0            5          2        NA
##   PreEff_Ave PreInt_Ave PreVal_Ave PostEff_Ave PostInt_Ave PostVal_Ave
## 1        5.0        7.0   5.500000         5.0         6.6       6.375
## 2        6.2        4.0   4.428571         6.2         3.2       5.625
## 3        5.4        6.4   4.375000         6.6         5.4       4.250
## 4        5.4        1.8   2.750000         5.4         3.0       3.625
## 5        5.0        4.8   6.375000         5.4         2.8       6.250
## 6        6.0        6.6   6.500000         4.6         4.8       6.125
##   Achievement
## 1           6
## 2           8
## 3           7
## 4           6
## 5           7
## 6           7

Loading data

Examples of loading data for different sorts of files:

read.delim("filename.txt") # Tab-delimited
readxl::read_excel("filename.xlsx") # Excel
haven::read_sav("filename.sav") # SPSS

Descriptive statistics

my_data_ss <- my_data[, 9:15] # this selects columns 9 through 15 
summary(my_data_ss) # creates summary statistics for continuous variables 
##    PreEff_Ave      PreInt_Ave      PreVal_Ave     PostEff_Ave   
##  Min.   :2.200   Min.   :1.000   Min.   :1.875   Min.   :1.200  
##  1st Qu.:5.000   1st Qu.:4.800   1st Qu.:5.125   1st Qu.:4.800  
##  Median :5.600   Median :6.200   Median :6.125   Median :5.600  
##  Mean   :5.466   Mean   :5.739   Mean   :5.780   Mean   :5.354  
##  3rd Qu.:6.200   3rd Qu.:6.800   3rd Qu.:6.625   3rd Qu.:6.200  
##  Max.   :7.000   Max.   :7.000   Max.   :7.000   Max.   :7.000  
##                                                  NA's   :16     
##   PostInt_Ave     PostVal_Ave     Achievement    
##  Min.   :1.000   Min.   :1.500   Min.   : 2.000  
##  1st Qu.:4.400   1st Qu.:4.875   1st Qu.: 5.000  
##  Median :5.800   Median :6.250   Median : 6.000  
##  Mean   :5.394   Mean   :5.691   Mean   : 6.263  
##  3rd Qu.:6.600   3rd Qu.:6.750   3rd Qu.: 7.750  
##  Max.   :7.000   Max.   :7.000   Max.   :10.000  
##  NA's   :16      NA's   :16      NA's   :18

Descriptive statistics

table(my_data$SciTeacher) # counts frequencies and creates a table
## 
##  0  1  2  3 
## 53 55 53 51

Packages

Here's how to install a package:

install.packages("dplyr")

Here's how to use a package (must do every time you use R):

library(dplyr)

You can also use a package using ::, without using library first

dplyr::select()

dplyr for data manipulation

dplyr::select(my_data, ) # Select only certain columns

dplyr::filter(mydata, ) # select only certain rows

# how to find out more about how to use dplyr

# Vignettes are documents that explain how to use packages in fleshed out examples
vignette(package = "dplyr")
vignette("introduction", package = "dplyr")

dplyr::arrange(my_data, ) # arrange data by a variable

psych for summary statistics

my_data_ss <- dplyr::select(my_data, PreEff_Ave:Achievement)
psych::describe(my_data_ss)
##             vars   n mean   sd median trimmed  mad  min max range  skew
## PreEff_Ave     1 212 5.47 1.03   5.60    5.56 0.89 2.20   7  4.80 -0.82
## PreInt_Ave     2 212 5.74 1.27   6.20    5.92 1.19 1.00   7  6.00 -1.18
## PreVal_Ave     3 212 5.78 1.16   6.12    5.94 1.02 1.88   7  5.12 -1.14
## PostEff_Ave    4 196 5.35 1.11   5.60    5.46 0.89 1.20   7  5.80 -0.91
## PostInt_Ave    5 196 5.39 1.55   5.80    5.58 1.48 1.00   7  6.00 -0.89
## PostVal_Ave    6 196 5.69 1.30   6.25    5.87 0.93 1.50   7  5.50 -1.12
## Achievement    7 194 6.26 1.77   6.00    6.29 1.48 2.00  10  8.00 -0.17
##             kurtosis   se
## PreEff_Ave      0.31 0.07
## PreInt_Ave      1.01 0.09
## PreVal_Ave      0.84 0.08
## PostEff_Ave     0.65 0.08
## PostInt_Ave    -0.27 0.11
## PostVal_Ave     0.56 0.09
## Achievement    -0.45 0.13

Other packages

There are over 7000 packages, covering pretty much every topic you'd want:

# Great plotting functionality
library(ggplot2)
# Great data manipulation
library(dplyr)
# A number of basic statistical procedures you will want to use
library(psych)
# Miscellaneous functions that are quite useful
library(Hmisc)
# Importing data from other file formats, like .sav
library(foreign)
# Deal with dates in your data
library(lubridate)

Example

Load data and select variables

# load built-in dataset `mtcars`
data(mtcars)
str(mtcars)
## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...
mtcars_ss <- dplyr::select(mtcars, mpg, cyl, disp, hp, wt)

Example

Calculate descriptive statistics and correlations

psych::describe(mtcars_ss) # descriptive statistics (not run)
mtcars_corrs <- Hmisc::rcorr(as.matrix(mtcars_ss)) # correlations
round(mtcars_corrs[[1]], 3)
##         mpg    cyl   disp     hp     wt
## mpg   1.000 -0.852 -0.848 -0.776 -0.868
## cyl  -0.852  1.000  0.902  0.832  0.782
## disp -0.848  0.902  1.000  0.791  0.888
## hp   -0.776  0.832  0.791  1.000  0.659
## wt   -0.868  0.782  0.888  0.659  1.000

Example

Plot histogram

# Scatterplot of mpg and hp
ggplot(mtcars_ss, aes(x = hp, y = mpg)) +
    geom_point() 

Exapmle

Use lm() for linear models (i.e., regression)

# regress mpg on hp
model1 <- lm(mpg ~ hp, data = mtcars_ss)

summary(model1)
## 
## Call:
## lm(formula = mpg ~ hp, data = mtcars_ss)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.7121 -2.1122 -0.8854  1.5819  8.2360 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 30.09886    1.63392  18.421  < 2e-16 ***
## hp          -0.06823    0.01012  -6.742 1.79e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.863 on 30 degrees of freedom
## Multiple R-squared:  0.6024, Adjusted R-squared:  0.5892 
## F-statistic: 45.46 on 1 and 30 DF,  p-value: 1.788e-07

Example

Plot fitted values

# plot fitted values
ggplot(mtcars_ss, aes(x = hp, y = mpg)) + 
  geom_point() +
  stat_smooth(method = "lm", col = "red")

Example

Testing moderation (interaction)

model2 <- lm(mpg ~ cyl + hp + cyl:hp,  data = mtcars_ss)
summary(model2)
## 
## Call:
## lm(formula = mpg ~ cyl + hp + cyl:hp, data = mtcars_ss)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -4.778 -1.969 -0.228  1.403  6.491 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 50.751207   6.511686   7.794 1.72e-08 ***
## cyl         -4.119140   0.988229  -4.168 0.000267 ***
## hp          -0.170680   0.069102  -2.470 0.019870 *  
## cyl:hp       0.019737   0.008811   2.240 0.033202 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.974 on 28 degrees of freedom
## Multiple R-squared:  0.7801, Adjusted R-squared:  0.7566 
## F-statistic: 33.11 on 3 and 28 DF,  p-value: 2.386e-09
# This is the same as model2 (not run)
model3 <- lm(mpg ~ cyl*hp,  data = mtcars_ss)
summary(model3)
# Let's imagine we have a factor for manufacturer
model4 <- lm(mphg ~ cyl*hp + manufact) # this will automatically dummy-code the factor

Structural equation models

library(lavaan)

model <- '
  # measurement model
    ind60 =~ x1 + x2 + x3
    dem60 =~ y1 + y2 + y3 + y4
    dem65 =~ y5 + y6 + y7 + y8
  # regressions
    dem60 ~ ind60
    dem65 ~ ind60 + dem60
  # residual correlations
    y1 ~~ y5
    y2 ~~ y4 + y6
    y3 ~~ y7
    y4 ~~ y8
    y6 ~~ y8
'

fit <- sem(model, data = PoliticalDemocracy)
summary(fit, standardized = TRUE, fit.measures = T)

Social network analysis

library(igraph)
g <- graph.edgelist(edgematrix) # loads two column matrix with ties
V(g)$size = log(degree(g)) # changes size of vertices 
V(g)$label <- NA # removes vertex names
V(g)$color[year_1_cohort_bool] <- "blue" # changes vertex color based on logical index
l <- layout.fruchterman.reingold(g) # selects layout
E(g)$weight <- 1 # weights each edge equal to 1
g <- simplify(g, edge.attr.comb = list(weight = "sum")) # sums edgeweights for equivalent ties

plot(g, layout = l,
     edge.width = E(g)weight)

Social network analysis

(Rosenberg, Greenhalgh, & Wolf, in preparation)

Text analysis

library(quanteda)
my_corpus <- quanteda::corpus(inaugTexts)
summary(my_corpus, n = 3)
## Corpus consisting of 57 documents, showing 3 documents.
## 
##             Text Types Tokens Sentences
##  1789-Washington   626   1540        24
##  1793-Washington    96    147         4
##       1797-Adams   826   2584        37
## 
## Source:  /Users/joshuarosenberg/Dropbox/research/R_Presentation/* on x86_64 by joshuarosenberg
## Created: Wed Mar 23 08:07:41 2016
## Notes:

Text analysis

my_dfm <- dfm(my_corpus, ignoredFeatures = stopwords("english"))
## Creating a dfm from a corpus ...
##    ... lowercasing
##    ... tokenizing
##    ... indexing documents: 57 documents
##    ... indexing features: 9,215 feature types
##    ... removed 133 features, from 174 supplied (glob) feature types
##    ... created a 57 x 9082 sparse dfm
##    ... complete. 
## Elapsed time: 0.48 seconds.

Linear mixed (hierarchical linear) models

library(nlme)

model_1 <- lme(Prop ~ Wave + I(Wave^2) + I(Wave^3) + High_Final + Low_Final,
                random = ~ Wave + I(Wave^2), 
                correlation = corAR1 (form = ~ Wave|ID), 
                method = "REML", 
                data = df, na.action = na.omit)

summary(model5_1)    

Linear mixed (hierarchical linear) models

(Rosenberg, Ranellucci, Lee, Robinson, Saltarelli, Linnenbrink-Garcia, & Roseth, 2016)

Rmarkdown and knitr

  • Rmarkdown documents are documents that include plain text that can be styled with very simple syntax as well as executable R code

  • This presentation was made from an R markdown document

  • R markdown documents can be turned into nice looking pdf, word, and html files

  • Makes sharing results extremely easy

Shiny

Additional resources

Thank you!