March 22, 2016

## Why I use R

• A flexible statistical analysis toolkit

• All of the standard data analysis tools are built right into the R language (and many others are available via "packages")
• A language

• In R, you do data analysis by writing functions and scripts, not by pointing and clicking

• A script documents all your work, from data access to reporting, and can instantly be re-run at any time

• Graphics and data visualization

• One of the design principles of R was that visualization of data through charts and graphs is an essential part of the data analysis process

• As a result, it has excellent tools for creating graphics

## Why I use R

• Leading academics and researches from around the world use R to develop the latest methods in statistics
• A robust, vibrant community

• There's a wealth of community resources for R available on the Web, for help in just about every domain
• Platform independence

• For Linux/Mac users that work with people who use windows, this is helpful

• You can use code contributed by others in the open-source community, or extend R with your own functions

## R syntax

This is code (syntax):

c() # makes a vector
: # makes a sequence

Let's make a vector of the numbers 1 to 10:

my_vector <- c(1:10)

This is the result of typing the name of "my_vector":

my_vector
##  [1]  1  2  3  4  5  6  7  8  9 10

This finds the mean:

mean(my_vector)
## [1] 5.5

## Figuring out what a function does or what an object is

- ? # this is to find out what a function does
- str() # this is to find out the 'structure' of anything
- View() # this allows you to view a data frame (think spreadsheet) or matrix
- class() # this tells you what kind of object this is

Loading a comma separated values (CSV) file:

setwd("~/documents/r_introduction") # this sets the working directory
my_data <- read.csv("r_introduction_data.csv") # loads a CSV and saves it to my_data
str(my_data) # this examines the "structure" of any object (like my_data)
## 'data.frame':    212 obs. of  15 variables:
##  $Int : int 1 1 0 0 1 0 NA 0 1 1 ... ##$ StudentID   : int  1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 ...
##  $Grade : int 1 0 0 1 1 1 1 0 1 1 ... ##$ Age         : int  11 10 10 11 11 12 11 9 11 11 ...
##  $Gender : int 0 1 0 1 1 0 0 0 1 1 ... ##$ ClassTeacher: int  5 0 1 4 7 5 4 1 6 4 ...
##  $SciTeacher : int 2 0 0 2 3 2 2 0 3 2 ... ##$ PrevIQWST   : int  NA 0 0 NA NA NA NA 0 NA NA ...
##  $PreEff_Ave : num 5 6.2 5.4 5.4 5 6 6.6 4.6 5.2 5.8 ... ##$ PreInt_Ave  : num  7 4 6.4 1.8 4.8 6.6 6.8 6.6 5.2 3.4 ...
##  $PreVal_Ave : num 5.5 4.43 4.38 2.75 6.38 ... ##$ PostEff_Ave : num  5 6.2 6.6 5.4 5.4 4.6 6.2 5.2 3.6 6.8 ...
##  $PostInt_Ave : num 6.6 3.2 5.4 3 2.8 4.8 6.6 6.6 2.8 3.2 ... ##$ PostVal_Ave : num  6.38 5.62 4.25 3.62 6.25 ...
##  $Achievement : int 6 8 7 6 7 7 5 NA 9 6 ... (Akcaoglu & Rosenberg, in preparation) ## Finding information about objects my_data[1, ] # just the first row of data frame my_data[, 1] # just the first column of data frame head(my_data) # first six rows of data frame ## Int StudentID Grade Age Gender ClassTeacher SciTeacher PrevIQWST ## 1 1 1001 1 11 0 5 2 NA ## 2 1 1002 0 10 1 0 0 0 ## 3 0 1003 0 10 0 1 0 0 ## 4 0 1004 1 11 1 4 2 NA ## 5 1 1005 1 11 1 7 3 NA ## 6 0 1006 1 12 0 5 2 NA ## PreEff_Ave PreInt_Ave PreVal_Ave PostEff_Ave PostInt_Ave PostVal_Ave ## 1 5.0 7.0 5.500000 5.0 6.6 6.375 ## 2 6.2 4.0 4.428571 6.2 3.2 5.625 ## 3 5.4 6.4 4.375000 6.6 5.4 4.250 ## 4 5.4 1.8 2.750000 5.4 3.0 3.625 ## 5 5.0 4.8 6.375000 5.4 2.8 6.250 ## 6 6.0 6.6 6.500000 4.6 4.8 6.125 ## Achievement ## 1 6 ## 2 8 ## 3 7 ## 4 6 ## 5 7 ## 6 7 ## Loading data Examples of loading data for different sorts of files: read.delim("filename.txt") # Tab-delimited readxl::read_excel("filename.xlsx") # Excel haven::read_sav("filename.sav") # SPSS ## Descriptive statistics my_data_ss <- my_data[, 9:15] # this selects columns 9 through 15 summary(my_data_ss) # creates summary statistics for continuous variables  ## PreEff_Ave PreInt_Ave PreVal_Ave PostEff_Ave ## Min. :2.200 Min. :1.000 Min. :1.875 Min. :1.200 ## 1st Qu.:5.000 1st Qu.:4.800 1st Qu.:5.125 1st Qu.:4.800 ## Median :5.600 Median :6.200 Median :6.125 Median :5.600 ## Mean :5.466 Mean :5.739 Mean :5.780 Mean :5.354 ## 3rd Qu.:6.200 3rd Qu.:6.800 3rd Qu.:6.625 3rd Qu.:6.200 ## Max. :7.000 Max. :7.000 Max. :7.000 Max. :7.000 ## NA's :16 ## PostInt_Ave PostVal_Ave Achievement ## Min. :1.000 Min. :1.500 Min. : 2.000 ## 1st Qu.:4.400 1st Qu.:4.875 1st Qu.: 5.000 ## Median :5.800 Median :6.250 Median : 6.000 ## Mean :5.394 Mean :5.691 Mean : 6.263 ## 3rd Qu.:6.600 3rd Qu.:6.750 3rd Qu.: 7.750 ## Max. :7.000 Max. :7.000 Max. :10.000 ## NA's :16 NA's :16 NA's :18 ## Descriptive statistics table(my_data$SciTeacher) # counts frequencies and creates a table
##
##  0  1  2  3
## 53 55 53 51

## Packages

Here's how to install a package:

install.packages("dplyr")

Here's how to use a package (must do every time you use R):

library(dplyr)

You can also use a package using ::, without using library first

dplyr::select()

## dplyr for data manipulation

dplyr::select(my_data, ) # Select only certain columns

dplyr::filter(mydata, ) # select only certain rows

# how to find out more about how to use dplyr

# Vignettes are documents that explain how to use packages in fleshed out examples
vignette(package = "dplyr")
vignette("introduction", package = "dplyr")

dplyr::arrange(my_data, ) # arrange data by a variable

## psych for summary statistics

my_data_ss <- dplyr::select(my_data, PreEff_Ave:Achievement)
psych::describe(my_data_ss)
##             vars   n mean   sd median trimmed  mad  min max range  skew
## PreEff_Ave     1 212 5.47 1.03   5.60    5.56 0.89 2.20   7  4.80 -0.82
## PreInt_Ave     2 212 5.74 1.27   6.20    5.92 1.19 1.00   7  6.00 -1.18
## PreVal_Ave     3 212 5.78 1.16   6.12    5.94 1.02 1.88   7  5.12 -1.14
## PostEff_Ave    4 196 5.35 1.11   5.60    5.46 0.89 1.20   7  5.80 -0.91
## PostInt_Ave    5 196 5.39 1.55   5.80    5.58 1.48 1.00   7  6.00 -0.89
## PostVal_Ave    6 196 5.69 1.30   6.25    5.87 0.93 1.50   7  5.50 -1.12
## Achievement    7 194 6.26 1.77   6.00    6.29 1.48 2.00  10  8.00 -0.17
##             kurtosis   se
## PreEff_Ave      0.31 0.07
## PreInt_Ave      1.01 0.09
## PreVal_Ave      0.84 0.08
## PostEff_Ave     0.65 0.08
## PostInt_Ave    -0.27 0.11
## PostVal_Ave     0.56 0.09
## Achievement    -0.45 0.13

## Other packages

There are over 7000 packages, covering pretty much every topic you'd want:

# Great plotting functionality
library(ggplot2)
# Great data manipulation
library(dplyr)
# A number of basic statistical procedures you will want to use
library(psych)
# Miscellaneous functions that are quite useful
library(Hmisc)
# Importing data from other file formats, like .sav
library(foreign)
# Deal with dates in your data
library(lubridate)

## Example

Load data and select variables

# load built-in dataset mtcars
data(mtcars)
str(mtcars)
## 'data.frame':    32 obs. of  11 variables:
##  $mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ... ##$ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $disp: num 160 160 108 258 360 ... ##$ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ... ##$ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $qsec: num 16.5 17 18.6 19.4 17 ... ##$ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $am : num 1 1 1 0 0 0 0 0 0 0 ... ##$ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $carb: num 4 4 1 1 2 1 4 2 2 4 ... mtcars_ss <- dplyr::select(mtcars, mpg, cyl, disp, hp, wt) ## Example Calculate descriptive statistics and correlations psych::describe(mtcars_ss) # descriptive statistics (not run) mtcars_corrs <- Hmisc::rcorr(as.matrix(mtcars_ss)) # correlations round(mtcars_corrs[[1]], 3) ## mpg cyl disp hp wt ## mpg 1.000 -0.852 -0.848 -0.776 -0.868 ## cyl -0.852 1.000 0.902 0.832 0.782 ## disp -0.848 0.902 1.000 0.791 0.888 ## hp -0.776 0.832 0.791 1.000 0.659 ## wt -0.868 0.782 0.888 0.659 1.000 ## Example Plot histogram # Scatterplot of mpg and hp ggplot(mtcars_ss, aes(x = hp, y = mpg)) + geom_point()  ## Exapmle Use lm() for linear models (i.e., regression) # regress mpg on hp model1 <- lm(mpg ~ hp, data = mtcars_ss) summary(model1) ## ## Call: ## lm(formula = mpg ~ hp, data = mtcars_ss) ## ## Residuals: ## Min 1Q Median 3Q Max ## -5.7121 -2.1122 -0.8854 1.5819 8.2360 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 30.09886 1.63392 18.421 < 2e-16 *** ## hp -0.06823 0.01012 -6.742 1.79e-07 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 3.863 on 30 degrees of freedom ## Multiple R-squared: 0.6024, Adjusted R-squared: 0.5892 ## F-statistic: 45.46 on 1 and 30 DF, p-value: 1.788e-07 ## Example Plot fitted values # plot fitted values ggplot(mtcars_ss, aes(x = hp, y = mpg)) + geom_point() + stat_smooth(method = "lm", col = "red") ## Example Testing moderation (interaction) model2 <- lm(mpg ~ cyl + hp + cyl:hp, data = mtcars_ss) summary(model2) ## ## Call: ## lm(formula = mpg ~ cyl + hp + cyl:hp, data = mtcars_ss) ## ## Residuals: ## Min 1Q Median 3Q Max ## -4.778 -1.969 -0.228 1.403 6.491 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 50.751207 6.511686 7.794 1.72e-08 *** ## cyl -4.119140 0.988229 -4.168 0.000267 *** ## hp -0.170680 0.069102 -2.470 0.019870 * ## cyl:hp 0.019737 0.008811 2.240 0.033202 * ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 2.974 on 28 degrees of freedom ## Multiple R-squared: 0.7801, Adjusted R-squared: 0.7566 ## F-statistic: 33.11 on 3 and 28 DF, p-value: 2.386e-09 # This is the same as model2 (not run) model3 <- lm(mpg ~ cyl*hp, data = mtcars_ss) summary(model3) # Let's imagine we have a factor for manufacturer model4 <- lm(mphg ~ cyl*hp + manufact) # this will automatically dummy-code the factor ## Structural equation models library(lavaan) model <- ' # measurement model ind60 =~ x1 + x2 + x3 dem60 =~ y1 + y2 + y3 + y4 dem65 =~ y5 + y6 + y7 + y8 # regressions dem60 ~ ind60 dem65 ~ ind60 + dem60 # residual correlations y1 ~~ y5 y2 ~~ y4 + y6 y3 ~~ y7 y4 ~~ y8 y6 ~~ y8 ' fit <- sem(model, data = PoliticalDemocracy) summary(fit, standardized = TRUE, fit.measures = T) ## Social network analysis library(igraph) g <- graph.edgelist(edgematrix) # loads two column matrix with ties V(g)$size = log(degree(g)) # changes size of vertices
V(g)$label <- NA # removes vertex names V(g)$color[year_1_cohort_bool] <- "blue" # changes vertex color based on logical index
l <- layout.fruchterman.reingold(g) # selects layout
E(g)\$weight <- 1 # weights each edge equal to 1
g <- simplify(g, edge.attr.comb = list(weight = "sum")) # sums edgeweights for equivalent ties

plot(g, layout = l,
edge.width = E(g)weight)

## Social network analysis

(Rosenberg, Greenhalgh, & Wolf, in preparation)

## Text analysis

library(quanteda)
my_corpus <- quanteda::corpus(inaugTexts)
summary(my_corpus, n = 3)
## Corpus consisting of 57 documents, showing 3 documents.
##
##             Text Types Tokens Sentences
##  1789-Washington   626   1540        24
##  1793-Washington    96    147         4
##       1797-Adams   826   2584        37
##
## Source:  /Users/joshuarosenberg/Dropbox/research/R_Presentation/* on x86_64 by joshuarosenberg
## Created: Wed Mar 23 08:07:41 2016
## Notes:

## Text analysis

my_dfm <- dfm(my_corpus, ignoredFeatures = stopwords("english"))
## Creating a dfm from a corpus ...
##    ... lowercasing
##    ... tokenizing
##    ... indexing documents: 57 documents
##    ... indexing features: 9,215 feature types
##    ... removed 133 features, from 174 supplied (glob) feature types
##    ... created a 57 x 9082 sparse dfm
##    ... complete.
## Elapsed time: 0.48 seconds.

## Linear mixed (hierarchical linear) models

library(nlme)

model_1 <- lme(Prop ~ Wave + I(Wave^2) + I(Wave^3) + High_Final + Low_Final,
random = ~ Wave + I(Wave^2),
correlation = corAR1 (form = ~ Wave|ID),
method = "REML",
data = df, na.action = na.omit)

summary(model5_1)    

## Linear mixed (hierarchical linear) models

(Rosenberg, Ranellucci, Lee, Robinson, Saltarelli, Linnenbrink-Garcia, & Roseth, 2016)

## Rmarkdown and knitr

• Rmarkdown documents are documents that include plain text that can be styled with very simple syntax as well as executable R code

• This presentation was made from an R markdown document

• R markdown documents can be turned into nice looking pdf, word, and html files

• Makes sharing results extremely easy