Intro to R

R is a flexible, open-source program that can manage advanced statistical models and produce elegant graphics, but it does have a learning curve. The purpose of this presentation is to provide a brief introduction to the basics of using R.

Packages and Libraries

The basic R installation has a wide variety of functions built into it, but R users have also created a wide variety of additional programs, called packages, that have more useful functions. To use these packages, you need to write two commands. The first installs the package within R. The code is:

install.packages(“tidyverse”)

Where the name of the package you want to install is in “quotes.” You only need to install a package once on your computer, but each time you open R and want to use it, you need to call up the package as a library using the following code:

library(tidyverse)

## Warning: package 'tidyverse' was built under R version 3.3.3

## Warning: package 'ggplot2' was built under R version 3.3.3

## Warning: package 'tibble' was built under R version 3.3.3

## Warning: package 'tidyr' was built under R version 3.3.3

## Warning: package 'readr' was built under R version 3.3.3

## Warning: package 'purrr' was built under R version 3.3.3

## Warning: package 'dplyr' was built under R version 3.3.3

Note that this time, you do not need to put the name of the package in “quotes.” For this example, we installed and called up the package “tidyverse,” which includes several useful packages for data management and visualization designed to work together, including ggplot2, tidyr, and dplyr.

Loading a Dataset

The first, most basic thing to learn in R is how to open an existing dataset. It is possible to open many types of data in R, but my preference is to load .csv files. To use this command, save your existing dataset as a .csv file (in Excel, Stata, SPSS, or another program).

When you load a file in R, you need to assign it to an object with the <- sign. Here, I load a dataset with information on countries in the world in 2000 to an object I call “inequality.” You can name objects anything, but they cannot start with a number or special character (?.< etc.). All text in R is case sensitive.

inequality <- read.csv("world_inequality.csv")
inequality <- read.csv("world_inequality.csv", header=TRUE, sep=",")
View(inequality)

Note that the options for the function “read.csv” include whether to use the first row of data as variable names and what character separates pieces of data.

To view the dataset, you can either type “View(data)” (note the capital V) or click on the object in RStudio’s Global Environment box.

Summary Statistics

There are sevearl functions to produce summary statistics for a dataset. The most basic is “summary,” which prints the mean, median, highest and lowest values for all variables in the dataset (or the variables within it that you specify). If you would also like the number of observations, variance, and standard deviation, use the “stat.desc” function.

summary(inequality) #prints the mean

##                 country       colbrit           colfra      
##  Afghanistan        :  1   Min.   :0.0000   Min.   :0.0000  
##  Albania            :  1   1st Qu.:0.0000   1st Qu.:0.0000  
##  Algeria            :  1   Median :0.0000   Median :0.0000  
##  Andorra            :  1   Mean   :0.3542   Mean   :0.1406  
##  Angola             :  1   3rd Qu.:1.0000   3rd Qu.:0.0000  
##  Antigua and Barbuda:  1   Max.   :1.0000   Max.   :1.0000  
##  (Other)            :186                                    
##     cgv_dem          wb_gdppc           wb_gini     
##  Min.   :0.0000   Min.   :   136.6   Min.   :27.22  
##  1st Qu.:0.0000   1st Qu.:   758.3   1st Qu.:34.26  
##  Median :1.0000   Median :  2569.2   Median :40.81  
##  Mean   :0.5737   Mean   : 10216.8   Mean   :43.57  
##  3rd Qu.:1.0000   3rd Qu.:  9326.1   3rd Qu.:51.96  
##  Max.   :1.0000   Max.   :122438.5   Max.   :63.00  
##  NA's   :2        NA's   :8          NA's   :151

library(pastecs)

## Warning: package 'pastecs' was built under R version 3.3.3

stat.desc(inequality)

##          country      colbrit       colfra      cgv_dem     wb_gdppc
## nbr.val       NA 192.00000000 192.00000000 190.00000000 1.840000e+02
## nbr.null      NA 124.00000000 165.00000000  81.00000000 0.000000e+00
## nbr.na        NA   0.00000000   0.00000000   2.00000000 8.000000e+00
## min           NA   0.00000000   0.00000000   0.00000000 1.366303e+02
## max           NA   1.00000000   1.00000000   1.00000000 1.224385e+05
## range         NA   1.00000000   1.00000000   1.00000000 1.223019e+05
## sum           NA  68.00000000  27.00000000 109.00000000 1.879889e+06
## median        NA   0.00000000   0.00000000   1.00000000 2.569183e+03
## mean          NA   0.35416667   0.14062500   0.57368421 1.021679e+04
## SE.mean       NA   0.03460568   0.02515394   0.03597255 1.311363e+03
## CI.mean       NA   0.06825839   0.04961518   0.07095928 2.587335e+03
## var           NA   0.22993019   0.12148233   0.24586466 3.164198e+08
## std.dev       NA   0.47951037   0.34854315   0.49584742 1.778819e+04
## coef.var      NA   1.35391162   2.47852909   0.86432119 1.741075e+00
##               wb_gini
## nbr.val    41.0000000
## nbr.null    0.0000000
## nbr.na    151.0000000
## min        27.2200000
## max        63.0000000
## range      35.7800000
## sum      1786.5400000
## median     40.8100000
## mean       43.5741463
## SE.mean     1.6024619
## CI.mean     3.2386964
## var       105.2832549
## std.dev    10.2607629
## coef.var    0.2354782

Note that “stat.desc” is in the library “pastecs,” which you have to call up before you can run the function.

Linear Regression

To run an ordinary least squares (OLS) regression, the command is “lm” followed by the formula (depvar ~ indepvar + control) and the object with your data in parentheses. In order to view the estimates for your intercept, coefficient(s), and error, you need to assign the output of your model to an object then call up the summary of that output. For example:

reg <- lm(wb_gdppc ~ cgv_dem, inequality)
summary(reg)

## 
## Call:
## lm(formula = wb_gdppc ~ cgv_dem, data = inequality)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -13025  -9861  -3999    336  92370 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     4524       1760   2.571 0.010953 *  
## cgv_dem         8725       2295   3.802 0.000196 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15240 on 180 degrees of freedom
##   (10 observations deleted due to missingness)
## Multiple R-squared:  0.07433,    Adjusted R-squared:  0.06919 
## F-statistic: 14.45 on 1 and 180 DF,  p-value: 0.0001965

mvreg <- lm(wb_gdppc ~ cgv_dem+colbrit+colfra+wb_gini, inequality)
summary(mvreg)

## 
## Call:
## lm(formula = wb_gdppc ~ cgv_dem + colbrit + colfra + wb_gini, 
##     data = inequality)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15897.4  -2256.8    137.1   1493.6  25750.5 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)   5903.5     5013.0   1.178  0.24689   
## cgv_dem       5651.3     2399.3   2.355  0.02424 * 
## colbrit       9737.1     2934.1   3.319  0.00212 **
## colfra        1273.8     3844.6   0.331  0.74238   
## wb_gini       -150.7      107.5  -1.401  0.16997   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6885 on 35 degrees of freedom
##   (152 observations deleted due to missingness)
## Multiple R-squared:  0.3212, Adjusted R-squared:  0.2437 
## F-statistic: 4.141 on 4 and 35 DF,  p-value: 0.007515

Intro to R

Carolyn Coberly

November 5, 2017

Packages and Libraries

Loading a Dataset

Summary Statistics

Linear Regression