R is a flexible, open-source program that can manage advanced statistical models and produce elegant graphics, but it does have a learning curve. The purpose of this presentation is to provide a brief introduction to the basics of using R.
The basic R installation has a wide variety of functions built into it, but R users have also created a wide variety of additional programs, called packages, that have more useful functions. To use these packages, you need to write two commands. The first installs the package within R. The code is:
install.packages(“tidyverse”)
Where the name of the package you want to install is in “quotes.” You only need to install a package once on your computer, but each time you open R and want to use it, you need to call up the package as a library using the following code:
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 3.3.3
## Warning: package 'ggplot2' was built under R version 3.3.3
## Warning: package 'tibble' was built under R version 3.3.3
## Warning: package 'tidyr' was built under R version 3.3.3
## Warning: package 'readr' was built under R version 3.3.3
## Warning: package 'purrr' was built under R version 3.3.3
## Warning: package 'dplyr' was built under R version 3.3.3
Note that this time, you do not need to put the name of the package in “quotes.” For this example, we installed and called up the package “tidyverse,” which includes several useful packages for data management and visualization designed to work together, including ggplot2, tidyr, and dplyr.
The first, most basic thing to learn in R is how to open an existing dataset. It is possible to open many types of data in R, but my preference is to load .csv files. To use this command, save your existing dataset as a .csv file (in Excel, Stata, SPSS, or another program).
When you load a file in R, you need to assign it to an object with the <- sign. Here, I load a dataset with information on countries in the world in 2000 to an object I call “inequality.” You can name objects anything, but they cannot start with a number or special character (?.< etc.). All text in R is case sensitive.
inequality <- read.csv("world_inequality.csv")
inequality <- read.csv("world_inequality.csv", header=TRUE, sep=",")
View(inequality)
Note that the options for the function “read.csv” include whether to use the first row of data as variable names and what character separates pieces of data.
To view the dataset, you can either type “View(data)” (note the capital V) or click on the object in RStudio’s Global Environment box.
There are sevearl functions to produce summary statistics for a dataset. The most basic is “summary,” which prints the mean, median, highest and lowest values for all variables in the dataset (or the variables within it that you specify). If you would also like the number of observations, variance, and standard deviation, use the “stat.desc” function.
summary(inequality) #prints the mean
## country colbrit colfra
## Afghanistan : 1 Min. :0.0000 Min. :0.0000
## Albania : 1 1st Qu.:0.0000 1st Qu.:0.0000
## Algeria : 1 Median :0.0000 Median :0.0000
## Andorra : 1 Mean :0.3542 Mean :0.1406
## Angola : 1 3rd Qu.:1.0000 3rd Qu.:0.0000
## Antigua and Barbuda: 1 Max. :1.0000 Max. :1.0000
## (Other) :186
## cgv_dem wb_gdppc wb_gini
## Min. :0.0000 Min. : 136.6 Min. :27.22
## 1st Qu.:0.0000 1st Qu.: 758.3 1st Qu.:34.26
## Median :1.0000 Median : 2569.2 Median :40.81
## Mean :0.5737 Mean : 10216.8 Mean :43.57
## 3rd Qu.:1.0000 3rd Qu.: 9326.1 3rd Qu.:51.96
## Max. :1.0000 Max. :122438.5 Max. :63.00
## NA's :2 NA's :8 NA's :151
library(pastecs)
## Warning: package 'pastecs' was built under R version 3.3.3
stat.desc(inequality)
## country colbrit colfra cgv_dem wb_gdppc
## nbr.val NA 192.00000000 192.00000000 190.00000000 1.840000e+02
## nbr.null NA 124.00000000 165.00000000 81.00000000 0.000000e+00
## nbr.na NA 0.00000000 0.00000000 2.00000000 8.000000e+00
## min NA 0.00000000 0.00000000 0.00000000 1.366303e+02
## max NA 1.00000000 1.00000000 1.00000000 1.224385e+05
## range NA 1.00000000 1.00000000 1.00000000 1.223019e+05
## sum NA 68.00000000 27.00000000 109.00000000 1.879889e+06
## median NA 0.00000000 0.00000000 1.00000000 2.569183e+03
## mean NA 0.35416667 0.14062500 0.57368421 1.021679e+04
## SE.mean NA 0.03460568 0.02515394 0.03597255 1.311363e+03
## CI.mean NA 0.06825839 0.04961518 0.07095928 2.587335e+03
## var NA 0.22993019 0.12148233 0.24586466 3.164198e+08
## std.dev NA 0.47951037 0.34854315 0.49584742 1.778819e+04
## coef.var NA 1.35391162 2.47852909 0.86432119 1.741075e+00
## wb_gini
## nbr.val 41.0000000
## nbr.null 0.0000000
## nbr.na 151.0000000
## min 27.2200000
## max 63.0000000
## range 35.7800000
## sum 1786.5400000
## median 40.8100000
## mean 43.5741463
## SE.mean 1.6024619
## CI.mean 3.2386964
## var 105.2832549
## std.dev 10.2607629
## coef.var 0.2354782
Note that “stat.desc” is in the library “pastecs,” which you have to call up before you can run the function.
To run an ordinary least squares (OLS) regression, the command is “lm” followed by the formula (depvar ~ indepvar + control) and the object with your data in parentheses. In order to view the estimates for your intercept, coefficient(s), and error, you need to assign the output of your model to an object then call up the summary of that output. For example:
reg <- lm(wb_gdppc ~ cgv_dem, inequality)
summary(reg)
##
## Call:
## lm(formula = wb_gdppc ~ cgv_dem, data = inequality)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13025 -9861 -3999 336 92370
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4524 1760 2.571 0.010953 *
## cgv_dem 8725 2295 3.802 0.000196 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15240 on 180 degrees of freedom
## (10 observations deleted due to missingness)
## Multiple R-squared: 0.07433, Adjusted R-squared: 0.06919
## F-statistic: 14.45 on 1 and 180 DF, p-value: 0.0001965
mvreg <- lm(wb_gdppc ~ cgv_dem+colbrit+colfra+wb_gini, inequality)
summary(mvreg)
##
## Call:
## lm(formula = wb_gdppc ~ cgv_dem + colbrit + colfra + wb_gini,
## data = inequality)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15897.4 -2256.8 137.1 1493.6 25750.5
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5903.5 5013.0 1.178 0.24689
## cgv_dem 5651.3 2399.3 2.355 0.02424 *
## colbrit 9737.1 2934.1 3.319 0.00212 **
## colfra 1273.8 3844.6 0.331 0.74238
## wb_gini -150.7 107.5 -1.401 0.16997
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6885 on 35 degrees of freedom
## (152 observations deleted due to missingness)
## Multiple R-squared: 0.3212, Adjusted R-squared: 0.2437
## F-statistic: 4.141 on 4 and 35 DF, p-value: 0.007515