1 Introduction to R and Rstudio

1.1 What is R?

R is a free, open source software program for statistical computing and analysis.

Things to know about R:

Statistical computing environment with its own language.
Released in 2000; an open source implementation of S.
Available for Windows, Macintosh, and Linux.
Produces publication-quality graphs.
Numerous advanced statistical methods and algorithms, through availability of user-created packages.
Has packages for weaving written reports and analysis code in one document aka R Markdown and R Notebook.

1.2 What is RStudio?

RStudio is a free, open source IDE (integrated development environment) for R.

Things to know about RStudio:

Before installing RStudio, R must first be installed.
The interface is structured such that users can clearly view:
- data frames (tables) & graphs.
- R code & output all in one place at the same time.
- It allows users to seamlessly import CSV, Excel, text (txt), SAS (.sas7bdat), SPSS (.sav), and Stata (*.dta) files into R without having to write the code to do so.

1.3 Installing R & RStudio

To install R and RStudio on any computer, download the software from their associated websites.
- Click \(\Rightarrow\) The R Project for Statistical Computing.
- Click \(\Rightarrow\) RStudio.
RStudio Cloud is a hosted version of RStudio in the cloud.
- Click \(\Rightarrow\) RStudio Cloud.

1.3.1 Understanding your Working Environment

1.3.2 Variable & Assignment Operator

1.4 Installing R Packages

Packages are collection of functions and datasets that add new features to R.
Most R packages are available from CRAN, the official R repository.
- CRAN is a network of servers (called mirrors) around the world.
Packages on CRAN are checked before they are published, to make sure they don’t contain malicious components.

1.4.1 Installing R packages from the Terminal

To install R packages from the Terminal, follow these steps:
1. Open the Terminal.
2. Type and run the following command. Make sure to replace package_name1 with an actual package name, such as ggplot2:
- install.packages("package_name1").
You can also install multiple packages at the same time with both the R command line and RStudio. Just separate the individual packages with commas:
- install.packages("package_name1", "package_name2").

1.4.2 Installing R packages from inside RStudio

To install R packages from RStudio, follow these steps:
1. Open RStudio.
2. Click on Tools from the menu bar and then click on Install Packages…:
3. In the Install Packages dialog box, type in the package name in the Packages text box and click on the Install button:

There is also a list of common problems when installing packages available on the RStudio support page at \(\Rightarrow\) Click me.

1.5 R Basics

1.5.1 Basic Math

R can be used to do basic math.
These follow the basic order of operations: Parenthesis, Exponents, Multiplication, Division, Addition and Subtraction (PEMDAS).
In the console there is a right angle bracket (>) where code should be entered.

2 + 3
3 * 5 * 7
5/2
10/5
4 + 6 * 5
2 * (3 + 5)

1.5.2 Variable

A variable can take on any available data type as will be described later.
It can also hold any R object such as a function, the result of an analysis or a plot.

1.5.2.1 Variable Assignment

The valid assignment operators are <- and =, with the first being preferred.

x <- 2
x
2 -> x
x
x = 2
x
u <- v <- 7
u
v

Removing variable is accomplished using the rm() function.

y <- 15
y
rm(y)
y # object 'y' not found

1.5.3 Data Types

There are many data types in R.
We will consider the 4 main data types:
- Numeric
- Character (String)
- Date (time-based)
- logical (TRUE/FALSE)
The type of data contained in a variable is checked with the class function i.e. class().

1.5.3.1 Numeric Data

Numeric or float/double in other languages.
It handles integers and decimals, both positive and negative, and of course, zero.
Testing whether a variable is numeric is done with the function is.numeric().
Testing whether a variable is an integer is done with the function is.integer().

x <- 25
class(x)
is.numeric(x)
x <- 25L
class(x)
is.integer(x)
is.numeric(x)

1.5.4 Character

Very common in statistical analysis and must be handled with care.
R handles character data in two ways: character and factor. While they may seem similar on the surface, they are treated quite differently.

x <- "Missouri"
x
class(x)
nchar(x)
y <- factor(x)
y
class(y)
nchar(y) # Error in nchar(y) : 'nchar()' requires a character vector

1.5.5 Logical

Logicals are a way of representing data that can be either TRUE or FALSE.
Numerically, TRUE is the same as 1 and FALSE is the same as 0.

TRUE
T
FALSE
F
10 * TRUE
10 * FALSE
x <- TRUE
class(x)
is.logical(x)

Logicals can result from comparing two numbers, or characters.

# does 7 equal 10?
7 == 10
# does 10 not equal 7?
10 != 7
# is 7 less than 10?
7 < 10
# is 7 less than or equal to 10?
7 <= 10
# is 7 greater than 10?
7 > 10
# is 7 greater than or equal to 10?
7 >= 10

1.6 Vectors

A vector is a collection of elements, all of the same type.
c(2, 1, 5, 10, -9) is a vector of numbers.
c("high", "medium", "low", "unknown") is a vector of characters.
The most common way to create a vector is with c. Thec means combine.

dat <- c(2.24, 2.05, 1.76, 2.43, 1.75, 1.54, 1.84, 1.94, 1.64, 1.50)
dat
dat[2]
dat[1:3]
dat[-(1:3)]
dat[c(1,5,8)]
dat[-c(1,5,8)]
length(dat)
x <- 1:10
length(x)

1.6.1 Vector Operations

dat - 0.5
dat + 3.2
dat/3
dat^2
sqrt(dat)
length(sqrt(dat))
dat == x
dat < x
dat >= x
dat != x

1.6.2 Factor Vectors

Factors are an important concept in R, especially when building models.
We use the function as.factor() to convert a character vector to a factor vector.
Notice that after printing out every element of fac2, R also prints the levels.
- The levels of a factor are the unique values of that factor variable.
- Technically, R is giving each unique value of a factor a unique integer. This can be seen with the function as.numeric().

fac1 <- c("Hockey", "Football", "Baseball", "Curling", "Rugby",
       "Lacrosse", "Basketball", "Tennis", "Cricket", "Soccer")
fac2 <- as.factor(fac1)
fac2
as.numeric(fac2)
# Relevel
fac3 <- relevel(fac2, ref = "Soccer")
fac3

1.7 Calling Functions

mean(dat)
median(dat)

1.8 Function Documentation

Any function provided in R has accompanying documentation.
The easiest way to access that documentation is to place a question mark in front of the function name, like this: ?mean.

1.9 Missing Data

Often we will have data that has missing values.
Statistical programs use various techniques to represent missing data such as a dash, a period or even the number 99.
R uses NA. NA will often be seen as just another element of a vector. is.na tests each element of a vector for missingness.

x <- c(2, 3, 5, 7, NA, 5, NA, NA)
x
mean(x)
mean(x, na.rm = TRUE)
is.na(x)
z <- c("Male", NA, "Female")
z
is.na(z)

Handling missing data is a key part of statistical data analysis.
There are many techniques depending on field and preference. One popular technique is multiple imputation, which is discussed in detail in Chapter 25 of Andrew Gelman and Jennifer Hill’s book Data Analysis Using Regression and Multilevel/Hierarchical Models, and is implemented in the mi, mice and Amelia packages.

1.10 Pipes

Pipe is a new way of calling functions in R.
The pipe from the magrittr package functions by taking the value (object) on the left-hand side of the pipe and inserting it into the first argument of the function that is on the right-hand side of the pipe.
- Pipes reduce development time and improve readability of code.

library(magrittr)
z <- c(2, 5, 9, 3, 7)
z %>% mean(na.rm = TRUE)
mean(z)
u <- c(1, 2, NA, 8, 3, NA, 3, NA, NA, 15)
u %>% is.na %>% sum
sum(is.na(u))

2 Advanced Data Structures

2.1 Introduction

At times data require more complex storage than simple vectors.
R provides a host of data structures. The most popular ones are:
- data.frame
- matrix
- list
- array

2.2 Data.frames

Data.frame is like an excel spreadsheet with rows and columns. Row represents observations and columns denote variables.
In R, each column of a data.frame is actually a vector, each of which has the same length with different type of data.

a <- 20:11
b <- -12:-3
c <- c("Hockey", "Football", "Baseball", "Curling", "Rugby", 
       "Lacrosse", "Basketball", "Tennis", "Cricket", "Soccer")
DF <- data.frame(Col_1 = a, Col_2 = b, Sport = c)
DF
names(DF)
names(DF)[3]
class(DF)
nrow(DF)
ncol(DF)
dim(DF)
head(DF)
tail(DF, 3)
DF$Sport
DF[3,]
DF[, 2]
DF[3, 2]
DF[3, 2:3]
DF[3, 2:3]
DF[c(3, 5), 2]
DF[c(3, 5), 2:3]
DF[c(3, 5), c(2, 3)]
DF[, c("Col_1", "Sport")]
DF[, "Sport"]

2.3 Matrices

This is similar to a data.frame in that it is rectangular with rows and columns.
Every single element in a matrix must be of the same type, most commonly all numeric.

# create a 5x2 matrix
A <- matrix(1:10, nrow=5, byrow = FALSE)
A
nrow(A); ncol(A); dim(A)
# create another 5x2 matrix
B <- matrix(21:30, nrow=5)
nrow(B); ncol(B); dim(B)
# create another 5x2 matrix
C <- matrix(21:40, nrow=2)
nrow(C); ncol(C); dim(C)
# add them
A + B
# multiply them
A * B

2.3.1 Naming Rows and Columns

See below.

colnames(A) <- c("Left", "Right")
rownames(A) <- c("1st", "2nd", "3rd", "4th", "5th")
colnames(B) <- c("First", "Second")
rownames(B) <- c("One", "Two", "Three", "Four", "Five")
colnames(C) <- LETTERS[1:10]
rownames(C) <- c("Top", "Bottom")
A; B; C

2.3.2 Matrix Transpose and Multiplication

See below.

# matrix transpose
t(A); t(C)
# matrix multiplication
A %*% t(B)
A %*% C

2.4 Lists

Sometimes a container is needed to hold arbitrary objects of either the same type or varying types.
R accomplishes this through lists.
A list can contain all numeric or characters or a mix of the two or data.frames or other lists.

# creates a two element list.
list1 <- list(DF, A)
list1
# creates a four element list.
list2 <- list(DF, A, B, C)
list2

2.5 Arrays

An array is generally a multidimensional vector.
All vectors must be of the same type, and individual elements are accessed in a similar fashion using square brackets.
The first element is the row index, the second is the column index and the remaining elements are for outer dimensions.

Arry <- array(1:18, dim=c(2, 3, 3))
Arry
Arry[1, , 1]
Arry[, 2, 3]
Arry[1, 2, 2]

3 Reading Data into R

3.1 Introduction

There are numerous ways to get data into R. The most common is probably reading comma separated values (CSV) files.

Setting Working Directory

The first thing we often do in an R script is set our working directory.
- You usually set your working directory to where your data files are located.
There are two different ways you can follow to set your working directory.
1. To set working directory via point-and-click:
  1. Session…Set Working Directory…Choose Directory. In the dialog, highlight the directory and click Open.
  2. Use the Files tab. Navigate to folder and select “Set As Working Directory” under More.
2. To set working directory with R code:
  - use setwd() function; path must be in quotes.
You can import just about any kind of data into R: Excel, Stata, SPSS, SAS, CSV, JSON, fixed-width, TXT, DAT etc.
- You can even connect to databases.

3.2 Reading CSVs

The easiest way to read data from a CSV file is to use read.table(). Most people prefer to use read.csv() which is a rapper around read.table() with the sep argument preset to a comma (,).
The outcome of using read.table is a data.frame.
We will learn to import a CSV file from your local computer into R using the credit data set.

# Set your working directory to the folder on the computer that 
# contains the credit data set.
setwd("C:/Users/ethom/Dropbox/data")

# Read data into R using the read.table() function.
dat_1 <- read.table("credit.csv", header=TRUE, sep=",", stringsAsFactors=FALSE)
dat_1

# Read data into R using the read.csv() function.
dat_1 <- read.csv("credit.csv", header = TRUE, stringsAsFactors = TRUE)
dat_1

We will learn to import data from a website`.

theUrl <-  "https://stats.idre.ucla.edu/wp-content/uploads/2016/02/test-1.csv"
test.tab <- read.table(file=theUrl, header=TRUE, sep=",", stringsAsFactors=FALSE)
head(test.tab)
test.txt <- read.table("https://stats.idre.ucla.edu/wp-content/uploads/2016/02/test.txt", header=T)
head(test.txt)

Large files can be slow to read into memory using read.table(), but there are other alternatives that can read large files into memory.
The two most relevant functions for reading large CSVs and other text files are;
- read_delim() from the readr package by Hadley Wickham.
- fread() from the data.table package by Matt Dowle respectively.
Both cannot convert character data to factors automatically.

3.3 read_delim

The read_delim(), and all the data-reading functions in readr, return a tibble, which is an extension of data.frame.

library(readr)
theUrl <-  "https://stats.idre.ucla.edu/wp-content/uploads/2016/02/test-1.csv"
test.del <- read_delim(file=theUrl, delim=',')
head(test.del)
class(test.del)

The functions read_csv(), read_csv2() and read_tsv() are special cases for when the delimiters are commas (,), semicolons (;) and tabs (\t), respectively.

3.4 fread

fread() results in a data.table object which is an extension of data.frame.

library(data.table)
theUrl <- "https://stats.idre.ucla.edu/wp-content/uploads/2016/02/test-1.csv"
test.fre <- fread(input=theUrl, sep=',', header=TRUE)
head(test.fre)
class(test.fre)

Both read_delim() or fread() are fast but the decision of which one to use depends upon whether dplyr or data.table is preferred for data manipulation.

3.5 Excel Data

# Use readxl package to read xls|xlsx
library("readxl")
my_data <- read_excel(file, sheet = "------") # For sheet specify an index or name.

# Use xlsx package
library("xlsx")
my_data <- read.xlsx(file, sheetIndex, header = TRUE)

3.6 Data from Other Statistical Software

The foreign package has a number of functions like read.table to read in data from other tools.
A partial list of functions to read data from commonly used statistical tools is given below:

Function	Format
read.spss	SPSS
read.dta	Stata
read.ssd	SAS
read.octave	Octave
read.mtp	Minitab
read.systat	System

A new package called haven() optimized for speed written by Hadley Wickham but results in tibble rather than data.frame can also be used to read data from some standard statistical software.

3.7 Loading Built-in Datasets in R

To see the list of pre-loaded data, type the function data():

data()

Load and print data as follow:

data(Seatbelts)
head(Seatbelts, 5)

data(iris)
head(iris, 5)

3.8 Exporting Data from R

Base functions for writing data: write.table(), write.csv().
Fast writing of data from R to txt|csv files involves using readr function: write_tsv(), write_csv(). From xlsx package we use the function write.xlsx() for Excel files.

data(iris)
write.csv(iris, file = "iris1.csv")
library(readr)
write_csv(iris, path = "iris2.csv")

4 Data Manipulation

4.1 Data Inspection

head() for first few rows of a matrix or data frame.
tail() for last few rows of a matrix or data frame.
dim() for dimension of a matrix or data frame.
str() for displaying the structure of an R object.
nrow() for number of rows of a matrix or data frame.
ncol() for number of columns of a matrix or data frame.
summary() for numeric variables.
quantile() for quartiles.
table() for categorical variables.
sum(is.na()) for counting the number of NAs in the entire dataset.

If you need to change the data type for any column, use the following functions:

as.character() converts to a text string.
as.numeric() converts to a number.
as.factor() converts to a categorical variable.
as.integer() converts to an integer.

4.2 Inspect the credit dataset using str():

url_credit <- "https://raw.githubusercontent.com/sylvadon5/data-files/main/credit.csv"
credit_data <- read.csv(url_credit, header = TRUE,  stringsAsFactor = FALSE)
# credit_data <- read.csv("data/credit.csv", header = TRUE,  stringsAsFactor = FALSE)
dim(credit_data)

[1] 1000   17

str(credit_data)

'data.frame':   1000 obs. of  17 variables:
 $ checking_balance    : chr  "< 0 DM" "1 - 200 DM" "unknown" "< 0 DM" ...
 $ months_loan_duration: int  6 48 12 42 24 36 24 36 12 30 ...
 $ credit_history      : chr  "critical" "good" "critical" "good" ...
 $ purpose             : chr  "furniture/appliances" "furniture/appliances" "education" "furniture/appliances" ...
 $ amount              : int  1169 5951 2096 7882 4870 9055 2835 6948 3059 5234 ...
 $ savings_balance     : chr  "unknown" "< 100 DM" "< 100 DM" "< 100 DM" ...
 $ employment_duration : chr  "> 7 years" "1 - 4 years" "4 - 7 years" "4 - 7 years" ...
 $ percent_of_income   : int  4 2 2 2 3 2 3 2 2 4 ...
 $ years_at_residence  : int  4 2 3 4 4 4 4 2 4 2 ...
 $ age                 : int  67 22 49 45 53 35 53 35 61 28 ...
 $ other_credit        : chr  "none" "none" "none" "none" ...
 $ housing             : chr  "own" "own" "own" "other" ...
 $ existing_loans_count: int  2 1 1 1 2 1 1 1 1 2 ...
 $ job                 : chr  "skilled" "skilled" "unskilled" "skilled" ...
 $ dependents          : int  1 1 2 2 2 2 1 1 1 1 ...
 $ phone               : chr  "yes" "no" "no" "no" ...
 $ default             : chr  "no" "yes" "no" "no" ...

4.3 dplyr Package

We will use the dplyr package from the tidyverse packages to manipulate data.
Here are some of the most useful functions in dplyr:
- select: Choose which columns to include.
- filter: Filter the data.
- group_by: Group the data by a categorical variable.
- summarize: Summarize, or aggregate (for each group if following group_by). Often used in conjunction with functions including: mean, median, max, min, sum, n etc.
- mutate: Create new column(s) in the data, or change existing column(s).

These functions can be chained together using the operator %>% which makes the output of one line of code the input for the next.

4.3.1 Comparison Operators

The comparison operators we can use for filtering include:
- x < y (less than)
- x > y (greater than)
- x <= y (less than or equal to)
- x >= y (greater than or equal to)
- x == y (equal)
- x != y (not equal)
Relevant Logical Operators
- ! x (NOT operator)
- x & y (AND operator)
- x | y (OR operator)

Gapminder data: Excerpt of the Gapminder data on life expectancy, GDP per capita, and population by country

# install.packages("gapminder")
library(gapminder)
library(tidyverse)

4.3.2 Filter()

filter_1 <- filter(gapminder, country == "United States")
filter_2 <- filter(gapminder, country != "United States")
filter_3 <- filter(gapminder, pop < 1000000)
filter_4 <- filter(gapminder, pop < 1000000 | year == 2007)
filter_5 <- filter(gapminder, country == "United States") %>% 
  filter(lifeExp >= 66 & lifeExp <= 80)
filter_6 <- filter(gapminder, pop < 1000000 & year != 2007)
filter_7 <- filter(gapminder, country %in% c("United States", "Canada")) %>%
  filter(year > 2000) %>%
  filter(pop > 100000) %>%
  filter(lifeExp >= 18)
filter_8 <- filter(gapminder, !continent %in% c("Asia", "Europe", "America")) %>%
  filter(year > 2000) %>%
  filter(pop > 100000) %>%
  filter(lifeExp >= 18)

4.3.3 Select()

select_1 <- select(gapminder, -country)
select_2 <- select(gapminder, lifeExp, pop, gdpPercap, continent)
select_2 <- select(gapminder, c(lifeExp, pop, gdpPercap, continent))
select_2 <- select(gapminder, c("lifeExp", "pop", "gdpPercap", "continent"))
select_3 <- select(gapminder, -c(lifeExp, pop, gdpPercap, continent))
select_3 <- select(gapminder, -c("lifeExp", "pop", "gdpPercap", "continent"))

4.3.4 Filter() and Select()

filter_select_1 <- filter(gapminder, year == 2007) %>% 
  select(country, year, lifeExp)
filter_select_2 <- filter(gapminder, country == "United States" | country == "Canada", 
      year > 2000) %>% 
  select(country, year, lifeExp)

4.3.5 Mutate()

mutate_1 <- mutate(gapminder, popMil = round(pop / 1000000, 1))
mutate_2 <- mutate(gapminder, popMil = round(pop / 1000000, 1)) %>%
  mutate(Log_lifeExp = log(lifeExp))

4.3.6 Group Data

group_year <- group_by(gapminder, year)
group_continent <- group_by(gapminder, continent)

4.3.7 Summarise or Summarize with Groups

# Needed to get higher moments like skewness/kurtosis
library(moments) 
group_1 <- group_by(gapminder, continent) %>%
summarize(mean = mean(lifeExp),
          stdev = sd(lifeExp),
          median = median(lifeExp),
          min = min(lifeExp),
          max = max(lifeExp),
          n = n(),
          se = stdev/sqrt(n),
          skew = skewness(lifeExp),
          kur = kurtosis(lifeExp))

group_2 <- group_by(gapminder, year) %>%
  summarise(mean = mean(lifeExp),
            stdev = sd(lifeExp),
            median = median(lifeExp),
            min = min(lifeExp),
            max = max(lifeExp),
            n = n(),
            se = stdev/sqrt(n),
            skew = skewness(lifeExp),
            kur = kurtosis(lifeExp))

4.3.8 Summarize

sumr_1 <- summarise(gapminder, 
          mean = mean(lifeExp),
          stdev = sd(lifeExp),
          median = median(lifeExp),
          min = min(lifeExp),
          max = max(lifeExp),
          n = n(),
          se = stdev/sqrt(n),
          skew = skewness(lifeExp),
          kur = kurtosis(lifeExp))

sumr_2 <- summarise(gapminder, 
          mean = mean(gdpPercap),
          stdev = sd(gdpPercap),
          median = median(gdpPercap),
          min = min(gdpPercap),
          max = max(gdpPercap),
          n = n(),
          se = stdev/sqrt(n),
          skew = skewness(gdpPercap),
          kur = kurtosis(gdpPercap))

# Bind to create a Data frame 
dd <- round(as.data.frame(rbind(sumr_1, sumr_2)), 3)
rownames(dd) <- c("Life Expectancy", "GDP Per Capita")
dd

4.4 References

5 Graphics in R

5.1 Base R Plots

R has few built-in plot functions.

Scatter Plot

URL <- "https://raw.githubusercontent.com/sylvadon5/data-files/main/credit.csv"
credit_data <- read.csv(URL, header = TRUE)
plot(credit_data$age, credit_data$amount,
     xlab = "Age", ylab = "Amount", main = "Amount vs Age",
     pch = 25, col = "red")

Histogram

hist(credit_data$age, xlab = "Age", ylab = "Frequency", main = "Histogram",
     freq = TRUE, col = "purple", border = "red", breaks = 7)
grid()

Density Plot

plot(density(credit_data$amount), col = "green",
     xlab = "Amount", main = "Density", lwd = 5)
grid()

Boxplot

boxplot(credit_data$amount~credit_data$job, xlab = "Job", ylab = "Amount",
        main = "Boxplot", horizontal = FALSE, col = "blue")
grid()

Bar Graph

frequency <- table(credit_data$years_at_residence)
barplot(frequency, xlab = "Default Status", main = "Default",
        ylim = c(0, 500))

5.2 ggplot2

ggplot2 is a plotting system developed by Hadley Wickham in 2005.
It makes it easy to create complicated graphs.
ggplot graphs are built layer by layer by adding new elements.
The ggplot function uses the following basic syntax for different types of graphs:

ggplot(<DATA>, mapping = aes(<MAPPINGS>)) + 
  <GEOM_FUNCTION>()

DATA: Data set containing the variables to be used for plotting.
aes: Stands for “Aesthetic”. Function that defines the variables to be plotted and other plotting characteristics such as color, shape, size etc.
GEOM_FUNCTION: Defines how the data is to be represented in the plot. Popular GEOM_FUNCTIONS include:
- geom_point() for scatter plots.
- geom_boxplot() for boxplots.
- geom_histogram() for histograms.
- geom_bar() for bar graphs.
- To add a GEOM_FUNCTION to the plot, we use + operator.
- The + operator can also be used to add other layers such as labs() to the plot.
We can install and load ggplot2 package via the tidyverse packages.