Autumn School - Day 1

Introduction to R - Session 1

Dr. J. Kavanagh

2023-09-09

Setting a working directory

In this document, there are a series of examples setting out the basic commands of R.

You must have the following installed on your computer before we begin:

The appropriate links for these downloads are provided in the Moodle page for this module.

RStudio Environment

When you open RStudio for the first time there are four windows, you can re-orientate these to your own preference. This also applies to the text colour, font type and size. It is very customisable for your needs. RStudio uses the memory of your computer and you need to set a Working Directory for your analysis.

Go to Console and type the following:

# This command shows you the current working directory that RStudio is set up to use
getwd()
## [1] "/Users/jackkavanagh/Dropbox/R_Business"

My advice is create a specific folder and link RStudio to it using the Session -> Set Working Directory -> Choose Directory… option in the Menu Bar. Otherwise you can specify the folder in the console as follows:

# This command sets a distinct working directory for RStudio to utilise 
setwd("/Users/jackkavanagh/Dropbox/R_Business")

Libraries

This lecture is designed to show you the potential for using R for statistical analysis. You must have the following libraries installed via RStudio before we begin. Run the following command to install the necessary packages.

install.packages("tidyverse", "ggthemes", "historydata", "lubridate")

# The library() command will load the relevant libraries
library(tidyverse)
library(ggthemes)
library(historydata)
library(lubridate)

Since the creation of the ‘Tidyverse’ by Hadley Wickham there has been a trend to teach solely within the TidyR framework, which involves concentrating on a number of small and interlinked packages in R. This approach while useful for creating a basic overview of R, prevents users from understanding the full potential of R for analysis and leads to errors which can often be quickly solved using the base R commands.

Data Types in R

The following definitions were set out by Deborah Nolan in ‘An introduction to programming in R (2019)’:

In R, vectors are the primitive objects. A vector is simply an ordered collection of values grouped together into a single container. Some primitive types of vectors are numeric, logical, and character. A very important characteristic of these vectors is that they can only store values of the same type. A vector contains values that are homogeneous primitive elements. That is, a numeric vector contains only real numbers, a logical vector stores values that are either TRUE or FALSE, and character vectors store strings.

To sum up:

Base R

The Console is where you type in commands and after hitting return you get results. As R was originally designed for statistical purposes, it has fully operational calculator functions as follows:

            + = same
            - = same
            * = mulitply
            / = divide 
1+3
## [1] 4
2*10
## [1] 20
100 - 48
## [1] 52
150 / 25
## [1] 6
pi
## [1] 3.141593
3*pi
## [1] 9.424778
sin(pi/2)
## [1] 1

You can assign functions and results to a vector in R

x <- 3*pi

print(x)
## [1] 9.424778
y <- 1:5

y
## [1] 1 2 3 4 5

R Loops

Loops are essential automated processes that can analyses multiple datasets. They often appear quite off-putting but are quite simple to understand if broken down into the distinct types that are available.

# Create a new vector 
r_loop <- 1 

r_loop
## [1] 1

Now this loop is going to add 4 to the r_loop vector 15 times.

for(i in 1:15) {              # Head of for-loop
  
  r_loop <- r_loop + 4          # Body of for-loop
  print(r_loop)
}
## [1] 5
## [1] 9
## [1] 13
## [1] 17
## [1] 21
## [1] 25
## [1] 29
## [1] 33
## [1] 37
## [1] 41
## [1] 45
## [1] 49
## [1] 53
## [1] 57
## [1] 61

R Loops - Character Vectors

# This creates a character 
r_loop_2 <- c("Ringo", "John", "Paul", "George", "Linda", 
              "Janice", "Ella", "Sarah", "Barbara")
for(i in r_loop_2) {         # Loop over character vector
print(paste("The name", i, "consists of", nchar(i), "characters."))
}
## [1] "The name Ringo consists of 5 characters."
## [1] "The name John consists of 4 characters."
## [1] "The name Paul consists of 4 characters."
## [1] "The name George consists of 6 characters."
## [1] "The name Linda consists of 5 characters."
## [1] "The name Janice consists of 6 characters."
## [1] "The name Ella consists of 4 characters."
## [1] "The name Sarah consists of 5 characters."
## [1] "The name Barbara consists of 7 characters."

R Loops - Breaks & Skipping

You can tell your loop to either skip a sequence or end after a sequence. Using the very simple syntax of break or next.

for(i in 1:5) {                     # for-loop with break
  
  if(i == 4) {
    break
  }
  
  print(paste("Sin é", i))
}
## [1] "Sin é 1"
## [1] "Sin é 2"
## [1] "Sin é 3"
for(i in 1:7) {                     # for-loop with skip
  
  if(i == 4) {
    next
  }
  
  print(paste("Anseo", i))
}
## [1] "Anseo 1"
## [1] "Anseo 2"
## [1] "Anseo 3"
## [1] "Anseo 5"
## [1] "Anseo 6"
## [1] "Anseo 7"

Sample Data

Sample data is included within RStudio and individual libraries. Use the following command bring up all the sample datasets that could be called into the Environment panel of RStudio.

data()

We want to bring in the following datasets from the historydata library

data("judges_people", "judges_appointments", "early_colleges")

These will now be listed as so you need to call each one specifically into the RStudio environment

## # A tibble: 3,532 × 13
##    judge_id name_first name_middle name_last name_suffix birth_date
##       <int> <chr>      <chr>       <chr>     <chr>            <int>
##  1     3419 Ronnie     <NA>        Abrams    <NA>              1968
##  2        1 Matthew    T.          Abruzzo   <NA>              1889
##  3        2 Marcus     Wilson      Acheson   <NA>              1828
##  4        3 William    Marsh       Acker     Jr.               1927
##  5        4 Harold     Arnold      Ackerman  <NA>              1928
##  6        5 James      Waldo       Ackerman  <NA>              1926
##  7        6 Raymond    L.          Acosta    <NA>              1925
##  8        7 J[ackson]  Leroy       Adair     <NA>              1887
##  9        8 Arlin      Marvin      Adams     <NA>              1921
## 10        9 Elmer      Bragg       Adams     <NA>              1842
## # ℹ 3,522 more rows
## # ℹ 7 more variables: birthplace_city <chr>, birthplace_state <chr>,
## #   death_date <int>, death_city <chr>, death_state <chr>, gender <chr>,
## #   race <chr>
## # A tibble: 4,202 × 15
##    judge_id court_name court_type president_name president_party nomination_date
##       <int> <chr>      <chr>      <chr>          <chr>           <chr>          
##  1     3419 U. S. Dis… USDC       Barack Obama   Democratic      07/28/2011     
##  2        1 U. S. Dis… USDC       Franklin D. R… Democratic      02/03/1936     
##  3        2 U. S. Dis… USDC       Rutherford B.… Republican      01/06/1880     
##  4        3 U. S. Dis… USDC       Ronald Reagan  Republican      07/22/1982     
##  5        4 U. S. Dis… USDC       Jimmy Carter   Democratic      09/28/1979     
##  6        5 U. S. Dis… USDC       Gerald Ford    Republican      06/18/1976     
##  7        6 U. S. Dis… USDC       Ronald Reagan  Republican      09/09/1982     
##  8        7 U. S. Dis… USDC       Franklin D. R… Democratic      03/24/1937     
##  9        8 U. S. Cou… USCA       Richard M. Ni… Republican      09/22/1969     
## 10        9 U. S. Dis… USDC       Grover Clevel… Democratic      12/04/1895     
## # ℹ 4,192 more rows
## # ℹ 9 more variables: predecessor_last_name <chr>,
## #   predecessor_first_name <chr>, senate_confirmation_date <chr>,
## #   commission_date <chr>, chief_judge_begin <int>, chief_judge_end <int>,
## #   retirement_from_active_service <chr>, termination_date <chr>,
## #   termination_reason <chr>
## # A tibble: 65 × 6
##    college                original_name      city  state established sponsorship
##    <chr>                  <chr>              <chr> <chr>       <int> <chr>      
##  1 Harvard                <NA>               Camb… MA           1636 Congregati…
##  2 William and Mary       <NA>               Will… VA           1693 Anglican   
##  3 Yale                   <NA>               New … CT           1701 Congregati…
##  4 Pennsylvania, Univ. of <NA>               Phil… PA           1740 Nondenomin…
##  5 Princeton              College of New Je… Prin… NJ           1746 Presbyteri…
##  6 Columbia               King's College     New … NY           1754 Anglican   
##  7 Brown                  <NA>               Prov… RI           1765 Baptist    
##  8 Rutgers                Queen's College    New … NJ           1766 Dutch Refo…
##  9 Dartmouth              <NA>               Hano… NH           1769 Congregati…
## 10 Charleston, Coll. Of   <NA>               Char… SC           1770 Anglican   
## # ℹ 55 more rows

TidyR Commands

The $ command is used to display the internal components of a dataframe

    early_colleges$

The %>% pipeline command will be used throughout this weeks to link various command queries

    early_colleges %>% select(established)

The %in% command is used for matching a vector within a dataframe

    early_colleges %>% filter(established %in% c('1795','1797','1802'))

Please each of these now in the Console section of RStudio

Explore the Data

The head() and tail() commands are useful for exploring the datasets, each shows the first and last rows of the dataset. This is particularly useful when importing data and ensuring that all the information has been correctly inputted.

# Note the n= sets the number of rows being displayed
head(early_colleges, n=10)
## # A tibble: 10 × 6
##    college                original_name      city  state established sponsorship
##    <chr>                  <chr>              <chr> <chr>       <int> <chr>      
##  1 Harvard                <NA>               Camb… MA           1636 Congregati…
##  2 William and Mary       <NA>               Will… VA           1693 Anglican   
##  3 Yale                   <NA>               New … CT           1701 Congregati…
##  4 Pennsylvania, Univ. of <NA>               Phil… PA           1740 Nondenomin…
##  5 Princeton              College of New Je… Prin… NJ           1746 Presbyteri…
##  6 Columbia               King's College     New … NY           1754 Anglican   
##  7 Brown                  <NA>               Prov… RI           1765 Baptist    
##  8 Rutgers                Queen's College    New … NJ           1766 Dutch Refo…
##  9 Dartmouth              <NA>               Hano… NH           1769 Congregati…
## 10 Charleston, Coll. Of   <NA>               Char… SC           1770 Anglican
tail(early_colleges, n=10)
## # A tibble: 10 × 6
##    college               original_name city        state established sponsorship
##    <chr>                 <chr>         <chr>       <chr>       <int> <chr>      
##  1 Holy Cross            <NA>          Worchester  MA           1843 Roman Cath…
##  2 Mississipps, Univ. of <NA>          Oxford      MI           1844 Secular    
##  3 Louisiana, Univ. of   <NA>          New Orleans LA           1845 Secular    
##  4 U.S. Naval Academy    <NA>          Annapolis   MD           1845 Secular    
##  5 Beloit                <NA>          Beloit      WI           1846 Congregati…
##  6 Bucknell              <NA>          Lewisburg   PA           1846 Baptist    
##  7 Grinnell              <NA>          Grinnell    IA           1846 Congregati…
##  8 Mount Union           <NA>          Alliance    OH           1846 Methodist  
##  9 Earlham               <NA>          Richmond    IN           1847 Quaker     
## 10 Wisconsin, Univ. of   <NA>          Madison     WI           1848 Secular

Another way to view an entire dataset is to use the glimpse() command which displays the overall dataset and the class of each type

glimpse(early_colleges)
## Rows: 65
## Columns: 6
## $ college       <chr> "Harvard", "William and Mary", "Yale", "Pennsylvania, Un…
## $ original_name <chr> NA, NA, NA, NA, "College of New Jersey", "King's College…
## $ city          <chr> "Cambridge", "Williamsburg", "New Haven", "Philadelphia"…
## $ state         <chr> "MA", "VA", "CT", "PA", "NJ", "NY", "RI", "NJ", "NH", "S…
## $ established   <int> 1636, 1693, 1701, 1740, 1746, 1754, 1765, 1766, 1769, 17…
## $ sponsorship   <chr> "Congregational; after 1805 Unitarian", "Anglican", "Con…

The summary() command provides an overview of the dataset, most beneficial with numerical data

summary(early_colleges)
##    college          original_name          city              state          
##  Length:65          Length:65          Length:65          Length:65         
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##   established   sponsorship       
##  Min.   :1636   Length:65         
##  1st Qu.:1793   Class :character  
##  Median :1823   Mode  :character  
##  Mean   :1810                     
##  3rd Qu.:1838                     
##  Max.   :1848

Selecting and Filtering the Data

The select() command from ‘dplyr’ is a very versatile command that can be used in sequence using the %>% pipeline to link to other commands.

early_colleges %>% select(college, city, state) %>% head()
## # A tibble: 6 × 3
##   college                city         state
##   <chr>                  <chr>        <chr>
## 1 Harvard                Cambridge    MA   
## 2 William and Mary       Williamsburg VA   
## 3 Yale                   New Haven    CT   
## 4 Pennsylvania, Univ. of Philadelphia PA   
## 5 Princeton              Princeton    NJ   
## 6 Columbia               New York     NY

The filter() command also from ‘dplyr’ is very useful and can implement numerical and text commands.

This example shows the number of colleges established prior to 1800

early_colleges %>% filter(established < 1800) %>% head()
## # A tibble: 6 × 6
##   college                original_name       city  state established sponsorship
##   <chr>                  <chr>               <chr> <chr>       <int> <chr>      
## 1 Harvard                <NA>                Camb… MA           1636 Congregati…
## 2 William and Mary       <NA>                Will… VA           1693 Anglican   
## 3 Yale                   <NA>                New … CT           1701 Congregati…
## 4 Pennsylvania, Univ. of <NA>                Phil… PA           1740 Nondenomin…
## 5 Princeton              College of New Jer… Prin… NJ           1746 Presbyteri…
## 6 Columbia               King's College      New … NY           1754 Anglican

This example shows the colleges with the state of New York using the ==

early_colleges %>% filter(state == "NY")
## # A tibble: 6 × 6
##   college               original_name  city        state established sponsorship
##   <chr>                 <chr>          <chr>       <chr>       <int> <chr>      
## 1 Columbia              King's College New York    NY           1754 Anglican   
## 2 Union College         <NA>           Schenectady NY           1795 Presbyteri…
## 3 U.S. Military Academy <NA>           West Point  NY           1802 Secular    
## 4 Colgate               <NA>           Hamilton    NY           1819 Baptist    
## 5 New York Univ.        <NA>           New York    NY           1831 Nondenomin…
## 6 Fordham               <NA>           Fordham     NY           1841 Roman Cath…

Using the != displays all states that are not New York

early_colleges %>% filter(state != "NY") %>% head()
## # A tibble: 6 × 6
##   college                original_name       city  state established sponsorship
##   <chr>                  <chr>               <chr> <chr>       <int> <chr>      
## 1 Harvard                <NA>                Camb… MA           1636 Congregati…
## 2 William and Mary       <NA>                Will… VA           1693 Anglican   
## 3 Yale                   <NA>                New … CT           1701 Congregati…
## 4 Pennsylvania, Univ. of <NA>                Phil… PA           1740 Nondenomin…
## 5 Princeton              College of New Jer… Prin… NJ           1746 Presbyteri…
## 6 Brown                  <NA>                Prov… RI           1765 Baptist

The simple logical operators are for the filter command are:

    & (and)

    | (or)

    ! (not)
early_colleges %>% filter(state %in% c( 'NY','VA','MA')) %>% head()
## # A tibble: 6 × 6
##   college          original_name  city           state established sponsorship  
##   <chr>            <chr>          <chr>          <chr>       <int> <chr>        
## 1 Harvard          <NA>           Cambridge      MA           1636 Congregation…
## 2 William and Mary <NA>           Williamsburg   VA           1693 Anglican     
## 3 Columbia         King's College New York       NY           1754 Anglican     
## 4 Hampden-Sydney   <NA>           Hampden-Sydney VA           1775 Presbyterian 
## 5 Williams         <NA>           Williamstown   MA           1793 Congregation…
## 6 Union College    <NA>           Schenectady    NY           1795 Presbyterian…

Note that the %in% command can also be expressed as follows:

# Create a new character list of three states using their abbreviations
three_states <- c("NY", "VA", "MA")
# Filter for this using the %in% command
early_colleges %>% filter(state %in% three_states) %>% head()
## # A tibble: 6 × 6
##   college          original_name  city           state established sponsorship  
##   <chr>            <chr>          <chr>          <chr>       <int> <chr>        
## 1 Harvard          <NA>           Cambridge      MA           1636 Congregation…
## 2 William and Mary <NA>           Williamsburg   VA           1693 Anglican     
## 3 Columbia         King's College New York       NY           1754 Anglican     
## 4 Hampden-Sydney   <NA>           Hampden-Sydney VA           1775 Presbyterian 
## 5 Williams         <NA>           Williamstown   MA           1793 Congregation…
## 6 Union College    <NA>           Schenectady    NY           1795 Presbyterian…

Saving new variables

Although mutate() creates a new column, unless you save it back into the main dataframe it will be lost therefore always point your code back to the original dataframe using the -> command.

Some programmers use the = sign, however, this is not recommended in R as that sign has other uses depending on which package you are using.

# Now when you run this code, the number of variables will increase to 7
early_colleges %>% mutate(location=paste(city,state,sep=",")) -> early_colleges
early_colleges
## # A tibble: 65 × 7
##    college            original_name city  state established sponsorship location
##    <chr>              <chr>         <chr> <chr>       <int> <chr>       <chr>   
##  1 Harvard            <NA>          Camb… MA           1636 Congregati… Cambrid…
##  2 William and Mary   <NA>          Will… VA           1693 Anglican    William…
##  3 Yale               <NA>          New … CT           1701 Congregati… New Hav…
##  4 Pennsylvania, Uni… <NA>          Phil… PA           1740 Nondenomin… Philade…
##  5 Princeton          College of N… Prin… NJ           1746 Presbyteri… Princet…
##  6 Columbia           King's Colle… New … NY           1754 Anglican    New Yor…
##  7 Brown              <NA>          Prov… RI           1765 Baptist     Provide…
##  8 Rutgers            Queen's Coll… New … NJ           1766 Dutch Refo… New Bru…
##  9 Dartmouth          <NA>          Hano… NH           1769 Congregati… Hanover…
## 10 Charleston, Coll.… <NA>          Char… SC           1770 Anglican    Charles…
## # ℹ 55 more rows

Class Exercise

Filter the early_colleges to show the colleges established in the original 13 colonies of the United States of America

Create a new object from the early_colleges dataset showing the largest religious sponsorship of colleges