Introduction to R

How R looks like before tutorials vs. How you look like after learning R

Set up R

  1. Download R from RPubs.

  2. Next, you should download RStudio as a text editor. Text editors make our life easier for using R.

  3. Once Rstudio installed, launch it and you are good to go!

The frontend

  1. Before I walk you through the interface, some fun first. You can change the default color scheme of your RStudio interface. Go to Tools>Global Options>Appearance. Here you can change the font type and size, and color scheme. I am using the default. Here is how my interface looks like:

\(~\)

  1. The ordering/layout of these windows/tabs might be different on your screen. That’s ok. On my screen, please notice that the console window is in the up right corner. The window with files-plots tabs is in the down right corner. And the environment window is to the left. You can change this layout as you wish. Whatever is more convenient for you. Go to Tools>Global Options>Pane Layout. Here are my settings:

\(~\)

  1. What are these tabs then?
  • The console tab is where you can type your commands and see the output. But if you’d like to keep track of your commands, you should start an R script. If you used STATA before, it’s similar to do-files. File > New File > R Script to open a new script. You will find your RScript on the Source tab on your panel. You can type and run your R commands here. On the Source tab, you will also access your data frames.

  • Everything on R is an object. The environment tab stores the objects you saved/created during the session. The object can be a list of values, data frames, or a function. For instance, as you may see below, in my environment, there is data entitled “vdem15”, a function entitled “thewitcher” and a list of values entitled “lucky” stored.

  • The files tab displays all files in your default workspace. The plots tab displays the plots/figures that you’ll create.

You are good to go! I know R might be intimidating for some of you. But hang in there. You’ll be able to pull off cool stuff with it. I’ve prepared this html for you on R, for instance. Or you can design interactive plots like this (it’s called Rosenbrock’s valley):

Some Basics

Let’s check out the basic syntax and command operators on R.

Mathematical Operations

Open your R script and type in the following commands. Then click Ctrl + Enter. With this shortcut, you can run the command on the current line or any selected lines. You will see how the Console tab runs the code and generates the output.

### You can type your notes/headings after a hashtag.
105 + 105 #you may perform arithmetic operations
365/12
5^-2
-5.02 + -4.48
(3*3) + (2/5)

### There are some built-in functions in R.
log(100) #natural log
seq(2,6) #create a sequence of numbers from 2 to 6
seq(1,12, by=2) #create a sequence of numbers from 1 to 12 that increases by 2.
seq(0,1, length=11) #a sequence from 0 to 1 with specific length/total number of elements
5:8 #an alternative notation for integer-sequence
sum(5:8) #take a sum of numbers from 5 to 8
mean(10:20) #find mean of numbers from 10 to 20
sqrt(16) #square root of a non-negative number

#You can always look through documentations of these functions to seek help and remember the notation.
?seq

Creating Objects

We store information in R sessions as an object with an assigned name. To that aim, we use the “<-” assignment operator.

### Assignments
result <- sqrt(36) + sum(4:15) / 2^4
result #Write the object name and hit Enter in the console, or Ctrl+Enter in R script. It'll print the result in the Console.
## [1] 13.125

Please note that result object has just been stored as a value in your environment. If you assign different value to a stored object, it’ll be replaced. Be advised! You can assign numeric values, functions, strings of characters to an object.

#string of characters stored, so they are in double quotation marks.
(result <- "i am a nerd") 

Trick: you can both store and print the object at the same time by putting the assignment in parentheses.

You can list and remove these stored objects.

### Listing and removing objects
a1 <- "number" 
a2 <- 5789 
ls() #lists all objects in the environment
## [1] "a1"     "a2"     "result"
ls(pat="a") #lists all objects that include the letter a in their name
## [1] "a1" "a2"
rm(a1) #removes a1 object from the environment
rm(list=ls(pat="a")) #removes all objects that contain letter a from the environment
rm(list = ls()) #removes all objets

Each object has two intrinsic values: its class and length. e.g. An object might be logical, numeric, character, function, etc.

### Object attributes
a1 <- "Tenet"
class(a1) #class function tells us the main attribute/type.
## [1] "character"
a2 <- 2020
class(a2)
## [1] "numeric"
a3 <- "2020"
class(a3)
## [1] "character"
a4 <- TRUE #logical
class(a4)
## [1] "logical"
class(seq)
## [1] "function"

You may use numeric objects for subsequent mathematical operations.

x <- sum(1:100)
y <- x/50

Vectors and Lists

We can combine/concatenate multiple elements and objects into one object - a vector.

### Concatenate
a1 <- c("Oppenheimer","is", "a", "good", "movie")
length(a1) #the length of the vector/how many elements
## [1] 5
a2 <- c(TRUE, FALSE, FALSE, TRUE)
length(a2)
## [1] 4
a3 <- c(seq(1,4), sqrt(16), 47+25, sum(6:8))
print(a3) #It will print the object in the console. 
## [1]  1  2  3  4  4 72 21
length(a3)
## [1] 7
a4 <- c(a2, a3) #You can combine vectors. Please note that logical elements are coerced into numeric.
print(a4)
##  [1]  1  0  0  1  1  2  3  4  4 72 21
a5 <- c(a1, a3) #Combined with character, numeric coerced into character,
print(a5)
##  [1] "Oppenheimer" "is"          "a"           "good"        "movie"      
##  [6] "1"           "2"           "3"           "4"           "4"          
## [11] "72"          "21"

You can access specific elements of vectors through indexing.

### Indexing
a1[4] #we use square brackets for indexing. This means fourth element.
## [1] "good"
a4[2:7] #elements 2 through 7
## [1] 0 0 1 1 2 3
a4[c(2,7)] #second and seventh element
## [1] 0 3
a1[-4] #omit fourth element
## [1] "Oppenheimer" "is"          "a"           "movie"
a1[-c(3,5)] #omit third and fifth element
## [1] "Oppenheimer" "is"          "good"
a1[-(2:7)] #omit elements 2 through 7
## [1] "Oppenheimer"

You can use specific elements of numeric vectors for mathematical operations. Below you see the number of shootings and firearm discharges in Toronto last year.

Date Cases
January 17
February 16
March 22
April 23
May 33
June 29
July 40
August 37
September 27
October 29
November 37

Table 1: Number of shootings and firearm discharges in Toronto

cases <- c(17, 16, 22, 23, 33, 29, 40, 37, 27, 29, 37, NA) #Storing number of cases -- list of values 
## 12 values with the last month missing

Vectorized arithmetic is also possible.

sort(cases, decreasing = TRUE) #sorts the elements in a decreasing fashion
##  [1] 40 37 37 33 29 29 27 23 22 17 16
results <- cases/5 #divide the vectors and save it as a list of values

Lists are objects that include elements of different types.

dates <- list(c("Sep", "3", "2020"), c("Month", "Day", "Year"), c(1,2))
length(dates)
## [1] 3
class(dates)
## [1] "list"
dates[1]
## [[1]]
## [1] "Sep"  "3"    "2020"
dates[[1]]
## [1] "Sep"  "3"    "2020"
length(dates[[1]]) #indexing list items
## [1] 3

Functions

Please take a moment to take a stock of built-in functions we have learned so far.  

Another built-in function is names which assigns names to elements in a vector. Let’s label the shooting statistics with their respective dates.

###Assigning names and saving as date
to.date <- c("January 1 2023", "February 1 2023", "March 1 2023", "April 1 2023", "May 1 2023", "June 1 2023", "July 1 2023", "August 1 2023", "September 1 2023", "October 1 2023", "November1 2023", "December 1 2023")
names(cases) <- to.date
print(cases) ## it is stored as a named list of values.
##   January 1 2023  February 1 2023     March 1 2023     April 1 2023 
##               17               16               22               23 
##       May 1 2023      June 1 2023      July 1 2023    August 1 2023 
##               33               29               40               37 
## September 1 2023   October 1 2023   November1 2023  December 1 2023 
##               27               29               37               NA
constant <- rep(2, times=length(cases)) #rep is another built-in function for R. 

to <- as.data.frame(cbind(cases, to.date, constant)) ## save it as a data frame
#to is a data frame with two columns. cbind means bind the columns.
rownames(to) <- NULL #don't need rownames anymore. 

R has two missing values: NA and NULL. In data sets, we often encounter missing data, which we represent in R with the value NA. NULL, on the other hand, represents that the value in question simply doesn’t exist.

mean(to$cases) ## generates an error, why?
## Warning in mean.default(to$cases): argument is not numeric or logical:
## returning NA
## [1] NA
class(to$cases) #referring to columns in a dataset with $, 
## [1] "character"
class(to$to.date) #it's also a character but should be date.
## [1] "character"
to$cases <- as.numeric(to$cases)
class(to$cases)
## [1] "numeric"
to$to.date <- as.Date(to$to.date, format = "%B %d %Y") 
## format option allows you to set the date format. Check ?as.Date for instructions.

class(to$to.date)
## [1] "Date"
## Other built-in functions
max(to$cases, na.rm=T) #maximum value
## [1] 40
min(to$cases, na.rm =T) #minimum value
## [1] 16

You may also use paste function quite often.

sentence <- c("toronto", "is", "the", "best", "city", "in the world")
sentence2 <- paste(sentence, collapse = " ")

length(sentence)
## [1] 6
length(sentence2)
## [1] 1

You can also create your own functions to avoid typing the same command over and over again. User-defined functions offer us efficiency. Let’s say as part of your job, you are expected to report quarterly 1) the cumulative number of shootings and discharges for Toronto, 2) the average number of cases updated each quarter. Instead of writing the code for each quarter and run it separately, you may simply create a function to render the process more efficient!

###User-defined functions

casesq1 <- c(17, 16, 22)
casesq2 <- c(23, 33, 29)
casesq3 <- c(40, 37, 27)

police.summary <- function(x){ #creating a function with one input, x, titled police.summary
  out.total <- sum(x) #cumulative cases for the quarter
  out.mean <- mean(x) #average
  out.final <- c(out.total, out.mean) #final output
  names(out.final) <- c("Cumulative Cases", "Average Number") #labeling the output, be careful with the ordering! 
  return(out.final) #return function will call the output here
}

police.summary(casesq1) #Calling the function police.summary supplying casesq1 vector as an argument
## Cumulative Cases   Average Number 
##         55.00000         18.33333

There are different ways in which you can define your arguments. e.g. You can create a function with multiple arguments/inputs. Let’s say you are asked to calculate percent change in average number of cases quarterly.

police.change <- function(w1, w2) { #two arguments defined as w1 and w2 that would be referred to in the function
  out.percent <- (mean(w2) - mean(w1))*100/mean(w1)
  names(out.percent) <- "Percent Change"
  return(out.percent)
}

police.change(casesq1, casesq2) 
## Percent Change 
##       54.54545

Data

  • We don’t really save our data manually with vectors. More often than not, we import external files to R. Often it’s either a .csv (Excel), .txt (Text), .dta (STATA) or RData files. RData is a collection of R objects (i.e. R output).

  • R will automatically display data from your working directory. Check the Files tab to see your current working directory. Ideally, you’d want to have your project files in one designated place. Do not save to / import from your Desktop - it’ll be chaotic.

  • There are different ways in which you can assign/change your working directory. Manually, you can check your current directory with getwd() or Session > Set Working Directory. I will show you a better way towards the end.

  • Please download the datasets from Imai’s website for Chapter 1. Unzip and place these files in your working directory.

UNpop <- read.csv("UNpop.csv") #read.csv is a built-in function to help us import csv. files into an R object. Don't forget to assign this data to a specific object, otherwise it'll not be stored in your environment.
class(UNpop)
length(UNpop)

load("UNpop.RData") #for Rdata, we use load function. 
  • Check your Environment tab. UNpop object is a data.frame with 2 variables and 7 observations. Data frame is an R object of collection of vectors. In this case, it’s as if there are two vectors/columns (therefore the length of UNpop is 2) merged into a data frame.

Let’s work on this data a bit.

ncol(UNpop) #gives number of columns in a data frame
## [1] 2
nrow(UNpop) #gives number of rows in a data frame
## [1] 7
summary(UNpop) #descriptive statistics for this data frame (mean, median, percentile, etc.)
##       year        world.pop      
##  Min.   :1950   Min.   :2525779  
##  1st Qu.:1965   1st Qu.:3358588  
##  Median :1980   Median :4449049  
##  Mean   :1980   Mean   :4579529  
##  3rd Qu.:1995   3rd Qu.:5724258  
##  Max.   :2010   Max.   :6916183

Please familiarize yourself with $ operator. That allows us to access variables/columns from data frames and individual elements from objects.

### $ Operator and Brackets
UNpop$year #accessing year column
## [1] 1950 1960 1970 1980 1990 2000 2010
(descriptive <- summary(UNpop$world.pop))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 2525779 3358588 4449049 4579529 5724258 6916183
UNpop[,"year"] #extract year column: using brackets with a comma to separate rows and columns
## [1] 1950 1960 1970 1980 1990 2000 2010
UNpop[c(1,2,3),"year"] #extract first three rows of the year column
## [1] 1950 1960 1970

Let’s create a new dataset for running some descriptive statistics. Assume that this is a dataset on schools in an imaginary province that lays out information on their status, funding, and number of teachers.

### Expand Grid
schools <- expand.grid(status=c("Public", "Private"), 
                       funding=seq(1500, 2000, by=100), 
                       teacher=c(seq(5,15,by=5),NA)) 
#expand.grid is a built-in function that creates a data frame with all possible combinations of given vectors.
head(schools)
##    status funding teacher
## 1  Public    1500       5
## 2 Private    1500       5
## 3  Public    1600       5
## 4 Private    1600       5
## 5  Public    1700       5
## 6 Private    1700       5
mean(schools$teacher) #not gonna work because of the missing values NA
## [1] NA
mean(schools$teacher, na.rm = TRUE) #TRUE and FALSE are logical statements in R. na.rm option allows you to discard them for taking the average.
## [1] 10

You can save/export your data to your working directory.

### Saving Data
write.csv(schools, file="school.csv") #saving as csv file
save(schools, file="school.RData") #saving as RData

Packages

R package is a collection of coding, data, and documentation to expand R functionalities. You can think of them as apps we install on our phones. Our phone can make a call, send a text, etc., but with extra apps, you can shoot a TikTok video…

In order to use these packages, we must install them first. One useful package is “tidyverse”. It is a full new language that facilitates multiple operations in R.

### Install Packages
install.packages("tidyverse") #Write this in console, not in script; you need to install this only once in your computer. 
library("tidyverse") #Once installed, you must use the library command in your R script for each R session. 
## It's like each time you need to use an app, you must tap on the icon, right? Library function does just that. 

## You may access specific functions loaded in each package with ::
## such as 
## dplyr::

Plots

ggplot(to, aes(x = to.date, y=cases)) +
  geom_line() +
  theme_bw() +
  labs(x = "Date", y = "Number of cases")

Other Syntax and Operators

An operator helps us with mathematical and logical manipulations.

  • The built-in operators for arithmetic operators are +, -, *, /, ^ etc.
  • Relational and logical operators: >, <, ==, <=, >=, != . ! introduces negation. != means not equal. &, | are logical AND - OR operators.
  • Other operators: : is a colon operator that implies series in a sequence. %in% denotes if an element belongs to a vector or not.

You may find some examples below.

### Operators
schools$status[!(schools$funding<1900)] #list status of schools with funding more than 1900.
#Let's calculate total number of teachers in public schools.
sum(schools$teacher[schools$status=="Public"], na.rm=TRUE) #access teachers in public schools. Notice the use of double brackets. This is subsetting with a logical operation with "=="

sum(schools$teacher[schools$status=="Public" & schools$funding>1900], na.rm=TRUE) #total number of teachers in public schools with funding more than 1900. 

Other Tips and Shortcuts

  • If you cannot run your R code and keep getting errors, it is either you forgot to add a column or comma somewhere, or it’s just you need to “turn it off and on again”. Go to Session>Restart R. But that means you must run the whole code again.

  • You can comment out blocks of code by selecting lines of code and use the following shortcut: Ctrl + Shift + C. For Mac users: Cmd + Shifct C. You can take it back with the same shortcut.

  • You can edit several lines at the same time by pressing ALT.

  • Get familiar with some of these RStudio shortcuts. It’ll make your life easier.

  • You can create a heading in the navigation button at the end of your R script tab by adding at least 4 #### or - - - - at the end of your lines.

  • If you cannot figure something out, just google checkmarked answers on Stack Overflow or ask ChatGPT. 80% of learning how to code is just that.

A Sample R Script

##--------------------------------------------------------------##
##                       Tutorial #1                            ##
##                    Introduction to R                         ##
##                     Semuhi Sinanoglu                         ##
##                        January 2024                          ##
##--------------------------------------------------------------##

pacman::p_load("tidyverse")

## Get data ------------------------------------------------------
UNpop <- read.csv("UNpop.csv")
UNpop.analysis <- UNpop #keep the original in the environment and use the new one for data manipulation

## Descriptive Statistics ----------------------------------------
summary(UNpop.analysis)

Project-based Workflow

  • For every new research project/homework, I highly encourage you to start an R project. Each project will be self-contained and easily reproducible, especially used with here package. It’ll help you to have a file system structure in the sense that all of your files for a project will be stored in a designated folder.

  • Go to File->New Project->New Directory->New Project and then create your new project in a designated folder.

  • Rproj can help you access your other files stored in the working directory in the Files tab. It also allows us to switch to recent projects. Check the right corner of your screen for a scrolldown with the Rproj icon. Once you work on your Rproj, close it. When you open it again, you’ll realize that you start where you left off!