sqrt(16) # sqrt() is a function that takes the square root of the number within the parenthesesPOLS 3000: Lab 1
An Introduction to R for Absolute Beginners
Before we start learning about research methodologies or data analysis, let’s get our bearings with R. The following information is adapted from the Imai and Williams (2022) textbook, section 1.3.
To begin, let’s open a new R Script (File > New File > R Script).
Next, let’s name our R Script as lab-1.R and save it to the labs subfolder of pols-3000 on our local computer.
Once we’ve saved our script in the appropriate folder (and we know where to find it), we can begin programming in R.
The Basic Workflow
While learning to code in R, and later while doing data analysis, you will find yourself constantly pinging back and forth between three things:
writing code
looking at the output of code
taking notes on the output and original code
How can you do all this effectively? The simplest way to keep code and notes together is to write your code and intersperse it with comments.
All programming languages have some way of demarcating lines as comments. In R, the way to insert comments is through the # symbol. Every line of code or text preceded by a # is a comment—it will not be executed by R but it will show up in the output in the console.
For example, in your R script you might write:
NOTE: Save all your code and comments to the lab-1.R file. DO NOT rely on the console (lower left-hand box) to save your work (it will disappear as soon as you quit R Studio!).
Running Your Code
To execute a line of code, place the cursor after the code and hit enter (or control+return on a Mac). You can also highlight the entire line or chunk of code and hit enter.
You can also hit the Run button in the upper right-hand corner of your R script (the top right panel of R Studio).
Note: physically typing the code is best way to familiarize yourself with R and the infamous “try-fail-try-fail-try-succeed” cycle of a data analysis workflow.
Another Pep Talk
Adapted from Kieran Healy’s Data Visualization: a Practical Introduction (chapter 2):
“Like all programming languages, R does exactly what you tell it to, rather than exactly what you want it to. This can make it frustrating to work with. It is as if one had an endlessly energetic, powerful, but also extremely literal-minded robot to order around.
Remember that no-one writes fluent, error-free code on the first go all the time. From simple typos to big misunderstandings, mistakes are a standard part of the activity of programming. This is why error-checking, debugging, and testing are also a central part of programming. So, just try to be patient with yourself and with R while you use it.
Expect to make errors, and don’t worry when that happens. You won’t break anything. Each time you figure out why a bit of code has gone wrong you will have learned a new thing about how the language works.”
Using R as a Calculator
The most basic way to use R (and R Studio) is as a souped-up calculator. Try running the code below in your own R script:
5+2 # addition
150-81 # subtraction
145*10304 # multiplication
6/2 # division
sqrt(16) # sqrt() is a function that takes the square root of the number within the parenthesesObjects
Everything in R has a name. We can assign numerical values (or text) to specific objects. In the first line of code below the arrow (<-) indicates: “assign the value of 5 to the object a”
What does b <- 10 mean in plain English?
Saved objects will appear in the “environment” window to the right.
Note: Some names are forbidden (NA, TRUE, FALSE, etc.) or strongly not recommended (c, mean, table)
a <- 5
b <- 10
c <- "Brutus Buckeye"Every object has a class: a vector, a character string, a function, a list, and so on. Knowing an object’s class tells you a lot about what you can and can’t do with it. The functions class() is useful for discovering an object’s class.
The str() function is sometimes useful too. It lets you see what is inside an object.
class(a) # "numeric"
class(c) # "character"
str(a) # What is the structure of the object "a"?
str(c) # What is the structure of the object "d"?Vectors
One really useful object in R is a vector. Vectors are one-dimensional arrays that store information in a specified order. The c() function (short for “concatenate”) to create a vector with multiple pieces of information.
my_numbers <- c(1, 2, 3, 1, 3, 5, 25) # a numerical vector
my_names <- c("Austin", "Amy", "Elliott", "Ethan") # a character vector (i.e., text)Functions
Functions take in objects, perform actions, and return outputs. In plain English, we give a function some information (e.g., a number), it acts on that information (e.g., addition), and some result is produced (e.g., a sum).Like our family pet, when we want a function to do something for us, we “call” it. Unlike our pets, however, functions are 100% reliable. This is both beautiful—and—incredibly frustrating when don’t understand what’s going on.
What would be the output from sum(a,b)? (hint: sum is function; a and b are the arguments).
All functions in R have parentheses at the end of their names. This distinguishes functions from other objects, vectors, tables, and data frames, etc.
Anytime you’re not sure what a function does, you can type ?? before it. For example, ??sum(). A help file with explanations will show up in the bottom right square look at the entry for base::sum()
Let’s try it now:
sum(a,b) # sum function
mean(x = my_numbers) # find the mean of the `my_numbers` vectorIn the mean() example, x is the argument name and my_numbers is what we’re passing to the that argument
Packages (also called Libraries)
All functions live in packages. Packages are bundles of functions written by other users that we can use.
The install.packages() function allows you to store a package on your computer.
The tidyverse package contains hundreds of useful functions for exploring, analyzing, and visualizing data in [R]. It will be the most important package we use all semester.
install.packages("tidyverse") # install the package
library(tidyverse) # load the packageAn Basic Example with Real Life Data
Let’s look at some life expectancy data from the Oxford non-profit, Gapminder.
First, let’s load the packages we need.
library(tidyverse) # load the tidyverse library ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
✔ ggplot2 3.4.0 ✔ purrr 1.0.1
✔ tibble 3.1.8 ✔ dplyr 1.0.10
✔ tidyr 1.2.1 ✔ stringr 1.5.0
✔ readr 2.1.3 ✔ forcats 0.5.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
library(gapminder) # load the gapminder librarySecond, let’s load the dataset from the gapminder package.
gapminder <- gapminder # load the gapminder dataset from the package
class(gapminder) # gapminder is a data frame[1] "tbl_df" "tbl" "data.frame"
str(gapminder) # information about each variable (column) in the data frametibble [1,704 × 6] (S3: tbl_df/tbl/data.frame)
$ country : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
$ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
$ year : int [1:1704] 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
$ lifeExp : num [1:1704] 28.8 30.3 32 34 36.1 ...
$ pop : int [1:1704] 8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
$ gdpPercap: num [1:1704] 779 821 853 836 740 ...
Third, let’s look at the mean life expectancy among countries in the dataset?
Note: The $ operator is used to extract or subset a specific part of an object. In the example below, gapminder$lifeExp means: “Use the lifeExp column of the gapminder dataframe.”
mean(gapminder$lifeExp) # find the mean of life expectancy[1] 59.47444
The mean life expectancy for countries in our datset is 59.47 years. (The mean life exptectancy in the US for the year 2007 was 78 years.)
Finally, let’s plot the relationship between life expectancy and GDP-per-capita.
ggplot(gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) + # set x and y axes
geom_point() + # add data points to the plot
geom_smooth(method = "loess") + # add a trend line
labs(x = "GDP-Per-Capita (USD)", y = "Life Expectancy (Years)", # change labels
title = "What's the relationship between GDP-per-capita and life expectancy?")`geom_smooth()` using formula = 'y ~ x'
The relationship looks like an “inverted U.” This trend implies that countries with a GDP-per-capita between $30,000 and $55,000 have the highest average life expectancy (around 80 years).