SLS 670 - Intro to R and RStudio

In this course, we exclusively use R and RStudio to learn and carry out data wrangling, statistical analyses, simulation, vizualization, and all other sorts of tasks.

But we have to be able to crawl before we can walk. So today we’ll be starting with some basic, simple uses of R to get everyone familiar with the interface and commonly used commands and conventions of writing code.

To begin, let’s do some math. You can use R like a calculator (it’s quite a powerful one!).

2 + 2 #addition

## [1] 4

Now for some other operations. Try typing these lines of code. You can press ctrl+enter to run a line of code while working in a script.

100-77.2 #subtraction

## [1] 22.8

10*2.54 #multiplication

## [1] 25.4

3/2 #division

## [1] 1.5

Clearly R can handle basic arithmetic. Let’s make things a little more complicated.

1+2*3

## [1] 7

The code 1+2*3 returns a result of 7. Were you expecting 9 instead? Verbally, when we ask “What’s one plus two times three?”, the implication is usually (depending on pausing and intonation…) “What is the result when you add one and two and then multiply the sum by three?” But computers are quite picky about the order in which they apply mathematical operations.

Let’s do a little practice with order of operations using parentheses to control what the computer does. In general, computers will execute multiplication and division before addition and subtraction, even if the addition/subtraction operation shows up first in a left-to-right expression.

(1+2)*3

## [1] 9

When you put (embed) something in parentheses, the computer will attend to it first, breaking the usual order of operations. Try this out:

9*(10+(1*5))

## [1] 135

Embeddings can get quite deep when using R, so you’ll want to be careful that you have your parentheses in the right places. Luckily, RStudio highlights pairs of parentheses when you type and when you move the cursor around, so this helps avoid many errors.

We can do some fancier math things in R, too. Check out exponents and square roots:

2^2 # 2 to the 2nd power

## [1] 4

4^(1/2) # 2 to 1/2 power

## [1] 2

sqrt(4) #square root of 4

## [1] 2

I call the ^ symbol “up-carrot”, by the way. I don’t know if that’s the most common name for it, but I picked up somewhere/some time ago when I poorly learned some Java programming.

Anyhow, you may notice that we’ve done something a little different in the last line of code above: We’ve used a function with a name to calculate a square root. While the use of shortcut symbols, often called “operators”, is common in R, you will actually get most things done by using functions.

Along with functions, we rely on objects in R to get a lot of things done. Essentially, objects are static (but updateable) things we want to hold on to and use for various purposes. Objects can be single numbers, a single character of text, a whole paragraph, a group of several numbers, a whole data set, or a statistical model - there aren’t many limits of what you can store as an object.

We’ll start small:

x <- sqrt(9)
x

## [1] 3

We use the <- operator (“assignment operator” or “assignment arrow”) to define objects. We can name objects just about anything, but in general it’s a good idea to be brief and avoid starting with numbers or using peculiar characters.

As you can see, we’ve saved the output of the function sqrt(9) to an object named x. Now we can do all kinds of things to/with x:

x*2

## [1] 6

x^2

## [1] 9

3*x+5

## [1] 14

((x^2)-1)/2

## [1] 4

We can also save non-numeric output in objects. Try making your name into an object:

name <- "Dan"
name

## [1] "Dan"

The real power of R starts to become apparent when working with multiple values/numbers/strings of text/etc. We’ll now use a new function c() (for “concatenate”) to create a set, or vector, of numbers.

scores <- c(54, 60, 67, 67, 71, 73, 74, 84, 88, 90)
scores

##  [1] 54 60 67 67 71 73 74 84 88 90

We can imagine that these are language test scores of some sort, with a maximum possible score of 100.

Now let’s do some math with this vector of scores:

#give everyone 5 points of extra credit:
scores + 5

##  [1] 59 65 72 72 76 78 79 89 93 95

#divide by 100 to get a proportion
scores/100

##  [1] 0.54 0.60 0.67 0.67 0.71 0.73 0.74 0.84 0.88 0.90

Neat! To get us thinking about describing data, let’s try some other functions for some statistics you are likely familiar with.

length(scores) #how many values?

## [1] 10

mode(scores) #what kind of data? note: not the statistical mode

## [1] "numeric"

class(scores) #same as above

## [1] "numeric"

mean(scores)

## [1] 72.8

median(scores)

## [1] 72

sd(scores)

## [1] 11.74545

min(scores)

## [1] 54

max(scores)

## [1] 90

Let’s make a vector with a different type of data:

gender <- c("M", "M", "F", "F", "F", "M", "M", "F", "F", "F")

We can also look at the data with a simple function.

plot(scores)

Not the best visualization, but you can see how it works. Let’s try another:

hist(scores)

A histogram! Needs a bit of work. But let’s try a slightly different approach. We’re going to install and load a package called tidyverse which has loads of helpful functions - functions for working with data, making pretty graphs, and all kinds of other neat and useful tasks.

#install.packages("tidyverse")
library(tidyverse)

## Warning: package 'tidyverse' was built under R version 4.2.3

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.0     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.1     ✔ tibble    3.1.8
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the ]8;;http://conflicted.r-lib.org/conflicted package]8;; to force all conflicts to become errors

We’re going to build a data set now by combining the test scores and gender. Data sets in R are called “dataframes” generally, though the tidyverse package has a variation called a “tibble” (which is a portmanteau of “tidy” and “table”).

df <- tibble(scores, gender)
df

Let’s try making a much nicer looking histogram. We’ll use the ggplot2 package that comes with tidyverse to make it. This code might not make much sense yet, but we’ll do more practice with creating plots throughout the semester.

histogram <- ggplot(df, aes(x=scores))+
  geom_histogram(binwidth = 5)

histogram

So this actually isn’t that pretty yet… let’s tweak a few things:

histogram <- ggplot(df, aes(x=scores))+
  geom_histogram(binwidth = 5)+
  scale_x_continuous(limits = c(0, 100), breaks = seq(0, 100, 10))+
  scale_y_continuous(limits = c(0, 3), breaks = c(1,2,3), expand = c(0,0))+
  labs(x = "Test Scores", y = "Frequency")+
  theme_bw()

histogram

## Warning: Removed 2 rows containing missing values (`geom_bar()`).

This is starting to look better. Let’s add a twist:

histogram <- ggplot(df, aes(x=scores, color = gender, fill = gender))+
  geom_histogram(binwidth = 5, position = "dodge")+
  scale_x_continuous(limits = c(0, 100), breaks = seq(0, 100, 10))+
  scale_y_continuous(limits = c(0, 3), breaks = c(1,2,3), expand = c(0,0))+
  labs(x = "Test Scores", y = "Frequency")+
  theme_bw()

histogram

## Warning: Removed 2 rows containing missing values (`geom_bar()`).

So what we’ve just done is overlaid histograms for two different groups. This is neat, but could look a lot better if we had better data. So let’s make some… at scale.

scores_sim <- round(rnorm(1000, mean = 61, sd = 8), digits = 0)
gender_500 <- rep(c("F", "M"), each=500)

big_df <- tibble(scores_sim, gender_500)

Now we have a much bigger dataset - 1000 imaginary people!

big_hist <- ggplot(big_df, aes(x=scores_sim, color = gender_500, fill = gender_500))+
  geom_histogram(binwidth = 1, position = "dodge")+
  scale_x_continuous(limits = c(0, 100), breaks = seq(0, 100, 10))+
  scale_y_continuous(expand = c(0,0))+
  labs(x = "Test Scores", y = "Frequency")+
  theme_bw()

big_hist

## Warning: Removed 2 rows containing missing values (`geom_bar()`).