Getting started with R and RStudio

Clemens Brunner

Installing R and RStudio
Importing data
Vectors and data frames
Data wrangling
t-test
Linear regression
Analysis of variance

Why R?

R enables reproducible analyses because:
R is free, cross-platform, and open-source
R works with text commands (no clicking around in dozens of dialogs)
Text commands can be permanently stored in a text file
This text file (R script) can be run over and over and always reproduces the same results
R is the de-facto standard for statistical data analysis
R has a large and friendly community

Installing R and RStudio

R is a statistical programming environment
- Full-fledged programming language
- Tailored towards data analysis and statistics
RStudio is a GUI (graphical user interface)
- Completely optional
- Makes using R easier
Download R from https://www.r-project.org/
Download RStudio from https://www.rstudio.com/products/rstudio/download/

RStudio components

Console
Editor
Environment, History
Files, Plots, Packages, Help

Installing packages

Packages extend the functionality of R
The Tidyverse is a collection of packages that work together

Although almost anything can be done with “base R” (no additional packages), most tasks are easier and more convenient with the Tidyverse! — (my opinion)

The tidyverse meta-package provides all core packages
Let’s install it (using the RStudio “Packages” tab)!

Where do I put my code?

R runs only in the console
This is where we enter commands and R will run them
We permanently store R commands in a text file
This is called an R script (file extension .R)
Reproducible (same result every time we run the script)
Comments are introduced with #

Integrated help

In the console, type ? followed by a function name
This will open the integrated help browser
Always read the documentation if you use a function for the first time!

?mean  # shows help for the mean function

Importing data

Let’s import an existing data set!
We need the readr package (which is part of tidyverse)
To use a package, we have to activate it first with the library() function
library(tidyverse) activates all Tidyverse packages
library(readr) activates just the readr package:

library(readr)

Importing CSV files

The function read_delim() imports data from text files
It only needs the file name (in most cases)
To import the file lecturer.csv:

read_delim("lecturer.csv")

To use the imported data, we need to assign a name
For example, the data from lecturer.csv will be available as df in this example:

df = read_delim("lecturer.csv")

More CSV examples

read_delim("birds.csv")
read_delim("lecturer.dat")
read_delim("pm10.csv")
read_delim("wahl16.csv")
read_delim("homework.csv")  # temperature column not correct
read_csv2("homework.csv")  # this works
read_delim("cars.csv")  # error, need to set delimiter manually!
read_delim("cars.csv", delim=",")
read_csv2("covid19.csv")  # decimal mark ,

We can also generate import code semi-automatically:

Importing Excel files

We need read_excel() from the readxl package:

library(readxl)
read_excel("lecturer.xlsx")

# A tibble: 10 × 7
   name   birth_date   job friends alcohol income neurotic
   <chr>  <chr>      <dbl>   <dbl>   <dbl>  <dbl>    <dbl>
 1 Ben    7/3/1977       1       5      10  20000       10
 2 Martin 5/24/1969      1       2      15  40000       17
 3 Andy   6/21/1973      1       0      20  35000       14
 4 Paul   7/16/1970      1       4       5  22000       13
 5 Graham 10/10/1949     1       1      30  50000       21
 6 Carina 11/5/1983      2      10      25   5000        7
 7 Karina 10/8/1987      2      12      20    100       13
 8 Doug   1/23/1989      2      15      16   3000        9
 9 Mark   5/20/1973      2      12      17  10000       14
10 Zoe    11/12/1984     2      17      18     10       13

Importing SPSS files

We need read_spss() from the haven package:

library(haven)
read_spss("lecturer.sav")

# A tibble: 10 × 7
   name   birth_date   job friends alcohol income neurotic
   <chr>  <chr>      <dbl>   <dbl>   <dbl>  <dbl>    <dbl>
 1 Ben    7/3/1977       1       5      10  20000       10
 2 Martin 5/24/1969      1       2      15  40000       17
 3 Andy   6/21/1973      1       0      20  35000       14
 4 Paul   7/16/1970      1       4       5  22000       13
 5 Graham 10/10/1949     1       1      30  50000       21
 6 Carina 11/5/1983      2      10      25   5000        7
 7 Karina 10/8/1987      2      12      20    100       13
 8 Doug   1/23/1989      2      15      16   3000        9
 9 Mark   5/20/1973      2      12      17  10000       14
10 Zoe    11/12/1984     2      17      18     10       13

Vectors

Imported data ends up in a so-called data frame (a table)
The columns are vectors, the most basic data type in R
A vector is a collection of values of a specific type (numeric, character, logical, …)
The c() function creates a vector:

c(1, 8, -5, 17, 3)

[1]  1  8 -5 17  3

x = c(1, 8, -5, 17, 3)  # assign the name x
length(x)  # number of elements

[1] 5

Data frames

A data frame is a table consisting of rows and columns
Each column is a vector
Different columns can have different types

data.frame(a=-2:2, b=c("A", "B", "C", "D", "E"), c=10.1:14.1)

   a b    c
1 -2 A 10.1
2 -1 B 11.1
3  0 C 12.1
4  1 D 13.1
5  2 E 14.1

Tibbles

A tibble is a data frame that “looks better”
It requires the tibble package (part of the Tidyverse)

library(tibble)
tibble(a=-2:2, b=c("A", "B", "C", "D", "E"), c=10.1:14.1)

# A tibble: 5 × 3
      a b         c
  <int> <chr> <dbl>
1    -2 A      10.1
2    -1 B      11.1
3     0 C      12.1
4     1 D      13.1
5     2 E      14.1

Data wrangling

Manipulating a data frame is called data wrangling
The dplyr package makes this task a lot of fun
There are five basic data wrangling activities:
- Filtering rows with filter()
- Sorting rows with arrange()
- Selecting columns with select()
- Computing new columns with mutate()
- Summarizing with group_by() and summarize()

Pipe operator

The pipe operator |> can be used to pipe an expression on the left to a function on the right
Instead of writing mean(x) we can write x |> mean()
This is useful if we want to combine several processing steps in a pipeline

y = 1:100
log(mean(y))

[1] 3.921973

y |> mean() |> log()

[1] 3.921973

Example (1)

Let’s use the penguins data set from the palmerpenguins package

library(dplyr)
library(palmerpenguins)

penguins |>
    group_by(species) |>
    mutate(mass=body_mass_g / 1000) |>
    summarize(
        mean_mass=mean(mass, na.rm=TRUE),
        sd_mass=sd(mass, na.rm=TRUE)
    )

# A tibble: 3 × 3
  species   mean_mass sd_mass
  <fct>         <dbl>   <dbl>
1 Adelie         3.70   0.459
2 Chinstrap      3.73   0.384
3 Gentoo         5.08   0.504

Example (2)

The ggplot2 package creates visualizations of the data

library(ggplot2)
df = penguins |> rename(length=bill_length_mm, depth=bill_depth_mm)
ggplot(df, mapping=aes(x=length, y=depth)) +
    geom_point() +
    geom_smooth(method=lm)

Example (3)

ggplot(df, mapping=aes(x=length, y=depth, color=species)) +
    geom_point() +
    geom_smooth(method=lm)

t-test (1)

The function t.test() performs (un)paired t-tests

df = df |> filter(species %in% c("Adelie", "Chinstrap"))
t.test(length ~ species, data=df)


    Welch Two Sample t-test

data:  length by species
t = -21.865, df = 106.97, p-value < 2.2e-16
alternative hypothesis: true difference in means between group Adelie and group Chinstrap is not equal to 0
95 percent confidence interval:
 -10.952948  -9.131917
sample estimates:
   mean in group Adelie mean in group Chinstrap 
               38.79139                48.83382

t-test (2)

t.test(depth ~ species, data=df)


    Welch Two Sample t-test

data:  depth by species
t = -0.43771, df = 137.75, p-value = 0.6623
alternative hypothesis: true difference in means between group Adelie and group Chinstrap is not equal to 0
95 percent confidence interval:
 -0.4095657  0.2611044
sample estimates:
   mean in group Adelie mean in group Chinstrap 
               18.34636                18.42059

Linear regression (1)

The function lm() computes a linear regression model
Model specification: dv ~ iv1 + iv2 + ...
(Read: “dv is predicted by iv1 and iv2 and …”)

model = lm(bill_depth_mm ~ bill_length_mm, data=penguins)

Linear regression (2)

summary(model)


Call:
lm(formula = bill_depth_mm ~ bill_length_mm, data = penguins)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.1381 -1.4263  0.0164  1.3841  4.5255 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)    20.88547    0.84388  24.749  < 2e-16 ***
bill_length_mm -0.08502    0.01907  -4.459 1.12e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.922 on 340 degrees of freedom
  (2 observations deleted due to missingness)
Multiple R-squared:  0.05525,   Adjusted R-squared:  0.05247 
F-statistic: 19.88 on 1 and 340 DF,  p-value: 1.12e-05

Linear regression (3)

par(mfrow=c(2, 2))  # put 2 x 2 plots into one figure
plot(model)  # creates 4 plots

Analysis of variance

ANOVA is linear regression
Base R has functions for doing ANOVA (you could even use lm()), but these produce results that are slightly different from those obtained by e.g. SPSS
The package ez generates output that is similar to SPSS

ANOVA example (1)

library(ez)
library(tidyr)  # for drop_na()

df = drop_na(penguins)  # drop rows with missing data (NA)
df$id = factor(1:nrow(df))  # add id column
ezANOVA(df, dv=bill_depth_mm, wid=id, between=species)

$ANOVA
   Effect DFn DFd        F            p p<.05       ges
1 species   2 330 344.8251 1.446616e-81     * 0.6763596

$`Levene's Test for Homogeneity of Variance`
  DFn DFd      SSn      SSd        F         p p<.05
1   2 330 1.647581 142.1504 1.912417 0.1493565

ANOVA example (2)

pairwise.t.test(df$bill_depth_mm, df$species)


    Pairwise comparisons using t tests with pooled SD 

data:  df$bill_depth_mm and df$species 

          Adelie Chinstrap
Chinstrap 0.66   -        
Gentoo    <2e-16 <2e-16   

P value adjustment method: holm

Courses for beginners

Introduction to R

Data wrangling in R using the Tidyverse

Book recommendations

Discovering Statistics using R

Learning Statistics with R

Getting started with R and RStudio

Contents

Why R?

Installing R and RStudio

RStudio components

Installing packages

Where do I put my code?

Integrated help

Importing data

Importing CSV files

More CSV examples

Importing Excel files

Importing SPSS files

Vectors

Data frames

Tibbles

Data wrangling

Pipe operator

Example (1)

Example (2)

Example (3)

t-test (1)

t-test (2)

Linear regression (1)

Linear regression (2)

Linear regression (3)

Analysis of variance

ANOVA example (1)

ANOVA example (2)

Courses for beginners

Book recommendations