Getting started with R and RStudio

Clemens Brunner

Contents

  • Installing R and RStudio
  • Importing data
  • Vectors and data frames
  • Data wrangling
  • t-test
  • Linear regression
  • Analysis of variance

Why R?

  • R enables reproducible analyses because:
  • R is free, cross-platform, and open-source
  • R works with text commands (no clicking around in dozens of dialogs)
  • Text commands can be permanently stored in a text file
  • This text file (R script) can be run over and over and always reproduces the same results
  • R is the de-facto standard for statistical data analysis
  • R has a large and friendly community

Installing R and RStudio

RStudio components

  1. Console
  2. Editor
  3. Environment, History
  4. Files, Plots, Packages, Help

Installing packages

  • Packages extend the functionality of R
  • The Tidyverse is a collection of packages that work together

Although almost anything can be done with “base R” (no additional packages), most tasks are easier and more convenient with the Tidyverse! — (my opinion)

  • The tidyverse meta-package provides all core packages
  • Let’s install it (using the RStudio “Packages” tab)!

Where do I put my code?

  • R runs only in the console
  • This is where we enter commands and R will run them
  • We permanently store R commands in a text file
  • This is called an R script (file extension .R)
  • Reproducible (same result every time we run the script)
  • Comments are introduced with #

Integrated help

  • In the console, type ? followed by a function name
  • This will open the integrated help browser
  • Always read the documentation if you use a function for the first time!
?mean  # shows help for the mean function

Importing data

  • Let’s import an existing data set!
  • We need the readr package (which is part of tidyverse)
  • To use a package, we have to activate it first with the library() function
  • library(tidyverse) activates all Tidyverse packages
  • library(readr) activates just the readr package:
library(readr)

Importing CSV files

  • The function read_delim() imports data from text files
  • It only needs the file name (in most cases)
  • To import the file lecturer.csv:
read_delim("lecturer.csv")
  • To use the imported data, we need to assign a name
  • For example, the data from lecturer.csv will be available as df in this example:
df = read_delim("lecturer.csv")

More CSV examples

read_delim("birds.csv")
read_delim("lecturer.dat")
read_delim("pm10.csv")
read_delim("wahl16.csv")
read_delim("homework.csv")  # temperature column not correct
read_csv2("homework.csv")  # this works
read_delim("cars.csv")  # error, need to set delimiter manually!
read_delim("cars.csv", delim=",")
read_csv2("covid19.csv")  # decimal mark ,

We can also generate import code semi-automatically:

Importing Excel files

We need read_excel() from the readxl package:

library(readxl)
read_excel("lecturer.xlsx")
# A tibble: 10 × 7
   name   birth_date   job friends alcohol income neurotic
   <chr>  <chr>      <dbl>   <dbl>   <dbl>  <dbl>    <dbl>
 1 Ben    7/3/1977       1       5      10  20000       10
 2 Martin 5/24/1969      1       2      15  40000       17
 3 Andy   6/21/1973      1       0      20  35000       14
 4 Paul   7/16/1970      1       4       5  22000       13
 5 Graham 10/10/1949     1       1      30  50000       21
 6 Carina 11/5/1983      2      10      25   5000        7
 7 Karina 10/8/1987      2      12      20    100       13
 8 Doug   1/23/1989      2      15      16   3000        9
 9 Mark   5/20/1973      2      12      17  10000       14
10 Zoe    11/12/1984     2      17      18     10       13

Importing SPSS files

We need read_spss() from the haven package:

library(haven)
read_spss("lecturer.sav")
# A tibble: 10 × 7
   name   birth_date   job friends alcohol income neurotic
   <chr>  <chr>      <dbl>   <dbl>   <dbl>  <dbl>    <dbl>
 1 Ben    7/3/1977       1       5      10  20000       10
 2 Martin 5/24/1969      1       2      15  40000       17
 3 Andy   6/21/1973      1       0      20  35000       14
 4 Paul   7/16/1970      1       4       5  22000       13
 5 Graham 10/10/1949     1       1      30  50000       21
 6 Carina 11/5/1983      2      10      25   5000        7
 7 Karina 10/8/1987      2      12      20    100       13
 8 Doug   1/23/1989      2      15      16   3000        9
 9 Mark   5/20/1973      2      12      17  10000       14
10 Zoe    11/12/1984     2      17      18     10       13

Vectors

  • Imported data ends up in a so-called data frame (a table)
  • The columns are vectors, the most basic data type in R
  • A vector is a collection of values of a specific type (numeric, character, logical, …)
  • The c() function creates a vector:
c(1, 8, -5, 17, 3)
[1]  1  8 -5 17  3
x = c(1, 8, -5, 17, 3)  # assign the name x
length(x)  # number of elements
[1] 5

Data frames

  • A data frame is a table consisting of rows and columns
  • Each column is a vector
  • Different columns can have different types
data.frame(a=-2:2, b=c("A", "B", "C", "D", "E"), c=10.1:14.1)
   a b    c
1 -2 A 10.1
2 -1 B 11.1
3  0 C 12.1
4  1 D 13.1
5  2 E 14.1

Tibbles

  • A tibble is a data frame that “looks better”
  • It requires the tibble package (part of the Tidyverse)
library(tibble)
tibble(a=-2:2, b=c("A", "B", "C", "D", "E"), c=10.1:14.1)
# A tibble: 5 × 3
      a b         c
  <int> <chr> <dbl>
1    -2 A      10.1
2    -1 B      11.1
3     0 C      12.1
4     1 D      13.1
5     2 E      14.1

Data wrangling

  • Manipulating a data frame is called data wrangling
  • The dplyr package makes this task a lot of fun
  • There are five basic data wrangling activities:
    • Filtering rows with filter()
    • Sorting rows with arrange()
    • Selecting columns with select()
    • Computing new columns with mutate()
    • Summarizing with group_by() and summarize()

Pipe operator

  • The pipe operator |> can be used to pipe an expression on the left to a function on the right
  • Instead of writing mean(x) we can write x |> mean()
  • This is useful if we want to combine several processing steps in a pipeline
y = 1:100
log(mean(y))
[1] 3.921973
y |> mean() |> log()
[1] 3.921973

Example (1)

  • Let’s use the penguins data set from the palmerpenguins package
library(dplyr)
library(palmerpenguins)

penguins |>
    group_by(species) |>
    mutate(mass=body_mass_g / 1000) |>
    summarize(
        mean_mass=mean(mass, na.rm=TRUE),
        sd_mass=sd(mass, na.rm=TRUE)
    )
# A tibble: 3 × 3
  species   mean_mass sd_mass
  <fct>         <dbl>   <dbl>
1 Adelie         3.70   0.459
2 Chinstrap      3.73   0.384
3 Gentoo         5.08   0.504

Example (2)

  • The ggplot2 package creates visualizations of the data
library(ggplot2)
df = penguins |> rename(length=bill_length_mm, depth=bill_depth_mm)
ggplot(df, mapping=aes(x=length, y=depth)) +
    geom_point() +
    geom_smooth(method=lm)

Example (3)

ggplot(df, mapping=aes(x=length, y=depth, color=species)) +
    geom_point() +
    geom_smooth(method=lm)

t-test (1)

  • The function t.test() performs (un)paired t-tests
df = df |> filter(species %in% c("Adelie", "Chinstrap"))
t.test(length ~ species, data=df)

    Welch Two Sample t-test

data:  length by species
t = -21.865, df = 106.97, p-value < 2.2e-16
alternative hypothesis: true difference in means between group Adelie and group Chinstrap is not equal to 0
95 percent confidence interval:
 -10.952948  -9.131917
sample estimates:
   mean in group Adelie mean in group Chinstrap 
               38.79139                48.83382 

t-test (2)

t.test(depth ~ species, data=df)

    Welch Two Sample t-test

data:  depth by species
t = -0.43771, df = 137.75, p-value = 0.6623
alternative hypothesis: true difference in means between group Adelie and group Chinstrap is not equal to 0
95 percent confidence interval:
 -0.4095657  0.2611044
sample estimates:
   mean in group Adelie mean in group Chinstrap 
               18.34636                18.42059 

Linear regression (1)

  • The function lm() computes a linear regression model
  • Model specification: dv ~ iv1 + iv2 + ...
  • (Read: “dv is predicted by iv1 and iv2 and …”)
model = lm(bill_depth_mm ~ bill_length_mm, data=penguins)

Linear regression (2)

summary(model)

Call:
lm(formula = bill_depth_mm ~ bill_length_mm, data = penguins)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.1381 -1.4263  0.0164  1.3841  4.5255 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)    20.88547    0.84388  24.749  < 2e-16 ***
bill_length_mm -0.08502    0.01907  -4.459 1.12e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.922 on 340 degrees of freedom
  (2 observations deleted due to missingness)
Multiple R-squared:  0.05525,   Adjusted R-squared:  0.05247 
F-statistic: 19.88 on 1 and 340 DF,  p-value: 1.12e-05

Linear regression (3)

par(mfrow=c(2, 2))  # put 2 x 2 plots into one figure
plot(model)  # creates 4 plots

Analysis of variance

  • ANOVA is linear regression
  • Base R has functions for doing ANOVA (you could even use lm()), but these produce results that are slightly different from those obtained by e.g. SPSS
  • The package ez generates output that is similar to SPSS

ANOVA example (1)

library(ez)
library(tidyr)  # for drop_na()

df = drop_na(penguins)  # drop rows with missing data (NA)
df$id = factor(1:nrow(df))  # add id column
ezANOVA(df, dv=bill_depth_mm, wid=id, between=species)
$ANOVA
   Effect DFn DFd        F            p p<.05       ges
1 species   2 330 344.8251 1.446616e-81     * 0.6763596

$`Levene's Test for Homogeneity of Variance`
  DFn DFd      SSn      SSd        F         p p<.05
1   2 330 1.647581 142.1504 1.912417 0.1493565      

ANOVA example (2)

pairwise.t.test(df$bill_depth_mm, df$species)

    Pairwise comparisons using t tests with pooled SD 

data:  df$bill_depth_mm and df$species 

          Adelie Chinstrap
Chinstrap 0.66   -        
Gentoo    <2e-16 <2e-16   

P value adjustment method: holm 

Courses for beginners

Introduction to R

Data wrangling in R using the Tidyverse

Book recommendations

Discovering Statistics using R

Learning Statistics with R