Clemens Brunner
Although almost anything can be done with “base R” (no additional packages), most tasks are easier and more convenient with the Tidyverse! — (my opinion)
tidyverse meta-package provides all core packages.R)#? followed by a function namereadr package (which is part of tidyverse)library() functionlibrary(tidyverse) activates all Tidyverse packageslibrary(readr) activates just the readr package:read_delim() imports data from text fileslecturer.csv:lecturer.csv will be available as df in this example:read_delim("birds.csv")
read_delim("lecturer.dat")
read_delim("pm10.csv")
read_delim("wahl16.csv")
read_delim("homework.csv") # temperature column not correct
read_csv2("homework.csv") # this works
read_delim("cars.csv") # error, need to set delimiter manually!
read_delim("cars.csv", delim=",")
read_csv2("covid19.csv") # decimal mark ,We can also generate import code semi-automatically:
We need read_excel() from the readxl package:
# A tibble: 10 × 7
name birth_date job friends alcohol income neurotic
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Ben 7/3/1977 1 5 10 20000 10
2 Martin 5/24/1969 1 2 15 40000 17
3 Andy 6/21/1973 1 0 20 35000 14
4 Paul 7/16/1970 1 4 5 22000 13
5 Graham 10/10/1949 1 1 30 50000 21
6 Carina 11/5/1983 2 10 25 5000 7
7 Karina 10/8/1987 2 12 20 100 13
8 Doug 1/23/1989 2 15 16 3000 9
9 Mark 5/20/1973 2 12 17 10000 14
10 Zoe 11/12/1984 2 17 18 10 13
We need read_spss() from the haven package:
# A tibble: 10 × 7
name birth_date job friends alcohol income neurotic
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Ben 7/3/1977 1 5 10 20000 10
2 Martin 5/24/1969 1 2 15 40000 17
3 Andy 6/21/1973 1 0 20 35000 14
4 Paul 7/16/1970 1 4 5 22000 13
5 Graham 10/10/1949 1 1 30 50000 21
6 Carina 11/5/1983 2 10 25 5000 7
7 Karina 10/8/1987 2 12 20 100 13
8 Doug 1/23/1989 2 15 16 3000 9
9 Mark 5/20/1973 2 12 17 10000 14
10 Zoe 11/12/1984 2 17 18 10 13
c() function creates a vector:tibble package (part of the Tidyverse)dplyr package makes this task a lot of funfilter()arrange()select()mutate()group_by() and summarize()|> can be used to pipe an expression on the left to a function on the rightmean(x) we can write x |> mean()penguins data set from the palmerpenguins packagelibrary(dplyr)
library(palmerpenguins)
penguins |>
group_by(species) |>
mutate(mass=body_mass_g / 1000) |>
summarize(
mean_mass=mean(mass, na.rm=TRUE),
sd_mass=sd(mass, na.rm=TRUE)
)# A tibble: 3 × 3
species mean_mass sd_mass
<fct> <dbl> <dbl>
1 Adelie 3.70 0.459
2 Chinstrap 3.73 0.384
3 Gentoo 5.08 0.504
ggplot2 package creates visualizations of the datat.test() performs (un)paired t-tests
Welch Two Sample t-test
data: length by species
t = -21.865, df = 106.97, p-value < 2.2e-16
alternative hypothesis: true difference in means between group Adelie and group Chinstrap is not equal to 0
95 percent confidence interval:
-10.952948 -9.131917
sample estimates:
mean in group Adelie mean in group Chinstrap
38.79139 48.83382
Welch Two Sample t-test
data: depth by species
t = -0.43771, df = 137.75, p-value = 0.6623
alternative hypothesis: true difference in means between group Adelie and group Chinstrap is not equal to 0
95 percent confidence interval:
-0.4095657 0.2611044
sample estimates:
mean in group Adelie mean in group Chinstrap
18.34636 18.42059
lm() computes a linear regression modeldv ~ iv1 + iv2 + ...dv is predicted by iv1 and iv2 and …”)
Call:
lm(formula = bill_depth_mm ~ bill_length_mm, data = penguins)
Residuals:
Min 1Q Median 3Q Max
-4.1381 -1.4263 0.0164 1.3841 4.5255
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 20.88547 0.84388 24.749 < 2e-16 ***
bill_length_mm -0.08502 0.01907 -4.459 1.12e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.922 on 340 degrees of freedom
(2 observations deleted due to missingness)
Multiple R-squared: 0.05525, Adjusted R-squared: 0.05247
F-statistic: 19.88 on 1 and 340 DF, p-value: 1.12e-05
lm()), but these produce results that are slightly different from those obtained by e.g. SPSSez generates output that is similar to SPSSlibrary(ez)
library(tidyr) # for drop_na()
df = drop_na(penguins) # drop rows with missing data (NA)
df$id = factor(1:nrow(df)) # add id column
ezANOVA(df, dv=bill_depth_mm, wid=id, between=species)$ANOVA
Effect DFn DFd F p p<.05 ges
1 species 2 330 344.8251 1.446616e-81 * 0.6763596
$`Levene's Test for Homogeneity of Variance`
DFn DFd SSn SSd F p p<.05
1 2 330 1.647581 142.1504 1.912417 0.1493565