Best Practice: Always load tidyverse and readxl first

library(tidyverse)
## -- Attaching core tidyverse packages ------------------------ tidyverse 2.0.0 --
## v dplyr     1.1.4     v readr     2.1.5
## v forcats   1.0.0     v stringr   1.5.1
## v ggplot2   3.5.1     v tibble    3.2.1
## v lubridate 1.9.3     v tidyr     1.3.1
## v purrr     1.0.2     
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
## i Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(readxl)

# it is not necessary to re-install these packages (via install.packages("") ) if you have already installed them once in the console

Manipulating Data with Tidyverse

Filter

“Filter” is a command to keep certain observations, and remove others.

# "Filter" allows us to remove or keep different observations, depending on values that we select. This is just like the "Filter" in Excel

# We can remove, for example, all cars with less than 100 horsepower

new_cars <- mtcars |> filter(hp>100)

# in the above command, we are asking R to keep all cars with *more* than 100 horsepower. How many cars are left?

length(new_cars$hp)
## [1] 23
# There are now 23 cars instead of the original 32

Select

“Select” is a command to keep certain variables/columns, and remove others.

# Select is just like "filter", but instead of removing rows or observations, we are *selecting* which columns (or variables) to keep

fewer_columns_cars <- mtcars |> select(hp,mpg)

# Now we see that the new dataframe has only 2 columns (hp, mpg) instead of 11 columns
#the old mtcars data frame:

head(mtcars)
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
#the new fewer_columns_cars dataframe:

head(fewer_columns_cars)
##                    hp  mpg
## Mazda RX4         110 21.0
## Mazda RX4 Wag     110 21.0
## Datsun 710         93 22.8
## Hornet 4 Drive    110 21.4
## Hornet Sportabout 175 18.7
## Valiant           105 18.1

Like “filter()”, in “select()” you specify which information you want to keep.

Mutate

Mutate is a command to create new columns from scratch.

# here, we are going to create a new column that just adds hp to mpg:

fewer_columns_cars <- fewer_columns_cars |> mutate(hp_plus_mpg=hp+mpg)

head(fewer_columns_cars)
##                    hp  mpg hp_plus_mpg
## Mazda RX4         110 21.0       131.0
## Mazda RX4 Wag     110 21.0       131.0
## Datsun 710         93 22.8       115.8
## Hornet 4 Drive    110 21.4       131.4
## Hornet Sportabout 175 18.7       193.7
## Valiant           105 18.1       123.1

The neat thing about mutate is that you can do this with almost any type of command. Here’s the same variables, but with subtraction instead of addition:

fewer_columns_cars <- fewer_columns_cars |> mutate(hp_minus_mpg=hp-mpg)

head(fewer_columns_cars)
##                    hp  mpg hp_plus_mpg hp_minus_mpg
## Mazda RX4         110 21.0       131.0         89.0
## Mazda RX4 Wag     110 21.0       131.0         89.0
## Datsun 710         93 22.8       115.8         70.2
## Hornet 4 Drive    110 21.4       131.4         88.6
## Hornet Sportabout 175 18.7       193.7        156.3
## Valiant           105 18.1       123.1         86.9

(ADVANCED) Mutate with IF-ELSE

It’s possible to create conditional mutate() columns, where the value of one column influences the new column. For example, we can create a new column that has a “1” if the hp is above 100, otherwise it has a “0”:

ifelse_columns <- mtcars |> select(hp) |> mutate(high_horsepower=ifelse(hp>100,1,0))

# Here, we have selected only "hp" from mtcars, and chained it into mutate. All pipe operations can be chained together for efficiency. 

head(ifelse_columns)
##                    hp high_horsepower
## Mazda RX4         110               1
## Mazda RX4 Wag     110               1
## Datsun 710         93               0
## Hornet 4 Drive    110               1
## Hornet Sportabout 175               1
## Valiant           105               1

See what happened? There is a “1” in the new column if the horsepower is above 100, and a “0” in the new column if it’s 100 or below.

The ifelse() command has a structure like this:

ifelse(“some value you select”,“if the value is TRUE”, “if the value is FALSE”)

So, if you wanted to create a new column where every hp that is exactly 110 get’s a “1”, and everyone else get’s a “0”, it would look like this:

ifelse_columns <- mtcars |> select(hp) |> mutate(exactly_110_hp=ifelse(hp==110,1,0))

head(ifelse_columns)
##                    hp exactly_110_hp
## Mazda RX4         110              1
## Mazda RX4 Wag     110              1
## Datsun 710         93              0
## Hornet 4 Drive    110              1
## Hornet Sportabout 175              0
## Valiant           105              0

See what happened there? The magic happened in “ifelse()” within mutate(). ifelse(hp==110,1,0) means, in plain English, “IF the hp is exactly 110, put a”1” in the new column, otherwise, put a “0” “. This is exactly what has happened.

Arrange

Arrange is a command to re-order the data according to values in certain columns. This is just like the “A|Z” ordering button in Excel:

ifelse_columns_rearranged <- ifelse_columns |> arrange(hp)

head(ifelse_columns_rearranged)
##                hp exactly_110_hp
## Honda Civic    52              0
## Merc 240D      62              0
## Toyota Corolla 65              0
## Fiat 128       66              0
## Fiat X1-9      66              0
## Porsche 914-2  91              0

See what happened? It rearranged “ifelse_columns” so that hp is now lowest to highest. What if you wanted to rearrange it so that it went from highest to lowest instead?

For that, you’d put “desc()” inside “arrange()”. Like this:

# "desc()" stands for "descending" -- we want the values of HP to "descend" from highest to lowest

ifelse_columns_rearranged <- ifelse_columns |> arrange(desc(hp))

head(ifelse_columns_rearranged)
##                      hp exactly_110_hp
## Maserati Bora       335              0
## Ford Pantera L      264              0
## Duster 360          245              0
## Camaro Z28          245              0
## Chrysler Imperial   230              0
## Lincoln Continental 215              0

Summarize

Summarize reduces a whole dataset to just summaries which you specify. For example, let’s say we wanted to know the average hp of the whole mtcars dataset. We could call “summarize()” on hp, and create a new variable called “avg_hp”:

summarized_hp <- mtcars |> summarize(avg_hp=mean(hp))

head(summarized_hp)
##     avg_hp
## 1 146.6875

You’ll notice that the entire dataset has been condensed into “avg_hp”, which gives us the mean(hp). Only use “summarize()” when you want summary statistics in the new dataset.

You can also summarize() more than one variable. For example, let’s say we wanted to know the average hp and mpg:

summarized_hp <- mtcars |> summarize(avg_hp=mean(hp), avg_mpg=mean(mpg))

head(summarized_hp)
##     avg_hp  avg_mpg
## 1 146.6875 20.09062

GROUP BY

Summarize can be used with group_by() to get summary tables by specific variables. This is incredibly useful, and is basically the same as running a pivot table in Excel.

Let’s say we wanted to know average hp, but by transmission type:

summarized_hp <- mtcars |> group_by(am) |> summarize(avg_hp=mean(hp))

head(summarized_hp)
## # A tibble: 2 x 2
##      am avg_hp
##   <dbl>  <dbl>
## 1     0   160.
## 2     1   127.

See what we did there? We still have the avg_hp thanks to “summarize()”, however we’ve asked R to “group_by()” the variable “am” (transmission type). Now we see that the hp is quite different depending on transmission type. Which type of transmission has better horsepower?

Count in Summarize (a special use for group_by() and summarize())

We can use the command “n=n()” in “summarize” to count the number of observations in a dataset. This is good if you want to know how many of each category are in the data. For example, let’s say we wanted to know how many cars there are, by transmission type:

summarized_car <- mtcars |> group_by(am) |> summarize(n=n())

head(summarized_car)
## # A tibble: 2 x 2
##      am     n
##   <dbl> <int>
## 1     0    19
## 2     1    13

How many cars are there in transmission type “0”? In transmission type “1”?

How many cars are there by cylinder?

summarized_car <- mtcars |> group_by(cyl) |> summarize(n=n())

head(summarized_car)
## # A tibble: 3 x 2
##     cyl     n
##   <dbl> <int>
## 1     4    11
## 2     6     7
## 3     8    14

left_join() - to combine datasets

Sometimes you’ll want to combine two datasets into one. This can be done – if they have a variable in common (a “key” variable).

Here’s two datasets we can combine:

head(band_members)
## # A tibble: 3 x 2
##   name  band   
##   <chr> <chr>  
## 1 Mick  Stones 
## 2 John  Beatles
## 3 Paul  Beatles
head(band_instruments)
## # A tibble: 3 x 2
##   name  plays 
##   <chr> <chr> 
## 1 John  guitar
## 2 Paul  bass  
## 3 Keith guitar

What column/variable do they both have in common?

(It’s the “name”)

To combine them into a single dataset, we can call a “left_join()”. This takes one data set and combines it with the other one on the left-hand side. Like this:

new_band <- band_members |> left_join(band_instruments,by="name")

head(new_band)
## # A tibble: 3 x 3
##   name  band    plays 
##   <chr> <chr>   <chr> 
## 1 Mick  Stones  <NA>  
## 2 John  Beatles guitar
## 3 Paul  Beatles bass

Three things have happened:

  1. band_members and band_instruments have been combined

  2. Keith from “band_instruments” has been dropped. Why? Because his name isn’t in “band_members” (the left hand side).

  3. Mick in “band_members” now has “NA” for “band_instruments”. Why? Because he was not present in band_instruments.

There are different kinds of joins to have different values show up, but left_join() is typically the most useful.