filter( )select()arrange( )mutate()summarize() and group_by()pivot_longer()pivot_wider()separate()unite()drop_na() and
replace_na()This intensive course introduces the core concepts of data science using R, focusing on data visualization, transformation, tidying, and importing. Through detailed introduction and hands-on exercises, you will learn the essential skills to manage, transform, visualize, and import data, helping you perform data analysis using R’s tidyverse packages. No prior R experience is required.
Lecture 1: Data Visualization
Lecture 2: Data transformation
Lecture 3: Data tidying
Lecture 4: Data import
Install and set up R and RStudio
Understand basic programming concepts like functions and variables
Explore datasets in R
Create visualizations from scratch using
ggplot2
Map variables to different visual elements (aesthetics)
Customize plots to communicate insights effectively
RStudio is an Integrated Development Environment (IDE) for R.
It helps you:
Panels in RStudio:
Console: Where you type and run R code.
Environment: Displays your data and variables.
Files/Plots: Shows files, plots, and outputs.
Let’s start simple by running a basic calculation in R. Type the following code in the Console:
1 + 1
2 * 3
2 * (3 + 6)
8 / 9
R can be extended by installing packages—sets of code that provide additional functionality.
The tidyverse is a collection of R packages that make working with data easier. To install and load the tidyverse, type the codes in your console:
# install.packages("tidyverse")
library(tidyverse)
Each row is an observation.
Each column is a variable.
Each cell is a value.
# Load the palmerpenguins package
library(tidyverse)
library(palmerpenguins)
# Load the penguins dataset
data(penguins)
# View the dataset
head(penguins)
## # A tibble: 6 × 8
## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## <fct> <fct> <dbl> <dbl> <int> <int>
## 1 Adelie Torgersen 39.1 18.7 181 3750
## 2 Adelie Torgersen 39.5 17.4 186 3800
## 3 Adelie Torgersen 40.3 18 195 3250
## 4 Adelie Torgersen NA NA NA NA
## 5 Adelie Torgersen 36.7 19.3 193 3450
## 6 Adelie Torgersen 39.3 20.6 190 3650
## # ℹ 2 more variables: sex <fct>, year <int>
glimpse(penguins)
## Rows: 344
## Columns: 8
## $ species <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
## $ island <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
## $ bill_length_mm <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
## $ bill_depth_mm <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
## $ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
## $ body_mass_g <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
## $ sex <fct> male, female, female, NA, female, male, female, male…
## $ year <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…
# View(penguins)
ggplot(data = , mapping = aes(x = , y = ,) + geom_point( )
ggplot(): Starts the plot.
aes(): Defines the “aesthetic
mappings” (how variables are mapped to visual properties).
geom_point(): Adds points to create
a scatter plot.
library(ggplot2)
library(palmerpenguins)
# Scatter plot
ggplot(data = penguins, mapping = aes(x = flipper_length_mm, y = body_mass_g)) + geom_point()
# Scatter plot with points colored by species
ggplot(data = penguins, mapping = aes(x = flipper_length_mm, y = body_mass_g, color = species)) + # Color the points by species
geom_point()
# Scatter plot with titles and labels
ggplot(data = penguins, mapping = aes(x = flipper_length_mm, y = body_mass_g, color = species)) +
geom_point() +
labs(
title = "Flipper Length vs. Body Mass by Species",
x = "Flipper Length (mm)",
y = "Body Mass (g)"
)
# Faceted scatter plot by island
ggplot(data = penguins, mapping = aes(x = flipper_length_mm, y = body_mass_g, color = species)) +
geom_point() +
facet_wrap(~ island) + # facet by categorical variables
labs(
title = "Flipper Length vs. Body Mass by Island",
x = "Flipper Length (mm)",
y = "Body Mass (g)"
)
In ggplot2, themes control the overall appearance of the non-data elements in your plots, such as:
Commonly Used Themes
# Scatter plot with theme_minimal
ggplot(data = penguins, mapping = aes(x = flipper_length_mm, y = body_mass_g, color = species)) +
geom_point() +
labs(
title = "Flipper Length vs. Body Mass (Minimal Theme)",
x = "Flipper Length (mm)",
y = "Body Mass (g)"
) +
theme_minimal()
# Scatter plot with theme_classic
ggplot(data = penguins, mapping = aes(x = flipper_length_mm, y = body_mass_g, color = species)) +
geom_point() +
labs(
title = "Flipper Length vs. Body Mass (Classic Theme)",
x = "Flipper Length (mm)",
y = "Body Mass (g)"
) +
theme_classic()
# Scatter plot with theme_bw
ggplot(data = penguins, mapping = aes(x = flipper_length_mm, y = body_mass_g, color = species)) +
geom_point() +
labs(
title = "Flipper Length vs. Body Mass (Black and White Theme)",
x = "Flipper Length (mm)",
y = "Body Mass (g)"
) +
theme_bw()
# Scatter plot with theme_void
ggplot(data = penguins, mapping = aes(x = flipper_length_mm, y = body_mass_g, color = species)) +
geom_point() +
labs(
title = "Flipper Length vs. Body Mass (Void Theme)"
) +
theme_void()
Trendlines are lines added to a plot that help visualize the general direction or pattern of data.
They are often used in scatter plots to reveal relationships between two variables, such as whether there is a positive or negative correlation.
In ggplot2, trendlines (or smoothers) are added to plots using the geom_smooth() function.
# Scatter plot with a linear trendline
ggplot(data = penguins, mapping = aes(x = flipper_length_mm, y = body_mass_g, color = species)) +
geom_point() + # Adds scatter plot
geom_smooth(method = "lm") + # Adds linear regression trendline
labs(
title = "Flipper Length vs. Body Mass with Linear Trendline",
x = "Flipper Length (mm)",
y = "Body Mass (g)"
)
R and RStudio Setup: install and set up R and RStudio, and explored the basic RStudio interface (Console, Environment, Plots/Files panes).
Running Basic R Code: simple arithmetic in R and learned how to use comments (#) to annotate code.
Working with Packages: tidyverse package, which contains essential tools for data science, including ggplot2 for data visualization.
Understanding Datasets: introduced the concept of datasets and used the palmerpenguins dataset to explore real-world data.
Creating Visualizations with ggplot2: how to map variables to the x and y axes using aes(), and added layers like geom_point() to display data.
Customizing Visualizations: added titles, axis labels, and color-coded plots by species using labs() and aesthetics like color.
Faceting: faceting with facet_wrap(), which allows you to create multiple plots for different subsets of data.
Using Themes: explored built-in themes like theme_minimal(), theme_classic(), and learned how to customize plot appearance by adjusting fonts, removing gridlines, and moving legends.
Trendlines: how to add linear trendlines (using method = “lm”, and non-linear smoothers using method = “loess”) to visualize patterns in the data.
Master the key functions in dplyr for transforming datasets.
Filter and select specific rows and columns from a dataset.
Arrange and sort rows based on the values of specific columns.
Create new variables using mutate().
Summarize and group data for analysis using summarize() and group_by().
Handle missing data and clean your dataset before analysis.
Data transformation is the process of modifying, cleaning, and reshaping raw data into a format that is ready for analysis.
In real-world data science, raw data is often messy or too large to be useful as is, so transforming it into a tidy and usable form is essential.
The dplyr package, part of the tidyverse, provides a set of easy-to-use functions for performing these data manipulation tasks efficiently and intuitively.
| Function | Description |
|---|---|
| filter( ) | Selects rows based on a condition |
| select( ) | Selects specific columns |
| arrange( ) | Sorts rows based on the values of one or more columns |
| mutate( ) | Creates new columns or modifies existing ones |
| summarize( ) | Summarizes multiple values into a single value |
| group_by( ) | Groups data by one or more columns before summarizing |
filter( )The filter() function is used to subset rows from a dataset based on specific conditions.
Filtering Rows Based on Numeric Conditions
Filtering Rows Based on Categorical Values
Combining Multiple Conditions
# Load required libraries
library(tidyverse)
library(palmerpenguins)
# Filter penguins with flipper length greater than 200mm
filtered_penguins <- penguins %>%
filter(flipper_length_mm > 200)
head(filtered_penguins)
## # A tibble: 6 × 8
## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## <fct> <fct> <dbl> <dbl> <int> <int>
## 1 Adelie Dream 35.7 18 202 3550
## 2 Adelie Dream 41.1 18.1 205 4300
## 3 Adelie Dream 40.8 18.9 208 4300
## 4 Adelie Biscoe 41 20 203 4725
## 5 Adelie Torgersen 41.4 18.5 202 3875
## 6 Adelie Torgersen 44.1 18 210 4000
## # ℹ 2 more variables: sex <fct>, year <int>
# Keep only the penguins from the island "Biscoe".
filtered_penguins <- penguins %>%
filter(flipper_length_mm > 200, body_mass_g < 4000)
head(filtered_penguins)
## # A tibble: 6 × 8
## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## <fct> <fct> <dbl> <dbl> <int> <int>
## 1 Adelie Dream 35.7 18 202 3550
## 2 Adelie Torgersen 41.4 18.5 202 3875
## 3 Gentoo Biscoe 42.7 13.7 208 3950
## 4 Chinstrap Dream 49 19.5 210 3950
## 5 Chinstrap Dream 51.4 19 201 3950
## 6 Chinstrap Dream 50.2 18.8 202 3800
## # ℹ 2 more variables: sex <fct>, year <int>
# Select penguins with a flipper length greater than 200 mm and a body mass under 4000 g.
filtered_penguins <- penguins %>%
filter(flipper_length_mm > 200, body_mass_g < 4000)
head(filtered_penguins)
## # A tibble: 6 × 8
## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## <fct> <fct> <dbl> <dbl> <int> <int>
## 1 Adelie Dream 35.7 18 202 3550
## 2 Adelie Torgersen 41.4 18.5 202 3875
## 3 Gentoo Biscoe 42.7 13.7 208 3950
## 4 Chinstrap Dream 49 19.5 210 3950
## 5 Chinstrap Dream 51.4 19 201 3950
## 6 Chinstrap Dream 50.2 18.8 202 3800
## # ℹ 2 more variables: sex <fct>, year <int>
select()The select() function allows you to choose specific columns from a dataset.
Reducing Dimensionality
Reordering Columns
Dropping Irrelevant Columns
# Select only the columns for species, bill length, and body mass
penguins_selected <- penguins %>%
select(species, bill_length_mm, body_mass_g)
head(penguins_selected)
## # A tibble: 6 × 3
## species bill_length_mm body_mass_g
## <fct> <dbl> <int>
## 1 Adelie 39.1 3750
## 2 Adelie 39.5 3800
## 3 Adelie 40.3 3250
## 4 Adelie NA NA
## 5 Adelie 36.7 3450
## 6 Adelie 39.3 3650
# Rearranging columns to place species first.
penguins_reordered <- penguins %>%
select(species, everything()) # Moves species to the first column
head(penguins_reordered)
## # A tibble: 6 × 8
## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## <fct> <fct> <dbl> <dbl> <int> <int>
## 1 Adelie Torgersen 39.1 18.7 181 3750
## 2 Adelie Torgersen 39.5 17.4 186 3800
## 3 Adelie Torgersen 40.3 18 195 3250
## 4 Adelie Torgersen NA NA NA NA
## 5 Adelie Torgersen 36.7 19.3 193 3450
## 6 Adelie Torgersen 39.3 20.6 190 3650
## # ℹ 2 more variables: sex <fct>, year <int>
# Exclude the island and sex columns from the dataset.
penguins_dropped <- penguins %>%
select(-island, -sex)
head(penguins_dropped)
## # A tibble: 6 × 6
## species bill_length_mm bill_depth_mm flipper_length_mm body_mass_g year
## <fct> <dbl> <dbl> <int> <int> <int>
## 1 Adelie 39.1 18.7 181 3750 2007
## 2 Adelie 39.5 17.4 186 3800 2007
## 3 Adelie 40.3 18 195 3250 2007
## 4 Adelie NA NA NA NA 2007
## 5 Adelie 36.7 19.3 193 3450 2007
## 6 Adelie 39.3 20.6 190 3650 2007
arrange( )The arrange() function sorts the rows of your dataset based on the values in one or more columns.
Sorting by a Numeric Column
Sorting by a Categorical Variable
Sorting by Multiple Columns
# Sort the dataset by body mass in descending order to find the heaviest penguins.
penguins_sorted <- penguins %>%
arrange(desc(body_mass_g))
head(penguins_sorted)
## # A tibble: 6 × 8
## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## <fct> <fct> <dbl> <dbl> <int> <int>
## 1 Gentoo Biscoe 49.2 15.2 221 6300
## 2 Gentoo Biscoe 59.6 17 230 6050
## 3 Gentoo Biscoe 51.1 16.3 220 6000
## 4 Gentoo Biscoe 48.8 16.2 222 6000
## 5 Gentoo Biscoe 45.2 16.4 223 5950
## 6 Gentoo Biscoe 49.8 15.9 229 5950
## # ℹ 2 more variables: sex <fct>, year <int>
# Sort the dataset alphabetically by species.
penguins_sorted_species <- penguins %>%
arrange(species)
head(penguins_sorted_species)
## # A tibble: 6 × 8
## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## <fct> <fct> <dbl> <dbl> <int> <int>
## 1 Adelie Torgersen 39.1 18.7 181 3750
## 2 Adelie Torgersen 39.5 17.4 186 3800
## 3 Adelie Torgersen 40.3 18 195 3250
## 4 Adelie Torgersen NA NA NA NA
## 5 Adelie Torgersen 36.7 19.3 193 3450
## 6 Adelie Torgersen 39.3 20.6 190 3650
## # ℹ 2 more variables: sex <fct>, year <int>
# First sort by island, and then by flipper length.
penguins_sorted_multi <- penguins %>%
arrange(island, flipper_length_mm)
head(penguins_sorted_multi)
## # A tibble: 6 × 8
## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## <fct> <fct> <dbl> <dbl> <int> <int>
## 1 Adelie Biscoe 37.9 18.6 172 3150
## 2 Adelie Biscoe 37.8 18.3 174 3400
## 3 Adelie Biscoe 37.7 18.7 180 3600
## 4 Adelie Biscoe 38.8 17.2 180 3800
## 5 Adelie Biscoe 40.5 18.9 180 3950
## 6 Adelie Biscoe 36.5 16.6 181 2850
## # ℹ 2 more variables: sex <fct>, year <int>
mutate()The mutate() function is used to create new columns (variables) or modify existing ones.
Creating a New Column
Performing Calculations Across Columns
# Create a new column for body mass in kilograms.
penguins <- penguins %>%
mutate(body_mass_kg = body_mass_g / 1000)
head(penguins)
## # A tibble: 6 × 9
## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## <fct> <fct> <dbl> <dbl> <int> <int>
## 1 Adelie Torgersen 39.1 18.7 181 3750
## 2 Adelie Torgersen 39.5 17.4 186 3800
## 3 Adelie Torgersen 40.3 18 195 3250
## 4 Adelie Torgersen NA NA NA NA
## 5 Adelie Torgersen 36.7 19.3 193 3450
## 6 Adelie Torgersen 39.3 20.6 190 3650
## # ℹ 3 more variables: sex <fct>, year <int>, body_mass_kg <dbl>
# Create a new column for the difference between maximum and minimum body mass.
penguins <- penguins %>%
mutate(body_mass_diff = max(body_mass_g, na.rm = TRUE) - min(body_mass_g, na.rm = TRUE))
head(penguins)
## # A tibble: 6 × 10
## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## <fct> <fct> <dbl> <dbl> <int> <int>
## 1 Adelie Torgersen 39.1 18.7 181 3750
## 2 Adelie Torgersen 39.5 17.4 186 3800
## 3 Adelie Torgersen 40.3 18 195 3250
## 4 Adelie Torgersen NA NA NA NA
## 5 Adelie Torgersen 36.7 19.3 193 3450
## 6 Adelie Torgersen 39.3 20.6 190 3650
## # ℹ 4 more variables: sex <fct>, year <int>, body_mass_kg <dbl>,
## # body_mass_diff <int>
summarize() and
group_by()The summarize() function condenses your dataset by calculating summary statistics (e.g., mean, median, sum) for one or more columns.
When used with group_by(), you can calculate these statistics for different groups within your data.
Calculating Average Values for Specific Groups
Summarizing Counts of Observations
Aggregating Data for Further Analysis:
# Calculate the average body mass for each species.
penguins_summary <- penguins %>%
group_by(species) %>%
summarize(mean_body_mass_g = mean(body_mass_g, na.rm = TRUE))
head(penguins_summary)
## # A tibble: 3 × 2
## species mean_body_mass_g
## <fct> <dbl>
## 1 Adelie 3701.
## 2 Chinstrap 3733.
## 3 Gentoo 5076.
# Count the number of penguins from each island.
penguins_count <- penguins %>%
group_by(island) %>%
summarize(count = n())
head(penguins_count)
## # A tibble: 3 × 2
## island count
## <fct> <int>
## 1 Biscoe 168
## 2 Dream 124
## 3 Torgersen 52
# Calculate the total body mass of penguins for each species
total_mass_summary <- penguins %>%
group_by(species) %>%
summarize(total_body_mass_g = sum(body_mass_g, na.rm = TRUE))
head(total_mass_summary)
## # A tibble: 3 × 2
## species total_body_mass_g
## <fct> <int>
## 1 Adelie 558800
## 2 Chinstrap 253850
## 3 Gentoo 624350
Sometimes datasets contain missing values, which can interfere with calculations.
Most functions (e.g., mean() or sum()) will return NA if missing data is not handled.
To address this, we use the argument na.rm = TRUE, which tells R to remove missing values before performing calculations.
# Calculate the mean flipper length, ignoring missing values
mean_flipper_length <- mean(penguins$flipper_length_mm, na.rm = TRUE)
mean_flipper_length
## [1] 200.9152
Filtering Rows with filter(): Learn how to subset your data based on specific conditions.
Selecting Columns with select(): Focus on only the variables you need by selecting specific columns.
Arranging Rows with arrange(): Sort your data by one or more columns in ascending or descending order.
Creating New Variables with mutate(): Add new columns to your dataset or modify existing ones.
Summarizing Data with summarize() and group_by(): Compute summary statistics (mean, sum, etc.) for groups within your data.
Handling Missing Data: Use na.rm = TRUE to remove missing values before performing calculations.
Understand the concept of tidy data and why it is important.
Learn how to reshape data using pivot_longer() and
pivot_wider().
Separate and unite columns using separate() and
unite().
Handle missing data with drop_na() and
replace_na().
Gain proficiency in preparing messy datasets for analysis.
Tidy data is a standardized way of organizing data that makes it easier to work with for analysis. In a tidy dataset:
Each variable forms a column.
Each observation forms a row.
Each value forms a cell.
| year | China_population | India_population | USA_population |
|---|---|---|---|
| 2000 | 1.26 billion | 1.05 billion | 282 million |
| 2010 | 1.34 billion | 1.21 billion | 309 million |
| Year | Country | Population |
|---|---|---|
| 2000 | China | 1.26 billion |
| 2000 | India | 1.05 billion |
| 2000 | USA | 282 million |
| 2010 | China | 1.34 billion |
| 2010 | India | 1.21 billion |
| 2010 | USA | 309 million |
| Function | Purpose |
|---|---|
| pivot_longer( ) | Converts wide data into long format |
| pivot_wider( ) | Converts long data back into wide format |
| separate( ) | Splits one column into multiple columns |
| unite( ) | Combines multiple columns into a single column |
| drop_na( ) | Removes rows with missing values |
| replace_na( ) | Replaces missing values with specified values |
pivot_longer()Use pivot_longer() when you have wide
data, where multiple columns represent the same variable (e.g.,
different years or measurements), and you need to gather them into a
long (tidy) format where each variable has its own
column.
# Create a wide dataset with quarterly sales
sales_data <- data.frame(
year = c(2020, 2021),
q1_sales = c(15000, 18000),
q2_sales = c(20000, 21000),
q3_sales = c(25000, 26000),
q4_sales = c(30000, 31000)
)
sales_data
## year q1_sales q2_sales q3_sales q4_sales
## 1 2020 15000 20000 25000 30000
## 2 2021 18000 21000 26000 31000
# Pivoting the sales data to long format
sales_long <- sales_data %>%
pivot_longer(cols = starts_with("q"),
names_to = "quarter",
values_to = "sales")
# View the long format data
head(sales_long)
## # A tibble: 6 × 3
## year quarter sales
## <dbl> <chr> <dbl>
## 1 2020 q1_sales 15000
## 2 2020 q2_sales 20000
## 3 2020 q3_sales 25000
## 4 2020 q4_sales 30000
## 5 2021 q1_sales 18000
## 6 2021 q2_sales 21000
# Create a wide dataset for population
population_data <- data.frame(
year = c(2000, 2010),
china_population = c(1.26, 1.34),
india_population = c(1.05, 1.21),
usa_population = c(0.28, 0.31)
)
population_data
## year china_population india_population usa_population
## 1 2000 1.26 1.05 0.28
## 2 2010 1.34 1.21 0.31
# Pivoting the population data to long format
population_long <- population_data %>%
pivot_longer(cols = ends_with("population"),
names_to = "country",
values_to = "population")
# View the long format data
head(population_long)
## # A tibble: 6 × 3
## year country population
## <dbl> <chr> <dbl>
## 1 2000 china_population 1.26
## 2 2000 india_population 1.05
## 3 2000 usa_population 0.28
## 4 2010 china_population 1.34
## 5 2010 india_population 1.21
## 6 2010 usa_population 0.31
# Wide dataset of temperatures
temperature_data <- data.frame(
year = c(2020, 2021),
spring_temp = c(15.5, 16.0),
summer_temp = c(25.3, 26.1),
fall_temp = c(10.2, 9.8),
winter_temp = c(0.5, -1.0)
)
# Pivoting temperature data to long format
temperature_long <- temperature_data %>%
pivot_longer(cols = ends_with("temp"),
names_to = "season",
values_to = "temperature")
# View the long format data
head(temperature_long)
## # A tibble: 6 × 3
## year season temperature
## <dbl> <chr> <dbl>
## 1 2020 spring_temp 15.5
## 2 2020 summer_temp 25.3
## 3 2020 fall_temp 10.2
## 4 2020 winter_temp 0.5
## 5 2021 spring_temp 16
## 6 2021 summer_temp 26.1
pivot_wider()Use pivot_wider() when you have long data and need to spread it into a wide format for easier comparisons between different categories or variables.
# Pivoting the long sales data back to wide format
sales_wide <- sales_long %>%
pivot_wider(names_from = "quarter",
values_from = "sales")
# View the wide format data
head(sales_wide)
## # A tibble: 2 × 5
## year q1_sales q2_sales q3_sales q4_sales
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2020 15000 20000 25000 30000
## 2 2021 18000 21000 26000 31000
# Pivoting long population data back to wide format
population_wide <- population_long %>%
pivot_wider(names_from = "country",
values_from = "population")
# View the wide format data
head(population_wide)
## # A tibble: 2 × 4
## year china_population india_population usa_population
## <dbl> <dbl> <dbl> <dbl>
## 1 2000 1.26 1.05 0.28
## 2 2010 1.34 1.21 0.31
# Pivoting temperature data back to wide format
temperature_wide <- temperature_long %>%
pivot_wider(names_from = "season",
values_from = "temperature")
# View the wide format data
head(temperature_wide)
## # A tibble: 2 × 5
## year spring_temp summer_temp fall_temp winter_temp
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2020 15.5 25.3 10.2 0.5
## 2 2021 16 26.1 9.8 -1
separate()Use separate( ) when a single column contains multiple pieces of information (e.g., a combined date, or a column like “city-country”) and you need to split it into multiple columns.
# Example of a combined date column
date_data <- data.frame(
transaction_date = c("2021-10-01", "2021-11-15", "2021-12-03")
)
date_data
## transaction_date
## 1 2021-10-01
## 2 2021-11-15
## 3 2021-12-03
# Separate date into year, month, and day
date_separated <- date_data %>%
separate(transaction_date, into = c("year", "month", "day"), sep = "-")
# View the separated data
head(date_separated)
## year month day
## 1 2021 10 01
## 2 2021 11 15
## 3 2021 12 03
# Example of a location column
location_data <- data.frame(
location = c("New York-USA", "London-UK", "Tokyo-Japan")
)
location_data
## location
## 1 New York-USA
## 2 London-UK
## 3 Tokyo-Japan
# Separate location into city and country
location_separated <- location_data %>%
separate(location, into = c("city", "country"), sep = "-")
# View the separated data
head(location_separated)
## city country
## 1 New York USA
## 2 London UK
## 3 Tokyo Japan
# Product data
product_data <- data.frame(
product_info = c("123-Electronics", "456-Furniture", "789-Clothing")
)
# Separate product_info into product_id and category
product_separated <- product_data %>%
separate(product_info, into = c("product_id", "category"), sep = "-")
# View the separated data
head(product_separated)
## product_id category
## 1 123 Electronics
## 2 456 Furniture
## 3 789 Clothing
unite()Use unite() when you have multiple columns that should be combined into one (e.g., combining first and last names into a full name, or combining date components into a single date field).
names_data <- data.frame(
First_Name = c("Alice", "Bob"),
Last_Name = c("Johnson", "Miller"),
Age = c(30, 25)
)
names_data
## First_Name Last_Name Age
## 1 Alice Johnson 30
## 2 Bob Miller 25
# Unite First_Name and Last_Name into Full_Name
names_united <- names_data %>%
unite(Full_Name, First_Name, Last_Name, sep = " ")
# View the united data
head(names_united)
## Full_Name Age
## 1 Alice Johnson 30
## 2 Bob Miller 25
date_data <- data.frame(
Year = c(2021, 2021),
Month = c(10, 10),
Day = c(1, 2),
Sales = c(15000, 18000)
)
date_data
## Year Month Day Sales
## 1 2021 10 1 15000
## 2 2021 10 2 18000
# Unite Year, Month, and Day into a single Date column
date_united <- date_data %>%
unite(Date, Year, Month, Day, sep = "-")
# View the united data
head(date_united)
## Date Sales
## 1 2021-10-1 15000
## 2 2021-10-2 18000
# Sample address data
address_data <- data.frame(
Street = c("123 Main St", "456 Elm St"),
City = c("Springfield", "Greenfield"),
State = c("IL", "MA")
)
address_data
## Street City State
## 1 123 Main St Springfield IL
## 2 456 Elm St Greenfield MA
# Unite Street, City, and State into a single Address column
address_united <- address_data %>%
unite(Address, Street, City, State, sep = ", ")
# View the united data
head(address_united)
## Address
## 1 123 Main St, Springfield, IL
## 2 456 Elm St, Greenfield, MA
drop_na() and
replace_na()Use drop_na( ) to remove rows with missing values, and
use replace_na( ) to fill in missing values with default
values (such as the column mean or median).
# Sample sales data with NA values
sales_data <- data.frame(
Product = c("A", "B", "C", "D"),
Sales = c(15000, NA, 18000, NA)
)
sales_data
## Product Sales
## 1 A 15000
## 2 B NA
## 3 C 18000
## 4 D NA
# Drop rows with missing sales values
cleaned_sales_data <- sales_data %>%
drop_na(Sales)
# View the cleaned data
head(cleaned_sales_data)
## Product Sales
## 1 A 15000
## 2 C 18000
# Replace missing sales values with the mean
filled_sales_data <- sales_data %>%
mutate(Sales = replace_na(Sales, mean(Sales, na.rm = TRUE)))
# View the filled data
head(filled_sales_data)
## Product Sales
## 1 A 15000
## 2 B 16500
## 3 C 18000
## 4 D 16500
# Sample product data with NA values
product_data <- data.frame(
Product = c("E", "F", "G", "H"),
Sales = c(22000, NA, 25000, 27000)
)
product_data
## Product Sales
## 1 E 22000
## 2 F NA
## 3 G 25000
## 4 H 27000
# Introduce missing values in the Sales column
product_data$Sales[c(2)] <- NA
# Drop rows with missing values
cleaned_product_data <- product_data %>%
drop_na(Sales)
# Replace missing values in Sales with the median
filled_product_data <- product_data %>%
mutate(Sales = replace_na(Sales, median(Sales, na.rm = TRUE)))
# View the cleaned and filled data
head(filled_product_data)
## Product Sales
## 1 E 22000
## 2 F 25000
## 3 G 25000
## 4 H 27000
Reshaping Data with pivot_longer(): Convert wide
data into long (tidy) format.
Reshaping Data with pivot_wider(): Convert long data
back into wide format.
Splitting Columns with separate(): Split a single
column into multiple columns based on a delimiter.
Combining Columns with unite(): Combine multiple
columns into a single column.
Handling Missing Data with drop_na() and
replace_na(): Drop or replace missing values in your
dataset.
Learn how to import data from various file formats (CSV, Excel).
Understand how to handle different data sources (databases).
Learn to specify options during data import to handle irregularities and missing values.
Familiarize yourself with tools for importing data in R, particularly the readr and readxl packages.