This intensive course introduces the core concepts of data science using R, focusing on data visualization, transformation, tidying, and importing. Through detailed introduction and hands-on exercises, you will learn the essential skills to manage, transform, visualize, and import data, helping you perform data analysis using R’s tidyverse packages. No prior R experience is required.
Lecture 1: Data visualization
Lecture 2: Data transformation
Lecture 3: Data tidying
Lecture 4: Data import
Understand basic programming concepts like functions and variables
Explore datasets in R
Create visualizations from scratch using
ggplot2
Map variables to different visual elements (aesthetics)
Customize plots to communicate insights effectively
RStudio is an Integrated Development Environment (IDE) for R.
It helps you:
Panels in RStudio:
Console: Where you type and run R code.
Environment: Displays your data and variables.
Files/Plots: Shows files, plots, and outputs.
Let’s start simple by running a basic calculation in R. Type the following code in the Console:
1 + 1
2 * 3
2 * (3 + 6)
8 / 9
R can be extended by installing packages—sets of code that provide additional functionality.
The tidyverse is a collection of R packages that make working with data easier. To install and load the tidyverse, type the codes in your console:
# install.packages("tidyverse")
library(tidyverse)
Each row is an observation.
Each column is a variable.
Each cell is a value.
# Load the palmerpenguins package
library(tidyverse)
library(palmerpenguins)
# Load the penguins dataset
data(penguins)
# View the dataset
head(penguins)
## # A tibble: 6 × 8
## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## <fct> <fct> <dbl> <dbl> <int> <int>
## 1 Adelie Torgersen 39.1 18.7 181 3750
## 2 Adelie Torgersen 39.5 17.4 186 3800
## 3 Adelie Torgersen 40.3 18 195 3250
## 4 Adelie Torgersen NA NA NA NA
## 5 Adelie Torgersen 36.7 19.3 193 3450
## 6 Adelie Torgersen 39.3 20.6 190 3650
## # ℹ 2 more variables: sex <fct>, year <int>
glimpse(penguins)
## Rows: 344
## Columns: 8
## $ species <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
## $ island <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
## $ bill_length_mm <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
## $ bill_depth_mm <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
## $ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
## $ body_mass_g <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
## $ sex <fct> male, female, female, NA, female, male, female, male…
## $ year <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…
# View(penguins)
ggplot(data = , mapping = aes(x = , y = ,) + geom_point( )
ggplot(): Starts the plot.
aes(): Defines the “aesthetic
mappings” (how variables are mapped to visual properties).
geom_point(): Adds points to create
a scatter plot.
library(ggplot2)
library(palmerpenguins)
# Scatter plot
ggplot(data = penguins, mapping = aes(x = flipper_length_mm, y = body_mass_g)) + geom_point()
# Scatter plot with points colored by species
ggplot(data = penguins, mapping = aes(x = flipper_length_mm, y = body_mass_g, color = species)) + # Color the points by species
geom_point()
# Scatter plot with titles and labels
ggplot(data = penguins, mapping = aes(x = flipper_length_mm, y = body_mass_g, color = species)) +
geom_point() +
labs(
title = "Flipper Length vs. Body Mass by Species",
x = "Flipper Length (mm)",
y = "Body Mass (g)"
)
# Faceted scatter plot by island
ggplot(data = penguins, mapping = aes(x = flipper_length_mm, y = body_mass_g, color = species)) +
geom_point() +
facet_wrap(~ island) + # facet by categorical variables
labs(
title = "Flipper Length vs. Body Mass by Island",
x = "Flipper Length (mm)",
y = "Body Mass (g)"
)
In ggplot2, themes control the overall appearance of the non-data elements in your plots, such as:
Commonly Used Themes
# Scatter plot with theme_minimal
ggplot(data = penguins, mapping = aes(x = flipper_length_mm, y = body_mass_g, color = species)) +
geom_point() +
labs(
title = "Flipper Length vs. Body Mass (Minimal Theme)",
x = "Flipper Length (mm)",
y = "Body Mass (g)"
) +
theme_minimal()
# Scatter plot with theme_classic
ggplot(data = penguins, mapping = aes(x = flipper_length_mm, y = body_mass_g, color = species)) +
geom_point() +
labs(
title = "Flipper Length vs. Body Mass (Classic Theme)",
x = "Flipper Length (mm)",
y = "Body Mass (g)"
) +
theme_classic()
# Scatter plot with theme_bw
ggplot(data = penguins, mapping = aes(x = flipper_length_mm, y = body_mass_g, color = species)) +
geom_point() +
labs(
title = "Flipper Length vs. Body Mass (Black and White Theme)",
x = "Flipper Length (mm)",
y = "Body Mass (g)"
) +
theme_bw()
# Scatter plot with theme_void
ggplot(data = penguins, mapping = aes(x = flipper_length_mm, y = body_mass_g, color = species)) +
geom_point() +
labs(
title = "Flipper Length vs. Body Mass (Void Theme)"
) +
theme_void()
Trendlines are lines added to a plot that help visualize the general direction or pattern of data.
They are often used in scatter plots to reveal relationships between two variables, such as whether there is a positive or negative correlation.
In ggplot2, trendlines (or smoothers) are added to plots using the geom_smooth() function.
# Scatter plot with a linear trendline
ggplot(data = penguins, mapping = aes(x = flipper_length_mm, y = body_mass_g, color = species)) +
geom_point() + # Adds scatter plot
geom_smooth(method = "lm") + # Adds linear regression trendline
labs(
title = "Flipper Length vs. Body Mass with Linear Trendline",
x = "Flipper Length (mm)",
y = "Body Mass (g)"
)
R and RStudio Setup: install and set up R and RStudio, and explored the basic RStudio interface (Console, Environment, Plots/Files panes).
Running Basic R Code: simple arithmetic in R and learned how to use comments (#) to annotate code.
Working with Packages: tidyverse package, which contains essential tools for data science, including ggplot2 for data visualization.
Understanding Datasets: introduced the concept of datasets and used the palmerpenguins dataset to explore real-world data.
Creating Visualizations with ggplot2: how to map variables to the x and y axes using aes(), and added layers like geom_point() to display data.
Customizing Visualizations: added titles, axis labels, and color-coded plots by species using labs() and aesthetics like color.
Faceting: faceting with facet_wrap(), which allows you to create multiple plots for different subsets of data.
Using Themes: explored built-in themes like theme_minimal(), theme_classic(), and learned how to customize plot appearance by adjusting fonts, removing gridlines, and moving legends.
Trendlines: how to add linear trendlines (using method = “lm”, and non-linear smoothers using method = “loess”) to visualize patterns in the data.