This intensive course introduces the core concepts of data science using R, focusing on data visualization, transformation, tidying, and importing. Through detailed introduction and hands-on exercises, you will learn the essential skills to manage, transform, visualize, and import data, helping you perform data analysis using R’s tidyverse packages. No prior R experience is required.
ggplot2RStudio is an Integrated Development Environment (IDE) for R.
It helps you:
Panels in RStudio:
Console: Where you type and run R code.
Environment: Displays your data and variables.
Files/Plots: Shows files, plots, and outputs.
Let’s start simple by running a basic calculation in R. Type the following code in the Console:
1 + 1
2 * 3
2 * (3 + 6)
8 / 9
R can be extended by installing packages—sets of code that provide additional functionality.
The tidyverse is a collection of R packages that make working with data easier. To install and load the tidyverse, type the codes in your console:
# install.packages("tidyverse")
library(tidyverse)
Each row is an observation.
Each column is a variable.
Each cell is a value.
# Load the palmerpenguins package
library(tidyverse)
library(palmerpenguins)
# Load the penguins dataset
data(penguins)
# View the dataset
head(penguins)
## # A tibble: 6 × 8
## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## <fct> <fct> <dbl> <dbl> <int> <int>
## 1 Adelie Torgersen 39.1 18.7 181 3750
## 2 Adelie Torgersen 39.5 17.4 186 3800
## 3 Adelie Torgersen 40.3 18 195 3250
## 4 Adelie Torgersen NA NA NA NA
## 5 Adelie Torgersen 36.7 19.3 193 3450
## 6 Adelie Torgersen 39.3 20.6 190 3650
## # ℹ 2 more variables: sex <fct>, year <int>
glimpse(penguins)
## Rows: 344
## Columns: 8
## $ species <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
## $ island <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
## $ bill_length_mm <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
## $ bill_depth_mm <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
## $ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
## $ body_mass_g <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
## $ sex <fct> male, female, female, NA, female, male, female, male…
## $ year <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…
# View(penguins)
😇 ggplot(data = , mapping = aes(x = , y = ,) + geom_point( )
ggplot(): Starts the plot.
aes(): Defines the “aesthetic
mappings” (how variables are mapped to visual properties).
geom_point(): Adds points to create
a scatter plot.
library(ggplot2)
library(palmerpenguins)
# Scatter plot
ggplot(data = penguins, mapping = aes(x = flipper_length_mm, y = body_mass_g)) + geom_point()
# Scatter plot with points colored by species
ggplot(data = penguins, mapping = aes(x = flipper_length_mm, y = body_mass_g, color = species)) + # Color the points by species
geom_point()
# Scatter plot with titles and labels
ggplot(data = penguins, mapping = aes(x = flipper_length_mm, y = body_mass_g, color = species)) +
geom_point() +
labs(
title = "Flipper Length vs. Body Mass by Species",
x = "Flipper Length (mm)",
y = "Body Mass (g)"
)
# Faceted scatter plot by island
ggplot(data = penguins, mapping = aes(x = flipper_length_mm, y = body_mass_g, color = species)) +
geom_point() +
facet_wrap(~ island) + # facet by categorical variables
labs(
title = "Flipper Length vs. Body Mass by Island",
x = "Flipper Length (mm)",
y = "Body Mass (g)"
)
In ggplot2, themes control the overall appearance of the non-data elements in your plots, such as:
Commonly Used Themes
# Scatter plot with theme_minimal
ggplot(data = penguins, mapping = aes(x = flipper_length_mm, y = body_mass_g, color = species)) +
geom_point() +
labs(
title = "Flipper Length vs. Body Mass (Minimal Theme)",
x = "Flipper Length (mm)",
y = "Body Mass (g)"
) +
theme_minimal()
# Scatter plot with theme_classic
ggplot(data = penguins, mapping = aes(x = flipper_length_mm, y = body_mass_g, color = species)) +
geom_point() +
labs(
title = "Flipper Length vs. Body Mass (Classic Theme)",
x = "Flipper Length (mm)",
y = "Body Mass (g)"
) +
theme_classic()
# Scatter plot with theme_bw
ggplot(data = penguins, mapping = aes(x = flipper_length_mm, y = body_mass_g, color = species)) +
geom_point() +
labs(
title = "Flipper Length vs. Body Mass (Black and White Theme)",
x = "Flipper Length (mm)",
y = "Body Mass (g)"
) +
theme_bw()
# Scatter plot with theme_void
ggplot(data = penguins, mapping = aes(x = flipper_length_mm, y = body_mass_g, color = species)) +
geom_point() +
labs(
title = "Flipper Length vs. Body Mass (Void Theme)"
) +
theme_void()
Trendlines are lines added to a plot that help visualize the general direction or pattern of data.
They are often used in scatter plots to reveal relationships between two variables, such as whether there is a positive or negative correlation.
In ggplot2, trendlines (or smoothers) are added to plots using the geom_smooth() function.
# Scatter plot with a linear trendline
ggplot(data = penguins, mapping = aes(x = flipper_length_mm, y = body_mass_g, color = species)) +
geom_point() + # Adds scatter plot
geom_smooth(method = "lm") + # Adds linear regression trendline
labs(
title = "Flipper Length vs. Body Mass with Linear Trendline",
x = "Flipper Length (mm)",
y = "Body Mass (g)"
)
R and RStudio Setup: install and set up R and RStudio, and explored the basic RStudio interface (Console, Environment, Plots/Files panes).
Running Basic R Code: simple arithmetic in R and learned how to use comments (#) to annotate code.
Working with Packages: tidyverse package, which contains essential tools for data science, including ggplot2 for data visualization.
Understanding Datasets: introduced the concept of datasets and used the palmerpenguins dataset to explore real-world data.
Creating Visualizations with ggplot2: how to map variables to the x and y axes using aes(), and added layers like geom_point() to display data.
Customizing Visualizations: added titles, axis labels, and color-coded plots by species using labs() and aesthetics like color.
Faceting: faceting with facet_wrap(), which allows you to create multiple plots for different subsets of data.
Using Themes: explored built-in themes like theme_minimal(), theme_classic(), and learned how to customize plot appearance by adjusting fonts, removing gridlines, and moving legends.
Trendlines: how to add linear trendlines (using method = “lm”, and non-linear smoothers using method = “loess”) to visualize patterns in the data.