Overview

This intensive course introduces the core concepts of data science using R, focusing on data visualization, transformation, tidying, and importing. Through detailed introduction and hands-on exercises, you will learn the essential skills to manage, transform, visualize, and import data, helping you perform data analysis using R’s tidyverse packages. No prior R experience is required.

Outline

Lecture 1: Data visualization

Learning Objectives

  • Install and set up R and RStudio
  • Understand basic programming concepts like functions and variables
  • Explore datasets in R
  • Create visualizations from scratch using ggplot2
  • Map variables to different visual elements (aesthetics)
  • Customize plots to communicate insights effectively

Preliminaries on R and RStudio

What is R?

  • R is a programming language designed for data analysis and statistics.
  • It is widely used in fields like data science, finance, and academia.
  • With R, you can:
    • Perform complex statistical operations
    • Visualize data
    • Handle large datasets

What is RStudio?

  • RStudio is an Integrated Development Environment (IDE) for R.

  • It helps you:

    • Write and run R code
    • Organize projects
    • Create visualizations
    • Debug your code
  • Panels in RStudio:

    • Console: Where you type and run R code.

    • Environment: Displays your data and variables.

    • Files/Plots: Shows files, plots, and outputs.

Installing R and RStudio

  1. Download and install R from CRAN.
  2. Download and install RStudio from RStudio’s website.
  3. Open RStudio and explore the interface!

Running Basic R Code

Let’s start simple by running a basic calculation in R. Type the following code in the Console:

1 + 1 

2 * 3

2 * (3 + 6)

8 / 9

Visualizing with ggplot2

Tidyverse Package in R

  • R can be extended by installing packages—sets of code that provide additional functionality.

  • The tidyverse is a collection of R packages that make working with data easier. To install and load the tidyverse, type the codes in your console:

# install.packages("tidyverse")

library(tidyverse)

What is a Dataset?

  • A dataset is a table where:
    • Each row is an observation.

    • Each column is a variable.

    • Each cell is a value.

# Load the palmerpenguins package
library(tidyverse)
library(palmerpenguins)

# Load the penguins dataset
data(penguins)

# View the dataset
head(penguins)
## # A tibble: 6 × 8
##   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##   <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
## 1 Adelie  Torgersen           39.1          18.7               181        3750
## 2 Adelie  Torgersen           39.5          17.4               186        3800
## 3 Adelie  Torgersen           40.3          18                 195        3250
## 4 Adelie  Torgersen           NA            NA                  NA          NA
## 5 Adelie  Torgersen           36.7          19.3               193        3450
## 6 Adelie  Torgersen           39.3          20.6               190        3650
## # ℹ 2 more variables: sex <fct>, year <int>
glimpse(penguins)
## Rows: 344
## Columns: 8
## $ species           <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
## $ island            <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
## $ bill_length_mm    <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
## $ bill_depth_mm     <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
## $ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
## $ body_mass_g       <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
## $ sex               <fct> male, female, female, NA, female, male, female, male…
## $ year              <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…
# View(penguins)

Creating Visualizations with ggplot2

  • 😇 ggplot(data = , mapping = aes(x = , y = ,) + geom_point( )

    • ggplot(): Starts the plot.

    • aes(): Defines the “aesthetic mappings” (how variables are mapped to visual properties).

    • geom_point(): Adds points to create a scatter plot.

library(ggplot2) 
library(palmerpenguins)

# Scatter plot
ggplot(data = penguins, mapping = aes(x = flipper_length_mm, y = body_mass_g)) + geom_point()

Adding Aesthetic Mappings

  • map variables to visual properties like color, size, and shape.
# Scatter plot with points colored by species
ggplot(data = penguins, mapping = aes(x = flipper_length_mm, y = body_mass_g, color = species)) + # Color the points by species
  geom_point()

Customizing Plots

  • add titles and labels to our plots using labs().
# Scatter plot with titles and labels
ggplot(data = penguins, mapping = aes(x = flipper_length_mm, y = body_mass_g, color = species)) +
  geom_point() +
  labs(
    title = "Flipper Length vs. Body Mass by Species",
    x = "Flipper Length (mm)",
    y = "Body Mass (g)"
  )

Exercise 1: Create a scatter plot showing the relationship between bill length and body mass, and color the points by species.

Faceting: Creating Multiple Plots

  • use facets to split your plot into multiple panels based on a variable
# Faceted scatter plot by island
ggplot(data = penguins, mapping = aes(x = flipper_length_mm, y = body_mass_g, color = species)) +
  geom_point() +
  facet_wrap(~ island) + # facet by categorical variables
  labs(
    title = "Flipper Length vs. Body Mass by Island",
    x = "Flipper Length (mm)",
    y = "Body Mass (g)"
  )

Exercise 2: Create a faceted scatter plot showing bill length vs body mass, with separate plots for each species.

Customizing the Appearance with Themes

  • In ggplot2, themes control the overall appearance of the non-data elements in your plots, such as:

    • Background colors
    • Gridlines
    • Axis labels and ticks
    • Font sizes and styles
    • Legend appearance
  • Commonly Used Themes

  1. theme_minimal(): A clean, minimalistic theme with no background color or gridlines.
# Scatter plot with theme_minimal
ggplot(data = penguins, mapping = aes(x = flipper_length_mm, y = body_mass_g, color = species)) +
  geom_point() +
  labs(
    title = "Flipper Length vs. Body Mass (Minimal Theme)",
    x = "Flipper Length (mm)",
    y = "Body Mass (g)"
  ) +
  theme_minimal()

  1. theme_classic(): A theme with a white background and black axis lines.
# Scatter plot with theme_classic
ggplot(data = penguins, mapping = aes(x = flipper_length_mm, y = body_mass_g, color = species)) +
  geom_point() +
  labs(
    title = "Flipper Length vs. Body Mass (Classic Theme)",
    x = "Flipper Length (mm)",
    y = "Body Mass (g)"
  ) +
  theme_classic()

  1. theme_bw(): Black-and-white theme with gridlines, suitable for formal presentations or papers.
# Scatter plot with theme_bw
ggplot(data = penguins, mapping = aes(x = flipper_length_mm, y = body_mass_g, color = species)) +
  geom_point() +
  labs(
    title = "Flipper Length vs. Body Mass (Black and White Theme)",
    x = "Flipper Length (mm)",
    y = "Body Mass (g)"
  ) +
  theme_bw()

  1. theme_void(): A completely empty theme, useful when you want to remove all non-data elements (like axis lines and labels).
# Scatter plot with theme_void
ggplot(data = penguins, mapping = aes(x = flipper_length_mm, y = body_mass_g, color = species)) +
  geom_point() +
  labs(
    title = "Flipper Length vs. Body Mass (Void Theme)"
  ) +
  theme_void()

Exercise 3: Apply a different theme (e.g., theme_classic(), theme_bw()) to the scatter plot you created in Exercise 1.

Adding Trendlines

  • Trendlines are lines added to a plot that help visualize the general direction or pattern of data.

  • They are often used in scatter plots to reveal relationships between two variables, such as whether there is a positive or negative correlation.

  • In ggplot2, trendlines (or smoothers) are added to plots using the geom_smooth() function.

# Scatter plot with a linear trendline
ggplot(data = penguins, mapping = aes(x = flipper_length_mm, y = body_mass_g, color = species)) +
  geom_point() +  # Adds scatter plot
  geom_smooth(method = "lm") +  # Adds linear regression trendline
  labs(
    title = "Flipper Length vs. Body Mass with Linear Trendline",
    x = "Flipper Length (mm)",
    y = "Body Mass (g)"
  )

Exercise 4: Add a trendline to the scatter plot you created in Exercise 1.

Summary: key concepts covered

  1. R and RStudio Setup: install and set up R and RStudio, and explored the basic RStudio interface (Console, Environment, Plots/Files panes).

  2. Running Basic R Code: simple arithmetic in R and learned how to use comments (#) to annotate code.

  3. Working with Packages: tidyverse package, which contains essential tools for data science, including ggplot2 for data visualization.

  4. Understanding Datasets: introduced the concept of datasets and used the palmerpenguins dataset to explore real-world data.

  5. Creating Visualizations with ggplot2: how to map variables to the x and y axes using aes(), and added layers like geom_point() to display data.

  6. Customizing Visualizations: added titles, axis labels, and color-coded plots by species using labs() and aesthetics like color.

  7. Faceting: faceting with facet_wrap(), which allows you to create multiple plots for different subsets of data.

  8. Using Themes: explored built-in themes like theme_minimal(), theme_classic(), and learned how to customize plot appearance by adjusting fonts, removing gridlines, and moving legends.

  9. Trendlines: how to add linear trendlines (using method = “lm”, and non-linear smoothers using method = “loess”) to visualize patterns in the data.