Overview

This intensive course introduces the core concepts of data science using R, focusing on data visualization, transformation, tidying, and importing. Through detailed introduction and hands-on exercises, you will learn the essential skills to manage, transform, visualize, and import data, helping you perform data analysis using R’s tidyverse packages. No prior R experience is required.

Outline

Lecture 1: Data Visualization
Lecture 2: Data transformation
Lecture 3: Data tidying
Lecture 4: Data import

Lecture 1: Data visualization

Learning Objectives

Install and set up R and RStudio
Understand basic programming concepts like functions and variables
Explore datasets in R
Create visualizations from scratch using ggplot2
Map variables to different visual elements (aesthetics)
Customize plots to communicate insights effectively

Preliminaries on R and RStudio

What is R?

R is a programming language designed for data analysis and statistics.
It is widely used in fields like data science, finance, and academia.
With R, you can:
- Perform complex statistical operations
- Visualize data
- Handle large datasets

What is RStudio?

RStudio is an Integrated Development Environment (IDE) for R.
It helps you:
- Write and run R code
- Organize projects
- Create visualizations
- Debug your code
Panels in RStudio:
- Console: Where you type and run R code.
- Environment: Displays your data and variables.
- Files/Plots: Shows files, plots, and outputs.

Installing R and RStudio

Download and install R from CRAN.
Download and install RStudio from RStudio’s website.
Open RStudio and explore the interface!

Running Basic R Code

Let’s start simple by running a basic calculation in R. Type the following code in the Console:

1 + 1 

2 * 3

2 * (3 + 6)

8 / 9

Visualizing with ggplot2

Tidyverse Package in R

R can be extended by installing packages—sets of code that provide additional functionality.
The tidyverse is a collection of R packages that make working with data easier. To install and load the tidyverse, type the codes in your console:

# install.packages("tidyverse")

library(tidyverse)

What is a Dataset?

A dataset is a table where:
- Each row is an observation.
- Each column is a variable.
- Each cell is a value.

# Load the palmerpenguins package
library(tidyverse)
library(palmerpenguins)

# Load the penguins dataset
data(penguins)

# View the dataset
head(penguins)

## # A tibble: 6 × 8
##   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##   <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
## 1 Adelie  Torgersen           39.1          18.7               181        3750
## 2 Adelie  Torgersen           39.5          17.4               186        3800
## 3 Adelie  Torgersen           40.3          18                 195        3250
## 4 Adelie  Torgersen           NA            NA                  NA          NA
## 5 Adelie  Torgersen           36.7          19.3               193        3450
## 6 Adelie  Torgersen           39.3          20.6               190        3650
## # ℹ 2 more variables: sex <fct>, year <int>

glimpse(penguins)

## Rows: 344
## Columns: 8
## $ species           <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
## $ island            <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
## $ bill_length_mm    <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
## $ bill_depth_mm     <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
## $ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
## $ body_mass_g       <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
## $ sex               <fct> male, female, female, NA, female, male, female, male…
## $ year              <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…

# View(penguins)

Creating Visualizations with ggplot2

```
  ggplot(data = , mapping = aes(x = , y = ,) + geom_point( )
```
- ggplot(): Starts the plot.
- aes(): Defines the “aesthetic mappings” (how variables are mapped to visual properties).
- geom_point(): Adds points to create a scatter plot.

library(ggplot2) 
library(palmerpenguins)

# Scatter plot
ggplot(data = penguins, mapping = aes(x = flipper_length_mm, y = body_mass_g)) + geom_point()

Adding Aesthetic Mappings

map variables to visual properties like color, size, and shape.

# Scatter plot with points colored by species
ggplot(data = penguins, mapping = aes(x = flipper_length_mm, y = body_mass_g, color = species)) + # Color the points by species
  geom_point()

Customizing Plots

add titles and labels to our plots using labs().

# Scatter plot with titles and labels
ggplot(data = penguins, mapping = aes(x = flipper_length_mm, y = body_mass_g, color = species)) +
  geom_point() +
  labs(
    title = "Flipper Length vs. Body Mass by Species",
    x = "Flipper Length (mm)",
    y = "Body Mass (g)"
  )

Exercise 1: Create a scatter plot showing the relationship between bill length and body mass, and color the points by species.

Faceting: Creating Multiple Plots

use facets to split your plot into multiple panels based on a variable

# Faceted scatter plot by island
ggplot(data = penguins, mapping = aes(x = flipper_length_mm, y = body_mass_g, color = species)) +
  geom_point() +
  facet_wrap(~ island) + # facet by categorical variables
  labs(
    title = "Flipper Length vs. Body Mass by Island",
    x = "Flipper Length (mm)",
    y = "Body Mass (g)"
  )

Exercise 2: Create a faceted scatter plot showing bill length vs body mass, with separate plots for each species.

Customizing the Appearance with Themes

In ggplot2, themes control the overall appearance of the non-data elements in your plots, such as:
- Background colors
- Gridlines
- Axis labels and ticks
- Font sizes and styles
- Legend appearance
Commonly Used Themes

theme_minimal(): A clean, minimalistic theme with no background color or gridlines.

# Scatter plot with theme_minimal
ggplot(data = penguins, mapping = aes(x = flipper_length_mm, y = body_mass_g, color = species)) +
  geom_point() +
  labs(
    title = "Flipper Length vs. Body Mass (Minimal Theme)",
    x = "Flipper Length (mm)",
    y = "Body Mass (g)"
  ) +
  theme_minimal()

theme_classic(): A theme with a white background and black axis lines.

# Scatter plot with theme_classic
ggplot(data = penguins, mapping = aes(x = flipper_length_mm, y = body_mass_g, color = species)) +
  geom_point() +
  labs(
    title = "Flipper Length vs. Body Mass (Classic Theme)",
    x = "Flipper Length (mm)",
    y = "Body Mass (g)"
  ) +
  theme_classic()

theme_bw(): Black-and-white theme with gridlines, suitable for formal presentations or papers.

# Scatter plot with theme_bw
ggplot(data = penguins, mapping = aes(x = flipper_length_mm, y = body_mass_g, color = species)) +
  geom_point() +
  labs(
    title = "Flipper Length vs. Body Mass (Black and White Theme)",
    x = "Flipper Length (mm)",
    y = "Body Mass (g)"
  ) +
  theme_bw()

theme_void(): A completely empty theme, useful when you want to remove all non-data elements (like axis lines and labels).

# Scatter plot with theme_void
ggplot(data = penguins, mapping = aes(x = flipper_length_mm, y = body_mass_g, color = species)) +
  geom_point() +
  labs(
    title = "Flipper Length vs. Body Mass (Void Theme)"
  ) +
  theme_void()

Exercise 3: Apply a different theme (e.g., theme_classic(), theme_bw()) to the scatter plot you created in Exercise 1.

Adding Trendlines

Trendlines are lines added to a plot that help visualize the general direction or pattern of data.
They are often used in scatter plots to reveal relationships between two variables, such as whether there is a positive or negative correlation.
In ggplot2, trendlines (or smoothers) are added to plots using the geom_smooth() function.

# Scatter plot with a linear trendline
ggplot(data = penguins, mapping = aes(x = flipper_length_mm, y = body_mass_g, color = species)) +
  geom_point() +  # Adds scatter plot
  geom_smooth(method = "lm") +  # Adds linear regression trendline
  labs(
    title = "Flipper Length vs. Body Mass with Linear Trendline",
    x = "Flipper Length (mm)",
    y = "Body Mass (g)"
  )

Exercise 4: Add a trendline to the scatter plot you created in Exercise 1.

Summary: Key Concepts Covered

R and RStudio Setup: install and set up R and RStudio, and explored the basic RStudio interface (Console, Environment, Plots/Files panes).
Running Basic R Code: simple arithmetic in R and learned how to use comments (#) to annotate code.
Working with Packages: tidyverse package, which contains essential tools for data science, including ggplot2 for data visualization.
Understanding Datasets: introduced the concept of datasets and used the palmerpenguins dataset to explore real-world data.
Creating Visualizations with ggplot2: how to map variables to the x and y axes using aes(), and added layers like geom_point() to display data.
Customizing Visualizations: added titles, axis labels, and color-coded plots by species using labs() and aesthetics like color.
Faceting: faceting with facet_wrap(), which allows you to create multiple plots for different subsets of data.
Using Themes: explored built-in themes like theme_minimal(), theme_classic(), and learned how to customize plot appearance by adjusting fonts, removing gridlines, and moving legends.
Trendlines: how to add linear trendlines (using method = “lm”, and non-linear smoothers using method = “loess”) to visualize patterns in the data.

Lecture 2: Data Transformation

Learning Objectives

Master the key functions in dplyr for transforming datasets.
Filter and select specific rows and columns from a dataset.
Arrange and sort rows based on the values of specific columns.
Create new variables using mutate().
Summarize and group data for analysis using summarize() and group_by().
Handle missing data and clean your dataset before analysis.

Introduction to Data Transformation

Data transformation is the process of modifying, cleaning, and reshaping raw data into a format that is ready for analysis.
In real-world data science, raw data is often messy or too large to be useful as is, so transforming it into a tidy and usable form is essential.
The dplyr package, part of the tidyverse, provides a set of easy-to-use functions for performing these data manipulation tasks efficiently and intuitively.

Key Functions in dplyr

Function	Description
filter( )	Selects rows based on a condition
select( )	Selects specific columns
arrange( )	Sorts rows based on the values of one or more columns
mutate( )	Creates new columns or modifies existing ones
summarize( )	Summarizes multiple values into a single value
group_by( )	Groups data by one or more columns before summarizing

1. Filtering Rows with `filter( )`

The filter() function is used to subset rows from a dataset based on specific conditions.
- Filtering Rows Based on Numeric Conditions
- Filtering Rows Based on Categorical Values
- Combining Multiple Conditions

# Load required libraries
library(tidyverse)
library(palmerpenguins)

# Filter penguins with flipper length greater than 200mm
filtered_penguins <- penguins %>%
  filter(flipper_length_mm > 200)
head(filtered_penguins)

## # A tibble: 6 × 8
##   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##   <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
## 1 Adelie  Dream               35.7          18                 202        3550
## 2 Adelie  Dream               41.1          18.1               205        4300
## 3 Adelie  Dream               40.8          18.9               208        4300
## 4 Adelie  Biscoe              41            20                 203        4725
## 5 Adelie  Torgersen           41.4          18.5               202        3875
## 6 Adelie  Torgersen           44.1          18                 210        4000
## # ℹ 2 more variables: sex <fct>, year <int>

# Keep only the penguins from the island "Biscoe".
filtered_penguins <- penguins %>%
  filter(flipper_length_mm > 200, body_mass_g < 4000)
head(filtered_penguins)

## # A tibble: 6 × 8
##   species   island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##   <fct>     <fct>              <dbl>         <dbl>             <int>       <int>
## 1 Adelie    Dream               35.7          18                 202        3550
## 2 Adelie    Torgersen           41.4          18.5               202        3875
## 3 Gentoo    Biscoe              42.7          13.7               208        3950
## 4 Chinstrap Dream               49            19.5               210        3950
## 5 Chinstrap Dream               51.4          19                 201        3950
## 6 Chinstrap Dream               50.2          18.8               202        3800
## # ℹ 2 more variables: sex <fct>, year <int>

# Select penguins with a flipper length greater than 200 mm and a body mass under 4000 g.
filtered_penguins <- penguins %>%
  filter(flipper_length_mm > 200, body_mass_g < 4000)
head(filtered_penguins)

## # A tibble: 6 × 8
##   species   island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##   <fct>     <fct>              <dbl>         <dbl>             <int>       <int>
## 1 Adelie    Dream               35.7          18                 202        3550
## 2 Adelie    Torgersen           41.4          18.5               202        3875
## 3 Gentoo    Biscoe              42.7          13.7               208        3950
## 4 Chinstrap Dream               49            19.5               210        3950
## 5 Chinstrap Dream               51.4          19                 201        3950
## 6 Chinstrap Dream               50.2          18.8               202        3800
## # ℹ 2 more variables: sex <fct>, year <int>

Exercise 1: Filter the penguins dataset to include only penguins with a body mass greater than 4000 grams.

2. Selecting Columns with `select()`

The select() function allows you to choose specific columns from a dataset.
- Reducing Dimensionality
- Reordering Columns
- Dropping Irrelevant Columns

# Select only the columns for species, bill length, and body mass
penguins_selected <- penguins %>%
  select(species, bill_length_mm, body_mass_g)
head(penguins_selected)

## # A tibble: 6 × 3
##   species bill_length_mm body_mass_g
##   <fct>            <dbl>       <int>
## 1 Adelie            39.1        3750
## 2 Adelie            39.5        3800
## 3 Adelie            40.3        3250
## 4 Adelie            NA            NA
## 5 Adelie            36.7        3450
## 6 Adelie            39.3        3650

# Rearranging columns to place species first.
penguins_reordered <- penguins %>%
  select(species, everything())  # Moves species to the first column
head(penguins_reordered)

## # A tibble: 6 × 8
##   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##   <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
## 1 Adelie  Torgersen           39.1          18.7               181        3750
## 2 Adelie  Torgersen           39.5          17.4               186        3800
## 3 Adelie  Torgersen           40.3          18                 195        3250
## 4 Adelie  Torgersen           NA            NA                  NA          NA
## 5 Adelie  Torgersen           36.7          19.3               193        3450
## 6 Adelie  Torgersen           39.3          20.6               190        3650
## # ℹ 2 more variables: sex <fct>, year <int>

# Exclude the island and sex columns from the dataset.
penguins_dropped <- penguins %>%
  select(-island, -sex)
head(penguins_dropped)

## # A tibble: 6 × 6
##   species bill_length_mm bill_depth_mm flipper_length_mm body_mass_g  year
##   <fct>            <dbl>         <dbl>             <int>       <int> <int>
## 1 Adelie            39.1          18.7               181        3750  2007
## 2 Adelie            39.5          17.4               186        3800  2007
## 3 Adelie            40.3          18                 195        3250  2007
## 4 Adelie            NA            NA                  NA          NA  2007
## 5 Adelie            36.7          19.3               193        3450  2007
## 6 Adelie            39.3          20.6               190        3650  2007

Exercise 2: Select only the columns species, island, and body_mass_g.

3. Arranging Rows with `arrange( )`

The arrange() function sorts the rows of your dataset based on the values in one or more columns.
- Sorting by a Numeric Column
- Sorting by a Categorical Variable
- Sorting by Multiple Columns

# Sort the dataset by body mass in descending order to find the heaviest penguins.
penguins_sorted <- penguins %>%
  arrange(desc(body_mass_g))
head(penguins_sorted)

## # A tibble: 6 × 8
##   species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##   <fct>   <fct>           <dbl>         <dbl>             <int>       <int>
## 1 Gentoo  Biscoe           49.2          15.2               221        6300
## 2 Gentoo  Biscoe           59.6          17                 230        6050
## 3 Gentoo  Biscoe           51.1          16.3               220        6000
## 4 Gentoo  Biscoe           48.8          16.2               222        6000
## 5 Gentoo  Biscoe           45.2          16.4               223        5950
## 6 Gentoo  Biscoe           49.8          15.9               229        5950
## # ℹ 2 more variables: sex <fct>, year <int>

# Sort the dataset alphabetically by species.
penguins_sorted_species <- penguins %>%
  arrange(species)
head(penguins_sorted_species)

## # A tibble: 6 × 8
##   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##   <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
## 1 Adelie  Torgersen           39.1          18.7               181        3750
## 2 Adelie  Torgersen           39.5          17.4               186        3800
## 3 Adelie  Torgersen           40.3          18                 195        3250
## 4 Adelie  Torgersen           NA            NA                  NA          NA
## 5 Adelie  Torgersen           36.7          19.3               193        3450
## 6 Adelie  Torgersen           39.3          20.6               190        3650
## # ℹ 2 more variables: sex <fct>, year <int>

# First sort by island, and then by flipper length.
penguins_sorted_multi <- penguins %>%
  arrange(island, flipper_length_mm)
head(penguins_sorted_multi)

## # A tibble: 6 × 8
##   species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##   <fct>   <fct>           <dbl>         <dbl>             <int>       <int>
## 1 Adelie  Biscoe           37.9          18.6               172        3150
## 2 Adelie  Biscoe           37.8          18.3               174        3400
## 3 Adelie  Biscoe           37.7          18.7               180        3600
## 4 Adelie  Biscoe           38.8          17.2               180        3800
## 5 Adelie  Biscoe           40.5          18.9               180        3950
## 6 Adelie  Biscoe           36.5          16.6               181        2850
## # ℹ 2 more variables: sex <fct>, year <int>

Exercise 3: Arrange the penguins dataset by flipper length in ascending order.

4. Creating New Variables with `mutate()`

The mutate() function is used to create new columns (variables) or modify existing ones.
- Creating a New Column
- Performing Calculations Across Columns

# Create a new column for body mass in kilograms.
penguins <- penguins %>%
  mutate(body_mass_kg = body_mass_g / 1000)
head(penguins)

## # A tibble: 6 × 9
##   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##   <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
## 1 Adelie  Torgersen           39.1          18.7               181        3750
## 2 Adelie  Torgersen           39.5          17.4               186        3800
## 3 Adelie  Torgersen           40.3          18                 195        3250
## 4 Adelie  Torgersen           NA            NA                  NA          NA
## 5 Adelie  Torgersen           36.7          19.3               193        3450
## 6 Adelie  Torgersen           39.3          20.6               190        3650
## # ℹ 3 more variables: sex <fct>, year <int>, body_mass_kg <dbl>

# Create a new column for the difference between maximum and minimum body mass.
penguins <- penguins %>%
  mutate(body_mass_diff = max(body_mass_g, na.rm = TRUE) - min(body_mass_g, na.rm = TRUE))
head(penguins)

## # A tibble: 6 × 10
##   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##   <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
## 1 Adelie  Torgersen           39.1          18.7               181        3750
## 2 Adelie  Torgersen           39.5          17.4               186        3800
## 3 Adelie  Torgersen           40.3          18                 195        3250
## 4 Adelie  Torgersen           NA            NA                  NA          NA
## 5 Adelie  Torgersen           36.7          19.3               193        3450
## 6 Adelie  Torgersen           39.3          20.6               190        3650
## # ℹ 4 more variables: sex <fct>, year <int>, body_mass_kg <dbl>,
## #   body_mass_diff <int>

Exercise 4: Create a new column in the penguins dataset that calculates the flipper length in meters.

5. Summarizing Data with `summarize()` and `group_by()`

The summarize() function condenses your dataset by calculating summary statistics (e.g., mean, median, sum) for one or more columns.
When used with group_by(), you can calculate these statistics for different groups within your data.
- Calculating Average Values for Specific Groups
- Summarizing Counts of Observations
- Aggregating Data for Further Analysis:

# Calculate the average body mass for each species.
penguins_summary <- penguins %>%
  group_by(species) %>%
  summarize(mean_body_mass_g = mean(body_mass_g, na.rm = TRUE))
head(penguins_summary)

## # A tibble: 3 × 2
##   species   mean_body_mass_g
##   <fct>                <dbl>
## 1 Adelie               3701.
## 2 Chinstrap            3733.
## 3 Gentoo               5076.

# Count the number of penguins from each island.
penguins_count <- penguins %>%
  group_by(island) %>%
  summarize(count = n())
head(penguins_count)

## # A tibble: 3 × 2
##   island    count
##   <fct>     <int>
## 1 Biscoe      168
## 2 Dream       124
## 3 Torgersen    52

# Calculate the total body mass of penguins for each species
total_mass_summary <- penguins %>%
  group_by(species) %>%
  summarize(total_body_mass_g = sum(body_mass_g, na.rm = TRUE))
head(total_mass_summary)

## # A tibble: 3 × 2
##   species   total_body_mass_g
##   <fct>                 <int>
## 1 Adelie               558800
## 2 Chinstrap            253850
## 3 Gentoo               624350

Exercise 5: Group the penguins dataset by island and calculate the mean body mass for each island.

Handling Missing Data

Sometimes datasets contain missing values, which can interfere with calculations.
Most functions (e.g., mean() or sum()) will return NA if missing data is not handled.
To address this, we use the argument na.rm = TRUE, which tells R to remove missing values before performing calculations.

# Calculate the mean flipper length, ignoring missing values
mean_flipper_length <- mean(penguins$flipper_length_mm, na.rm = TRUE)

mean_flipper_length

## [1] 200.9152

Summary: Key Concepts Covered

Filtering Rows with filter(): Learn how to subset your data based on specific conditions.
Selecting Columns with select(): Focus on only the variables you need by selecting specific columns.
Arranging Rows with arrange(): Sort your data by one or more columns in ascending or descending order.
Creating New Variables with mutate(): Add new columns to your dataset or modify existing ones.
Summarizing Data with summarize() and group_by(): Compute summary statistics (mean, sum, etc.) for groups within your data.
Handling Missing Data: Use na.rm = TRUE to remove missing values before performing calculations.

Lecture 3: Data Tidying

Learning Objectives

Understand the concept of tidy data and why it is important.
Learn how to reshape data using pivot_longer() and pivot_wider().
Separate and unite columns using separate() and unite().
Handle missing data with drop_na() and replace_na().
Gain proficiency in preparing messy datasets for analysis.

What Is Tidy Data?

Tidy data is a standardized way of organizing data that makes it easier to work with for analysis. In a tidy dataset:

Each variable forms a column.
Each observation forms a row.
Each value forms a cell.

Messy vs. Tidy Data

year	China_population	India_population	USA_population
2000	1.26 billion	1.05 billion	282 million
2010	1.34 billion	1.21 billion	309 million

Year	Country	Population
2000	China	1.26 billion
2000	India	1.05 billion
2000	USA	282 million
2010	China	1.34 billion
2010	India	1.21 billion
2010	USA	309 million

Core Functions for Data Tidying from tidyr

Function	Purpose
pivot_longer( )	Converts wide data into long format
pivot_wider( )	Converts long data back into wide format
separate( )	Splits one column into multiple columns
unite( )	Combines multiple columns into a single column
drop_na( )	Removes rows with missing values
replace_na( )	Replaces missing values with specified values

1. Reshaping Data with `pivot_longer()`

Use pivot_longer() when you have wide data, where multiple columns represent the same variable (e.g., different years or measurements), and you need to gather them into a long (tidy) format where each variable has its own column.

Use Case 1: Pivoting Sales Data from Wide to Long Format. Convert quarterly sales data (Q1, Q2, Q3, Q4) into a long format to allow comparison across quarters.

# Create a wide dataset with quarterly sales
sales_data <- data.frame(
  year = c(2020, 2021),
  q1_sales = c(15000, 18000),
  q2_sales = c(20000, 21000),
  q3_sales = c(25000, 26000),
  q4_sales = c(30000, 31000)
)
sales_data

##   year q1_sales q2_sales q3_sales q4_sales
## 1 2020    15000    20000    25000    30000
## 2 2021    18000    21000    26000    31000

# Pivoting the sales data to long format
sales_long <- sales_data %>%
  pivot_longer(cols = starts_with("q"),
               names_to = "quarter",
               values_to = "sales")

# View the long format data
head(sales_long)

## # A tibble: 6 × 3
##    year quarter  sales
##   <dbl> <chr>    <dbl>
## 1  2020 q1_sales 15000
## 2  2020 q2_sales 20000
## 3  2020 q3_sales 25000
## 4  2020 q4_sales 30000
## 5  2021 q1_sales 18000
## 6  2021 q2_sales 21000

Use Case 2: Pivoting Population Data from Wide to Long Format. Transform the population data for different countries from wide to long format to make it easier to analyze population trends by year.

# Create a wide dataset for population
population_data <- data.frame(
  year = c(2000, 2010),
  china_population = c(1.26, 1.34),
  india_population = c(1.05, 1.21),
  usa_population = c(0.28, 0.31)
)
population_data

##   year china_population india_population usa_population
## 1 2000             1.26             1.05           0.28
## 2 2010             1.34             1.21           0.31

# Pivoting the population data to long format
population_long <- population_data %>%
  pivot_longer(cols = ends_with("population"),
               names_to = "country",
               values_to = "population")

# View the long format data
head(population_long)

## # A tibble: 6 × 3
##    year country          population
##   <dbl> <chr>                 <dbl>
## 1  2000 china_population       1.26
## 2  2000 india_population       1.05
## 3  2000 usa_population         0.28
## 4  2010 china_population       1.34
## 5  2010 india_population       1.21
## 6  2010 usa_population         0.31

Exercise 1: Pivoting Temperature Data

Given a wide dataset of average temperatures across four seasons, pivot it into a long format where each season has its own column for comparison.

# Wide dataset of temperatures
temperature_data <- data.frame(
  year = c(2020, 2021),
  spring_temp = c(15.5, 16.0),
  summer_temp = c(25.3, 26.1),
  fall_temp = c(10.2, 9.8),
  winter_temp = c(0.5, -1.0)
)

# Pivoting temperature data to long format
temperature_long <- temperature_data %>%
  pivot_longer(cols = ends_with("temp"),
               names_to = "season",
               values_to = "temperature")

# View the long format data
head(temperature_long)

## # A tibble: 6 × 3
##    year season      temperature
##   <dbl> <chr>             <dbl>
## 1  2020 spring_temp        15.5
## 2  2020 summer_temp        25.3
## 3  2020 fall_temp          10.2
## 4  2020 winter_temp         0.5
## 5  2021 spring_temp        16  
## 6  2021 summer_temp        26.1

2. Reshaping Data with `pivot_wider()`

Use pivot_wider() when you have long data and need to spread it into a wide format for easier comparisons between different categories or variables.

Use Case 1: Pivoting Long Sales Data Back to Wide Format. Convert the long sales_long dataset (created earlier) back to its original wide format for reporting.

# Pivoting the long sales data back to wide format
sales_wide <- sales_long %>%
  pivot_wider(names_from = "quarter",
              values_from = "sales")

# View the wide format data
head(sales_wide)

## # A tibble: 2 × 5
##    year q1_sales q2_sales q3_sales q4_sales
##   <dbl>    <dbl>    <dbl>    <dbl>    <dbl>
## 1  2020    15000    20000    25000    30000
## 2  2021    18000    21000    26000    31000

Use Case 2: Pivoting Long Population Data Back to Wide Format. Convert the long population data (created earlier) back to wide format for summarizing population by country and year.

# Pivoting long population data back to wide format
population_wide <- population_long %>%
  pivot_wider(names_from = "country",
              values_from = "population")

# View the wide format data
head(population_wide)

## # A tibble: 2 × 4
##    year china_population india_population usa_population
##   <dbl>            <dbl>            <dbl>          <dbl>
## 1  2000             1.26             1.05           0.28
## 2  2010             1.34             1.21           0.31

Exercise 2: Pivoting Long Temperature Data

Convert the temperature_long data (from Exercise 1) back to its original wide format.

# Pivoting temperature data back to wide format
temperature_wide <- temperature_long %>%
  pivot_wider(names_from = "season",
              values_from = "temperature")

# View the wide format data
head(temperature_wide)

## # A tibble: 2 × 5
##    year spring_temp summer_temp fall_temp winter_temp
##   <dbl>       <dbl>       <dbl>     <dbl>       <dbl>
## 1  2020        15.5        25.3      10.2         0.5
## 2  2021        16          26.1       9.8        -1

3. Splitting Columns with `separate()`

Use separate( ) when a single column contains multiple pieces of information (e.g., a combined date, or a column like “city-country”) and you need to split it into multiple columns.

Use Case 1: Splitting Date into Year, Month, and Day. Split a date column into year, month, and day to enable time-based analysis.

# Example of a combined date column
date_data <- data.frame(
  transaction_date = c("2021-10-01", "2021-11-15", "2021-12-03")
)
date_data

##   transaction_date
## 1       2021-10-01
## 2       2021-11-15
## 3       2021-12-03

# Separate date into year, month, and day
date_separated <- date_data %>%
  separate(transaction_date, into = c("year", "month", "day"), sep = "-")

# View the separated data
head(date_separated)

##   year month day
## 1 2021    10  01
## 2 2021    11  15
## 3 2021    12  03

Use Case 2: Splitting Location Data into City and Country. Split a column containing combined city-country data into two separate columns for easier analysis.

# Example of a location column
location_data <- data.frame(
  location = c("New York-USA", "London-UK", "Tokyo-Japan")
)
location_data

##       location
## 1 New York-USA
## 2    London-UK
## 3  Tokyo-Japan

# Separate location into city and country
location_separated <- location_data %>%
  separate(location, into = c("city", "country"), sep = "-")

# View the separated data
head(location_separated)

##       city country
## 1 New York     USA
## 2   London      UK
## 3    Tokyo   Japan

Exercise 3: Splitting Product Information

Split a column containing combined product ID and category (e.g., “123-Electronics”) into two separate columns.

# Product data
product_data <- data.frame(
  product_info = c("123-Electronics", "456-Furniture", "789-Clothing")
)

# Separate product_info into product_id and category
product_separated <- product_data %>%
  separate(product_info, into = c("product_id", "category"), sep = "-")

# View the separated data
head(product_separated)

##   product_id    category
## 1        123 Electronics
## 2        456   Furniture
## 3        789    Clothing

4. Combining Columns with `unite()`

Use unite() when you have multiple columns that should be combined into one (e.g., combining first and last names into a full name, or combining date components into a single date field).

Use Case 1: Combining First and Last Names. You have a dataset where you want to create a full name from separate first and last name columns.

names_data <- data.frame(
  First_Name = c("Alice", "Bob"),
  Last_Name = c("Johnson", "Miller"),
  Age = c(30, 25)
)
names_data

##   First_Name Last_Name Age
## 1      Alice   Johnson  30
## 2        Bob    Miller  25

# Unite First_Name and Last_Name into Full_Name
names_united <- names_data %>%
  unite(Full_Name, First_Name, Last_Name, sep = " ")

# View the united data
head(names_united)

##       Full_Name Age
## 1 Alice Johnson  30
## 2    Bob Miller  25

Use Case 2: Combining Date Components.You have a dataset with separate year, month, and day columns, and you want to create a single date column.

date_data <- data.frame(
  Year = c(2021, 2021),
  Month = c(10, 10),
  Day = c(1, 2),
  Sales = c(15000, 18000)
)
date_data

##   Year Month Day Sales
## 1 2021    10   1 15000
## 2 2021    10   2 18000

# Unite Year, Month, and Day into a single Date column
date_united <- date_data %>%
  unite(Date, Year, Month, Day, sep = "-")

# View the united data
head(date_united)

##        Date Sales
## 1 2021-10-1 15000
## 2 2021-10-2 18000

Exercise 4: Combining Address Components

Combine the Street, City, and State columns into a single Address column in the following dataset.

# Sample address data
address_data <- data.frame(
  Street = c("123 Main St", "456 Elm St"),
  City = c("Springfield", "Greenfield"),
  State = c("IL", "MA")
)
address_data

##        Street        City State
## 1 123 Main St Springfield    IL
## 2  456 Elm St  Greenfield    MA

# Unite Street, City, and State into a single Address column
address_united <- address_data %>%
  unite(Address, Street, City, State, sep = ", ")

# View the united data
head(address_united)

##                        Address
## 1 123 Main St, Springfield, IL
## 2   456 Elm St, Greenfield, MA

5. Handling Missing Data with `drop_na()` and `replace_na()`

Use drop_na( ) to remove rows with missing values, and use replace_na( ) to fill in missing values with default values (such as the column mean or median).

Use Case 1: Dropping Missing Values in Sales Data. You have a dataset where some sales records are missing, and you want to drop those entries.

# Sample sales data with NA values
sales_data <- data.frame(
  Product = c("A", "B", "C", "D"),
  Sales = c(15000, NA, 18000, NA)
)
sales_data

##   Product Sales
## 1       A 15000
## 2       B    NA
## 3       C 18000
## 4       D    NA

# Drop rows with missing sales values
cleaned_sales_data <- sales_data %>%
  drop_na(Sales)

# View the cleaned data
head(cleaned_sales_data)

##   Product Sales
## 1       A 15000
## 2       C 18000

Use Case 2: Replacing Missing Values with Mean Sales. Fill in missing sales values with the average sales amount to keep the dataset intact.

# Replace missing sales values with the mean
filled_sales_data <- sales_data %>%
  mutate(Sales = replace_na(Sales, mean(Sales, na.rm = TRUE)))

# View the filled data
head(filled_sales_data)

##   Product Sales
## 1       A 15000
## 2       B 16500
## 3       C 18000
## 4       D 16500

Exercise 5: Handling Missing Data in Product Dataset

In the following dataset, introduce some missing values in the Sales column. Then, use drop_na() to remove rows with missing values and replace_na() to replace missing values with the median of the Sales column.

# Sample product data with NA values
product_data <- data.frame(
  Product = c("E", "F", "G", "H"),
  Sales = c(22000, NA, 25000, 27000)
)
product_data

##   Product Sales
## 1       E 22000
## 2       F    NA
## 3       G 25000
## 4       H 27000

# Introduce missing values in the Sales column
product_data$Sales[c(2)] <- NA

# Drop rows with missing values
cleaned_product_data <- product_data %>%
  drop_na(Sales)

# Replace missing values in Sales with the median
filled_product_data <- product_data %>%
  mutate(Sales = replace_na(Sales, median(Sales, na.rm = TRUE)))

# View the cleaned and filled data
head(filled_product_data)

##   Product Sales
## 1       E 22000
## 2       F 25000
## 3       G 25000
## 4       H 27000

Summary: Key Concepts Covered

Reshaping Data with pivot_longer(): Convert wide data into long (tidy) format.
Reshaping Data with pivot_wider(): Convert long data back into wide format.
Splitting Columns with separate(): Split a single column into multiple columns based on a delimiter.
Combining Columns with unite(): Combine multiple columns into a single column.
Handling Missing Data with drop_na() and replace_na(): Drop or replace missing values in your dataset.

Lecture 4: Data Import

Learning Objectives

Learn how to import data from various file formats (CSV, Excel).
Understand how to handle different data sources (databases).
Learn to specify options during data import to handle irregularities and missing values.
Familiarize yourself with tools for importing data in R, particularly the readr and readxl packages.

Data Science with R

Xu Chen

2024-10-06

Overview

Outline

Lecture 1: Data visualization

Learning Objectives

Preliminaries on R and RStudio

What is R?

What is RStudio?

Installing R and RStudio

Running Basic R Code

Visualizing with ggplot2

Tidyverse Package in R

What is a Dataset?

Creating Visualizations with ggplot2

Adding Aesthetic Mappings

Customizing Plots

Exercise 1: Create a scatter plot showing the relationship between bill length and body mass, and color the points by species.

Faceting: Creating Multiple Plots

Exercise 2: Create a faceted scatter plot showing bill length vs body mass, with separate plots for each species.

Customizing the Appearance with Themes

Exercise 3: Apply a different theme (e.g., theme_classic(), theme_bw()) to the scatter plot you created in Exercise 1.

Adding Trendlines

Exercise 4: Add a trendline to the scatter plot you created in Exercise 1.

Summary: Key Concepts Covered

Lecture 2: Data Transformation

Learning Objectives

Introduction to Data Transformation

Key Functions in dplyr

1. Filtering Rows with filter( )

Exercise 1: Filter the penguins dataset to include only penguins with a body mass greater than 4000 grams.

2. Selecting Columns with select()

Exercise 2: Select only the columns species, island, and body_mass_g.

3. Arranging Rows with arrange( )

Exercise 3: Arrange the penguins dataset by flipper length in ascending order.

4. Creating New Variables with mutate()

Exercise 4: Create a new column in the penguins dataset that calculates the flipper length in meters.

5. Summarizing Data with summarize() and group_by()

Exercise 5: Group the penguins dataset by island and calculate the mean body mass for each island.

Handling Missing Data

Summary: Key Concepts Covered

Lecture 3: Data Tidying

Learning Objectives

What Is Tidy Data?

Messy vs. Tidy Data

Core Functions for Data Tidying from tidyr

1. Reshaping Data with pivot_longer()

Exercise 1: Pivoting Temperature Data

2. Reshaping Data with pivot_wider()

Exercise 2: Pivoting Long Temperature Data

3. Splitting Columns with separate()

Exercise 3: Splitting Product Information

4. Combining Columns with unite()

Exercise 4: Combining Address Components

5. Handling Missing Data with drop_na() and replace_na()

Exercise 5: Handling Missing Data in Product Dataset

Summary: Key Concepts Covered

Lecture 4: Data Import

Learning Objectives

1. Filtering Rows with `filter( )`

2. Selecting Columns with `select()`

3. Arranging Rows with `arrange( )`

4. Creating New Variables with `mutate()`

5. Summarizing Data with `summarize()` and `group_by()`

1. Reshaping Data with `pivot_longer()`

2. Reshaping Data with `pivot_wider()`

3. Splitting Columns with `separate()`

4. Combining Columns with `unite()`

5. Handling Missing Data with `drop_na()` and `replace_na()`