Chapter 1: Introduction to R for Data Science

What is R for Data Science?

R is a powerful tool for working with data.
This book focuses on the tidyverse, a collection of R packages that make data science easier.
Think of data science like cooking:
- Import data → Bring ingredients to the kitchen.
- Tidy data → Organize ingredients.
- Transform data → Chop, mix, or season.
- Visualize data → Plate the food beautifully.
- Model data → Predict how the dish will taste based on past meals.
- Communicate results → Share the recipe with others.

Setting Up R and RStudio

Download and install R from CRAN.
Download and install RStudio from RStudio.

Understanding RStudio

Console: Where you run commands.
Script Editor: Where you write and save R code.
Environment Pane: Shows variables and datasets.
Help Pane: Finds help and documentation.

Writing Basic R Code

# Assign values to variables
x <- 10
y <- 5
sum_xy <- x + y
print(sum_xy)  # Output: 15

Exercise

Create a variable a and assign it the value 50.
Create a variable b and assign it the value 25.
Add a and b together and print the result.

Chapter 2: Data Visualization with ggplot2

What is Data Visualization?

Data visualization means turning numbers into pictures.
Helps us spot trends, patterns, and outliers.
Uses ggplot2, which works like building with LEGO blocks:
- Start with a dataset.
- Add layers (points, lines, bars).
- Customize appearance (colors, labels, themes).

Creating a Basic Plot

# Load ggplot2
library(ggplot2)

# Create a scatter plot
ggplot(data = mpg) +
  geom_point(mapping = aes(x = displ, y = hwy))

Explanation

ggplot(data = mpg) → Use the mpg dataset.
aes(x = displ, y = hwy) → Map engine size to x-axis, highway MPG to y-axis.
geom_point() → Show each car as a dot.

Exercise

Load the ggplot2 package.
Use the mpg dataset to create a scatter plot of displ vs. hwy.
Add color to the scatter plot based on class.

Chapter 3: Data Transformation with dplyr

What is Data Transformation?

Transforming data means changing raw data into a useful format.
dplyr helps filter, select, arrange, and summarize data.

Filtering Data

# Load dplyr
library(dplyr)

# Filter cars with highway MPG greater than 30
mpg_filtered <- mpg %>% filter(hwy > 30)
head(mpg_filtered)

Selecting Columns

# Select only manufacturer and highway MPG
mpg_selected <- mpg %>% select(manufacturer, hwy)
head(mpg_selected)

Grouping and Summarizing Data

# Find average highway MPG for each manufacturer
mpg_grouped <- mpg %>% group_by(manufacturer) %>% summarize(avg_hwy = mean(hwy))
head(mpg_grouped)

Exercise

Group the mpg dataset by class instead of manufacturer.
Find the average hwy for each class.
Print the results.

Chapter 4: Data Import with readr

Reading a CSV File (`read_csv()`)

# Load the readr package
library(readr)

# Read a CSV file
data <- read_csv("data.csv")

# View the first few rows
head(data)

Exercise

Load the readr package.
Read in a file named sales_data.csv.
Display the first six rows.

Chapter 5: Tidy Data with tidyr

What is Tidy Data?

Tidy data means organizing data into a clean format.
Each row is an observation, each column is a variable.

Using `gather()` and `spread()` in R

Gather: Convert wide format to long format

# Convert wide to long format
long_grades <- grades %>% gather(key = "Subject", value = "Score", Math:English)
head(long_grades)

Spread: Convert long format back to wide format

# Convert long to wide format
wide_grades <- long_grades %>% spread(key = "Subject", value = "Score")
head(wide_grades)

Exercise

Convert a dataset from wide to long format using gather().
Convert it back to wide format using spread().

Chapter 6: Data Wrangling with tidyr

Handling Missing Data

# Remove missing values
data_clean <- drop_na(data)
head(data_clean)

Exercise

Remove missing values from a dataset.
Fill missing values with a default value using replace_na().

Separating and Uniting Columns

# Separate a column
data_separated <- data %>% separate(Name, into = c("First", "Last"), sep = "_")
head(data_separated)

# Unite columns back into one
data_united <- data_separated %>% unite("Full_Name", First, Last, sep = " ")
head(data_united)

Exercise

Separate a column into two using separate().
Unite two columns back into one using unite().

Chapter 7: Working with Factors in R

What are Factors?

Factors are used to represent categorical data (e.g., Gender, Colors, Product Categories).
Factors allow for ordering and grouping data efficiently.

Creating Factors

# Create a factor variable
fruit <- factor(c("Apple", "Banana", "Apple", "Orange", "Banana"))
print(fruit)

Explanation

This converts a character vector into a factor, treating the unique values as categories.

Changing Factor Levels

# Rename levels of a factor
fruit <- factor(fruit, levels = c("Apple", "Banana", "Orange"), labels = c("Red", "Yellow", "Orange"))
print(fruit)

Exercise

Create a factor variable for vehicle types: “Car”, “Truck”, “Motorcycle”.
Change the factor levels to “Small”, “Large”, “Medium”.
Print the modified factor.

Chapter 8: Working with Dates and Times

Working with Dates in R

R provides tools for handling dates and times using the lubridate package.

Parsing Dates

# Load lubridate
library(lubridate)

# Convert a string into a date
date1 <- ymd("2024-03-20")
print(date1)

Explanation

ymd() converts a YYYY-MM-DD format string into an R Date object.

Extracting Components of a Date

# Extract year, month, and day
print(year(date1))   # Output: 2024
print(month(date1))  # Output: 3
print(day(date1))    # Output: 20

Exercise

Convert “2025-07-15” into an R date.
Extract and print the year, month, and day separately.
Create a sequence of dates from “2024-01-01” to “2024-12-31”.

Chapter 9: Writing Functions in R

What is a Function?

A function is a set of instructions bundled together to perform a specific task.
Functions make code reusable and easier to understand.

Creating a Simple Function

# Define a function to add two numbers
add_numbers <- function(x, y) {
  return(x + y)
}

# Use the function
result <- add_numbers(10, 5)
print(result)  # Output: 15

Explanation

function(x, y) {} → Defines a function with two inputs (x and y).
return(x + y) → Returns the sum of x and y.

Exercise

Create a function to multiply two numbers.
Create a function that takes a name as input and prints “Hello, [Name]!”.
Test both functions.

Chapter 10: Iteration with Loops

What is Iteration?

Iteration means repeating an action multiple times.
R provides for loops and while loops for iteration.

Using a For Loop

# Print numbers from 1 to 5
for (i in 1:5) {
  print(i)
}

Explanation

for (i in 1:5) → Loops through numbers 1 to 5.
print(i) → Prints each number.

Exercise

Write a for loop to print numbers from 10 to 20.
Create a loop that prints only even numbers between 1 and 10.

Chapter 11: Working with Data Frames

What is a Data Frame?

A data frame is like a table containing rows and columns.
Data frames store structured data in R.

Creating a Data Frame

# Create a simple data frame
data <- data.frame(
  Name = c("Alice", "Bob", "Charlie"),
  Age = c(25, 30, 35),
  Score = c(90, 85, 88)
)
print(data)

Explanation

data.frame() → Creates a structured dataset.
Each column has a name (Name, Age, Score).

Exercise

Create a data frame with columns City, Country, and Population.
Add 3 rows of data and print the data frame.

Chapter 12: Introduction to Modeling in R

What is Modeling?

Modeling means using data to make predictions.
R provides lm() for linear regression modeling.

Simple Linear Regression

# Create a dataset
heights <- data.frame(
  height = c(150, 160, 170, 180, 190),
  weight = c(50, 60, 70, 80, 90)
)

# Build a linear model
model <- lm(weight ~ height, data = heights)
print(summary(model))

Explanation

lm(weight ~ height, data = heights) → Predicts weight using height.
summary(model) → Shows model details.

Exercise

Create a dataset of Experience vs. Salary.
Fit a linear model to predict salary based on experience.
Print the summary of the model.

R for Data Science: Beginner Guide

Evan

2025-03-15

Chapter 1: Introduction to R for Data Science

What is R for Data Science?

Setting Up R and RStudio

Understanding RStudio

Writing Basic R Code

Exercise

Chapter 2: Data Visualization with ggplot2

What is Data Visualization?

Creating a Basic Plot

Explanation

Exercise

Chapter 3: Data Transformation with dplyr

What is Data Transformation?

Filtering Data

Selecting Columns

Grouping and Summarizing Data

Exercise

Chapter 4: Data Import with readr

Reading a CSV File (read_csv())

Exercise

Chapter 5: Tidy Data with tidyr

What is Tidy Data?

Using gather() and spread() in R

Gather: Convert wide format to long format

Spread: Convert long format back to wide format

Exercise

Chapter 6: Data Wrangling with tidyr

Handling Missing Data

Exercise

Separating and Uniting Columns

Exercise

Chapter 7: Working with Factors in R

What are Factors?

Creating Factors

Explanation

Changing Factor Levels

Exercise

Chapter 8: Working with Dates and Times

Working with Dates in R

Parsing Dates

Explanation

Extracting Components of a Date

Exercise

Chapter 9: Writing Functions in R

What is a Function?

Creating a Simple Function

Explanation

Exercise

Chapter 10: Iteration with Loops

What is Iteration?

Using a For Loop

Explanation

Exercise

Chapter 11: Working with Data Frames

What is a Data Frame?

Creating a Data Frame

Explanation

Exercise

Chapter 12: Introduction to Modeling in R

What is Modeling?

Simple Linear Regression

Explanation

Exercise

Reading a CSV File (`read_csv()`)

Using `gather()` and `spread()` in R