Chapter 1: Introduction to R for Data Science

What is R for Data Science?

  • R is a powerful tool for working with data.
  • This book focuses on the tidyverse, a collection of R packages that make data science easier.
  • Think of data science like cooking:
    • Import data → Bring ingredients to the kitchen.
    • Tidy data → Organize ingredients.
    • Transform data → Chop, mix, or season.
    • Visualize data → Plate the food beautifully.
    • Model data → Predict how the dish will taste based on past meals.
    • Communicate results → Share the recipe with others.

Setting Up R and RStudio

  • Download and install R from CRAN.
  • Download and install RStudio from RStudio.

Understanding RStudio

  • Console: Where you run commands.
  • Script Editor: Where you write and save R code.
  • Environment Pane: Shows variables and datasets.
  • Help Pane: Finds help and documentation.

Writing Basic R Code

# Assign values to variables
x <- 10
y <- 5
sum_xy <- x + y
print(sum_xy)  # Output: 15

Exercise

  1. Create a variable a and assign it the value 50.
  2. Create a variable b and assign it the value 25.
  3. Add a and b together and print the result.

Chapter 2: Data Visualization with ggplot2

What is Data Visualization?

  • Data visualization means turning numbers into pictures.
  • Helps us spot trends, patterns, and outliers.
  • Uses ggplot2, which works like building with LEGO blocks:
    • Start with a dataset.
    • Add layers (points, lines, bars).
    • Customize appearance (colors, labels, themes).

Creating a Basic Plot

# Load ggplot2
library(ggplot2)

# Create a scatter plot
ggplot(data = mpg) +
  geom_point(mapping = aes(x = displ, y = hwy))

Explanation

  • ggplot(data = mpg) → Use the mpg dataset.
  • aes(x = displ, y = hwy) → Map engine size to x-axis, highway MPG to y-axis.
  • geom_point() → Show each car as a dot.

Exercise

  1. Load the ggplot2 package.
  2. Use the mpg dataset to create a scatter plot of displ vs. hwy.
  3. Add color to the scatter plot based on class.

Chapter 3: Data Transformation with dplyr

What is Data Transformation?

  • Transforming data means changing raw data into a useful format.
  • dplyr helps filter, select, arrange, and summarize data.

Filtering Data

# Load dplyr
library(dplyr)

# Filter cars with highway MPG greater than 30
mpg_filtered <- mpg %>% filter(hwy > 30)
head(mpg_filtered)

Selecting Columns

# Select only manufacturer and highway MPG
mpg_selected <- mpg %>% select(manufacturer, hwy)
head(mpg_selected)

Grouping and Summarizing Data

# Find average highway MPG for each manufacturer
mpg_grouped <- mpg %>% group_by(manufacturer) %>% summarize(avg_hwy = mean(hwy))
head(mpg_grouped)

Exercise

  1. Group the mpg dataset by class instead of manufacturer.
  2. Find the average hwy for each class.
  3. Print the results.

Chapter 4: Data Import with readr

Reading a CSV File (read_csv())

# Load the readr package
library(readr)

# Read a CSV file
data <- read_csv("data.csv")

# View the first few rows
head(data)

Exercise

  1. Load the readr package.
  2. Read in a file named sales_data.csv.
  3. Display the first six rows.

Chapter 5: Tidy Data with tidyr

What is Tidy Data?

  • Tidy data means organizing data into a clean format.
  • Each row is an observation, each column is a variable.

Using gather() and spread() in R

Gather: Convert wide format to long format

# Convert wide to long format
long_grades <- grades %>% gather(key = "Subject", value = "Score", Math:English)
head(long_grades)

Spread: Convert long format back to wide format

# Convert long to wide format
wide_grades <- long_grades %>% spread(key = "Subject", value = "Score")
head(wide_grades)

Exercise

  1. Convert a dataset from wide to long format using gather().
  2. Convert it back to wide format using spread().

Chapter 6: Data Wrangling with tidyr

Handling Missing Data

# Remove missing values
data_clean <- drop_na(data)
head(data_clean)

Exercise

  1. Remove missing values from a dataset.
  2. Fill missing values with a default value using replace_na().

Separating and Uniting Columns

# Separate a column
data_separated <- data %>% separate(Name, into = c("First", "Last"), sep = "_")
head(data_separated)
# Unite columns back into one
data_united <- data_separated %>% unite("Full_Name", First, Last, sep = " ")
head(data_united)

Exercise

  1. Separate a column into two using separate().
  2. Unite two columns back into one using unite().

Chapter 7: Working with Factors in R

What are Factors?

  • Factors are used to represent categorical data (e.g., Gender, Colors, Product Categories).
  • Factors allow for ordering and grouping data efficiently.

Creating Factors

# Create a factor variable
fruit <- factor(c("Apple", "Banana", "Apple", "Orange", "Banana"))
print(fruit)

Explanation

  • This converts a character vector into a factor, treating the unique values as categories.

Changing Factor Levels

# Rename levels of a factor
fruit <- factor(fruit, levels = c("Apple", "Banana", "Orange"), labels = c("Red", "Yellow", "Orange"))
print(fruit)

Exercise

  1. Create a factor variable for vehicle types: “Car”, “Truck”, “Motorcycle”.
  2. Change the factor levels to “Small”, “Large”, “Medium”.
  3. Print the modified factor.

Chapter 8: Working with Dates and Times

Working with Dates in R

  • R provides tools for handling dates and times using the lubridate package.

Parsing Dates

# Load lubridate
library(lubridate)

# Convert a string into a date
date1 <- ymd("2024-03-20")
print(date1)

Explanation

  • ymd() converts a YYYY-MM-DD format string into an R Date object.

Extracting Components of a Date

# Extract year, month, and day
print(year(date1))   # Output: 2024
print(month(date1))  # Output: 3
print(day(date1))    # Output: 20

Exercise

  1. Convert “2025-07-15” into an R date.
  2. Extract and print the year, month, and day separately.
  3. Create a sequence of dates from “2024-01-01” to “2024-12-31”.

Chapter 9: Writing Functions in R

What is a Function?

  • A function is a set of instructions bundled together to perform a specific task.
  • Functions make code reusable and easier to understand.

Creating a Simple Function

# Define a function to add two numbers
add_numbers <- function(x, y) {
  return(x + y)
}

# Use the function
result <- add_numbers(10, 5)
print(result)  # Output: 15

Explanation

  • function(x, y) {} → Defines a function with two inputs (x and y).
  • return(x + y) → Returns the sum of x and y.

Exercise

  1. Create a function to multiply two numbers.
  2. Create a function that takes a name as input and prints “Hello, [Name]!”.
  3. Test both functions.

Chapter 10: Iteration with Loops

What is Iteration?

  • Iteration means repeating an action multiple times.
  • R provides for loops and while loops for iteration.

Using a For Loop

# Print numbers from 1 to 5
for (i in 1:5) {
  print(i)
}

Explanation

  • for (i in 1:5) → Loops through numbers 1 to 5.
  • print(i) → Prints each number.

Exercise

  1. Write a for loop to print numbers from 10 to 20.
  2. Create a loop that prints only even numbers between 1 and 10.

Chapter 11: Working with Data Frames

What is a Data Frame?

  • A data frame is like a table containing rows and columns.
  • Data frames store structured data in R.

Creating a Data Frame

# Create a simple data frame
data <- data.frame(
  Name = c("Alice", "Bob", "Charlie"),
  Age = c(25, 30, 35),
  Score = c(90, 85, 88)
)
print(data)

Explanation

  • data.frame() → Creates a structured dataset.
  • Each column has a name (Name, Age, Score).

Exercise

  1. Create a data frame with columns City, Country, and Population.
  2. Add 3 rows of data and print the data frame.

Chapter 12: Introduction to Modeling in R

What is Modeling?

  • Modeling means using data to make predictions.
  • R provides lm() for linear regression modeling.

Simple Linear Regression

# Create a dataset
heights <- data.frame(
  height = c(150, 160, 170, 180, 190),
  weight = c(50, 60, 70, 80, 90)
)

# Build a linear model
model <- lm(weight ~ height, data = heights)
print(summary(model))

Explanation

  • lm(weight ~ height, data = heights) → Predicts weight using height.
  • summary(model) → Shows model details.

Exercise

  1. Create a dataset of Experience vs. Salary.
  2. Fit a linear model to predict salary based on experience.
  3. Print the summary of the model.