suppressMessages(suppressWarnings(install.packages("testthat", ask=FALSE)))
trying URL 'https://cran.rstudio.com/bin/macosx/big-sur-arm64/contrib/4.5/testthat_3.2.3.tgz'
Content type 'application/x-gzip' length 3091157 bytes (2.9 MB)
==================================================
downloaded 2.9 MB
The downloaded binary packages are in
/var/folders/1d/schf58950zl9y22b_w4skq5r0000gn/T//RtmppN6Nz6/downloaded_packages
library(testthat)
Welcome to your first assignment in R! This notebook will guide you
through the basics of R programming, data manipulation using the
dplyr package, and data visualization using the
ggplot2 package. By the end of this assignment, you will be
prepared to tackle more complex data engineering tasks in the next
assignments.
dplyrggplot2R is a powerful programming language used primarily for statistical computing and graphics. It is widely used in various fields such as data science, bioinformatics, and social sciences. RStudio is an integrated development environment (IDE) for R that makes it easier to write and execute R code, manage projects, and visualize data.
In this section, we will get you started with R by printing a simple message and installing and loading some necessary packages. Printing a message will help you understand how to execute R code, and installing and loading packages will extend R’s functionality. Getting comfortable with these basics will lay the foundation for more advanced tasks you’ll encounter later.
### Your code here - Start ###
print("Hello, world!")
[1] "Hello, world!"
### Your code here - End ###
# Install and load necessary packages
install.packages(c("dplyr", "ggplot2", "tidyr", "readr", "lubridate"))
trying URL 'https://cran.rstudio.com/bin/macosx/big-sur-arm64/contrib/4.5/dplyr_1.1.4.tgz'
trying URL 'https://cran.rstudio.com/bin/macosx/big-sur-arm64/contrib/4.5/ggplot2_4.0.0.tgz'
trying URL 'https://cran.rstudio.com/bin/macosx/big-sur-arm64/contrib/4.5/tidyr_1.3.1.tgz'
trying URL 'https://cran.rstudio.com/bin/macosx/big-sur-arm64/contrib/4.5/readr_2.1.5.tgz'
trying URL 'https://cran.rstudio.com/bin/macosx/big-sur-arm64/contrib/4.5/lubridate_1.9.4.tgz'
The downloaded binary packages are in
/var/folders/1d/schf58950zl9y22b_w4skq5r0000gn/T//RtmppN6Nz6/downloaded_packages
library(dplyr)
library(ggplot2)
library(tidyr)
library(readr)
library(lubridate)
### Problem 1.1 - Start ###
print("Welcome to R Programming!")
[1] "Welcome to R Programming!"
### Problem 1.1 - End ###
### Problem 1.2 - Start ###
install.packages("stringr")
trying URL 'https://cran.rstudio.com/bin/macosx/big-sur-arm64/contrib/4.5/stringr_1.5.2.tgz'
Content type 'application/x-gzip' length 313180 bytes (305 KB)
==================================================
downloaded 305 KB
The downloaded binary packages are in
/var/folders/1d/schf58950zl9y22b_w4skq5r0000gn/T//RtmppN6Nz6/downloaded_packages
library(stringr)
### Problem 1.2 - End ###
R can perform basic arithmetic operations and handle variables easily. Variables in R are used to store data that can be used later in the program. Understanding how to perform basic arithmetic and work with variables is fundamental to programming in R.
In this section, you will learn how to perform arithmetic operations and assign values to variables. Arithmetic operations include addition, subtraction, multiplication, and division. Variables allow you to store results and reuse them without retyping the same values. Mastering these basics is crucial for writing more complex R scripts in the future.
# Basic arithmetic
5 + 3
[1] 8
10 - 2
[1] 8
4 * 3
[1] 12
8 / 2
[1] 4
# Assigning values to variables
x <- 10
y <- 5
result <- x + y
result
[1] 15
Problems for Students:
It’s time to practice basic arithmetic and variable assignments. Solve the following problems:
7 * 8 and
assign it to a variable named
product.### Problem 2.1 - Start ###
product <- 7 * 8
### Problem 2.2 - End ###
# Test Feedback
tryCatch({
testthat::test_that("Problem 2.1", {
testthat::expect_equal(product, 56)
})
}, error = function(e) {
message("Ensure that you correctly assign the product of 7 * 8 to the variable 'product'.")
})
Test passed 😀
15 to a variable
named a and
20 to a variable named
b. Then, calculate the sum of
a and
b.### Problem 2.2 - Start ###
a <- 15
b <- 20
sum_ab <- a + b
### Problem 2.2 - End ###
# Hidden test code
tryCatch({
testthat::test_that("Problem 2.2", {
testthat::expect_equal(a, 15)
testthat::expect_equal(b, 20)
testthat::expect_equal(sum_ab, 35)
})
}, error = function(e) {
message("Ensure that you correctly assign values to 'a' and 'b', and compute their sum in 'sum_ab'.")
})
Test passed 🎉
Vectors are one of the basic data structures in R. They can store a sequence of elements of the same type, such as numbers, characters, or logical values. Understanding vectors and data types is crucial for effective data manipulation and analysis.
In this section, you will learn how to create vectors and perform operations on them. You will also explore different data types in R. Data types determine what kind of data can be stored and how it can be used. Being comfortable with these concepts will enable you to handle and process data efficiently.
Create and Manipulate Vectors: Learn how to create vectors and perform operations on them. Vectors are used to store data in a linear format.
Understand Data Types: Explore different data types in R. Data types determine what kind of data can be stored and how it can be used.
# Creating vectors
numbers <- c(1, 2, 3, 4, 5)
characters <- c("a", "b", "c")
# Basic vector operations
sum(numbers)
[1] 15
mean(numbers)
[1] 3
length(characters)
[1] 3
# Data types
str(numbers)
num [1:5] 1 2 3 4 5
str(characters)
chr [1:3] "a" "b" "c"
Problems for Students:
Now, let’s apply what you’ve learned about vectors and data types:
temperatures
containing the values
23, 25, 20, 19, 22.### Problem 3.1 - Start ###
temperatures <- c(23, 25, 20, 19, 22)
### Problem 3.2 - Start ###
# Hidden test code
tryCatch({
testthat::test_that("Problem 3.1", {
testthat::expect_true(is.numeric(temperatures))
testthat::expect_equal(length(temperatures), 5)
})
}, error = function(e) {
message("Ensure that 'temperatures' is a numeric vector with the correct length.")
})
Test passed 🎉
mean_temperature.### Problem 3.2 - Start ###
mean_temperature <- mean(temperatures)
### Problem 3.2 - end ###
# Hidden test code
tryCatch({
testthat::test_that("Problem 3.2", {
testthat::expect_equal(mean_temperature, 21.8)
})
}, error = function(e) {
message("Ensure that you correctly calculate and store the mean of 'temperatures' in 'mean_temperature'.")
})
Test passed 🥇
cities
with the values “New York”, “Los Angeles”, “Chicago”.### Problem 3.3 - Start ###
cities <- c("New York", "Los Angeles", "Chicago")
### Problem 3.3 - End ###
# Hidden test code
tryCatch({
testthat::test_that("Problem 3.3", {
testthat::expect_true(is.character(cities))
testthat::expect_equal(length(cities), 3)
})
}, error = function(e) {
message("Ensure that 'cities' is a character vector with the correct length.")
})
Test passed 🥳
Data frames are table-like structures that store data in rows and columns, similar to a spreadsheet. Each column can contain data of a different type, making data frames very flexible and useful for data analysis.
In this section, you will learn how to create data frames by combining vectors. You will also learn how to inspect and manipulate data frames, including viewing the structure of the data, summarizing it, and adding new columns. These skills are essential for handling real-world data in your analyses.
Create a Data Frame: Combine vectors into a data frame. Data frames are the most common data structure used in R for data analysis.
Inspect and Manipulate Data Frames: Use functions to inspect and manipulate data frames. This includes viewing the structure of the data, summarizing it, and adding new columns.
# Creating a data frame
names <- c("John", "Jane", "Jim")
ages <- c(28, 34, 40)
data <- data.frame(Name = names, Age = ages)
data
# Inspect the data frame
head(data)
summary(data)
Name Age
Length:3 Min. :28
Class :character 1st Qu.:31
Mode :character Median :34
Mean :34
3rd Qu.:37
Max. :40
str(data)
'data.frame': 3 obs. of 2 variables:
$ Name: chr "John" "Jane" "Jim"
$ Age : num 28 34 40
# Manipulate data frames
data$Age_in_5_years <- data$Age + 5
data
Problems for Students:
Let’s practice creating and manipulating data frames:
students
with the columns StudentName and
Score, using the vectors
c("Alice", "Bob", "Charlie") and
c(85, 90, 88).### Problem 4.1 - Start ###
StudentName <- c("Alice", "Bob", "Charlie")
Score <- c(85, 90, 88)
students <- data.frame(StudentName, Score)
students
### Problem 4.1 - End ###
# Hidden test code
tryCatch({
testthat::test_that("Problem 4.1", {
testthat::expect_true(is.data.frame(students))
testthat::expect_equal(names(students), c("StudentName", "Score"))
testthat::expect_equal(nrow(students), 3)
})
}, error = function(e) {
message("Ensure that 'students' is a data frame with the correct structure.")
})
Test passed 🎉
students data
frame named Pass, which is
TRUE if
Score is greater than or equal to 50 and
FALSE otherwise.### Problem 4.2 - Start ###
students$Pass <- students$Score >= 50
students
### Problem 4.2 - End ###
# Hidden test code
tryCatch({
testthat::test_that("Problem 4.2", {
testthat::expect_true("Pass" %in% names(students))
testthat::expect_equal(students$Pass, c(TRUE, TRUE, TRUE))
})
}, error = function(e) {
message("Ensure that you correctly add the 'Pass' column based on the 'Score'.")
})
Test passed 🌈
dplyrThe dplyr package provides a set of
functions for data manipulation, making it easier to perform common data
manipulation tasks such as filtering rows, selecting columns, and
summarizing data. These functions are designed to be easy to use and
efficient.
In this section, you will learn how to load the
dplyr package and use its functions to
manipulate data frames. These skills are vital for cleaning and
transforming data before analysis.
Load the dplyr Package: Ensure the
package is loaded. If not, install it using
install.packages("dplyr").
Perform Basic Data Manipulation: Use
filter,
select,
mutate, and
summarize functions to manipulate data
frames.
# Ensure the package is loaded
library(dplyr)
# Filtering data
filtered_data <- filter(data, Age > 30)
filtered_data
# Selecting columns
selected_data <- select(data, Name, Age)
selected_data
# Mutating data
mutated_data <- mutate(data, Age_in_10_years = Age + 10)
mutated_data
# Summarizing data
summary_data <- summarize(data, Average_Age = mean(Age))
summary_data
NA
Problems for Students:
Practice using dplyr to manipulate
data:
students data frame. Filter
the rows where Score is greater than
85.### Problem 5.1 - Start ###
filtered_students <- filter(students, Score > 85)
filtered_students
### Problem 5.1 - End ###
# Hidden test code
tryCatch({
testthat::test_that("Problem 5.1", {
testthat::expect_equal(nrow(filtered_students), 2)
})
}, error = function(e) {
message("Ensure that you correctly filter 'students' based on 'Score'.")
})
Test passed 🎊
StudentName column
from the students data frame.### Problem 5.2 - Start ###
selected_students <- select(students, StudentName)
selected_students
### Problem 5.2 - End ###
# Hidden test code
tryCatch({
testthat::test_that("Problem 5.2", {
testthat::expect_equal(names(selected_students), "StudentName")
})
}, error = function(e) {
message("Ensure that you correctly select the 'StudentName' column.")
})
Test passed 🌈
students
data frame named Score_in_10_years that
adds 10 to the current
Score.### Problem 5.3 - Start ###
mutated_students <- mutate(students, Score_in_10_years = Score + 10)
mutated_students
### Problem 5.3 - End ###
# Hidden test code
tryCatch({
testthat::test_that("Problem 5.3", {
testthat::expect_true("Score_in_10_years" %in% names(mutated_students))
testthat::expect_equal(mutated_students$Score_in_10_years, c(95, 100, 98))
})
}, error = function(e) {
message("Ensure that you correctly add the 'Score_in_10_years' column.")
})
Test passed 😸
### Problem 5.4 - Start ###
average_score <- summarize(students, Average_Score = mean(Score))
average_score
### Problem 5.4 - Start ###
# Hidden test code
tryCatch({
testthat::test_that("Problem 5.4", {
testthat::expect_equal(average_score$Average_Score, 87.67, tolerance = 0.01)
})
}, error = function(e) {
message("Ensure that you correctly calculate the average score of 'students'.")
})
Test passed 🥳
ggplot2The ggplot2 package is a powerful tool
for creating visualizations. It uses a coherent system of “grammar” to
create a variety of plots, making it easier to understand and
communicate data insights.
In this section, you will learn how to load the
ggplot2 package and create basic plots.
Visualizations are crucial for presenting your findings in a clear and
compelling way.
Load the ggplot2 Package: Ensure
the package is loaded. If not, install it using
install.packages("ggplot2").
Create a Simple Plot: Learn the basics of
creating plots with ggplot2, such as
scatter plots and bar charts.
# Ensure the package is loaded
library(ggplot2)
# Creating a scatter plot
ggplot(data, aes(x = Name, y = Age)) +
geom_point()
# Creating a bar chart
ggplot(data, aes(x = Name, y = Age)) +
geom_bar(stat = "identity")
Problems for Students:
Now it’s your turn to create visualizations:
Score
vs. StudentName using the
students data frame.### Problem 6.1 - Start ###
ggplot(students, aes(x = StudentName, y = Score)) +
geom_point()
### Problem 6.1 - End ###
# Hidden test code
tryCatch({
testthat::test_that("Problem 6.1", {
testthat::expect_s3_class(ggplot(students, aes(x = StudentName, y = Score)) + geom_point(), "gg")
})
}, error = function(e) {
message("Ensure that you correctly create a scatter plot of 'Score' vs. 'StudentName'.")
})
Test passed 🎉
Score
for each StudentName.### Problem 6.2 - Start ###
ggplot(students, aes(x = StudentName, y = Score)) +
geom_bar(stat = "identity")
### Problem 6.2 - End ###
# Hidden test code
tryCatch({
testthat::test_that("Problem 6.2", {
testthat::expect_s3_class(ggplot(students, aes(x = StudentName, y = Score)) + geom_bar(stat = "identity"), "gg")
})
}, error = function(e) {
message("Ensure that you correctly create a bar chart showing the 'Score' for each 'StudentName'.")
})
Test passed 🥇
Now that you’ve learned the basics, let’s apply these skills to a practical exercise. We’ll analyze a simple patient dataset to gain insights into patient characteristics. The dataset is already loaded into memory and can be used for the following questions.
In this section, you will load a dataset, inspect and clean the data, perform basic data analysis, and create visualizations. This exercise will consolidate your learning and give you hands-on experience with real-world data.
# Load the dataset
patients <- read.csv("patient_data_large.csv")
# Inspect the data
head(patients)
summary(patients)
PatientID Age Gender HeartDisease BloodPressure Cholesterol
Min. : 1.0 Min. :18.00 Length:1000 Length:1000 Min. : 90.0 Min. :103.9
1st Qu.: 250.8 1st Qu.:39.00 Class :character Class :character 1st Qu.:109.5 1st Qu.:194.2
Median : 500.5 Median :51.00 Mode :character Mode :character Median :120.1 Median :219.7
Mean : 500.5 Mean :50.33 Mean :119.9 Mean :219.6
3rd Qu.: 750.2 3rd Qu.:61.00 3rd Qu.:129.7 3rd Qu.:244.0
Max. :1000.0 Max. :90.00 Max. :169.7 Max. :362.4
str(patients)
'data.frame': 1000 obs. of 6 variables:
$ PatientID : int 1 2 3 4 5 6 7 8 9 10 ...
$ Age : int 65 53 78 28 62 68 57 47 19 66 ...
$ Gender : chr "Male" "Female" "Female" "Male" ...
$ HeartDisease : chr "No" "No" "No" "No" ...
$ BloodPressure: num 108.7 96.3 123.8 125.9 109.8 ...
$ Cholesterol : num 238 195 199 232 220 ...
# Clean the data
patients <- na.omit(patients)
patients$Age <- as.numeric(patients$Age)
patients$Gender <- as.factor(patients$Gender)
patients$HeartDisease <- as.factor(patients$HeartDisease)
summary(patients)
PatientID Age Gender HeartDisease BloodPressure Cholesterol
Min. : 1.0 Min. :18.00 Female:510 No :590 Min. : 90.0 Min. :103.9
1st Qu.: 250.8 1st Qu.:39.00 Male :490 Yes:410 1st Qu.:109.5 1st Qu.:194.2
Median : 500.5 Median :51.00 Median :120.1 Median :219.7
Mean : 500.5 Mean :50.33 Mean :119.9 Mean :219.6
3rd Qu.: 750.2 3rd Qu.:61.00 3rd Qu.:129.7 3rd Qu.:244.0
Max. :1000.0 Max. :90.00 Max. :169.7 Max. :362.4
# Analyze the data
# Age distribution
summary(patients$Age)
Min. 1st Qu. Median Mean 3rd Qu. Max.
18.00 39.00 51.00 50.33 61.00 90.00
hist(patients$Age, main = "Age Distribution", xlab = "Age", col = "lightblue", border = "white")
# Blood pressure distribution
summary(patients$BloodPressure)
Min. 1st Qu. Median Mean 3rd Qu. Max.
90.0 109.5 120.1 119.9 129.7 169.7
hist(patients$BloodPressure, main = "Blood Pressure Distribution", xlab = "Blood Pressure", col = "lightgreen", border = "white")
# Cholesterol distribution
summary(patients$Cholesterol)
Min. 1st Qu. Median Mean 3rd Qu. Max.
103.9 194.2 219.7 219.6 244.0 362.4
hist(patients$Cholesterol, main = "Cholesterol Distribution", xlab = "Cholesterol", col = "lightcoral", border = "white")
# Exploring relationships
# Age and heart disease
boxplot(Age ~ HeartDisease, data = patients, main = "Age and Heart Disease", xlab = "Heart Disease", ylab = "Age", col = c("lightblue", "lightpink"))
# Blood pressure and heart disease
boxplot(BloodPressure ~ HeartDisease, data = patients, main = "Blood Pressure and Heart Disease", xlab = "Heart Disease", ylab = "Blood Pressure", col = c("lightgreen", "lightpink"))
# Cholesterol and heart disease
boxplot(Cholesterol ~ HeartDisease, data = patients, main = "Cholesterol and Heart Disease", xlab = "Heart Disease", ylab = "Cholesterol", col = c("lightcoral", "lightpink"))
Problems for Students:
Now it’s your turn to analyze the patient data. Follow the steps below to complete the exercises.
patients dataset to include only male
patients, and then calculate the average age of these patients. Store
the result in a variable named
avg_age_male.### Problem 7.1 - Start ###
avg_age_male <- patients %>%
filter(Gender == "Male") %>%
summarize(avg_age = mean(Age)) %>%
pull(avg_age)
avg_age_male
[1] 50.1551
### Problem 7.1 - End ###
# Hidden test code
tryCatch({
testthat::test_that("Problem 7.1", {
testthat::expect_equal(avg_age_male, mean(patients$Age[patients$Gender == "Male"]), tolerance = 0.01)
})
}, error = function(e) {
message("Ensure that you correctly filter and calculate the average age for male patients.")
})
Test passed 🌈
high_cholesterol that includes only
the PatientID and
Cholesterol columns for patients with
cholesterol levels greater than 250. Add a new column to this data frame
named CholesterolCategory that labels
these patients as “High”.### Problem 7.2 - Start ###
high_cholesterol <- patients %>%
filter(Cholesterol > 250) %>%
select(PatientID, Cholesterol) %>%
mutate(CholesterolCategory = "High")
high_cholesterol
### Problem 7.2 - End ###
# Hidden test code
tryCatch({
testthat::test_that("Problem 7.2", {
testthat::expect_true(all(high_cholesterol$Cholesterol > 250))
testthat::expect_true("CholesterolCategory" %in% names(high_cholesterol))
testthat::expect_equal(unique(high_cholesterol$CholesterolCategory), "High")
})
}, error = function(e) {
message("Ensure that you correctly filter, select, and mutate the data for high cholesterol patients.")
})
Test passed 🥇
ggplot2 package for this
visualization.### Problem 7.3 - Start ###
ggplot(patients, aes(x = HeartDisease, y = Age, fill = HeartDisease)) +
geom_boxplot() +
labs(title = "Age Comparison of Patients with and without Heart Disease", x = "Heart Disease", y = "Age")
### Problem 7.3 - End ###
# Hidden test code
tryCatch({
testthat::test_that("Problem 7.3", {
testthat::expect_s3_class(ggplot(patients, aes(x = HeartDisease, y = Age, fill = HeartDisease)) +
geom_boxplot() +
labs(title = "Age Comparison of Patients with and without Heart Disease", x = "Heart Disease", y = "Age"), "gg")
})
}, error = function(e) {
message("Ensure that you correctly create a boxplot comparing ages of patients with and without heart disease.")
})
Test passed 😀
### Problem 7.4 - Start ###
avg_cholesterol <- patients %>%
group_by(HeartDisease) %>%
summarize(avg_chol = mean(Cholesterol))
ggplot(avg_cholesterol, aes(x = HeartDisease, y = avg_chol, fill = HeartDisease)) +
geom_bar(stat = "identity") +
labs(title = "Average Cholesterol Levels by Heart Disease Status", x = "Heart Disease", y = "Average Cholesterol")
### Problem 7.4 - End ###
### Feadback - Test Case - Problem 7.4 - Start ###
tryCatch({
testthat::test_that("Problem 7.4", {
testthat::expect_s3_class(ggplot(avg_cholesterol, aes(x = HeartDisease, y = avg_chol, fill = HeartDisease)) +
geom_bar(stat = "identity") +
labs(title = "Average Cholesterol Levels by Heart Disease Status", x = "Heart Disease", y = "Average Cholesterol"), "gg")
})
}, error = function(e) {
message("Ensure that you correctly summarize and visualize the average cholesterol levels by heart disease status.")
})
Test passed 😸
### Feadback - Test Case - Problem 7.4 - End ###
In this assignment, you have learned the basics of R programming,
data manipulation with dplyr, and data
visualization with ggplot2. You also
applied these skills to analyze a simple patient dataset. These
foundational skills will prepare you for more complex data engineering
tasks in the next assignments. Well done!