Welcome to the Mastering R for Research Workshop. This comprehensive workshop is designed to help you unlock your research potential by mastering the R software. In this workshop, you will learn how to:
R is a language and environment for statistical computing and data visualization http://www.r-project.org/about.html. R is a free, open-source programming language, meaning anyone can use, modify, and distribute it. There are multiple sub-packages that may help read input, implement functions, visualize output and transform results for further use.
R can be installed on various operating systems. This chapter provides instructions for installing R on Windows, macOS, and Linux. It also includes brief guidance on installing RStudio.
Go to the Comprehensive R Archive Network (CRAN) website: https://cran.r-project.org/
Double-click the downloaded file and follow the installation instructions. You can usually accept the default settings.
Once installed, open the R GUI (or RStudio if you install that too) and type:
version
For Ubuntu/Debian-based systems, follow these steps in your terminal:
# Update your package list and install prerequisites
sudo apt update
sudo apt install --no-install-recommends software-properties-common dirmngr
# Add the CRAN repository (replace 'focal' with your Ubuntu release if necessary)
sudo add-apt-repository 'deb https://cloud.r-project.org/bin/linux/ubuntu focal-cran40/'
# Update package lists again and install R
sudo apt update
sudo apt install r-base
# Verify the installation by checking the version
R --version
RStudio is an integrated development environment (IDE) for R. It provides a user-friendly interface that makes writing and running R scripts easier.
RStudio has several key components that enhance the user experience:
To install RStudio: Download and install RStudio from RStudio’s website.
You can run R code in RStudio using the following methods:
Enter.Ctrl + Enter (Windows) or Cmd + Enter (Mac).# Adding two numbers
2 + 3
ls() to list all objects in your environment.rm(object_name) to remove an object.rm(list = ls()) to clear the entire environment.# Listing objects in the environment
ls()
You can customize RStudio to suit your preferences:
RStudio is a powerful tool for writing and executing R code efficiently. Understanding its features and functionalities will help you work effectively with R for research and data analysis.
Understanding the basic syntax of R is essential for writing effective scripts. R is case-sensitive and follows a simple, readable syntax.
In R, values can be assigned to variables using <- or =.
x <- 10 # Assigning 10 to x
y = 20 # Assigning 20 to y
x + y # Summing x and y
## [1] 30
R has several basic data types:
10.5, 2.3)1L, 5L)"Hello")TRUE, FALSE)a <- 5 # Numeric
b <- 2L # Integer
c <- "R" # Character
d <- TRUE # Logical
R supports conditional statements like if, else, and ifelse.
x <- 15
if (x > 10) {
print("x is greater than 10")
} else {
print("x is 10 or less")
}
## [1] "x is greater than 10"
Loops help execute repetitive tasks efficiently.
for (i in 1:5) {
print(i)
}
## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5
count <- 1
while (count <= 5) {
print(count)
count <- count + 1
}
## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5
Functions allow modular programming and reuse of code.
add_numbers <- function(a, b) {
return(a + b)
}
add_numbers(5, 3)
## [1] 8
Try the following exercises to practice R syntax:
Modify the code below to multiply a and b instead of adding them.
# Assign values to variables
# Print the sum of a and b
# Assign values to variables
a <- 7
b <- 3
# Multiply a and b
print(a * b)
## [1] 21
Write an if-else statement to check if x is greater than 20.
# Define a variable x
# Write an if-else statement to check if x is greater than 20
# Define a variable x
x <- 25
# Write an if-else statement to check if x is greater than 20
if (x > 20) {
print("x is greater than 20")
} else {
print("x is 20 or less")
}
## [1] "x is greater than 20"
Write a for loop to print numbers from 1 to 10.
# Write a for loop to print numbers from 1 to 10
Modify the loop to print only even numbers between 1 and 10.
# Print even numbers from 1 to 10
for (i in seq(1, 10, by=1)) {
print(i)
}
## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5
## [1] 6
## [1] 7
## [1] 8
## [1] 9
## [1] 10
Mastering R syntax is the foundation for effective programming in R. Understanding variable assignment, data types, conditional statements, loops, and functions is key to becoming proficient in R programming.
R provides several fundamental data structures for handling data. Understanding these structures is essential for efficient data manipulation.
Vectors are the simplest data structure in R and can contain elements of the same type.
numeric_vector <- c(1, 2, 3, 4, 5)
character_vector <- c("apple", "banana", "cherry")
logical_vector <- c(TRUE, FALSE, TRUE)
Matrices are two-dimensional arrays that contain elements of the same type.
matrix_example <- matrix(1:9, nrow=3, ncol=3)
matrix_example
## [,1] [,2] [,3]
## [1,] 1 4 7
## [2,] 2 5 8
## [3,] 3 6 9
Lists can hold elements of different types, including vectors, matrices, and even other lists.
list_example <- list(name = "John", age = 25, scores = c(90, 85, 88))
list_example
## $name
## [1] "John"
##
## $age
## [1] 25
##
## $scores
## [1] 90 85 88
Data frames are table-like structures where each column can contain different types of data.
df <- data.frame(Name = c("Alice", "Bob"), Age = c(25, 30), Score = c(90, 85))
df
## Name Age Score
## 1 Alice 25 90
## 2 Bob 30 85
Create a numeric vector containing the numbers 10, 20, 30, 40, and 50.
numeric_vector <- c(10, 20, 30, 40, 50)
numeric_vector
Create a data frame with two columns: “Student” (names: “Alice”, “Bob”, “Charlie”) and “Score” (values: 85, 90, 78).
df <- data.frame(Student = c("Alice", "Bob", "Charlie"), Score = c(85, 90, 78))
df
Understanding R’s data structures is crucial for effective data analysis. Vectors, matrices, lists, and data frames each serve different purposes in handling and organizing data.
R provides several fundamental data structures for handling data. Understanding these structures is essential for efficient data manipulation.
R supports multiple data file formats. Below are some common formats and how to handle them.
CSV (Comma-Separated Values) files are one of the most commonly used data formats.
Reading CSV Files: Built in read.csv
readr package
library(readr)
data <- read_csv("/data/bakeoff.csv")
head(data)
Writing CSV Files:
write_csv(data, "/results/output.csv")
To handle Excel files, use the readxl package for reading and the writexl package for writing.
Reading Excel Files:
library(readxl)
data <- read_excel("/latitude.xlsx", sheet = 1)
head(data)
Writing Excel Files:
library(writexl)
write_xlsx(data, "/results/output.xlsx")
RDS files store R objects efficiently.
Loading an R Object:
data <- readRDS("/data/inventory_parts.rds")
Saving an R Object:
saveRDS(data, "/results/output.rds")
To handle RData data, you don’t need any package. It is R’s internal data structure and preserves the type of data saved.
Reading RData Files:
data <- load("/data/wine.RData")
Writing RData Files:
save(data, "/results/output.RData")
R provides multiple ways to handle data formats, including CSV, Excel, JSON, RDS, and databases. Using the appropriate package ensures efficient data handling and manipulation.
Data manipulation is a crucial part of data analysis. R provides various packages to facilitate efficient data handling, including:
The dplyr package provides functions for filtering, selecting, mutating, and summarizing data.
library(dplyr)
data <- mtcars
filtered_data <- data %>% filter(mpg > 20)
head(filtered_data)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
## Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
## Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
selected_data <- data %>% select(mpg, hp, wt)
head(selected_data)
## mpg hp wt
## Mazda RX4 21.0 110 2.620
## Mazda RX4 Wag 21.0 110 2.875
## Datsun 710 22.8 93 2.320
## Hornet 4 Drive 21.4 110 3.215
## Hornet Sportabout 18.7 175 3.440
## Valiant 18.1 105 3.460
data <- data %>% mutate(power_to_weight = hp / wt,
abc= mpg/wt)
head(data)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
## power_to_weight abc
## Mazda RX4 41.98473 8.015267
## Mazda RX4 Wag 38.26087 7.304348
## Datsun 710 40.08621 9.827586
## Hornet 4 Drive 34.21462 6.656299
## Hornet Sportabout 50.87209 5.436047
## Valiant 30.34682 5.231214
data_summary <- data %>% summarize(avg_mpg = mean(mpg))
data_summary
## avg_mpg
## 1 20.09062
library(tidyr)
data_long <- data %>% gather(key = "attribute", value = "value", mpg:hp)
head(data_long)
## drat wt qsec vs am gear carb power_to_weight abc attribute value
## 1 3.90 2.620 16.46 0 1 4 4 41.98473 8.015267 mpg 21.0
## 2 3.90 2.875 17.02 0 1 4 4 38.26087 7.304348 mpg 21.0
## 3 3.85 2.320 18.61 1 1 4 1 40.08621 9.827586 mpg 22.8
## 4 3.08 3.215 19.44 1 0 3 1 34.21462 6.656299 mpg 21.4
## 5 3.15 3.440 17.02 0 0 3 2 50.87209 5.436047 mpg 18.7
## 6 2.76 3.460 20.22 1 0 3 1 30.34682 5.231214 mpg 18.1
data_separated <- data_long %>% separate(col = attribute, into = c("Type", "Detail"), sep = "_")
head(data_separated)
## drat wt qsec vs am gear carb power_to_weight abc Type Detail value
## 1 3.90 2.620 16.46 0 1 4 4 41.98473 8.015267 mpg <NA> 21.0
## 2 3.90 2.875 17.02 0 1 4 4 38.26087 7.304348 mpg <NA> 21.0
## 3 3.85 2.320 18.61 1 1 4 1 40.08621 9.827586 mpg <NA> 22.8
## 4 3.08 3.215 19.44 1 0 3 1 34.21462 6.656299 mpg <NA> 21.4
## 5 3.15 3.440 17.02 0 0 3 2 50.87209 5.436047 mpg <NA> 18.7
## 6 2.76 3.460 20.22 1 0 3 1 30.34682 5.231214 mpg <NA> 18.1
library(data.table)
dt <- as.data.table(mtcars)
dt[, .(avg_mpg = mean(mpg)), by = cyl]
## cyl avg_mpg
## <num> <num>
## 1: 6 19.74286
## 2: 4 26.66364
## 3: 8 15.10000
Data manipulation is a key aspect of data analysis in R. The dplyr, tidyr, and data.table packages provide powerful tools for transforming and processing data efficiently.
Question: Select only the mpg, cyl, and hp columns from the mtcars dataset.
# Write your code here
Answer:
selected_data <- mtcars %>% select(mpg, cyl, hp)
head(selected_data)
## mpg cyl hp
## Mazda RX4 21.0 6 110
## Mazda RX4 Wag 21.0 6 110
## Datsun 710 22.8 4 93
## Hornet 4 Drive 21.4 6 110
## Hornet Sportabout 18.7 8 175
## Valiant 18.1 6 105
x=rnorm(100)
y=rnorm(100)
plot(x,y)
For vectors
library(ggplot2)
qplot(x,y)
For datasets
library(ggplot2)
ggplot(mtcars, aes(x = hp, y = mpg)) + geom_point()
ggplot(mtcars, aes(x = hp, y = mpg, color = factor(cyl))) + geom_point()
ggplot(mtcars, aes(x = factor(cyl))) + geom_bar()
ggplot(mtcars, aes(x = hp, y = mpg)) +
geom_point() +
facet_grid(. ~ cyl)
Question: Create a scatter plot of wt vs. mpg with points colored by gear.
# Write your code here
Answer:
ggplot(mtcars, aes(x = wt, y = mpg, color = factor(gear))) + geom_point()