My E-Mail: szhang@stcloudstate.edu
I will be using the following textbook:
R for Data Science: Import, Tidy, Transform, Visualize, and Model Data 2nd Edition by Hadley Wickham et. al (ISBN: 978-1492097402)
An electronic version: https://r4ds.hadley.nz/
RStudio is an integrated development environment (IDE) for R. It provides a user-friendly interface with features that enhance the R programming experience, including:
R Markdown is a document format that combines R code with narrative text and output, allowing you to create dynamic and reproducible reports. Key features include:
Importing data into R is a crucial step in data analysis. R provides several functions to read data from various file formats. Here are some commonly used functions:
Reading CSV Files:
# Using read.csv
data <- read.csv("your_file.csv")
Reading Excel Files:
# Using readxl package
library(readxl)
data <- read_excel("your_file.xlsx")
Reading Text Files (Tab-Delimited or Space-Delimited):
# Using read.table
data <- read.table("your_file.txt", header = TRUE, sep = "\t") # for tab-delimited
data <- read.table("your_file.txt", header = TRUE, sep = " ") # for space-delimited
Reading JSON Files:
# Using jsonlite package
install.packages("jsonlite")
library(jsonlite)
data <- fromJSON("your_file.json")
Reading Data from URLs:
# Using read.table for CSV from URL
url <- "https://example.com/your_data.csv"
data <- read.csv(url)
# Using jsonlite for JSON from URL
url <- "https://example.com/your_data.json"
data <- fromJSON(url)
Reading Data from Databases:
# Using DBI and RMySQL packages for MySQL
install.packages(c("DBI", "RMySQL"))
library(DBI)
library(RMySQL)
con <- dbConnect(RMySQL::MySQL(),
dbname = "your_database",
host = "your_host",
user = "your_user",
password = "your_password")
data <- dbGetQuery(con, "SELECT * FROM your_table")
Reading data with the R function “file.choose()”
spam = read.csv(file.choose()) # I will choose "spam" data from my downloads
head(spam, n = 20)
We introduce some basic data types in R. Data types can be “double”, “integer”, “character”, “logical”, “list”, and so on. You can use the code: typeof(objectName) to check the type of an object.
# The following line of code will assign the string "Adam" to object "name".
# You can understand the code this way: store the string "Adam" in a container called "name"
name <- "Adam"
# Assign the numeric value 19 to the object called "age"
age <- 19
# Assign the numeric value 95 to the object called "score"
score <- 95
# Assign the logical value "TRUE" to the object called "answer1"
answer1 <- TRUE
# Assign the logical value "FALSE" to the object called "answer2"
answer2 <- FALSE
# Check the type using R function called "typeof"
typeof(name) # character
## [1] "character"
typeof(age) # double
## [1] "double"
typeof(answer1) # logical
## [1] "logical"
We created 5 R objects above. They will consume some memories of your laptop when you run the code.
R offers several fundamental data structures that are essential for organizing and manipulating data. Here are some basic R data structures:
numeric_vector <- c(1, 2, 3, 4, 5)
character_vector <- c("apple", "banana", "orange")
Matrices: A two-dimensional array with rows and columns. All elements in a matrix must be of the same data type.
matrix_data <- matrix(1:6, nrow = 2, ncol = 3)
matrix_data
## [,1] [,2] [,3]
## [1,] 1 3 5
## [2,] 2 4 6
data_frame <- data.frame(name = c("John", "Jane", "Bob", "Alice", "Amy"),
age = c(25, 30, 22, 19, 20),
score = c(90, 85, 92, 90, 98))
To extract a column from a data frame, use the $ sign.
# Extract the age column
data_frame$age
## [1] 25 30 22 19 20
The above code extracts the age column from the data frame called “data_frame”.
To extract multiple columns, use the “subset” function:
# Extract the two columns name and scores from the data frame
subset(data_frame, select = c(name, score))
## name score
## 1 John 90
## 2 Jane 85
## 3 Bob 92
## 4 Alice 90
## 5 Amy 98
To extract data that satisfy conditions, add the subset option:
# Choose people whose age>21 and score > 88
subset(data_frame, subset = age>21 & score > 88)
## name age score
## 1 John 25 90
## 3 Bob 22 92
The above code chooses people whose age>21 and score > 88.
# Choose people whose age>22 and score = 90
subset(data_frame, subset = age>22 & score == 90)
## name age score
## 1 John 25 90
The above code chooses people whose age>22 and score = 90. Note the “==” symbol for testing whether two objects are equal.
gender <- factor(c("Male", "Female", "Male", "Female"))
The above code creates a factor with 2 levels (“Female” and “Male”). Levels are arranged according to the alphabetical order by default, unless you change the order. When creating a bar plot for the data, R will automatically placed the bar for females ahead of the bar for males.
You can change the order of levels as follows:
gender <- factor(c("Male", "Female", "Male", "Female"))
# Reset the order of levels
levels(gender) = c("Male", "Females")
A statement like “x + y equals z” can be either true or false. In R, if the statement is true, its corresponding value is represented as TRUE; otherwise, it is denoted as FALSE.
logical_vector1 <- c(FALSE, FALSE, TRUE, FALSE)
logical_vector1
## [1] FALSE FALSE TRUE FALSE
logical_vector2 <- c(3>2, 4>5, 6==7, 8.0 == 8, 9 != 10)
logical_vector2
## [1] TRUE FALSE FALSE TRUE TRUE
my_variable <- NULL
my_variable
## NULL
A list is an ordered collection of different data types. Elements in a list can be vectors, matrices, data frames, etc.
my_list <- list(name = "John", age = 25, scores = c(90, 85, 92), df = data_frame)
my_list
## $name
## [1] "John"
##
## $age
## [1] 25
##
## $scores
## [1] 90 85 92
##
## $df
## name age score
## 1 John 25 90
## 2 Jane 30 85
## 3 Bob 22 92
## 4 Alice 19 90
## 5 Amy 20 98
The created list has 4 elements. To access the scores element, we do
my_list$scores # or do my_list[[3]]
## [1] 90 85 92
These basic data structures form the foundation for working with data in R. Understanding how to create, manipulate, and access elements within these structures is crucial for effective data analysis and modeling in R.
In R, you can use the if…else construct for conditional statements. The basic syntax is as follows:
if (condition) {
# Code to execute if the condition is TRUE
} else {
# Code to execute if the condition is FALSE
}
A simple example:
# Example 1
x <- 10
if (x > 5) {
print("x is greater than 5")
} else {
print("x is not greater than 5")
}
## [1] "x is greater than 5"
You can also include multiple conditions using else if:
# Example 2
y <- 7
if (y > 10) {
print("y is greater than 10")
} else if (y > 5) {
print("y is greater than 5 but not greater than 10")
} else {
print("y is not greater than 5")
}
## [1] "y is greater than 5 but not greater than 10"
In programming, a function is a reusable block of code that performs a specific task or set of tasks. Functions provide modularity, making it easier to organize and maintain code. In R, functions can be built-in (provided by the language) or user-defined (created by the user).
Here are examples of built-in functions.
Example 1. The ifelse() function
The ifelse() function is used for vectorized conditional operations. It allows you to apply a condition to each element of a vector and return a new vector with values based on whether the condition is true or false for each element.
z <- c(3, 8, 12)
ifelse(z > 10, "Yes", "No")
## [1] "No" "No" "Yes"
Explanation of the code:
For each element in the vector z, the condition z > 10 is evaluated. If an element is greater than 10, the corresponding result element is “Yes”. If an element is not greater than 10, the corresponding result element is “No”. The resulting vector result is a character vector with “Yes” and “No” based on the condition.
Example 2. The paste() function
The paste() function is used to concatenate (combine) character strings. In this case, it concatenates the first_name and last_name variables.
first_name <- "John"
last_name <- "Doe"
paste(first_name, last_name)
## [1] "John Doe"
Here’s what happens:
In R, a user-defined function has a specific structure, which includes the following elements:
my_function <- function(par1, par2, ...) {
# Function body
# Code to be executed
# ...
}
Here are some examples of user-defined functions:
Example 1:
# Function to calculate the area of a circle
calculate_circle_area <- function(radius) {
area <- pi * radius^2
return(area)
}
# Usage
calculate_circle_area(5)
## [1] 78.53982
Example 2: Function to Find the Maximum of Two Numbers
# Function to find the maximum of two numbers
find_maximum <- function(a, b) {
if (a > b) {
return(a)
} else {
return(b)
}
}
# Usage
num1 <- 15
num2 <- 8
find_maximum(num1, num2)
## [1] 15
In R, the pipe operator (%>%) is used to chain multiple operations together in a more readable and expressive way. The pipe operator is part of the magrittr package, and it allows you to pass the result of one operation as the first argument to the next operation.
Here’s a brief explanation of how the pipe operator works.
# Example without pipe
result1 <- f(g(h(x))) # here f, g, and h are 3 functions
# Example with pipe
result2 <- x %>% h() %>% g() %>% f()
Explanation:
In the second example, the value x is passed through a series of functions (h(), g(), f()) in a left-to-right fashion. It makes the code more readable by avoiding nested function calls and aligning operations in a sequence.
The two results are the same.
When a function takes two or more parameters, you can use the pipe operator (%>%) to pass the output of the preceding operation as the first argument to the next operation. The pipe operator helps create more readable and concise code.
Here’s an example where a function takes two parameters:
# Example without pipe
result1 <- f(g(x, y), z)
# Example with pipe
result2 <- x %>% g(y) %>% f(z)
Explanation:
In the second example, the value x is passed to the first function g() with the parameter y. The result of g(x, y) is then passed as the first argument to the next function f() with the parameter z.
The two results are the same.
We will use the package dplyr which provides the following functions for data manipulation.
# Create a sample data frame
df <- data.frame(
ID = c(1, 2, 3, 4, 5),
Name = c("Alice", "Bob", "Charlie", "David", "Eva"),
Age = c(25, 30, 22, 35, 28),
Salary = c(50000, 60000, 45000, 70000, 55000)
)
library(dplyr) # This package will be loaded if you load package "tidyverse"
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
# select(): forms a new data frame with selected columns
selected_df <- select(df, ID, Name)
print(selected_df)
## ID Name
## 1 1 Alice
## 2 2 Bob
## 3 3 Charlie
## 4 4 David
## 5 5 Eva
# arrange(): forms a new data frame with rows arranged in a specified order
arranged_df <- arrange(df, Age)
print(arranged_df)
## ID Name Age Salary
## 1 3 Charlie 22 45000
## 2 1 Alice 25 50000
## 3 5 Eva 28 55000
## 4 2 Bob 30 60000
## 5 4 David 35 70000
# filter(): forms a new data frame consisting of rows that satisfy certain filtering conditions
filtered_df <- filter(df, Age > 25)
print(filtered_df)
## ID Name Age Salary
## 1 2 Bob 30 60000
## 2 4 David 35 70000
## 3 5 Eva 28 55000
# mutate(): creates a new column out of a data frame by adding the new column
mutated_df <- mutate(df, SalaryAfterBonus = Salary + 5000)
print(mutated_df)
## ID Name Age Salary SalaryAfterBonus
## 1 1 Alice 25 50000 55000
## 2 2 Bob 30 60000 65000
## 3 3 Charlie 22 45000 50000
## 4 4 David 35 70000 75000
## 5 5 Eva 28 55000 60000
# transmute(): creates a new data frame by replacing the original column by a new one
transmuted_df <- transmute(df, SalaryAfterBonus = Salary + 5000)
print(transmuted_df)
## SalaryAfterBonus
## 1 55000
## 2 65000
## 3 50000
## 4 75000
## 5 60000
# summarize(): summarizes a data frame into a single row
summary_df <- summarize(df, AvgAge = mean(Age), TotalSalary = sum(Salary))
print(summary_df)
## AvgAge TotalSalary
## 1 28 280000
# distinct(): collapses identical observations into a single one
distinct_df <- distinct(df, Age)
print(distinct_df)
## Age
## 1 25
## 2 30
## 3 22
## 4 35
## 5 28
# group_by(): groups the data to perform tasks by groups and it often is followed by the summarize() function
grouped_df <- group_by(df, Age)
summary_by_age <- summarize(grouped_df, AvgSalary = mean(Salary))
print(summary_by_age)
## # A tibble: 5 × 2
## Age AvgSalary
## <dbl> <dbl>
## 1 22 45000
## 2 25 50000
## 3 28 55000
## 4 30 60000
## 5 35 70000
# Descriptive statistics
data <- c(7,7,9,12,15,18,19,21,21,26,32,35,45,46,49,55,61,63,63,68,70,70,71,75,78,79,81,86,87,98)
mean_value <- mean(data)
sd_value <- sd(data) # Give the standard deviation
Max <- max(data)
Min <- min(data)
five_number_summaries <- quantile(data) # 5-number summary
other_quantiles <- quantile(data, c(0.6, 0.8)) # the 60th and 80th percentiles
Summary <- summary(data) # 5-number summary along with mean
When data contain missing values, the above functions mean(), sd(), … need to take care of such a situation.
# Descriptive statistics
data <- c(7,7,9,12,15,18,19,21,21,NA,26,32,35,45,46,49,55,61,63,63,68,NA,70,70,71,75,78,79,81,86,87,98)
mean_value <- mean(data, na.rm = TRUE)
sd_value <- sd(data, na.rm = TRUE) # Give the standard deviation
Max <- max(data, na.rm = TRUE)
Min <- min(data, na.rm = TRUE)
five_number_summaries <- quantile(data, na.rm = TRUE) # 5-number summary
other_quantiles <- quantile(data, c(0.6, 0.8), na.rm = TRUE) # the 60th and 80th percentiles
Summary <- summary(data) # 5-number summary along with mean, handling missing values automatically
Introduction to functions like runif(), rnorm(), and rbinom() for generating random numbers. Setting seeds for reproducibility. Whenever the same seed (a positive integer) is used, the generated random numbers will be the same.
set.seed(123) # Here 123 is called the seed. Any seed is ok.
runif(10) # generate 10 numbers between 0 (inclusive) and 1 (exclusive)
## [1] 0.2875775 0.7883051 0.4089769 0.8830174 0.9404673 0.0455565 0.5281055
## [8] 0.8924190 0.5514350 0.4566147
runif(10, 100, 200) # generate 10 numbers between 100 and 200
## [1] 195.6833 145.3334 167.7571 157.2633 110.2925 189.9825 124.6088 104.2060
## [9] 132.7921 195.4504
rnorm(10) # generate 10 numbers from the standard normal distribution
## [1] 1.2240818 0.3598138 0.4007715 0.1106827 -0.5558411 1.7869131
## [7] 0.4978505 -1.9666172 0.7013559 -0.4727914
rnorm(10, 100, 15) # generate 10 numbers from the normal distribution with mean 100 and std 15
## [1] 83.98264 96.73038 84.60993 89.06663 90.62441 74.69960 112.56681
## [8] 102.30060 82.92795 118.80722
rbinom(10, 5, 0.5) # generate 10 numbers from the binormial distribution with n = 5 and p = 0.5
## [1] 3 1 2 2 4 2 3 3 3 2
In base R, you can create a variety of basic data visualizations using functions like plot(), hist(), boxplot(), and others. Here are some examples of data visualizations using base R:
Scatter Plot:
# Create sample data
set.seed(123)
x <- rnorm(50)
y <- 2 * x + rnorm(50)
# Scatter plot
plot(x, y, main = "Scatter Plot", xlab = "X-axis", ylab = "Y-axis")
Histogram:
# Create sample data
set.seed(456)
data <- rnorm(100)
# Histogram
hist(data, main = "Histogram", xlab = "Values", col = "lightblue")
Boxplot:
# Create sample data
set.seed(789)
group1 <- rnorm(50, mean = 0, sd = 1)
group2 <- rnorm(50, mean = 2, sd = 1)
# Boxplot
boxplot(group1, group2, names = c("Group 1", "Group 2"), main = "Boxplot", col = c("lightblue", "lightgreen"))
Bar Chart:
# Create sample data
categories <- c("A", "B", "C", "D")
counts <- c(10, 5, 8, 12)
# Bar chart
barplot(counts, names.arg = categories, main = "Bar Chart", xlab = "Categories", ylab = "Counts", col = "skyblue")
If you already have a vector of individual-level data without the need for random sampling, you can directly use that data. Here’s an example assuming you have a vector named categories:
# Assuming you have a vector of individual-level data
categories <- c("A", "B", "C", "A", "B", "A", "C", "D", "E", "C", "D", "E", "A", "B", "C")
# Bar chart
barplot(table(categories), main = "Bar Chart of Category Counts", xlab = "Categories", ylab = "Counts", col = "skyblue")
This code uses the table() function to generate counts for each category and then creates a bar chart using barplot(). Adjust the categories vector based on your actual data.
Scatter Plot:
library(ggplot2)
# Sample data
set.seed(123)
data <- data.frame(x = rnorm(50), y = 2 * rnorm(50) + 1)
# Scatter plot
ggplot(data, aes(x, y)) +
geom_point() +
labs(title = "Scatter Plot", x = "X-axis", y = "Y-axis")
Histogram:
# Sample data
set.seed(456)
data <- data.frame(values = rnorm(100))
# Histogram
ggplot(data, aes(x = values)) +
geom_histogram(binwidth = 0.5, fill = "lightblue", color = "black") +
labs(title = "Histogram", x = "Values", y = "Frequency")
Boxplot:
# Sample data
set.seed(789)
data <- data.frame(group = rep(c("Group 1", "Group 2"), each = 50),
values = c(rnorm(50, mean = 0, sd = 1), rnorm(50, mean = 2, sd = 1)))
# Boxplot
ggplot(data, aes(x = group, y = values, fill = group)) +
geom_boxplot() +
labs(title = "Boxplot", x = "Groups", y = "Values")
Line Plot:
# Sample data
set.seed(987)
data <- data.frame(time = 1:20, values = cumsum(rnorm(20)))
# Line plot
ggplot(data, aes(x = time, y = values)) +
geom_line() +
labs(title = "Line Plot", x = "Time", y = "Values")
Bar Chart:
# Sample data
data <- data.frame(categories = c("A", "B", "C", "D", "E"),
counts = c(10, 5, 15, 8, 12))
# Bar chart for discrete distribution
ggplot(data, aes(x = categories, y = counts, fill = categories)) +
geom_bar(stat = "identity") +
labs(title = "Bar Chart", x = "Categories", y = "Counts")
# Bar chart for categorical data
ggplot(diamonds, aes(x = clarity)) +
geom_bar()
# Bar chart for the summary of numeric data
ggplot(mtcars, aes(x = factor(cyl), y = mpg)) +
geom_bar(stat = "summary", fun = "mean") +
labs(title = "Average mpg by Number of Cylinders", x = "Number of Cylinders", y = "Average mpg")
We can use the facet_wrap() function to create a grid of scatter plots. let’s create a grid of scatter plots using facet_wrap() with a real-world dataset. We’ll use the built-in iris dataset:
library(ggplot2)
# Scatter plot with facet_wrap using iris dataset
ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width)) +
geom_point() +
facet_wrap(~ Species) +
labs(title = "Scatter Plots by Species", x = "Sepal Length", y = "Sepal Width")
In this example, the iris dataset contains information about iris flowers, and we create a scatter plot of Sepal Length vs. Sepal Width. The facet_wrap(~ Species) creates a grid of scatter plots, with each facet representing a different species of iris.
These are just a few examples, and ggplot2 offers extensive customization options. You can further customize colors, themes, and other aspects of the plots based on your preferences and data characteristics.
# Confidence interval
x = c(23, 34, 19, 33, 65, 45, 62, 32, 51)
confidence_interval <- t.test(x, conf.level = 0.95)$conf.int
print(confidence_interval)
## [1] 27.89542 52.99347
## attr(,"conf.level")
## [1] 0.95
# Hypothesis testing (two-sample t-test)
group1 <- c(23, 25, 28, 30, 32)
group2 <- c(18, 20, 22, 25, 27)
t_test_result <- t.test(group1, group2)
print(t_test_result)
##
## Welch Two Sample t-test
##
## data: group1 and group2
## t = 2.2545, df = 8, p-value = 0.05419
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.1188274 10.5188274
## sample estimates:
## mean of x mean of y
## 27.6 22.4
Confidence intervals and hypothesis testing. Comparing Two Groups:
t-tests, chi-square tests, and non-parametric tests.
# Chi-square test for independence
table_result <- table(diamonds$cut, diamonds$clarity)
chi_square_result <- chisq.test(table_result)
print(chi_square_result)
##
## Pearson's Chi-squared test
##
## data: table_result
## X-squared = 4391.4, df = 28, p-value < 2.2e-16
Understanding the basics of linear regression.
# Simple linear regression
model <- lm(mpg ~ wt, data = mtcars)
summary(model)
##
## Call:
## lm(formula = mpg ~ wt, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.5432 -2.3647 -0.1252 1.4096 6.8727
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 37.2851 1.8776 19.858 < 2e-16 ***
## wt -5.3445 0.5591 -9.559 1.29e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.046 on 30 degrees of freedom
## Multiple R-squared: 0.7528, Adjusted R-squared: 0.7446
## F-statistic: 91.38 on 1 and 30 DF, p-value: 1.294e-10