Stat325LectureNotes

Contact Information

Textbook/references

I will be using the following textbook:

R for Data Science: Import, Tidy, Transform, Visualize, and Model Data 2nd Edition by Hadley Wickham et. al (ISBN: 978-1492097402)
An electronic version: https://r4ds.hadley.nz/

RStudio

RStudio is an integrated development environment (IDE) for R. It provides a user-friendly interface with features that enhance the R programming experience, including:

Script Editor: Write and execute R scripts.
Console: Interactively run R code.
Environment and History: View and manage your workspace.
Plots and Files: Visualize plots and manage files.
Packages: Install and manage R packages.
Help and Documentation: Access R documentation and help files.
Git Integration: Facilitates version control with Git.

R Markdown:

R Markdown is a document format that combines R code with narrative text and output, allowing you to create dynamic and reproducible reports. Key features include:

RCode Chunks: Embed R code within your document.
Markdown Syntax: Use simple and expressive Markdown syntax for text formatting.
Dynamic Output: Include R code outputs (e.g., tables, plots) directly in the document.
Parameterized Reports: Create parameterized reports for easy customization.
Multiple Output Formats: Generate reports in various formats, such as HTML, PDF, and Word.
Integrated into RStudio: Seamless integration with RStudio for a streamlined workflow.

Importing Data

Importing data into R is a crucial step in data analysis. R provides several functions to read data from various file formats. Here are some commonly used functions:

Reading CSV Files:

# Using read.csv
data <- read.csv("your_file.csv")

Reading Excel Files:

# Using readxl package
library(readxl)

data <- read_excel("your_file.xlsx")

Reading Text Files (Tab-Delimited or Space-Delimited):

# Using read.table
data <- read.table("your_file.txt", header = TRUE, sep = "\t")  # for tab-delimited
data <- read.table("your_file.txt", header = TRUE, sep = " ")   # for space-delimited

Reading JSON Files:

# Using jsonlite package
install.packages("jsonlite")
library(jsonlite)

data <- fromJSON("your_file.json")

Reading Data from URLs:

# Using read.table for CSV from URL
url <- "https://example.com/your_data.csv"
data <- read.csv(url)

# Using jsonlite for JSON from URL
url <- "https://example.com/your_data.json"
data <- fromJSON(url)

Reading Data from Databases:

# Using DBI and RMySQL packages for MySQL
install.packages(c("DBI", "RMySQL"))
library(DBI)
library(RMySQL)

con <- dbConnect(RMySQL::MySQL(), 
                 dbname = "your_database",
                 host = "your_host",
                 user = "your_user",
                 password = "your_password")

data <- dbGetQuery(con, "SELECT * FROM your_table")

Reading data with the R function “file.choose()”

spam = read.csv(file.choose()) # I will choose "spam" data from my downloads
head(spam, n = 20)

Introduction to R Programming:

Data Type

We introduce some basic data types in R. Data types can be “double”, “integer”, “character”, “logical”, “list”, and so on. You can use the code: typeof(objectName) to check the type of an object.

# The following line of code will assign the string "Adam" to object "name".
# You can understand the code this way: store the string "Adam" in a container called "name"
name <- "Adam" 

# Assign the numeric value 19 to the object called "age"
age <- 19

# Assign the numeric value 95 to the object called "score"
score <- 95

# Assign the logical value "TRUE" to the object called "answer1"
answer1 <- TRUE

# Assign the logical value "FALSE" to the object called "answer2"
answer2 <- FALSE

# Check the type using R function called "typeof"
typeof(name)    # character

## [1] "character"

typeof(age)     # double

## [1] "double"

typeof(answer1) # logical

## [1] "logical"

We created 5 R objects above. They will consume some memories of your laptop when you run the code.

Data Structure

R offers several fundamental data structures that are essential for organizing and manipulating data. Here are some basic R data structures:

Vectors: A one-dimensional array that can hold elements of the same data type.

numeric_vector <- c(1, 2, 3, 4, 5)
character_vector <- c("apple", "banana", "orange")

Matrices: A two-dimensional array with rows and columns. All elements in a matrix must be of the same data type.

matrix_data <- matrix(1:6, nrow = 2, ncol = 3)
matrix_data

##      [,1] [,2] [,3]
## [1,]    1    3    5
## [2,]    2    4    6

Data Frames: A two-dimensional tabular data structure similar to a table in a database. Columns can be of different data types.

data_frame <- data.frame(name = c("John", "Jane", "Bob", "Alice", "Amy"),
                         age = c(25, 30, 22, 19, 20),
                         score = c(90, 85, 92, 90, 98))

To extract a column from a data frame, use the $ sign.

# Extract the age column
data_frame$age

## [1] 25 30 22 19 20

The above code extracts the age column from the data frame called “data_frame”.

To extract multiple columns, use the “subset” function:

# Extract the two columns name and scores from the data frame 
subset(data_frame, select = c(name, score))

##    name score
## 1  John    90
## 2  Jane    85
## 3   Bob    92
## 4 Alice    90
## 5   Amy    98

To extract data that satisfy conditions, add the subset option:

# Choose people whose age>21 and score > 88
subset(data_frame, subset = age>21 & score > 88)

##   name age score
## 1 John  25    90
## 3  Bob  22    92

The above code chooses people whose age>21 and score > 88.

# Choose people whose age>22 and score = 90
subset(data_frame, subset = age>22 & score == 90)

##   name age score
## 1 John  25    90

The above code chooses people whose age>22 and score = 90. Note the “==” symbol for testing whether two objects are equal.

Factors: Used to represent categorical data.

gender <- factor(c("Male", "Female", "Male", "Female"))

The above code creates a factor with 2 levels (“Female” and “Male”). Levels are arranged according to the alphabetical order by default, unless you change the order. When creating a bar plot for the data, R will automatically placed the bar for females ahead of the bar for males.

You can change the order of levels as follows:

gender <- factor(c("Male", "Female", "Male", "Female"))

# Reset the order of levels
levels(gender) = c("Male", "Females")

Logical: Represents logical (Boolean) values TRUE or FALSE.

A statement like “x + y equals z” can be either true or false. In R, if the statement is true, its corresponding value is represented as TRUE; otherwise, it is denoted as FALSE.

logical_vector1 <- c(FALSE, FALSE, TRUE, FALSE)
logical_vector1

## [1] FALSE FALSE  TRUE FALSE

logical_vector2 <- c(3>2, 4>5, 6==7, 8.0 == 8, 9 != 10)
logical_vector2

## [1]  TRUE FALSE FALSE  TRUE  TRUE

NULL: Represents the absence of a value or undefined.

my_variable <- NULL
my_variable

## NULL

Lists

A list is an ordered collection of different data types. Elements in a list can be vectors, matrices, data frames, etc.

my_list <- list(name = "John", age = 25, scores = c(90, 85, 92), df = data_frame)
my_list

## $name
## [1] "John"
## 
## $age
## [1] 25
## 
## $scores
## [1] 90 85 92
## 
## $df
##    name age score
## 1  John  25    90
## 2  Jane  30    85
## 3   Bob  22    92
## 4 Alice  19    90
## 5   Amy  20    98

The created list has 4 elements. To access the scores element, we do

my_list$scores # or do my_list[[3]]

## [1] 90 85 92

These basic data structures form the foundation for working with data in R. Understanding how to create, manipulate, and access elements within these structures is crucial for effective data analysis and modeling in R.

If…Else Conditions

In R, you can use the if…else construct for conditional statements. The basic syntax is as follows:

if (condition) {
  # Code to execute if the condition is TRUE
} else {
  # Code to execute if the condition is FALSE
}

A simple example:

# Example 1
x <- 10

if (x > 5) {
  print("x is greater than 5")
} else {
  print("x is not greater than 5")
}

## [1] "x is greater than 5"

You can also include multiple conditions using else if:

# Example 2
y <- 7

if (y > 10) {
  print("y is greater than 10")
} else if (y > 5) {
  print("y is greater than 5 but not greater than 10")
} else {
  print("y is not greater than 5")
}

## [1] "y is greater than 5 but not greater than 10"

The “for” and “while” Loops

R Functions

In programming, a function is a reusable block of code that performs a specific task or set of tasks. Functions provide modularity, making it easier to organize and maintain code. In R, functions can be built-in (provided by the language) or user-defined (created by the user).

Here are examples of built-in functions.

Example 1. The ifelse() function

The ifelse() function is used for vectorized conditional operations. It allows you to apply a condition to each element of a vector and return a new vector with values based on whether the condition is true or false for each element.

z <- c(3, 8, 12)
ifelse(z > 10, "Yes", "No")

## [1] "No"  "No"  "Yes"

Explanation of the code:

z is a numeric vector containing the values 3, 8, and 12.
The ifelse() function takes three arguments: (1) The condition: z > 10, (2) The value to be returned when the condition is TRUE: “Yes”, and (3) The value to be returned when the condition is FALSE: “No”

For each element in the vector z, the condition z > 10 is evaluated. If an element is greater than 10, the corresponding result element is “Yes”. If an element is not greater than 10, the corresponding result element is “No”. The resulting vector result is a character vector with “Yes” and “No” based on the condition.

Example 2. The paste() function

The paste() function is used to concatenate (combine) character strings. In this case, it concatenates the first_name and last_name variables.

first_name <- "John"
last_name <- "Doe"
paste(first_name, last_name)

## [1] "John Doe"

Here’s what happens:

first_name is assigned the value “John”.
last_name is assigned the value “Doe”.
paste(first_name, last_name) takes the two character strings and combines them with a space in between.
The result of the paste() function in this case is a single character string:

In R, a user-defined function has a specific structure, which includes the following elements:

The function name can’t include any space and can’t start with a special symbol or digit.
The “function” keyword is used to define a new function.
The function definition includes parameters (if any) enclosed in parentheses. Parameters are variables that represent the input values passed to the function. They are defined within the parentheses following the function name.
The body of the function is enclosed in curly braces {}. The function body contains the code that defines the operations to be performed. It can include conditional statements, loops, and any other valid R code. The return() statement is used to specify the value that the function should return. It is optional; if omitted, the last evaluated expression becomes the return value.

my_function <- function(par1, par2, ...) {
  # Function body
  # Code to be executed
  # ...
}

Here are some examples of user-defined functions:

Example 1:

# Function to calculate the area of a circle
calculate_circle_area <- function(radius) {
  area <- pi * radius^2
  return(area)
}

# Usage
calculate_circle_area(5)

## [1] 78.53982

Example 2: Function to Find the Maximum of Two Numbers

# Function to find the maximum of two numbers
find_maximum <- function(a, b) {
  if (a > b) {
    return(a)
  } else {
    return(b)
  }
}

# Usage
num1 <- 15
num2 <- 8
find_maximum(num1, num2)

## [1] 15

The Pipe Operator (%>%)

In R, the pipe operator (%>%) is used to chain multiple operations together in a more readable and expressive way. The pipe operator is part of the magrittr package, and it allows you to pass the result of one operation as the first argument to the next operation.

Here’s a brief explanation of how the pipe operator works.

# Example without pipe
result1 <- f(g(h(x))) # here f, g, and h are 3 functions

# Example with pipe
result2 <- x %>% h() %>% g() %>% f()

Explanation:

In the second example, the value x is passed through a series of functions (h(), g(), f()) in a left-to-right fashion. It makes the code more readable by avoiding nested function calls and aligning operations in a sequence.

The two results are the same.

When a function takes two or more parameters, you can use the pipe operator (%>%) to pass the output of the preceding operation as the first argument to the next operation. The pipe operator helps create more readable and concise code.

Here’s an example where a function takes two parameters:

# Example without pipe
result1 <- f(g(x, y), z)

# Example with pipe
result2 <- x %>% g(y) %>% f(z)

Explanation:

In the second example, the value x is passed to the first function g() with the parameter y. The result of g(x, y) is then passed as the first argument to the next function f() with the parameter z.

The two results are the same.

Data Manipulation

We will use the package dplyr which provides the following functions for data manipulation.

select() forms a new data frame with selected columns.
arrange() forms a new data frame with row(s) arranged in a specified order.
filter() forms a new data frame consisting of rows that satisfy certain filtering conditions.
mutate() and transmute() allow you to create new columns out of a data frame. mutate adds to the data frame, and transmute creates a new data frame without the original columns.
summarize() summarizes a data frame into a single row.
distinct() collapses the identical observations into a single one.
group_by() groups the data to perform tasks by groups.

# Create a sample data frame
df <- data.frame(
  ID = c(1, 2, 3, 4, 5),
  Name = c("Alice", "Bob", "Charlie", "David", "Eva"),
  Age = c(25, 30, 22, 35, 28),
  Salary = c(50000, 60000, 45000, 70000, 55000)
)

library(dplyr) # This package will be loaded if you load package "tidyverse"

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

# select(): forms a new data frame with selected columns
selected_df <- select(df, ID, Name)
print(selected_df)

##   ID    Name
## 1  1   Alice
## 2  2     Bob
## 3  3 Charlie
## 4  4   David
## 5  5     Eva

# arrange(): forms a new data frame with rows arranged in a specified order
arranged_df <- arrange(df, Age)
print(arranged_df)

##   ID    Name Age Salary
## 1  3 Charlie  22  45000
## 2  1   Alice  25  50000
## 3  5     Eva  28  55000
## 4  2     Bob  30  60000
## 5  4   David  35  70000

# filter(): forms a new data frame consisting of rows that satisfy certain filtering conditions
filtered_df <- filter(df, Age > 25)
print(filtered_df)

##   ID  Name Age Salary
## 1  2   Bob  30  60000
## 2  4 David  35  70000
## 3  5   Eva  28  55000

# mutate(): creates a new column out of a data frame by adding the new column
mutated_df <- mutate(df, SalaryAfterBonus = Salary + 5000)
print(mutated_df)

##   ID    Name Age Salary SalaryAfterBonus
## 1  1   Alice  25  50000            55000
## 2  2     Bob  30  60000            65000
## 3  3 Charlie  22  45000            50000
## 4  4   David  35  70000            75000
## 5  5     Eva  28  55000            60000

# transmute(): creates a new data frame by replacing the original column by a new one
transmuted_df <- transmute(df, SalaryAfterBonus = Salary + 5000)
print(transmuted_df)

##   SalaryAfterBonus
## 1            55000
## 2            65000
## 3            50000
## 4            75000
## 5            60000

# summarize(): summarizes a data frame into a single row
summary_df <- summarize(df, AvgAge = mean(Age), TotalSalary = sum(Salary))
print(summary_df)

##   AvgAge TotalSalary
## 1     28      280000

# distinct(): collapses identical observations into a single one
distinct_df <- distinct(df, Age)
print(distinct_df)

##   Age
## 1  25
## 2  30
## 3  22
## 4  35
## 5  28

# group_by(): groups the data to perform tasks by groups and it often is followed by the summarize() function
grouped_df <- group_by(df, Age)
summary_by_age <- summarize(grouped_df, AvgSalary = mean(Salary))
print(summary_by_age)

## # A tibble: 5 × 2
##     Age AvgSalary
##   <dbl>     <dbl>
## 1    22     45000
## 2    25     50000
## 3    28     55000
## 4    30     60000
## 5    35     70000

Basic Statistical Concepts:

# Descriptive statistics
data <- c(7,7,9,12,15,18,19,21,21,26,32,35,45,46,49,55,61,63,63,68,70,70,71,75,78,79,81,86,87,98)
mean_value <- mean(data)
sd_value <- sd(data)    # Give the standard deviation
Max <- max(data)
Min <- min(data)
five_number_summaries <- quantile(data)   # 5-number summary
other_quantiles <- quantile(data, c(0.6, 0.8))  # the 60th and 80th percentiles
Summary <- summary(data)  # 5-number summary along with mean

When data contain missing values, the above functions mean(), sd(), … need to take care of such a situation.

# Descriptive statistics
data <- c(7,7,9,12,15,18,19,21,21,NA,26,32,35,45,46,49,55,61,63,63,68,NA,70,70,71,75,78,79,81,86,87,98)
mean_value <- mean(data, na.rm = TRUE)
sd_value <- sd(data, na.rm = TRUE)    # Give the standard deviation
Max <- max(data, na.rm = TRUE)
Min <- min(data, na.rm = TRUE)
five_number_summaries <- quantile(data, na.rm = TRUE)   # 5-number summary
other_quantiles <- quantile(data, c(0.6, 0.8), na.rm = TRUE)  # the 60th and 80th percentiles
Summary <- summary(data)  # 5-number summary along with mean, handling missing values automatically

Generating Random Numbers in R:

Introduction to functions like runif(), rnorm(), and rbinom() for generating random numbers. Setting seeds for reproducibility. Whenever the same seed (a positive integer) is used, the generated random numbers will be the same.

set.seed(123) # Here 123 is called the seed. Any seed is ok.
runif(10) # generate 10 numbers between 0 (inclusive) and 1 (exclusive)

##  [1] 0.2875775 0.7883051 0.4089769 0.8830174 0.9404673 0.0455565 0.5281055
##  [8] 0.8924190 0.5514350 0.4566147

runif(10, 100, 200) # generate 10 numbers between 100 and 200

##  [1] 195.6833 145.3334 167.7571 157.2633 110.2925 189.9825 124.6088 104.2060
##  [9] 132.7921 195.4504

rnorm(10) # generate 10 numbers from the standard normal distribution

##  [1]  1.2240818  0.3598138  0.4007715  0.1106827 -0.5558411  1.7869131
##  [7]  0.4978505 -1.9666172  0.7013559 -0.4727914

rnorm(10, 100, 15) # generate 10 numbers from the normal distribution with mean 100 and std 15

##  [1]  83.98264  96.73038  84.60993  89.06663  90.62441  74.69960 112.56681
##  [8] 102.30060  82.92795 118.80722

rbinom(10, 5, 0.5) # generate 10 numbers from the binormial distribution with n = 5 and p = 0.5

##  [1] 3 1 2 2 4 2 3 3 3 2

Data Visualization with Base R

In base R, you can create a variety of basic data visualizations using functions like plot(), hist(), boxplot(), and others. Here are some examples of data visualizations using base R:

Scatter Plot:

# Create sample data
set.seed(123)
x <- rnorm(50)
y <- 2 * x + rnorm(50)

# Scatter plot
plot(x, y, main = "Scatter Plot", xlab = "X-axis", ylab = "Y-axis")

Histogram:

# Create sample data
set.seed(456)
data <- rnorm(100)

# Histogram
hist(data, main = "Histogram", xlab = "Values", col = "lightblue")

Boxplot:

# Create sample data
set.seed(789)
group1 <- rnorm(50, mean = 0, sd = 1)
group2 <- rnorm(50, mean = 2, sd = 1)

# Boxplot
boxplot(group1, group2, names = c("Group 1", "Group 2"), main = "Boxplot", col = c("lightblue", "lightgreen"))

Bar Chart:

# Create sample data
categories <- c("A", "B", "C", "D")
counts <- c(10, 5, 8, 12)

# Bar chart
barplot(counts, names.arg = categories, main = "Bar Chart", xlab = "Categories", ylab = "Counts", col = "skyblue")

If you already have a vector of individual-level data without the need for random sampling, you can directly use that data. Here’s an example assuming you have a vector named categories:

# Assuming you have a vector of individual-level data
categories <- c("A", "B", "C", "A", "B", "A", "C", "D", "E", "C", "D", "E", "A", "B", "C")

# Bar chart
barplot(table(categories), main = "Bar Chart of Category Counts", xlab = "Categories", ylab = "Counts", col = "skyblue")

This code uses the table() function to generate counts for each category and then creates a bar chart using barplot(). Adjust the categories vector based on your actual data.

Data Visualization with ggplot2

Scatter Plot:

library(ggplot2)

# Sample data
set.seed(123)
data <- data.frame(x = rnorm(50), y = 2 * rnorm(50) + 1)

# Scatter plot
ggplot(data, aes(x, y)) +
  geom_point() +
  labs(title = "Scatter Plot", x = "X-axis", y = "Y-axis")

Histogram:

# Sample data
set.seed(456)
data <- data.frame(values = rnorm(100))

# Histogram
ggplot(data, aes(x = values)) +
  geom_histogram(binwidth = 0.5, fill = "lightblue", color = "black") +
  labs(title = "Histogram", x = "Values", y = "Frequency")

Boxplot:

# Sample data
set.seed(789)
data <- data.frame(group = rep(c("Group 1", "Group 2"), each = 50),
                   values = c(rnorm(50, mean = 0, sd = 1), rnorm(50, mean = 2, sd = 1)))

# Boxplot
ggplot(data, aes(x = group, y = values, fill = group)) +
  geom_boxplot() +
  labs(title = "Boxplot", x = "Groups", y = "Values")

Line Plot:

# Sample data
set.seed(987)
data <- data.frame(time = 1:20, values = cumsum(rnorm(20)))

# Line plot
ggplot(data, aes(x = time, y = values)) +
  geom_line() +
  labs(title = "Line Plot", x = "Time", y = "Values")

Bar Chart:

# Sample data
data <- data.frame(categories = c("A", "B", "C", "D", "E"),
                   counts = c(10, 5, 15, 8, 12))

# Bar chart for discrete distribution
ggplot(data, aes(x = categories, y = counts, fill = categories)) +
  geom_bar(stat = "identity") +
  labs(title = "Bar Chart", x = "Categories", y = "Counts")

# Bar chart for categorical data
ggplot(diamonds, aes(x = clarity)) +
  geom_bar()

# Bar chart for the summary of numeric data
ggplot(mtcars, aes(x = factor(cyl), y = mpg)) +
  geom_bar(stat = "summary", fun = "mean") +
  labs(title = "Average mpg by Number of Cylinders", x = "Number of Cylinders", y = "Average mpg")

We can use the facet_wrap() function to create a grid of scatter plots. let’s create a grid of scatter plots using facet_wrap() with a real-world dataset. We’ll use the built-in iris dataset:

library(ggplot2)

# Scatter plot with facet_wrap using iris dataset
ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width)) +
  geom_point() +
  facet_wrap(~ Species) +
  labs(title = "Scatter Plots by Species", x = "Sepal Length", y = "Sepal Width")

In this example, the iris dataset contains information about iris flowers, and we create a scatter plot of Sepal Length vs. Sepal Width. The facet_wrap(~ Species) creates a grid of scatter plots, with each facet representing a different species of iris.

These are just a few examples, and ggplot2 offers extensive customization options. You can further customize colors, themes, and other aspects of the plots based on your preferences and data characteristics.

Statistical Inference

# Confidence interval
x = c(23, 34, 19, 33, 65, 45, 62, 32, 51)
confidence_interval <- t.test(x, conf.level = 0.95)$conf.int
print(confidence_interval)

## [1] 27.89542 52.99347
## attr(,"conf.level")
## [1] 0.95

# Hypothesis testing (two-sample t-test)
group1 <- c(23, 25, 28, 30, 32)
group2 <- c(18, 20, 22, 25, 27)
t_test_result <- t.test(group1, group2)
print(t_test_result)

## 
##  Welch Two Sample t-test
## 
## data:  group1 and group2
## t = 2.2545, df = 8, p-value = 0.05419
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.1188274 10.5188274
## sample estimates:
## mean of x mean of y 
##      27.6      22.4

Confidence intervals and hypothesis testing. Comparing Two Groups:

t-tests, chi-square tests, and non-parametric tests.

# Chi-square test for independence
table_result <- table(diamonds$cut, diamonds$clarity)
chi_square_result <- chisq.test(table_result)
print(chi_square_result)

## 
##  Pearson's Chi-squared test
## 
## data:  table_result
## X-squared = 4391.4, df = 28, p-value < 2.2e-16

Introduction to Regression:

Understanding the basics of linear regression.

# Simple linear regression
model <- lm(mpg ~ wt, data = mtcars)
summary(model)

## 
## Call:
## lm(formula = mpg ~ wt, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.5432 -2.3647 -0.1252  1.4096  6.8727 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  37.2851     1.8776  19.858  < 2e-16 ***
## wt           -5.3445     0.5591  -9.559 1.29e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.046 on 30 degrees of freedom
## Multiple R-squared:  0.7528, Adjusted R-squared:  0.7446 
## F-statistic: 91.38 on 1 and 30 DF,  p-value: 1.294e-10

Datasets

The cherry blossom ten mile-run data: https://www.cherryblossom.org/post-race/race-results/past-results/. Use the database, click “Search for age group”. Choose event “10M”, year “2022”, and division “Overall women”. The data spread through 417 pages! How would you read all the data? There is an R package “cherryblossom” providing data for year 2012, 2017, and 2019. Check it out. The data from 1999 to 2012 were analyzed in the book https://afrozhussain.files.wordpress.com/2015/07/data-science-in-r.pdf from page 45 to 103.

Stat325LectureNotes

Shiju Zhang

1/1/2024

Contact Information

Textbook/references

RStudio

R Markdown:

Importing Data

Introduction to R Programming:

Data Type

Data Structure

Vectors: A one-dimensional array that can hold elements of the same data type.

Data Frames: A two-dimensional tabular data structure similar to a table in a database. Columns can be of different data types.

Factors: Used to represent categorical data.

Logical: Represents logical (Boolean) values TRUE or FALSE.

NULL: Represents the absence of a value or undefined.

Lists

If…Else Conditions

The “for” and “while” Loops

R Functions

The Pipe Operator (%>%)

Data Manipulation

Basic Statistical Concepts:

Generating Random Numbers in R:

Data Visualization with Base R

Data Visualization with ggplot2

Statistical Inference

Introduction to Regression:

Datasets