R is a powerful, open-source programming language and environment for statistical computing and graphics. It’s widely used in academia, research, and industry for data analysis, visualization, and machine learning.
Key features of R:
History:
Why R for Education Data Analysis?
[1] "Hello, Education Data Analysts!"
Using Base R as a calculator
[1] 7
[1] 3.333333
Explanation: This block demonstrates basic arithmetic operations in R. It shows addition and division. R can be used as a simple calculator.
[1] 1 2 3 4 5
Explanation: This creates a vector named ‘numbers’ using the
c()
function, which combines values into a vector. The
vector is then displayed.
[1] 2 4 6 8 10
[1] 20 40 60 80 100
Explanation: This shows vector arithmetic. Each element of the ‘numbers’ vector is multiplied by 2 or scaled to percent.
[1] 3
[1] 3
[1] 1.581139
Explanation: These are basic statistical functions.
mean()
calculates the average, median()
finds
the middle value, and sd()
computes the standard deviation
of the ‘numbers’ vector.
[1] 16
Explanation: This defines a custom function named ‘square’ that takes an input ‘x’ and returns its square. The function is then called with the argument 4.
# Working with data frames
df <- data.frame(
name = c("Alice", "Bob", "Charlie"),
age = c(10, 11, 9),
score = c(85, 92, 78)
)
print(df)
Explanation: This creates a data frame, a fundamental data structure in R for storing tabular data. It has three columns: name, age, and score. The data frame is then printed.
Note: data frames are typically not entered manually like this, but imported from a file or connection to a database.
name age score
Length:3 Min. : 9.0 Min. :78.0
Class :character 1st Qu.: 9.5 1st Qu.:81.5
Mode :character Median :10.0 Median :85.0
Mean :10.0 Mean :85.0
3rd Qu.:10.5 3rd Qu.:88.5
Max. :11.0 Max. :92.0
Explanation: The summary()
function provides a
statistical summary of the data frame, including min, max, mean, and
median for numeric columns, and frequency for categorical columns.
For public discussion, we often generate synthetic data to run code
# Generate synthetic data
data <- data.frame(
name = sample(c("Alice", "Bob", "Charlie", "David", "Emma"), 20, replace = TRUE),
subject = sample(c("Math", "English", "Science", "Math", "English"), 20, replace = TRUE),
age = sample(9:11, 20, replace = TRUE),
score = round(rnorm(20, mean = 85, sd = 7))
)
# Display the generated data frame
print(data)
Tidyverse is a collection of R packages designed for data science, following consistent design philosophies.
── Attaching core tidyverse packages ───────────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ─────────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the ]8;;http://conflicted.r-lib.org/conflicted package]8;; to force all conflicts to become errors
dplyr provides a grammar of data manipulation, with key verbs like: - filter(): subset rows - select(): choose columns - mutate(): create new variables - arrange(): reorder rows - summarise(): reduce variables to values
# Group and summarize
data %>%
group_by(subject) %>%
summarise(
avg_score = mean(score),
count = n()
)
# Create new variables
data %>%
mutate(
grade = case_when(
score >= 90 ~ "A",
score >= 80 ~ "B",
score >= 70 ~ "C",
TRUE ~ "D"
)
)
ggplot2 is based on the grammar of graphics, allowing you to build complex plots from simple components.
# Basic scatter plot
ggplot(data, aes(x = name, y = score)) +
geom_point() +
theme_minimal() +
labs(title = "Student Scores", x = "Student Name", y = "Score")
# Bar plot with grouping
ggplot(data, aes(x = subject, y = score, fill = subject)) +
geom_bar(stat = "identity") +
theme_minimal() +
labs(title = "Scores by Subject", x = "Subject", y = "Score")
# Boxplot with y-axis scaled from 0 to 100
ggplot(data, aes(x = subject, y = score)) +
geom_boxplot() +
theme_minimal() +
labs(title = "Score Distribution by Subject", x = "Subject", y = "Score") +
scale_y_continuous(limits = c(50, 100), breaks = seq(50, 100, by = 20))
RStudio is an integrated development environment (IDE) for R, enhancing productivity and ease of use.
Features: - Code editor with syntax highlighting and auto-completion - Console for immediate code execution - Environment pane for managing objects and data - Plot viewer for visualizations - Help documentation and package management - Project management capabilities - Version control integration (Git)
R Notebooks allow for interactive, literate programming, combining code, output, and narrative text.
Benefits: - Reproducible research - Easy sharing of analysis and results - Inline code execution - Multiple output formats (HTML, PDF, Word)
set.seed(123)
student_data <- tibble(
student_id = 1:200,
gender = sample(c("Male", "Female"), 200, replace = TRUE),
grade_level = sample(9:12, 200, replace = TRUE),
program = sample(c("General", "Honors", "AP"), 200, replace = TRUE, prob = c(0.5, 0.3, 0.2)),
math_score = rnorm(200, mean = 75, sd = 15),
reading_score = rnorm(200, mean = 70, sd = 12),
attendance_rate = rbeta(200, shape1 = 5, shape2 = 1) * 100
)
# View the first few rows
head(student_data)
# Summary statistics
summary(student_data)
# Distribution of math scores
ggplot(student_data, aes(x = math_score)) +
geom_histogram(binwidth = 5, fill = "skyblue", color = "black") +
theme_minimal() +
labs(title = "Distribution of Math Scores", x = "Math Score", y = "Count")
# Boxplot of reading scores by program
ggplot(student_data, aes(x = program, y = reading_score, fill = program)) +
geom_boxplot() +
theme_minimal() +
labs(title = "Reading Scores by Program", x = "Program", y = "Reading Score")
# Scatter plot of math vs reading scores
ggplot(student_data, aes(x = math_score, y = reading_score, color = program)) +
geom_point(alpha = 0.6) +
geom_smooth(method = "lm", se = FALSE) +
theme_minimal() +
labs(title = "Math vs Reading Scores by Program", x = "Math Score", y = "Reading Score")
AI tools can be helpful for generating R code snippets, especially for beginners. However, it’s important to understand the code and verify its correctness.
Benefits:
Cautions:
“You are my tutor helping me learn and understand R programming for data analysis and visualization in the education research field. I will be sharing questions, ideas, and code snippets with you and I need you to provide examples and explanations that are detailed and clearly presented for someone just starting out. Don’t provide complex code, or use new packages or functions without explaining them. Be an effective teacher and tutor for learning R, Rstudio, the tidyverse. …”
An effective prompt will avoid it generating large, complex, possibly broken code, and make sure it is explained in a way that you can learn from.
I would love to hear your thoughts on this introduction to R and if you’re interested in further sessions. Potential topics for future workshops could include:
Easiest way to reach me is the #r-sharing group on the DUG slack or email.