# set.seed() makes the random data reproducible —
# everyone running this code gets the same values
set.seed(42)
data_lms <- data.frame(
Student_ID = paste("Student", 1:40, sep = "_"),
Week_1 = sample(6:20, 40, replace = TRUE),
Week_2 = sample(6:20, 40, replace = TRUE),
Week_3 = sample(6:20, 40, replace = TRUE),
Week_4 = sample(6:20, 40, replace = TRUE),
Week_5 = sample(6:20, 40, replace = TRUE),
Week_6 = sample(6:20, 40, replace = TRUE),
Week_7 = sample(6:20, 40, replace = TRUE),
Week_8 = sample(6:20, 40, replace = TRUE),
Week_9 = sample(6:20, 40, replace = TRUE),
Week_10 = sample(6:20, 40, replace = TRUE),
Week_11 = sample(6:20, 40, replace = TRUE),
Week_12 = sample(6:20, 40, replace = TRUE),
Week_13 = sample(6:20, 40, replace = TRUE),
Week_14 = sample(6:20, 40, replace = TRUE),
Week_15 = sample(6:20, 40, replace = TRUE),
Week_16 = sample(6:20, 40, replace = TRUE)
)
# Inspect the first few rows
head(data_lms)Analytics Types & Visualization
Learning Analytics — Analytics & Visualization (Required)
Learning objectives
By the end of this file, you will be able to:
- Simulate and save an educational dataset in R
- Apply descriptive analytics using
colMeans()androwMeans() - Reshape data from wide to long format using
pivot_longer() - Create and interpret scatter plots, bar plots, line plots, and histograms
- Compute and interpret correlation between two variables
- Apply the analytics type (descriptive, diagnostic, predictive) to real questions
The analytics types — a reminder
Before coding, connect each technique to the type of question it answers:
| Analytics type | Question | Technique used in this file |
|---|---|---|
| Descriptive | What happened? | Summary stats, bar plot, histogram |
| Diagnostic | Why did it happen? | Scatter plot, correlation |
| Predictive | What will happen next? | Regression line, risk flagging |
Keep this table in mind as you work through the exercises below. Every output you produce should be connected to one of these questions.
Part 1 · Creating and saving a simulated dataset
Instead of loading existing data, we will create our own simulated dataset. This teaches you how data is structured in R — useful when you need to build a small dataset from scratch for testing or teaching.
Creating the dataset
Question: What does sample(6:20, 40, replace = TRUE) do? What would change if you set replace = FALSE? As always, use your own words to answer the question.
- [After conversing with my husband (a software engineer), I understood (I think) what is happening. The 6:20 pulls values between 6 through 20. The 40 is pulling 40 different values. The TRUE replaces the value; FALSE would not replace the value. For example, with replace = TRUE, 12 is replaced in the data set to be pulled out again if prompted. However, with replace = FALSE, 12 remains pulled out and there would be only 15 total values to pull but 40 are prompted and code would likely run an error.]
Saving the dataset
# Save as a CSV file in your project folder
write.csv(data_lms, "40_students_LMS_time_spent.csv", row.names = FALSE)
# Confirm it saved — check your Files pane for the new fileQuestion: Why is it important to be able to create and save datasets manually, rather than only working with provided data?
- [Based on my consultation with NAMU, there are a few reasons. In my hypothesis, I raised the point of custom or simulated data. NAMU clapped for me and then added the reasons of: FERPA, testing and debugging, & being able to save it in your own desired format. ]
Part 2 · Descriptive analytics — what happened?
Summary statistics
# Summary of all weekly columns (excluding Student_ID column)
summary_stats <- summary(data_lms[, -1])
summary_stats Week_1 Week_2 Week_3 Week_4
Min. : 6.00 Min. : 6.00 Min. : 6.00 Min. : 6.00
1st Qu.: 9.00 1st Qu.: 8.00 1st Qu.: 9.75 1st Qu.:12.00
Median :13.00 Median :10.50 Median :12.50 Median :14.50
Mean :12.32 Mean :10.75 Mean :12.53 Mean :14.28
3rd Qu.:15.00 3rd Qu.:13.00 3rd Qu.:16.00 3rd Qu.:18.00
Max. :20.00 Max. :20.00 Max. :19.00 Max. :20.00
Week_5 Week_6 Week_7 Week_8
Min. : 6.00 Min. : 6.00 Min. : 6.00 Min. : 6.00
1st Qu.: 8.50 1st Qu.: 9.00 1st Qu.: 8.75 1st Qu.:10.75
Median :13.50 Median :15.00 Median :14.00 Median :13.50
Mean :13.18 Mean :13.22 Mean :13.60 Mean :13.05
3rd Qu.:17.00 3rd Qu.:17.00 3rd Qu.:18.25 3rd Qu.:16.00
Max. :20.00 Max. :20.00 Max. :20.00 Max. :20.00
Week_9 Week_10 Week_11 Week_12
Min. : 6.00 Min. : 6.00 Min. : 6.00 Min. : 6.00
1st Qu.:10.00 1st Qu.: 9.00 1st Qu.:10.75 1st Qu.: 8.00
Median :14.00 Median :13.00 Median :14.00 Median :12.00
Mean :13.05 Mean :12.75 Mean :13.47 Mean :12.12
3rd Qu.:16.00 3rd Qu.:16.00 3rd Qu.:17.00 3rd Qu.:15.00
Max. :20.00 Max. :20.00 Max. :20.00 Max. :19.00
Week_13 Week_14 Week_15 Week_16
Min. : 6.00 Min. : 6.00 Min. : 6.00 Min. : 6.00
1st Qu.: 9.75 1st Qu.:10.75 1st Qu.:10.75 1st Qu.:10.00
Median :14.00 Median :14.00 Median :15.00 Median :15.00
Mean :12.90 Mean :13.70 Mean :13.75 Mean :13.65
3rd Qu.:16.00 3rd Qu.:17.25 3rd Qu.:17.00 3rd Qu.:17.00
Max. :20.00 Max. :20.00 Max. :20.00 Max. :20.00
Question: What insights do you gain from the summary? Pick one week and describe what the min, median, and max values tell you about student engagement that week.
- [In Week 12, student engagement appears more varied and slightly lower overall compared to some other weeks. Scores range from a minimum of 6 to a maximum of 19, showing that while some students were highly engaged, others participated very little. The median score of 12 suggests that a typical student was moderately engaged, leaning toward the lower middle of the scale. Overall, Week 12 reflects a mix of engagement levels, with more students clustered around the middle rather than the high end.]
Average time spent per week (colMeans)
# Select only the Week columns explicitly using grep()
# This protects against any extra columns added later (Semester_Average etc.)
# that would break names(average_time) if included accidentally
week_cols <- grep("^Week_", names(data_lms), value = TRUE)
average_time <- colMeans(data_lms[, week_cols])
average_time Week_1 Week_2 Week_3 Week_4 Week_5 Week_6 Week_7 Week_8 Week_9 Week_10
12.325 10.750 12.525 14.275 13.175 13.225 13.600 13.050 13.050 12.750
Week_11 Week_12 Week_13 Week_14 Week_15 Week_16
13.475 12.125 12.900 13.700 13.750 13.650
Question: If some weeks show notably higher or lower average time, what actions might an instructor take?
- [An instructor can use that information to adjust instruction in a targeted way. For example, lower-engagement weeks might signal that the material was more difficult, less engaging, or that students needed more support, so the instructor could slow the pace, add more guided practice, or incorporate more interactive or hands-on activities.]
Each student’s semester average (rowMeans)
# rowMeans() calculates the mean across columns for each row (each student)
data_lms$Semester_Average <- rowMeans(data_lms[, 2:17])
head(data_lms |> select(Student_ID, Semester_Average))Task: Calculate the average time spent for only Weeks 1–5 and save it as early_semester_average. Add it to the data frame.
# YOUR CODE HERE
# Hint: weeks 1–5 are columns 2–6 in the data frame.
# Follow the same pattern as the row-means chunk above,
# but change the column range to cover only the first 5 weeks.
grep("^Week_[1-5]$", names(data_lms), value = TRUE)[1] "Week_1" "Week_2" "Week_3" "Week_4" "Week_5"
weekTo5Cols <- grep("^Week_[1-5]$", names(data_lms), value = TRUE)
early_semester_average <- colMeans(data_lms[, weekTo5Cols])Question: How could the early semester average help an instructor identify at-risk students before midterm?
- [The early semester averages can help instructors spot students whose engagement is consistently below the class average. Identifying these students early allows instructors to provide support before midterm and help prevent academic difficulties.]
Part 3 · Visualization — bar plot and line plot
Prepare data for plotting
# Confirm average_time exists and has names before reshaping
# This prevents the "zero-length variable name" error
stopifnot(
"Run the col-means chunk first" = exists("average_time"),
"average_time has no names" = !is.null(names(average_time)),
"average_time is empty" = length(average_time) > 0
)
average_time_table <- data.frame(
Week = factor(names(average_time), levels = names(average_time)),
Average_Time_Spent = average_time
)
# Quick check — should show 16 rows, one per week
nrow(average_time_table)[1] 16
head(average_time_table)Bar plot — average time per week
ggplot(average_time_table, aes(x = Week, y = Average_Time_Spent)) +
geom_bar(stat = "identity", fill = "#1D9E75", color = "white") +
labs(
title = "Average Time Spent per Week",
x = "Week",
y = "Average Hours"
) +
theme_minimal() +
theme(
plot.title = element_text(size = 16, face = "bold", hjust = 0.5),
axis.text.x = element_text(angle = 45, hjust = 1)
)Line plot — trend over time
ggplot(average_time_table, aes(x = Week, y = Average_Time_Spent, group = 1)) +
geom_line(color = "#185FA5", linewidth = 1.2) +
geom_point(color = "#185FA5", size = 3) +
labs(
title = "Trend of Average Time Spent per Week",
x = "Week",
y = "Average Hours"
) +
theme_minimal() +
theme(
plot.title = element_text(size = 16, face = "bold", hjust = 0.5),
axis.text.x = element_text(angle = 45, hjust = 1)
)Question: What differences do you notice between the bar plot and the line plot? Which is more effective for showing a trend and why? Use your own words.
- [The line plot shows much more specific variation than the bar plot. I believe the line plot would more effective , if noted that the y-axis is not 0-15. ]
Line plot — individual students
# Reshape from wide to long format for individual student lines
data_long <- data_lms |>
pivot_longer(
cols = starts_with("Week"),
names_to = "Week",
values_to = "TimeSpent"
)
ggplot(data_long, aes(x = Week, y = TimeSpent,
group = Student_ID, color = Student_ID)) +
geom_line(alpha = 0.5) +
labs(
title = "Weekly Time Spent by Each Student",
x = "Week",
y = "Hours"
) +
theme_minimal() +
theme(
plot.title = element_text(size = 14, face = "bold", hjust = 0.5),
axis.text.x = element_text(angle = 45, hjust = 1),
legend.position = "none"
)Question: What patterns do you notice when looking at all 40 students at once? Is this visualization easy to interpret? Why or why not?
- [I see the pattern of a rainbow. There is no discernible way for me to weed through this data of 40 students. Even when expanding the window or saving the graph as a PDF, I cannot tell what in the world is happening here. ]
Line plot — selected students only
Task: Choose 5 students you want to compare and update the code below.
# YOUR CODE HERE
# Step 1: Choose 5 Student_IDs from the data and filter for them.
# Student IDs are in the format "Student_1", "Student_2", etc.
# Pick students whose patterns you find interesting to compare —
# for example, mix high and low average engagement.
#
# Step 2: Reshape with pivot_longer() — same as the lineplot-all chunk above.
#
# Step 3: Plot with ggplot() — copy the structure from lineplot-all
# and adjust the title and legend position.
Student5 <- filter(data_lms, Student_ID %in% c("Student_1", "Student_9", "Student_19", "Student_29", "Student_39"))
data_long <- Student5 |>
pivot_longer(
cols = starts_with("Week"),
names_to = "Week",
values_to = "TimeSpent"
)
ggplot(data_long, aes(x = Week, y = TimeSpent,
group = Student_ID, color = Student_ID)) +
geom_line(alpha = 0.5) +
labs(
title = "Weekly Time Spent by Each Student",
x = "Week",
y = "Hours"
) +
theme_minimal() +
theme(
plot.title = element_text(size = 14, face = "bold", hjust = 0.5),
axis.text.x = element_text(angle = 45, hjust = 1),
legend.position = "none"
)Question: What insights do you gain from this focused view? What design decisions did you make in choosing these five students?
- [I chose Students 1, 9, 19, 29, and 39 because they represent a variety of engagement patterns. Student 39 shows consistently high engagement, Student 19 shows lower engagement, Student 29 is more average and consistent, while Students 1 and 9 demonstrate different levels of variation over time. Comparing these students provides a focused view of how engagement can differ across a class.]
Histogram — semester averages
ggplot(data_lms, aes(x = Semester_Average)) +
geom_histogram(binwidth = 1, fill = "#378ADD", color = "white") +
labs(
title = "Distribution of Semester Average Time Spent",
x = "Semester Average (hours/week)",
y = "Number of Students"
) +
theme_minimal() +
theme(plot.title = element_text(size = 14, face = "bold", hjust = 0.5))Part 4 · Diagnostic analytics — why did it happen?
Now we switch to the sci-online-classes dataset to explore the relationship between time spent and final grades.
# Load the dataset used in the previous module
# Make sure sci-online-classes.csv is in your data folder
data_sci <- read_csv("data/sci-online-classes.csv") |>
clean_names()
glimpse(data_sci)Rows: 603
Columns: 30
$ student_id <dbl> 43146, 44638, 47448, 47979, 48797, 51943, 52326,…
$ course_id <chr> "FrScA-S216-02", "OcnA-S116-01", "FrScA-S216-01"…
$ total_points_possible <dbl> 3280, 3531, 2870, 4562, 2207, 4208, 4325, 2086, …
$ total_points_earned <dbl> 2220, 2672, 1897, 3090, 1910, 3596, 2255, 1719, …
$ percentage_earned <dbl> 0.6768293, 0.7567261, 0.6609756, 0.6773345, 0.86…
$ subject <chr> "FrScA", "OcnA", "FrScA", "OcnA", "PhysA", "FrSc…
$ semester <chr> "S216", "S116", "S216", "S216", "S116", "S216", …
$ section <chr> "02", "01", "01", "01", "01", "03", "01", "01", …
$ gradebook_item <chr> "POINTS EARNED & TOTAL COURSE POINTS", "ATTEMPTE…
$ grade_category <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ final_grade_cems <dbl> 93.45372, 81.70184, 88.48758, 81.85260, 84.00000…
$ points_possible <dbl> 5, 10, 10, 5, 438, 5, 10, 10, 443, 5, 12, 10, 5,…
$ points_earned <dbl> NA, 10.00, NA, 4.00, 399.00, NA, NA, 10.00, 425.…
$ gender <chr> "M", "F", "M", "M", "F", "F", "M", "F", "F", "M"…
$ q1 <dbl> 5, 4, 5, 5, 4, NA, 5, 3, 4, NA, NA, 4, 3, 5, NA,…
$ q2 <dbl> 4, 4, 4, 5, 3, NA, 5, 3, 3, NA, NA, 5, 3, 3, NA,…
$ q3 <dbl> 4, 3, 4, 3, 3, NA, 3, 3, 3, NA, NA, 3, 3, 5, NA,…
$ q4 <dbl> 5, 4, 5, 5, 4, NA, 5, 3, 4, NA, NA, 5, 3, 5, NA,…
$ q5 <dbl> 5, 4, 5, 5, 4, NA, 5, 3, 4, NA, NA, 5, 4, 5, NA,…
$ q6 <dbl> 5, 4, 4, 5, 4, NA, 5, 4, 3, NA, NA, 5, 3, 5, NA,…
$ q7 <dbl> 5, 4, 4, 4, 4, NA, 4, 3, 3, NA, NA, 5, 3, 5, NA,…
$ q8 <dbl> 5, 5, 5, 5, 4, NA, 5, 3, 4, NA, NA, 4, 3, 5, NA,…
$ q9 <dbl> 4, 4, 3, 5, NA, NA, 5, 3, 2, NA, NA, 5, 2, 2, NA…
$ q10 <dbl> 5, 4, 5, 5, 3, NA, 5, 3, 5, NA, NA, 4, 4, 5, NA,…
$ time_spent <dbl> 1555.1667, 1382.7001, 860.4335, 1598.6166, 1481.…
$ time_spent_hours <dbl> 25.91944500, 23.04500167, 14.34055833, 26.643610…
$ time_spent_std <dbl> -0.18051496, -0.30780313, -0.69325954, -0.148446…
$ int <dbl> 5.0, 4.2, 5.0, 5.0, 3.8, 4.6, 5.0, 3.0, 4.2, NA,…
$ pc <dbl> 4.50, 3.50, 4.00, 3.50, 3.50, 4.00, 3.50, 3.00, …
$ uv <dbl> 4.333333, 4.000000, 3.666667, 5.000000, 3.500000…
This is the same dataset from previous module. We are reloading it here because the LMS time data (Parts 1–3) and the sci-online-classes data (Part 4) are separate files. Reloading makes this file self-contained.
Scatter plot with regression line
ggplot(data_sci,
aes(x = time_spent_hours, y = final_grade_cems)) +
geom_point(color = "#185FA5", size = 2.5, alpha = 0.6) +
geom_smooth(method = "lm", color = "#993C1D", se = TRUE) +
labs(
title = "Time Spent vs. Final Grade",
x = "Time Spent on LMS (hours)",
y = "Final Grade"
) +
theme_minimal() +
theme(plot.title = element_text(size = 14, face = "bold", hjust = 0.5))Question: Based on the scatter plot, what do you expect the relationship between time spent and final grades to be? Write your hypothesis before looking at the correlation.
- [My hypothesis is that the more time spent on LMS, the higher the final grade.]
Correlation
# cor() computes the Pearson correlation coefficient
# use = "complete.obs" ignores rows with missing data
correlation <- cor(data_sci$time_spent_hours,
data_sci$final_grade_cems,
use = "complete.obs")
correlation[1] 0.3654121
- Values close to +1: strong positive relationship (more time → higher grade)
- Values close to -1: strong negative relationship
- Values close to 0: little or no linear relationship
- This is NOT a statistics course — focus on interpreting what this number means for learners, not on p-values.
Question: With both the scatter plot and the correlation value in front of you, what can you say about the relationship between time spent and final grades? What would you recommend to an instructor based on this finding?
- [There is a weak correlation between the time spent on LMS and the final grade. I would not simply equate time to grade, but I would take it into consideration when viewing the data.]
Practice — grouped summary by subject
Task: Using data_sci, calculate the mean final_grade_cems and mean time_spent_hours grouped by subject. Arrange by mean grade descending. Which subject has the highest average grade? Is it also the subject with the most time spent? data_sci |> group_by(subject) |> summarize( mean_grade = mean(final_grade_cems, na.rm = TRUE), mean_time = mean(time_spent_hours, na.rm = TRUE) ) |> arrange(desc(mean_grade)) ::: callout-tip ## Hint
You have used group_by() and summarise() in the previous file. Apply the same pattern here with a different grouping variable. If you need a column name reminder, run names(data_sci) in the Console. :::
# YOUR CODE HERE
# Steps: group_by(subject) |> summarise(mean_grade = ..., mean_time = ...) |> arrange(desc(...))
data_sci |>
group_by(gender) |>
summarise(
mean_grade = mean(final_grade_cems, na.rm = TRUE),
mean_time = mean(time_spent_hours, na.rm = TRUE)
) |>
arrange(desc(mean_grade))Question: Does the subject with the highest average grade also have the most time spent? What might explain any differences you find?
- [The subject with the highest average grade does not have the most time spent. Comfort with content could be an example of a cause of difference. In comparison, the gender grouping does support that more time spent in LMS leads to a higher grade.]
Part 5 · Box plot
A box plot shows the distribution of a variable across categories — useful for comparing groups and spotting outliers.
ggplot(data_sci, aes(x = gender, y = final_grade_cems, fill = gender)) +
geom_boxplot(color = "gray30",
outlier.colour = "#993C1D",
outlier.shape = 16,
outlier.size = 2) +
scale_fill_manual(values = c("F" = "#E1F5EE", "M" = "#E6F1FB")) +
labs(
title = "Final Grade Distribution by Gender",
x = "Gender",
y = "Final Grade"
) +
theme_minimal() +
theme(
plot.title = element_text(size = 14, face = "bold", hjust = 0.5),
legend.position = "none"
)Question: What does the box plot tell you about the distribution of final grades by gender? Are there differences worth investigating?
- [Visually, both genders performed pretty much the same here. Their typical grades (the medians) are sitting right around the mid-80s, and the boxes overlap almost perfectly, meaning there’s no major difference in average performance. The real thing worth looking into is that huge trail of brown dots at the bottom. Both groups have a bunch of students scoring under 45, all the way down to zero. It’s worth digging into the data to figure out why so many students completely tanked, like whether they stopped showing up, missed the final exam, or if there’s a specific reason the female group has a slightly thicker cluster of these low scores.]
Final reflection
After completing both the LMS time analysis and the sci-online-classes analysis, reflect on the following:
Question: How could these analytics techniques be applied in a real classroom or course design context? Describe one specific scenario — from your track (K–12 or ID/higher ed) — where the combination of a bar plot, line plot, and correlation would help an educator or designer make a better decision.
- [In a classroom, these analytics tools can help a teacher better understand student learning and adjust instruction in real time. A bar plot can show which students or groups are performing better or struggling by comparing average scores or engagement levels. A line plot can track how student performance or participation changes over time, helping the teacher spot improvement or decline after certain lessons or assignments. A correlation analysis can show whether factors like time spent on practice or homework are actually related to higher achievement. Together, these tools help teachers identify who needs support, what instruction is working, and how to improve student learning outcomes. In my classroom, I could use a bar plot to compare average reading scores across proficiency levels. I could use a line plot to track weekly engagement on an online reading program. Lastly, I could use correlation analysis to determine if time on the reading program has an impact on the average reading scores across proficiency levels.]
Render & submit
Step 1 — Add your name
Change the author: field in the YAML header at the top to your name.
Step 2 — Render
Click Render in the toolbar. A formatted HTML page will appear in your Viewer tab or a new browser window. Check the Console for any error messages if the render fails.
Step 3 — Publish
| Option | Best for | Link |
|---|---|---|
| Posit Cloud | Quickest — one click from your workspace | Guide |
| RPubs | Free, public, easy to share a link | rpubs.com |
| Quarto Pub | Clean public portfolio pages | Guide |
| GitHub Pages | Best for a professional portfolio | Guide |
This document shows three levels of analytics work: descriptive (summary statistics and bar plots), trend analysis (line plots), and diagnostic (scatter plot and correlation). Together they demonstrate a complete analytical workflow that is worth showcasing in a professional portfolio.
Share your published link with your instructor once you have rendered and published. Post in the course discussion board if you run into any technical issues.