Analytics Types & Visualization

Learning Analytics — Analytics & Visualization (Required)

Author

Coy Johnson

Published

June 14, 2026

Learning objectives

By the end of this file, you will be able to:

Simulate and save an educational dataset in R
Apply descriptive analytics using colMeans() and rowMeans()
Reshape data from wide to long format using pivot_longer()
Create and interpret scatter plots, bar plots, line plots, and histograms
Compute and interpret correlation between two variables
Apply the analytics type (descriptive, diagnostic, predictive) to real questions

The analytics types — a reminder

Before coding, connect each technique to the type of question it answers:

Analytics type	Question	Technique used in this file
Descriptive	What happened?	Summary stats, bar plot, histogram
Diagnostic	Why did it happen?	Scatter plot, correlation
Predictive	What will happen next?	Regression line, risk flagging

Keep this table in mind as you work through the exercises below. Every output you produce should be connected to one of these questions.

Part 1 · Creating and saving a simulated dataset

Instead of loading existing data, we will create our own simulated dataset. This teaches you how data is structured in R — useful when you need to build a small dataset from scratch for testing or teaching.

Creating the dataset

# set.seed() makes the random data reproducible —
# everyone running this code gets the same values
set.seed(42)

data_lms <- data.frame(
  Student_ID = paste("Student", 1:40, sep = "_"),
  Week_1  = sample(6:20, 40, replace = TRUE),
  Week_2  = sample(6:20, 40, replace = TRUE),
  Week_3  = sample(6:20, 40, replace = TRUE),
  Week_4  = sample(6:20, 40, replace = TRUE),
  Week_5  = sample(6:20, 40, replace = TRUE),
  Week_6  = sample(6:20, 40, replace = TRUE),
  Week_7  = sample(6:20, 40, replace = TRUE),
  Week_8  = sample(6:20, 40, replace = TRUE),
  Week_9  = sample(6:20, 40, replace = TRUE),
  Week_10 = sample(6:20, 40, replace = TRUE),
  Week_11 = sample(6:20, 40, replace = TRUE),
  Week_12 = sample(6:20, 40, replace = TRUE),
  Week_13 = sample(6:20, 40, replace = TRUE),
  Week_14 = sample(6:20, 40, replace = TRUE),
  Week_15 = sample(6:20, 40, replace = TRUE),
  Week_16 = sample(6:20, 40, replace = TRUE)
)

# Inspect the first few rows
head(data_lms)

Question: What does sample(6:20, 40, replace = TRUE) do? What would change if you set replace = FALSE? As always, use your own words to answer the question.

[It creates 40 random weekly values reusing numbers. If you set replace = FALSE, R would try to use each number only once, and the code would fail because there are not enough number to to fill every spot differently.]

Saving the dataset

# Save as a CSV file in your project folder
write.csv(data_lms, "40_students_LMS_time_spent.csv", row.names = FALSE)

write.csv(data_lms, "40_students_LMS_time_spent.csv", row.names = FALSE)

# Confirm it saved — check your Files pane for the new file

Question: Why is it important to be able to create and save datasets manually, rather than only working with provided data?

[Creating and saving data sets manually is important because researchers and teachers often need to collect their own data rather than rely on data that has already been provided. Creating a data set allows you to practice using data and loging it to make sure your doing it right. Saving it is especially import so you can look back at data in the future or add to it.]

Part 2 · Descriptive analytics — what happened?

Summary statistics

# Summary of all weekly columns (excluding Student_ID column)
summary_stats <- summary(data_lms[, -1])
summary_stats

     Week_1          Week_2          Week_3          Week_4     
 Min.   : 6.00   Min.   : 6.00   Min.   : 6.00   Min.   : 6.00  
 1st Qu.: 9.00   1st Qu.: 8.00   1st Qu.: 9.75   1st Qu.:12.00  
 Median :13.00   Median :10.50   Median :12.50   Median :14.50  
 Mean   :12.32   Mean   :10.75   Mean   :12.53   Mean   :14.28  
 3rd Qu.:15.00   3rd Qu.:13.00   3rd Qu.:16.00   3rd Qu.:18.00  
 Max.   :20.00   Max.   :20.00   Max.   :19.00   Max.   :20.00  
     Week_5          Week_6          Week_7          Week_8     
 Min.   : 6.00   Min.   : 6.00   Min.   : 6.00   Min.   : 6.00  
 1st Qu.: 8.50   1st Qu.: 9.00   1st Qu.: 8.75   1st Qu.:10.75  
 Median :13.50   Median :15.00   Median :14.00   Median :13.50  
 Mean   :13.18   Mean   :13.22   Mean   :13.60   Mean   :13.05  
 3rd Qu.:17.00   3rd Qu.:17.00   3rd Qu.:18.25   3rd Qu.:16.00  
 Max.   :20.00   Max.   :20.00   Max.   :20.00   Max.   :20.00  
     Week_9         Week_10         Week_11         Week_12     
 Min.   : 6.00   Min.   : 6.00   Min.   : 6.00   Min.   : 6.00  
 1st Qu.:10.00   1st Qu.: 9.00   1st Qu.:10.75   1st Qu.: 8.00  
 Median :14.00   Median :13.00   Median :14.00   Median :12.00  
 Mean   :13.05   Mean   :12.75   Mean   :13.47   Mean   :12.12  
 3rd Qu.:16.00   3rd Qu.:16.00   3rd Qu.:17.00   3rd Qu.:15.00  
 Max.   :20.00   Max.   :20.00   Max.   :20.00   Max.   :19.00  
    Week_13         Week_14         Week_15         Week_16     
 Min.   : 6.00   Min.   : 6.00   Min.   : 6.00   Min.   : 6.00  
 1st Qu.: 9.75   1st Qu.:10.75   1st Qu.:10.75   1st Qu.:10.00  
 Median :14.00   Median :14.00   Median :15.00   Median :15.00  
 Mean   :12.90   Mean   :13.70   Mean   :13.75   Mean   :13.65  
 3rd Qu.:16.00   3rd Qu.:17.25   3rd Qu.:17.00   3rd Qu.:17.00  
 Max.   :20.00   Max.   :20.00   Max.   :20.00   Max.   :20.00

Question: What insights do you gain from the summary? Pick one week and describe what the min, median, and max values tell you about student engagement that week.

[With this chart we can compare and contrast how much time the students spent on there work. For instance, when looking at week 5 the min and max looks pretty consistent with all the other weeks. However when looking at the first quarter this value is constantly lower than all the other values for the same quarter. This tells me that something must of happened during week 5 to cause students to not work as hard or for as long during the first quarter.]

Average time spent per week (colMeans)

# Select only the Week columns explicitly using grep()
# This protects against any extra columns added later (Semester_Average etc.)
# that would break names(average_time) if included accidentally
week_cols    <- grep("^Week_", names(data_lms), value = TRUE)
average_time <- colMeans(data_lms[, week_cols])
average_time

 Week_1  Week_2  Week_3  Week_4  Week_5  Week_6  Week_7  Week_8  Week_9 Week_10 
 12.325  10.750  12.525  14.275  13.175  13.225  13.600  13.050  13.050  12.750 
Week_11 Week_12 Week_13 Week_14 Week_15 Week_16 
 13.475  12.125  12.900  13.700  13.750  13.650

Question: If some weeks show notably higher or lower average time, what actions might an instructor take?

[During the weeks that have high work times the teacher could cut back some and then put some of that work into the weeks that have lower work times.]

Each student’s semester average (rowMeans)

# rowMeans() calculates the mean across columns for each row (each student)
data_lms$Semester_Average <- rowMeans(data_lms[, 2:17])

head(data_lms |> select(Student_ID, Semester_Average))

Task: Calculate the average time spent for only Weeks 1–5 and save it as early_semester_average. Add it to the data frame.

data_lms$Semester_Average <- rowMeans(data_lms[, 2:6])

head(data_lms |> select(Student_ID, Semester_Average))

# YOUR CODE HERE
# Hint: weeks 1–5 are columns 2–6 in the data frame.
# Follow the same pattern as the row-means chunk above,
# but change the column range to cover only the first 5 weeks.

write.csv(data_lms, "early_semester_average.csv", row.names = FALSE)

Question: How could the early semester average help an instructor identify at-risk students before midterm?

[The early semester average grades would help identify the students that are falling behind sooner so the the teacher could give then help faster. ]

Part 3 · Visualization — bar plot and line plot

Prepare data for plotting

# Confirm average_time exists and has names before reshaping
# This prevents the "zero-length variable name" error
stopifnot(
  "Run the col-means chunk first" = exists("average_time"),
  "average_time has no names"     = !is.null(names(average_time)),
  "average_time is empty"         = length(average_time) > 0
)

average_time_table <- data.frame(
  Week               = factor(names(average_time), levels = names(average_time)),
  Average_Time_Spent = average_time
)

# Quick check — should show 16 rows, one per week
nrow(average_time_table)

[1] 16

head(average_time_table)

Bar plot — average time per week

ggplot(average_time_table, aes(x = Week, y = Average_Time_Spent)) +
  geom_bar(stat = "identity", fill = "#1D9E75", color = "white") +
  labs(
    title = "Average Time Spent per Week",
    x = "Week",
    y = "Average Hours"
  ) +
  theme_minimal() +
  theme(
    plot.title  = element_text(size = 16, face = "bold", hjust = 0.5),
    axis.text.x = element_text(angle = 45, hjust = 1)
  )

Average LMS time spent per week across all 40 students

Line plot — trend over time

ggplot(average_time_table, aes(x = Week, y = Average_Time_Spent, group = 1)) +
  geom_line(color = "#185FA5", linewidth = 1.2) +
  geom_point(color = "#185FA5", size = 3) +
  labs(
    title = "Trend of Average Time Spent per Week",
    x = "Week",
    y = "Average Hours"
  ) +
  theme_minimal() +
  theme(
    plot.title  = element_text(size = 16, face = "bold", hjust = 0.5),
    axis.text.x = element_text(angle = 45, hjust = 1)
  )

Trend of average LMS time across the semester

Question: What differences do you notice between the bar plot and the line plot? Which is more effective for showing a trend and why? Use your own words.

[write your response here]

Line plot — individual students

# Reshape from wide to long format for individual student lines
data_long <- data_lms |>
  pivot_longer(
    cols      = starts_with("Week"),
    names_to  = "Week",
    values_to = "TimeSpent"
  )

ggplot(data_long, aes(x = Week, y = TimeSpent,
                      group = Student_ID, color = Student_ID)) +
  geom_line(alpha = 0.5) +
  labs(
    title = "Weekly Time Spent by Each Student",
    x = "Week",
    y = "Hours"
  ) +
  theme_minimal() +
  theme(
    plot.title     = element_text(size = 14, face = "bold", hjust = 0.5),
    axis.text.x    = element_text(angle = 45, hjust = 1),
    legend.position = "none"
  )

Question: What patterns do you notice when looking at all 40 students at once? Is this visualization easy to interpret? Why or why not?

[The first thing I notice was that week 2 was really low and week 4 was really high. The visualization was east to interpret I thought. The chart titles and labels made it really easy to read and see what was going on.]

Line plot — selected students only

Task: Choose 5 students you want to compare and update the code below.

# YOUR CODE HERE
# Step 1: Choose 5 Student_IDs from the data and filter for them.
#         Student IDs are in the format "Student_1", "Student_2", etc.
#         Pick students whose patterns you find interesting to compare —
#         for example, mix high and low average engagement.
#
# Step 2: Reshape with pivot_longer() — same as the lineplot-all chunk above.
#
# Step 3: Plot with ggplot() — copy the structure from lineplot-all
#         and adjust the title and legend position.



selected_students <- data_lms |>
  filter(Student_ID %in% c(
    "Student_1",
    "Student_5",
    "Student_10",
    "Student_20",
    "Student_35"
  ))


data_long <- selected_students |>
  pivot_longer(
    cols = starts_with("Week"),
    names_to = "Week",
    values_to = "TimeSpent"
  )


ggplot(data_long,
       aes(x = Week,
           y = TimeSpent,
           group = Student_ID,
           color = Student_ID)) +
  geom_line(linewidth = 1) +
  labs(
    title = "Weekly Time Spent for Selected Students",
    x = "Week",
    y = "Hours",
    color = "Student"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 14, face = "bold", hjust = 0.5),
    axis.text.x = element_text(angle = 45, hjust = 1),
    legend.position = "right"
  )

Question: What insights do you gain from this focused view? What design decisions did you make in choosing these five students?

[Using this view I can see the students time spent more closely and really focus in on them. It is easier to see patterns in between them and identify similarities or differences. When choosing these 5 students I just tried to pick ones that looked like that had different data]

Histogram — semester averages

ggplot(data_lms, aes(x = Semester_Average)) +
  geom_histogram(binwidth = 1, fill = "#378ADD", color = "white") +
  labs(
    title = "Distribution of Semester Average Time Spent",
    x = "Semester Average (hours/week)",
    y = "Number of Students"
  ) +
  theme_minimal() +
  theme(plot.title = element_text(size = 14, face = "bold", hjust = 0.5))

Distribution of semester averages across 40 students

Part 4 · Diagnostic analytics — why did it happen?

Now we switch to the sci-online-classes dataset to explore the relationship between time spent and final grades.

# Load the dataset used in the previous module
# Make sure sci-online-classes.csv is in your data folder
data_sci <- read_csv("data/sci-online-classes.csv") |>
  clean_names()

glimpse(data_sci)

Rows: 603
Columns: 30
$ student_id            <dbl> 43146, 44638, 47448, 47979, 48797, 51943, 52326,…
$ course_id             <chr> "FrScA-S216-02", "OcnA-S116-01", "FrScA-S216-01"…
$ total_points_possible <dbl> 3280, 3531, 2870, 4562, 2207, 4208, 4325, 2086, …
$ total_points_earned   <dbl> 2220, 2672, 1897, 3090, 1910, 3596, 2255, 1719, …
$ percentage_earned     <dbl> 0.6768293, 0.7567261, 0.6609756, 0.6773345, 0.86…
$ subject               <chr> "FrScA", "OcnA", "FrScA", "OcnA", "PhysA", "FrSc…
$ semester              <chr> "S216", "S116", "S216", "S216", "S116", "S216", …
$ section               <chr> "02", "01", "01", "01", "01", "03", "01", "01", …
$ gradebook_item        <chr> "POINTS EARNED & TOTAL COURSE POINTS", "ATTEMPTE…
$ grade_category        <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ final_grade_cems      <dbl> 93.45372, 81.70184, 88.48758, 81.85260, 84.00000…
$ points_possible       <dbl> 5, 10, 10, 5, 438, 5, 10, 10, 443, 5, 12, 10, 5,…
$ points_earned         <dbl> NA, 10.00, NA, 4.00, 399.00, NA, NA, 10.00, 425.…
$ gender                <chr> "M", "F", "M", "M", "F", "F", "M", "F", "F", "M"…
$ q1                    <dbl> 5, 4, 5, 5, 4, NA, 5, 3, 4, NA, NA, 4, 3, 5, NA,…
$ q2                    <dbl> 4, 4, 4, 5, 3, NA, 5, 3, 3, NA, NA, 5, 3, 3, NA,…
$ q3                    <dbl> 4, 3, 4, 3, 3, NA, 3, 3, 3, NA, NA, 3, 3, 5, NA,…
$ q4                    <dbl> 5, 4, 5, 5, 4, NA, 5, 3, 4, NA, NA, 5, 3, 5, NA,…
$ q5                    <dbl> 5, 4, 5, 5, 4, NA, 5, 3, 4, NA, NA, 5, 4, 5, NA,…
$ q6                    <dbl> 5, 4, 4, 5, 4, NA, 5, 4, 3, NA, NA, 5, 3, 5, NA,…
$ q7                    <dbl> 5, 4, 4, 4, 4, NA, 4, 3, 3, NA, NA, 5, 3, 5, NA,…
$ q8                    <dbl> 5, 5, 5, 5, 4, NA, 5, 3, 4, NA, NA, 4, 3, 5, NA,…
$ q9                    <dbl> 4, 4, 3, 5, NA, NA, 5, 3, 2, NA, NA, 5, 2, 2, NA…
$ q10                   <dbl> 5, 4, 5, 5, 3, NA, 5, 3, 5, NA, NA, 4, 4, 5, NA,…
$ time_spent            <dbl> 1555.1667, 1382.7001, 860.4335, 1598.6166, 1481.…
$ time_spent_hours      <dbl> 25.91944500, 23.04500167, 14.34055833, 26.643610…
$ time_spent_std        <dbl> -0.18051496, -0.30780313, -0.69325954, -0.148446…
$ int                   <dbl> 5.0, 4.2, 5.0, 5.0, 3.8, 4.6, 5.0, 3.0, 4.2, NA,…
$ pc                    <dbl> 4.50, 3.50, 4.00, 3.50, 3.50, 4.00, 3.50, 3.00, …
$ uv                    <dbl> 4.333333, 4.000000, 3.666667, 5.000000, 3.500000…

Note

This is the same dataset from previous module. We are reloading it here because the LMS time data (Parts 1–3) and the sci-online-classes data (Part 4) are separate files. Reloading makes this file self-contained.

Scatter plot with regression line

ggplot(data_sci,
       aes(x = time_spent_hours, y = final_grade_cems)) +
  geom_point(color = "#185FA5", size = 2.5, alpha = 0.6) +
  geom_smooth(method = "lm", color = "#993C1D", se = TRUE) +
  labs(
    title = "Time Spent vs. Final Grade",
    x = "Time Spent on LMS (hours)",
    y = "Final Grade"
  ) +
  theme_minimal() +
  theme(plot.title = element_text(size = 14, face = "bold", hjust = 0.5))

Question: Based on the scatter plot, what do you expect the relationship between time spent and final grades to be? Write your hypothesis before looking at the correlation.

[I think there will be a correlation between time spent in final grade. I think more time spent will result in a higher grade.]

Correlation

# cor() computes the Pearson correlation coefficient
# use = "complete.obs" ignores rows with missing data
correlation <- cor(data_sci$time_spent_hours,
                   data_sci$final_grade_cems,
                   use = "complete.obs")

correlation

[1] 0.3654121

Interpreting correlation

Values close to +1: strong positive relationship (more time → higher grade)
Values close to -1: strong negative relationship
Values close to 0: little or no linear relationship
This is NOT a statistics course — focus on interpreting what this number means for learners, not on p-values.

Question: With both the scatter plot and the correlation value in front of you, what can you say about the relationship between time spent and final grades? What would you recommend to an instructor based on this finding?

[When viewing the scatter plot and the correlation value I would say that their is a moderate positive correlation between time spent and final grade.I would recommend that the instructor makes it clear to their students how important it is that they spend time on their work.]

Practice — grouped summary by subject

Task: Using data_sci, calculate the mean final_grade_cems and mean time_spent_hours grouped by subject. Arrange by mean grade descending. Which subject has the highest average grade? Is it also the subject with the most time spent?

Hint

You have used group_by() and summarise() in the previous file. Apply the same pattern here with a different grouping variable. If you need a column name reminder, run names(data_sci) in the Console.

# YOUR CODE HERE
# Steps: group_by(subject) |> summarise(mean_grade = ..., mean_time = ...) |> arrange(desc(...))


data_sci |>
  group_by(subject) |>
  summarise(
    mean_grade = mean(final_grade_cems, na.rm = TRUE),
    mean_time = mean(time_spent_hours, na.rm = TRUE)
  ) |>
  arrange(desc(mean_grade))

Question: Does the subject with the highest average grade also have the most time spent? What might explain any differences you find?

[No, it does not. This difference is most likely caused by the difficulty difference in these classes. Some class are harder which will require more time but could still result in a low grade.]

Part 5 · Box plot

A box plot shows the distribution of a variable across categories — useful for comparing groups and spotting outliers.

ggplot(data_sci, aes(x = gender, y = final_grade_cems, fill = gender)) +
  geom_boxplot(color = "gray30",
               outlier.colour = "#993C1D",
               outlier.shape  = 16,
               outlier.size   = 2) +
  scale_fill_manual(values = c("F" = "#E1F5EE", "M" = "#E6F1FB")) +
  labs(
    title = "Final Grade Distribution by Gender",
    x     = "Gender",
    y     = "Final Grade"
  ) +
  theme_minimal() +
  theme(
    plot.title     = element_text(size = 14, face = "bold", hjust = 0.5),
    legend.position = "none"
  )

Question: What does the box plot tell you about the distribution of final grades by gender? Are there differences worth investigating?

[The box plot shows for the most part that the grades are very similar between the genders. I do not think the differences are worth investigating. ]

Final reflection

After completing both the LMS time analysis and the sci-online-classes analysis, reflect on the following:

Question: How could these analytics techniques be applied in a real classroom or course design context? Describe one specific scenario — from your track (K–12 or ID/higher ed) — where the combination of a bar plot, line plot, and correlation would help an educator or designer make a better decision.

[I think these analytics techniques could be super helpful for course desighn. I think they would help pin point problems, find holes in content, and tell where people are getting lost. For example, if you made a course and week 2 has the most time spent on it and also the lowest grade this would show you that something was wrong. These different charts would help you see these anomalies easily and quickly.]

Render & submit

Step 1 — Add your name

Change the author: field in the YAML header at the top to your name.

Step 2 — Render

Click Render in the toolbar. A formatted HTML page will appear in your Viewer tab or a new browser window. Check the Console for any error messages if the render fails.

Step 3 — Publish

Option	Best for	Link
Posit Cloud	Quickest — one click from your workspace	Guide
RPubs	Free, public, easy to share a link	rpubs.com
Quarto Pub	Clean public portfolio pages	Guide
GitHub Pages	Best for a professional portfolio	Guide

E-portfolio tip

This document shows three levels of analytics work: descriptive (summary statistics and bar plots), trend analysis (line plots), and diagnostic (scatter plot and correlation). Together they demonstrate a complete analytical workflow that is worth showcasing in a professional portfolio.

Share your published link with your instructor once you have rendered and published. Post in the course discussion board if you run into any technical issues.