Advanced Visualization Techniques

Learning Analytics — Week 4 (Required)

Author

Juliette Duthoit

Published

June 20, 2026

Learning objectives

By the end of this file, you will be able to:

Create a word cloud from text data and interpret frequency patterns
Build a heatmap to reveal patterns across two dimensions
Visualize a social network of student interactions using SNA
Reflect on when each visualization type is the most appropriate choice

When to use which advanced visualization

Visualization	Best for	Typical data source
Word cloud	Frequency of words in text	Survey responses, discussion posts, feedback
Heatmap	Patterns across two categories	Scores by subject + week, engagement by module + cohort
SNA plot	Relationships between individuals	Forum reply data, collaboration records

Before you run any code, read through the whole file once. Notice which dataset each part uses and make sure the file is in your data folder.

Part 1 · Word cloud

A word cloud visualizes how often words appear in a text — larger words appear more frequently. We will use the Handbook of Learning Analytics (2022), which has been converted to a plain text file.

Packages for this section

# Text mining packages
# install.packages(c("tm", "wordcloud")) if you get errors here
library(tm)
library(wordcloud)

Load the text data

# read.delim() reads a plain text file
# header = FALSE and stringsAsFactors = FALSE read it as raw text
la_text <- read.delim("data/Handbook of LA.txt",
                      header = FALSE,
                      stringsAsFactors = FALSE)

# Create a text corpus — a collection of text documents the tm package can process
doc <- Corpus(VectorSource(la_text))

head(doc)

<<SimpleCorpus>>
Metadata:  corpus specific: 1, document level (indexed): 0
Content:  documents: 1

Clean the text

Before generating the word cloud, we remove common words, punctuation, numbers, and other noise. This is called text preprocessing.

# Helper function: replace specific characters with a space
toSpace <- content_transformer(function(x, pattern) gsub(pattern, " ", x))

doc <- doc |>
  (\(d) {
    d <- tm_map(d, toSpace, "/")
    d <- tm_map(d, toSpace, "@")
    d <- tm_map(d, toSpace, "\\|")
    d <- tm_map(d, content_transformer(tolower))
    d <- tm_map(d, removeWords,
                c(stopwords("english"), "https", "can", "doi", "also", "use", "http", "chapter", "journal", "one", "com", "org", "I", "example", "url", "ing", "may"))
    d <- tm_map(d, removeNumbers)
    d <- tm_map(d, removePunctuation)
    d <- tm_map(d, stripWhitespace)
    d
  })()

Note

You may see warning messages when running the preprocessing chunk. These are expected from the tm package — as long as you see no error messages, you can continue.

Generate the word cloud

# Build a term-document matrix — a table of word frequencies
dtm <- TermDocumentMatrix(doc)
m   <- as.matrix(dtm)
v   <- sort(rowSums(m), decreasing = TRUE)
word_freq <- data.frame(word = names(v), freq = v)

# Top 10 most frequent words
head(word_freq, 10)

Word cloud from the Handbook of Learning Analytics (2022)

# Generate the word cloud
set.seed(1234)
wordcloud(
  words       = word_freq$word,
  freq        = word_freq$freq,
  min.freq    = 10,
  max.words   = 50,
  random.order = FALSE,
  rot.per     = 0.35,
  colors      = brewer.pal(8, "Dark2")
)

Task 1 — Parameter exploration: Try changing max.words to 50 or 150 (original=100) and min.freq to 10 (original=5). How does the word cloud change? Which version is more readable?

[With 100 words and a minimal frequency of 10, the cloud is a little messy; we seem to see important themes and maybe secondary themes. Only the most frquent words appear. Having less words makes the cloud a lot cleaner and easy to read. it feels like we still have a lot of words so we hopefully still have secondary themes visible. Adding words over 100 makes the cloud not legible. Allowing for a lower frequency (5) changed nothing in my clouds if I compare the same amount of words, but I could imagine that if we have a frequency too low, we would make themes out of random words. The frequency should probably be adjusted according to the numbers of entries; for instance for 20 students or feedbacks, then we might want to set frequency as 15, so that we know it appeared in at least half of the feedback and more. ]

Task 2 — Add your own stopwords: Look at the word cloud above. Find 2–3 words that appear large but feel too generic to be meaningful — words that are frequent only because they appear in every chapter, not because they signal a specific theme.

Add those words to the removeWords() line in the text-preprocessing chunk above, then re-run that chunk and the wordcloud-generate chunk. The line to edit looks like this:

d <- tm_map(d, removeWords,
            c(stopwords("english"), "https", "can", "doi", "also", "use"))

Add your words inside the c(...) vector, then re-run both chunks.

Reflection: Which words did you remove? Why did you choose them? Does the revised cloud reveal any themes the original missed?

[I removed “http”, “chapter”, “journal”, “one”, “com”, “org”, “I”, “example”, “url”, “ing”, and “may”, all of which I saw appear in one of my word clouds. Those words felt to generic and brought no interesting information, so I felt it was safe to remove them. The main themes are still the same, but new words have appeared in the cloud, including “conference”. ]

Question: Looking at the most frequent words, what does this tell you about the themes the learning analytics field focuses on? Does anything surprise you?

[The words now mainly focus on vocabulary about education and computer-related things. I was suprised to see words connected to socia/human interaction but in hindsight, those words also make sense.]

Part 2 · Heatmap

A heatmap uses color intensity to show values across two categorical dimensions — useful for spotting patterns that would be invisible in a table or bar chart.

Package for this section

# reshape2 gives us melt() for converting wide to long format
# install.packages("reshape2") if needed
library(reshape2)

Load the data

data_hm <- read.csv("data/student_assignment_scores.csv")

head(data_hm)

glimpse(data_hm)

Rows: 30
Columns: 11
$ Student_ID    <chr> "Student_1", "Student_2", "Student_3", "Student_4", "Stu…
$ Assignment_1  <int> 98, 86, 50, 74, 59, 85, 67, 98, 96, 73, 56, 85, 74, 86, …
$ Assignment_2  <int> 77, 89, 52, 82, 91, 73, 79, 92, 64, 71, 57, 85, 53, 71, …
$ Assignment_3  <int> 52, 70, 51, 87, 59, 89, 54, 82, 98, 88, 85, 94, 91, 58, …
$ Assignment_4  <int> 98, 67, 54, 95, 89, 89, 70, 85, 85, 88, 91, 67, 76, 62, …
$ Assignment_5  <int> 54, 58, 65, 75, 92, 91, 91, 82, 78, 74, 100, 81, 66, 95,…
$ Assignment_6  <int> 87, 80, 55, 69, 64, 83, 91, 72, 61, 75, 90, 50, 51, 73, …
$ Assignment_7  <int> 54, 84, 63, 63, 77, 82, 86, 84, 75, 80, 94, 92, 81, 66, …
$ Assignment_8  <int> 80, 57, 65, 52, 86, 62, 61, 50, 79, 100, 80, 86, 96, 55,…
$ Assignment_9  <int> 82, 87, 92, 90, 71, 75, 69, 94, 86, 92, 64, 70, 83, 74, …
$ Assignment_10 <int> 79, 54, 71, 69, 99, 90, 60, 77, 87, 91, 58, 72, 75, 100,…

Reshape and plot

# melt() converts wide format (one column per assignment) to long format
# (one row per student-assignment combination) — required for ggplot heatmaps
data_hm_long <- melt(data_hm,
                     id.vars      = "Student_ID",
                     variable.name = "Assignment",
                     value.name   = "Score")

ggplot(data_hm_long,
       aes(x = Assignment, y = Student_ID, fill = Score)) +
  geom_tile(color = "white", linewidth = 0.3) +
  scale_fill_gradient(low = "#F0FAF6", high = "#0F6E56") +
  labs(
    title = "Heatmap of Student Progress Across Assignments",
    x     = "Assignment",
    y     = "Student ID",
    fill  = "Score"
  ) +
  theme_minimal() +
  theme(
    plot.title  = element_text(size = 14, face = "bold", hjust = 0.5),
    axis.text.x = element_text(angle = 45, hjust = 1),
    axis.text.y = element_text(size = 8)
  )

Student scores across assignments — darker = higher score

Task: Copy the ggplot() call from the heatmap-plot chunk above into the blank chunk below. Make two changes:

Change the low and high colors in scale_fill_gradient() — for example: low = "white", high = "#993C1D" (red gradient).
Update the x and y labels in labs() to something more descriptive than "Assignment" and "Student ID".

Hint

Copy the entire ggplot(data_hm_long, ...) block from above. You only need to change two things: scale_fill_gradient() and labs(). Everything else stays the same.

# YOUR CODE HERE — copy the ggplot() call above and apply your two changes
data_hm_long <- melt(data_hm,
                     id.vars      = "Student_ID",
                     variable.name = "Assignment",
                     value.name   = "Score")

ggplot(data_hm_long,
       aes(x = Assignment, y = Student_ID, fill = Score)) +
  geom_tile(color = "white", linewidth = 0.3) +
  scale_fill_gradient(low = "white", high = "#993c1d") +
  labs(
    title = "Heatmap of Student Progress Across Assignments",
    x     = "Homework",
    y     = "Student Name",
    fill  = "Score"
  ) +
  theme_minimal() +
  theme(
    plot.title  = element_text(size = 14, face = "bold", hjust = 0.5),
    axis.text.x = element_text(angle = 45, hjust = 1),
    axis.text.y = element_text(size = 8)
  )

Reflection: How does the color change affect how you interpret the data? Does the red gradient feel different from the green one — and why might that matter when presenting to a principal or department chair?

[Red is usually associated to danger or problems, so putting the high score might be misinterpreted. If it was an activity heat spot, it would have worked: redder = more activity, crowded times etc. For higher grades, we need a different color scheme. White is also problematic, as it implies “no data”, and not no grade. It might be interpreted as “no assignment submitted”. A gradiant from light green to darker green was better]

Question: Looking at the heat map, identify one student and one assignment that stand out. What would you do as an instructor based on what you see?

[I have difficulties ready this map, to be honest. I can see that student 3 and 28 started with very low grades then improved through the rest of the semester. As an instructor, I would hopefully have started that heat map as soon as the first assignment was graded, and I would have monitored those 2 students and offered extra help after the second or third assignment was graded. Assignment 5 seems to stand out a little by having almost only red - everyone as rather successful in that assignment. It could mean that the content related to this assignment is very good and well received, or the assignment is too easy. ]

Part 3 · Social network analysis (SNA)

Social network analysis maps relationships and interactions between people. In education, this reveals who is connected, who is isolated, and who acts as a hub or bridge in a learning community.

Package for this section

# igraph is the main package for network analysis in R
# install.packages("igraph") if needed
library(igraph)

Load the interaction data

# This dataset has two columns: From and To
# Each row is one interaction (e.g., Student_1 replied to Student_5)
data_sna <- read.csv("data/student_interactions.csv")

head(data_sna)

Build the network

# Convert the data frame to a graph object
# directed = FALSE means interactions go both ways
g <- graph_from_data_frame(d = data_sna, directed = FALSE)

# Summary of the network
summary(g)

IGRAPH 152b310 UNW- 30 100 -- 
+ attr: name (v/c), weight (e/n)

Basic network plot

plot(
  g,
  vertex.label = V(g)$name,
  vertex.size  = 30,
  vertex.color = "#E1F5EE",
  edge.color   = "gray60",
  main         = "Student Interaction Network"
)

Basic social network of student interactions

Enhanced network — node size by connections

# Compute degree — how many connections each student has
V(g)$degree <- degree(g)

# Color palette from RColorBrewer (loaded in the main setup chunk)
n_nodes       <- length(V(g))
node_colors   <- brewer.pal(min(n_nodes, 12), "Set3")

plot(
  g,
  vertex.label       = V(g)$name,
  vertex.size        = V(g)$degree * 1.5 + 5,   # bigger = more connected
  vertex.color       = node_colors[seq_len(n_nodes) %% length(node_colors) + 1],
  edge.color         = "gray60",
  vertex.label.cex   = 0.6,
  vertex.label.color = "#2C2C2A",
  layout             = layout_with_fr,
  main               = "Student Interaction Network — Node Size = Degree"
)

Network with node size scaled by number of connections (degree)

Click “Show in New Window” for a better view

Network plots can be hard to read in the small viewer. Click the expand icon in the top right of the plot pane to open it full size.

Task: The enhanced plot above shows node size by degree, but a visual ranking is hard to read precisely when many nodes are close in size. The chunk below has already built a data frame called degree_df with each student’s name and degree count. Your job is to sort it and find the top 3 most-connected students.

# This data frame is already built for you from the sna-enhanced chunk above.
# V(g)$name = student names, V(g)$degree = number of connections each has
degree_df <- data.frame(
  student = V(g)$name,
  degree  = V(g)$degree
)
degree_df

# YOUR CODE HERE
# Use arrange() and head() to show the top 3 students by degree, descending.
# Hint: arrange(desc(degree)) |> head(3)
degree_df_arranged <- degree_df |>
  arrange(desc(degree)) |>
  head(3)

Reflection: Does the result match what you expected from the visual? Which student is most connected and what does that mean for an instructor?

[My result says “David, Mia, Emily”. David was absolutely the clear winner, but I could not decide who would be after him between Emily, Mia, Sophia, Christopher, Sarah, and Madison. I’m suspecting that some amon them have the same degree or only vary by 1.]

Question: Looking at the enhanced network plot, identify:

Which students appear most central (largest nodes)? What does that mean?
Are any students isolated (small nodes with few or no connections)?
What would you do as an instructor based on what you see?

[(Was this question supposed to be higher up?) The largest nodes are, to me, David. Then it’s probably a tie or close between Emily, Mia, Sophia, Christopher, Sarah, and Madison. The connections are difficult to see under the big nodes; it feels like James, Samantha, and Joshua are the smallest nodes with only 2 or 3 connections. As an instructor, I would create an assignment that intentionally pairs the lonely students with a central student in hopes that those lonely students get more integrated socially and academically. ]

Part 4 · Visualization choice reflection

You have now used five visualization types: scatter plot, bar plot, line plot, histogram, word cloud, heatmap, and SNA plot (this week).

Question 1: For each of the three advanced techniques, describe one specific scenario from your track (K–12 or ID/higher ed) where it would be the most useful choice:

Word cloud: [I could see that technique to be very useful on students’ comments in course evaluations (mid semester or final). That would help identify themes on what works well int he course and what does not. ]
Heatmap: [I would use this map to analyze activity on the LMS throughout the semester. It could contain logins per week and/or discussion posts posted and/or assignment submissions. ]
SNA plot: [I would use this one on discussion posts; see who participates the most and if they always answer to the same people or not. I don’t think ID can eally use it as we did in this activity as students do not use the LMS to contact one another but emails and other sagging apps; the data extracted from the LMS would be non representative.]

Question 2: What challenges might you face implementing these visualizations in a real school or institutional setting? Think about data availability, privacy, and stakeholder interpretation.

[For the word cloud on course evaluation, the challenge is obvious: students do not always fill out those evaluations, and their comments are often filled quickly and not thoroughly thought out. The data is therefore a bit corrupted and not very representative. But combined with other data, it can be useful. ID don’t have access to the data unless the course’s instructor offers the data, too. And the ID will need to check the comments themselves, too, because negation don’t show up in clouds (“not helpful!” - helpful might show up in big and be interpreted as a positive feedback). For the heatmap, I believe there might be a FERPA or privacy issue here - it may be necessary to obtain specific permissions. Data should be available on the LMS, and complete, though. IN terms of interpretation… It is possible to interpret “no activity week X” as “students are jsut lazy”. It would be important to cross the data with something else. For the SNA, I mentioned using discussion posts only which is limited already but should be available in the LMS. As discussion posts are visible in the LMS and usually assignment, I believe it’s not an issue to mine this data as long as students were warned. It would be close to impossible to map all the students’ communication as much of it might happen ff line or outside the LMS, so the interpretation of the data needs to be taken with a grain of salt or be used in a very specific context (in my example, in the context of discussion posts only, not as a representation of the SNA of the whole class for the whole course).]

Question 3: How could these techniques evolve in your field? What would be possible if these tools were integrated into an LMS dashboard that teachers or designers could access in real time?

[If the word cloud could be integrated and connected to a weekly feedback assignment, we could see the evolution of the general feeling about the course and maybe identify weeks that are problematic. It would be a live version that instructor would want to use themselves! And it would maybe give an opportunity for the instructor to adjust their course during the semester instead of waiting the end to review the course for the next semester. For the heat map, it would mean a live analysis of engagement and maybe the possibility of creating alerts or protocols based on the results. I know iLearn had “intelligent agents” that could be programmed to send automatic alert when a student had two low grades in a row, for instance, or had not connected to the LMS in more than 48hrs. Paired with a heat map for engagement, this could help identifying at-risk students and help them as soon as they have been identified. The same could be possible to a live SNA map - it could help identify isolated students right away and allow an immediate action to integrate them.]

Render & submit

Step 1 — Add your name

Change the author: field in the YAML header at the top to your name.

Step 2 — Render

Click Render in the toolbar. This file uses several packages (tm, wordcloud, igraph) that produce warnings during preprocessing — that is expected. As long as the final HTML page appears, the render was successful. If you see a true error (red text that stops the render), check that all packages are installed:

install.packages(c("tm", "wordcloud", "RColorBrewer", "reshape2", "igraph"))

Step 3 — Publish

Option	Best for	Link
Posit Cloud	Quickest — one click from your workspace	Guide
RPubs	Free, public, easy to share a link	rpubs.com
Quarto Pub	Clean public portfolio pages	Guide
GitHub Pages	Best for a professional portfolio	Guide

E-portfolio tip

This is the most visually impressive of the four files — word clouds, heatmaps, and network graphs are immediately recognizable as advanced data work to anyone reviewing your portfolio. If you are sharing one document from this course with a hiring committee or graduate school application, this is the one to lead with. Pair it with your capstone analysis for the full picture.

Share your published link with your instructor once you have rendered and published. Post in the course discussion board if you run into any technical issues.