Deliverable 1 — Posit Cloud analysis

Author

Juliette Duthoit

Published

June 27, 2026

Introduction

An instructor reached out to me with concern about an online course she inherited from a former instructor. Although overall performance appeared stable and acceptable, she noticed that some quizzes and grade book items has lower average while other were completed very successfully by students. These irregularities, paired with some negative feedback from the students in the final course evaluation, raised questions: She wanted to know if there was an issue with the course design and if so, where to make changes.

If some assessments in the course are disproportionately difficult or unclear, or even misaligned with the course’s learning objectives, it is important to identify them as they are barriers to the students’ learning. It is also essential that assessment accurately evaluate what students know and what they can do. With this analysis, I aim to help the instructor by determining if specific assessments need redesign. This analysis will answer the instructor’s inquiry and support student success in the course.

Data Overview

Setting up for this project

#|include: FALSE
suppressPackageStartupMessages(library(dplyr))
suppressPackageStartupMessages(library(janitor))
suppressPackageStartupMessages(library(tidyverse))
library(dplyr)
library(tidyverse)
library(readr)
library(skimr)
library(janitor)

As I am a GA and not an instructor nor an official Instructional Designer at the institution, I have access to the data through my GA position but not the ethical ground to access and use the data for this project.

Therefore, I am using a data set from other courses, provided by my instructor, that I loaded and cleaned.

See this Process

As explained above, I will use an existing data set and select data in it to emulate a data set that would have been extracted from the course’s LMS.

# loading the data into a data frame
raw_data <- read.csv("sci-online-classes.csv")

#exploring the first 6 rows of the data frame
head(raw_data)

Checking on the result, we notice that two variable/columns seem to have problematic information: “Gradebook_Item” seems to only have one value (“POINTS EARNED & TOTAL COURSE POINTS”) and “Grade_Category” appears to be empty.

#listing the values existing in "Gradebook_Item"
unique(raw_data$Gradebook_Item)

[1] "POINTS EARNED & TOTAL COURSE POINTS" "ATTEMPTED"                          
[3] "OutcomeDefinition.Total.title"

#listing the values existing in "Grade_Category"
unique(raw_data$Grade_Category)

[1] NA

“Grade_Category” is entirely empty.; “Gradebook_Item” doesn’t hold only one value as expected, but the values make little sense in regard to the set. Therefore, those two columns need to be deleted as they are useless.

#deleting the two columns
raw_data<-raw_data |>
  select(-Gradebook_Item, -Grade_Category)

#checking the first rows of the frame
head(raw_data)

Our data set is now clean and ready for minimizing.

Then, I minimized the set to:

keep only one course data (one course, one semester, one section)
NoteSee this Process
The first six rows displayed above show that this data set contains data from several courses. Since we only want the data of one course, we need to filter the data to only have one subject, one semester, and one section. As we notice that the course_id seems to be made out of a combination of the course subject, semester, and section, we can explore which exact course_id has enough entries to be analyzed for our project.
#exploring how many course ID exists in the set and how many entries there are for each of them count(raw_data, course_id)|> #+ displays the result in descending order, from the course ID with the most entries to the one with the least arrange(desc(n))
According to this result, the course ID “FrScA-S216-01” has the most entries, with 81 rows under that ID. This is a good amount for this project, so we shall select that course for this project and create a data set with only these rows.
#creating a new data frame with only the FrScA courses from S216 full_course_data <- raw_data|> filter(course_id == "FrScA-S216-01")
Since we picked one course ID only, our should only have one value in “course_id”, one value in “subject” , one value in “section” and one value in “semester”. Let’s check that this is correct.
#listing the values existing in "course_id" unique(full_course_data$course_id)

[1] "FrScA-S216-01"

#listing the values existing in "subject" unique(full_course_data$subject)

[1] "FrScA"

#listing the values existing in "semester" unique(full_course_data$semester)

[1] "S216"

#listing the values existing in "section" unique(full_course_data$section)

[1] 1
As it is correct, we can therefore delete those three columns.
#updating the dataframe by removing the redundant columns full_course_data<- full_course_data |> select(-course_id, -subject, -semester, -section) #exploring the result head(full_course_data)
Our one course data set is ready.
keep only data available from my institution’s LMS
NoteSee this Process
Now that the course was selected, we need to erase the data variables that would not be available to an instructional designer in my institution and focus on the data that could be extracted from the course’s shell in the LMS.

Let’s start by exploring what variables are in the set by looking at the columns’ names.
#extracting the names of variable/ columns colnames(full_course_data)

[1] "student_id" "total_points_possible" "total_points_earned" [4] "percentage_earned" "FinalGradeCEMS" "Points_Possible" [7] "Points_Earned" "Gender" "q1" [10] "q2" "q3" "q4" [13] "q5" "q6" "q7" [16] "q8" "q9" "q10" [19] "TimeSpent" "TimeSpent_hours" "TimeSpent_std" [22] "int" "pc" "uv"
My institution uses D2L as an LMS, so we will only keep the data that this specific LMS can produce. This means that we should keep:
- Student Identifiers: student_id
- Performance variables: total_points_possible, total_points_earned, percentage_earned, Gradebook_Item, Grade_Category, FinalGradeCEMS, Points_Possible, Points_Earned, Quiz variables (q1 to q10), Time spent in the LMS (TimeSpent)
The other variables are either a calculation from the creator of the set (TimeSpent_hours, TimeSpent_std), unknown/unclear (“int”, “pc”, “uv”) or not something available in the LMS (“gender”). Let’s remove those specific columns from the set.
#remove the variable not available in the LMS and importing result in a new dataframe LMS_data <- full_course_data |> select(-Gender, -TimeSpent_hours, -TimeSpent_std, -int, -pc, -uv) #check result in first 6 rows of dataframe head(LMS_data)
The set now only contains data that would be available in my institution LMS.
keep only data that I would have extracted from the LMS for this specific project, according to my ethics.
NoteSee this Process
We now have one course data, everything that would be available in the LMS.

However, as an Instructional Designer, I would not extract everything available; it would be an overuse of my position and could lead to ethical issues.
Therefore, we now need to deleted the data that I would not have extracted for this project; we want to keep only the necessary data to the project and the data that would be ethical for me to use in this project without putting students’ identity at risk.

Let’s look at the variable we have left and decide what is strictly necessary for the project.
#extract name of variables/column in the data frame colnames(LMS_data)

[1] "student_id" "total_points_possible" "total_points_earned" [4] "percentage_earned" "FinalGradeCEMS" "Points_Possible" [7] "Points_Earned" "q1" "q2" [10] "q3" "q4" "q5" [13] "q6" "q7" "q8" [16] "q9" "q10" "TimeSpent"

#explores the first row of the data frame head(LMS_data)
This analysis will be looking at assessment and trying to assess if one or several of them is dis-proportionally difficult.

For this aim, we absolutely need:
- “student_id”, to group the score by students. Those ID are not identifiable.
- the individual quiz grades: central to the analysis as they are the assessment being evaluated
- a course performance metric: final grade (“FinalGardeCEMS”), which would differ from the quiz grades as it would include more than just the quiz grades
“TimeSpent” could, in theory, be used to evaluate the effort students put in the course and verify whether that effort correlates with their performance. However, this variable is not reliable for a meaningful analysis: LMS session length data is easily corruptible, for instance if a student leaves the browser tab open but is not actively working on the course. In addition, “TimeSpent” reflects individual differences in reading speed, processing time or even study habits; some students take longer to complete activities for reasons not related to engagement or motivation. This variable therefore does not reflect course quality. Therefore, this variable will not be included in the data set as it is unreliable, sensitive, and not relevant.

The remaining metrics in this data set should also be removed as they don’t contribute to answering the research question. The variables “total_points_possible”, “total_points_earned”, “percentage_earned”, “Points_Possible”, and “Points_Earned” do not provide information related to quiz difficulty or quiz alignment and are therefore not relevant to this analysis. They will be removed according to the principle od data minimization.
#Erasing the selected columns ethical_LMS_data <- LMS_data |> select(-TimeSpent, -total_points_possible, -total_points_earned, -percentage_earned, -Points_Possible, -Points_Earned) #Checking the result head(ethical_LMS_data)
Our LMS set from one course is ready for analysis! We can now turn it into a csv file that will be saved into the project and used as the basis of this analysis.
#save the data frame as a separate file in the project write.csv(ethical_LMS_data, "LMS_dataset.csv", row.names = FALSE)

Data Description

#loading the data set into a dataframe
LMS_data <- read.csv("LMS_dataset.csv")

#exploring the data frame
glimpse(LMS_data)

Rows: 81
Columns: 12
$ student_id     <int> 47448, 53475, 55078, 57188, 65116, 66689, 67463, 68795,…
$ FinalGradeCEMS <dbl> 88.48758, 81.03837, 97.74266, 76.97517, 93.22799, 98.64…
$ q1             <int> 5, NA, NA, NA, 5, 4, 4, 4, 5, 4, 5, 5, NA, 5, 5, 5, 5, …
$ q2             <int> 4, NA, NA, NA, 4, 4, 3, 4, 4, 4, 3, 2, NA, NA, 4, 5, 4,…
$ q3             <int> 4, NA, NA, NA, 4, 3, 3, 4, 4, 3, 3, 4, NA, 5, 3, 4, 4, …
$ q4             <int> 5, NA, NA, NA, 5, 4, 4, 5, 5, 4, 5, 5, NA, 5, 4, 5, 4, …
$ q5             <int> 5, NA, NA, NA, 5, 4, 4, 4, 5, 4, 5, 5, NA, 5, 4, 5, 5, …
$ q6             <int> 4, NA, NA, NA, 4, 4, 4, 4, 4, 5, 3, 4, NA, 4, NA, 4, 5,…
$ q7             <int> 4, NA, NA, NA, 5, 4, 3, 5, 5, 3, 4, 5, NA, 5, NA, 3, NA…
$ q8             <int> 5, NA, NA, NA, 5, 4, 4, 5, 5, 5, 5, 5, NA, 5, NA, 5, 5,…
$ q9             <int> 3, NA, NA, NA, 5, 3, 4, 4, 5, 3, 4, 5, NA, 2, NA, 3, 4,…
$ q10            <int> 5, NA, NA, NA, 5, 4, 5, 4, 5, 5, 5, 5, NA, 4, NA, 5, 4,…

This data contains 12 column, which define our 12 variables. This variables can be grouped into four categories as defined in the table below.

Variable Description
Variable	What is is	Use
student_id	unique student identifier	identify a student from another in the set (one row, one student)
FinalGradeCEMS	Final Course Grade	Course performance, Outcome of the course
q1, q2, q3…q10	quiz score. The numbers are in order of the semester.	Performance through the semester

Each row represents a student’s data, so with 81 row, we have the data of 81 students who took the course.

Data Cleaning

From the previous look at the data, we noticed that the variable/column names are randomly capitalized. Column names should always be cleaned of capitalization and spaces, for an easier processing.

#remove capitalize letters and replaces spaces by underscore in variable names
LMS_data <- LMS_data |>
  clean_names()

#check the result
glimpse(LMS_data)

Rows: 81
Columns: 12
$ student_id       <int> 47448, 53475, 55078, 57188, 65116, 66689, 67463, 6879…
$ final_grade_cems <dbl> 88.48758, 81.03837, 97.74266, 76.97517, 93.22799, 98.…
$ q1               <int> 5, NA, NA, NA, 5, 4, 4, 4, 5, 4, 5, 5, NA, 5, 5, 5, 5…
$ q2               <int> 4, NA, NA, NA, 4, 4, 3, 4, 4, 4, 3, 2, NA, NA, 4, 5, …
$ q3               <int> 4, NA, NA, NA, 4, 3, 3, 4, 4, 3, 3, 4, NA, 5, 3, 4, 4…
$ q4               <int> 5, NA, NA, NA, 5, 4, 4, 5, 5, 4, 5, 5, NA, 5, 4, 5, 4…
$ q5               <int> 5, NA, NA, NA, 5, 4, 4, 4, 5, 4, 5, 5, NA, 5, 4, 5, 5…
$ q6               <int> 4, NA, NA, NA, 4, 4, 4, 4, 4, 5, 3, 4, NA, 4, NA, 4, …
$ q7               <int> 4, NA, NA, NA, 5, 4, 3, 5, 5, 3, 4, 5, NA, 5, NA, 3, …
$ q8               <int> 5, NA, NA, NA, 5, 4, 4, 5, 5, 5, 5, 5, NA, 5, NA, 5, …
$ q9               <int> 3, NA, NA, NA, 5, 3, 4, 4, 5, 3, 4, 5, NA, 2, NA, 3, …
$ q10              <int> 5, NA, NA, NA, 5, 4, 5, 4, 5, 5, 5, 5, NA, 4, NA, 5, …

Looking at the set, we also see that some values are NA (missing value). We need to verify which variables have NA values.

#looking for extra information on the data set, including number of missing value per column
skim(LMS_data)

Data summary
Name	LMS_data
Number of rows	81
Number of columns	12
_______________________
Column type frequency:
numeric	12
________________________
Group variables	None

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
student_id	0	1.00	87173.63	11136.53	47448.0	85659.00	90622.00	94419.00	95991.00	▁▁▁▂▇
final_grade_cems	3	0.96	80.47	19.23	0.9	75.59	87.47	91.87	98.65	▁▁▁▂▇
q1	16	0.80	4.38	0.63	2.0	4.00	4.00	5.00	5.00	▁▁▁▇▇
q2	17	0.79	3.58	0.94	2.0	3.00	4.00	4.00	5.00	▃▅▁▇▃
q3	16	0.80	3.40	0.93	1.0	3.00	3.00	4.00	5.00	▁▃▇▆▂
q4	16	0.80	4.28	0.96	1.0	4.00	5.00	5.00	5.00	▁▁▁▆▇
q5	16	0.80	4.25	0.61	3.0	4.00	4.00	5.00	5.00	▁▁▇▁▅
q6	17	0.79	3.91	0.75	2.0	3.00	4.00	4.00	5.00	▁▃▁▇▃
q7	16	0.80	4.05	0.84	2.0	3.00	4.00	5.00	5.00	▁▅▁▇▇
q8	17	0.79	4.38	0.65	2.0	4.00	4.00	5.00	5.00	▁▁▁▇▇
q9	17	0.79	3.48	0.96	2.0	3.00	3.00	4.00	5.00	▂▇▁▅▃
q10	17	0.79	4.17	0.86	2.0	4.00	4.00	5.00	5.00	▁▂▁▇▇

Each quiz have 16 or 17 missing values. This means that the quizzes were not submitted, which is an important information. This could be useful to calculate a submission rate, for instance. Those rows therefore need to be kept.

There is no missing student ID. However, we are missing 3 final grades. This means that the data is incomplete or that the student did not complete the course. This is not information we want for this analysis, so we shall remove those rows.

#keep everything but the NA value in the final grade column
LMS_data <- LMS_data |>
  filter(!is.na(final_grade_cems))

#check that there is no more NA is the final grade column
sum(is.na(LMS_data$final_grade_cems))

[1] 0

Our data is now clean and ready to be analyzed.

Data Analysis

The goal of this analysis is to identify if some of the quizzes are too difficult and misaligned with overall student performance. For this, we need to look at:

Quiz difficulty
Quiz submission rate
Quiz alignment with final grade

Quiz Difficulty

We need to know if some quizzes have a much lower average score than the others.

Calculating the Data

The average for each quiz is calculated as follow:

average score = sum of all scores for the quiz / total students

#creating a dataframe with only the quiz names and quiz means
quiz_means <- LMS_data |>
  select(-student_id, - final_grade_cems) |>
  colMeans(na.rm=TRUE) #calculates the means of each column WITHOUT the missing values (as they are not 0, they are missing)

#checking the result
quiz_means

      q1       q2       q3       q4       q5       q6       q7       q8 
4.396825 3.612903 3.396825 4.285714 4.253968 3.951613 4.079365 4.387097 
      q9      q10 
3.516129 4.193548

We result with a dataset, but need a dataframe to create a plot, so we need to convert that set into a usable frame.

#creating a dataframe from the dataset
quiz_means_df <- data.frame(
  quiz = names(quiz_means),
  mean_score = quiz_means)

#checking the result
quiz_means_df

Visualizing the Data

#creating a bar graph from the new quiz dataframe
ggplot(quiz_means_df,
  aes(x=factor(quiz, levels = paste0("q", 1:10)),
      y=mean_score)) +
  
#Institution color & spacing the bars
  geom_col(fill="#582c83", width=0.7) +

#forcing the y axes to go from 0 to 5
  scale_y_continuous(limits = c(0, 5)) +
  
#Labeling axes, title, and keeping design clean
  labs(title = "Average Score per Quiz",
       x= "Quiz",
       y= "Average Score (out of 5)") +
  theme_minimal()

Interpretation

We can see from the graph that quiz 2, 3 and 9 have a noticeably lower average score than the other quizzes. This suggests that those 3 quizzes may be more challenging or less clear to students than the other quizzes. They may warrant a closer review for clarity, alignment, or cognitive load.

Quiz Submission Rate

We want to know how many students skipped each quiz. This could help us identify quizzes that were overwhelmingly skipped.

Calculating the Data

The submission rate is calculated as follow:

submission rate = number of submitted quiz / total of students

For each quiz, “NA” indicates that the student did not submit their quiz.

#creating a dataframe with only the quiz names and quiz scores
quiz_data <- LMS_data |>
  select(-student_id, - final_grade_cems)

#creating a dataset calculating the submission rate
quiz_submission <- colSums(!is.na(quiz_data)) / nrow(quiz_data)

#checking the result
quiz_submission

       q1        q2        q3        q4        q5        q6        q7        q8 
0.8076923 0.7948718 0.8076923 0.8076923 0.8076923 0.7948718 0.8076923 0.7948718 
       q9       q10 
0.7948718 0.7948718

We result with a dataset, but need a dataframe to create a plot, so we need to convert that set into a usable frame. For clarity purposes, we will also convert those rates into clear percentages by multiplying the rates by 100.

#creating a dataframe from a dataset
quiz_submission_df <- data.frame(
  quiz = names(quiz_submission),
  submission_rate = quiz_submission*100)

#checking the result
quiz_submission_df

Visualizing the Data

#package to be able to have % in the y axis
suppressPackageStartupMessages(library(scales))
library(scales)

#creating a bar graph from the new quiz dataframe
ggplot(quiz_submission_df,
       aes(x=factor(quiz, levels = paste0("q", 1:10)),
           y=submission_rate)) +
  
#Institution color & spacing the bars
  geom_col(fill="#FFD100", width=0.7) + 

#y axis parameters (0 to 100%)  
  scale_y_continuous(limits = c(0, 100), 
                     labels = percent_format(scale = 1))

#labels axis, title, and keeping the design clean and minimal
  labs(title = "Submission Rate per Quiz",
       x= "Quiz",
       y= "Submission Rate") +
  theme_minimal()

NULL

Interpretation

Each quiz has around 75% of submission rate, indicating that a group of students consistently did not submit their quiz. This suggests a participation issue rather than problems with any specific quiz.

Alignment With Final Grade

We want to know if the quiz performance reflects the overall course performance for each students. This means checking whether the quizzes scores align with the final grades.

Calculating the Data

For this analysis, for each student, we will compare each quiz score with their final grade. This will give us a more detailed picture on alignment than simply correlating each students’ average quiz score with their final grade.

Do do this, we will compute the correlation between each quiz and the final grade across all students. Each correlation value will tell us how strongly their performance on that specific quiz is aligned with their overall course performance.

#creating a new frame with the needed data
alignment_data <- LMS_data |>
  select(-student_id) #seleect the correct data
head(alignment_data)

#calculates the correlations when the quiz was submitted (this ignores the pair when a quiz has NA as a value) and puts it in a matrix
alignment_cor <- cor(alignment_data, use = "pairwise.complete.obs")

#put new matrix into a dataframe
alignment_df <- data.frame(
  quiz = rownames(alignment_cor)[rownames(alignment_cor) != "final_grade_cems"], # selects all rows except the final grade row
  correlation = alignment_cor[rownames(alignment_cor) != "final_grade_cems", "final_grade_cems"])#extract the correlation quoeficient for each quiz with final grade

#checking the result
alignment_df

Visualizing the Data

#Create a graph from dataframe
ggplot(alignment_df,
       aes(x = factor(quiz, levels = paste0("q", 1:10)),
           y = correlation)) +
  
#creating background zones for clear intepretation
  #Weak zone (-0.30 to +0.30)
  geom_rect(aes(xmin = -Inf, xmax = Inf, ymin = -0.30, ymax = 0.30),
            fill = "#FFF9D9") +
  #Moderate zone (-0.75 to -0.30) and (+0.30 to +0.75)
  geom_rect(aes(xmin = -Inf, xmax = Inf, ymin = -0.75, ymax = -0.30),
            fill = "#FFF2B3") +
  geom_rect(aes(xmin = -Inf, xmax = Inf, ymin = 0.30, ymax = 0.75),
            fill = "#FFF2B3") +
  #Strong zone (-1.00 to -0.75) and (+0.75 to +1.00)
  geom_rect(aes(xmin = -Inf, xmax = Inf, ymin = -1.00, ymax = -0.75),
            fill = "#FFE066") +
  geom_rect(aes(xmin = -Inf, xmax = Inf, ymin = 0.75, ymax = 1.00),
            fill = "#FFE066") +
  
#bars
  geom_col(fill = "#582c83", width = 0.7) +
  
#y axis scale and second y axis
  scale_y_continuous(
    limits = c(-1, 1),
    sec.axis = dup_axis(
      breaks = c(-0.875, -0.5, -0.200, 0, 0.200, 0.5, 0.875),
      labels = c("Strong", "Moderate", "Weak", "None", "Weak", "Moderate", "Strong"),
      name = "Correlation Strength")) +

#add a line at y=0
  geom_hline(yintercept = 0, color = "black", linewidth = 0.6) +
  
#labels and theme
  labs(
    title = "Alignment Between Quiz Scores and Final Grade",
    x = "Quiz",
    y = "Correlation with Final Grade") +
  theme_minimal()

Interpretation

Quiz performance is only weakly to very weakly correlated with final grade. This means that the quiz scores do not correctly predict the student’s overall achievement in the course. That suggests a misalignment between what the quiz measure and what the final grade represents. This could be because the quizzes are too easy or because the final grade is dominated by other grade items, decreasing the influence of the quiz scores.

Overall, it indicates that the quizzes, as currently designed, are not strong indicators of course mastery and may benefit from revision to better align with the learning outcomes reflected in the final grade.

Findings summary (text)

Across the semester, quiz performance showed clear patterns in difficulty, participation, and alignment with overall course achievement:

Average quiz scores varied, with several quizzes (quiz 2, 3 and 9) showing noticeably lower means, suggesting potential differences in difficulty or clarity.
Submission rates are around 75% for each quiz, indicating a stable pattern of partial participation and suggesting a course-level engagement issue rather than quiz-specific problems.
Correlations between quiz scores and final grades are uniformly weak and slightly negative, indicating that quiz performance do not meaningfully predict overall course outcomes and suggesting that the quiz do not align with course objectives.

Taken together, these results suggest that the quizzes may not be fully aligned with the course’s learning objectives or assessment structure and need to be redesigned to improve the learning experience.