# Creating a dataset called `data2` with simulated data for 40 students for our exercise this week.
# The first column, 'Student_ID', generates a unique ID for each student by
# combining the word "Student" with numbers 1 through 40.
#
# The subsequent columns, 'Week_1' through 'Week_16', represent data for 16 weeks.
#
# Each week’s column contains 40 random values, sampled from the range 6 to 20,
# representing hypothetical time spent (in hours) on the learning management system by each student during that week.
set.seed(42) # Setting seed for reproducibility
data2 <- data.frame(
Student_ID = paste("Student", 1:40, sep = "_"),
Week_1 = sample(6:20, 40, replace = TRUE),
Week_2 = sample(6:20, 40, replace = TRUE),
Week_3 = sample(6:20, 40, replace = TRUE),
Week_4 = sample(6:20, 40, replace = TRUE),
Week_5 = sample(6:20, 40, replace = TRUE),
Week_6 = sample(6:20, 40, replace = TRUE),
Week_7 = sample(6:20, 40, replace = TRUE),
Week_8 = sample(6:20, 40, replace = TRUE),
Week_9 = sample(6:20, 40, replace = TRUE),
Week_10 = sample(6:20, 40, replace = TRUE),
Week_11 = sample(6:20, 40, replace = TRUE),
Week_12 = sample(6:20, 40, replace = TRUE),
Week_13 = sample(6:20, 40, replace = TRUE),
Week_14 = sample(6:20, 40, replace = TRUE),
Week_15 = sample(6:20, 40, replace = TRUE),
Week_16 = sample(6:20, 40, replace = TRUE)
)
# Saving the dataset as a CSV file.
# I named the file '40_students_LMS_time_spent.csv'. You can name the file differently if you'd like.
# The row.names = FALSE argument prevents R from writing an unnecessary column of row numbers.
write.csv(data2, "40_students_LMS_time_spent.csv", row.names = FALSE)
# Inspect the first few rows of the dataset
# head(data2)Data Collection and Basic Analytics
CUED 7540: Learning Analytics II
Learning Objectives
By the end of this lesson, you will be able to:
- Perform descriptive and predictive analytics techniques.
- Create and save datasets manually.
- Interpret and visualize your data using different plot types.
- Understand the importance of data cleaning and transformation for analysis.
Part 1: Creating dataset & datafile
Instead of loading an existing dataset, we’ll manually create our own simulated data to practice with. This will help you understand the structure of data in R and the importance of data management.
###Setup
Let’s start by loading the necessary libraries. If you encounter any errors, revisit our first module for guidance on installing packages.
Creating Simulated Data
Task 1: We’ll simulate data for 40 students’ time spent on a learning management system (LMS) over 16 weeks. We will then save this new dataset as a CSV file.
The set.seed() function is used here to ensure that the random data we generate is the same every time you run the code. Think of it as “shuffling a deck of cards in the same way” so that we all get the same results.
Once your data is created and saved as a CSV file, you should see the new file in your Files Pane.
Reflect & Respond
Question: Why do you think it’s important to be able to manually create and save datasets?
- [It builds understanding of how data is structured, improves data management skills, and provides control over the variables being analyzed.]
Part 2: Descriptive Analytics & Visualization
Descriptive Analytics
Now that we have our dataset, let’s conduct some basic analytics to better understand the data.
Task 2: Calculate summary statistics for each week. Use the summary() function to analyze the dataset, excluding the first column which is the Student_ID.
# Summary statistics for each week
summary_stats <- summary(data2[, -1])
# summary() function calculates summary statistics (such as minimum, maximum, median, mean, etc.)
# data2[, -1]: excluding the first column (-1), which is typically the Student_ID column. summary_stats
summary_stats Week_1 Week_2 Week_3 Week_4
Min. : 6.00 Min. : 6.00 Min. : 6.00 Min. : 6.00
1st Qu.: 9.00 1st Qu.: 8.00 1st Qu.: 9.75 1st Qu.:12.00
Median :13.00 Median :10.50 Median :12.50 Median :14.50
Mean :12.32 Mean :10.75 Mean :12.53 Mean :14.28
3rd Qu.:15.00 3rd Qu.:13.00 3rd Qu.:16.00 3rd Qu.:18.00
Max. :20.00 Max. :20.00 Max. :19.00 Max. :20.00
Week_5 Week_6 Week_7 Week_8
Min. : 6.00 Min. : 6.00 Min. : 6.00 Min. : 6.00
1st Qu.: 8.50 1st Qu.: 9.00 1st Qu.: 8.75 1st Qu.:10.75
Median :13.50 Median :15.00 Median :14.00 Median :13.50
Mean :13.18 Mean :13.22 Mean :13.60 Mean :13.05
3rd Qu.:17.00 3rd Qu.:17.00 3rd Qu.:18.25 3rd Qu.:16.00
Max. :20.00 Max. :20.00 Max. :20.00 Max. :20.00
Week_9 Week_10 Week_11 Week_12
Min. : 6.00 Min. : 6.00 Min. : 6.00 Min. : 6.00
1st Qu.:10.00 1st Qu.: 9.00 1st Qu.:10.75 1st Qu.: 8.00
Median :14.00 Median :13.00 Median :14.00 Median :12.00
Mean :13.05 Mean :12.75 Mean :13.47 Mean :12.12
3rd Qu.:16.00 3rd Qu.:16.00 3rd Qu.:17.00 3rd Qu.:15.00
Max. :20.00 Max. :20.00 Max. :20.00 Max. :19.00
Week_13 Week_14 Week_15 Week_16
Min. : 6.00 Min. : 6.00 Min. : 6.00 Min. : 6.00
1st Qu.: 9.75 1st Qu.:10.75 1st Qu.:10.75 1st Qu.:10.00
Median :14.00 Median :14.00 Median :15.00 Median :15.00
Mean :12.90 Mean :13.70 Mean :13.75 Mean :13.65
3rd Qu.:16.00 3rd Qu.:17.25 3rd Qu.:17.00 3rd Qu.:17.00
Max. :20.00 Max. :20.00 Max. :20.00 Max. :20.00
Question: What insights do you gain from the summary results?
- [The summary shows minimum, maximum, median, and quartiles. It highlights variation between weeks and potential outliers.]
Task 3: Calculate the average time spent per week for all students using the colMeans() function.
# Calculate the average time spent per week
average_time <- colMeans(data2[, -1])
# The colMeans() function computes the mean for each column in the data2 dataframe, excluding the first column (-1), which is typically the Student_ID column.
average_time Week_1 Week_2 Week_3 Week_4 Week_5 Week_6 Week_7 Week_8 Week_9 Week_10
12.325 10.750 12.525 14.275 13.175 13.225 13.600 13.050 13.050 12.750
Week_11 Week_12 Week_13 Week_14 Week_15 Week_16
13.475 12.125 12.900 13.700 13.750 13.650
Question: How would you interpret these results?
- [Students are spending between about 10.8 and 14.3 hours per week on the LMS.]
Question: If some weeks show significantly higher or lower time spent, what actions would you take as an instructor or course designer?
- [I might provide additional reminders, clarify expectations, or add interactive content to motivate students. For weeks with unusually high time spent, I would check whether assignments are too long or confusing, and consider breaking them into smaller, more manageable tasks.]
Task 4: Calculate each individual’s average time spent across 16 weeks. Use the rowMeans() function and add this new variable to the data2 data frame.
# Calculate the mean time spent for each student across all 16 weeks
# The rowMeans() function calculates the mean for each row (i.e., each student)
data2$Semester_Average <- rowMeans(data2[, 2:17])
# rowMeans(data2[, 2:17]): This function calculates the mean of each row across columns 2 to 17, which correspond to the weekly time spent values for each student. The result is stored in the 'Mean_TimeSpent' column.
# Inspect the first few rows to see the new column with mean time spent
head(data2) Student_ID Week_1 Week_2 Week_3 Week_4 Week_5 Week_6 Week_7 Week_8 Week_9
1 Student_1 6 20 18 17 16 20 20 17 9
2 Student_2 10 11 15 18 19 16 19 20 9
3 Student_3 6 15 14 6 10 15 17 18 20
4 Student_4 14 13 18 17 6 14 20 14 18
5 Student_5 15 9 17 14 11 17 6 9 12
6 Student_6 9 9 9 14 6 11 18 14 6
Week_10 Week_11 Week_12 Week_13 Week_14 Week_15 Week_16 Semester_Average
1 11 18 18 7 19 15 15 15.3750
2 20 11 9 11 8 18 12 14.1250
3 14 11 15 14 20 19 19 14.5625
4 15 17 7 13 20 11 13 14.3750
5 13 18 12 16 7 6 18 12.5000
6 7 12 6 11 15 16 13 11.0000
Task 5: Calculate the average time spent for each student only from week 1 to week 5. Add this as a new variable named early_semester_average.
# Now, complete the code to calculate the mean time spent for each student ONLY from week 1 to week 5. Save the result to a variable named 'early_semester_average'
# Revise the code to choose from week 1 to week 5.
# COMPLETE THE CODE BELOW
data2$early_semester_average <- rowMeans(data2[, c("Week_1", "Week_2", "Week_3", "Week_4", "Week_5")])
# Inspect the first few rows to see the new column
#TYPE YOUR CODE
head(data2) Student_ID Week_1 Week_2 Week_3 Week_4 Week_5 Week_6 Week_7 Week_8 Week_9
1 Student_1 6 20 18 17 16 20 20 17 9
2 Student_2 10 11 15 18 19 16 19 20 9
3 Student_3 6 15 14 6 10 15 17 18 20
4 Student_4 14 13 18 17 6 14 20 14 18
5 Student_5 15 9 17 14 11 17 6 9 12
6 Student_6 9 9 9 14 6 11 18 14 6
Week_10 Week_11 Week_12 Week_13 Week_14 Week_15 Week_16 Semester_Average
1 11 18 18 7 19 15 15 15.3750
2 20 11 9 11 8 18 12 14.1250
3 14 11 15 14 20 19 19 14.5625
4 15 17 7 13 20 11 13 14.3750
5 13 18 12 16 7 6 18 12.5000
6 7 12 6 11 15 16 13 11.0000
early_semester_average
1 15.4
2 14.6
3 10.2
4 13.6
5 13.2
6 9.4
Question: How do these newly added variables provide you with new insights? What could you do with this information as an instructor?
- [You can quickly see which students are consistently spending time on the LMS and which students may be falling behind. You could use this information to provide targeted support.]
Data Visualization
Visualizing data helps to better understand and communicate your findings. Let’s create some plots with our data.
Bar Plot of Average Time Spent Per Week
For this data, a bar plot is an excellent way to show the average time spent each week.
# First, reshape the average_time vector into a data frame for ggplot
average_time_table <- data.frame(
Week = factor(names(average_time), levels = names(average_time)),
Average_Time_Spent = average_time
)Task 6: Create a bar plot using the average_time_table data frame. Experiment with different colors and text for the parameters (fill, color, size, face, etc.).
# Create a boxplot of 'Average Time Spent' by 'Week'
ggplot(average_time_table, aes(x = Week, y = Average_Time_Spent)) +
# Add the boxplot to visualize the distribution of average time spent across different weeks
geom_bar(stat = "identity", fill = "steelblue", color = "black") +
# Add titles and labels to the plot
labs(title = "Boxplot of Average Time Spent/Week", x = "Week", y = "Average Time Spent") +
theme_minimal() +
theme(
plot.title = element_text(size = 18, face = "bold", hjust = 0.5),
axis.title.x = element_text(size = 14, face = "bold"),
axis.title.y = element_text(size = 14, face = "bold"),
axis.text.x = element_text(size = 10, angle = 45, hjust = 1)
)- Click
If you want to see the full image in a new tab.
Line Plot of Average Time Spent Per Week
Task 7: Line plots are great for showing trends over time. Create a line plot to visualize the trend of average time spent per week using average_time_table.
# Create a line plot
ggplot(average_time_table, aes(x = Week, y = Average_Time_Spent, group = 1)) +
geom_line(color = "blue", size = 1.2) +
geom_point(color = "red", size = 3) +
labs(title = "Trend of Average Time Spent per Week",
x = "Week",
y = "Average Time Spent (Hours)") +
theme_minimal() +
theme(
plot.title = element_text(size = 18, face = "bold", hjust = 0.5),
axis.title.x = element_text(size = 14, face = "bold"),
axis.title.y = element_text(size = 14, face = "bold"),
axis.text.x = element_text(size = 10, angle = 45, hjust = 1)
)Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
Question: What differences do you notice between the bar plot and the line plot? Which one is more effective for showing a trend over time, and why?
- [write your response here.]
Task 7-2: If you are interested in the trends of each student’s time spent from week 1 to week 16, a line plot can be helpful.
# Reshape the data for easier plotting
data_long <- data2 %>%
pivot_longer(cols = starts_with("Week"), names_to = "Week", values_to = "TimeSpent")
# Create a line plot for each student's weekly TimeSpent
ggplot(data_long, aes(x = Week, y = TimeSpent, group = Student_ID, color = Student_ID)) +
geom_line() +
labs(title = "Weekly Time Spent by Each Student",
x = "Week",
y = "Time Spent (Hours)") +
theme_minimal() +
theme(
plot.title = element_text(size = 16, face = "bold", hjust = 0.5),
axis.title.x = element_text(size = 12, face = "bold"),
axis.title.y = element_text(size = 12, face = "bold"),
axis.text.x = element_text(angle = 45, hjust = 1),
legend.position = "none" # Hides the legend to reduce clutter
)Question: What do you think about the analytics result?
- [It looks like a bunch of jumbled up pasta.]
Task 7-3: To better interpret the analytics, let’s focus on a subset of students and create a line plot.
# Select 5 specific students
selected_students <- data2 %>%
filter(Student_ID %in% c("Student_1", "Student_10", "Student_20", "Student_30", "Student_40")) #TYPE YOUR CODE
# Reshape the data for easier plotting
data_long_selected <- selected_students %>%
pivot_longer(cols = starts_with("Week"), names_to = "Week", values_to = "TimeSpent")
# Create a line plot for the selected students
ggplot(data_long_selected, aes(x = Week, y = TimeSpent, group = Student_ID, color = Student_ID)) +
geom_line(size = 1.2) +
geom_point(size = 2) +
labs(title = "Weekly Time Spent by Selected Students",
x = "Week",
y = "Time Spent (Hours)") +
theme_minimal() +
theme(
plot.title = element_text(size = 16, face = "bold", hjust = 0.5),
axis.title.x = element_text(size = 12, face = "bold"),
axis.title.y = element_text(size = 12, face = "bold"),
axis.text.x = element_text(angle = 45, hjust = 1),
legend.title = element_blank(), # Hides the legend title for simplicity
legend.position = "top" # Position the legend at the top for better visibility
)Question: Pick 3-6 students you are particularly interested in comparing. Update the R code above (Task 7-3 chunk) with your selected students. What insights do you gain from this more focused analysis?
- [It makes it a lot clearer and helps me see that the students sort of follow their own trends.]
Question: What changes did you make to the visualization? Why?
- [I adjusted the color palette to make each student’s line more visually distinct and increased the line thickness so individual patterns were easier to track. I also added points at each week to highlight when students’ engagement changed sharply. These small adjustments made it much easier to identify who was increasing, decreasing, or maintaining consistent time spent across the semester.]
Histogram
Task 8: A histogram is useful for understanding the overall distribution of a single variable. Create a histogram of the Semester_Average variable to see the general pattern and frequency of time spent across all students.
# Histogram of Mean_TimeSpent
ggplot(data2, aes(x = Semester_Average)) +
geom_histogram(binwidth = 1, fill = "skyblue", color = "black") +
labs(title = "Histogram of Mean Time Spent by 40 Students",
x = "Mean Time Spent (Hours)",
y = "Frequency") +
theme_minimal() +
theme(
plot.title = element_text(size = 16, face = "bold", hjust = 0.5),
axis.title.x = element_text(size = 12, face = "bold"),
axis.title.y = element_text(size = 12, face = "bold")
)Question: Which plot do you find more insightful, the bar plot, box plot, or histogram? Why?
- [The histogram is the neatest so it is the easiest to get insight from.]
Part 3: Predictive Analytics and Visualization
Now, let’s switch gears and explore the relationship between two variables. We will return to the dataset from the previous module, sci-online-classes.csv. We want to see if there is a relationship between the time spent on the LMS (TimeSpent_hours) and students’ final grades (FinalGradeCEMS).
Load data
First, we need to load the data we used in our first module.
#import/load the dataset
# COMPLETE THE CODE WITH THE FUNCTION NAME (read_csv) & THE FILE NAME (sci-online-classes.csv).
data <- read_csv("data/sci-online-classes.csv", show_col_types = FALSE)
# Inspect your data
str(data)spc_tbl_ [603 × 30] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
$ student_id : num [1:603] 43146 44638 47448 47979 48797 ...
$ course_id : chr [1:603] "FrScA-S216-02" "OcnA-S116-01" "FrScA-S216-01" "OcnA-S216-01" ...
$ total_points_possible: num [1:603] 3280 3531 2870 4562 2207 ...
$ total_points_earned : num [1:603] 2220 2672 1897 3090 1910 ...
$ percentage_earned : num [1:603] 0.677 0.757 0.661 0.677 0.865 ...
$ subject : chr [1:603] "FrScA" "OcnA" "FrScA" "OcnA" ...
$ semester : chr [1:603] "S216" "S116" "S216" "S216" ...
$ section : chr [1:603] "02" "01" "01" "01" ...
$ Gradebook_Item : chr [1:603] "POINTS EARNED & TOTAL COURSE POINTS" "ATTEMPTED" "POINTS EARNED & TOTAL COURSE POINTS" "POINTS EARNED & TOTAL COURSE POINTS" ...
$ Grade_Category : logi [1:603] NA NA NA NA NA NA ...
$ FinalGradeCEMS : num [1:603] 93.5 81.7 88.5 81.9 84 ...
$ Points_Possible : num [1:603] 5 10 10 5 438 5 10 10 443 5 ...
$ Points_Earned : num [1:603] NA 10 NA 4 399 NA NA 10 425 2.5 ...
$ Gender : chr [1:603] "M" "F" "M" "M" ...
$ q1 : num [1:603] 5 4 5 5 4 NA 5 3 4 NA ...
$ q2 : num [1:603] 4 4 4 5 3 NA 5 3 3 NA ...
$ q3 : num [1:603] 4 3 4 3 3 NA 3 3 3 NA ...
$ q4 : num [1:603] 5 4 5 5 4 NA 5 3 4 NA ...
$ q5 : num [1:603] 5 4 5 5 4 NA 5 3 4 NA ...
$ q6 : num [1:603] 5 4 4 5 4 NA 5 4 3 NA ...
$ q7 : num [1:603] 5 4 4 4 4 NA 4 3 3 NA ...
$ q8 : num [1:603] 5 5 5 5 4 NA 5 3 4 NA ...
$ q9 : num [1:603] 4 4 3 5 NA NA 5 3 2 NA ...
$ q10 : num [1:603] 5 4 5 5 3 NA 5 3 5 NA ...
$ TimeSpent : num [1:603] 1555 1383 860 1599 1482 ...
$ TimeSpent_hours : num [1:603] 25.9 23 14.3 26.6 24.7 ...
$ TimeSpent_std : num [1:603] -0.181 -0.308 -0.693 -0.148 -0.235 ...
$ int : num [1:603] 5 4.2 5 5 3.8 4.6 5 3 4.2 NA ...
$ pc : num [1:603] 4.5 3.5 4 3.5 3.5 4 3.5 3 3 NA ...
$ uv : num [1:603] 4.33 4 3.67 5 3.5 ...
- attr(*, "spec")=
.. cols(
.. student_id = col_double(),
.. course_id = col_character(),
.. total_points_possible = col_double(),
.. total_points_earned = col_double(),
.. percentage_earned = col_double(),
.. subject = col_character(),
.. semester = col_character(),
.. section = col_character(),
.. Gradebook_Item = col_character(),
.. Grade_Category = col_logical(),
.. FinalGradeCEMS = col_double(),
.. Points_Possible = col_double(),
.. Points_Earned = col_double(),
.. Gender = col_character(),
.. q1 = col_double(),
.. q2 = col_double(),
.. q3 = col_double(),
.. q4 = col_double(),
.. q5 = col_double(),
.. q6 = col_double(),
.. q7 = col_double(),
.. q8 = col_double(),
.. q9 = col_double(),
.. q10 = col_double(),
.. TimeSpent = col_double(),
.. TimeSpent_hours = col_double(),
.. TimeSpent_std = col_double(),
.. int = col_double(),
.. pc = col_double(),
.. uv = col_double()
.. )
- attr(*, "problems")=<externalptr>
# Display the first few rows of the dataset
# COMPLETE THE CODE
head(data)# A tibble: 6 × 30
student_id course_id total_points_possible total_points_earned
<dbl> <chr> <dbl> <dbl>
1 43146 FrScA-S216-02 3280 2220
2 44638 OcnA-S116-01 3531 2672
3 47448 FrScA-S216-01 2870 1897
4 47979 OcnA-S216-01 4562 3090
5 48797 PhysA-S116-01 2207 1910
6 51943 FrScA-S216-03 4208 3596
# ℹ 26 more variables: percentage_earned <dbl>, subject <chr>, semester <chr>,
# section <chr>, Gradebook_Item <chr>, Grade_Category <lgl>,
# FinalGradeCEMS <dbl>, Points_Possible <dbl>, Points_Earned <dbl>,
# Gender <chr>, q1 <dbl>, q2 <dbl>, q3 <dbl>, q4 <dbl>, q5 <dbl>, q6 <dbl>,
# q7 <dbl>, q8 <dbl>, q9 <dbl>, q10 <dbl>, TimeSpent <dbl>,
# TimeSpent_hours <dbl>, TimeSpent_std <dbl>, int <dbl>, pc <dbl>, uv <dbl>
summary(data) student_id course_id total_points_possible total_points_earned
Min. :43146 Length:603 Min. : 840 Min. : 651
1st Qu.:85612 Class :character 1st Qu.: 2810 1st Qu.: 2050
Median :88340 Mode :character Median : 3583 Median : 2757
Mean :86070 Mean : 4274 Mean : 3245
3rd Qu.:92730 3rd Qu.: 5069 3rd Qu.: 3875
Max. :97441 Max. :15552 Max. :12208
percentage_earned subject semester section
Min. :0.3384 Length:603 Length:603 Length:603
1st Qu.:0.7047 Class :character Class :character Class :character
Median :0.7770 Mode :character Mode :character Mode :character
Mean :0.7577
3rd Qu.:0.8262
Max. :0.9106
Gradebook_Item Grade_Category FinalGradeCEMS Points_Possible
Length:603 Mode:logical Min. : 0.00 Min. : 5.00
Class :character NA's:603 1st Qu.: 71.25 1st Qu.: 10.00
Mode :character Median : 84.57 Median : 10.00
Mean : 77.20 Mean : 76.87
3rd Qu.: 92.10 3rd Qu.: 30.00
Max. :100.00 Max. :935.00
NA's :30
Points_Earned Gender q1 q2
Min. : 0.00 Length:603 Min. :1.000 Min. :1.000
1st Qu.: 7.00 Class :character 1st Qu.:4.000 1st Qu.:3.000
Median : 10.00 Mode :character Median :4.000 Median :4.000
Mean : 68.63 Mean :4.296 Mean :3.629
3rd Qu.: 26.12 3rd Qu.:5.000 3rd Qu.:4.000
Max. :828.20 Max. :5.000 Max. :5.000
NA's :92 NA's :123 NA's :126
q3 q4 q5 q6
Min. :1.000 Min. :1.000 Min. :2.000 Min. :1.000
1st Qu.:3.000 1st Qu.:4.000 1st Qu.:4.000 1st Qu.:4.000
Median :3.000 Median :4.000 Median :4.000 Median :4.000
Mean :3.327 Mean :4.268 Mean :4.191 Mean :4.008
3rd Qu.:4.000 3rd Qu.:5.000 3rd Qu.:5.000 3rd Qu.:5.000
Max. :5.000 Max. :5.000 Max. :5.000 Max. :5.000
NA's :123 NA's :125 NA's :127 NA's :127
q7 q8 q9 q10
Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.000
1st Qu.:3.000 1st Qu.:4.000 1st Qu.:3.000 1st Qu.:4.000
Median :4.000 Median :4.000 Median :4.000 Median :4.000
Mean :3.907 Mean :4.289 Mean :3.487 Mean :4.101
3rd Qu.:4.750 3rd Qu.:5.000 3rd Qu.:4.000 3rd Qu.:5.000
Max. :5.000 Max. :5.000 Max. :5.000 Max. :5.000
NA's :129 NA's :129 NA's :129 NA's :129
TimeSpent TimeSpent_hours TimeSpent_std int
Min. : 0.45 Min. : 0.0075 Min. :-1.3280 Min. :2.000
1st Qu.: 851.90 1st Qu.: 14.1983 1st Qu.:-0.6996 1st Qu.:3.900
Median :1550.91 Median : 25.8485 Median :-0.1837 Median :4.200
Mean :1799.75 Mean : 29.9959 Mean : 0.0000 Mean :4.219
3rd Qu.:2426.09 3rd Qu.: 40.4348 3rd Qu.: 0.4623 3rd Qu.:4.700
Max. :8870.88 Max. :147.8481 Max. : 5.2188 Max. :5.000
NA's :5 NA's :5 NA's :5 NA's :76
pc uv
Min. :1.500 Min. :1.000
1st Qu.:3.000 1st Qu.:3.333
Median :3.500 Median :3.667
Mean :3.608 Mean :3.719
3rd Qu.:4.000 3rd Qu.:4.167
Max. :5.000 Max. :5.000
NA's :75 NA's :75
Visualize Relationships
Task 9: To explore the relationship between two variables (e.g., TimeSpent_hours and FinalGradeCEMS), create a scatter plot with a regression line. This visual will help us see if one variable might predict another.
# Create a scatter plot of TimeSpent_hours vs. FinalGradeCEMS with a regression line
ggplot(data, aes(x =TimeSpent_hours, y =FinalGradeCEMS )) + # TYPE YOUR CODE
geom_point(color = "blue", size = 3, alpha = 0.6) +
geom_smooth(method = "lm", color = "red", se = TRUE) + # This line will add a linear regression line
labs(title = "Scatter Plot of Time Spent vs. Final Grade with Regression Line",
x = "Time Spent (Hours)",
y = "Final Grade (CEMS)") +
theme_minimal() +
theme(
plot.title = element_text(size = 16, face = "bold", hjust = 0.5),
axis.title.x = element_text(size = 12, face = "bold"),
axis.title.y = element_text(size = 12, face = "bold")
)`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 30 rows containing non-finite outside the scale range
(`stat_smooth()`).
Warning: Removed 30 rows containing missing values or values outside the scale range
(`geom_point()`).
Reflect & Respond
Question: Based on the scatter plot, what do you expect the relationship between time spent and final grades to be? Write down your hypothesis. - [I expect a positive relationship. As students spend more time engaging with the LMS, their final grades tend to increase.]
Correlation
Task 10: After visualizing the data, let’s quantify the relationship by computing the correlation between TimeSpent_hours and FinalGradeCEMS. The cor() function is used for this. The argument use = “complete.obs” tells R to ignore any rows with missing data when performing the calculation.
# Compute the correlation
correlation <- cor(data$TimeSpent_hours, data$FinalGradeCEMS, use = "complete.obs")
# Display the correlation
correlation[1] 0.3654121
Question: With the scatter plot and correlation results in mind, what insights can you draw about the relationship between time spent and final grades? Remember, this is NOT a traditional statistics course, so focus on interpreting the data in context.
- [The correlation shows a moderate positive relationship between time spent on the LMS and final grades. This means that, in general, students who dedicate more hours to course materials tend to achieve higher final grades. As an educator, this insight reinforces the importance of designing meaningful, efficient learning activities that encourage consistent engagement rather than simply increasing workload.]
Render & Submit
Congratulations, you’ve completed the second module!
To receive full score, you will need to render this document and publish via a method such as: Quarto Pub, Posit Cloud, RPubs , GitHub Pages, or other methods. Once you have shared a link to you published document with me and I have reviewed your work, you will be officially done with the current module.
Complete the following steps to submit your work for review by:
First, change the name of the
author:in the YAML header at the very top of this document to your name. The YAML header controls the style and feel for knitted document but doesn’t actually display in the final output.Next, click the “Render” button in the toolbar above to “render” your R Markdown document to a HTML file that will be saved in your R Project folder. You should see a formatted webpage appear in your Viewer tab in the lower right pan or in a new browser window. Let me know if you run into any issues with rendering.
Finally, publish. To do publish, follow the step from the link
If you have any questions about this module, or run into any technical issues, don’t hesitate to contact me.
Once I have checked your link, you will be notified!