Data Collection and Basic Analytics

CUED 7540: Learning Analytics II

Author

[Caroline Wendt, cghuntley42@tntech.edu]

Published

September 22, 2025

Learning Objectives

By the end of this lesson, you will be able to:

- Perform descriptive and predictive analytics techniques.

- Create and save datasets manually.

- Interpret and visualize your data using different plot types.

- Understand the importance of data cleaning and transformation for analysis.

Part 1: Creating dataset & datafile

Instead of loading an existing dataset, we’ll manually create our own simulated data to practice with. This will help you understand the structure of data in R and the importance of data management.

###Setup

Let’s start by loading the necessary libraries. If you encounter any errors, revisit our first module for guidance on installing packages.

Creating Simulated Data

Task 1: We’ll simulate data for 40 students’ time spent on a learning management system (LMS) over 16 weeks. We will then save this new dataset as a CSV file.

The set.seed() function is used here to ensure that the random data we generate is the same every time you run the code. Think of it as “shuffling a deck of cards in the same way” so that we all get the same results.

# Creating a dataset called `data2` with simulated data for 40 students for our exercise this week.
# The first column, 'Student_ID', generates a unique ID for each student by 
# combining the word "Student" with numbers 1 through 40.
# 
# The subsequent columns, 'Week_1' through 'Week_16', represent data for 16 weeks.
#
# Each week’s column contains 40 random values, sampled from the range 6 to 20, 
# representing hypothetical time spent (in hours) on the learning management system by each student during that week.
set.seed(42)  # Setting seed for reproducibility
data2 <- data.frame(
  Student_ID = paste("Student", 1:40, sep = "_"),
  Week_1 = sample(6:20, 40, replace = TRUE),
  Week_2 = sample(6:20, 40, replace = TRUE),
  Week_3 = sample(6:20, 40, replace = TRUE),
  Week_4 = sample(6:20, 40, replace = TRUE),
  Week_5 = sample(6:20, 40, replace = TRUE),
  Week_6 = sample(6:20, 40, replace = TRUE),
  Week_7 = sample(6:20, 40, replace = TRUE),
  Week_8 = sample(6:20, 40, replace = TRUE),
  Week_9 = sample(6:20, 40, replace = TRUE),
  Week_10 = sample(6:20, 40, replace = TRUE),
  Week_11 = sample(6:20, 40, replace = TRUE),
  Week_12 = sample(6:20, 40, replace = TRUE),
  Week_13 = sample(6:20, 40, replace = TRUE),
  Week_14 = sample(6:20, 40, replace = TRUE),
  Week_15 = sample(6:20, 40, replace = TRUE),
  Week_16 = sample(6:20, 40, replace = TRUE)
)

# Saving the dataset as a CSV file. 
# I named the file '40_students_LMS_time_spent.csv'. You can name the file differently if you'd like.
# The row.names = FALSE argument prevents R from writing an unnecessary column of row numbers.
write.csv(data2, "students_timespent.csv", row.names = FALSE)

# Inspect the first few rows of the dataset
# TYPE YOUR CODE
head(data2)

  Student_ID Week_1 Week_2 Week_3 Week_4 Week_5 Week_6 Week_7 Week_8 Week_9
1  Student_1      6     20     18     17     16     20     20     17      9
2  Student_2     10     11     15     18     19     16     19     20      9
3  Student_3      6     15     14      6     10     15     17     18     20
4  Student_4     14     13     18     17      6     14     20     14     18
5  Student_5     15      9     17     14     11     17      6      9     12
6  Student_6      9      9      9     14      6     11     18     14      6
  Week_10 Week_11 Week_12 Week_13 Week_14 Week_15 Week_16
1      11      18      18       7      19      15      15
2      20      11       9      11       8      18      12
3      14      11      15      14      20      19      19
4      15      17       7      13      20      11      13
5      13      18      12      16       7       6      18
6       7      12       6      11      15      16      13

Once your data is created and saved as a CSV file, you should see the new file in your Files Pane.

Reflect & Respond

Question: Why do you think it’s important to be able to manually create and save datasets?

[I think it is important to be able to create and save data sets because this gives you the freedom to collect and run analyses on data that you compile yourself. This is also an efficient way to store and analyze your data you collect at the same time. Once the data is compiled in the system, you can run any needed analyses in R that you need.]

Part 2: Descriptive Analytics & Visualization

Descriptive Analytics

Now that we have our dataset, let’s conduct some basic analytics to better understand the data.

Task 2: Calculate summary statistics for each week. Use the summary() function to analyze the dataset, excluding the first column which is the Student_ID.

# Summary statistics for each week
summary_stats <- summary(data2[, -1])
# summary() function calculates summary statistics (such as minimum, maximum, median, mean, etc.)
# data2[, -1]: excluding the first column (-1), which is typically the Student_ID column. summary_stats

summary_stats

     Week_1          Week_2          Week_3          Week_4     
 Min.   : 6.00   Min.   : 6.00   Min.   : 6.00   Min.   : 6.00  
 1st Qu.: 9.00   1st Qu.: 8.00   1st Qu.: 9.75   1st Qu.:12.00  
 Median :13.00   Median :10.50   Median :12.50   Median :14.50  
 Mean   :12.32   Mean   :10.75   Mean   :12.53   Mean   :14.28  
 3rd Qu.:15.00   3rd Qu.:13.00   3rd Qu.:16.00   3rd Qu.:18.00  
 Max.   :20.00   Max.   :20.00   Max.   :19.00   Max.   :20.00  
     Week_5          Week_6          Week_7          Week_8     
 Min.   : 6.00   Min.   : 6.00   Min.   : 6.00   Min.   : 6.00  
 1st Qu.: 8.50   1st Qu.: 9.00   1st Qu.: 8.75   1st Qu.:10.75  
 Median :13.50   Median :15.00   Median :14.00   Median :13.50  
 Mean   :13.18   Mean   :13.22   Mean   :13.60   Mean   :13.05  
 3rd Qu.:17.00   3rd Qu.:17.00   3rd Qu.:18.25   3rd Qu.:16.00  
 Max.   :20.00   Max.   :20.00   Max.   :20.00   Max.   :20.00  
     Week_9         Week_10         Week_11         Week_12     
 Min.   : 6.00   Min.   : 6.00   Min.   : 6.00   Min.   : 6.00  
 1st Qu.:10.00   1st Qu.: 9.00   1st Qu.:10.75   1st Qu.: 8.00  
 Median :14.00   Median :13.00   Median :14.00   Median :12.00  
 Mean   :13.05   Mean   :12.75   Mean   :13.47   Mean   :12.12  
 3rd Qu.:16.00   3rd Qu.:16.00   3rd Qu.:17.00   3rd Qu.:15.00  
 Max.   :20.00   Max.   :20.00   Max.   :20.00   Max.   :19.00  
    Week_13         Week_14         Week_15         Week_16     
 Min.   : 6.00   Min.   : 6.00   Min.   : 6.00   Min.   : 6.00  
 1st Qu.: 9.75   1st Qu.:10.75   1st Qu.:10.75   1st Qu.:10.00  
 Median :14.00   Median :14.00   Median :15.00   Median :15.00  
 Mean   :12.90   Mean   :13.70   Mean   :13.75   Mean   :13.65  
 3rd Qu.:16.00   3rd Qu.:17.25   3rd Qu.:17.00   3rd Qu.:17.00  
 Max.   :20.00   Max.   :20.00   Max.   :20.00   Max.   :20.00

Question: What insights do you gain from the summary results?

[One of the pieces of information the summary results allowed me to see was the mean hours of time spent each week on the learning management system. This then allowed me to look at what weeks had the most time spent and least time spent. This can then enable further studies into why some weeks had more hours spent than others and potentially would allow research on what amount of time could optimize learning for students.]

Task 3: Calculate the average time spent per week for all students using the colMeans() function.

# Calculate the average time spent per week
average_time <- colMeans(data2[, -1])
# The colMeans() function computes the mean for each column in the data2 dataframe, excluding the first column (-1), which is typically the Student_ID column. 
average_time

 Week_1  Week_2  Week_3  Week_4  Week_5  Week_6  Week_7  Week_8  Week_9 Week_10 
 12.325  10.750  12.525  14.275  13.175  13.225  13.600  13.050  13.050  12.750 
Week_11 Week_12 Week_13 Week_14 Week_15 Week_16 
 13.475  12.125  12.900  13.700  13.750  13.650

Question: How would you interpret these results?

[Looking at the results, I see a range of average time spent from 10.70 hours to 14.275 hours. Most of the data shows 12 or 13 hours spent on the learning management system, with 10 and 14 hours being uncommon. This shows that students typically spend a similar amount of time on learning management systems in the week (12-13 hours). It also shows that something occurred in week 2 that had students spending less time on the system. There could have been anything that occurred such a more group work, a fire drill, a field trip, etc. in which lowered the typical time spent down to 10.7 hours. As well, week 4 had an increase in time spent being at 14.275 hours. This could lead to an inference such as students had a test that week and possibly had more time after on the learning management system. ]

Question: If some weeks show significantly higher or lower time spent, what actions would you take as an instructor or course designer?

[The above answer touched base on this, but as an interpretor of the results, I would analyze what specific things happened on those weeks that may have caused more or less time. This could help explain the occurrences. Additionally, I would want to look at the overall performance on those specific weeks compared to those of typical weeks. Did students perform better or worse on assignments? Did the students have more attentive time on the system as compared to other weeks? This would help indicate a sweet spot for time spent on a learning management system to optimize academic growth/understanding for students. It could be that the average time spent is not as optimal as perhaps the week that 10 hours were spent. More data/research would need to be conducted to show potential relationships in this.]

Task 4: Calculate each individual’s average time spent across 16 weeks. Use the rowMeans() function and add this new variable to the data2 data frame.

# Calculate the mean time spent for each student across all 16 weeks
# The rowMeans() function calculates the mean for each row (i.e., each student)
data2$Semester_Average <- rowMeans(data2[, 2:17])

# rowMeans(data2[, 2:17]): This function calculates the mean of each row across columns 2 to 17, which correspond to the weekly time spent values for each student. The result is stored in the 'Mean_TimeSpent' column.

# Inspect the first few rows to see the new column with mean time spent
head(data2)

  Student_ID Week_1 Week_2 Week_3 Week_4 Week_5 Week_6 Week_7 Week_8 Week_9
1  Student_1      6     20     18     17     16     20     20     17      9
2  Student_2     10     11     15     18     19     16     19     20      9
3  Student_3      6     15     14      6     10     15     17     18     20
4  Student_4     14     13     18     17      6     14     20     14     18
5  Student_5     15      9     17     14     11     17      6      9     12
6  Student_6      9      9      9     14      6     11     18     14      6
  Week_10 Week_11 Week_12 Week_13 Week_14 Week_15 Week_16 Semester_Average
1      11      18      18       7      19      15      15          15.3750
2      20      11       9      11       8      18      12          14.1250
3      14      11      15      14      20      19      19          14.5625
4      15      17       7      13      20      11      13          14.3750
5      13      18      12      16       7       6      18          12.5000
6       7      12       6      11      15      16      13          11.0000

Task 5: Calculate the average time spent for each student only from week 1 to week 5. Add this as a new variable named early_semester_average.

# Now, complete the code to calculate the mean time spent for each student ONLY from week 1 to week 5. Save the result to a variable named 'early_semester_average'
# Revise the code to choose from week 1 to week 5.

# COMPLETE THE CODE BELOW
data2$early_semester_average <- rowMeans(data2[, 2:6])

# Inspect the first few rows to see the new column
#TYPE YOUR CODE
head(data2)

  Student_ID Week_1 Week_2 Week_3 Week_4 Week_5 Week_6 Week_7 Week_8 Week_9
1  Student_1      6     20     18     17     16     20     20     17      9
2  Student_2     10     11     15     18     19     16     19     20      9
3  Student_3      6     15     14      6     10     15     17     18     20
4  Student_4     14     13     18     17      6     14     20     14     18
5  Student_5     15      9     17     14     11     17      6      9     12
6  Student_6      9      9      9     14      6     11     18     14      6
  Week_10 Week_11 Week_12 Week_13 Week_14 Week_15 Week_16 Semester_Average
1      11      18      18       7      19      15      15          15.3750
2      20      11       9      11       8      18      12          14.1250
3      14      11      15      14      20      19      19          14.5625
4      15      17       7      13      20      11      13          14.3750
5      13      18      12      16       7       6      18          12.5000
6       7      12       6      11      15      16      13          11.0000
  early_semester_average
1                   15.4
2                   14.6
3                   10.2
4                   13.6
5                   13.2
6                    9.4

Question: How do these newly added variables provide you with new insights? What could you do with this information as an instructor?

[This new information provides insights into the increase or decrease of students hours spent from early in the semester to later in the semester (indicated by an increase in the overall semester average compared to the early semesters average). As an instructor, I could look at the average time spent in the early semester and see what caused an increase in the use of the LMS. I could also look at what potentially caused a decrease in some students. I could then gain insights as to what increases student motivation and usage. These strategies could then be employed throughout the rest of the year to have more steady LMS usage from all students. As well, I could analyze what materials students were working on on weeks that caused a decrease of time spent in the beginning of the semester. Maybe a particular subject/assignment was hard that caused a decrease in time from the beginning of the year. This could then allow me to better plan instructional materials for future semesters/school years.]

Data Visualization

Visualizing data helps to better understand and communicate your findings. Let’s create some plots with our data.

Bar Plot of Average Time Spent Per Week

For this data, a bar plot is an excellent way to show the average time spent each week.

# First, reshape the average_time vector into a data frame for ggplot
average_time_table <- data.frame(
  Week = factor(names(average_time), levels = names(average_time)),
  Average_Time_Spent = average_time
)
head(average_time_table)

         Week Average_Time_Spent
Week_1 Week_1             12.325
Week_2 Week_2             10.750
Week_3 Week_3             12.525
Week_4 Week_4             14.275
Week_5 Week_5             13.175
Week_6 Week_6             13.225

Task 6: Create a bar plot using the average_time_table data frame. Experiment with different colors and text for the parameters (fill, color, size, face, etc.).

# Create a boxplot of 'Average Time Spent' by 'Week'
ggplot(average_time_table, aes(x = Week, y = Average_Time_Spent)) +
  # Add the boxplot to visualize the distribution of average time spent across different weeks
   geom_bar(stat = "identity", fill = "red", color = "black") +
  # Add titles and labels to the plot
  labs(title = "Boxplot of Average Time Spent/Week", x = "Week", y = "Average Time Spent") +
theme_minimal() +
  theme(
    plot.title = element_text(size = 18, face = "bold", hjust = 0.5),
    axis.title.x = element_text(size = 14, face = "bold"),
    axis.title.y = element_text(size = 14, face = "bold"),
    axis.text.x = element_text(size = 10, angle = 45, hjust = 1)
  )

Click If you want to see the full image in a new tab.

Line Plot of Average Time Spent Per Week

Task 7: Line plots are great for showing trends over time. Create a line plot to visualize the trend of average time spent per week using average_time_table.

# Create a line plot
ggplot(average_time_table, aes(x = Week, y = Average_Time_Spent, group = 1)) +
  geom_line(color = "blue", size = 1.2) +
  geom_point(color = "red", size = 3) +
  labs(title = "Trend of Average Time Spent per Week",
       x = "Week",
       y = "Average Time Spent (Hours)") +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 18, face = "bold", hjust = 0.5),
    axis.title.x = element_text(size = 14, face = "bold"),
    axis.title.y = element_text(size = 14, face = "bold"),
    axis.text.x = element_text(size = 10, angle = 45, hjust = 1)
  )

Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.

Question: What differences do you notice between the bar plot and the line plot? Which one is more effective for showing a trend over time, and why?

[I notice that the numbers on the average time spent (y-axis) are more precise and close to the true value in the line plot verses the bar plot. The line plot also shows clear trends, up or down, verses the bar plot where you can just see the bar tops moving up or down a little bit each week. The averages for each week are all relatively close to being the same. The line plot does a much better job at presenting the changes in averages per week rather than the bar plot. With this said, the line plot is more effective at showing a trend over time. I can clearly see the dips and increases in hours spent on the LMS each week. ]

Task 7-2: If you are interested in the trends of each student’s time spent from week 1 to week 16, a line plot can be helpful.

# Reshape the data for easier plotting
data_long <- data2 %>%
  pivot_longer(cols = starts_with("Week"), names_to = "Week", values_to = "TimeSpent")

# Create a line plot for each student's weekly TimeSpent
ggplot(data_long, aes(x = Week, y = TimeSpent, group = Student_ID, color = Student_ID)) +
  geom_line() +
  labs(title = "Weekly Time Spent by Each Student",
       x = "Week",
       y = "Time Spent (Hours)") +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 16, face = "bold", hjust = 0.5),
    axis.title.x = element_text(size = 12, face = "bold"),
    axis.title.y = element_text(size = 12, face = "bold"),
    axis.text.x = element_text(angle = 45, hjust = 1),
    legend.position = "none"  # Hides the legend to reduce clutter
  )

Question: What do you think about the analytics result?

[The new line plot is very jumbled and hard to follow. All of the student data is interdispered within the plot making it hard to understand what is taking place or to interpret the results.]

Task 7-3: To better interpret the analytics, let’s focus on a subset of students and create a line plot.

# Select 5 specific students
selected_students <- data2 %>%
  filter(Student_ID %in% c("Student_1", "Student_10", "Student_20", "Student_30", "Student_40")) 

#Type your code here
selected_students <- data2 %>%
  filter(Student_ID %in% c("Student_3", "Student_6", "Student_7", "Student_15")) 

# Reshape the data for easier plotting
data_long_selected <- selected_students %>%
  pivot_longer(cols = starts_with("Week"), names_to = "Week", values_to = "TimeSpent")

# Create a line plot for the selected students
ggplot(data_long_selected, aes(x = Week, y = TimeSpent, group = Student_ID, color = Student_ID)) +
  geom_line(size = .5) +
  geom_point(size = 4) +
  labs(title = "Weekly Time Spent by Selected Students",
       x = "Week",
       y = "Time Spent (Hours)") +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 16, face = "bold", hjust = 0.5),
    axis.title.x = element_text(size = 12, face = "bold"),
    axis.title.y = element_text(size = 12, face = "bold"),
    axis.text.x = element_text(angle = 45, hjust = 1),
    legend.title = element_blank(),  # Hides the legend title for simplicity
    legend.position = "top"  # Position the legend at the top for better visibility
  )

Question: Pick 3-6 students you are particularly interested in comparing. Update the R code above (Task 7-3 chunk) with your selected students. What insights do you gain from this more focused analysis?

[I wanted to analyze students who had lower LMS averages in the early semester but had a significant increase in overall LMS usage by the end. I chose students 3, 6, 7, and 15 for this. By analyzing these specific students, I see a trend of increase around week 15 in all of the students analyzed. I see another increase for most in week 7. I also notice a dip in time in weeks 5 and some in week 12. By looking at this, I can further research potential causes for the increases in the stated weeks and the decreases in the stated weeks. It makes me wonder why the students spent more time in week 15 specifically? This week increased the end average of time spent for all students. Maybe these students were more successful or found the assignments more interesting this week. Maybe something caused higher levels of engagement. These things can be further anayzed to bring about more consistent amounts of time spent on the LMS for students who started out not spending as much time. As well, the dips that lowered the averages could be analyzed to see what deterred students from spending as much time on the LMS. These factors can then be minimized to increase the weekly hours spent participating in LMS activities.]

Question: What changes did you make to the visualization? Why?

[In terms of changes to the visualization, I made the geometry lines thinner and the geometry points thicker. The thinner lines allowed for me to better visualize all of the students data because I could see between them better. The thicker points allowed me to see the areas of concentrated data better because they lumped together nicely and visually drew my eye to those areas. ]

Histogram

Task 8: A histogram is useful for understanding the overall distribution of a single variable. Create a histogram of the Semester_Average variable to see the general pattern and frequency of time spent across all students.

# Histogram of Mean_TimeSpent
ggplot(data2, aes(x = Semester_Average)) +
  geom_histogram(binwidth = 1, fill = "skyblue", color = "black") +
  labs(title = "Histogram of Mean Time Spent by 40 Students",
       x = "Mean Time Spent (Hours)",
       y = "Frequency") +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 16, face = "bold", hjust = 0.5),
    axis.title.x = element_text(size = 12, face = "bold"),
    axis.title.y = element_text(size = 12, face = "bold")
  )

Question: Which plot do you find more insightful, the bar plot, box plot, or histogram? Why?

[I found the line plot to be most insightful because I could specifically concentrate on certain students to find patterns in the data. It also allowed me to visaulize areas that I could further resarch or analyze to find more trends or patterns present. It was also more visaully useful as I could make inferences easily based on the chart.]

Part 3: Predictive Analytics and Visualization

Now, let’s switch gears and explore the relationship between two variables. We will return to the dataset from the previous module, sci-online-classes.csv. We want to see if there is a relationship between the time spent on the LMS (TimeSpent_hours) and students’ final grades (FinalGradeCEMS).

Load data

First, we need to load the data we used in our first module.

#import/load the dataset
# COMPLETE THE CODE WITH THE FUNCTION NAME (read_csv) & THE FILE NAME (sci-online-classes.csv).
data <- read_csv("data/sci-online-classes.csv")

Rows: 603 Columns: 30
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (6): course_id, subject, semester, section, Gradebook_Item, Gender
dbl (23): student_id, total_points_possible, total_points_earned, percentage...
lgl  (1): Grade_Category

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# Inspect your data
str(data)

spc_tbl_ [603 × 30] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ student_id           : num [1:603] 43146 44638 47448 47979 48797 ...
 $ course_id            : chr [1:603] "FrScA-S216-02" "OcnA-S116-01" "FrScA-S216-01" "OcnA-S216-01" ...
 $ total_points_possible: num [1:603] 3280 3531 2870 4562 2207 ...
 $ total_points_earned  : num [1:603] 2220 2672 1897 3090 1910 ...
 $ percentage_earned    : num [1:603] 0.677 0.757 0.661 0.677 0.865 ...
 $ subject              : chr [1:603] "FrScA" "OcnA" "FrScA" "OcnA" ...
 $ semester             : chr [1:603] "S216" "S116" "S216" "S216" ...
 $ section              : chr [1:603] "02" "01" "01" "01" ...
 $ Gradebook_Item       : chr [1:603] "POINTS EARNED & TOTAL COURSE POINTS" "ATTEMPTED" "POINTS EARNED & TOTAL COURSE POINTS" "POINTS EARNED & TOTAL COURSE POINTS" ...
 $ Grade_Category       : logi [1:603] NA NA NA NA NA NA ...
 $ FinalGradeCEMS       : num [1:603] 93.5 81.7 88.5 81.9 84 ...
 $ Points_Possible      : num [1:603] 5 10 10 5 438 5 10 10 443 5 ...
 $ Points_Earned        : num [1:603] NA 10 NA 4 399 NA NA 10 425 2.5 ...
 $ Gender               : chr [1:603] "M" "F" "M" "M" ...
 $ q1                   : num [1:603] 5 4 5 5 4 NA 5 3 4 NA ...
 $ q2                   : num [1:603] 4 4 4 5 3 NA 5 3 3 NA ...
 $ q3                   : num [1:603] 4 3 4 3 3 NA 3 3 3 NA ...
 $ q4                   : num [1:603] 5 4 5 5 4 NA 5 3 4 NA ...
 $ q5                   : num [1:603] 5 4 5 5 4 NA 5 3 4 NA ...
 $ q6                   : num [1:603] 5 4 4 5 4 NA 5 4 3 NA ...
 $ q7                   : num [1:603] 5 4 4 4 4 NA 4 3 3 NA ...
 $ q8                   : num [1:603] 5 5 5 5 4 NA 5 3 4 NA ...
 $ q9                   : num [1:603] 4 4 3 5 NA NA 5 3 2 NA ...
 $ q10                  : num [1:603] 5 4 5 5 3 NA 5 3 5 NA ...
 $ TimeSpent            : num [1:603] 1555 1383 860 1599 1482 ...
 $ TimeSpent_hours      : num [1:603] 25.9 23 14.3 26.6 24.7 ...
 $ TimeSpent_std        : num [1:603] -0.181 -0.308 -0.693 -0.148 -0.235 ...
 $ int                  : num [1:603] 5 4.2 5 5 3.8 4.6 5 3 4.2 NA ...
 $ pc                   : num [1:603] 4.5 3.5 4 3.5 3.5 4 3.5 3 3 NA ...
 $ uv                   : num [1:603] 4.33 4 3.67 5 3.5 ...
 - attr(*, "spec")=
  .. cols(
  ..   student_id = col_double(),
  ..   course_id = col_character(),
  ..   total_points_possible = col_double(),
  ..   total_points_earned = col_double(),
  ..   percentage_earned = col_double(),
  ..   subject = col_character(),
  ..   semester = col_character(),
  ..   section = col_character(),
  ..   Gradebook_Item = col_character(),
  ..   Grade_Category = col_logical(),
  ..   FinalGradeCEMS = col_double(),
  ..   Points_Possible = col_double(),
  ..   Points_Earned = col_double(),
  ..   Gender = col_character(),
  ..   q1 = col_double(),
  ..   q2 = col_double(),
  ..   q3 = col_double(),
  ..   q4 = col_double(),
  ..   q5 = col_double(),
  ..   q6 = col_double(),
  ..   q7 = col_double(),
  ..   q8 = col_double(),
  ..   q9 = col_double(),
  ..   q10 = col_double(),
  ..   TimeSpent = col_double(),
  ..   TimeSpent_hours = col_double(),
  ..   TimeSpent_std = col_double(),
  ..   int = col_double(),
  ..   pc = col_double(),
  ..   uv = col_double()
  .. )
 - attr(*, "problems")=<externalptr>

# Display the first few rows of the dataset
# COMPLETE THE CODE
summary(data)

   student_id     course_id         total_points_possible total_points_earned
 Min.   :43146   Length:603         Min.   :  840         Min.   :  651      
 1st Qu.:85612   Class :character   1st Qu.: 2810         1st Qu.: 2050      
 Median :88340   Mode  :character   Median : 3583         Median : 2757      
 Mean   :86070                      Mean   : 4274         Mean   : 3245      
 3rd Qu.:92730                      3rd Qu.: 5069         3rd Qu.: 3875      
 Max.   :97441                      Max.   :15552         Max.   :12208      
                                                                             
 percentage_earned   subject            semester           section         
 Min.   :0.3384    Length:603         Length:603         Length:603        
 1st Qu.:0.7047    Class :character   Class :character   Class :character  
 Median :0.7770    Mode  :character   Mode  :character   Mode  :character  
 Mean   :0.7577                                                            
 3rd Qu.:0.8262                                                            
 Max.   :0.9106                                                            
                                                                           
 Gradebook_Item     Grade_Category FinalGradeCEMS   Points_Possible 
 Length:603         Mode:logical   Min.   :  0.00   Min.   :  5.00  
 Class :character   NA's:603       1st Qu.: 71.25   1st Qu.: 10.00  
 Mode  :character                  Median : 84.57   Median : 10.00  
                                   Mean   : 77.20   Mean   : 76.87  
                                   3rd Qu.: 92.10   3rd Qu.: 30.00  
                                   Max.   :100.00   Max.   :935.00  
                                   NA's   :30                       
 Points_Earned       Gender                q1              q2       
 Min.   :  0.00   Length:603         Min.   :1.000   Min.   :1.000  
 1st Qu.:  7.00   Class :character   1st Qu.:4.000   1st Qu.:3.000  
 Median : 10.00   Mode  :character   Median :4.000   Median :4.000  
 Mean   : 68.63                      Mean   :4.296   Mean   :3.629  
 3rd Qu.: 26.12                      3rd Qu.:5.000   3rd Qu.:4.000  
 Max.   :828.20                      Max.   :5.000   Max.   :5.000  
 NA's   :92                          NA's   :123     NA's   :126    
       q3              q4              q5              q6       
 Min.   :1.000   Min.   :1.000   Min.   :2.000   Min.   :1.000  
 1st Qu.:3.000   1st Qu.:4.000   1st Qu.:4.000   1st Qu.:4.000  
 Median :3.000   Median :4.000   Median :4.000   Median :4.000  
 Mean   :3.327   Mean   :4.268   Mean   :4.191   Mean   :4.008  
 3rd Qu.:4.000   3rd Qu.:5.000   3rd Qu.:5.000   3rd Qu.:5.000  
 Max.   :5.000   Max.   :5.000   Max.   :5.000   Max.   :5.000  
 NA's   :123     NA's   :125     NA's   :127     NA's   :127    
       q7              q8              q9             q10       
 Min.   :1.000   Min.   :1.000   Min.   :1.000   Min.   :1.000  
 1st Qu.:3.000   1st Qu.:4.000   1st Qu.:3.000   1st Qu.:4.000  
 Median :4.000   Median :4.000   Median :4.000   Median :4.000  
 Mean   :3.907   Mean   :4.289   Mean   :3.487   Mean   :4.101  
 3rd Qu.:4.750   3rd Qu.:5.000   3rd Qu.:4.000   3rd Qu.:5.000  
 Max.   :5.000   Max.   :5.000   Max.   :5.000   Max.   :5.000  
 NA's   :129     NA's   :129     NA's   :129     NA's   :129    
   TimeSpent       TimeSpent_hours    TimeSpent_std          int       
 Min.   :   0.45   Min.   :  0.0075   Min.   :-1.3280   Min.   :2.000  
 1st Qu.: 851.90   1st Qu.: 14.1983   1st Qu.:-0.6996   1st Qu.:3.900  
 Median :1550.91   Median : 25.8485   Median :-0.1837   Median :4.200  
 Mean   :1799.75   Mean   : 29.9959   Mean   : 0.0000   Mean   :4.219  
 3rd Qu.:2426.09   3rd Qu.: 40.4348   3rd Qu.: 0.4623   3rd Qu.:4.700  
 Max.   :8870.88   Max.   :147.8481   Max.   : 5.2188   Max.   :5.000  
 NA's   :5         NA's   :5          NA's   :5         NA's   :76     
       pc              uv       
 Min.   :1.500   Min.   :1.000  
 1st Qu.:3.000   1st Qu.:3.333  
 Median :3.500   Median :3.667  
 Mean   :3.608   Mean   :3.719  
 3rd Qu.:4.000   3rd Qu.:4.167  
 Max.   :5.000   Max.   :5.000  
 NA's   :75      NA's   :75

head(data)

# A tibble: 6 × 30
  student_id course_id     total_points_possible total_points_earned
       <dbl> <chr>                         <dbl>               <dbl>
1      43146 FrScA-S216-02                  3280                2220
2      44638 OcnA-S116-01                   3531                2672
3      47448 FrScA-S216-01                  2870                1897
4      47979 OcnA-S216-01                   4562                3090
5      48797 PhysA-S116-01                  2207                1910
6      51943 FrScA-S216-03                  4208                3596
# ℹ 26 more variables: percentage_earned <dbl>, subject <chr>, semester <chr>,
#   section <chr>, Gradebook_Item <chr>, Grade_Category <lgl>,
#   FinalGradeCEMS <dbl>, Points_Possible <dbl>, Points_Earned <dbl>,
#   Gender <chr>, q1 <dbl>, q2 <dbl>, q3 <dbl>, q4 <dbl>, q5 <dbl>, q6 <dbl>,
#   q7 <dbl>, q8 <dbl>, q9 <dbl>, q10 <dbl>, TimeSpent <dbl>,
#   TimeSpent_hours <dbl>, TimeSpent_std <dbl>, int <dbl>, pc <dbl>, uv <dbl>

Visualize Relationships

Task 9: To explore the relationship between two variables (e.g., TimeSpent_hours and FinalGradeCEMS), create a scatter plot with a regression line. This visual will help us see if one variable might predict another.

# Create a scatter plot of TimeSpent_hours vs. FinalGradeCEMS with a regression line
ggplot(data, aes(x =TimeSpent_hours, y =FinalGradeCEMS )) + # TYPE YOUR CODE
  geom_point(color = "blue", size = 3, alpha = 0.6) +
  geom_smooth(method = "lm", color = "red", se = TRUE) +  # This line will add a linear regression line
  labs(title = "Scatter Plot of Time Spent vs. Final Grade with Regression Line",
       x = "Time Spent (Hours)",
       y = "Final Grade (CEMS)") +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 16, face = "bold", hjust = 0.5),
    axis.title.x = element_text(size = 12, face = "bold"),
    axis.title.y = element_text(size = 12, face = "bold")
  )

`geom_smooth()` using formula = 'y ~ x'

Warning: Removed 30 rows containing non-finite outside the scale range
(`stat_smooth()`).

Warning: Removed 30 rows containing missing values or values outside the scale range
(`geom_point()`).

Reflect & Respond

Question: Based on the scatter plot, what do you expect the relationship between time spent and final grades to be? Write down your hypothesis. - [Based on the regression line showing a positive, linear relationship between time spent and final grade, my hypothesis is that more time spent on the LMS has a relationship with higher final grades. However, only so much time spent on the LMS will produce higher grades, meaning there may be a certain amount of time that is most optimal for the highest grades production.]

Correlation

Task 10: After visualizing the data, let’s quantify the relationship by computing the correlation between TimeSpent_hours and FinalGradeCEMS. The cor() function is used for this. The argument use = “complete.obs” tells R to ignore any rows with missing data when performing the calculation.

# Compute the correlation
correlation <- cor(data$TimeSpent_hours, data$FinalGradeCEMS, use = "complete.obs")

# Display the correlation
correlation

[1] 0.3654121

Question: With the scatter plot and correlation results in mind, what insights can you draw about the relationship between time spent and final grades? Remember, this is NOT a traditional statistics course, so focus on interpreting the data in context.

[With a correlation of 0.365, it indicates that there is a positive relationship between time spent and final grades. This demonstrates that we do see an increase in the final grade whenever there is an increase in the time spent on the LMS. However, this is not a super strong correlation. There is a correlation present, indicating an increase in both at the same time, but it must not be the biggest factor in overall final grades in the course. This demonstrates that a LMS may be incorporated to potentially help boost final grade scores, but it is not the sole factor in increasing final grades. An LMS system can be used in conjunction with other strategies and tools for student academic achievement. ]

Render & Submit

Congratulations, you’ve completed the second module!

To receive full score, you will need to render this document and publish via a method such as: Quarto Pub, Posit Cloud, RPubs , GitHub Pages, or other methods. Once you have shared a link to you published document with me and I have reviewed your work, you will be officially done with the current module.

Complete the following steps to submit your work for review by:

First, change the name of the author: in the YAML header at the very top of this document to your name. The YAML header controls the style and feel for knitted document but doesn’t actually display in the final output.
Next, click the “Render” button in the toolbar above to “render” your R Markdown document to a HTML file that will be saved in your R Project folder. You should see a formatted webpage appear in your Viewer tab in the lower right pan or in a new browser window. Let me know if you run into any issues with rendering.
Finally, publish. To do publish, follow the step from the link

If you have any questions about this module, or run into any technical issues, don’t hesitate to contact me.

Once I have checked your link, you will be notified!