Learning Analytic Workflow Case Study

Independent work

Author

SON PHAM

Published

July 18, 2025

0. INTRODUCTION

We will focus on online science classes provided through a state-wide online virtual school and conduct an analysis that help product students’ performance in these online courses. This case study is guided by a foundational study in Learning Analytics that illustrates how analyses like these can be used develop an early warning system for educators to identify students at risk of failing and intervene before that happens.

Over the next labs we will dive into the Learning Analytics Workflow as follows:

Steps of Data-Intensive Research Workflow

  1. Prepare: Prior to analysis, it’s critical to understand the context and data sources you’re working with so you can formulate useful and answerable questions. You’ll also need to become familiar with and load essential packages for analysis, and learn to load and view the data for analysis.
  2. Wrangle: Wrangling data entails the work of manipulating, cleaning, transforming, and merging data. In Part 2 we focus on importing CSV files, tidying and joining our data.
  3. Explore: In Part 3, we use basic data visualization and calculate some summary statistics to explore our data and see what insight it provides in response to our questions.
  4. Model: After identifying variables that may be related to student performance through exploratory analysis, we’ll look at correlations and create some simple models of our data using linear regression.
  5. Communicate: To wrap up our case study, we’ll develop our first “data product” and share our analyses and findings by creating our first web page using Markdown.
  6. Change Idea: Having developed a webpage using Markdown, share your findings with the colleagues. The page will include interactive plots and a detailed explanation of the analysis process, serving as a case study for other educators in your school. Present your findings at a staff meeting, advocating for a broader adoption of data-driven strategies across curricula.

1. PREPARE (Module 1)

About the Study

This case study is guided by a well-cited publication from two authors that have made numerous contributions to the field of Learning Analytics over the years. This article is focused on “early warning systems” in higher education, and where adoption of learning management systems (LMS) like Moodle and Canvas gained a quicker foothold.

Macfadyen, L. P., & Dawson, S. (2010). Mining LMS data to develop an “early warning system” for educators: A proof of concept. Computers & education54(2), 588-599.

Previous research has indicated that universities and colleges could utilize Learning Management System (LMS) data to create reporting tools that identify students who are at risk and enable prompt pedagogical interventions. The present study validates and expands upon this idea by presenting data from an international research project that explores the specific online activities of students that reliably indicate their academic success. This paper confirms and extends this proposition by providing data from an international research project investigating which student online activities accurately predict academic achievement.

The data analyzed in this exploratory research was extracted from the course-based instructor tracking logs and the BB Vista production server.

Data collected on each student included ‘whole term’ counts for frequency of usage of course materials and tools supporting content delivery, engagement and discussion, assessment and administration/management. In addition, tracking data indicating total time spent on certain tool-based activities (assessments, assignments, total time online) offered a total measure of individual student time on task.

The authors used scatter plots for identifying potential relationships between variables under investigation, followed by a a simple correlation analysis of each variable to further interrogate the significance of selected variables as indicators of student achievement. Finally, a linear multiple regression analysis was conducted in order to develop a predictive model in which a student final grade was the continuous dependent variable.

Introduction to the Stakeholder

Name: Alex Johnson

Role: University Science Professor

Experience: 5 years teaching, enthusiastic about integrating technology in education

Goal: Alex aims to improve student engagement and performance in her online science classes.

Teacher Persona Alex begins by understanding the importance of data analysis in identifying students who might need extra support. The cited foundational study motivates her to explore similar analyses to develop her own early warning system.

Load Libraries

Remember libraries are also called packages. They are shareable collections of code that can contain functions, data, and/or documentation and extend the functionality of the coding language.

tidyverse is a collection of R packages designed for data manipulation, visualization, and analysis.

#Load Libraries below needed for analysis
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.2     ✔ tibble    3.3.0
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.4     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Data Source #1: Log Data

Log-trace data is data generated from our interactions with digital technologies, such as archived data from social media postings. In education, an increasingly common source of log-trace data is that generated from interactions with LMS and other digital tools.

The data we will use has already been “wrangled” quite a bit and is a summary type of log-trace data: the number of minutes students spent on the course. While this data type is fairly straightforward, there are even more complex sources of log-trace data out there (e.g., time stamps associated with when students started and stopped accessing the course).

Log Data Variables

Course Acronym Descriptions
Variable Description
student_id student’s id at institution
course_id abbreviation for course, course number, semester
gender male/female/NA
e n r ollment_reason reason student decided to take the course
e n r ollment_status approved/enrolled, dropped, withdrawn
time_spent Time spent in hours for entire course
Acronym Course Name
AnPhA Anatomy
BioA Biology
FrScA Forensics
OcnA Oceanography
PhysA Physics

Data Source #2: Academic Achievement Data

Academic Achievement Data Variables

Variable Description
total_points_possible available points for the course
total_points_earned points earned for the entire course

Data Source #3: Self-Report Survey

The third data source is a self-report survey. This was data collected before the start of the course. The survey included ten items, each corresponding to one of three motivation measures: interest, utility value, and perceived competence. These were chosen for their alignment with one way to think about students’ motivation, to what extent they expect to do well (corresponding to their perceived competence) and their value for what they are learning (corresponding to their interest and utility value).

Self-Report Survey Variables

Variable Description
int interest in science
tv hours of TV watched
Q1 -Q10 survey questions
  1. I think this course is an interesting subject. (Interest)
  2. What I am learning in this class is relevant to my life. (Utility value)
  3. I consider this topic to be one of my best subjects. (Perceived competence)
  4. I am not interested in this course. (Interest—reverse coded)
  5. I think I will like learning about this topic. (Interest)
  6. I think what we are studying in this course is useful for me to know. (Utility value)
  7. I don’t feel comfortable when it comes to answering questions in this area. (Perceived competence–reverse coded)
  8. I think this subject is interesting. (Interest)
  9. I find the content of this course to be personally meaningful. (Utility value)
  10. I’ve always wanted to learn more about this subject. (Interest)

2. WRANGLE (Module 1)

Import data

We will need to load in and inspect each of the data frames that we will use for this lab. You will first read about the data frame and then learn how to load (or read in) the data frame into the quarto document.

time_spent

For our first data frame object, let’s use the read_csv() function from to import our log-data.csv file directly from our data folder.

We will save this data frame as an object called time_spent, to help us to quickly recollect what function it serves in this analysis. To do that, we need to enter this new name and assign its value using the <- operator.

👉 Your Turn

You can run any code in a code block by pressing the green arrowhead in the top right corner. Try this one:

#load log-data file from data folder
time_spent <- read_csv("module_1/data/log-data.csv")
Rows: 716 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (4): course_id, gender, enrollment_reason, enrollment_status
dbl (2): student_id, time_spent

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
time_spent
# A tibble: 716 × 6
   student_id course_id    gender enrollment_reason enrollment_status time_spent
        <dbl> <chr>        <chr>  <chr>             <chr>                  <dbl>
 1      60186 AnPhA-S116-… M      Course Unavailab… Approved/Enrolled      2087.
 2      66693 AnPhA-S116-… M      Course Unavailab… Approved/Enrolled      2309.
 3      66811 AnPhA-S116-… F      Course Unavailab… Approved/Enrolled      5299.
 4      66862 AnPhA-S116-… F      Course Unavailab… Approved/Enrolled      1747.
 5      67508 AnPhA-S116-… F      Scheduling Confl… Approved/Enrolled      2668.
 6      70532 AnPhA-S116-… F      Learning Prefere… Approved/Enrolled      2938.
 7      77010 AnPhA-S116-… F      Learning Prefere… Approved/Enrolled      1533.
 8      85249 AnPhA-S116-… F      Course Unavailab… Approved/Enrolled      1210.
 9      85411 AnPhA-S116-… F      Scheduling Confl… Approved/Enrolled       473.
10      85583 AnPhA-S116-… F      Scheduling Confl… Approved/Enrolled      5532.
# ℹ 706 more rows

gradebook

Load the file gradebook-summary.csv from data folder and save the object as gradebook.

❗️In R, everything is an object. An object can be a simple value (like a number or a string), a complex structure (like a data frame or a list), or even a function or a model. For example, when you load a CSV file into R and store it in a variable, that variable is an object that contains your dataset.

A data set typically refers to a collection of data, often stored in a tabular format with rows and columns.

👉 Your Turn

  1. Use the same function as before to read in the gradebook-summary.csv file.
  2. Make sure the output is assigned to a new object, this time called gradebook.
  3. Press the green arrow head to run the code.
#load grade book data from data folder
#(add code below)
gradebook <- read_csv("module_1/data/gradebook-summary.csv")
Rows: 717 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): course_id
dbl (3): student_id, total_points_possible, total_points_earned

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

survey

Load the file survey.csv from data folder.

👉 Your Turn

  1. Use the same function as before to read in the gradebook-summary.csv file.
  2. Make sure the output is assigned to a new object, this time called survey.
  3. Press the green arrow head to run the code.
#load survey data from data folder
#(add code below)
survey <- read_csv("module_1/data/survey.csv")
Rows: 662 Columns: 26
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr   (5): student_ID, course_ID, subject, semester, section
dbl  (18): int, val, percomp, tv, q1, q2, q3, q4, q5, q6, q7, q8, q9, q10, p...
dttm  (3): date.x, date.y, date

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Inspect data

There are several ways you can look at data objects in R:

  • Simply typing the name of your object and running the code
time_spent
# A tibble: 716 × 6
   student_id course_id    gender enrollment_reason enrollment_status time_spent
        <dbl> <chr>        <chr>  <chr>             <chr>                  <dbl>
 1      60186 AnPhA-S116-… M      Course Unavailab… Approved/Enrolled      2087.
 2      66693 AnPhA-S116-… M      Course Unavailab… Approved/Enrolled      2309.
 3      66811 AnPhA-S116-… F      Course Unavailab… Approved/Enrolled      5299.
 4      66862 AnPhA-S116-… F      Course Unavailab… Approved/Enrolled      1747.
 5      67508 AnPhA-S116-… F      Scheduling Confl… Approved/Enrolled      2668.
 6      70532 AnPhA-S116-… F      Learning Prefere… Approved/Enrolled      2938.
 7      77010 AnPhA-S116-… F      Learning Prefere… Approved/Enrolled      1533.
 8      85249 AnPhA-S116-… F      Course Unavailab… Approved/Enrolled      1210.
 9      85411 AnPhA-S116-… F      Scheduling Confl… Approved/Enrolled       473.
10      85583 AnPhA-S116-… F      Scheduling Confl… Approved/Enrolled      5532.
# ℹ 706 more rows
  • Using glimpse()
glimpse(gradebook)
Rows: 717
Columns: 4
$ student_id            <dbl> 43146, 44638, 47448, 47979, 48797, 51943, 52326,…
$ course_id             <chr> "FrScA-S216-02", "OcnA-S116-01", "FrScA-S216-01"…
$ total_points_possible <dbl> 1217, 1676, 1232, 1833, 2225, 1222, 1775, 2225, …
$ total_points_earned   <dbl> 1150.00, 1384.23, 1116.00, 1492.73, 1994.75, 70.…
  • Using the Global Environment Pane

  • Using head() and tail() to look at the first and last few rows
#first few rows
head(survey)
# A tibble: 6 × 26
  student_ID course_ID  subject semester section   int   val percomp    tv    q1
  <chr>      <chr>      <chr>   <chr>    <chr>   <dbl> <dbl>   <dbl> <dbl> <dbl>
1 43146      FrScA-S21… FrScA   S216     02        4.2  3.67     4    3.86     4
2 44638      OcnA-S116… OcnA    S116     01        4    3        3    3.57     4
3 47448      FrScA-S21… FrScA   S216     01        4.2  3        3    3.71     5
4 47979      OcnA-S216… OcnA    S216     01        4    3.67     2.5  3.86     4
5 48797      PhysA-S11… PhysA   S116     01        3.8  3.67     3.5  3.71     4
6 51943      FrScA-S21… FrScA   S216     03        3.8  3.67     3.5  3.71     4
# ℹ 16 more variables: q2 <dbl>, q3 <dbl>, q4 <dbl>, q5 <dbl>, q6 <dbl>,
#   q7 <dbl>, q8 <dbl>, q9 <dbl>, q10 <dbl>, date.x <dttm>, post_int <dbl>,
#   post_uv <dbl>, post_tv <dbl>, post_percomp <dbl>, date.y <dttm>,
#   date <dttm>
#last few rows
tail(survey)
# A tibble: 6 × 26
  student_ID course_ID  subject semester section   int   val percomp    tv    q1
  <chr>      <chr>      <chr>   <chr>    <chr>   <dbl> <dbl>   <dbl> <dbl> <dbl>
1 19         AnPhA-S21… AnPhA   S217     02        4.2  5        5    4.5      5
2 42         FrScA-S21… FrScA   S217     01        4    4        4    4        4
3 52         FrScA-S21… FrScA   S217     03        4.4  2.67     3.5  3.75     4
4 57         FrScA-S21… FrScA   S217     01        4.4  2.33     2.5  3.62     5
5 72         FrScA-S21… FrScA   S217     01        5    3        4    4.25     5
6 80         FrScA-S21… FrScA   S217     01        3.6  2.33     3    3.12     4
# ℹ 16 more variables: q2 <dbl>, q3 <dbl>, q4 <dbl>, q5 <dbl>, q6 <dbl>,
#   q7 <dbl>, q8 <dbl>, q9 <dbl>, q10 <dbl>, date.x <dttm>, post_int <dbl>,
#   post_uv <dbl>, post_tv <dbl>, post_percomp <dbl>, date.y <dttm>,
#   date <dttm>
  • Using sample()
sample(survey)
# A tibble: 662 × 26
   date.y                val semester    q4 course_ID        tv    q8 post_tv
   <dttm>              <dbl> <chr>    <dbl> <chr>         <dbl> <dbl>   <dbl>
 1 NA                   3.67 S216         5 FrScA-S216-02  3.86     4   NA   
 2 NA                   3    S116         4 OcnA-S116-01   3.57     4   NA   
 3 NA                   3    S216         4 FrScA-S216-01  3.71     4   NA   
 4 NA                   3.67 S216         4 OcnA-S216-01   3.86     4   NA   
 5 NA                   3.67 S116         4 PhysA-S116-01  3.71     4   NA   
 6 NA                   3.67 S216         4 FrScA-S216-03  3.71     4   NA   
 7 NA                   4    S216         2 AnPhA-S216-01  4        4   NA   
 8 2016-01-02 00:41:00  3.67 S116         4 PhysA-S116-01  4        5    3.57
 9 2015-10-13 14:11:00  2    S116         4 FrScA-S116-01  3        4    3   
10 NA                   3.33 S216         5 FrScA-S216-01  4.14     5   NA   
# ℹ 652 more rows
# ℹ 18 more variables: post_int <dbl>, q5 <dbl>, post_percomp <dbl>,
#   date <dttm>, section <chr>, date.x <dttm>, q2 <dbl>, student_ID <chr>,
#   q3 <dbl>, subject <chr>, int <dbl>, percomp <dbl>, post_uv <dbl>, q1 <dbl>,
#   q6 <dbl>, q7 <dbl>, q10 <dbl>, q9 <dbl>

👉 Your Turn

Inspect the three datasets we loaded and answer the questions:

❓ What do you notice or wonder about?

  • The three datasets differ in size and detail: time_spent has 716 entries, likely session-level data; gradebook has 717, suggesting one record per student; and survey has only 662. The student_id appears in all three and can be used for joins, with variable types mostly consistent such as chr, dbl, and some dttm. Questions remain about the mismatched row counts, possibly due to missing data or dropouts, and the meaning of date.x, date.y, and date in the survey. It is also worth checking if gender appears in both time_spent and survey and whether they match.

❓ How many observations do you see?

  • The datasets differ slightly in size: time_spent has 716 rows, gradebook 717, and survey 662, suggesting not all students completed the survey.

❓ How many variables are there?

  • The datasets have different numbers of variables: time_spent has 6, gradebook has 4, and survey has 26.

❓ What do you notice about the variable classes?

  • The variable classes show clear patterns. In time_spent, most variables are character types like course_id and gender, while student_id and time_spent are numeric. Gradebook has course_id as character and student_id along with total points as numeric. The survey dataset includes character variables such as student_ID and course_ID, many numeric survey items, and datetime variables like date.x and date.y. The difference in student_id formatting across datasets may affect merging.

Tidy data

time_spent

Use separate() function from tidyr

Using separate(), we will turn the course_id variable in time_spent into three different variables: The course subject, semester, and section.

The c() function in R is used used to combine or concatenate its argument. You can use it to get the output by giving parameters inside the function.

#separate variable to individual subject, semester and section
time_spent %>%  
  separate(course_id,
           c("subject", "semester", "section"))
# A tibble: 716 × 8
   student_id subject semester section gender enrollment_reason                 
        <dbl> <chr>   <chr>    <chr>   <chr>  <chr>                             
 1      60186 AnPhA   S116     01      M      Course Unavailable at Local School
 2      66693 AnPhA   S116     01      M      Course Unavailable at Local School
 3      66811 AnPhA   S116     01      F      Course Unavailable at Local School
 4      66862 AnPhA   S116     01      F      Course Unavailable at Local School
 5      67508 AnPhA   S116     01      F      Scheduling Conflict               
 6      70532 AnPhA   S116     01      F      Learning Preference of the Student
 7      77010 AnPhA   S116     01      F      Learning Preference of the Student
 8      85249 AnPhA   S116     01      F      Course Unavailable at Local School
 9      85411 AnPhA   S116     01      F      Scheduling Conflict               
10      85583 AnPhA   S116     01      F      Scheduling Conflict               
# ℹ 706 more rows
# ℹ 2 more variables: enrollment_status <chr>, time_spent <dbl>
#inspect
time_spent
# A tibble: 716 × 6
   student_id course_id    gender enrollment_reason enrollment_status time_spent
        <dbl> <chr>        <chr>  <chr>             <chr>                  <dbl>
 1      60186 AnPhA-S116-… M      Course Unavailab… Approved/Enrolled      2087.
 2      66693 AnPhA-S116-… M      Course Unavailab… Approved/Enrolled      2309.
 3      66811 AnPhA-S116-… F      Course Unavailab… Approved/Enrolled      5299.
 4      66862 AnPhA-S116-… F      Course Unavailab… Approved/Enrolled      1747.
 5      67508 AnPhA-S116-… F      Scheduling Confl… Approved/Enrolled      2668.
 6      70532 AnPhA-S116-… F      Learning Prefere… Approved/Enrolled      2938.
 7      77010 AnPhA-S116-… F      Learning Prefere… Approved/Enrolled      1533.
 8      85249 AnPhA-S116-… F      Course Unavailab… Approved/Enrolled      1210.
 9      85411 AnPhA-S116-… F      Scheduling Confl… Approved/Enrolled       473.
10      85583 AnPhA-S116-… F      Scheduling Confl… Approved/Enrolled      5532.
# ℹ 706 more rows

Make sure to save it to the time_spent object. Saving an object is accomplished by using an assignment operator, which looks kind of like an arrow <- It is generally best practice to put the assigned object on the left, at the beginning of the code.

time_spent <- time_spent %>% #save function output to existing object 
  separate(course_id,
           c("subject", "semester", "section"))

#inspect
time_spent
# A tibble: 716 × 8
   student_id subject semester section gender enrollment_reason                 
        <dbl> <chr>   <chr>    <chr>   <chr>  <chr>                             
 1      60186 AnPhA   S116     01      M      Course Unavailable at Local School
 2      66693 AnPhA   S116     01      M      Course Unavailable at Local School
 3      66811 AnPhA   S116     01      F      Course Unavailable at Local School
 4      66862 AnPhA   S116     01      F      Course Unavailable at Local School
 5      67508 AnPhA   S116     01      F      Scheduling Conflict               
 6      70532 AnPhA   S116     01      F      Learning Preference of the Student
 7      77010 AnPhA   S116     01      F      Learning Preference of the Student
 8      85249 AnPhA   S116     01      F      Course Unavailable at Local School
 9      85411 AnPhA   S116     01      F      Scheduling Conflict               
10      85583 AnPhA   S116     01      F      Scheduling Conflict               
# ℹ 706 more rows
# ℹ 2 more variables: enrollment_status <chr>, time_spent <dbl>

Use mutate() function from dplyr

As you can see from the dataset, time_spent variable is not set in hours. Let’s change that.

In R, you can create new variables in a dataset (data frame or tibble) using mutate(), which allows you to add new columns to your data frame or modify existing ones.

#mutate minutes to hours on time spent and save as new variable.
time_spent <- time_spent %>% 
  mutate(time_spent_hours = time_spent / 60)

#inspect 
time_spent
# A tibble: 716 × 9
   student_id subject semester section gender enrollment_reason                 
        <dbl> <chr>   <chr>    <chr>   <chr>  <chr>                             
 1      60186 AnPhA   S116     01      M      Course Unavailable at Local School
 2      66693 AnPhA   S116     01      M      Course Unavailable at Local School
 3      66811 AnPhA   S116     01      F      Course Unavailable at Local School
 4      66862 AnPhA   S116     01      F      Course Unavailable at Local School
 5      67508 AnPhA   S116     01      F      Scheduling Conflict               
 6      70532 AnPhA   S116     01      F      Learning Preference of the Student
 7      77010 AnPhA   S116     01      F      Learning Preference of the Student
 8      85249 AnPhA   S116     01      F      Course Unavailable at Local School
 9      85411 AnPhA   S116     01      F      Scheduling Conflict               
10      85583 AnPhA   S116     01      F      Scheduling Conflict               
# ℹ 706 more rows
# ℹ 3 more variables: enrollment_status <chr>, time_spent <dbl>,
#   time_spent_hours <dbl>

gradebook

Use separate() function from tidyr

Now, we will work on the gradebook dataset. Like the previous dataset, we will separate the course_id variable again.

👉 Your Turn

  1. Use the pipe operator to separate course_id variable (like we just did in time_spent).
  2. Run the code.
#separate the course_id variable and save to 'gradebook' object
#(add code below)
gradebook <- gradebook %>%  
  separate(course_id, c("subject", "semester", "section"))
#inspect
#(add code below)
gradebook
# A tibble: 717 × 6
   student_id subject semester section total_points_possible total_points_earned
        <dbl> <chr>   <chr>    <chr>                   <dbl>               <dbl>
 1      43146 FrScA   S216     02                       1217               1150 
 2      44638 OcnA    S116     01                       1676               1384.
 3      47448 FrScA   S216     01                       1232               1116 
 4      47979 OcnA    S216     01                       1833               1493.
 5      48797 PhysA   S116     01                       2225               1995.
 6      51943 FrScA   S216     03                       1222                 70 
 7      52326 AnPhA   S216     01                       1775               1519.
 8      52446 PhysA   S116     01                       2225               2198 
 9      53447 FrScA   S116     01                       1212               1173 
10      53475 FrScA   S116     02                       1212                  0 
# ℹ 707 more rows

Use the mutate() function from dplyr

As you can see in gradebook, it is hard to see total_points_earned as a proportion. We can use mutate() to make it a percentage.

👉 Your Turn

  1. Take total_points_earned divide by total_points_possible and multiply by 100. Save this as proportion_earned.
  2. Run the code.
# Mutate to a proportion_earned, take 'total points earned' divide by 'total points possible.' Save as a new variable proportion_earned.
#(add code below)
gradebook <- gradebook %>%
  mutate(proportion_earned = (total_points_earned / total_points_possible) * 100)


#inspect data
#(add code below)

head(gradebook)
# A tibble: 6 × 7
  student_id subject semester section total_points_possible total_points_earned
       <dbl> <chr>   <chr>    <chr>                   <dbl>               <dbl>
1      43146 FrScA   S216     02                       1217               1150 
2      44638 OcnA    S116     01                       1676               1384.
3      47448 FrScA   S216     01                       1232               1116 
4      47979 OcnA    S216     01                       1833               1493.
5      48797 PhysA   S116     01                       2225               1995.
6      51943 FrScA   S216     03                       1222                 70 
# ℹ 1 more variable: proportion_earned <dbl>

survey

Let’s process our data. First though, take a quick look again by typing survey into the console or using a preferred viewing method to take a look at the data.

❓ Does it appear to be the correct file? What do the variables seem to be about? What wrangling steps do we need to take? Taking a quick peak at the data helps us to begin to formulate answers to these and is an important step in any data analysis, especially as we prepare for what we are going to do.

#inspect data to view the column names
survey
# A tibble: 662 × 26
   student_ID course_ID subject semester section   int   val percomp    tv    q1
   <chr>      <chr>     <chr>   <chr>    <chr>   <dbl> <dbl>   <dbl> <dbl> <dbl>
 1 43146      FrScA-S2… FrScA   S216     02        4.2  3.67     4    3.86     4
 2 44638      OcnA-S11… OcnA    S116     01        4    3        3    3.57     4
 3 47448      FrScA-S2… FrScA   S216     01        4.2  3        3    3.71     5
 4 47979      OcnA-S21… OcnA    S216     01        4    3.67     2.5  3.86     4
 5 48797      PhysA-S1… PhysA   S116     01        3.8  3.67     3.5  3.71     4
 6 51943      FrScA-S2… FrScA   S216     03        3.8  3.67     3.5  3.71     4
 7 52326      AnPhA-S2… AnPhA   S216     01        3.6  4        3    4        4
 8 52446      PhysA-S1… PhysA   S116     01        4.2  3.67     3    4        4
 9 53447      FrScA-S1… FrScA   S116     01        3.8  2        3    3        5
10 53475      FrScA-S2… FrScA   S216     01        4.8  3.33     4    4.14     5
# ℹ 652 more rows
# ℹ 16 more variables: q2 <dbl>, q3 <dbl>, q4 <dbl>, q5 <dbl>, q6 <dbl>,
#   q7 <dbl>, q8 <dbl>, q9 <dbl>, q10 <dbl>, date.x <dttm>, post_int <dbl>,
#   post_uv <dbl>, post_tv <dbl>, post_percomp <dbl>, date.y <dttm>,
#   date <dttm>

👉 Your turn

💡 Look at the variable names. Add one or more of the things you notice or wonder about the data here:

  • The dataset has 662 rows and 26 columns. It includes student information such as their ID, course, subject, semester, and section. There are also survey questions labeled q1 to q10, with answers in numbers.

  • The different column names show the data may come from more than one source. It’s important to find out what the q questions mean and how scores like “tv” were calculated. The presence of several date columns suggests the data was collected at different times or combined from multiple sources.


You may have noticed that student_ID is not formatted exactly the same as student_id in our other files. This is important because in the next section when we “join,” or merge, our data files, these variables will need to have identical names.

Use the janitor package

Fortunately the {janitor} package has simple functions for examining and cleaning dirty data. It was built with beginning and intermediate R users in mind and is optimized for user-friendliness. There is also a handy function called clean_names() in the {janitor} package for standardizing variable names.

👉 Your Turn

First, add the janitor package using the library() function.

#load janitor library to clean variable names that do not match
#(add code below)
library(janitor)
  1. Clean the columns by adding the survey object to the clean_names() function and saving it to thesurvey object.
  2. Inspect the data.
  3. Run the code.
#clean columns of the survey data and save to survey object
#(add code below)
survey <- survey %>%
  clean_names()

#inspect data to check for consistency with other data
#(add code below)
glimpse(survey)
Rows: 662
Columns: 26
$ student_id   <chr> "43146", "44638", "47448", "47979", "48797", "51943", "52…
$ course_id    <chr> "FrScA-S216-02", "OcnA-S116-01", "FrScA-S216-01", "OcnA-S…
$ subject      <chr> "FrScA", "OcnA", "FrScA", "OcnA", "PhysA", "FrScA", "AnPh…
$ semester     <chr> "S216", "S116", "S216", "S216", "S116", "S216", "S216", "…
$ section      <chr> "02", "01", "01", "01", "01", "03", "01", "01", "01", "01…
$ int          <dbl> 4.2, 4.0, 4.2, 4.0, 3.8, 3.8, 3.6, 4.2, 3.8, 4.8, 4.6, 3.…
$ val          <dbl> 3.666667, 3.000000, 3.000000, 3.666667, 3.666667, 3.66666…
$ percomp      <dbl> 4.0, 3.0, 3.0, 2.5, 3.5, 3.5, 3.0, 3.0, 3.0, 4.0, 4.0, 3.…
$ tv           <dbl> 3.857143, 3.571429, 3.714286, 3.857143, 3.714286, 3.71428…
$ q1           <dbl> 4, 4, 5, 4, 4, 4, 4, 4, 5, 5, 4, 3, 4, 4, 5, 4, 4, 5, 5, …
$ q2           <dbl> 4, 2, 3, 3, 4, 4, 4, 4, 2, 4, 5, 1, 3, 2, 5, 4, 4, 4, 3, …
$ q3           <dbl> 4, 2, 3, 2, 3, 3, 4, 3, 3, 4, 4, 3, 3, 2, 4, 3, 3, 4, 4, …
$ q4           <dbl> 5, 4, 4, 4, 4, 4, 2, 4, 4, 5, 4, 4, 4, 5, 5, 4, 4, 5, 5, …
$ q5           <dbl> 4, 4, 4, 4, 4, 3, 4, 4, 4, 5, 5, 4, 4, 5, 5, 4, 4, 5, 5, …
$ q6           <dbl> 4, 4, 3, 4, 4, 3, 4, 4, 2, 3, 5, 3, 3, 3, 5, 4, 4, 4, 5, …
$ q7           <dbl> 4, 4, 3, 3, 4, 4, 2, 3, 3, 4, 4, 4, 4, 4, 4, 4, 3, 5, 4, …
$ q8           <dbl> 4, 4, 4, 4, 4, 4, 4, 5, 4, 5, 5, 3, 4, 5, 5, 4, 5, 5, 5, …
$ q9           <dbl> 3, 3, 3, 4, 3, 4, 4, 3, 2, 3, 4, 2, 3, 1, 5, 3, 3, 4, 3, …
$ q10          <dbl> 4, 4, 4, 4, 3, 4, 4, 4, 2, 4, 5, 3, 4, 3, 5, 4, 5, 5, 5, …
$ date_x       <dttm> 2016-02-02 18:44:00, 2015-09-09 13:41:00, 2016-01-28 14:…
$ post_int     <dbl> NA, NA, NA, NA, NA, NA, NA, 3.50, 3.75, NA, 5.00, NA, NA,…
$ post_uv      <dbl> NA, NA, NA, NA, NA, NA, NA, 3.666667, 2.000000, NA, 4.666…
$ post_tv      <dbl> NA, NA, NA, NA, NA, NA, NA, 3.571429, 3.000000, NA, 4.857…
$ post_percomp <dbl> NA, NA, NA, NA, NA, NA, NA, 3.5, 3.0, NA, 4.0, NA, NA, 3.…
$ date_y       <dttm> NA, NA, NA, NA, NA, NA, NA, 2016-01-02 00:41:00, 2015-10…
$ date         <dttm> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …

Data merging and reshaping

Join the gradebook and time_spent datasets

As a reminder, there are different joins that we could use to combine our datasets:

Source: https://medium.com/(imanjokko/data-analysis-in-r-series-vi-joining-data-using-dplyr-fc0a83f0f064)

A full_join() is best for our need to combine information from all three datasets. full_join() returns all of the records in a new table, whether it matches on either the left or right tables. If the table rows match, then a join will be executed, otherwise it will return NULL in places where a matching row does not exist.

Similar to what we learned in the code-a-long, we will combine the gradebook and time_spent datasets. We first should identify the variables (column names) to combine. In this case, we will use the following:

  • student_id

  • subject

  • semester

  • section

#use single join to join data sets by student_id, subject, semester and section.
#(add code below)
joined_data <- gradebook %>%
  inner_join(time_spent, by = c("student_id", "subject", "semester", "section"))
#inspect 
#(add code below)
glimpse(joined_data)
Rows: 603
Columns: 12
$ student_id            <dbl> 43146, 44638, 47448, 47979, 48797, 51943, 52326,…
$ subject               <chr> "FrScA", "OcnA", "FrScA", "OcnA", "PhysA", "FrSc…
$ semester              <chr> "S216", "S116", "S216", "S216", "S116", "S216", …
$ section               <chr> "02", "01", "01", "01", "01", "03", "01", "01", …
$ total_points_possible <dbl> 1217, 1676, 1232, 1833, 2225, 1222, 1775, 2225, …
$ total_points_earned   <dbl> 1150.00, 1384.23, 1116.00, 1492.73, 1994.75, 70.…
$ proportion_earned     <dbl> 94.494659, 82.591289, 90.584416, 81.436443, 89.6…
$ gender                <chr> "F", "M", "F", "M", "M", "M", "F", "F", "F", "F"…
$ enrollment_reason     <chr> "Course Unavailable at Local School", "Course Un…
$ enrollment_status     <chr> "Approved/Enrolled", "Approved/Enrolled", "Appro…
$ time_spent            <dbl> 1555.1667, 1382.7001, 860.4335, 1598.6166, 1481.…
$ time_spent_hours      <dbl> 25.91944500, 23.04500167, 14.34055833, 26.643610…

As you can see, we have the new dataset joined_data. Those 12 variables came from the gradebook and time_spent datasets.

Join the survey and joined_data data sets

👉 Your Turn

  1. Use the full_join() function to join joined_data with survey, identifying the following variables:

    • student_id

    • subject

    • semester

    • section

  2. Save to a new object called data_to_explore.

  3. Inspect the data.

  4. Run the code.

#join data sets by student_id, subject, semester and section
#(add code below)
 data_to_explore <- joined_data %>%
  full_join(survey, by = c("student_id", "subject", "semester", "section")) 

#inspect
#(add code below)
glimpse(data_to_explore)
Note

DON’T PANIC if you are getting an error - read below!!

These datasets cannot be joined because the class (data type) of student_id is different than joined_data.


👉Your Turn

❓ Check out what class student_id is in joined_data compared to survey. What do you notice? (Hint: think about the class())

  • student_id in joined_data is numeric, but in survey, it’s text.

Use as.character()

We need to change joined_data’s student_id variable into a character class to match that of survey using mutate().

👉 Your Turn

  1. Use the mutate() function and as.character() function to change student_id variable from numeric to character class.

  2. Save the new value to the student_id variable.

  3. Run the code.

#mutate to change variable class from double or numeric to character
#(add code below)
joined_data <- joined_data %>%
  mutate(student_id = as.character(student_id))
#inspect
#(add code below)
class(joined_data$student_id)
[1] "character"

NOW full_join()!

👉 Your Turn

Now, that the variables are the same class:

  1. Use the full_join() function to join joined_data with survey, identifying the following variables:
    • student_id

    • subject

    • semester

    • section

  2. Save to a new object called data_to_explore.
  3. Inspect the data.
  4. Run the code.
#try again to together the joined_data with survey, assigning it to data_to_explore object
#(add code below) 
data_to_explore <- joined_data %>%
  full_join(survey, by = c("student_id", "subject", "semester", "section"))
#inspect
#(add code below)
glimpse(data_to_explore)
Rows: 716
Columns: 34
$ student_id            <chr> "43146", "44638", "47448", "47979", "48797", "51…
$ subject               <chr> "FrScA", "OcnA", "FrScA", "OcnA", "PhysA", "FrSc…
$ semester              <chr> "S216", "S116", "S216", "S216", "S116", "S216", …
$ section               <chr> "02", "01", "01", "01", "01", "03", "01", "01", …
$ total_points_possible <dbl> 1217, 1676, 1232, 1833, 2225, 1222, 1775, 2225, …
$ total_points_earned   <dbl> 1150.00, 1384.23, 1116.00, 1492.73, 1994.75, 70.…
$ proportion_earned     <dbl> 94.494659, 82.591289, 90.584416, 81.436443, 89.6…
$ gender                <chr> "F", "M", "F", "M", "M", "M", "F", "F", "F", "F"…
$ enrollment_reason     <chr> "Course Unavailable at Local School", "Course Un…
$ enrollment_status     <chr> "Approved/Enrolled", "Approved/Enrolled", "Appro…
$ time_spent            <dbl> 1555.1667, 1382.7001, 860.4335, 1598.6166, 1481.…
$ time_spent_hours      <dbl> 25.91944500, 23.04500167, 14.34055833, 26.643610…
$ course_id             <chr> "FrScA-S216-02", "OcnA-S116-01", "FrScA-S216-01"…
$ int                   <dbl> 4.2, 4.0, 4.2, 4.0, 3.8, 3.8, 3.6, 4.2, 3.8, NA,…
$ val                   <dbl> 3.666667, 3.000000, 3.000000, 3.666667, 3.666667…
$ percomp               <dbl> 4.0, 3.0, 3.0, 2.5, 3.5, 3.5, 3.0, 3.0, 3.0, NA,…
$ tv                    <dbl> 3.857143, 3.571429, 3.714286, 3.857143, 3.714286…
$ q1                    <dbl> 4, 4, 5, 4, 4, 4, 4, 4, 5, NA, 5, 4, 3, 4, 4, 5,…
$ q2                    <dbl> 4, 2, 3, 3, 4, 4, 4, 4, 2, NA, 4, 5, 1, 3, 2, 5,…
$ q3                    <dbl> 4, 2, 3, 2, 3, 3, 4, 3, 3, NA, 4, 4, 3, 3, 2, 4,…
$ q4                    <dbl> 5, 4, 4, 4, 4, 4, 2, 4, 4, NA, 5, 4, 4, 4, 5, 5,…
$ q5                    <dbl> 4, 4, 4, 4, 4, 3, 4, 4, 4, NA, 5, 5, 4, 4, 5, 5,…
$ q6                    <dbl> 4, 4, 3, 4, 4, 3, 4, 4, 2, NA, 3, 5, 3, 3, 3, 5,…
$ q7                    <dbl> 4, 4, 3, 3, 4, 4, 2, 3, 3, NA, 4, 4, 4, 4, 4, 4,…
$ q8                    <dbl> 4, 4, 4, 4, 4, 4, 4, 5, 4, NA, 5, 5, 3, 4, 5, 5,…
$ q9                    <dbl> 3, 3, 3, 4, 3, 4, 4, 3, 2, NA, 3, 4, 2, 3, 1, 5,…
$ q10                   <dbl> 4, 4, 4, 4, 3, 4, 4, 4, 2, NA, 4, 5, 3, 4, 3, 5,…
$ date_x                <dttm> 2016-02-02 18:44:00, 2015-09-09 13:41:00, 2016-…
$ post_int              <dbl> NA, NA, NA, NA, NA, NA, NA, 3.50, 3.75, NA, NA, …
$ post_uv               <dbl> NA, NA, NA, NA, NA, NA, NA, 3.666667, 2.000000, …
$ post_tv               <dbl> NA, NA, NA, NA, NA, NA, NA, 3.571429, 3.000000, …
$ post_percomp          <dbl> NA, NA, NA, NA, NA, NA, NA, 3.5, 3.0, NA, NA, 4.…
$ date_y                <dttm> NA, NA, NA, NA, NA, NA, NA, 2016-01-02 00:41:00…
$ date                  <dttm> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…

Let’s also transform the subject names to a more convenient format:

data_to_explore <- data_to_explore %>%
  mutate(subject = case_when(
    subject == "AnPhA"  ~ "Anatomy",
    subject == "BioA"   ~ "Biology",
    subject == "FrScA"  ~ "Forensics",
    subject == "OcnA"   ~ "Oceanography",
    subject == "PhysA"  ~ "Physics",
    TRUE ~ subject  #This line keeps the original value if none of the conditions above are met
  ))
data_to_explore
# A tibble: 716 × 34
   student_id subject semester section total_points_possible total_points_earned
   <chr>      <chr>   <chr>    <chr>                   <dbl>               <dbl>
 1 43146      Forens… S216     02                       1217               1150 
 2 44638      Oceano… S116     01                       1676               1384.
 3 47448      Forens… S216     01                       1232               1116 
 4 47979      Oceano… S216     01                       1833               1493.
 5 48797      Physics S116     01                       2225               1995.
 6 51943      Forens… S216     03                       1222                 70 
 7 52326      Anatomy S216     01                       1775               1519.
 8 52446      Physics S116     01                       2225               2198 
 9 53447      Forens… S116     01                       1212               1173 
10 53475      Forens… S116     02                       1212                  0 
# ℹ 706 more rows
# ℹ 28 more variables: proportion_earned <dbl>, gender <chr>,
#   enrollment_reason <chr>, enrollment_status <chr>, time_spent <dbl>,
#   time_spent_hours <dbl>, course_id <chr>, int <dbl>, val <dbl>,
#   percomp <dbl>, tv <dbl>, q1 <dbl>, q2 <dbl>, q3 <dbl>, q4 <dbl>, q5 <dbl>,
#   q6 <dbl>, q7 <dbl>, q8 <dbl>, q9 <dbl>, q10 <dbl>, date_x <dttm>,
#   post_int <dbl>, post_uv <dbl>, post_tv <dbl>, post_percomp <dbl>, …

Teacher PersonaAlex follows the steps to load and wrangle data, reflecting on how each step can provide insights into her students’ engagement levels. She is particularly interested in understanding patterns in the time students spend on different course materials and how these patterns correlate with their performance.

Filter and sort data

Use filter() from the dplyr package

We can identify students at risk of failing the course using the filter function looking at students below 70:

#Filter students with lower grades
at_risk_students <- data_to_explore %>%
  filter(proportion_earned<70)

#Print the at-risk students
at_risk_students
# A tibble: 133 × 34
   student_id subject semester section total_points_possible total_points_earned
   <chr>      <chr>   <chr>    <chr>                   <dbl>               <dbl>
 1 51943      Forens… S216     03                       1222                 70 
 2 53475      Forens… S116     02                       1212                  0 
 3 54346      Oceano… S116     01                       1676                902.
 4 54567      Oceano… S216     02                       1676               1092.
 5 55283      Forens… S116     01                       1212                110 
 6 61357      Forens… S116     02                       1212                  0 
 7 66508      Anatomy T116     01                       1775                  0 
 8 66862      Anatomy S116     01                       1775               1119.
 9 67013      Anatomy S216     01                       1775                871.
10 68768      Forens… S116     02                       1212                816.
# ℹ 123 more rows
# ℹ 28 more variables: proportion_earned <dbl>, gender <chr>,
#   enrollment_reason <chr>, enrollment_status <chr>, time_spent <dbl>,
#   time_spent_hours <dbl>, course_id <chr>, int <dbl>, val <dbl>,
#   percomp <dbl>, tv <dbl>, q1 <dbl>, q2 <dbl>, q3 <dbl>, q4 <dbl>, q5 <dbl>,
#   q6 <dbl>, q7 <dbl>, q8 <dbl>, q9 <dbl>, q10 <dbl>, date_x <dttm>,
#   post_int <dbl>, post_uv <dbl>, post_tv <dbl>, post_percomp <dbl>, …

You can also use arrange() to sort values:

#sort in ascending order
data_to_explore %>%
  arrange(proportion_earned)
# A tibble: 716 × 34
   student_id subject semester section total_points_possible total_points_earned
   <chr>      <chr>   <chr>    <chr>                   <dbl>               <dbl>
 1 53475      Forens… S116     02                       1212                   0
 2 61357      Forens… S116     02                       1212                   0
 3 66508      Anatomy T116     01                       1775                   0
 4 69937      Biology S116     01                       2420                   0
 5 85258      Forens… S116     02                       1212                   0
 6 85487      Oceano… S116     02                       1676                   0
 7 85659      Forens… S116     01                       1212                   0
 8 88568      Biology S216     01                       2425                   0
 9 89465      Forens… S216     01                       1232                   0
10 90995      Biology S116     01                       2420                   0
# ℹ 706 more rows
# ℹ 28 more variables: proportion_earned <dbl>, gender <chr>,
#   enrollment_reason <chr>, enrollment_status <chr>, time_spent <dbl>,
#   time_spent_hours <dbl>, course_id <chr>, int <dbl>, val <dbl>,
#   percomp <dbl>, tv <dbl>, q1 <dbl>, q2 <dbl>, q3 <dbl>, q4 <dbl>, q5 <dbl>,
#   q6 <dbl>, q7 <dbl>, q8 <dbl>, q9 <dbl>, q10 <dbl>, date_x <dttm>,
#   post_int <dbl>, post_uv <dbl>, post_tv <dbl>, post_percomp <dbl>, …

arrange(desc()) sorts in descending order:

#sort in descending order
data_to_explore %>%
  arrange(desc(proportion_earned))
# A tibble: 716 × 34
   student_id subject semester section total_points_possible total_points_earned
   <chr>      <chr>   <chr>    <chr>                   <dbl>               <dbl>
 1 85650      Forens… S116     01                       1212               1221 
 2 91067      Biology S116     01                       2420               2414.
 3 78153      Physics S216     01                       2225               2216 
 4 88261      Forens… S116     01                       1212               1205 
 5 66740      Oceano… S116     01                       1676               1663 
 6 86792      Forens… S116     01                       1212               1201 
 7 85522      Physics S116     01                       2225               2200 
 8 66689      Forens… S216     01                       1232               1218 
 9 52446      Physics S116     01                       2225               2198 
10 86365      Forens… S116     01                       1212               1197 
# ℹ 706 more rows
# ℹ 28 more variables: proportion_earned <dbl>, gender <chr>,
#   enrollment_reason <chr>, enrollment_status <chr>, time_spent <dbl>,
#   time_spent_hours <dbl>, course_id <chr>, int <dbl>, val <dbl>,
#   percomp <dbl>, tv <dbl>, q1 <dbl>, q2 <dbl>, q3 <dbl>, q4 <dbl>, q5 <dbl>,
#   q6 <dbl>, q7 <dbl>, q8 <dbl>, q9 <dbl>, q10 <dbl>, date_x <dttm>,
#   post_int <dbl>, post_uv <dbl>, post_tv <dbl>, post_percomp <dbl>, …

👉 Your Turn

Think what other factors are important to identify students at risk. Run your code and analyze the results:

#(add code below)
#Filter students with time_spent
at_risk_students_timespent <- data_to_explore %>%
  filter(time_spent_hours<20)

#Print the at-risk students
at_risk_students_timespent %>%
  arrange(proportion_earned)
# A tibble: 220 × 34
   student_id subject semester section total_points_possible total_points_earned
   <chr>      <chr>   <chr>    <chr>                   <dbl>               <dbl>
 1 66508      Anatomy T116     01                       1775                 0  
 2 69937      Biology S116     01                       2420                 0  
 3 85487      Oceano… S116     02                       1676                 0  
 4 89465      Forens… S216     01                       1232                 0  
 5 90995      Biology S116     01                       2420                 0  
 6 92606      Biology S216     01                       2425                 5  
 7 95684      Forens… S216     01                       1232                12  
 8 95738      Oceano… S216     01                       1833                30  
 9 89168      Forens… S116     03                       1212                25  
10 90996      Forens… S116     04                       1212                27.5
# ℹ 210 more rows
# ℹ 28 more variables: proportion_earned <dbl>, gender <chr>,
#   enrollment_reason <chr>, enrollment_status <chr>, time_spent <dbl>,
#   time_spent_hours <dbl>, course_id <chr>, int <dbl>, val <dbl>,
#   percomp <dbl>, tv <dbl>, q1 <dbl>, q2 <dbl>, q3 <dbl>, q4 <dbl>, q5 <dbl>,
#   q6 <dbl>, q7 <dbl>, q8 <dbl>, q9 <dbl>, q10 <dbl>, date_x <dttm>,
#   post_int <dbl>, post_uv <dbl>, post_tv <dbl>, post_percomp <dbl>, …

Export and back up data

Now let’s write the file to our data folder using write_csv() to save for later or download.

# add the function to write data to file to use later
write_csv(data_to_explore, "module_1/data/data_to_explore.csv")

Check the data folder to confirm the location of your new file.

🛑 Stop here! Congratulations, you finished the first part of the case study.

3. EXPLORE (Module 2)

Exploratory Data Analysis

Use the skimr package

We’ve already wrangled our data, but let’s look at the data frame to make sure it is still correct. A quick way to look at the data frame is with skimr.

This output is best for internal use. This is because the output is rich, but not well-suited to exporting to a table that you add, for instance, to a Google Docs or Microsoft Word manuscript.

Of course, these values can be entered manually into a table, but we’ll also discuss ways later on to create tables that are ready, or nearly-ready-to be added directly to manuscripts.

👉 Your Turn

  1. Load skimr with the correct function.
#load library by adding skimr as the package name
#(add code below)
library(skimr)

Normally you would do this above, but we want to make sure you know which packages are used with the new functions.

  1. Next, use skim() to view data_to_explore.
#skim the data by adding the skim function in front of the data
#(add code below)
skim(data_to_explore)
Data summary
Name data_to_explore
Number of rows 716
Number of columns 34
_______________________
Column type frequency:
character 8
numeric 23
POSIXct 3
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
student_id 0 1.00 2 5 0 655 0
subject 0 1.00 7 12 0 5 0
semester 0 1.00 4 4 0 4 0
section 0 1.00 2 2 0 4 0
gender 113 0.84 1 1 0 2 0
enrollment_reason 113 0.84 5 34 0 5 0
enrollment_status 113 0.84 7 17 0 3 0
course_id 54 0.92 12 13 0 36 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
total_points_possible 113 0.84 1639.72 398.02 1212.00 1217.00 1676.00 1775.00 2425.00 ▇▂▆▁▃
total_points_earned 113 0.84 1242.90 522.39 0.00 1006.00 1182.00 1584.82 2413.50 ▂▃▇▅▂
proportion_earned 113 0.84 76.17 25.55 0.00 72.13 86.15 92.60 100.74 ▁▁▁▃▇
time_spent 118 0.84 1799.75 1354.93 0.45 851.90 1550.91 2426.09 8870.88 ▇▅▁▁▁
time_spent_hours 118 0.84 30.00 22.58 0.01 14.20 25.85 40.43 147.85 ▇▅▁▁▁
int 66 0.91 4.30 0.60 1.80 4.00 4.40 4.80 5.00 ▁▁▂▆▇
val 60 0.92 3.75 0.75 1.00 3.33 3.67 4.33 5.00 ▁▁▆▇▆
percomp 61 0.91 3.64 0.69 1.50 3.00 3.50 4.00 5.00 ▁▁▇▃▃
tv 65 0.91 4.07 0.59 1.00 3.71 4.12 4.46 5.00 ▁▁▂▇▇
q1 58 0.92 4.34 0.66 1.00 4.00 4.00 5.00 5.00 ▁▁▁▇▇
q2 58 0.92 3.66 0.93 1.00 3.00 4.00 4.00 5.00 ▁▂▆▇▃
q3 59 0.92 3.31 0.85 1.00 3.00 3.00 4.00 5.00 ▁▂▇▅▂
q4 62 0.91 4.35 0.80 1.00 4.00 5.00 5.00 5.00 ▁▁▁▆▇
q5 59 0.92 4.28 0.69 1.00 4.00 4.00 5.00 5.00 ▁▁▁▇▆
q6 58 0.92 4.05 0.80 1.00 4.00 4.00 5.00 5.00 ▁▁▃▇▅
q7 59 0.92 3.96 0.85 1.00 3.00 4.00 5.00 5.00 ▁▁▅▇▆
q8 59 0.92 4.35 0.65 1.00 4.00 4.00 5.00 5.00 ▁▁▁▇▇
q9 59 0.92 3.55 0.92 1.00 3.00 4.00 4.00 5.00 ▁▂▇▇▃
q10 58 0.92 4.17 0.87 1.00 4.00 4.00 5.00 5.00 ▁▁▃▇▇
post_int 621 0.13 3.88 0.94 1.00 3.50 4.00 4.50 5.00 ▁▁▃▇▇
post_uv 621 0.13 3.48 0.99 1.00 3.00 3.67 4.00 5.00 ▂▂▅▇▅
post_tv 621 0.13 3.71 0.90 1.00 3.29 3.86 4.29 5.00 ▁▂▃▇▆
post_percomp 621 0.13 3.47 0.88 1.00 3.00 3.50 4.00 5.00 ▁▂▂▇▂

Variable type: POSIXct

skim_variable n_missing complete_rate min max median n_unique
date_x 166 0.77 2015-09-02 15:40:00 2016-05-24 15:53:00 2015-10-01 15:57:30 536
date_y 621 0.13 2015-09-02 15:31:00 2016-01-22 15:43:00 2016-01-04 13:25:00 95
date 607 0.15 2017-01-23 13:14:00 2017-02-13 13:00:00 2017-01-25 18:43:00 107
  1. Finally, use the group_by() function (from dplyr) on the subject variable, then the skim() function (from skimr).
#(add code below)
data_to_explore %>%
  group_by(subject) %>%
  skim()
Data summary
Name Piped data
Number of rows 716
Number of columns 34
_______________________
Column type frequency:
character 7
numeric 23
POSIXct 3
________________________
Group variables subject

Variable type: character

skim_variable subject n_missing complete_rate min max empty n_unique whitespace
student_id Anatomy 0 1.00 2 5 0 163 0
student_id Biology 0 1.00 3 5 0 43 0
student_id Forensics 0 1.00 2 5 0 286 0
student_id Oceanography 0 1.00 2 5 0 127 0
student_id Physics 0 1.00 3 5 0 68 0
semester Anatomy 0 1.00 4 4 0 4 0
semester Biology 0 1.00 4 4 0 4 0
semester Forensics 0 1.00 4 4 0 4 0
semester Oceanography 0 1.00 4 4 0 4 0
semester Physics 0 1.00 4 4 0 4 0
section Anatomy 0 1.00 2 2 0 2 0
section Biology 0 1.00 2 2 0 1 0
section Forensics 0 1.00 2 2 0 4 0
section Oceanography 0 1.00 2 2 0 3 0
section Physics 0 1.00 2 2 0 1 0
gender Anatomy 23 0.86 1 1 0 2 0
gender Biology 2 0.96 1 1 0 2 0
gender Forensics 65 0.79 1 1 0 2 0
gender Oceanography 20 0.85 1 1 0 2 0
gender Physics 3 0.96 1 1 0 2 0
enrollment_reason Anatomy 23 0.86 5 34 0 4 0
enrollment_reason Biology 2 0.96 5 34 0 5 0
enrollment_reason Forensics 65 0.79 5 34 0 5 0
enrollment_reason Oceanography 20 0.85 5 34 0 5 0
enrollment_reason Physics 3 0.96 5 34 0 4 0
enrollment_status Anatomy 23 0.86 7 17 0 2 0
enrollment_status Biology 2 0.96 7 17 0 3 0
enrollment_status Forensics 65 0.79 7 17 0 3 0
enrollment_status Oceanography 20 0.85 7 17 0 3 0
enrollment_status Physics 3 0.96 7 17 0 2 0
course_id Anatomy 14 0.92 13 13 0 7 0
course_id Biology 3 0.93 12 12 0 4 0
course_id Forensics 21 0.93 13 13 0 12 0
course_id Oceanography 11 0.92 12 12 0 9 0
course_id Physics 5 0.93 13 13 0 4 0

Variable type: numeric

skim_variable subject n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
total_points_possible Anatomy 23 0.86 1774.16 10.03 1655.00 1775.00 1775.00 1775.00 1775.00 ▁▁▁▁▇
total_points_possible Biology 2 0.96 2420.81 1.87 2420.00 2420.00 2420.00 2420.00 2425.00 ▇▁▁▁▂
total_points_possible Forensics 65 0.79 1219.47 9.13 1212.00 1212.00 1212.00 1232.00 1232.00 ▇▂▁▁▅
total_points_possible Oceanography 20 0.85 1733.64 80.05 1480.00 1676.00 1676.00 1833.00 1833.00 ▁▁▇▁▅
total_points_possible Physics 3 0.96 2225.00 0.00 2225.00 2225.00 2225.00 2225.00 2225.00 ▁▁▇▁▁
total_points_earned Anatomy 23 0.86 1348.95 410.98 0.00 1277.26 1511.14 1605.72 1732.52 ▁▁▁▂▇
total_points_earned Biology 2 0.96 1531.28 826.39 0.00 959.58 1865.13 2199.65 2413.50 ▃▁▁▃▇
total_points_earned Forensics 65 0.79 950.13 304.97 0.00 909.73 1067.04 1133.00 1221.00 ▁▁▁▂▇
total_points_earned Oceanography 20 0.85 1253.63 447.50 0.00 1210.76 1394.41 1551.39 1786.76 ▁▁▁▆▇
total_points_earned Physics 3 0.96 1885.98 476.43 110.00 1889.50 2067.50 2145.50 2216.00 ▁▁▁▂▇
proportion_earned Anatomy 23 0.86 76.03 23.15 0.00 72.99 85.13 90.46 97.61 ▁▁▁▂▇
proportion_earned Biology 2 0.96 63.26 34.14 0.00 39.65 77.07 90.89 99.73 ▃▁▁▃▇
proportion_earned Forensics 65 0.79 77.91 25.00 0.00 74.83 87.46 92.82 100.74 ▁▁▁▂▇
proportion_earned Oceanography 20 0.85 72.54 26.03 0.00 68.22 81.44 91.10 99.22 ▁▁▁▃▇
proportion_earned Physics 3 0.96 84.76 21.41 4.94 84.92 92.92 96.43 99.60 ▁▁▁▂▇
time_spent Anatomy 23 0.86 2408.96 1672.01 0.45 1239.97 2269.88 3528.55 7084.70 ▇▇▅▂▁
time_spent Biology 3 0.93 1275.63 1394.31 1.22 296.04 696.76 1734.13 6664.45 ▇▂▁▁▁
time_spent Forensics 69 0.77 1541.30 935.75 3.45 885.92 1378.98 2101.93 4323.83 ▅▇▃▂▁
time_spent Oceanography 20 0.85 1987.03 1520.68 0.58 1030.92 1734.68 2515.32 8870.88 ▇▆▂▁▁
time_spent Physics 3 0.96 1420.69 1004.80 0.70 719.28 1276.52 2047.00 5373.35 ▇▆▃▁▁
time_spent_hours Anatomy 23 0.86 40.15 27.87 0.01 20.67 37.83 58.81 118.08 ▇▇▅▂▁
time_spent_hours Biology 3 0.93 21.26 23.24 0.02 4.93 11.61 28.90 111.07 ▇▂▁▁▁
time_spent_hours Forensics 69 0.77 25.69 15.60 0.06 14.77 22.98 35.03 72.06 ▅▇▃▂▁
time_spent_hours Oceanography 20 0.85 33.12 25.34 0.01 17.18 28.91 41.92 147.85 ▇▆▂▁▁
time_spent_hours Physics 3 0.96 23.68 16.75 0.01 11.99 21.28 34.12 89.56 ▇▆▃▁▁
int Anatomy 18 0.89 4.42 0.57 1.80 4.00 4.40 5.00 5.00 ▁▁▁▅▇
int Biology 5 0.89 3.69 0.63 2.40 3.35 3.80 4.00 5.00 ▂▆▇▆▂
int Forensics 25 0.92 4.42 0.52 2.60 4.00 4.40 5.00 5.00 ▁▁▃▃▇
int Oceanography 12 0.91 4.24 0.58 2.20 4.00 4.20 4.60 5.00 ▁▁▂▇▆
int Physics 6 0.91 4.00 0.65 2.20 3.60 4.00 4.40 5.00 ▁▂▆▇▅
val Anatomy 15 0.91 4.29 0.62 1.00 4.00 4.33 4.67 5.00 ▁▁▁▅▇
val Biology 3 0.93 3.50 0.58 2.67 3.00 3.33 3.67 5.00 ▆▆▇▁▂
val Forensics 26 0.92 3.53 0.72 1.67 3.00 3.67 4.00 5.00 ▂▅▇▅▂
val Oceanography 11 0.92 3.62 0.77 1.00 3.00 3.67 4.00 5.00 ▁▁▅▇▃
val Physics 5 0.93 3.89 0.56 2.00 3.67 4.00 4.33 5.00 ▁▁▇▇▃
percomp Anatomy 17 0.90 3.80 0.67 2.00 3.50 4.00 4.50 5.00 ▂▃▇▆▇
percomp Biology 4 0.91 3.34 0.75 2.00 3.00 3.00 4.00 5.00 ▅▇▃▇▂
percomp Forensics 23 0.92 3.64 0.63 1.50 3.00 3.50 4.00 5.00 ▁▁▇▅▃
percomp Oceanography 12 0.91 3.57 0.67 2.00 3.00 3.50 4.00 5.00 ▂▇▆▅▅
percomp Physics 5 0.93 3.56 0.84 2.00 3.00 3.50 4.00 5.00 ▅▅▇▅▇
tv Anatomy 16 0.90 4.35 0.57 1.00 4.00 4.43 4.83 5.00 ▁▁▁▅▇
tv Biology 5 0.89 3.61 0.56 2.29 3.14 3.57 3.86 5.00 ▁▃▇▂▁
tv Forensics 27 0.91 4.04 0.52 2.29 3.71 4.00 4.43 5.00 ▁▂▆▇▅
tv Oceanography 11 0.92 3.97 0.62 1.71 3.71 4.00 4.38 5.00 ▁▁▂▇▅
tv Physics 6 0.91 3.94 0.56 2.14 3.57 4.00 4.29 5.00 ▁▂▃▇▂
q1 Anatomy 15 0.91 4.43 0.64 1.00 4.00 4.00 5.00 5.00 ▁▁▁▇▇
q1 Biology 3 0.93 3.76 0.66 2.00 3.00 4.00 4.00 5.00 ▁▃▁▇▁
q1 Forensics 24 0.92 4.50 0.57 2.00 4.00 5.00 5.00 5.00 ▁▁▁▆▇
q1 Oceanography 11 0.92 4.20 0.69 2.00 4.00 4.00 5.00 5.00 ▁▂▁▇▅
q1 Physics 5 0.93 4.03 0.72 2.00 4.00 4.00 4.50 5.00 ▁▃▁▇▃
q2 Anatomy 15 0.91 4.30 0.74 1.00 4.00 4.00 5.00 5.00 ▁▁▂▇▇
q2 Biology 3 0.93 3.48 0.71 2.00 3.00 3.00 4.00 5.00 ▁▇▁▆▁
q2 Forensics 23 0.92 3.35 0.89 1.00 3.00 3.00 4.00 5.00 ▁▃▇▆▂
q2 Oceanography 12 0.91 3.46 0.93 1.00 3.00 4.00 4.00 5.00 ▁▂▆▇▂
q2 Physics 5 0.93 4.03 0.76 2.00 4.00 4.00 5.00 5.00 ▁▂▁▇▅
q3 Anatomy 16 0.90 3.53 0.87 1.00 3.00 3.00 4.00 5.00 ▁▁▇▅▃
q3 Biology 3 0.93 2.98 0.87 2.00 2.00 3.00 3.00 5.00 ▅▇▁▂▁
q3 Forensics 23 0.92 3.25 0.79 1.00 3.00 3.00 4.00 5.00 ▁▂▇▃▁
q3 Oceanography 12 0.91 3.30 0.86 2.00 3.00 3.00 4.00 5.00 ▃▇▁▅▂
q3 Physics 5 0.93 3.32 0.95 1.00 3.00 3.00 4.00 5.00 ▁▃▇▆▂
q4 Anatomy 17 0.90 4.52 0.78 1.00 4.00 5.00 5.00 5.00 ▁▁▁▃▇
q4 Biology 3 0.93 3.69 0.81 2.00 3.00 4.00 4.00 5.00 ▂▃▁▇▂
q4 Forensics 25 0.92 4.44 0.74 1.00 4.00 5.00 5.00 5.00 ▁▁▁▅▇
q4 Oceanography 12 0.91 4.29 0.75 1.00 4.00 4.00 5.00 5.00 ▁▁▂▇▇
q4 Physics 5 0.93 4.02 0.87 2.00 4.00 4.00 5.00 5.00 ▁▃▁▇▆
q5 Anatomy 15 0.91 4.36 0.69 1.00 4.00 4.00 5.00 5.00 ▁▁▁▇▇
q5 Biology 4 0.91 3.88 0.68 2.00 4.00 4.00 4.00 5.00 ▁▃▁▇▂
q5 Forensics 24 0.92 4.38 0.62 2.00 4.00 4.00 5.00 5.00 ▁▁▁▇▇
q5 Oceanography 11 0.92 4.20 0.77 1.00 4.00 4.00 5.00 5.00 ▁▁▂▇▆
q5 Physics 5 0.93 4.06 0.67 2.00 4.00 4.00 4.00 5.00 ▁▁▁▇▃
q6 Anatomy 15 0.91 4.50 0.65 1.00 4.00 5.00 5.00 5.00 ▁▁▁▆▇
q6 Biology 3 0.93 3.83 0.70 3.00 3.00 4.00 4.00 5.00 ▅▁▇▁▂
q6 Forensics 24 0.92 3.88 0.79 2.00 3.00 4.00 4.00 5.00 ▁▃▁▇▃
q6 Oceanography 11 0.92 3.84 0.84 1.00 3.00 4.00 4.00 5.00 ▁▁▅▇▃
q6 Physics 5 0.93 4.27 0.68 2.00 4.00 4.00 5.00 5.00 ▁▁▁▇▆
q7 Anatomy 16 0.90 4.08 0.85 1.00 4.00 4.00 5.00 5.00 ▁▁▃▇▆
q7 Biology 4 0.91 3.71 0.96 2.00 3.00 4.00 4.00 5.00 ▂▇▁▇▆
q7 Forensics 23 0.92 4.02 0.83 1.00 3.00 4.00 5.00 5.00 ▁▁▅▇▆
q7 Oceanography 11 0.92 3.83 0.82 2.00 3.00 4.00 4.00 5.00 ▁▆▁▇▅
q7 Physics 5 0.93 3.81 0.90 2.00 3.00 4.00 4.00 5.00 ▂▅▁▇▅
q8 Anatomy 16 0.90 4.45 0.65 1.00 4.00 5.00 5.00 5.00 ▁▁▁▇▇
q8 Biology 3 0.93 3.79 0.72 2.00 3.00 4.00 4.00 5.00 ▁▃▁▇▂
q8 Forensics 23 0.92 4.45 0.58 3.00 4.00 4.00 5.00 5.00 ▁▁▇▁▇
q8 Oceanography 11 0.92 4.33 0.60 3.00 4.00 4.00 5.00 5.00 ▁▁▇▁▆
q8 Physics 6 0.91 4.05 0.73 2.00 4.00 4.00 4.00 5.00 ▁▁▁▇▃
q9 Anatomy 15 0.91 4.07 0.81 1.00 4.00 4.00 5.00 5.00 ▁▁▃▇▆
q9 Biology 3 0.93 3.19 0.86 2.00 3.00 3.00 4.00 5.00 ▃▇▁▅▁
q9 Forensics 25 0.92 3.37 0.91 1.00 3.00 3.00 4.00 5.00 ▁▃▇▆▂
q9 Oceanography 11 0.92 3.54 0.91 1.00 3.00 4.00 4.00 5.00 ▁▂▇▇▃
q9 Physics 5 0.93 3.38 0.83 2.00 3.00 3.00 4.00 5.00 ▃▇▁▇▂
q10 Anatomy 15 0.91 4.35 0.74 1.00 4.00 4.00 5.00 5.00 ▁▁▁▇▇
q10 Biology 4 0.91 3.37 0.89 2.00 3.00 3.00 4.00 5.00 ▂▇▁▅▂
q10 Forensics 23 0.92 4.30 0.81 1.00 4.00 4.00 5.00 5.00 ▁▁▂▆▇
q10 Oceanography 11 0.92 4.13 0.93 1.00 4.00 4.00 5.00 5.00 ▁▁▃▇▇
q10 Physics 5 0.93 3.78 0.89 2.00 3.00 4.00 4.00 5.00 ▂▆▁▇▅
post_int Anatomy 165 0.01 1.00 NA 1.00 1.00 1.00 1.00 1.00 ▁▁▇▁▁
post_int Biology 36 0.20 3.06 0.69 1.75 2.75 3.00 3.25 4.25 ▂▃▇▂▂
post_int Forensics 263 0.14 4.00 0.93 1.50 3.75 4.00 4.88 5.00 ▁▃▁▇▇
post_int Oceanography 113 0.14 4.33 0.56 3.00 4.00 4.25 4.75 5.00 ▁▂▅▅▇
post_int Physics 44 0.35 3.75 0.88 1.50 3.50 4.00 4.25 5.00 ▁▁▂▇▂
post_uv Anatomy 165 0.01 1.00 NA 1.00 1.00 1.00 1.00 1.00 ▁▁▇▁▁
post_uv Biology 36 0.20 3.11 0.80 1.67 2.67 3.33 3.67 4.33 ▂▃▂▇▂
post_uv Forensics 263 0.14 3.38 1.11 1.00 2.67 3.67 4.00 5.00 ▃▃▆▇▆
post_uv Oceanography 113 0.14 3.93 0.88 1.33 3.67 4.00 4.58 5.00 ▁▁▁▇▇
post_uv Physics 44 0.35 3.57 0.66 1.67 3.33 3.67 4.00 4.67 ▁▁▃▇▂
post_tv Anatomy 165 0.01 1.00 NA 1.00 1.00 1.00 1.00 1.00 ▁▁▇▁▁
post_tv Biology 36 0.20 3.08 0.70 1.71 2.86 3.00 3.29 4.29 ▂▂▇▃▂
post_tv Forensics 263 0.14 3.73 0.96 1.29 3.29 4.00 4.43 5.00 ▁▃▅▆▇
post_tv Oceanography 113 0.14 4.16 0.60 3.00 3.86 4.14 4.71 4.86 ▂▁▅▅▇
post_tv Physics 44 0.35 3.67 0.74 1.57 3.43 3.86 4.04 4.71 ▂▁▃▇▅
post_percomp Anatomy 165 0.01 3.00 NA 3.00 3.00 3.00 3.00 3.00 ▁▁▇▁▁
post_percomp Biology 36 0.20 3.06 0.58 2.00 2.50 3.50 3.50 3.50 ▂▃▁▂▇
post_percomp Forensics 263 0.14 3.51 0.96 1.00 3.00 3.50 4.00 5.00 ▁▂▆▇▅
post_percomp Oceanography 113 0.14 3.69 0.75 2.00 3.50 4.00 4.00 5.00 ▃▁▆▇▃
post_percomp Physics 44 0.35 3.40 0.91 1.50 3.00 3.50 4.00 4.50 ▂▂▂▆▇

Variable type: POSIXct

skim_variable subject n_missing complete_rate min max median n_unique
date_x Anatomy 36 0.78 2015-09-02 15:40:00 2016-03-23 16:11:00 2015-09-27 20:10:30 129
date_x Biology 5 0.89 2015-09-08 19:52:00 2016-03-09 14:07:00 2015-09-16 14:27:00 40
date_x Forensics 86 0.72 2015-09-08 13:10:00 2016-04-27 02:12:00 2015-10-08 19:19:30 218
date_x Oceanography 31 0.76 2015-09-08 20:08:00 2016-03-03 15:57:00 2016-01-25 20:17:00 97
date_x Physics 8 0.88 2015-09-09 12:24:00 2016-05-24 15:53:00 2015-10-08 21:17:00 60
date_y Anatomy 165 0.01 2015-09-02 15:31:00 2015-09-02 15:31:00 2015-09-02 15:31:00 1
date_y Biology 36 0.20 2015-11-17 03:04:00 2016-01-21 23:38:00 2016-01-16 23:48:00 9
date_y Forensics 263 0.14 2015-09-09 15:21:00 2016-01-22 15:43:00 2016-01-04 13:13:00 43
date_y Oceanography 113 0.14 2015-09-12 15:56:00 2016-01-08 17:51:00 2015-09-18 04:08:30 18
date_y Physics 44 0.35 2015-09-14 14:45:00 2016-01-22 05:36:00 2016-01-17 08:24:30 24
date Anatomy 145 0.13 2017-01-23 14:28:00 2017-02-10 15:25:00 2017-02-01 17:09:00 21
date Biology 43 0.04 2017-02-06 20:12:00 2017-02-09 19:15:00 2017-02-08 07:43:30 2
date Forensics 243 0.21 2017-01-23 13:14:00 2017-02-13 13:00:00 2017-01-24 17:23:00 62
date Oceanography 111 0.15 2017-01-23 14:07:00 2017-02-09 18:45:00 2017-02-01 21:53:30 20
date Physics 65 0.04 2017-01-30 14:41:00 2017-02-03 15:23:00 2017-02-02 20:54:00 3

Missing values

The summary() function provides additional information. It can be used for the entire dataset or individual variables.

#use summary() to look at your data
summary(data_to_explore)
  student_id          subject            semester           section         
 Length:716         Length:716         Length:716         Length:716        
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                            
                                                                            
                                                                            
                                                                            
 total_points_possible total_points_earned proportion_earned    gender         
 Min.   :1212          Min.   :   0        Min.   :  0.00    Length:716        
 1st Qu.:1217          1st Qu.:1006        1st Qu.: 72.13    Class :character  
 Median :1676          Median :1182        Median : 86.15    Mode  :character  
 Mean   :1640          Mean   :1243        Mean   : 76.17                      
 3rd Qu.:1775          3rd Qu.:1585        3rd Qu.: 92.60                      
 Max.   :2425          Max.   :2414        Max.   :100.74                      
 NA's   :113           NA's   :113         NA's   :113                         
 enrollment_reason  enrollment_status    time_spent      time_spent_hours  
 Length:716         Length:716         Min.   :   0.45   Min.   :  0.0075  
 Class :character   Class :character   1st Qu.: 851.90   1st Qu.: 14.1983  
 Mode  :character   Mode  :character   Median :1550.91   Median : 25.8485  
                                       Mean   :1799.75   Mean   : 29.9959  
                                       3rd Qu.:2426.09   3rd Qu.: 40.4348  
                                       Max.   :8870.88   Max.   :147.8481  
                                       NA's   :118       NA's   :118       
  course_id              int             val           percomp     
 Length:716         Min.   :1.800   Min.   :1.000   Min.   :1.500  
 Class :character   1st Qu.:4.000   1st Qu.:3.333   1st Qu.:3.000  
 Mode  :character   Median :4.400   Median :3.667   Median :3.500  
                    Mean   :4.301   Mean   :3.754   Mean   :3.636  
                    3rd Qu.:4.800   3rd Qu.:4.333   3rd Qu.:4.000  
                    Max.   :5.000   Max.   :5.000   Max.   :5.000  
                    NA's   :66      NA's   :60      NA's   :61     
       tv              q1              q2              q3       
 Min.   :1.000   Min.   :1.000   Min.   :1.000   Min.   :1.000  
 1st Qu.:3.714   1st Qu.:4.000   1st Qu.:3.000   1st Qu.:3.000  
 Median :4.125   Median :4.000   Median :4.000   Median :3.000  
 Mean   :4.065   Mean   :4.337   Mean   :3.661   Mean   :3.312  
 3rd Qu.:4.464   3rd Qu.:5.000   3rd Qu.:4.000   3rd Qu.:4.000  
 Max.   :5.000   Max.   :5.000   Max.   :5.000   Max.   :5.000  
 NA's   :65      NA's   :58      NA's   :58      NA's   :59     
       q4              q5              q6              q7             q8       
 Min.   :1.000   Min.   :1.000   Min.   :1.000   Min.   :1.00   Min.   :1.000  
 1st Qu.:4.000   1st Qu.:4.000   1st Qu.:4.000   1st Qu.:3.00   1st Qu.:4.000  
 Median :5.000   Median :4.000   Median :4.000   Median :4.00   Median :4.000  
 Mean   :4.346   Mean   :4.282   Mean   :4.049   Mean   :3.96   Mean   :4.346  
 3rd Qu.:5.000   3rd Qu.:5.000   3rd Qu.:5.000   3rd Qu.:5.00   3rd Qu.:5.000  
 Max.   :5.000   Max.   :5.000   Max.   :5.000   Max.   :5.00   Max.   :5.000  
 NA's   :62      NA's   :59      NA's   :58      NA's   :59     NA's   :59     
       q9             q10            date_x                      
 Min.   :1.000   Min.   :1.000   Min.   :2015-09-02 15:40:00.00  
 1st Qu.:3.000   1st Qu.:4.000   1st Qu.:2015-09-11 16:55:15.00  
 Median :4.000   Median :4.000   Median :2015-10-01 15:57:30.00  
 Mean   :3.553   Mean   :4.173   Mean   :2015-11-17 22:56:45.16  
 3rd Qu.:4.000   3rd Qu.:5.000   3rd Qu.:2016-01-27 16:11:30.00  
 Max.   :5.000   Max.   :5.000   Max.   :2016-05-24 15:53:00.00  
 NA's   :59      NA's   :58      NA's   :166                     
    post_int        post_uv         post_tv       post_percomp  
 Min.   :1.000   Min.   :1.000   Min.   :1.000   Min.   :1.000  
 1st Qu.:3.500   1st Qu.:3.000   1st Qu.:3.286   1st Qu.:3.000  
 Median :4.000   Median :3.667   Median :3.857   Median :3.500  
 Mean   :3.879   Mean   :3.481   Mean   :3.708   Mean   :3.468  
 3rd Qu.:4.500   3rd Qu.:4.000   3rd Qu.:4.286   3rd Qu.:4.000  
 Max.   :5.000   Max.   :5.000   Max.   :5.000   Max.   :5.000  
 NA's   :621     NA's   :621     NA's   :621     NA's   :621    
     date_y                            date                       
 Min.   :2015-09-02 15:31:00.00   Min.   :2017-01-23 13:14:00.00  
 1st Qu.:2015-10-16 14:27:00.00   1st Qu.:2017-01-23 19:13:00.00  
 Median :2016-01-04 13:25:00.00   Median :2017-01-25 18:43:00.00  
 Mean   :2015-12-06 18:24:55.57   Mean   :2017-01-30 08:03:39.07  
 3rd Qu.:2016-01-18 18:53:30.00   3rd Qu.:2017-02-08 13:04:00.00  
 Max.   :2016-01-22 15:43:00.00   Max.   :2017-02-13 13:00:00.00  
 NA's   :621                      NA's   :607                     

If you want to look for NA’s in all your columns, you can use is.na() (from dplyr).

data_to_explore %>%
  select(everything()) %>%  # replace to your needs
  summarize(across(everything(), ~ sum(is.na(.)))) #across() applies functions to multiple columns
# A tibble: 1 × 34
  student_id subject semester section total_points_possible total_points_earned
       <int>   <int>    <int>   <int>                 <int>               <int>
1          0       0        0       0                   113                 113
# ℹ 28 more variables: proportion_earned <int>, gender <int>,
#   enrollment_reason <int>, enrollment_status <int>, time_spent <int>,
#   time_spent_hours <int>, course_id <int>, int <int>, val <int>,
#   percomp <int>, tv <int>, q1 <int>, q2 <int>, q3 <int>, q4 <int>, q5 <int>,
#   q6 <int>, q7 <int>, q8 <int>, q9 <int>, q10 <int>, date_x <int>,
#   post_int <int>, post_uv <int>, post_tv <int>, post_percomp <int>,
#   date_y <int>, date <int>

👉 Your Turn

Let’s look at how many NAs are in the semester column only. Most of the code is completed, but you need to:

  1. Add the semester column to the select() function
  2. Add semester to the sum(is.na()) function
data_to_explore %>%
  select(semester) %>%  # Fill in the variable you want to look at
  summarize(na_count = sum(is.na(semester)))  # count NA values in the chosen variable

Exploration with Data Visualization using ggplot2

ggplot2 is designed to work iteratively. You start with a layer that shows the raw data, then add layers of annotations and statistical summaries. ggplot2 is a part of the tidyverse, so we do not need to load it again.

Here are three resources for ggplot2:

Elegant Graphics for Data Analysis states

every ggplot2 plot has three key components:

  • data,

  • a set of aesthetic mappings between variables in the data and visual properties, and

  • at least one layer which describes how to render each observation. Layers are usually created with a geom() function.”

We’ll start off by creating some basic visualizations that examines a continuous variable of interest: Bar plots and histograms.

One continuous variable: Bar plot

For this visualization, we will be guided by the following research question about another continuous variable:

❓ Which online course had the largest enrollment numbers?

❓ Which variable should we be looking at?

👉 Your Turn

1. Before visualizing anything, inspect data_to_explore to understand what variables we might need to explore the research questions.

#inspect the data frame
#(add code below)
head(data_to_explore)
# A tibble: 6 × 34
  student_id subject  semester section total_points_possible total_points_earned
  <chr>      <chr>    <chr>    <chr>                   <dbl>               <dbl>
1 43146      Forensi… S216     02                       1217               1150 
2 44638      Oceanog… S116     01                       1676               1384.
3 47448      Forensi… S216     01                       1232               1116 
4 47979      Oceanog… S216     01                       1833               1493.
5 48797      Physics  S116     01                       2225               1995.
6 51943      Forensi… S216     03                       1222                 70 
# ℹ 28 more variables: proportion_earned <dbl>, gender <chr>,
#   enrollment_reason <chr>, enrollment_status <chr>, time_spent <dbl>,
#   time_spent_hours <dbl>, course_id <chr>, int <dbl>, val <dbl>,
#   percomp <dbl>, tv <dbl>, q1 <dbl>, q2 <dbl>, q3 <dbl>, q4 <dbl>, q5 <dbl>,
#   q6 <dbl>, q7 <dbl>, q8 <dbl>, q9 <dbl>, q10 <dbl>, date_x <dttm>,
#   post_int <dbl>, post_uv <dbl>, post_tv <dbl>, post_percomp <dbl>,
#   date_y <dttm>, date <dttm>
names(data_to_explore)
 [1] "student_id"            "subject"               "semester"             
 [4] "section"               "total_points_possible" "total_points_earned"  
 [7] "proportion_earned"     "gender"                "enrollment_reason"    
[10] "enrollment_status"     "time_spent"            "time_spent_hours"     
[13] "course_id"             "int"                   "val"                  
[16] "percomp"               "tv"                    "q1"                   
[19] "q2"                    "q3"                    "q4"                   
[22] "q5"                    "q6"                    "q7"                   
[25] "q8"                    "q9"                    "q10"                  
[28] "date_x"                "post_int"              "post_uv"              
[31] "post_tv"               "post_percomp"          "date_y"               
[34] "date"                 
summary(data_to_explore)
  student_id          subject            semester           section         
 Length:716         Length:716         Length:716         Length:716        
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                            
                                                                            
                                                                            
                                                                            
 total_points_possible total_points_earned proportion_earned    gender         
 Min.   :1212          Min.   :   0        Min.   :  0.00    Length:716        
 1st Qu.:1217          1st Qu.:1006        1st Qu.: 72.13    Class :character  
 Median :1676          Median :1182        Median : 86.15    Mode  :character  
 Mean   :1640          Mean   :1243        Mean   : 76.17                      
 3rd Qu.:1775          3rd Qu.:1585        3rd Qu.: 92.60                      
 Max.   :2425          Max.   :2414        Max.   :100.74                      
 NA's   :113           NA's   :113         NA's   :113                         
 enrollment_reason  enrollment_status    time_spent      time_spent_hours  
 Length:716         Length:716         Min.   :   0.45   Min.   :  0.0075  
 Class :character   Class :character   1st Qu.: 851.90   1st Qu.: 14.1983  
 Mode  :character   Mode  :character   Median :1550.91   Median : 25.8485  
                                       Mean   :1799.75   Mean   : 29.9959  
                                       3rd Qu.:2426.09   3rd Qu.: 40.4348  
                                       Max.   :8870.88   Max.   :147.8481  
                                       NA's   :118       NA's   :118       
  course_id              int             val           percomp     
 Length:716         Min.   :1.800   Min.   :1.000   Min.   :1.500  
 Class :character   1st Qu.:4.000   1st Qu.:3.333   1st Qu.:3.000  
 Mode  :character   Median :4.400   Median :3.667   Median :3.500  
                    Mean   :4.301   Mean   :3.754   Mean   :3.636  
                    3rd Qu.:4.800   3rd Qu.:4.333   3rd Qu.:4.000  
                    Max.   :5.000   Max.   :5.000   Max.   :5.000  
                    NA's   :66      NA's   :60      NA's   :61     
       tv              q1              q2              q3       
 Min.   :1.000   Min.   :1.000   Min.   :1.000   Min.   :1.000  
 1st Qu.:3.714   1st Qu.:4.000   1st Qu.:3.000   1st Qu.:3.000  
 Median :4.125   Median :4.000   Median :4.000   Median :3.000  
 Mean   :4.065   Mean   :4.337   Mean   :3.661   Mean   :3.312  
 3rd Qu.:4.464   3rd Qu.:5.000   3rd Qu.:4.000   3rd Qu.:4.000  
 Max.   :5.000   Max.   :5.000   Max.   :5.000   Max.   :5.000  
 NA's   :65      NA's   :58      NA's   :58      NA's   :59     
       q4              q5              q6              q7             q8       
 Min.   :1.000   Min.   :1.000   Min.   :1.000   Min.   :1.00   Min.   :1.000  
 1st Qu.:4.000   1st Qu.:4.000   1st Qu.:4.000   1st Qu.:3.00   1st Qu.:4.000  
 Median :5.000   Median :4.000   Median :4.000   Median :4.00   Median :4.000  
 Mean   :4.346   Mean   :4.282   Mean   :4.049   Mean   :3.96   Mean   :4.346  
 3rd Qu.:5.000   3rd Qu.:5.000   3rd Qu.:5.000   3rd Qu.:5.00   3rd Qu.:5.000  
 Max.   :5.000   Max.   :5.000   Max.   :5.000   Max.   :5.00   Max.   :5.000  
 NA's   :62      NA's   :59      NA's   :58      NA's   :59     NA's   :59     
       q9             q10            date_x                      
 Min.   :1.000   Min.   :1.000   Min.   :2015-09-02 15:40:00.00  
 1st Qu.:3.000   1st Qu.:4.000   1st Qu.:2015-09-11 16:55:15.00  
 Median :4.000   Median :4.000   Median :2015-10-01 15:57:30.00  
 Mean   :3.553   Mean   :4.173   Mean   :2015-11-17 22:56:45.16  
 3rd Qu.:4.000   3rd Qu.:5.000   3rd Qu.:2016-01-27 16:11:30.00  
 Max.   :5.000   Max.   :5.000   Max.   :2016-05-24 15:53:00.00  
 NA's   :59      NA's   :58      NA's   :166                     
    post_int        post_uv         post_tv       post_percomp  
 Min.   :1.000   Min.   :1.000   Min.   :1.000   Min.   :1.000  
 1st Qu.:3.500   1st Qu.:3.000   1st Qu.:3.286   1st Qu.:3.000  
 Median :4.000   Median :3.667   Median :3.857   Median :3.500  
 Mean   :3.879   Mean   :3.481   Mean   :3.708   Mean   :3.468  
 3rd Qu.:4.500   3rd Qu.:4.000   3rd Qu.:4.286   3rd Qu.:4.000  
 Max.   :5.000   Max.   :5.000   Max.   :5.000   Max.   :5.000  
 NA's   :621     NA's   :621     NA's   :621     NA's   :621    
     date_y                            date                       
 Min.   :2015-09-02 15:31:00.00   Min.   :2017-01-23 13:14:00.00  
 1st Qu.:2015-10-16 14:27:00.00   1st Qu.:2017-01-23 19:13:00.00  
 Median :2016-01-04 13:25:00.00   Median :2017-01-25 18:43:00.00  
 Mean   :2015-12-06 18:24:55.57   Mean   :2017-01-30 08:03:39.07  
 3rd Qu.:2016-01-18 18:53:30.00   3rd Qu.:2017-02-08 13:04:00.00  
 Max.   :2016-01-22 15:43:00.00   Max.   :2017-02-13 13:00:00.00  
 NA's   :621                      NA's   :607                     

Level a: The most basic level for a plot

As a reminder, the most basic visualization that you can make with ggplot2 includes three layers:

  • Layer 1: data: data_to_explore.csv
  • Layer 2: aes() function - one continuous variable:
    • subject mapped to x position
  • Layer 3: Geom:geom_bar() function - bar graph
#layer 1: add data 
# layer 2: add aesthetics mapping
ggplot(data_to_explore, aes(x = subject)) +
#layer 3: add geom 
  geom_bar() +
labs(title = "Number of Observations per Subject",
       x = "Subject",
       y = "Count")

Level b: Add another layer with labels

To communicate context, we can add the following labels to our ggplot2 visualization as Layer 4.

  • Title: “Number of Student Enrollments per Subject”

  • Caption: “Which online courses have had the largest enrollment numbers?”

#layer 1: add data 
# layer 2: add aesthetics mapping
ggplot(data_to_explore, aes(x = subject)) +
#layer 3: add geom 
  geom_bar() +

#layer 4: add labels
    labs(title = "Number of Student Enrollments per Subject",
       caption = "Which online courses have had the largest enrollment numbers?")

Level c: Add scale with a different color.

We will be guided by the following research question.

❓ What can we notice about gender?

For Layer 5, we’ll add scale by telling the aes() function to make the fill colors of the bars correspond to the gender variable.

#layer 1: add data 
# layer 2: add aesthetics mapping and #layer 5 scale
ggplot(data_to_explore, aes(x = subject, fill = gender)) +
#layer 3: add geom 
  geom_bar() +

#layer 4: add labels
    labs(title = "Gender Distribution of Students Across Subjects",
       caption = "Which subjects enroll more female students?")

One continuous variable: Histogram

Now we’ll create a basic visualization that examines the relationship between a different continuous variable using a histogram via geom_histogram().

For this visualization, we will be guided by the following research question:

❓ What number is the number of hours students watch TV?

👉 Your Turn

Write a new visualization using geom_histogram().

  • Data: data_to_explore
  • aes() function: tv variable mapped to the x position
  • Geom: geom_histogram() (already included in code block)
  • Title: “Number of Hours Students Watch TV per Day”
  • Caption: “Approximately how many students watch 4+ hours of TV per day?”, or something similar that captures the research question

NEED HELP? Try Statistical Tools for High-throughput Data Analysis (STHDA)!

#(add code below)
ggplot(data_to_explore, aes(x = tv)) +
  geom_histogram(binwidth = 1, fill = "steelblue", color = "white") +
  labs(
    title = "Number of Hours Students Watch TV per Day",
    x = "Hours of TV Watched per Day",
    y = "Number of Students",
    caption = "How many students watch 4 or more hours of TV daily?"
  )


Two categorical variables: Heatmap

Create a basic visualization that examines the relationship between two categorical variables using a heatmap via geom_tile().

For this visualization, we will be guided by the following research question:

❓ What are the reasons for enrollment in various courses?

We can make a heatmap (via geom_tile()) to visualize the relationship between the course subject and reasons for enrollment.

👉 Your Turn

  • Data: data_to_explore
  • First, use the count() function for the categorical variables subject & enrollment, then
  • ggplot()
  • aes():
    • subject mapped to x position
    • enrollment_reason mapped to y position
  • Geom: geom_tile()
  • Title: “Reasons for Enrollment by Subject”
  • Caption: “Which subjects were the least available at local schools?”

We’ve added extra code as scale_fill_gradient() and labs() to help show a heatmap’s effectiveness. Play around with it if you’d like!

#(add code below)

enrollment_summary <- data_to_explore %>%
  count(subject, enrollment_reason)

ggplot(enrollment_summary, aes(x = subject, y = enrollment_reason, fill = n)) +
  geom_tile()    + 
  scale_fill_gradient(low = "red", high = "maroon") +  #Try changing one color to "red"!
    labs(title = "Reasons for Enrollment by Subject",
         caption = "Which subjects were the least available at local schools?",
         x = "Subject",
         y = "Enrollment Reason",
         fill = "Count")


Two continuous variables: Scatter plot

Create a basic visualization that examines the relationship between two continuous variables using a scatter plot via geom_point().

For this visualization, we will be guided by the following research question:

❓ Can we predict the grade on a course from the time spent in the course LMS?

👉 Your turn

Take another look at your data:

#look at the data frame
#(add code below)
head(data_to_explore) 
# A tibble: 6 × 34
  student_id subject  semester section total_points_possible total_points_earned
  <chr>      <chr>    <chr>    <chr>                   <dbl>               <dbl>
1 43146      Forensi… S216     02                       1217               1150 
2 44638      Oceanog… S116     01                       1676               1384.
3 47448      Forensi… S216     01                       1232               1116 
4 47979      Oceanog… S216     01                       1833               1493.
5 48797      Physics  S116     01                       2225               1995.
6 51943      Forensi… S216     03                       1222                 70 
# ℹ 28 more variables: proportion_earned <dbl>, gender <chr>,
#   enrollment_reason <chr>, enrollment_status <chr>, time_spent <dbl>,
#   time_spent_hours <dbl>, course_id <chr>, int <dbl>, val <dbl>,
#   percomp <dbl>, tv <dbl>, q1 <dbl>, q2 <dbl>, q3 <dbl>, q4 <dbl>, q5 <dbl>,
#   q6 <dbl>, q7 <dbl>, q8 <dbl>, q9 <dbl>, q10 <dbl>, date_x <dttm>,
#   post_int <dbl>, post_uv <dbl>, post_tv <dbl>, post_percomp <dbl>,
#   date_y <dttm>, date <dttm>

❓ Which variables should we be looking at?

  • The time spent variable is used to measure engagement, while grade variables reflect performance.

Level a. The most basic level for a scatter plot

👉 Your Turn

  • Data: data_to_explore

  • aes(): Two continuous variables

    • Time spent in hours mapped to x position
    • Proportion earned mapped to y position
  • Geom: geom_point()

#(add code below)
ggplot(data_to_explore, aes(x = time_spent_hours, y = proportion_earned)) +
  geom_point() +
  labs(
    title = "Relationship Between Time Spent and Proportion Earned",
    x = "Time Spent (Hours)",
    y = "Proportion Earned"
  )


Level b. Add another layer with labels

👉 Your Turn

  • Title: “How Time Spent on Course LMS is Related to Points Earned in the course”
  • x label: “Time Spent (Hours)”
  • y label: “Proportion of Points Earned”
#(add code below)
ggplot(data_to_explore, aes(x = time_spent_hours, y = proportion_earned)) +
  geom_point() +
  labs(
    title = "How Time Spent on Course LMS is Related to Points Earned in the Course",
    x = "Time Spent (Hours)",
    y = "Proportion of Points Earned"
  )


Level c. Add scale with a different color.

❓ Can we notice anything about enrollment status?

We’ll add scale by telling the aes() function to make the colors of the points correspond to the enrollment_status variable.

👉 Your Turn

Create a scatter plot with color based on enrollment_status.

#(add code below)
ggplot(data_to_explore, aes(x = time_spent_hours, y = proportion_earned, color = enrollment_status)) +
  geom_point() +
  labs(
    title = "How Time Spent on Course LMS is Related to Points Earned in the Course",
    x = "Time Spent (Hours)",
    y = "Proportion of Points Earned",
    color = "Enrollment Status"
  ) +
  scale_color_brewer(palette = "Set1")


Level d. Divide up graphs using facet to visualize by subject.

The facet_wrap() function allows you to generate multiple visualizations side-by-side for easier comparison. We’ll use subject as the variable.

👉 Your Turn

Create a scatter plot with facets for each subject using facet_wrap().

#(add code below)
ggplot(data_to_explore, aes(x = time_spent_hours, y = proportion_earned, color = enrollment_status)) +
  geom_point() +
  labs(
    title = "Time Spent vs. Proportion Earned by Subject",
    x = "Time Spent (Hours)",
    y = "Proportion of Points Earned",
    color = "Enrollment Status"
  ) +
  scale_color_brewer(palette = "Set1") +
  facet_wrap(~ subject)


Level e. Remove NAs from plot

  1. Start with data_to_explore.
  2. Pipe in an argument to drop_na() from:
    • time_spent_hours
    • proportion_earned
    • enrollment_status
  3. Use your previous ggplot() code from above. Remember that you already piped in data_to_explore, so you can remove it from the first argument after ggplot().
#(add code here)
data_to_explore %>%
  drop_na(time_spent_hours, proportion_earned, enrollment_status) %>%
  ggplot(aes(x = time_spent_hours, y = proportion_earned, color = enrollment_status)) +
  geom_point() +
  labs(
    title = "Time Spent vs. Proportion Earned by Subject",
    x = "Time Spent (Hours)",
    y = "Proportion of Points Earned",
    color = "Enrollment Status"
  ) +
  scale_color_brewer(palette = "Set1") +
  facet_wrap(~ subject)


Teacher Persona

As Alex explores the data through visualizations and summary statistics, she begins to see trends that could indicate which students are at risk. Her observations guide her to consider changes in her teaching approach or additional support for certain students.

🛑 Stop here! Congratulations, you finished the second part of the case study.

4. MODEL (Module 3)

As highlighted in.Chapter 3 of Data Science in Education Using R, the Model step of the data science process entails “using statistical models, from simple to complex, to understand trends and patterns in the data.”

The authors note that while descriptive statistics and data visualization during the Explore step can help us to identify patterns and relationships in our data, statistical models can be used to help us determine if relationships, patterns and trends are actually meaningful.

Simple Correlation

As highlighted in @macfadyen2010, scatter plots are a useful initial approach for identifying potential correlational trends between variables under investigation, but to further interrogate the significance of selected variables as indicators of student achievement, a simple correlation analysis of each variable with student final grade can be conducted.

There are two efficient ways to create correlation matrices, one that is best for internal use, and one that is best for inclusion in a manuscript.

correlate()

The corrr package provides a way to create a correlation matrix in a tidyverse-friendly way. Like for the skimr package, it can take as little as a line of code to create a correlation matrix. If unfamiliar, a correlation matrix is a table that presents how all of the variables are related to all of the other variables.

👉 Your Turn

First, load the corrr package using the correct function. You may need to install.packages() in the console if this is your first time using loading the package.

# load in corrr library
#(add code below)
install.packages("corrr")
Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.4'
(as 'lib' is unspecified)
library(corrr)

Attaching package: 'corrr'
The following object is masked from 'package:skimr':

    focus

Look and see if there is a simple correlation between time spent in hours and the proportion of points earned:

1. Use data_to_explore

2. select(): - time-spent-hours - proportion_earned

3. Use correlate()

#(add code below)
data_to_explore %>%
  select(time_spent_hours, proportion_earned) %>%
  correlate()
Correlation computed with
• Method: 'pearson'
• Missing treated using: 'pairwise.complete.obs'
# A tibble: 2 × 3
  term              time_spent_hours proportion_earned
  <chr>                        <dbl>             <dbl>
1 time_spent_hours            NA                 0.438
2 proportion_earned            0.438            NA    

fashion()

For printing purposes, the fashion() function can be added for converting a correlation into a cleanly formatted matrix, with leading zeros removed, spaces for signs, and the diagonal (or any NA) left blank.

#add fashion function
data_to_explore %>% 
  select(proportion_earned, time_spent_hours) %>% 
  correlate() %>% 
  rearrange() %>%
  shave() %>%
  fashion()
Correlation computed with
• Method: 'pearson'
• Missing treated using: 'pairwise.complete.obs'
               term proportion_earned time_spent_hours
1 proportion_earned                                   
2  time_spent_hours               .44                 

👉 Your Turn⤵

❓ What could we write up for a manuscript in APA format or another format? Offer a short summary below:

  • Results indicated a moderate positive correlation between time spent (in hours) and proportion earned, r = .44, suggesting that students who spent more time engaging with the course content tended to earn higher scores.

❓ What other variables would you like to check out?

  • The variable tv can help explore the relationship between media consumption and academic performance. Additionally, variables related to discussion participation may serve as indicators of social engagement, offering insights into its potential role as a predictor of academic success. And the variable enrollment_status allows for comparison of engagement patterns between active and withdrawn students.

Take some of those variables and explore them using correlate():

#(add code below)
data_to_explore %>%
  select(tv, proportion_earned) %>%
  correlate()
Correlation computed with
• Method: 'pearson'
• Missing treated using: 'pairwise.complete.obs'
# A tibble: 2 × 3
  term                   tv proportion_earned
  <chr>               <dbl>             <dbl>
1 tv                NA                 0.0918
2 proportion_earned  0.0918           NA     

APA Formatting

While corrr is a nice package to quickly create a correlation matrix, you may wish to create one that is ready to be added directly to a dissertation or journal article. apaTables is great for creating more formal forms of output that can be added directly to an APA-formatted manuscript. It also has functionality for regression and other types of model output.

However, apaTables is not as friendly to tidyverse functions. First, we need to select only the variables we wish to correlate. Then, we can use that subset of the variables as the argument to the apa.cor.table() function.

👉 Your Turn

Run the following code to create a subset of the larger data_to_explore data frame with the variables you wish to correlate, then create a correlation table using apa.cor.table().

  1. Load apaTables library.
  2. Save your selected variables for comparison to a new object data_to_explore_subset.
  3. Run apa.cor.table().
#load apatables library
#(add code below)
library(apaTables)

#save a new subset for comparison to the new object
#(add code below)
data_to_explore_subset <- data_to_explore %>%
  select(proportion_earned, time_spent_hours, tv)

#use apa.cor.table() on that subset
#(add code below)
apa.cor.table(data_to_explore_subset)


Means, standard deviations, and correlations with confidence intervals
 

  Variable             M     SD    1          2         
  1. proportion_earned 76.17 25.55                      
                                                        
  2. time_spent_hours  30.00 22.58 .44**                
                                   [.37, .50]           
                                                        
  3. tv                4.07  0.59  .09*       .10*      
                                   [.01, .17] [.02, .18]
                                                        

Note. M and SD are used to represent mean and standard deviation, respectively.
Values in square brackets indicate the 95% confidence interval.
The confidence interval is a plausible range of population correlations 
that could have caused the sample correlation (Cumming, 2014).
 * indicates p < .05. ** indicates p < .01.
 

This may look nice, but how do we add this into a dissertation or article that you might be interested in publishing?

Read the documentation for apa.cor.table() by running ?apa.cor.table() in the console. Look through the documentation and examples to understand how to output a file with the formatted correlation table, and then run the code to do that with your subset data_to_explore_subset.

?apa.cor.table

apa.cor.table(data_to_explore_subset, filename = "cor-table.doc")


Means, standard deviations, and correlations with confidence intervals
 

  Variable             M     SD    1          2         
  1. proportion_earned 76.17 25.55                      
                                                        
  2. time_spent_hours  30.00 22.58 .44**                
                                   [.37, .50]           
                                                        
  3. tv                4.07  0.59  .09*       .10*      
                                   [.01, .17] [.02, .18]
                                                        

Note. M and SD are used to represent mean and standard deviation, respectively.
Values in square brackets indicate the 95% confidence interval.
The confidence interval is a plausible range of population correlations 
that could have caused the sample correlation (Cumming, 2014).
 * indicates p < .05. ** indicates p < .01.
 

You should now see a new Word document in your project folder called survey-cor-table.doc. Click on that and you’ll be prompted to download from your browser.

Linear Regression

In brief, a linear regression model involves estimating the relationships between one or more independent variables with one dependent variable. Mathematically, it can be written like the following.

\[ \operatorname{dependentvar} = \beta_{0} + \beta_{1}(\operatorname{independentvar}) + \epsilon \]

❓ Does time spent predict grade earned?

👉 Your Turn

Use lm() to estimate a model in which proportion_earned is the dependent variable. It is predicted by one independent variable, time_spent_hours, with an interaction term int (interest in science).

#(add code below)
lm(proportion_earned ~ time_spent_hours + int, data = data_to_explore)

Call:
lm(formula = proportion_earned ~ time_spent_hours + int, data = data_to_explore)

Coefficients:
     (Intercept)  time_spent_hours               int  
         44.9657            0.4255            4.6283  

We can see that the intercept is now estimated at 44, which tells us that when students’ time spent and interest are equal to zero, they are likely fail the course (unsurprisingly). Note that that estimate for interest in science is 4.6, so for every one-unit increase in int, we should expect about a 5 percentage point increase in their grade.

We can save the output of the function to an object. Let’s call it m1 for “Model 1.” We can then use the summary() function built into R to view a much more feature-rich summary of the estimated model.

# save the model
m1 <- lm(proportion_earned ~ time_spent_hours + int, data = data_to_explore)

👉 Your Turn

Run a summary for the model you just created, called m1.

#run the summary
#(add code below)
summary(m1)

Call:
lm(formula = proportion_earned ~ time_spent_hours + int, data = data_to_explore)

Residuals:
    Min      1Q  Median      3Q     Max 
-66.705  -7.836   5.049  14.695  35.766 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)       44.9657     6.6488   6.763 3.54e-11 ***
time_spent_hours   0.4255     0.0410  10.378  < 2e-16 ***
int                4.6283     1.5364   3.012  0.00271 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 21.42 on 536 degrees of freedom
  (177 observations deleted due to missingness)
Multiple R-squared:  0.1859,    Adjusted R-squared:  0.1828 
F-statistic: 61.18 on 2 and 536 DF,  p-value: < 2.2e-16

Let’s save this as a nice APA table for possible publication.

# use the {apaTables} package to create a nice regression table that could be used for later publication.
apa.reg.table(m1, filename = "lm-table.doc")


Regression results using proportion_earned as the criterion
 

        Predictor       b       b_95%_CI beta  beta_95%_CI sr2  sr2_95%_CI
      (Intercept) 44.97** [31.90, 58.03]                                  
 time_spent_hours  0.43**   [0.34, 0.51] 0.41 [0.33, 0.48] .16  [.11, .22]
              int  4.63**   [1.61, 7.65] 0.12 [0.04, 0.19] .01 [-.00, .03]
                                                                          
                                                                          
                                                                          
     r             Fit
                      
 .41**                
 .15**                
           R2 = .186**
       95% CI[.13,.24]
                      

Note. A significant b-weight indicates the beta-weight and semi-partial correlation are also significant.
b represents unstandardized regression weights. beta indicates the standardized regression weights. 
sr2 represents the semi-partial correlation squared. r represents the zero-order correlation.
Square brackets are used to enclose the lower and upper limits of a confidence interval.
* indicates p < .05. ** indicates p < .01.
 

Teacher Persona By creating simple models, Alex hopes to predict student outcomes more accurately. She is interested in how variables like time spent on tasks correlate with student grades and uses this information to adjust her instructional strategies.

Summarize predictors

The summarize() function from the dplyr package creates summary statistics such as the mean, standard deviation, or the minimum or maximum of a value. At its core, think of summarize() as a function that returns a single statistics summarizing a single column.

👉 Your Turn

In the space below, find the mean int of students using summarize(), removing any NAs.

#(add code below)
data_to_explore %>%
  summarize(mean_int = mean(int, na.rm = TRUE))
# A tibble: 1 × 1
  mean_int
     <dbl>
1     4.30

The mean value for interest is quite high. If we multiply the estimated relationship between interest and proportion of points earned—0.046—by this, the mean interest across all of the students—we can determine that students’ estimated final grade was 0.046 X 4.3, or 0.197.

👉 Your Turn

Do the same for time_spent_hours by finding the mean and removing any NAs using summarize().

#(add code below)
data_to_explore %>%
  summarize(mean_time_spent = mean(time_spent_hours, na.rm = TRUE))
# A tibble: 1 × 1
  mean_time_spent
            <dbl>
1            30.0

For hours spent, the average students’ estimated final grade was 0.0042 X 30.48, or 0.128.

If we add both 0.197 and 0.128 to the intercept, 0.449, that equals 0.774, or about 77%. In other words, a student with average interest in science who spent an average amount of time in the course earned a pretty average grade.

Checking Assumptions

Great! Now that you have defined your linear model m1 in R, which predicts proportion_earned based on time_spent_hours and the interaction term int.

Let’s go through how to check the assumptions of this linear model using the various diagnostic plots and tests.

We’ll need to check:

  • Linearity and Interaction Effects

  • Residuals Analysis

  • Normality of Residuals

  • Multicollinearity

Linearity and Interaction Effect

Since our model includes an interaction term (int), it’s good to first check if the interaction is meaningful and whether the linearity assumption holds for the predictors in relation to the dependent variable.

ggplot(data_to_explore, aes(x=time_spent_hours, y=proportion_earned, color=int)) +
  geom_point() +
  geom_smooth(method="lm", se=FALSE)
`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 118 rows containing non-finite outside the scale range
(`stat_smooth()`).
Warning: The following aesthetics were dropped during statistical transformation:
colour.
ℹ This can happen when ggplot fails to infer the correct grouping structure in
  the data.
ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
  variable into a factor?
Warning: Removed 118 rows containing missing values or values outside the scale range
(`geom_point()`).

This plot helps visualize if the interaction term significantly affects the relationship between your predictors and the dependent variable.

Residuals Analysis

Next, we’ll plot the residuals against the fitted values to check for independence, homoscedasticity, and any unusual patterns.

plot(residuals(m1) ~ fitted(m1))
abline(h = 0, col = "red")

Look for a random dispersion of points. Any pattern or funnel shape indicates issues with homoscedasticity or linearity.

Normality of Residuals

A deviation from the straight line in the Normal Q-Q plot indicates deviations from normality.

qqnorm(residuals(m1))
qqline(residuals(m1))

The Shapiro-Wilk test is another way to test normality.

shapiro.test(residuals(m1))

    Shapiro-Wilk normality test

data:  residuals(m1)
W = 0.88967, p-value < 2.2e-16
  • W statistic: The test statistic W is 0.88967. This value indicates the extent to which the data are normally distributed. W values close to 1 suggest that the data closely follow a normal distribution. In this case, a W value of 0.88967 suggests some deviation from normality.

  • p-value: The p-value is less than 2.2e-16 (a very small number close to zero). In statistical testing, a p-value less than the chosen significance level (typically 0.05) leads to the rejection of the null hypothesis.

Multicollinearity

We can use vif() from the car package to identify multicolinearity. You try it!

👉 Your Turn

  • Load the package car.

  • Use the vif() function for your model m1.

#(add code below)
install.packages("car")
Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.4'
(as 'lib' is unspecified)
library(car)
Loading required package: carData

Attaching package: 'car'
The following object is masked from 'package:dplyr':

    recode
The following object is masked from 'package:purrr':

    some
vif(m1)
time_spent_hours              int 
        1.006076         1.006076 

Interpretation of VIF Scores

VIF Value Interpretation
= 1 No correlation among the predictor and all other predictors
< 5 Generally indicates a moderate level of multicollinearity
>= 5 to 10 May indicate a problematic amount of multicollinearity, depending on the context and sources
>10 Signifies high multicollinearity that can severely distort the least squares estimates

Both of the VIF scores are slightly above 1, which suggests that there is almost no multicollinearity among these predictors. This is a good sign, indicating that each predictor provides unique and independent information to the model, not unduly influenced by the other variables.

Our model was violated and does not follow a normality. However, we are going to practice other functions as if it passed all assumptions. Assumption testing is just for you to know how to do in the future. You will need to correct issues like normality by exploring data transformations, adding polynomial or interaction terms to the model, or using a different type of regression model that does not assume normality of residuals.

🛑 Stop here! Congratulations, you finished the third part of the case study.

5. COMMUNICATE (Module 4)

For your final 👉 Your Turn, your goal is to distill our analysis into a Quarto “data product” designed to illustrate key findings. Feel free to use the template in the lab 4 folder.

The final step in the workflow/process is sharing the results of your analysis with wider audience. Krumm et al. @krumm2018 have outlined the following 3-step process for communicating with education stakeholders findings from an analysis:

  1. Select. Communicating what one has learned involves selecting among those analyses that are most important and most useful to an intended audience, as well as selecting a form for displaying that information, such as a graph or table in static or interactive form, i.e. a “data product.”

  2. Polish. After creating initial versions of data products, research teams often spend time refining or polishing them, by adding or editing titles, labels, and notations and by working with colors and shapes to highlight key points.

  3. Narrate. Writing a narrative to accompany the data products involves, at a minimum, pairing a data product with its related research question, describing how best to interpret the data product, and explaining the ways in which the data product helps answer the research question and might be used to inform new analyses or a “change idea” for improving student learning.

👉 Your Turn

Create a data story with our current data set, or your own. Make sure to use the LA workflow as your guide to include:

  • Develop a research question

  • Add ggplot visualizations

  • Offer a model (with visualizations)

  • Communicate by writing up a short write up for intended stakeholders. Remember to write it in terms the stakeholders understand.

  • NARRATIVE REPORT:

In an effort to support student success in online science courses, we explored how student engagement, measured by time spent in the course and self-reported motivation, relates to academic performance. Using data from over 700 students enrolled in courses such as Biology, Physics, and Anatomy, we combined three datasets: system log data showing time spent, gradebook data with points earned, and survey responses capturing student interest, perceived value of the subject, and self-confidence. This approach allowed us to understand both what students did and how they felt at the start of the course.

Visual 1. Average Grade by Subject

data_to_explore %>%
  group_by(subject) %>%
  summarise(avg_grade = mean(proportion_earned, na.rm = TRUE)) %>%
  ggplot(aes(x = reorder(subject, avg_grade), y = avg_grade, fill = subject)) +
  geom_col(show.legend = FALSE) +
  coord_flip() +
  labs(
    title = "Average Grade by Science Subject",
    x = "Subject",
    y = "Average Final Grade (%)"
  ) +
  theme_minimal()

Visual 2. Motivation vs Performance

ggplot(data_to_explore, aes(x = int, y = proportion_earned)) +
  geom_point(alpha = 0.4) +
  geom_smooth(method = "lm", se = FALSE, color = "steelblue") +
  labs(
    title = "Student Interest in Science vs. Performance",
    x = "Interest Rating",
    y = "Proportion of Points Earned (%)"
  ) +
  theme_minimal()
`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 176 rows containing non-finite outside the scale range
(`stat_smooth()`).
Warning: Removed 176 rows containing missing values or values outside the scale range
(`geom_point()`).

Our analysis revealed two important patterns. First, students who spent more time engaging with course content tended to earn higher final grades. Second, those who expressed greater interest in science and stronger confidence in their abilities also performed better. A basic regression model confirmed that both time spent and motivational factors were significant predictors of academic success. This shows that tracking both behavioral and psychological engagement can provide early indicators of who may need support in an online learning environment.

Linear Model: Predict Final Grade

model <- lm(proportion_earned ~ time_spent_hours + int + val + percomp, data = data_to_explore)
summary(model)

Call:
lm(formula = proportion_earned ~ time_spent_hours + int + val + 
    percomp, data = data_to_explore)

Residuals:
    Min      1Q  Median      3Q     Max 
-66.164  -7.704   4.727  14.710  36.730 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)      48.92459    6.84391   7.149 2.94e-12 ***
time_spent_hours  0.43222    0.04126  10.476  < 2e-16 ***
int               6.45945    2.03113   3.180  0.00156 ** 
val              -4.05761    1.53651  -2.641  0.00852 ** 
percomp           0.94187    1.73846   0.542  0.58820    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 21.32 on 527 degrees of freedom
  (184 observations deleted due to missingness)
Multiple R-squared:  0.1944,    Adjusted R-squared:  0.1883 
F-statistic: 31.79 on 4 and 527 DF,  p-value: < 2.2e-16
library(broom)
tidy(model) %>%
  ggplot(aes(x = reorder(term, estimate), y = estimate, fill = term)) +
  geom_col(show.legend = FALSE) +
  coord_flip() +
  labs(title = "Predictors of Student Performance", x = "Predictor", y = "Effect Size") +
  theme_minimal()

These insights offer direct value to instructors like Professor Alex Johnson. For example, students with low engagement and low motivation, especially in subjects with lower average performance such as Anatomy, can be identified early in the term. Instructors could then intervene through one-on-one check-ins, encouragement messages, or study support resources. With this evidence in hand, it becomes possible to build a practical early warning system to improve outcomes for students who might otherwise fall behind.


Teacher Persona Finally, Alex prepares to communicate her findings. She creates a simple web page using Markdown to share her insights with colleagues. This acts as a practical example of how data can inform teaching practices.

🏆 Congratulations, you finished the case study!