Unit 1 Case Study: Predicting Student Achievement

ECI 586 Intro to Learning Analytics

Author

Dr. Joey Huang

Published

September 14, 2025

1. PREPARE

During the final week of each unit, we will complete a “case study” to illustrate how Learning Analytics methods and techniques can be applied to address research questions of interest, create useful data products, and conduct reproducible research. Each case study is structured around a basic research workflow modeled after the Data-Intensive Research Workflow from Learning Analytics Goes to School (Krumm, Means, and Bienkowski 2018):

Figure 2.2 Steps of Data-Intensive Research Workflow

For Unit 1, we will focus on online science classes provided through a state-wide online virtual school and conduct an analysis that help predict students’ performance in these online courses. This case study is guided by a foundational study in Learning Analytics that illustrates how analyses like these can be used develop an early warning system for educators to identify students at risk of failing and intervene before that happens.

The Unit 1 case study will cover the following workflow topics:

  1. Prepare: Prior to analysis, it’s critical to understand the context and data sources you’re working with so you can formulate useful and answerable questions. You’ll also need to become familiar with and load essential packages for analysis.
  2. Wrangle: Wrangling data entails the work of manipulating, cleaning, transforming, and merging data. In Part 2 we focus on importing CSV files, tidying and joining our data.
  3. Explore: In Part 3, we use basic data visualization and calculate some summary statistics to explore our data and see what insight it provides in response to our questions.
  4. Model: After identifying variables that may be related to student performance through exploratory analysis, we’ll look at correlations and create some simple models of our data using linear regression.
  5. Communicate: To wrap up our case study, we’ll develop our first “data product” and share our analyses and findings by creating our first web page using R Pubs.

1a. Review the Literature

Our Unit 1 Case Study is guided by a well-cited publication from two authors that have made numerous contributions to the field of Learning Analytics over the years. Although this article is focused on “early warning systems” in higher education, and where adoption of learning management systems (LMS) like Moodle and Canvas gained a quicker foothold, this study is particularly relevant since COVID-19. Many districts across the county have incorporated a LMS into their remote instruction and have set up virtual academies likely to continue post-pandemic. In North Carolina specifically, student disengagement and learning loss have become a particular concern among districts and in response the NC Department of Public Instruction has recently established the Office of Learning Recovery & Acceleration, which has since transitioned into the Office of Learning & Research.

Macfadyen, L. P., & Dawson, S. (2010). Mining LMS data to develop an “early warning system” for educators: A proof of concept. Computers & education54(2), 588-599.

Abstract

Earlier studies have suggested that higher education institutions could harness the predictive power of Learning Management System (LMS) data to develop reporting tools that identify at-risk students and allow for more timely pedagogical interventions. This paper confirms and extends this proposition by providing data from an international research project investigating which student online activities accurately predict academic achievement. Analysis of LMS tracking data from a Blackboard Vista-supported course identified 15 variables demonstrating a significant simple correlation with student final grade… Moreover, network analysis of course discussion forums afforded insight into the development of the student learning community by identifying disconnected students, patterns of student-to-student communication, and instructor positioning within the network. This study affirms that pedagogically meaningful information can be extracted from LMS-generated student tracking data, and discusses how these findings are informing the development of a dashboard-like reporting tool for educators that will extract and visualize real-time data on student engagement and likelihood of success.

Data Sources & Analysis

The data analyzed in this exploratory research was extracted from the course-based instructor tracking logs and the BB Vista production server. Data collected on each student included ‘whole term’ counts for frequency of usage of course materials and tools supporting content delivery, engagement and discussion, assessment and administration/management. In addition, tracking data indicating total time spent on certain tool-based activities (assessments, assignments, total time online) offered a total measure of individual student time on task.

The authors used scatter plots for identifying potential relationships between variables under investigation, followed by a a simple correlation analysis of each variable to further interrogate the significance of selected variables as indicators of student achievement. Finally, a linear multiple regression analysis was conducted in order to develop a predictive model in which a student final grade was the continuous dependent variable.

Your Turn

Take a quick scan of Table 1 in the article linked above (and also located in your “lit” folder in the files pane) and in the space below, answer the following question: Of the 13 LMS variables correlated with student final grade, which 2-3 do you think will found to significantly predict final grades?

  • I think there are a number of useful predictors for the final grades. I think that Assignments submitted would indicate the completion of required work. Visits to the My Progress Tool may also indicate an awareness of course completion and desire to track progress. Discussion messages replied to would also indicate interaction with the course which may correlate to boost in grades, comprehension and collaboration.

Now take a quick look at Discussion and Conclusions sections and answer the following questions: What factors in the model did ultimately predict final grades? How accurate was this model in identifying “at risk” students?

  • The predictors of student activity that proved most useful were the number of forum posts, mail messages sent, and assignments completed. The article says that the tracking variables were quite effective at 80.9% accuracy of identifying ‘at risk’ students.

1b. Define Questions

In this study, exploratory research was undertaken to identify the data variables that would inform the development of a data visualization tool for instructors. This involved the extraction of all LMS tracking variables for selected course sections at The University of British Columbia, Canada. In so doing, the study aimed to address the following research questions:

  1. Which LMS tracking data variables correlate significantly with student achievement?

  2. How accurately can measures of student online activity in an online course site predict student achievement in the course under study?

  3. Can tracking data recording online student communication patterns offer pedagogically meaningful insights into development of a student learning community?

For our case study, we’ll adopt questions 1 & 2 to guide our exploratory analysis and modeling, with a special emphasis on time spent in the LMS. We’ll also use analytical approaches and data similar to those used by Leah P. Macfadyen and Dawson (2010b) to better understand how LMS, gradebook, and survey data might be predictive of student performance.

1c. Load Libraries

As noted in our Getting Started activity and tutorials, R uses “packages” and add-ons that enhance its functionality. One of our first steps in any workflow is to load packages necessary for data wrangling, analysis, and reporting. We’ll load the familiar {tidyverse} package in this section and introduce new packages and corresponding functions throughout the case study.

tidyverse 📦

Recall from earlier tutorials that the {tidyverse} package is actually a collection of R packages designed for reading, wrangling, and exploring data and which all share an underlying design philosophy, grammar, and data structures. These shared features are sometimes “tidy data principles.”

Click the green arrow in the right corner of the “code chunk” that follows to load the {tidyverse} library introduced in our Getting Started case study.

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.2
✔ ggplot2   3.5.2     ✔ tibble    3.3.0
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.1.0     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Don’t worry if you saw a number of messages: those probably mean that the tidyverse loaded just fine. Any conflicts you may have seen mean that functions in these packages you loaded have the same name as functions in other packages and R will default to function from the last loaded package unless you specify otherwise.

2. WRANGLE

In general, data wrangling involves some combination of cleaning, reshaping, transforming, and merging data (Wickham, Çetinkaya-Rundel, and Grolemund 2023). The importance of data wrangling is difficult to overstate, as it involves the initial steps of going from raw data to a dataset that can be explored and modeled (Krumm, Means, and Bienkowski 2018) and often comprises a large share of time spent on data-intensive research workflows. In Part 2, we focus on the the following workflow processes:

  1. Import Data. In this section, we introduce the read_csv() function for working with CSV files and revisit some key functions for inspecting our data.

  2. Tidy Data. We introduce the separate() and clean_names() functions for getting our data nice and tidy, and revisit the mutate() function for creating new variables and feature engineering.

  3. Join Data. We conclude our data wrangling by introducing join() functions for merging our processed files into a single data frame for analysis.

a. Import Data

Education data are stored in all sorts of different file formats and structures. In this course, we’ll discuss several of these common formats, how to import your data into R, and how to transform you data into other data structures such as network objects required for social network analysis in Unit 3. In this case study, we’ll focus on working with comma-separated values (CSV) files.

Similar to spreadsheet formats like Excel and Google Sheets, CSVs allow us to store rectangular data frames, but in a much simpler plain-text format, where all the important information in the file is represented by text. Note that “text” here refers to numbers, letters, and symbols you can type on your keyboard. In Tidyverse Skills for Data Science, Wright et al. Wright et al. (2021) note that the advantage of a .csv files is that:

… there are no workbooks or metadata making it difficult to open these files. CSVs are flexible files and are thus the preferred storage method for tabular data for many data scientists.

Data Source #1: Log Data

Log-trace data is data generated from our interactions with digital technologies, such as archived data from social media postings. In education, an increasingly common source of log-trace data is that generated from interactions with LMS and other digital learning tools.

The data we will use in this case study and already been “wrangled” quite a bit and is a summary of log-trace data: the number of minutes students spent on the course. While this data type is fairly straightforward, there are even more complex sources of log-trace data out there (e.g., time stamps associated with when students started and stopped accessing the course).

To help us import our data, we’ll be using the {readr} package, which provides a “fast and friendly way” to read rectangular data stored in plain-text file formats like csv, tsv, and fwf. If you are new to {readr}, I highly recommend the data import chapter in R for Data Science (Wickham, Çetinkaya-Rundel, and Grolemund 2023). Note that we don’t need to load the {readr} package because it was already loaded as part of the tidyverse package we called earlier.

Let’s use the read_csv() function from {readr} to import our log-data.csv file directly from our data folder and name this data set time_spent, which will now be saved in our R environment:

time_spent <- read_csv("data/log-data.csv")
Rows: 716 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (4): course_id, gender, enrollment_reason, enrollment_status
dbl (2): student_id, time_spent

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

There are two quick things to note. First, importing our data and saving in our environment does not alter the original file saved in our data folder, so we don’t have to worry about causing any harm to the original file as we wrangle and explore our data!

Second, the message in the output indicated that four of the columns were specified as “character” data types and two of the columns student_id and time_spent as double or numeric types.

Your Turn

Let’s take a look at this data in a couple ways. First, type time_spent into the console (the pane below this one) and then hit return/enter. You should see a printed summary of this data frame.

Next, go to your files pane (bottom right corner of R Studio), click on the data folder then click on the file named “log-data.csv” and in the pop-up click “View File.”

What do you notice about this data? What questions do you have? Add a couple of notes (or more—you can type return/enter after a bullet point to add another) on your observations and/or questions here:

  • There are 716 observations across 6 variables which means that 716 unique students’ data is present. Glancing at the data I have some immediate questions for it. How many different enrollment_reasons were available? Is the time spent in minutes or seconds ( I assume minutes)? How many different courses are captured in this data? Future EDA could look at all sorts of correlations like gender and enrollment reason, time_spent in each course on average, etc.

Data Source #2: Academic Achievement Data

Academic achievement data is (obviously) a very common form of data in education. In this case study, we’ll use both the sum of the points students earned as well as the number of points possible to compute the percentage of points they earned in the course—a measure comparable (but likely a little different based on teachers’ grading policies) to their final grade.

We’ll load the data in the same way as earlier but take a quick peek by including the name of our data frame in the code chunk as well:

gradebook <- read_csv("data/gradebook-summary.csv")
Rows: 717 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): course_id
dbl (3): student_id, total_points_possible, total_points_earned

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
gradebook
# A tibble: 717 × 4
   student_id course_id     total_points_possible total_points_earned
        <dbl> <chr>                         <dbl>               <dbl>
 1      43146 FrScA-S216-02                  1217               1150 
 2      44638 OcnA-S116-01                   1676               1384.
 3      47448 FrScA-S216-01                  1232               1116 
 4      47979 OcnA-S216-01                   1833               1493.
 5      48797 PhysA-S116-01                  2225               1995.
 6      51943 FrScA-S216-03                  1222                 70 
 7      52326 AnPhA-S216-01                  1775               1519.
 8      52446 PhysA-S116-01                  2225               2198 
 9      53447 FrScA-S116-01                  1212               1173 
10      53475 FrScA-S116-02                  1212                  0 
# ℹ 707 more rows

Data Source #3: Self-Report Survey

The third data source is a self-report survey. This was data collected before the start of the course. The survey included ten items, each corresponding to one of three motivation measures: interest, utility value, and perceived competence. The three motivation measures we explore here come from Expectancy-Value Theory, which states that students are motivated to learn when they both believe that they can achieve something (expectancy, also known as “perceived competence”) and believe that the concept they are trying to learn is important (value) (Wigfield and Eccles 2000). For more information about this survey, including the specific items included, see Chapter 7 of Data Science in Education Using R (Estrellado et al. 2020).

Your Turn

In the code chunk below, read in to R the survey.csv file located in the data folder. You can use the code above as a template. Assign the output from the read_csv() function to a new object named survey.

Hint: By asking you to “assign” the output from the read_csv() function to the name survey, consider how in the code chunk above this “Your Turn” we assigned the output from the read_csv() function to the name gradebook.

# YOUR CODE HERE
survey <- read_csv("data/survey.csv")
Rows: 662 Columns: 26
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr   (3): course_id, subject, semester
dbl  (20): student_id, section, int, val, percomp, tv, q1, q2, q3, q4, q5, q...
dttm  (3): date.x, date.y, date

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
survey
# A tibble: 662 × 26
   student_id course_id subject semester section   int   val percomp    tv    q1
        <dbl> <chr>     <chr>   <chr>      <dbl> <dbl> <dbl>   <dbl> <dbl> <dbl>
 1      43146 FrScA-S2… FrScA   S216           2   4.2  3.67     4    3.86     4
 2      44638 OcnA-S11… OcnA    S116           1   4    3        3    3.57     4
 3      47448 FrScA-S2… FrScA   S216           1   4.2  3        3    3.71     5
 4      47979 OcnA-S21… OcnA    S216           1   4    3.67     2.5  3.86     4
 5      48797 PhysA-S1… PhysA   S116           1   3.8  3.67     3.5  3.71     4
 6      51943 FrScA-S2… FrScA   S216           3   3.8  3.67     3.5  3.71     4
 7      52326 AnPhA-S2… AnPhA   S216           1   3.6  4        3    4        4
 8      52446 PhysA-S1… PhysA   S116           1   4.2  3.67     3    4        4
 9      53447 FrScA-S1… FrScA   S116           1   3.8  2        3    3        5
10      53475 FrScA-S2… FrScA   S216           1   4.8  3.33     4    4.14     5
# ℹ 652 more rows
# ℹ 16 more variables: q2 <dbl>, q3 <dbl>, q4 <dbl>, q5 <dbl>, q6 <dbl>,
#   q7 <dbl>, q8 <dbl>, q9 <dbl>, q10 <dbl>, date.x <dttm>, post_int <dbl>,
#   post_uv <dbl>, post_tv <dbl>, post_percomp <dbl>, date.y <dttm>,
#   date <dttm>

After reading the data, let’s continue the practice of looking at our data. Type survey into the console to take a look at the data or add survey on a new line in your code chunk above. Does it appear to be the correct file? What do the variables seem to be about? What wrangling steps do we need to take? Taking a quick peak at the data helps us to begin to formulate answers to these questions is an important step in any data analysis.

Add one or more of the things you notice or wonder about the data here:

  • This dataset has 662 observations (rows) and 26 variables (columns). It includes chr, dbl and a new type dttm which is date time month information.
  • We could use this information coupled with the log-data to join the student_id num and see some perception information against time spent in the course and then with the gradebook data to see whether positive perception at the outset and time spent correlate to higher grades.

RStudio Tip: If you happen to run into issues with data import, RStudio has a handy “Import Dataset” feature for a point and click approach to adding data to your environment. If you want to give this a try, be sure to pay attention to the default settings and the name it will give your data frame when imported.

View Data

Once your data is in R, there are many different ways you can view it. Give each of the following at try:

# enter the name of your data frame and view directly in the console or a code chunk
survey
# A tibble: 662 × 26
   student_id course_id subject semester section   int   val percomp    tv    q1
        <dbl> <chr>     <chr>   <chr>      <dbl> <dbl> <dbl>   <dbl> <dbl> <dbl>
 1      43146 FrScA-S2… FrScA   S216           2   4.2  3.67     4    3.86     4
 2      44638 OcnA-S11… OcnA    S116           1   4    3        3    3.57     4
 3      47448 FrScA-S2… FrScA   S216           1   4.2  3        3    3.71     5
 4      47979 OcnA-S21… OcnA    S216           1   4    3.67     2.5  3.86     4
 5      48797 PhysA-S1… PhysA   S116           1   3.8  3.67     3.5  3.71     4
 6      51943 FrScA-S2… FrScA   S216           3   3.8  3.67     3.5  3.71     4
 7      52326 AnPhA-S2… AnPhA   S216           1   3.6  4        3    4        4
 8      52446 PhysA-S1… PhysA   S116           1   4.2  3.67     3    4        4
 9      53447 FrScA-S1… FrScA   S116           1   3.8  2        3    3        5
10      53475 FrScA-S2… FrScA   S216           1   4.8  3.33     4    4.14     5
# ℹ 652 more rows
# ℹ 16 more variables: q2 <dbl>, q3 <dbl>, q4 <dbl>, q5 <dbl>, q6 <dbl>,
#   q7 <dbl>, q8 <dbl>, q9 <dbl>, q10 <dbl>, date.x <dttm>, post_int <dbl>,
#   post_uv <dbl>, post_tv <dbl>, post_percomp <dbl>, date.y <dttm>,
#   date <dttm>
# view your data frame transposed so your can see every column and the first few entries
glimpse(survey) 
Rows: 662
Columns: 26
$ student_id   <dbl> 43146, 44638, 47448, 47979, 48797, 51943, 52326, 52446, 5…
$ course_id    <chr> "FrScA-S216-02", "OcnA-S116-01", "FrScA-S216-01", "OcnA-S…
$ subject      <chr> "FrScA", "OcnA", "FrScA", "OcnA", "PhysA", "FrScA", "AnPh…
$ semester     <chr> "S216", "S116", "S216", "S216", "S116", "S216", "S216", "…
$ section      <dbl> 2, 1, 1, 1, 1, 3, 1, 1, 1, 1, 1, 2, 2, 1, 1, 2, 1, 1, 1, …
$ int          <dbl> 4.2, 4.0, 4.2, 4.0, 3.8, 3.8, 3.6, 4.2, 3.8, 4.8, 4.6, 3.…
$ val          <dbl> 3.666667, 3.000000, 3.000000, 3.666667, 3.666667, 3.66666…
$ percomp      <dbl> 4.0, 3.0, 3.0, 2.5, 3.5, 3.5, 3.0, 3.0, 3.0, 4.0, 4.0, 3.…
$ tv           <dbl> 3.857143, 3.571429, 3.714286, 3.857143, 3.714286, 3.71428…
$ q1           <dbl> 4, 4, 5, 4, 4, 4, 4, 4, 5, 5, 4, 3, 4, 4, 5, 4, 4, 5, 5, …
$ q2           <dbl> 4, 2, 3, 3, 4, 4, 4, 4, 2, 4, 5, 1, 3, 2, 5, 4, 4, 4, 3, …
$ q3           <dbl> 4, 2, 3, 2, 3, 3, 4, 3, 3, 4, 4, 3, 3, 2, 4, 3, 3, 4, 4, …
$ q4           <dbl> 5, 4, 4, 4, 4, 4, 2, 4, 4, 5, 4, 4, 4, 5, 5, 4, 4, 5, 5, …
$ q5           <dbl> 4, 4, 4, 4, 4, 3, 4, 4, 4, 5, 5, 4, 4, 5, 5, 4, 4, 5, 5, …
$ q6           <dbl> 4, 4, 3, 4, 4, 3, 4, 4, 2, 3, 5, 3, 3, 3, 5, 4, 4, 4, 5, …
$ q7           <dbl> 4, 4, 3, 3, 4, 4, 2, 3, 3, 4, 4, 4, 4, 4, 4, 4, 3, 5, 4, …
$ q8           <dbl> 4, 4, 4, 4, 4, 4, 4, 5, 4, 5, 5, 3, 4, 5, 5, 4, 5, 5, 5, …
$ q9           <dbl> 3, 3, 3, 4, 3, 4, 4, 3, 2, 3, 4, 2, 3, 1, 5, 3, 3, 4, 3, …
$ q10          <dbl> 4, 4, 4, 4, 3, 4, 4, 4, 2, 4, 5, 3, 4, 3, 5, 4, 5, 5, 5, …
$ date.x       <dttm> 2016-02-02 18:44:00, 2015-09-09 13:41:00, 2016-01-28 14:…
$ post_int     <dbl> NA, NA, NA, NA, NA, NA, NA, 3.50, 3.75, NA, 5.00, NA, NA,…
$ post_uv      <dbl> NA, NA, NA, NA, NA, NA, NA, 3.666667, 2.000000, NA, 4.666…
$ post_tv      <dbl> NA, NA, NA, NA, NA, NA, NA, 3.571429, 3.000000, NA, 4.857…
$ post_percomp <dbl> NA, NA, NA, NA, NA, NA, NA, 3.5, 3.0, NA, 4.0, NA, NA, 3.…
$ date.y       <dttm> NA, NA, NA, NA, NA, NA, NA, 2016-01-02 00:41:00, 2015-10…
$ date         <dttm> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
# look at just the first six entries
head(survey) 
# A tibble: 6 × 26
  student_id course_id  subject semester section   int   val percomp    tv    q1
       <dbl> <chr>      <chr>   <chr>      <dbl> <dbl> <dbl>   <dbl> <dbl> <dbl>
1      43146 FrScA-S21… FrScA   S216           2   4.2  3.67     4    3.86     4
2      44638 OcnA-S116… OcnA    S116           1   4    3        3    3.57     4
3      47448 FrScA-S21… FrScA   S216           1   4.2  3        3    3.71     5
4      47979 OcnA-S216… OcnA    S216           1   4    3.67     2.5  3.86     4
5      48797 PhysA-S11… PhysA   S116           1   3.8  3.67     3.5  3.71     4
6      51943 FrScA-S21… FrScA   S216           3   3.8  3.67     3.5  3.71     4
# ℹ 16 more variables: q2 <dbl>, q3 <dbl>, q4 <dbl>, q5 <dbl>, q6 <dbl>,
#   q7 <dbl>, q8 <dbl>, q9 <dbl>, q10 <dbl>, date.x <dttm>, post_int <dbl>,
#   post_uv <dbl>, post_tv <dbl>, post_percomp <dbl>, date.y <dttm>,
#   date <dttm>
# or the last six entries
tail(survey) 
# A tibble: 6 × 26
  student_id course_id  subject semester section   int   val percomp    tv    q1
       <dbl> <chr>      <chr>   <chr>      <dbl> <dbl> <dbl>   <dbl> <dbl> <dbl>
1         19 AnPhA-S21… AnPhA   S217           2   4.2  5        5    4.5      5
2         42 FrScA-S21… FrScA   S217           1   4    4        4    4        4
3         52 FrScA-S21… FrScA   S217           3   4.4  2.67     3.5  3.75     4
4         57 FrScA-S21… FrScA   S217           1   4.4  2.33     2.5  3.62     5
5         72 FrScA-S21… FrScA   S217           1   5    3        4    4.25     5
6         80 FrScA-S21… FrScA   S217           1   3.6  2.33     3    3.12     4
# ℹ 16 more variables: q2 <dbl>, q3 <dbl>, q4 <dbl>, q5 <dbl>, q6 <dbl>,
#   q7 <dbl>, q8 <dbl>, q9 <dbl>, q10 <dbl>, date.x <dttm>, post_int <dbl>,
#   post_uv <dbl>, post_tv <dbl>, post_percomp <dbl>, date.y <dttm>,
#   date <dttm>
# view the names of your variables or columns
names(survey) 
 [1] "student_id"   "course_id"    "subject"      "semester"     "section"     
 [6] "int"          "val"          "percomp"      "tv"           "q1"          
[11] "q2"           "q3"           "q4"           "q5"           "q6"          
[16] "q7"           "q8"           "q9"           "q10"          "date.x"      
[21] "post_int"     "post_uv"      "post_tv"      "post_percomp" "date.y"      
[26] "date"        
# or view in source pane
View(survey)

Yes, the “V” is capitalized—very unusual for an R function (the tidyverse does have it’s own lowercase view() function if that bothers you though). Because this function is a bit atypical in more ways than one, I have two recommendations concerning its use:

  • Use it strictly in the console. Because it opens a new viewing window, including it in an R Markdown or Quarto document can cause issues when “knitting” or “rendering” an HTML (or PDF) file. Hence I have included the eval: false option in the code chunk so it does not run when you render your document.

  • Close the viewer window that opens once you have viewed the data. Keeping it open can clutter your work space a bit and can lead to confusion about what data frame it was you viewed.

b. Tidy Data

Tidy data refers to a consistent way to organize your data in R. In our course text, R for Data Science, Hadley Wickham notes that getting your data into this format requires some work, but that work pays off in the long term:

Once you have tidy data and the tidy tools provided by packages in the tidyverse, you will spend much less time munging data from one representation to another, allowing you to spend more time on the analytic questions at hand.

Not surprisingly, the Tidyverse set of packages including packages like dplyr adhere “tidy” data principles. Tidy data has three interrelated rules which make a dataset tidy:

  1. Each variable is a column

  2. Each observation is a row

  3. Each type of observational unit is a table

Your Turn

Why would this data be considered “untidy”? Hint: Look at the column names.

  • I think because it’s all mixed up. 1A and 2A are not the same data type.
  • The variables would be better described as country and year with the observations being the percent.

Process Log Data

Earlier, we imported time_spent, which contains information on the number of minutes that students spent on the course, as well as other variables, particularly course_id.

Information about the course subject, semester, and section are not stored in a “tidy” format, but rather a single variable—course_id. This format of data storage is not ideal. If we instead give each piece of information its own column, we’ll have more opportunities for later analysis. We’ll use a function called separate() to do this.

First, let’s practice with a small data set. We’ll create it directly in R; run the code below to do that (and to assign the name df to the dataset).

df <- tibble(course_var = c("Fall - Chemistry", 
                            "Fall - Earth Science", 
                            "Spring - Forensic Science",
                            "Spring - Earth Science",
                            "Spring - Biology"))

df
# A tibble: 5 × 1
  course_var               
  <chr>                    
1 Fall - Chemistry         
2 Fall - Earth Science     
3 Spring - Forensic Science
4 Spring - Earth Science   
5 Spring - Biology         

Print df to the console. You should see a single variable, course_var, with five rows.

In this (very small) data frame, the information about both the semester and the course are encoded within the same variable. The separate() function has two primary arguments, one each for:

  1. the variable you want to separate
  2. the names of the new variables to create

Below, see using course_var for #1, and c("Semester", "Course") for #2, can be used to separate the semester and course data into two separate variable

df |> 
  separate(course_var, c("semester", "course"))
Warning: Expected 2 pieces. Additional pieces discarded in 3 rows [2, 3, 4].
# A tibble: 5 × 2
  semester course   
  <chr>    <chr>    
1 Fall     Chemistry
2 Fall     Earth    
3 Spring   Forensic 
4 Spring   Earth    
5 Spring   Biology  

Note that we used the |> operator called a pipe that was introduced in our Getting Started case study. Recall that pipes, including the original %>% pipe operator, are a powerful tool for combining a sequence of functions or processes. Here we used it to send our df data frame to the separate() function.

We could also have written the function as follows, with our data frame as the first argument in the function:

separate(df, course_var, c("semester", "course"))
Warning: Expected 2 pieces. Additional pieces discarded in 3 rows [2, 3, 4].
# A tibble: 5 × 2
  semester course   
  <chr>    <chr>    
1 Fall     Chemistry
2 Fall     Earth    
3 Spring   Forensic 
4 Spring   Earth    
5 Spring   Biology  

All of the {dplyr} functions like and many {tidyverse} functions like separate() take a data frame (or tibble) as the first argument. However, it can be helpful for general code readability to separate (no pun intended) your data frame from the functions and starting your first line of code with the data frame to be wrangled. This is a pretty standard convention when writing R code.

Next, let’s try something slightly different. Here, we have a data frame with a variable that encodes three pieces of information within the same variable: the year, semester, and subject. There are also a few other differences.

df2 <- tibble(course_variable = c("19-Fall-Algebra1", 
                                  "20-Fall-Algebra2", 
                                  "20-Spring-Algebra2",
                                  "20-Spring-Algebra2",
                                  "21-Fall-Algebra1"))
df2
# A tibble: 5 × 1
  course_variable   
  <chr>             
1 19-Fall-Algebra1  
2 20-Fall-Algebra2  
3 20-Spring-Algebra2
4 20-Spring-Algebra2
5 21-Fall-Algebra1  

Your Turn

Can you separate the variable in the above data frame not into two, but rather three, new variables?

# YOUR CODE HERE
separate(df2, course_variable, c("year", "semester", "subject"))
# A tibble: 5 × 3
  year  semester subject 
  <chr> <chr>    <chr>   
1 19    Fall     Algebra1
2 20    Fall     Algebra2
3 20    Spring   Algebra2
4 20    Spring   Algebra2
5 21    Fall     Algebra1

Hint: Try to modify the code from above (in which you separated course_var into two variables) based on a) the name of the variable in df2 and b) adding the name for the third new variable you wish to create.

Let’s return back to our time_spent data frame, now. It is often helpful to take a look at the data before writing code.

Below, we will load time_spent and run the separate() function with the course_id variable to split up the subject, semester, and section so we can use them later on. In other words, whereas above we separated the variable course_variable, in the data set we’ll use here, we’ll separate the course_id variable.

time_spent |>
  separate(course_id, c("subject", "semester", "section"))
# A tibble: 716 × 8
   student_id subject semester section gender enrollment_reason                 
        <dbl> <chr>   <chr>    <chr>   <chr>  <chr>                             
 1      60186 AnPhA   S116     01      M      Course Unavailable at Local School
 2      66693 AnPhA   S116     01      M      Course Unavailable at Local School
 3      66811 AnPhA   S116     01      F      Course Unavailable at Local School
 4      66862 AnPhA   S116     01      F      Course Unavailable at Local School
 5      67508 AnPhA   S116     01      F      Scheduling Conflict               
 6      70532 AnPhA   S116     01      F      Learning Preference of the Student
 7      77010 AnPhA   S116     01      F      Learning Preference of the Student
 8      85249 AnPhA   S116     01      F      Course Unavailable at Local School
 9      85411 AnPhA   S116     01      F      Scheduling Conflict               
10      85583 AnPhA   S116     01      F      Scheduling Conflict               
# ℹ 706 more rows
# ℹ 2 more variables: enrollment_status <chr>, time_spent <dbl>

Your Turn

There is one last key step – one that is likely to be a bit disorienting at first. Once we’ve processed the data how we would like, we have to assign, or save, the results back to the name for the data with which we have been working. This is done with the assignment operator, or the <- symbol.

Copy the code you successfully ran in the chunk above to follow the assignment operator in the chunk below. In other words, write the code you wrote above, but assign the output back to time_spent so that it now includes the course id separated into three distinct variables.

time_spent <- separate(time_spent, course_id, c("subject", "semester", "section"))

Note: It would have been perfectly acceptable, and in some cases preferable, to assign to a new name such as time_spent_tidy. However, to help keep our environment pane from becoming too cluttered and avoid confusion, we will reuse the same object name.

We have made a habit of continually looking at our data after running code to ensure that the step worked as intended. Type the name of the data we have been working with in the code chunk below to ensure that the course_id variable has been separated into three variables that correspond to the subject, semester, and section.

# YOUR CODE HERE
time_spent
# A tibble: 716 × 8
   student_id subject semester section gender enrollment_reason                 
        <dbl> <chr>   <chr>    <chr>   <chr>  <chr>                             
 1      60186 AnPhA   S116     01      M      Course Unavailable at Local School
 2      66693 AnPhA   S116     01      M      Course Unavailable at Local School
 3      66811 AnPhA   S116     01      F      Course Unavailable at Local School
 4      66862 AnPhA   S116     01      F      Course Unavailable at Local School
 5      67508 AnPhA   S116     01      F      Scheduling Conflict               
 6      70532 AnPhA   S116     01      F      Learning Preference of the Student
 7      77010 AnPhA   S116     01      F      Learning Preference of the Student
 8      85249 AnPhA   S116     01      F      Course Unavailable at Local School
 9      85411 AnPhA   S116     01      F      Scheduling Conflict               
10      85583 AnPhA   S116     01      F      Scheduling Conflict               
# ℹ 706 more rows
# ℹ 2 more variables: enrollment_status <chr>, time_spent <dbl>

If those look good, let’s proceed to the next step. If something doesn’t look right, consider re-running the code chunks above, perhaps returning all the way to the first code chunk that you ran (to load the data) to ensure that the output is as you intended for it to be.

“Mutate” a column to change the time spent variable to represent hours

In the above code, you used separate() to create new variables based on an existing variable. While that function solves a specific problem (when there are effectively multiple variables combined in one), its use represents a pattern that is fairly common: you use a function to solve a problem; figuring out how it works, checking the output, then assigning the output back to the name of the data frame, after which you can proceed to the next step.

There are a lot of other functions like separate that help you to solve specific problems, and we’ll introduce many over the course - and will point you to resources that describe many more.

There are also functions that can serve as general purpose tools that can solve many problems; one of the most useful is mutate(), a function to create new variables in a data set. Specifically, we’ll use mutate() to create a new variable for the percentage of points each student earned; keep in mind as you work through these steps how so many parts of wrangling data involves either changing a variable or creating a new one. For these purposes, mutate can be very helpful.

Let’s begin again with a small data set with two variables, var_a and var_b. Run the chunk below.

df3 <- tibble(var_a = c(30, 50, 30, 10, 30, 40, 40, 30, 20, 50),
              var_b = c(100, 90, 60, 70, 60, 80, 70, 50, 30, 20)
              )

Next, print df3 to the console. You should see two numeric variables.

Now imagine they represent points that students earned on a 50-point quiz and a 100-point test, respectively. There are a lot of things that you might wish to do with these variables. For instance, you may wish to create a new variable named points_sum that is the sum of var_a and var_b. The code below does this. Give it a try:

df3 |> 
  mutate(points_sum = var_a + var_b)
# A tibble: 10 × 3
   var_a var_b points_sum
   <dbl> <dbl>      <dbl>
 1    30   100        130
 2    50    90        140
 3    30    60         90
 4    10    70         80
 5    30    60         90
 6    40    80        120
 7    40    70        110
 8    30    50         80
 9    20    30         50
10    50    20         70

Your Turn

We can combine many mutate() functions together. Below, complete the following code to create a new variable (let’s call it points_proportion) that represents the proportion of the total points students could, potentially, earn. To do this, you can divide points_sum created in the second line of code by the maximum possible points—150.

df3 |> 
  mutate(points_sum = var_a + var_b) |> 
  mutate(points_proportion = points_sum/150)
# A tibble: 10 × 4
   var_a var_b points_sum points_proportion
   <dbl> <dbl>      <dbl>             <dbl>
 1    30   100        130             0.867
 2    50    90        140             0.933
 3    30    60         90             0.6  
 4    10    70         80             0.533
 5    30    60         90             0.6  
 6    40    80        120             0.8  
 7    40    70        110             0.733
 8    30    50         80             0.533
 9    20    30         50             0.333
10    50    20         70             0.467

Hint: Just like you can use the + symbol to add variables together, you can use the / symbol to divide a variable by another – or by a value, like 150!

You should see output that contains four variables, one each for var_a and var_b, points_sum, which represents the sum of the points students earned, and points_proportion, which represents the percentage of the total points students earned.

Your Turn

Let’s now process the time_spent variable. Specifically, this variable currently represents the number of minutes that students spent on the course LMS. Below, complete the code template to create a new variable, time_spent_hours, that represents the number of hours that students spent on the course LMS.

time_spent |> 
  mutate(time_spent_hours = time_spent/60)
# A tibble: 716 × 9
   student_id subject semester section gender enrollment_reason                 
        <dbl> <chr>   <chr>    <chr>   <chr>  <chr>                             
 1      60186 AnPhA   S116     01      M      Course Unavailable at Local School
 2      66693 AnPhA   S116     01      M      Course Unavailable at Local School
 3      66811 AnPhA   S116     01      F      Course Unavailable at Local School
 4      66862 AnPhA   S116     01      F      Course Unavailable at Local School
 5      67508 AnPhA   S116     01      F      Scheduling Conflict               
 6      70532 AnPhA   S116     01      F      Learning Preference of the Student
 7      77010 AnPhA   S116     01      F      Learning Preference of the Student
 8      85249 AnPhA   S116     01      F      Course Unavailable at Local School
 9      85411 AnPhA   S116     01      F      Scheduling Conflict               
10      85583 AnPhA   S116     01      F      Scheduling Conflict               
# ℹ 706 more rows
# ℹ 3 more variables: enrollment_status <chr>, time_spent <dbl>,
#   time_spent_hours <dbl>

Hint: Refer to the code you wrote above, being clear about a) what the name of the new variable you are creating is and b) how you will create this variable using division (by the number of minutes in an hour). Also, it’s a bit confusing and poor naming on my part, but you will need to use the time_spent variable in our time_spent data frame.

We used the above as a test bed to ensure that our code worked as intended. Once we are confident that we are creating the variable in the way we intend to, we can assign the output back to the data frame that time_spent refers to. Paste your working code here to save the new time_spent_hours variable to

time_spent <- time_spent |> 
  mutate(time_spent_hours = time_spent/60)

Good work wrangling this dataset!

Process Gradebook Data

Now let’s process the gradebook data. In particular, we’ll separate the course_id variable in the same way we separated that variable in the log data, and we’ll also calculate a new variable representing the proportion of points students earned (out of the points possible to earn).

Let’s start with separating the course_id variable. Run the code in the next chunk to do this. If you named the three parts of the course ID differently than they’re named below, be sure that these three variables are named identically; this is the key (literally) to these variables joining correctly in the next section.

gradebook <- gradebook |> 
  separate(course_id,
           c("subject", "semester", "section"))

gradebook
# A tibble: 717 × 6
   student_id subject semester section total_points_possible total_points_earned
        <dbl> <chr>   <chr>    <chr>                   <dbl>               <dbl>
 1      43146 FrScA   S216     02                       1217               1150 
 2      44638 OcnA    S116     01                       1676               1384.
 3      47448 FrScA   S216     01                       1232               1116 
 4      47979 OcnA    S216     01                       1833               1493.
 5      48797 PhysA   S116     01                       2225               1995.
 6      51943 FrScA   S216     03                       1222                 70 
 7      52326 AnPhA   S216     01                       1775               1519.
 8      52446 PhysA   S116     01                       2225               2198 
 9      53447 FrScA   S116     01                       1212               1173 
10      53475 FrScA   S116     02                       1212                  0 
# ℹ 707 more rows

Next, we’ll mutate our data set to create a new column, one representing the proportion of points students earned. Let’s consider a data frame with example data, df4 .

df4 <- tibble(var_a = c(8, 8, 7, 8, 9, 6, 8, 8, 7, 8),
              var_b = 9)

Note: To create df4, for var_a, we passed a vector that we created with the function c() that contains 10 values. Consider these to be the number of times that learners participated in an outside-of-school STEM club. Instead of passing another vector for var_b, we simply used the value 9, which represents the number of opportunities students had to participate in the outside-of–school STEM club . In this case, the value 9 is repeated for however many rows there are in the data frame. Thus, in the context of creating a data frame, var_b = 9 is the same as var_b = c(9, 9, 9, 9, 9, 9, 9, 9, 9, 9).

Since interpreting proportions when the denominator is nine can be difficult, we may wish to create a variable for the proportion.

After running the chunk above, print df4 to the console or view using another method to get a sense of what the data frame consists of.

Your Turn

To create a third variable that represents the proportion of STEM club activities students participated in, divide var_a by var_b.

# YOUR CODE HERE
df4 |>
  mutate(var_c = var_a/var_b)
# A tibble: 10 × 3
   var_a var_b var_c
   <dbl> <dbl> <dbl>
 1     8     9 0.889
 2     8     9 0.889
 3     7     9 0.778
 4     8     9 0.889
 5     9     9 1    
 6     6     9 0.667
 7     8     9 0.889
 8     8     9 0.889
 9     7     9 0.778
10     8     9 0.889

What happens if the output is different than you intended? That’s no problem! Re-run the code-chunk above (in which we create df) to have a blank slate with which to try again.

Your Turn

Your turn once more. This time, create a new variable—here, let’s name it proportion_earned—using the gradebook data. This will involve using the mutate() function with the gradebook data, creating a new variable (proportion_earned) on the basis of the values of two existing variables:

  • total_points_earned

  • total_points_possible

Also, once your code is ready, you’ll need to assign the results back to gradebook. This is challenging as you’re starting from scratch with the code. However, good R programmers use other code (that they or others wrote!) often, so feel free to copy and paste code from other, similar problems to give yourself a head start.

# YOUR CODE HERE
gradebook <- gradebook |>
  mutate(proportion_earned = total_points_earned/total_points_possible)
gradebook
# A tibble: 717 × 7
   student_id subject semester section total_points_possible total_points_earned
        <dbl> <chr>   <chr>    <chr>                   <dbl>               <dbl>
 1      43146 FrScA   S216     02                       1217               1150 
 2      44638 OcnA    S116     01                       1676               1384.
 3      47448 FrScA   S216     01                       1232               1116 
 4      47979 OcnA    S216     01                       1833               1493.
 5      48797 PhysA   S116     01                       2225               1995.
 6      51943 FrScA   S216     03                       1222                 70 
 7      52326 AnPhA   S216     01                       1775               1519.
 8      52446 PhysA   S116     01                       2225               2198 
 9      53447 FrScA   S116     01                       1212               1173 
10      53475 FrScA   S116     02                       1212                  0 
# ℹ 707 more rows
# ℹ 1 more variable: proportion_earned <dbl>

Once the above step is complete, take another look at gradebook by printing it to the console or viewing it using a preferred method. There should now be seven columns, the six originally in the data and a new, seventh variable you’ve “mutated.”

Process Survey Data

Finally, let’s process our survey data that we imported earlier. First though, take a quick look again by typing survey into the console or using the code chunk below.

# YOUR CODE HERE
survey
# A tibble: 662 × 26
   student_id course_id subject semester section   int   val percomp    tv    q1
        <dbl> <chr>     <chr>   <chr>      <dbl> <dbl> <dbl>   <dbl> <dbl> <dbl>
 1      43146 FrScA-S2… FrScA   S216           2   4.2  3.67     4    3.86     4
 2      44638 OcnA-S11… OcnA    S116           1   4    3        3    3.57     4
 3      47448 FrScA-S2… FrScA   S216           1   4.2  3        3    3.71     5
 4      47979 OcnA-S21… OcnA    S216           1   4    3.67     2.5  3.86     4
 5      48797 PhysA-S1… PhysA   S116           1   3.8  3.67     3.5  3.71     4
 6      51943 FrScA-S2… FrScA   S216           3   3.8  3.67     3.5  3.71     4
 7      52326 AnPhA-S2… AnPhA   S216           1   3.6  4        3    4        4
 8      52446 PhysA-S1… PhysA   S116           1   4.2  3.67     3    4        4
 9      53447 FrScA-S1… FrScA   S116           1   3.8  2        3    3        5
10      53475 FrScA-S2… FrScA   S216           1   4.8  3.33     4    4.14     5
# ℹ 652 more rows
# ℹ 16 more variables: q2 <dbl>, q3 <dbl>, q4 <dbl>, q5 <dbl>, q6 <dbl>,
#   q7 <dbl>, q8 <dbl>, q9 <dbl>, q10 <dbl>, date.x <dttm>, post_int <dbl>,
#   post_uv <dbl>, post_tv <dbl>, post_percomp <dbl>, date.y <dttm>,
#   date <dttm>

Does it appear to be the correct data frame? What do the variables seem to be about? What wrangling steps do we need to take? Taking a quick peak at the data helps us to begin to formulate answers to these and is an important step in any data analysis, especially as we prepare for what we are going to do.

Add one or more of the things you notice or wonder about the data here:

  • The survey data seems fairly clean except it has a lot of NA which is no data for that observation. There appears to be 10 questions on the survey with answers ranging from 1-5 on a scale. There is also a dttm data with date information.

You may have noticed that student_ID is not formatted exactly the same as student_id in our other files. This is important because in the next section when we “join,” or merge, our data files, these variables will need to have identical names.

Fortunately the {janitor} package has simple functions for examining and cleaning dirty data. It was built with beginning and intermediate R users in mind and is optimized for user-friendliness. There is also a handy function called clean_names() in the {janitor} package for standardizing variable names.

Run the following code to load the {janitor} package and clean our column names so both data frames are consistent:

library(janitor)

Attaching package: 'janitor'
The following objects are masked from 'package:stats':

    chisq.test, fisher.test
survey <- clean_names(survey)

Let’s take one more look at the data by typing its name into the code chunk below to check that the above function appeared to work; if it did, the names should be lower-case, and any symbols or spaces should now be replaced by an underscore (_).

survey
# A tibble: 662 × 26
   student_id course_id subject semester section   int   val percomp    tv    q1
        <dbl> <chr>     <chr>   <chr>      <dbl> <dbl> <dbl>   <dbl> <dbl> <dbl>
 1      43146 FrScA-S2… FrScA   S216           2   4.2  3.67     4    3.86     4
 2      44638 OcnA-S11… OcnA    S116           1   4    3        3    3.57     4
 3      47448 FrScA-S2… FrScA   S216           1   4.2  3        3    3.71     5
 4      47979 OcnA-S21… OcnA    S216           1   4    3.67     2.5  3.86     4
 5      48797 PhysA-S1… PhysA   S116           1   3.8  3.67     3.5  3.71     4
 6      51943 FrScA-S2… FrScA   S216           3   3.8  3.67     3.5  3.71     4
 7      52326 AnPhA-S2… AnPhA   S216           1   3.6  4        3    4        4
 8      52446 PhysA-S1… PhysA   S116           1   4.2  3.67     3    4        4
 9      53447 FrScA-S1… FrScA   S116           1   3.8  2        3    3        5
10      53475 FrScA-S2… FrScA   S216           1   4.8  3.33     4    4.14     5
# ℹ 652 more rows
# ℹ 16 more variables: q2 <dbl>, q3 <dbl>, q4 <dbl>, q5 <dbl>, q6 <dbl>,
#   q7 <dbl>, q8 <dbl>, q9 <dbl>, q10 <dbl>, date_x <dttm>, post_int <dbl>,
#   post_uv <dbl>, post_tv <dbl>, post_percomp <dbl>, date_y <dttm>,
#   date <dttm>
# this doesn't seem to have made any changes or updates to the data???

c. Joining the data

We’re now ready to join! At their core, joins involve operations on two data frames at the same time. This may seem useful only in certain cases, but consider the following data analysis tasks:

  • You have collected data from students from one of ten classrooms; at the same time, you have data on the teachers of those ten classes (five of which tried out a new curriculum, and five who taught a “business-as-usual” curriculum

  • You are studying the posts on Twitter and Pinterest of one of around 100 mathematics teachers

  • After working with a local school district, you collected survey responses from 100s of teachers who teach in one of approximately 25 elementary, middle, and high schools; you received data from the district on the characteristics of the schools, including how many students they serve and how many teachers work in them

In each of these cases—and many others like them—your single analysis involves multiple data files. While in some cases it is possible to analyze each data set individually, it is often useful (or necessary, depending upon your goal) to join these sources of data together. This is especially the case for learning analytics research, in which researchers and analysts often are interested in understanding teaching and learning through the lens of multiple data sources, including digital data, institutional records, and survey data, among other sources. In all of these cases, knowing how to promptly join together files—even files with tens of thousands of hundreds of thousands of rows—can be empowering.

Consider two example data frames. df5 contains a variable with four student names, name and a variable for the number of STEM-related classes they have taken, n_stem_classes.

df6 contains a variable with three student names, name (like in df5), as well as another, different variable, for students’ self-reported interest in STEM topics, interest_in_stem, measured on a one-seven scale, with seven indicating higher levels of interest.

Run the code below and then type df5 and df6 in the console.

df5 <- tibble(name = c("Sheila", "Tayla", "Marcus"),
              n_stem_classes = c(4, 5, 6))

df5
# A tibble: 3 × 2
  name   n_stem_classes
  <chr>           <dbl>
1 Sheila              4
2 Tayla               5
3 Marcus              6
df6 <- tibble(name = c("Tayla", "Marcus", "Sheila", "Vin"),
              interest_in_stem = c(4, 7, 6, 6))

df6
# A tibble: 4 × 2
  name   interest_in_stem
  <chr>             <dbl>
1 Tayla                 4
2 Marcus                7
3 Sheila                6
4 Vin                   6

A key (pun intended!) with joins is to consider what variable(s) will serve as the key. This is the variable to join by.

A key must have two characteristics:

  1. The variable(s) used as the key must be present in both of the data frames you are joining.

  2. The variable(s) used as the key must be of the same data type (e.g., character, numeric, integer, double, etc.).

To join two datasets, it is important that the key (or keys) on which you are joining the data is formatted identically. The key represents an identifier that is present in both of the data sets you are joining. For instance, you may have data collected from (or created about) the same students that are from two very different sources, such as a self-report survey of students and their teacher-assigned grades in class.

While some of the time it takes some thought to determine what the key is (or what the keys are—you can join on multiple keys!), in the above case, there is just one variable that meets both of the above characteristics.

Your Turn

In the code below, enter the name of the variable that is the key within the quotation marks following the by = argument. Then run the code chunk and note the output.

full_join(df5, df6, by = "name")
# A tibble: 4 × 3
  name   n_stem_classes interest_in_stem
  <chr>           <dbl>            <dbl>
1 Sheila              4                6
2 Tayla               5                4
3 Marcus              6                7
4 Vin                NA                6

What do you notice about the output of the full_join()? All observations are valid; consider how the output is similar to and different from df5 and df6, particularly in one or more notes following the bullet point.

  • It appears to join the two data frames by name into one data frame. But I don’t know how to recall that new data frame?
  • The same columns exist now just extended by one. The two stem columns are dbl type.

full_join() is one of a number of joins from which we can choose. full_join() is distinguished from the other joins by how it returns all of the rows in both of the data frames being joined. If a particular key is present in one of the data frames but not the other, the values for the variable in the data set for which the key is not present are simply recorded as missing (like in the above, where there is no value for the number of STEM classes Vin has taken).

There is one other join on which we’ll focus for now. That is left_join(), which differs from full_join() in that it returns all of the rows in the “left” data frame, the data frame named first in the function, but not all of the rows in the “right” data frame: it retains only the rows in the “right” data frame, the data frame named second in the function, that have a matching key. An example is necessary. Before running the code below, add the same key you added above.

left_join(df5, df6, by = "name")
# A tibble: 3 × 3
  name   n_stem_classes interest_in_stem
  <chr>           <dbl>            <dbl>
1 Sheila              4                6
2 Tayla               5                4
3 Marcus              6                7

Different from the above, left_join() did not return all of the rows from both data frames, instead returning all of the rows in the “left” data frame (and those in the “right” data frame with a match).

Join Gradebook and Log Data

For now, we’re going to use a single join function, full_join(). In the code below, join gradebook and time_spent; type the names of those two data frames as arguments to the full_join() function in a similar manner as in the full_join() code above, and then run this code chunk. For now, don’t specify anything for the by = argument of the function.

# join together the gradebook and log_wrangled
joined_data <- full_join(gradebook, time_spent)
Joining with `by = join_by(student_id, subject, semester, section)`
joined_data
# A tibble: 830 × 12
   student_id subject semester section total_points_possible total_points_earned
        <dbl> <chr>   <chr>    <chr>                   <dbl>               <dbl>
 1      43146 FrScA   S216     02                       1217               1150 
 2      44638 OcnA    S116     01                       1676               1384.
 3      47448 FrScA   S216     01                       1232               1116 
 4      47979 OcnA    S216     01                       1833               1493.
 5      48797 PhysA   S116     01                       2225               1995.
 6      51943 FrScA   S216     03                       1222                 70 
 7      52326 AnPhA   S216     01                       1775               1519.
 8      52446 PhysA   S116     01                       2225               2198 
 9      53447 FrScA   S116     01                       1212               1173 
10      53475 FrScA   S116     02                       1212                  0 
# ℹ 820 more rows
# ℹ 6 more variables: proportion_earned <dbl>, gender <chr>,
#   enrollment_reason <chr>, enrollment_status <chr>, time_spent <dbl>,
#   time_spent_hours <dbl>

You may notice a message in the console or first output box above that says Joining, by = c("student_id", "course", "subject", "section"). This is telling us that these files are being joined on the basis of all four of these variables matching in both data sets; in other words, for rows to be joined, they must match identically on all four of these variables.

This is related to not specifying anything for the by = part of the function; by default, full_join() (and left_join()) will consider any character strings with identical names that are present in both data sets to be keys. But, it’s generally better practice to specify the variables on which we are joining.

Your Turn

In the code chunk below, write your join like above, but add the by = c("student_id", "subject", "semester", "section") part to your code. You may notice the message does not appear. This is generally a better practice because you know precisely on which variables your data sets are joining.

# YOUR CODE HERE
joined_data <- full_join(gradebook, time_spent, by = c("student_id", "subject", "semester", "section"))

joined_data
# A tibble: 830 × 12
   student_id subject semester section total_points_possible total_points_earned
        <dbl> <chr>   <chr>    <chr>                   <dbl>               <dbl>
 1      43146 FrScA   S216     02                       1217               1150 
 2      44638 OcnA    S116     01                       1676               1384.
 3      47448 FrScA   S216     01                       1232               1116 
 4      47979 OcnA    S216     01                       1833               1493.
 5      48797 PhysA   S116     01                       2225               1995.
 6      51943 FrScA   S216     03                       1222                 70 
 7      52326 AnPhA   S216     01                       1775               1519.
 8      52446 PhysA   S116     01                       2225               2198 
 9      53447 FrScA   S116     01                       1212               1173 
10      53475 FrScA   S116     02                       1212                  0 
# ℹ 820 more rows
# ℹ 6 more variables: proportion_earned <dbl>, gender <chr>,
#   enrollment_reason <chr>, enrollment_status <chr>, time_spent <dbl>,
#   time_spent_hours <dbl>

Hint: If you’re curious about how to format the use of the by = part of your code, look up above at how you used this argument to the full_join() function.

What do you notice about the result—the data you joined? In particular, how does it differ from the two data sets from which it was created? Add one or more notes below.

  • The joined dataframe has 830 observations and 12 variables (by joining like data from the 717 obs/7var from gradebook and 716obs/9var from time_spent).
  • All three df share student_id, subject, semester, section and the additional variables are the combined from the sets.

Join Gradebook and Log Data

Now let’s try joining on our final data set so that our LMS, gradebook and survey data are all in a single data frame for analysis called data_to_explore:

# join together the gradebook and log_wrangled
data_to_explore <- full_join(joined_data, survey, by = c("student_id", "subject", "semester", "section"))

data_to_explore

Oops! You probably noticed a red error message telling us that x$section and y$section are not of the same data type. This is basically telling us we are not meeting the second criteria of a key noted above. Specifically, somewhere we have a mismatch in data types in our joined_data and survey data frames among our student_id, subject, semester or section variables; something we were warned about earlier.

Your probably also noticed that I included something strange in this code chunk, specifically the #|eval: false code chunk option. The eval argument is just one of many code chunk options you can include to change the output and behavior for code in Quarto docs. The reason I included this in the code chunk is because the code doesn’t work until we change the data type, which we’ll do below. By setting eval to false, it’s saying don’t include this code when you render.

Your Turn

Run the following code chunk to inspect our two data frames and record below the variable used in our key where we have a mismatch in data type:

  • The section variable is mismatched - chr in one, dbl in the other.
joined_data
# A tibble: 830 × 12
   student_id subject semester section total_points_possible total_points_earned
        <dbl> <chr>   <chr>    <chr>                   <dbl>               <dbl>
 1      43146 FrScA   S216     02                       1217               1150 
 2      44638 OcnA    S116     01                       1676               1384.
 3      47448 FrScA   S216     01                       1232               1116 
 4      47979 OcnA    S216     01                       1833               1493.
 5      48797 PhysA   S116     01                       2225               1995.
 6      51943 FrScA   S216     03                       1222                 70 
 7      52326 AnPhA   S216     01                       1775               1519.
 8      52446 PhysA   S116     01                       2225               2198 
 9      53447 FrScA   S116     01                       1212               1173 
10      53475 FrScA   S116     02                       1212                  0 
# ℹ 820 more rows
# ℹ 6 more variables: proportion_earned <dbl>, gender <chr>,
#   enrollment_reason <chr>, enrollment_status <chr>, time_spent <dbl>,
#   time_spent_hours <dbl>
survey
# A tibble: 662 × 26
   student_id course_id subject semester section   int   val percomp    tv    q1
        <dbl> <chr>     <chr>   <chr>      <dbl> <dbl> <dbl>   <dbl> <dbl> <dbl>
 1      43146 FrScA-S2… FrScA   S216           2   4.2  3.67     4    3.86     4
 2      44638 OcnA-S11… OcnA    S116           1   4    3        3    3.57     4
 3      47448 FrScA-S2… FrScA   S216           1   4.2  3        3    3.71     5
 4      47979 OcnA-S21… OcnA    S216           1   4    3.67     2.5  3.86     4
 5      48797 PhysA-S1… PhysA   S116           1   3.8  3.67     3.5  3.71     4
 6      51943 FrScA-S2… FrScA   S216           3   3.8  3.67     3.5  3.71     4
 7      52326 AnPhA-S2… AnPhA   S216           1   3.6  4        3    4        4
 8      52446 PhysA-S1… PhysA   S116           1   4.2  3.67     3    4        4
 9      53447 FrScA-S1… FrScA   S116           1   3.8  2        3    3        5
10      53475 FrScA-S2… FrScA   S216           1   4.8  3.33     4    4.14     5
# ℹ 652 more rows
# ℹ 16 more variables: q2 <dbl>, q3 <dbl>, q4 <dbl>, q5 <dbl>, q6 <dbl>,
#   q7 <dbl>, q8 <dbl>, q9 <dbl>, q10 <dbl>, date_x <dttm>, post_int <dbl>,
#   post_uv <dbl>, post_tv <dbl>, post_percomp <dbl>, date_y <dttm>,
#   date <dttm>

Change Data Type

We can use the mutate() function that we learned about in our tutorials to also change data types!

Run the code chunk below to “mutate” the section variable in our joined_data table from a “character” data type to a “dbl” data type, i.e. numeric variable to a character variable using the base R as.double() function:

joined_data <- joined_data |>
  mutate(section = as.double(section))

joined_data
# A tibble: 830 × 12
   student_id subject semester section total_points_possible total_points_earned
        <dbl> <chr>   <chr>      <dbl>                 <dbl>               <dbl>
 1      43146 FrScA   S216           2                  1217               1150 
 2      44638 OcnA    S116           1                  1676               1384.
 3      47448 FrScA   S216           1                  1232               1116 
 4      47979 OcnA    S216           1                  1833               1493.
 5      48797 PhysA   S116           1                  2225               1995.
 6      51943 FrScA   S216           3                  1222                 70 
 7      52326 AnPhA   S216           1                  1775               1519.
 8      52446 PhysA   S116           1                  2225               2198 
 9      53447 FrScA   S116           1                  1212               1173 
10      53475 FrScA   S116           2                  1212                  0 
# ℹ 820 more rows
# ℹ 6 more variables: proportion_earned <dbl>, gender <chr>,
#   enrollment_reason <chr>, enrollment_status <chr>, time_spent <dbl>,
#   time_spent_hours <dbl>

A quick note about “dbl” (short for double) vs. “numeric” data types. “dbl” is the designation you’ll commonly see for numeric columns when printing a tibble (a modern form of data frame provided by the tidyverse). This is just a user-friendly way to indicate that the column contains numeric (double-precision) data. For a nice simple list of data types and examples, check out this short article on the tidyverse website: https://tibble.tidyverse.org/articles/types.html.

Your Turn

In the code chunk below, try joining our two data frames again using the same code from above:

# YOUR CODE HERE
data_to_explore <- full_join(joined_data, survey, by = c("student_id", "subject", "semester", "section"))

data_to_explore
# A tibble: 943 × 34
   student_id subject semester section total_points_possible total_points_earned
        <dbl> <chr>   <chr>      <dbl>                 <dbl>               <dbl>
 1      43146 FrScA   S216           2                  1217               1150 
 2      44638 OcnA    S116           1                  1676               1384.
 3      47448 FrScA   S216           1                  1232               1116 
 4      47979 OcnA    S216           1                  1833               1493.
 5      48797 PhysA   S116           1                  2225               1995.
 6      51943 FrScA   S216           3                  1222                 70 
 7      52326 AnPhA   S216           1                  1775               1519.
 8      52446 PhysA   S116           1                  2225               2198 
 9      53447 FrScA   S116           1                  1212               1173 
10      53475 FrScA   S116           2                  1212                  0 
# ℹ 933 more rows
# ℹ 28 more variables: proportion_earned <dbl>, gender <chr>,
#   enrollment_reason <chr>, enrollment_status <chr>, time_spent <dbl>,
#   time_spent_hours <dbl>, course_id <chr>, int <dbl>, val <dbl>,
#   percomp <dbl>, tv <dbl>, q1 <dbl>, q2 <dbl>, q3 <dbl>, q4 <dbl>, q5 <dbl>,
#   q6 <dbl>, q7 <dbl>, q8 <dbl>, q9 <dbl>, q10 <dbl>, date_x <dttm>,
#   post_int <dbl>, post_uv <dbl>, post_tv <dbl>, post_percomp <dbl>, …

We’ll revisit joins in our Unit 2 tutorials, but for a quick overview of the different join functions with helpful visuals, visit: https://statisticsglobe.com/r-dplyr-join-inner-left-right-full-semi-anti

Congratulations! We’re done wrangling our data and it is ready to explore!

3. EXPLORE

As highlighted in both DSEIUR and Learning Analytics Goes to School, calculating summary statistics and data visualization are a key part of exploratory data analysis. One goal in this phase is to explore questions that drove the original analysis and develop new questions and hypotheses to test in later stages. In Part 3 we’ll learn to:

  • Summarize Key Stats. We’ll learn about the {skmir} package for pulling together some quick descriptive statistics when your goal is to understand your data internally.

  • Visualize Data. We’ll introduce the histogram geom for taking a quick peak the distributions of some key variables and put together some scatter plots for examining potential relationships between time spent and student performance.

  • Correlated Variables. We’ll wrap up our Part 3 by learning to create a correlation matrix for quantifying these relationships between key variables and student performance.

a. Table Summaries

At this point, we have quite the comprehensive data set, including single measures from a) students for the time they spent in the course LMS and other information about them, such as information on why they are enrolled in the course, b) their academic achievement.

Skimr Package

An efficient package for creating descriptive statistics when your goal is to understand your data internally (rather than to create a table for an external-to-the-research-team audience, like for a journal article) is the {skimr} package. A key feature of the {skimr} package is that it works well with the {tidyverse} packages we are using: it takes data frames as input, and returns data frames as output, which means we can manipulate them with {tidyverse} functions like select(), filter(), and arrange(), for example.

Let’s load the {skimr} package:

library(skimr)

The challenge here is not the complexity of the skim() function, per se, but will be comprehending the terminology. In the code chunk below:

  • Pass to the skim() function a single argument (recall from our tutorials last week that functions have names and arguments!)

  • That single argument is the data frame (aka in tidyverse parlance, a tibble) for which you are aiming to calculate descriptive statistics

Run the following code to “skim” our data_to_explore tibble:

skim(data_to_explore)
Data summary
Name data_to_explore
Number of rows 943
Number of columns 34
_______________________
Column type frequency:
character 6
numeric 25
POSIXct 3
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
subject 0 1.00 4 5 0 5 0
semester 0 1.00 4 4 0 4 0
gender 227 0.76 1 1 0 2 0
enrollment_reason 227 0.76 5 34 0 5 0
enrollment_status 227 0.76 7 17 0 3 0
course_id 281 0.70 12 13 0 36 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
student_id 0 1.00 98989.73 66673.70 1.00 84717.00 89345.00 96423.50 265635.00 ▂▇▁▁▂
section 0 1.00 1.48 0.73 1.00 1.00 1.00 2.00 4.00 ▇▃▁▁▁
total_points_possible 226 0.76 1619.55 387.12 1212.00 1217.00 1676.00 1791.00 2425.00 ▇▂▆▁▃
total_points_earned 226 0.76 1229.98 510.64 0.00 1002.50 1177.13 1572.45 2413.50 ▂▂▇▅▂
proportion_earned 226 0.76 0.76 0.25 0.00 0.72 0.86 0.92 1.01 ▁▁▁▃▇
time_spent 232 0.75 1828.80 1363.13 0.45 895.57 1559.97 2423.94 8870.88 ▇▅▁▁▁
time_spent_hours 232 0.75 30.48 22.72 0.01 14.93 26.00 40.40 147.85 ▇▅▁▁▁
int 293 0.69 4.30 0.60 1.80 4.00 4.40 4.80 5.00 ▁▁▂▆▇
val 287 0.70 3.75 0.75 1.00 3.33 3.67 4.33 5.00 ▁▁▆▇▆
percomp 288 0.69 3.64 0.69 1.50 3.00 3.50 4.00 5.00 ▁▁▇▃▃
tv 292 0.69 4.07 0.59 1.00 3.71 4.12 4.46 5.00 ▁▁▂▇▇
q1 285 0.70 4.34 0.66 1.00 4.00 4.00 5.00 5.00 ▁▁▁▇▇
q2 285 0.70 3.66 0.93 1.00 3.00 4.00 4.00 5.00 ▁▂▆▇▃
q3 286 0.70 3.31 0.85 1.00 3.00 3.00 4.00 5.00 ▁▂▇▅▂
q4 289 0.69 4.35 0.80 1.00 4.00 5.00 5.00 5.00 ▁▁▁▆▇
q5 286 0.70 4.28 0.69 1.00 4.00 4.00 5.00 5.00 ▁▁▁▇▆
q6 285 0.70 4.05 0.80 1.00 4.00 4.00 5.00 5.00 ▁▁▃▇▅
q7 286 0.70 3.96 0.85 1.00 3.00 4.00 5.00 5.00 ▁▁▅▇▆
q8 286 0.70 4.35 0.65 1.00 4.00 4.00 5.00 5.00 ▁▁▁▇▇
q9 286 0.70 3.55 0.92 1.00 3.00 4.00 4.00 5.00 ▁▂▇▇▃
q10 285 0.70 4.17 0.87 1.00 4.00 4.00 5.00 5.00 ▁▁▃▇▇
post_int 848 0.10 3.88 0.94 1.00 3.50 4.00 4.50 5.00 ▁▁▃▇▇
post_uv 848 0.10 3.48 0.99 1.00 3.00 3.67 4.00 5.00 ▂▂▅▇▅
post_tv 848 0.10 3.71 0.90 1.00 3.29 3.86 4.29 5.00 ▁▂▃▇▆
post_percomp 848 0.10 3.47 0.88 1.00 3.00 3.50 4.00 5.00 ▁▂▂▇▂

Variable type: POSIXct

skim_variable n_missing complete_rate min max median n_unique
date_x 393 0.58 2015-09-02 15:40:00 2016-05-24 15:53:00 2015-10-01 15:57:30 536
date_y 848 0.10 2015-09-02 15:31:00 2016-01-22 15:43:00 2016-01-04 13:25:00 95
date 834 0.12 2017-01-23 13:14:00 2017-02-13 13:00:00 2017-01-25 18:43:00 107

Note: If you are having difficult viewing your data in the code chunk, try clicking the icon in the output that looks like a spreadsheet with an arrow on it to expand your output in a separate window.

What do you notice about the output? These observations might pertain to the format of the output or its values (i.e., what the mean for the val variable is). Note one or two observations or questions below:

  • One of the most interesting columns in my opinion is the n_missing, especially next to the complete_rate. I assume that n_missing is how many responses are lacking for that row. I could infer that the same people who did not fill in gender also did not fill in enrollment reason or status. It also appears that almost all (848 of 943 observations) failed to fill out the post_(int, uv, tv, percomp) surveys.

As we noted earlier, the {skimr} package works nicely with other {tidyverse} functions.

Hint: For help, also consider running ?skim() in the console to view the documentation for the function.

Your Turn

Recall from the Week 3 tutorials and exercises how we how we isolated data using the select() and filter() functions. In the code chunk below, look at descriptives for just proportion_earned , time_spent and gender, but only for the “OcnA” and “PhysA” subjects.

Can you do this by modifying the code below to do this?

data_to_explore |> 
    filter(subject %in% c("OcnA", "PhysA")) |>
  select(proportion_earned, time_spent, gender) |> 
  skim()
Data summary
Name select(…)
Number of rows 249
Number of columns 3
_______________________
Column type frequency:
character 1
numeric 2
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
gender 48 0.81 1 1 0 2 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
proportion_earned 48 0.81 0.78 0.24 0.00 0.73 0.86 0.94 1.00 ▁▁▁▃▇
time_spent 48 0.81 1828.56 1374.13 0.58 943.07 1601.13 2356.88 8870.88 ▇▅▁▁▁

We noted earlier that this output is best for internal use. This is because the output is rich, but not well-suited to exporting to a table that you add, for instance, to a Microsoft Word manuscript. Of course, these values can be entered manually into a table, but we’ll also discuss ways later on to create tables that are ready–or nearly-ready–to be added directly to journal articles or evaluation reports.

If you are curious about doing more with {skimr}, check out: <https://cran.r-project.org/web/packages/skimr/vignettes/skimr.html>

b. Data Visualization

As we noted in our Getting Started Case Study, data visualization is an extremely common practice in Learning Analytics, especially in the use of data dashboards. Data visualization involves graphically representing one or more variables with the goal of discovering patterns in data. These patterns may help us to answer research questions or generate new questions about our data, to discover relationships between and among variables, and to create or select features for data modeling.

The Graphing Workflow

Before visually exploring our data, let’s revisit the basic code template for the {ggplot2} package from the tidyverse. ggplot2 is a system for declaratively creating graphics, based on the grammar of graphics (Wickham?). You provide the data, tell ggplot2 how to map variables to aesthetics, what graphical elements to use, and it takes care of the details.

At it’s core, you can create some very simple but attractive graphs with just a couple lines of code. {ggplot2} follows the common workflow for making graphs. To make a graph, you simply:

  1. Start the graph with ggplot() and include your data as an argument;

  2. “Add” elements to the graph using the + operator a geom_() function;

  3. Select variables to graph on each axis with the aes() argument.

Let’s give it a try by creating a simple histogram of our time_spent_hours variable. You likely noticed that the skim() function created some miniature histrograms for each of the numeric variables in our data-to-explore data frame. Since we are especially interested in whether time spent in the online course is related to student performance, let’s take a closer look at this variable.

The code below creates a histogram with 30 bins – the default number for geom_histogram:

ggplot(data_to_explore) +
  geom_histogram(aes(x = time_spent_hours), bins = 30)
Warning: Removed 232 rows containing non-finite outside the scale range
(`stat_bin()`).

Note that the first function, ggplot(), creates a coordinate system that you can “add” layers to using additional functions and + operator. The first argument of ggplot() is the dataset, in our case data_to_explore, to use for the graph.

By itself, ggplot(data_to_explore) just creates an empty graph. But when you add a required geom_() function like geom_histogram(), you tell it which type of graph you want to make, in our case a histogram. A geom is the geometrical object that a plot uses to represent observations. People often describe plots by the type of geom that the plot uses. For example, bar charts use bar geoms, line charts use line geoms, boxplots use boxplot geoms, and so on. Scatterplots, which we’ll see a in bit, break the trend; they use the point geom.

The final required element for any graph is a mapping = argument that defines which variables in your dataset are mapped to which axes in your graph. The mapping argument is always paired with the function aes(), which you use to gather together all of the mappings that you want to create. In our case, since we just created a simple histogram, we only had to specify what variable to place on the x axis, which in our case was time_spent_hours.

Note that we also included the bins = argument in our geom_histogram() function. Again, by default this is set to 30. Aside from the bins = argument, geom_histogram() has other arguments that you can modify, search the Help tab in the Files pane for ” geom_histrogram” or type ?geom_histogram in the Console to open the help page.

Pro Tip: If you scroll to the bottom of most help pages for functions, you will see an examples section with code for executing those functions. You can copy and paste that code in the console to view the output of those functions and discover different ways they can be used.

Your Turn

Now change the number of bins using the code chunk below below and note any differences in what you interpret about the data.

# YOUR CODE HERE
ggplot(data_to_explore) +
  geom_histogram(aes(x = time_spent_hours), bins = 5)
Warning: Removed 232 rows containing non-finite outside the scale range
(`stat_bin()`).

What do you think the ideal number of bins is—with what is ideal being the number of bins that helps you to interpret the overall distribution of the values for how much time students’ spent (note: there is no one right or wrong answer here!)?

  • Less bins chunks the data in bigger pieces, more bins segments it out which starts to show nuances in the peaks and valleys. Without quite a bit more information, this plot is hard to interpret regardless of more or less bins. I think the y axis is the number of observations minus NA ones out of the full 943 rows.
  • Also, the time_spent_hours wasn’t working as the x =…I think because it’s a whole different dataframe external to the one being called for the ggplot itself. Please correct me if I’m wrong!

Faceting

We’ll next be using the facet_wrap() function to create small multiples, or plots that are specific to subsets of your data. These subsets are identified based on another variable in your dataset. For example, the code below uses the built-in mpg dataset from the tutorials to plot the relationship between the displacement of a car’s engine and its highway miles per gallon fuel efficiency.

ggplot(mpg, aes(displ, hwy)) + 
  geom_point()

The code in the next plot creates individual plots for each class—think compact car or SUV.

ggplot(mpg, aes(displ, hwy)) + 
  geom_point() +
  facet_wrap(~class)

Your Turn

In the code below, create a faceted histogram based on the subject of the course. To do so, consider both:

  • What code you used to create the histogram of the time students’ spent on the course

  • How, in the example above, facet_wrap refers to the variable in that data frame that represents the class of the car—but modifying the code to work for your subject variable

You may also wish to change the color; reflect back to the getting started task for an example of how to do this.

# YOUR CODE HERE
ggplot(data_to_explore) +
  geom_point(aes(x = time_spent_hours, y = proportion_earned, color = gender)) +
  facet_wrap(~subject)
Warning: Removed 345 rows containing missing values or values outside the scale range
(`geom_point()`).

What do you notice about this figure? What questions do you have? Add a note (or a few notes!) below:

  • Data viz is great ;) We can now see that some courses required more time to be spent in general (AnPhA and OcnA at least). We can also see the clusters of grades and the separation a bit between upper and lower. I will certainly pop back up and add a color…then I’ll be back.
  • Separated by color now…at a glance we can begin to see which courses were gender neutral and which segmented. Also the distribution of higher performing gender by course and over time.

Scatter Plots

Having prepared the data sets we joined together, and worked hard to join those data sets, we’re now ready to use this joined data set in our exploration of how the time students spent on the course LMS relate to the number of points they earned throughout the course.

We’ll be using the {ggplot2} package again, but, this time, will be creating a different type of plot.

Run the code below to create a scatter plot of the proportion of points students earned and the number of hours they spent on the course LMS.

ggplot(data_to_explore) +
  geom_point(aes(x = time_spent_hours, y = proportion_earned))
Warning: Removed 345 rows containing missing values or values outside the scale range
(`geom_point()`).

What do you notice about this graph? And what do you wonder? How about the code—what do you notice about it (and what do you wonder)? Add one or more of what you see as the most important elements.

  • It appears that there is a correlation between amount of time spent and higher grade. However, there also seems to be a saturation point in time spent which may indicate that about 50 hours is enough.
  • The code is bare bones and would be more useful with additional aesthetics layered on.

Using {ggplot2} makes it efficient to iterate through different versions of similar plots. For instance, we can color the points by a third variable, such as the reason for which students enrolled in the course, to begin to explore what was going on for students who spent very little time on the course:

ggplot(data_to_explore) +
  geom_point(aes(x = time_spent_hours, y = proportion_earned, 
                 color = enrollment_status))
Warning: Removed 345 rows containing missing values or values outside the scale range
(`geom_point()`).

Your Turn

We can also additionally create faceted plots, like the one you created for the histogram. In the code below, facet the plot by subject.

# YOUR CODE HERE
ggplot(data_to_explore) +
  geom_point(aes(x = time_spent_hours, y = proportion_earned, 
                 color = enrollment_status)) +
  facet_wrap(~subject)
Warning: Removed 345 rows containing missing values or values outside the scale range
(`geom_point()`).

In ggplot2, there are various ways to style a scatterplot, or any plot for that matter, to enhance its visual appeal and convey more information. Aside from using facets, here are some key styling options:

  1. Changing Point Shapes and Sizes: You can use shape to differentiate groups by using different symbols and size to change the size of the points.

  2. Adjusting Colors: Use color or fill to map variables to point colors. This can be done for both continuous and categorical variables.

  3. Transparency (Alpha): Modify the alpha parameter to handle overplotting and make points more transparent.

  4. Adding Titles and Labels: Use ggtitle(), xlab(), and ylab() to add plot titles and axis labels.

  5. Customizing Themes: Modify the overall appearance with different themes such as theme_minimal(), theme_bw(), or create custom themes using theme().

Some styling requires you to modify arguments in a function, like adding color = enrollment_status to the aes()function, others you can “add” layers to your plot by using the plus + operator to the line prior and then adding a new function. For instance, the following code adds the x-axis label of a plot:

ggplot(data_to_explore) +
  geom_point(aes(x = time_spent_hours, y = proportion_earned, 
                 color = enrollment_status)) +
  xlab("Time Spent (Hours)")
Warning: Removed 345 rows containing missing values or values outside the scale range
(`geom_point()`).

Try adding (and modifying, if you’d like) any of the following to the faceted plot you created in the code chunk below:

  • xlab("Time Spent (Hours)")

  • ylab("Proportion of Points Earned")

  • scale_color_brewer("Enrollment Status", type = "qual", palette = 3)

  • ggtitle("How Time Spent on Course LMS is Related to Points Earned in the Course")

  • theme(legend.position = "bottom")

# YOUR CODE HERE
ggplot(data_to_explore) +
  geom_point(aes(x = time_spent_hours, y = proportion_earned, 
                 color = enrollment_status)) +
  xlab("Time Spent (Hours)") +
  ylab("Proportion of Points Earned") +
  scale_color_brewer("Enrollment Status", type = "qual", palette = 3) +
  ggtitle("More Time Spent on Course LMS Increases Points Earned") +
  theme(legend.frame = element_rect(fill = gray, linewidth = 2), legend.position = "right")
Warning: Removed 345 rows containing missing values or values outside the scale range
(`geom_point()`).

Once you have settled on a plot you are happy with (for now!), add a sentence or two interpreting your graph (like you were describing it within a journal article):

  • This study shows that, of the students who remained in the course, the majority saw an increase in grades with appropriate time spent within the LMS. Those students who spent less than 50 hours were at greater risk of failing.

c. Create a Correlation Matrix

As highlighted in Leah P. Macfadyen and Dawson (2010a), the scatter plots are a useful initial approach for identifying potential correlational trends between variables under investigation, but to further interrogate the significance of selected variables as indicators of student achievement, a simple correlation analysis of each variable with student final grade can be conducted.

There are two efficient ways to create correlation matrices, one that is best for internal use, and one that is best for inclusion in a manuscript.

Corrr Package

First, the {corrr} package provides a way to create a correlation matrix in a {tidyverse}-friendly way. Like for the {skimr} package, it can take as little as a line of code to create a correlation matrix. If not familiar, a correlation matrix is a table that presents how all of the variables are related to all of the other variables.

Run the following code to load the {corrr} package:

library(corrr)

Attaching package: 'corrr'
The following object is masked from 'package:skimr':

    focus

Time Spent and Course Grade

Since one main goal of this case study is to investigate whether time spent in an online course is predictive of student achievement, let’s first take a look and see if there is a simple correlation between time spent and student achievement.

Run the following code to create a simple correlation matrix using the correlate() function from the {corrr} package:

data_to_explore |> 
  select(proportion_earned, time_spent_hours) |>
  correlate()
Correlation computed with
• Method: 'pearson'
• Missing treated using: 'pairwise.complete.obs'
# A tibble: 2 × 3
  term              proportion_earned time_spent_hours
  <chr>                         <dbl>            <dbl>
1 proportion_earned            NA                0.438
2 time_spent_hours              0.438           NA    
#this created the below tibble...not a scatter plot - assuming the text below relates to the scatterplot above?

As our scatterplot suggested, there is positive correlation (0.438) between time spent and the percentage of total points a student earned, suggesting that students who spend more time in the LMS also tend to have a higher proportion of points earned.

For the purpose of printing, and as a quick aside, the {corrr} package also has several nice functions for converting a correlation data frame into a matrix with the correlations cleanly formatted (leading zeros removed; spaced for signs) and the diagonal (or any NA) left blank.

Run the following code to try it out:

data_to_explore |> 
  select(proportion_earned, time_spent_hours, int, val, percomp) |> 
  correlate() |> 
  rearrange() |>
  shave() |>
  fashion()
Correlation computed with
• Method: 'pearson'
• Missing treated using: 'pairwise.complete.obs'
               term time_spent_hours proportion_earned  int  val percomp
1  time_spent_hours                                                     
2 proportion_earned              .44                                    
3               int              .08               .14                  
4               val              .10               .02  .53             
5           percomp              .05               .08  .56  .48        

Your Turn

In the code chunk below, select 3-4 numeric variables in addition to time_spent_hours that you think may be related to student achievement, i.e. proportion_earned, and run a simple correlation.

# YOUR CODE HERE
data_to_explore |> 
  select(proportion_earned, total_points_possible, time_spent_hours, total_points_earned) |> 
  correlate() |> 
  rearrange() |>
  shave() |>
  fashion()
Correlation computed with
• Method: 'pearson'
• Missing treated using: 'pairwise.complete.obs'
                   term proportion_earned time_spent_hours total_points_earned
1     proportion_earned                                                       
2      time_spent_hours               .44                                     
3   total_points_earned               .79              .38                    
4 total_points_possible              -.05              .02                 .53
  total_points_possible
1                      
2                      
3                      
4                      

Hint: One key is to correlate only numeric variables. Note that while some numeric variables can technically be used, it is likely not sensible to correlate all of the variables; some—for instance, the student_id variable—are not very sensible to correlate!

What did you find? Were your selected variables related to time spent in the course? These observations might pertain to the format of the output or its values (i.e., what the mean for the val variable is). Note one or two of these observations or questions below:

  • It appears that the correlation matrix shows .44 correlation between proportion_earned and time_spent_hours as well as between time_spent_hours and total_points_earned which makes mathematical sense. I’m not certain how to interpret the correlation between total_points_possible at -.05 and the proportion_earned and time_spent_hours.

If you are interested in learning more about the {corrr} package, visit: <https://corrr.tidymodels.org>

APA Formatted Tables

As we noted earlier, the {skimr} package works nicely with other {tidyverse} functions. While {corrr} is a nice package to quickly create a correlation matrix, you may wish to create one that is ready to be added directly to a dissertation or journal article. {apaTables} is great for creating more formal forms of output that can be added directly to an APA-formatted manuscript; it also has functionality for regression and other types of model output. It is not as friendly to {tidyverse} functions; first, we need to select only the variables we wish to correlate.

Then, we can use that subset of the variables as the argument to the apa.cor.table() function.

Run the following code to create a subset of the larger data_to_explore data frame with the variables you wish to correlate, then create a correlation table using apa.cor.table().

library(apaTables)

data_to_explore_subset <- data_to_explore |> 
  select(time_spent_hours, proportion_earned, int)

apa.cor.table(data_to_explore_subset)


Means, standard deviations, and correlations with confidence intervals
 

  Variable             M     SD    1           2         
  1. time_spent_hours  30.48 22.72                       
                                                         
  2. proportion_earned 0.76  0.25  .44**                 
                                   [.37, .50]            
                                                         
  3. int               4.30  0.60  .08         .14**     
                                   [-.01, .16] [.06, .22]
                                                         

Note. M and SD are used to represent mean and standard deviation, respectively.
Values in square brackets indicate the 95% confidence interval.
The confidence interval is a plausible range of population correlations 
that could have caused the sample correlation (Cumming, 2014).
 * indicates p < .05. ** indicates p < .01.
 

This may look nice, but how to actually add this into a dissertation or journal article that you might be interested in publishing? Read the documentation for apa.cor.table() by running ?apa.cor.table() in the console. Look through the documentation and examples to understand how to output a file with the formatted correlation table, and then run the code to do that with your subset of the data_to_explore data frame.

apa.cor.table(data_to_explore_subset, filename = "cor-table.doc")


Means, standard deviations, and correlations with confidence intervals
 

  Variable             M     SD    1           2         
  1. time_spent_hours  30.48 22.72                       
                                                         
  2. proportion_earned 0.76  0.25  .44**                 
                                   [.37, .50]            
                                                         
  3. int               4.30  0.60  .08         .14**     
                                   [-.01, .16] [.06, .22]
                                                         

Note. M and SD are used to represent mean and standard deviation, respectively.
Values in square brackets indicate the 95% confidence interval.
The confidence interval is a plausible range of population correlations 
that could have caused the sample correlation (Cumming, 2014).
 * indicates p < .05. ** indicates p < .01.
 

You should now see a new Word document in your project folder called cor-table.doc. Click on that and you’ll be prompted to download from your browser.

4. MODEL

As highlighted in Chapter 3 of Data Science in Education Using R, the Model step of the data science process entails “using statistical models, from simple to complex, to understand trends and patterns in the data.” The authors note that while descriptive statistics and data visualization during the Explore step can help us to identify patterns and relationships in our data, statistical models can be used to help us determine if relationships, patterns and trends are actually meaningful.

Predict Academic Achievement

For the purpose of this case study, let’s consider the proportion_earned variable to be our dependent, or the outcome, variable. You may be new to linear regression models, or you may have a lot of experience. In brief, a linear regression model involves estimating the relationships between one or more independent variables with one dependent variable. Mathematically, it can be written like the following.

\[ \operatorname{dependentvar} = \beta_{0} + \beta_{1}(\operatorname{independentvar}) + \epsilon \]

Here, the dependentvar is predicted by two coefficients, or things that help to explain the dependent variable. The first coefficient, \(\beta_0\), is the intercept. This coefficient tells us what the estimated value of the dependent variable is when the independent variable (independentvar) is equal to 0. The other coefficient, \(\beta_1\), or the slope, represents the association of a one-unit change in the independent variable in the value of the dependent variable.

Does Time Spent Predict Grade Earned?

Let’s consider a simple concrete example. We’ll use the lm() function in R to estimate a linear regression model.

The following code estimates a model in which proportion_earned, the proportion of points students earned, is the dependent variable. It is predicted by one independent variable, int, students’ self-reported interest in science.

lm(proportion_earned ~ time_spent_hours, 
   data = data_to_explore)

Call:
lm(formula = proportion_earned ~ time_spent_hours, data = data_to_explore)

Coefficients:
     (Intercept)  time_spent_hours  
        0.624306          0.004792  

Let’s take a look at the output.

We can see that the intercept is estimated at 0.62. This tells us that when students’ time spent in the online course is equal to zero, their predicted proportion of points earned is 0.62—not such a great grade, but also not surprising! But, for every one-unit, or hour, increase in time spent in science, their estimate proportion of points earned was 0.0048. So if a student spent, for instance, 40 hours on the course, their estimated final grade would be .62 + (.0048 * 40), or around .82, or 82%. A pretty solid B-!

How about interest in science?

We can add additional predictor variables by separating variables with a plus symbol. Run the following code to add int, students’ self-reported interest in science, to our linear model:

lm(proportion_earned ~ time_spent_hours + int, 
   data = data_to_explore)

Call:
lm(formula = proportion_earned ~ time_spent_hours + int, data = data_to_explore)

Coefficients:
     (Intercept)  time_spent_hours               int  
        0.449657          0.004255          0.046283  

We can see that the intercept is now estimated at 0.449, which tells us that when students’ time spent and interest are equal to zero, they are likely fail the course. Note that that estimate for interest in science is .046, so for every one-unit increase in int, we should expect an 5 percentage point increase in their grade.

We can save the output of the function to an object—let’s say m1, standing for model 1. We can then use the summary() function built into R to view a much more feature-rich summary of the estimated model.

m1 <- lm(proportion_earned ~ time_spent_hours + int, data = data_to_explore)

summary(m1)

Call:
lm(formula = proportion_earned ~ time_spent_hours + int, data = data_to_explore)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.66705 -0.07836  0.05049  0.14695  0.35766 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)      0.449657   0.066488   6.763 3.54e-11 ***
time_spent_hours 0.004255   0.000410  10.378  < 2e-16 ***
int              0.046282   0.015364   3.012  0.00271 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.2142 on 536 degrees of freedom
  (404 observations deleted due to missingness)
Multiple R-squared:  0.1859,    Adjusted R-squared:  0.1828 
F-statistic: 61.18 on 2 and 536 DF,  p-value: < 2.2e-16

There is a lot to unpack in this output, but for now the most important values to look at are those in the Estimate column, which represent the intercept and slopes for your linear regression model.

Note that the estimate for time_spent_hours is now 0.0042 and statistically significant. We see that int, interest in science, is also statistically significant.

Do “average” students earn an average grade?

Now let’s consider the mean values for each of these predictors. Recall from our tutorials last week the summarize() function from the {dplyr} package that was used to create summary statistics such as the mean, standard deviation, or the minimum or maximum of a value. At its core, think of summarize() as a function that returns a single value (whether it’s a mean, median, standard deviation—whichever!) that summarizes a single column.

Let’s use the summarize function to calculate the means for time spent and interest in science and add the argument na.rm = TRUE to tell R that it can ignore missing, or NA values, and to calculate the summary statistic using the non-missing values.

data_to_explore |> 
  summarize(mean_interest = mean(int, na.rm = TRUE),
            mean_time = mean(time_spent_hours, na.rm = TRUE))
# A tibble: 1 × 2
  mean_interest mean_time
          <dbl>     <dbl>
1          4.30      30.5

The mean value for interest is quite high. If we multiply the estimate relationship between interest and proportion of points earned—0.046—by this, the mean interest across all of the students—we can determine that students’ estimated construction to their final grade was 0.046 X 4.3, or 0.197. For hours spent, the average students’ estimate was 0.0042 X 30.48, or 0.128.

If we add both 0.197 and 0.128 to the intercept, 0.449, that equals 0.774, or about 77%. In other words, a student with average interest in science who spent an average amount of time in the course earned a pretty average grade.

Finally, similar to our APA formatted correlation table above, we can use the {apaTables} package to create a nice regression table that could be used for later publication:

apa.reg.table(m1, filename = "lm-table.doc")


Regression results using proportion_earned as the criterion
 

        Predictor      b     b_95%_CI beta  beta_95%_CI sr2  sr2_95%_CI     r
      (Intercept) 0.45** [0.32, 0.58]                                        
 time_spent_hours 0.00** [0.00, 0.01] 0.41 [0.33, 0.48] .16  [.11, .22] .41**
              int 0.05** [0.02, 0.08] 0.12 [0.04, 0.19] .01 [-.00, .03] .15**
                                                                             
                                                                             
                                                                             
             Fit
                
                
                
     R2 = .186**
 95% CI[.13,.24]
                

Note. A significant b-weight indicates the beta-weight and semi-partial correlation are also significant.
b represents unstandardized regression weights. beta indicates the standardized regression weights. 
sr2 represents the semi-partial correlation squared. r represents the zero-order correlation.
Square brackets are used to enclose the lower and upper limits of a confidence interval.
* indicates p < .05. ** indicates p < .01.
 

Your Turn

Repurpose the code from above to estimate a different regression models with at least 2 variables, save as m2, and view a summary() of the results:

# YOUR CODE HERE
m2 <- lm(proportion_earned ~ percomp + total_points_earned, data = data_to_explore)

summary(m2)

Call:
lm(formula = proportion_earned ~ percomp + total_points_earned, 
    data = data_to_explore)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.33338 -0.12393  0.00187  0.15022  0.26452 

Coefficients:
                     Estimate Std. Error t value Pr(>|t|)    
(Intercept)         2.433e-01  3.821e-02   6.366 4.13e-10 ***
percomp             2.002e-02  9.490e-03   2.110   0.0353 *  
total_points_earned 3.621e-04  1.301e-05  27.841  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.1535 on 542 degrees of freedom
  (398 observations deleted due to missingness)
Multiple R-squared:  0.5914,    Adjusted R-squared:  0.5899 
F-statistic: 392.3 on 2 and 542 DF,  p-value: < 2.2e-16

Add a brief note or two interpreting the above model (m2):

  • Both predictors are positively correlated with proportion_earned, with total_points_earned showing a much stronger effect which again makes maths sense. The model’s predictors together do a good job of explaining the outcome. About 59% of the differences in students’ proportion_earned scores can be accounted for by percomp and total_points_earned.

5. COMMUNICATE

The final(ish) step in our workflow/process is sharing the results of analysis with wider audience. Krumm et al. (2018) have outline the following 3-step process for communicating with education stakeholders what you have learned through analysis:

  1. Select. Communicating what one has learned involves selecting among those analyses that are most important and most useful to an intended audience, as well as selecting a form for displaying that information, such as a graph or table in static or interactive form, i.e. a “data product.”

  2. Polish. After creating initial versions of data products, research teams often spend time refining or polishing them, by adding or editing titles, labels, and notations and by working with colors and shapes to highlight key points.

  3. Narrate. Writing a narrative to accompany the data products involves, at a minimum, pairing a data product with its related research question, describing how best to interpret the data product, and explaining the ways in which the data product helps answer the research question.

For Unit 1 we will keep it simple. In the code chunk below, select a chart, table or model created above (or create an entirely new one based a new analysis) that you think an education stakeholder might find interesting and addresses one of our research questions. Beneath the code chunk, write a very brief narrative to accompany your narrative.

Time Spent in LMS Predicts Performance

# YOUR CODE HERE
ggplot(data_to_explore) +
  geom_point(aes(x = time_spent_hours, 
                 y = proportion_earned, 
                 color = enrollment_status), 
             alpha = 0.6) +
  facet_wrap(~subject) +
  xlab("Time Spent (Hours in LMS)") +
  ylab("Proportion of Points Earned") +
  scale_color_brewer("Enrollment Status", type = "qual", palette = 3) +
  ggtitle("Relationship Between LMS Time and Course Performance by Subject") +
  theme_minimal() +
  theme(legend.position = "bottom")
Warning: Removed 345 rows containing missing values or values outside the scale range
(`geom_point()`).

This faceted scatter plot illustrates the relationship between time spent in the LMS and course performance (proportion of points earned) across several science subjects. Each dot represents a student, and colors reflect different enrollment statuses (e.g., active, withdrawn, transferred).

Across most subjects, students who spent more time in the LMS generally achieved higher course performance, aligning with the predictive relationship identified in the regression model.

Subjects differ in how sharply performance increases with time; for example, Oceanography shows a steeper rise than Physics, suggesting course-specific dynamics.

Notably, students who spent fewer than ~30 hours in the LMS often clustered around lower performance bands, a potential early-warning threshold.

The distribution of enrollment statuses also varies across subjects, suggesting systemic or curricular influences that warrant further investigation.

This visualization is designed for instructional designers or school administrators interested in early intervention: it provides both a broad overview and subject-specific trends, helping prioritize which courses and student cohorts might benefit most from targeted support.

Congratulations!

You’ve completed the first case study! To “turn in” your work, you can click the “Render” icon at the top of the file. This will create a HTML file in your Files pane that serves as a record of your completed assignment.

References

Estrellado, Ryan A, Emily Freer, Jesse Mostipak, Joshua M Rosenberg, and Isabella C Velásquez. 2020. Data Science in Education Using r. Routledge.
Krumm, Andrew, Barbara Means, and Marie Bienkowski. 2018. Learning Analytics Goes to School. Routledge. https://doi.org/10.4324/9781315650722.
Macfadyen, Leah P., and Shane Dawson. 2010a. “Mining LMS Data to Develop an Early Warning System for Educators: A Proof of Concept.” Computers & Education 54 (2): 588–99. https://doi.org/10.1016/j.compedu.2009.09.008.
Macfadyen, Leah P, and Shane Dawson. 2010b. “Mining LMS Data to Develop an ‘Early Warning System’ for Educators: A Proof of Concept.” Computers & Education 54 (2): 588–99.
Wickham, Hadley, Mine Çetinkaya-Rundel, and Garrett Grolemund. 2023. R for Data Science. " O’Reilly Media, Inc.".
Wigfield, Allan, and Jacquelynne S Eccles. 2000. “Expectancy–Value Theory of Achievement Motivation.” Contemporary Educational Psychology 25 (1): 68–81.
Wright, C, SE Ellis, SC Hicks, and RD Peng. 2021. “Tidyverse Skills for Data Science.”