Week 4: Predicting Student Achievement

Author

Dr.Cansu Tatar

1. Prepare

During the final week of each unit, we will complete a “case-study” to illustrate how Learning Analytics methods and techniques can be applied to address research questions of interest, create useful data products, and conduct reproducible research. Each case study is structured around a basic research workflow modeled after the Data Intensive Research Workflow from Learning Analytics Goes to School (Krumm, 2018).

Figure 1.

Steps of Data Intensive Research Workflow

For Unit 1, we will focus on online science classes provided through a state-wide online virtual school and conduct an analysis that help predict students’ performance in these online courses. This case study is guided by a foundational study in Learning Analytics that illustrates how analyses like these can be used develop an early warning system to help educators to identify students at risk of failing and intervene before that happens.

The Unit 1 case study will cover the following workflow topics.

Prepare: Prior to analysis, it’s critical to understand the context and data sources you’re working with so you can formulate useful and answerable questions. You’ll also need to become familiar with and load essential packages for analysis.
Wrangle: Wrangling data entails the work of manipulating, cleaning, transforming, and merging data. In the part 2, we focus on importing CSV files, tidying and joining our data. For this assignment, we will only complete the half of the tidying process.
Explore: In Part 3, we use basic data visualization and calculate some summary statistics to explore our data and see what insight it provides in response to our question.
Model: After identifying variable that may be related to student performance through exploratory analysis, we’ll look at correlations and create some simple models of our data using linear regression.
Communicate: To wrap up our case study, we’ll develop our first “data product” and share our analyses and findings by creating our first web page using R pubs.

1a. Review the Literature

Our Unit 1 Case Study is guided by a well-cited publication from two authors that have made numerous contributions to the field of Learning Analytics over the years. Although this article is focused on “early warning systems” in higher education, and where adoption of learning management systems (LMS) like Moodle and Canvas gained a quicker foothold, this study is particularly relevant since COVID-19. Many districts across the county have incorporated a LMS into their remote instruction and have set up virtual academies likely to continue post-pandemic.

Note: You have the following reading in your literature file. Check the full-text from there.

Macfadyen, L. P., & Dawson, S. (2010). Mining LMS data to develop an “early warning system” for educators: A proof of concept. Computers & education, 54(2), 588-599

Abstract

Earlier studies have suggested that higher education institutions could harness the predictive power of Learning Management System (LMS) data to develop reporting tools that identify at-risk students and allow for more timely pedagogical interventions. This paper confirms and extends this proposition by providing data from an international research project investigating which student online activities accurately predict academic achievement. Analysis of LMS tracking data from a Blackboard Vista-supported course identified 15 variables demonstrating a significant simple correlation with student final grade… Moreover, network analysis of course discussion forums afforded insight into the development of the student learning community by identifying disconnected students, patterns of student-to-student communication, and instructor positioning within the network. This study affirms that pedagogically meaningful information can be extracted from LMS-generated student tracking data, and discusses how these findings are informing the development of a dashboard-like reporting tool for educators that will extract and visualize real-time data on student engagement and likelihood of success.

Data Sources & Analysis

The data analyzed in this exploratory research was extracted from the course-based instructor tracking logs and the BB VistaTM production server. Data collected on each student included ‘whole term’ counts for frequency of usage of course materials and tools supporting content delivery, engagement and discussion, assessment and administration/management. In addition, tracking data indicating total time spent on certain tool-based activities (assessments, assignments, total time online) offered a total measure of individual student time on task.

The authors used scatter plots for identifying potential correlational trends between variables under investigation, followed by a a simple correlation analysis of each variable to further interrogate the significance of selected variables as indicators of student achievement. Finally, a linear multiple regression analysis was therefore conducted, in order to develop a predictive model in which ‘Student final grade’ was the continuous dependent variable.

Your Turn ⤵

Take a quick scan of Table 1 in the article linked above (and also located in your “lit” folder in the files pane) and in the space below, answer the following question: Of the 13 LMS variables correlated with student final grade, which 2-3 do you think will found to significantly predict final grades?

I think that of the 13 LMS variables correlated with student final grade, the ones that would be found to significantly predict final grades would be total time online, assessments finished, and total number of online sessions.

Now take a quick look at Discussion and Conclusions sections and answer the following questions: What factors in the model did ultimately predict final grades? How accurate was this model in identifying “at risk” students?

In the article, three factors were identified as predictors of students final grades: the number of forum postings, mail messages sent, and assessments completed. The model proved reasonably accurate in identifying “at risk” students with a 70.3% accuracy rate in classifying students with final grades below 60% and an 80.9% accuracy rate in flagging students who actually failed the course.

1b. Define Questions

In this study, exploratory research was undertaken to identify the data variables that would inform the development of a data visualization tool for instructors. This involved the extraction of all LMS tracking variables for selected course sections at The University of British Columbia, Canada. In so doing, the study aimed to address the following research questions:

Which LMS tracking data variables correlate significantly with student achievement?
How accurately can measures of student online activity in an online course site predict student achievement in the course under study?
Can tracking data recording online student communication patterns offer pedagogically meaningful insights into development of a student learning community?

For our case study, we’ll adopt questions 1 & 2 wholesale to guide our exploratory analysis and modeling, with a special emphasis on time spent in the LMS. We’ll also use analytical approaches and data similar to those used by @macfadyen2010mining to better understand how LMS, gradebook, and survey data might be predictive of student performance.

1c. Load Libraries

As noted in our earlier class, R uses “packages” and add-ons that enhance its functionality. One of our first steps in any workflow is to load packages necessary for data wrangling, analysis, and reporting. We’ll load the familiar {tidyverse} package in this section and introduce new packages and corresponding functions throughout the case study.

Recall from earlier class that the {tidyverse} package is actually a collection of R packages designed for reading, wrangling, and exploring data and which all share an underlying design philosophy, grammar, and data structures. These shared features are sometimes “tidy data principles.”

Click the green arrow in the right corner of the “code chunk” that follows to load the {tidyverse} library introduced.

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.3     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.4.3     ✔ tibble    3.2.1
✔ lubridate 1.9.2     ✔ tidyr     1.3.0
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Don’t worry if you saw a number of messages: those probably mean that the tidyverse loaded just fine. Any conflicts you may have seen mean that functions in these packages you loaded have the same name as functions in other packages and R will default to function from the last loaded package unless you specify otherwise.

2. Wrangle

In general, data wrangling involves some combination of cleaning, reshaping, transforming, and merging data [@wickham2023r]. The importance of data wrangling is difficult to overstate, as it involves the initial steps of going from raw data to a dataset that can be explored and modeled [@krumm2018] and often comprises a large share of time spent on data-intensive research workflows. In Part 2, we focus on the the following workflow processes:

Import Data. In this section, we introduce the read_csv() function for working with CSV files and revisit some key functions for inspecting our data.
Tidy Data. We introduce the separate() and clean_names() functions for getting our data nice and tidy, and revisit the mutate() function for creating new variables and feature engineering.
Join Data. We conclude our data wrangling by introducing join() functions for merging our processed files into a single data frame for analysis.

2a. Import Data

Education data are stored in all sorts of different file formats and structures. In this course, we’ll discuss several of these common formats, how to import your data into R, and how to transform you data into other data structures such as network objects required for social network analysis in Unit 3. In this case study, we’ll focus on working with comma-separated values (CSV) files.

Similar to spreadsheet formats like Excel and Google Sheets, CSVs allow us to store rectangular data frames, but in a much simpler plain-text format, where all the important information in the file is represented by text. Note that “text” here refers to numbers, letters, and symbols you can type on your keyboard. In Tidyverse Skills for Data Science, Wright et al. @wright2021tidyverse note that the advantage of a .csv files is that:

… there are no workbooks or metadata making it difficult to open these files. CSVs are flexible files and are thus the preferred storage method for tabular data for many data scientists.

Data Source #1: Log Data

Log-trace data is data generated from our interactions with digital technologies, such as archived data from social media postings. In education, an increasingly common source of log-trace data is that generated from interactions with LMS and other digital learning tools.

The data we will use in this case study and already been “wrangled” quite a bit and is a summary of log-trace data: the number of minutes students spent on the course. While this data type is fairly straightforward, there are even more complex sources of log-trace data out there (e.g., time stamps associated with when students started and stopped accessing the course).

To help us import our data, we’ll be using the {readr} package, which provides a “fast and friendly way” to read rectangular data stored in plain-text file formats like csv, tsv, and fwf. If you are new to {readr}, I highly recommend the data import chapter in R for Data Science [@wickham2023r]. Note that we don’t need to load the {readr} package because it was already loaded as part of the tidyverse package we called earlier.

Let’s use the read_csv() function from {readr} to import our log-data.csv file directly from our data folder and name this data set time_spent, which will now be saved in our R environment:

time_spent <- read_csv("data/log-data.csv")

Rows: 716 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (4): course_id, gender, enrollment_reason, enrollment_status
dbl (2): student_id, time_spent

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

There are two quick things to note. First, importing our data and saving in our environment does not alter the original file saved in our data folder, so we don’t have to worry about causing any harm to the original file as we wrangle and explore our data!

Second, the message in the output indicated that four of the columns were specified as “character” data types and two of the columns student_id and time_spent as double or numeric types.

Your Turn ⤵

Let’s take a look at this data in a couple ways. First, type time_spent into the console (the pane below this one) and then hit return/enter. You should see a printed summary of this data frame.

Next, got your files pane (bottom right corner of R Studio), click on the data folder then click on the file named “log-data.csv” and in the pop-up click “View File.”

What do you notice about this data? What questions do you have? Add a couple of notes (or more—you can type return/enter after a bullet point to add another) on your observations and/or questions here:

It seems that the difference between typing “time_spent” in the console and using the “view file” option is that typing “time_spent” into the console displays just the first ten rows of the data, while the “view file” method provides all the log data.

Data Source #2: Academic Achievement Data

Academic achievement data is (obviously) is a very common form of data in education. In this case study, we’ll use both the sum of the points students earned as well as the number of points possible to compute the percentage of points they earned in the course—a measure comparable (but likely a little different based on teachers’ grading policies) to their final grade.

Your Turn ⤵

In the code chunk below, read in to R the gradebook-summary.csv file located in the data folder. You can use the code above as a template. Assign the output from the read_csv() function to a new object named gradebook.

# YOUR CODE HERE

gradebook <- read_csv("data/gradebook-summary.csv")

Rows: 717 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): course_id
dbl (3): student_id, total_points_possible, total_points_earned

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

gradebook

# A tibble: 717 × 4
   student_id course_id     total_points_possible total_points_earned
        <dbl> <chr>                         <dbl>               <dbl>
 1      43146 FrScA-S216-02                  1217               1150 
 2      44638 OcnA-S116-01                   1676               1384.
 3      47448 FrScA-S216-01                  1232               1116 
 4      47979 OcnA-S216-01                   1833               1493.
 5      48797 PhysA-S116-01                  2225               1995.
 6      51943 FrScA-S216-03                  1222                 70 
 7      52326 AnPhA-S216-01                  1775               1519.
 8      52446 PhysA-S116-01                  2225               2198 
 9      53447 FrScA-S116-01                  1212               1173 
10      53475 FrScA-S116-02                  1212                  0 
# ℹ 707 more rows

Data Source #3: Self-Report Survey

The third data source is a self-report survey. This was data collected before the start of the course. The survey included ten items, each corresponding to one of three motivation measures: interest, utility value, and perceived competence. The three motivation measures we explore here come from Expectancy-Value Theory, which states that students are motivated to learn when they both believe that they can achieve something (expectancy, also known as “perceived competence”) and believe that the concept they are trying to learn is important (value) [@wigfield2000expectancy]. For a more information about this survey, including the specific items included, see Chapter 7 of Data Science in Education Using R [@estrellado2020data].

Your Turn ⤵

In the code chunk below, read in to R the survey.csv file located in the data folder. You can use the code above as a template. Assign the output from the read_csv() function to a new object named survey.

# YOUR CODE HERE

survey <- read_csv("data/survey.csv")

Rows: 662 Columns: 26
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr   (3): course_id, subject, semester
dbl  (20): student_id, section, int, val, percomp, tv, q1, q2, q3, q4, q5, q...
dttm  (3): date.x, date.y, date

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

survey

# A tibble: 662 × 26
   student_id course_id subject semester section   int   val percomp    tv    q1
        <dbl> <chr>     <chr>   <chr>      <dbl> <dbl> <dbl>   <dbl> <dbl> <dbl>
 1      43146 FrScA-S2… FrScA   S216           2   4.2  3.67     4    3.86     4
 2      44638 OcnA-S11… OcnA    S116           1   4    3        3    3.57     4
 3      47448 FrScA-S2… FrScA   S216           1   4.2  3        3    3.71     5
 4      47979 OcnA-S21… OcnA    S216           1   4    3.67     2.5  3.86     4
 5      48797 PhysA-S1… PhysA   S116           1   3.8  3.67     3.5  3.71     4
 6      51943 FrScA-S2… FrScA   S216           3   3.8  3.67     3.5  3.71     4
 7      52326 AnPhA-S2… AnPhA   S216           1   3.6  4        3    4        4
 8      52446 PhysA-S1… PhysA   S116           1   4.2  3.67     3    4        4
 9      53447 FrScA-S1… FrScA   S116           1   3.8  2        3    3        5
10      53475 FrScA-S2… FrScA   S216           1   4.8  3.33     4    4.14     5
# ℹ 652 more rows
# ℹ 16 more variables: q2 <dbl>, q3 <dbl>, q4 <dbl>, q5 <dbl>, q6 <dbl>,
#   q7 <dbl>, q8 <dbl>, q9 <dbl>, q10 <dbl>, date.x <dttm>, post_int <dbl>,
#   post_uv <dbl>, post_tv <dbl>, post_percomp <dbl>, date.y <dttm>,
#   date <dttm>

After reading the data, let’s continue the practice of looking at our data. Type survey into the console to take a look at the data or add survey on a new line in your code chunk above. Does it appear to be the correct file? What do the variables seem to be about? What wrangling steps do we need to take? Taking a quick peak at the data helps us to begin to formulate answers to these questions is an important step in any data analysis.

Add one or more of the things you notice or wonder about the data here:

The file does seem to be correct as it shows what I assume are the results to the survey questions 1-10. It also seems to be in a Likert scale design between 1-5; I say 1-5 on the Likert scale because I did not see any numbers higher than 5. I also assume that the 5 is associated with the most positive.

RStudio Tip: If you happen to run into issues with data import, RStudio has a handy “Import Dataset” feature for a point and click approach to adding data to your environment. If you want to give this a try, be sure to pay attention to the default settings and the name it will give your data frame when imported.

View Data

Once your data is in R, there are many different ways you can view it. Give each of the following at try:

# enter the name of your data frame and view directly in the console or a code chunk
survey

# A tibble: 662 × 26
   student_id course_id subject semester section   int   val percomp    tv    q1
        <dbl> <chr>     <chr>   <chr>      <dbl> <dbl> <dbl>   <dbl> <dbl> <dbl>
 1      43146 FrScA-S2… FrScA   S216           2   4.2  3.67     4    3.86     4
 2      44638 OcnA-S11… OcnA    S116           1   4    3        3    3.57     4
 3      47448 FrScA-S2… FrScA   S216           1   4.2  3        3    3.71     5
 4      47979 OcnA-S21… OcnA    S216           1   4    3.67     2.5  3.86     4
 5      48797 PhysA-S1… PhysA   S116           1   3.8  3.67     3.5  3.71     4
 6      51943 FrScA-S2… FrScA   S216           3   3.8  3.67     3.5  3.71     4
 7      52326 AnPhA-S2… AnPhA   S216           1   3.6  4        3    4        4
 8      52446 PhysA-S1… PhysA   S116           1   4.2  3.67     3    4        4
 9      53447 FrScA-S1… FrScA   S116           1   3.8  2        3    3        5
10      53475 FrScA-S2… FrScA   S216           1   4.8  3.33     4    4.14     5
# ℹ 652 more rows
# ℹ 16 more variables: q2 <dbl>, q3 <dbl>, q4 <dbl>, q5 <dbl>, q6 <dbl>,
#   q7 <dbl>, q8 <dbl>, q9 <dbl>, q10 <dbl>, date.x <dttm>, post_int <dbl>,
#   post_uv <dbl>, post_tv <dbl>, post_percomp <dbl>, date.y <dttm>,
#   date <dttm>

# view your data frame transposed so your can see every column and the first few entries
glimpse(survey)

Rows: 662
Columns: 26
$ student_id   <dbl> 43146, 44638, 47448, 47979, 48797, 51943, 52326, 52446, 5…
$ course_id    <chr> "FrScA-S216-02", "OcnA-S116-01", "FrScA-S216-01", "OcnA-S…
$ subject      <chr> "FrScA", "OcnA", "FrScA", "OcnA", "PhysA", "FrScA", "AnPh…
$ semester     <chr> "S216", "S116", "S216", "S216", "S116", "S216", "S216", "…
$ section      <dbl> 2, 1, 1, 1, 1, 3, 1, 1, 1, 1, 1, 2, 2, 1, 1, 2, 1, 1, 1, …
$ int          <dbl> 4.2, 4.0, 4.2, 4.0, 3.8, 3.8, 3.6, 4.2, 3.8, 4.8, 4.6, 3.…
$ val          <dbl> 3.666667, 3.000000, 3.000000, 3.666667, 3.666667, 3.66666…
$ percomp      <dbl> 4.0, 3.0, 3.0, 2.5, 3.5, 3.5, 3.0, 3.0, 3.0, 4.0, 4.0, 3.…
$ tv           <dbl> 3.857143, 3.571429, 3.714286, 3.857143, 3.714286, 3.71428…
$ q1           <dbl> 4, 4, 5, 4, 4, 4, 4, 4, 5, 5, 4, 3, 4, 4, 5, 4, 4, 5, 5, …
$ q2           <dbl> 4, 2, 3, 3, 4, 4, 4, 4, 2, 4, 5, 1, 3, 2, 5, 4, 4, 4, 3, …
$ q3           <dbl> 4, 2, 3, 2, 3, 3, 4, 3, 3, 4, 4, 3, 3, 2, 4, 3, 3, 4, 4, …
$ q4           <dbl> 5, 4, 4, 4, 4, 4, 2, 4, 4, 5, 4, 4, 4, 5, 5, 4, 4, 5, 5, …
$ q5           <dbl> 4, 4, 4, 4, 4, 3, 4, 4, 4, 5, 5, 4, 4, 5, 5, 4, 4, 5, 5, …
$ q6           <dbl> 4, 4, 3, 4, 4, 3, 4, 4, 2, 3, 5, 3, 3, 3, 5, 4, 4, 4, 5, …
$ q7           <dbl> 4, 4, 3, 3, 4, 4, 2, 3, 3, 4, 4, 4, 4, 4, 4, 4, 3, 5, 4, …
$ q8           <dbl> 4, 4, 4, 4, 4, 4, 4, 5, 4, 5, 5, 3, 4, 5, 5, 4, 5, 5, 5, …
$ q9           <dbl> 3, 3, 3, 4, 3, 4, 4, 3, 2, 3, 4, 2, 3, 1, 5, 3, 3, 4, 3, …
$ q10          <dbl> 4, 4, 4, 4, 3, 4, 4, 4, 2, 4, 5, 3, 4, 3, 5, 4, 5, 5, 5, …
$ date.x       <dttm> 2016-02-02 18:44:00, 2015-09-09 13:41:00, 2016-01-28 14:…
$ post_int     <dbl> NA, NA, NA, NA, NA, NA, NA, 3.50, 3.75, NA, 5.00, NA, NA,…
$ post_uv      <dbl> NA, NA, NA, NA, NA, NA, NA, 3.666667, 2.000000, NA, 4.666…
$ post_tv      <dbl> NA, NA, NA, NA, NA, NA, NA, 3.571429, 3.000000, NA, 4.857…
$ post_percomp <dbl> NA, NA, NA, NA, NA, NA, NA, 3.5, 3.0, NA, 4.0, NA, NA, 3.…
$ date.y       <dttm> NA, NA, NA, NA, NA, NA, NA, 2016-01-02 00:41:00, 2015-10…
$ date         <dttm> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …

# look at just the first six entries
head(survey)

# A tibble: 6 × 26
  student_id course_id  subject semester section   int   val percomp    tv    q1
       <dbl> <chr>      <chr>   <chr>      <dbl> <dbl> <dbl>   <dbl> <dbl> <dbl>
1      43146 FrScA-S21… FrScA   S216           2   4.2  3.67     4    3.86     4
2      44638 OcnA-S116… OcnA    S116           1   4    3        3    3.57     4
3      47448 FrScA-S21… FrScA   S216           1   4.2  3        3    3.71     5
4      47979 OcnA-S216… OcnA    S216           1   4    3.67     2.5  3.86     4
5      48797 PhysA-S11… PhysA   S116           1   3.8  3.67     3.5  3.71     4
6      51943 FrScA-S21… FrScA   S216           3   3.8  3.67     3.5  3.71     4
# ℹ 16 more variables: q2 <dbl>, q3 <dbl>, q4 <dbl>, q5 <dbl>, q6 <dbl>,
#   q7 <dbl>, q8 <dbl>, q9 <dbl>, q10 <dbl>, date.x <dttm>, post_int <dbl>,
#   post_uv <dbl>, post_tv <dbl>, post_percomp <dbl>, date.y <dttm>,
#   date <dttm>

# or the last six entries
tail(survey)

# A tibble: 6 × 26
  student_id course_id  subject semester section   int   val percomp    tv    q1
       <dbl> <chr>      <chr>   <chr>      <dbl> <dbl> <dbl>   <dbl> <dbl> <dbl>
1         19 AnPhA-S21… AnPhA   S217           2   4.2  5        5    4.5      5
2         42 FrScA-S21… FrScA   S217           1   4    4        4    4        4
3         52 FrScA-S21… FrScA   S217           3   4.4  2.67     3.5  3.75     4
4         57 FrScA-S21… FrScA   S217           1   4.4  2.33     2.5  3.62     5
5         72 FrScA-S21… FrScA   S217           1   5    3        4    4.25     5
6         80 FrScA-S21… FrScA   S217           1   3.6  2.33     3    3.12     4
# ℹ 16 more variables: q2 <dbl>, q3 <dbl>, q4 <dbl>, q5 <dbl>, q6 <dbl>,
#   q7 <dbl>, q8 <dbl>, q9 <dbl>, q10 <dbl>, date.x <dttm>, post_int <dbl>,
#   post_uv <dbl>, post_tv <dbl>, post_percomp <dbl>, date.y <dttm>,
#   date <dttm>

# view the names of your variables or columns
names(survey)

 [1] "student_id"   "course_id"    "subject"      "semester"     "section"     
 [6] "int"          "val"          "percomp"      "tv"           "q1"          
[11] "q2"           "q3"           "q4"           "q5"           "q6"          
[16] "q7"           "q8"           "q9"           "q10"          "date.x"      
[21] "post_int"     "post_uv"      "post_tv"      "post_percomp" "date.y"      
[26] "date"

# or view in source pane
View(survey)

Yes, the “V” is capitalized—very unusual for an R function (the tidyverse does have it’s own lowercase view() function if that bothers you though). Because this function is a bit atypical in more ways than one, I have two recommendations concerning its use:

Use it strictly in the console. Because it opens a new viewing window, including it in an R Markdown or Quarto document can cause issues when “knitting” or “rendering” an HTML (or PDF) file. Hence I have included the eval: false option in the code chunk so it does not run when you render your document.
Close the viewer window that opens once you have viewed the data. Keeping it open can clutter your work space a bit and can lead to confusion about what data frame it was you viewed.

2b. Tidy Data

Tidy data refers to a consistent way to organize your data in R. In our course text, R for Data Science, Hadley Wickham notes that getting your data into this format requires some work, but that work pays off in the long term:

Once you have tidy data and the tidy tools provided by packages in the tidyverse, you will spend much less time munging data from one representation to another, allowing you to spend more time on the analytic questions at hand.

Not surprisingly, the Tidyverse set of packages including packages like dplyr adhere “tidy” data principles. Tidy data has three interrelated rules which make a dataset tidy:

Each variable is a column
Each observation is a row
Each type of observational unit is a table

Your Turn ⤵

Why would this data be considered “untidy”? Hint: Look at the column names.

The columns (A, AA, AB, AC…etc.) have generic labels that are not informative or meaningful. It would be better to label the columns as the years (e.g., 2004, 2005…).

Process Log Data

Earlier, we imported time_spent, which contains information on the number of minutes that students spent on the course, as well as other variables, particularly course_id.

Information about the course subject, semester, and section are not stored in a “tidy” format, but rather a single variable—course_id. This format of data storage is not ideal. If we instead give each piece of information its own column, we’ll have more opportunities for later analysis. We’ll use a function called separate() to do this.

First, let’s practice with a small data set. We’ll create it directly in R; run the code below to do that (and to assign the name df to the dataset).

df <- tibble(course_var = c("Fall - Chemistry", 
                            "Fall - Earth Science", 
                            "Spring - Forensic Science",
                            "Spring - Earth Science",
                            "Spring - Biology"))

df

# A tibble: 5 × 1
  course_var               
  <chr>                    
1 Fall - Chemistry         
2 Fall - Earth Science     
3 Spring - Forensic Science
4 Spring - Earth Science   
5 Spring - Biology

Print df to the console. You should see a single variable, course_var, with four rows.

In this (very small) data frame, there is information about both the semester and the course are encoded within the same variable. The separate() function has two primary arguments, one each for:

the variable you want to separate
the names of the new variables to create

Below, see using course_var for #1, and c("Semester", "Course") for #2, can be used to separate the semester and course data into two separate variable

df |> 
  separate(course_var, c("semester", "course"))

Warning: Expected 2 pieces. Additional pieces discarded in 3 rows [2, 3, 4].

# A tibble: 5 × 2
  semester course   
  <chr>    <chr>    
1 Fall     Chemistry
2 Fall     Earth    
3 Spring   Forensic 
4 Spring   Earth    
5 Spring   Biology

Note that we used the |> operator called a pipe that was introduced in our earlier class. Recall that pipes, including the original %>% pipe operator, are a powerful tool for combining a sequence of functions or processes. Here we used it to send our df data frame to the seprate() function.

We could also have written the function as follows, with our data frame as the first argument in the function:

separate(df, course_var, c("semester", "course"))

Warning: Expected 2 pieces. Additional pieces discarded in 3 rows [2, 3, 4].

# A tibble: 5 × 2
  semester course   
  <chr>    <chr>    
1 Fall     Chemistry
2 Fall     Earth    
3 Spring   Forensic 
4 Spring   Earth    
5 Spring   Biology

All of the {dplyr} functions like and many {tidyverse} functions like separate() take a data frame (or tibble) as the first argument. However, it can be helpful for just for general code readability to separate (no pun intended) your data frame from the functions and starting your first line of code with the data frame to be wrangled. This is a pretty standard convention when writing R code.

Next, let’s try something slightly different. Here, we have a data frame with a variable that encodes three pieces of information within the same variable: the year, semester, and subject. There are also a few other differences.

df2 <- tibble(course_variable = c("19-Fall-Algebra1", 
                                  "20-Fall-Algebra2", 
                                  "20-Spring-Algebra2",
                                  "20-Spring-Algebra2",
                                  "21-Fall-Algebra1"))
df2

# A tibble: 5 × 1
  course_variable   
  <chr>             
1 19-Fall-Algebra1  
2 20-Fall-Algebra2  
3 20-Spring-Algebra2
4 20-Spring-Algebra2
5 21-Fall-Algebra1

Your Turn ⤵

Can you separate the variable in the above data frame not into two, but rather three, new variables?

# YOUR CODE HERE
df2 |>
  separate(course_variable, c("year","semester","course"))

# A tibble: 5 × 3
  year  semester course  
  <chr> <chr>    <chr>   
1 19    Fall     Algebra1
2 20    Fall     Algebra2
3 20    Spring   Algebra2
4 20    Spring   Algebra2
5 21    Fall     Algebra1

Hint: Try to modify the code from above (in which you separated course_var into two variables) based on a) the name of the variable in df2 and b) adding the name for the third new variable you wish to create.

Let’s return back to our time_spent data frame, now. It is often helpful to take a look at the data before writing code.

Below, we will load time_spent and run the separate() function with the course_id variable to split up the subject, semester, and section so we can use them later on. In other words, whereas above we separated the variable course_variable, in the data set we’ll use here, we’ll separate the course_id variable.

time_spent |>
  separate(course_id,
           c("subject", "semester", "section"))

# A tibble: 716 × 8
   student_id subject semester section gender enrollment_reason                 
        <dbl> <chr>   <chr>    <chr>   <chr>  <chr>                             
 1      60186 AnPhA   S116     01      M      Course Unavailable at Local School
 2      66693 AnPhA   S116     01      M      Course Unavailable at Local School
 3      66811 AnPhA   S116     01      F      Course Unavailable at Local School
 4      66862 AnPhA   S116     01      F      Course Unavailable at Local School
 5      67508 AnPhA   S116     01      F      Scheduling Conflict               
 6      70532 AnPhA   S116     01      F      Learning Preference of the Student
 7      77010 AnPhA   S116     01      F      Learning Preference of the Student
 8      85249 AnPhA   S116     01      F      Course Unavailable at Local School
 9      85411 AnPhA   S116     01      F      Scheduling Conflict               
10      85583 AnPhA   S116     01      F      Scheduling Conflict               
# ℹ 706 more rows
# ℹ 2 more variables: enrollment_status <chr>, time_spent <dbl>

Your Turn ⤵

There is one last key step – one that is likely to be a bit disorienting at first. Once we’ve processed the data how we would like, we have to assign, or save, the results back to the name for the data with which we have been working. This is done with the assignment operator, or the <- symbol.

Copy the code you successfully ran in the chunk above to follow the assignment operator in the chunk below. In other words, write the code you wrote above, but assign the output back to time_spent so that it now includes the course id separated into three distinct variables.

# YOUR CODE HERE
time_spent2 <- time_spent |>
  separate(course_id, c("subject", "semester", "section"))

Note: It would have been perfectly acceptable, and in some cases preferable, to assign to a new name such as time_spent_tidy. However, to help keep our environment pane from becoming too cluttered and avoid confusion, we will reuse the same object name.

We have made a habit of continually looking at our data after running code to ensure that the step worked as intended. Type the name of the data we have been working with in the code chunk below to ensure that the course_id variable has been separated into three variables that correspond to the subject, semester, and section.

# YOUR CODE HERE
time_spent2

# A tibble: 716 × 8
   student_id subject semester section gender enrollment_reason                 
        <dbl> <chr>   <chr>    <chr>   <chr>  <chr>                             
 1      60186 AnPhA   S116     01      M      Course Unavailable at Local School
 2      66693 AnPhA   S116     01      M      Course Unavailable at Local School
 3      66811 AnPhA   S116     01      F      Course Unavailable at Local School
 4      66862 AnPhA   S116     01      F      Course Unavailable at Local School
 5      67508 AnPhA   S116     01      F      Scheduling Conflict               
 6      70532 AnPhA   S116     01      F      Learning Preference of the Student
 7      77010 AnPhA   S116     01      F      Learning Preference of the Student
 8      85249 AnPhA   S116     01      F      Course Unavailable at Local School
 9      85411 AnPhA   S116     01      F      Scheduling Conflict               
10      85583 AnPhA   S116     01      F      Scheduling Conflict               
# ℹ 706 more rows
# ℹ 2 more variables: enrollment_status <chr>, time_spent <dbl>

If something doesn’t look right, consider re-running the code chunks above, perhaps returning all the way to the first code chunk that you ran (to load the data) to ensure that the output is as you intended for it to be.

Next Class Time

We will finish the step 2: wrangle process together in our R tutorial series.

“Mutate” a column to change the time spent variable to represent hours
Process Gradebook Data
Process Survey Data
Joining the data
Change Data Type