knitr::opts_chunk$set(error = TRUE)
During the final week of each unit, we will complete a “case study” to illustrate how Learning Analytics methods and techniques can be applied to address research questions of interest, create useful data products, and conduct reproducible research. Each case study is structured around a basic research workflow modeled after the Data-Intensive Research Workflow from Learning Analytics Goes to School (Krumm et al., 2018):
Figure 2.2 Steps of Data-Intensive Research Workflow
For Unit 1, we will focus on online science classes provided through a state-wide online virtual school and conduct an analysis that help product students’ performance in these online courses. This case study is guided by a foundational study in Learning Analytics that illustrates how analyses like these can be used develop an early warning system for educators identify students at risk of failing and intervene before that happens. The Unit 1 case study will cover the following workflow topics:
Our Unit 1 Case Study is guided by a well-cited publication from two authors that have made numerous contributions to the field of Learning Analytics over the years. Although this article is focused on “early warning systems” in higher education, and where adoption of learning management systems (LMS) like Moodle and Canvas gained a quicker foothold, this study is particularly relevant since COVID-19. Many districts across the county have incorporated a LMS into their remote instruction and have set up virtual academies likely to continue post-pandemic. In North Carolina specifically, student disengagement has become a particular concern among districts and the NC Department of Public Instruction has recently established the Office of Learning Recovery & Acceleration.
Macfadyen, L. P., & Dawson, S. (2010). Mining LMS data to develop an “early warning system” for educators: A proof of concept. Computers & education, 54(2), 588-599.
Earlier studies have suggested that higher education institutions could harness the predictive power of Learning Management System (LMS) data to develop reporting tools that identify at-risk students and allow for more timely pedagogical interventions. This paper confirms and extends this proposition by providing data from an international research project investigating which student online activities accurately predict academic achievement. Analysis of LMS tracking data from a Blackboard Vista-supported course identified 15 variables demonstrating a significant simple correlation with student final grade… Moreover, network analysis of course discussion forums afforded insight into the development of the student learning community by identifying disconnected students, patterns of student-to-student communication, and instructor positioning within the network. This study affirms that pedagogically meaningful information can be extracted from LMS-generated student tracking data, and discusses how these findings are informing the development of a dashboard-like reporting tool for educators that will extract and visualize real-time data on student engagement and likelihood of success.
The data analyzed in this exploratory research was extracted from the course-based instructor tracking logs and the BB Vista production server. Data collected on each student included ‘whole term’ counts for frequency of usage of course materials and tools supporting content delivery, engagement and discussion, assessment and administration/management. In addition, tracking data indicating total time spent on certain tool-based activities (assessments, assignments, total time online) offered a total measure of individual student time on task.
The authors used scatter plots for identifying potential relationships between variables under investigation, followed by a a simple correlation analysis of each variable to further interrogate the significance of selected variables as indicators of student achievement. Finally, a linear multiple regression analysis was conducted in order to develop a predictive model in which a student final grade was the continuous dependent variable.
Take a quick scan of Table 3 in the article linked above (also located in your “lit” folder in the files pane). In the space below, answer the following question: Of the 13 LMS variables correlated with student final grade, which 2-3 do you think will be found to significantly predict final grades?
Total # discussion messages posted
Total number of online sessions
Total time online
Now take a quick look at sections 3.2 Multiple Regression and 3.3 Logistic Regression and answer the following questions: What factors in the model did ultimately predict final grades? How accurate was this model in identifying “at risk” students?
total number of discussion messages posted, total number of mail messages sent, and total number of assessments completed
73.7%
In this study, exploratory research was undertaken to identify the data variables that would inform the development of a data visualization tool for instructors. This involved the extraction of all LMS tracking variables for selected course sections at The University of British Columbia, Canada. In so doing, the study aimed to address the following research questions:
Which LMS tracking data variables correlate significantly with student achievement?
How accurately can measures of student online activity in an online course site predict student achievement in the course under study?
Can tracking data recording online student communication patterns offer pedagogically meaningful insights into development of a student learning community?
For our case study, we’ll adopt questions 1 & 2 to guide our exploratory analysis and modeling, with a special emphasis on time spent in the LMS. We’ll also use analytical approaches and data similar to those used by Leah P. Macfadyen and Dawson (2010b) to better understand how LMS, gradebook, and survey data might be predictive of student performance.
As noted in our Getting Started activity and tutorials, R uses “packages” and add-ons that enhance its functionality. One of our first steps in any workflow is to load packages necessary for data wrangling, analysis, and reporting. We’ll load the familiar {tidyverse} package in this section and introduce new packages and corresponding functions throughout the case study.
Recall from earlier tutorials that the {tidyverse} package is actually a collection of R packages designed for reading, wrangling, and exploring data and which all share an underlying design philosophy, grammar, and data structures. This shared features are sometimes “tidy data principles.”
Click the green arrow in the right corner of the “code chunk” that follows to load the {tidyverse} library as well as the {here} package introduced in previous labs.
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5 ✓ purrr 0.3.4
## ✓ tibble 3.1.4 ✓ dplyr 1.0.7
## ✓ tidyr 1.1.3 ✓ stringr 1.4.0
## ✓ readr 2.0.1 ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
Don’t worry if you saw a number of messages: those probably mean that the tidyverse loaded just fine. Any conflicts you may have seen mean that functions in these packages you loaded have the same name as functions in other packages and R will default to function from the last loaded package unless you specify otherwise.
In general, data wrangling involves some combination of cleaning, reshaping, transforming, and merging data (Wickham & Grolemund, 2017). The importance of data wrangling is difficult to overstate, as it involves the initial steps of going from raw data to a dataset that can be explored and modeled (Krumm et al, 2018). In Part 2, we focus on the the following workflow processes:
Import Data. In this section, we introduce the read_csv() function for working with CSV files and revisit some key functions for inspecting our data.
Tidy Data. We introduce the separate() and clean_names() functions for getting our data nice and tidy, and revisit the mutate() for creating new variables.
Join Data. We conclude our data wrangling by introducing join() functions for merging our processed files into a single data frame for analysis.
Education data are stored in all sorts of different file formats and structures. In this course, we’ll discuss several of these common formats, how to import your data into R, and how to transform you data into other data formats such as network objects required for social network analysis in Unit 3. In this case study, we’ll focus on working with Comma-separated values (CSV) files.
Similar to spreadsheet formats like Excel and Google Sheets, CSVs allow us to store rectangular data frames but in a much simpler plain-text format, where all the important information in the file is represented by text. Note that “text” here refers to numbers, letters, and symbols you can type on your keyboard. In Tidyverse Skills for Data Science, Wright et al. (2021) note that the advantage of CSVs is that:
… there are no workbooks or metadata making it difficult to open these files. CSVs are flexible files and are thus the preferred storage method for tabular data for many data scientists .
Log-trace data is data generated from our interactions with digital technologies, such as archived data from social media postings. In education, an increasingly common source of log-trace data is that generated from interactions with LMS and other digital tools.
The data we will use in this case study and already been “wrangled” quite a bit and is a summary of log-trace data: the number of minutes students spent on the course. While this data type is fairly straightforward, there are even more complex sources of log-trace data out there (e.g., time stamps associated with when students started and stopped accessing the course).
To help us import our data, we’ll be using two packages: {readr} and {here} . The {readr} package provides a “fast and friendly way” to read rectangular data stored in plain-text file formats like csv, tsv, and fwf. If you are new to readr, I highly recommend the data import chapter in R for Data Science. We don’t need to load the {readr} package because it was already loaded as part of the tidyverse package we called earlier.
Let’s use the read_csv() function from {readr} to import our log-data.csv file directly from our data folder and name this data set time_spent, to help us to quickly recollect what function it serves in this analysis:
time_spent <- read_csv("data/log-data.csv")
## Rows: 716 Columns: 6
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (4): course_id, gender, enrollment_reason, enrollment_status
## dbl (2): student_id, time_spent
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Note that we had to specify the file path in our argument, but this may not consistently work depending on the type of operating system you are using and where your R Project is located. The message also indicated that four of the columns were specified as “character” data types and two of the columns student_id and time_spent as numeric. The student id as numeric is problematic and something we will have to deal with later.
The goal of the {here} package is to enable easy file referencing in project-oriented workflows. In contrast to depending on working directories and file paths for projects like above, which are a fragile and dependent on the way you organize your files, {here} uses the top-level directory of a project to easily build paths to files.
To use the {here} package, we simply insert the here() function as our first argument for read_csv() and include the location of our file by typing the folder or folders where your data is stored along with the file name.
Give it try:
library(here)
## here() starts at /cloud/project
time_spent <- read_csv(here("data", "log-data.csv"))
## Rows: 716 Columns: 6
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (4): course_id, gender, enrollment_reason, enrollment_status
## dbl (2): student_id, time_spent
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Let’s take a look at this data in a couple ways. First, type time_spent into the console (below this window) and then hit return/enter. You should see a printed summary of this data frame.
Next, got your files pane (bottom right corner of R Studio), click on the data folder then click on the file named “log-data.csv” and in the pop-up click “View File.”
What do you notice about this data? What questions do you have? Add a couple of notes (or more—you can type return/enter after a bullet point to add another) on your observations and/or questions here:
1st row names the column headers; each subsequent row contains cell values separated by commas.
What happens if the number of items in a row doesn’t match the number of columns? Is that instance “bad?”
How does it know the data type (numeric vs character)?
Academic achievement data is (obviously) is a very common form of data in education. In this case study, we’ll use both the sum of the points students earned as well as the number of points possible to compute the percentage of points they earned in the course—a measure comparable (but likely a little different based on teachers’ grading policies) to their final grade.
We’ll load the data in the same way as earlier but take a quick peek by including the name of our data frame in the code chunk as well:
gradebook <- read_csv(here("data", "gradebook-summary.csv"))
## Rows: 717 Columns: 4
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): course_id
## dbl (3): student_id, total_points_possible, total_points_earned
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
gradebook
## # A tibble: 717 × 4
## student_id course_id total_points_possible total_points_earned
## <dbl> <chr> <dbl> <dbl>
## 1 43146 FrScA-S216-02 1217 1150
## 2 44638 OcnA-S116-01 1676 1384.
## 3 47448 FrScA-S216-01 1232 1116
## 4 47979 OcnA-S216-01 1833 1493.
## 5 48797 PhysA-S116-01 2225 1995.
## 6 51943 FrScA-S216-03 1222 70
## 7 52326 AnPhA-S216-01 1775 1519.
## 8 52446 PhysA-S116-01 2225 2198
## 9 53447 FrScA-S116-01 1212 1173
## 10 53475 FrScA-S116-02 1212 0
## # … with 707 more rows
The third data source is a self-report survey. This was data collected before the start of the course. The survey included ten items, each corresponding to one of three motivation measures: interest, utility value, and perceived competence. These were chosen for their alignment with one way to think about students’ motivation, to what extent they expect to do well (corresponding to their perceived competence) and their value for what they are learning (corresponding to their interest and utility value). We’ll use this in the third learning lab.
In the code below, import the survey.csv file located in the data folder. You can use the code above as a template. Assign the output from the read_csv() function to a new object named survey.
survey <- read_csv(here("data", "survey.csv"))
## Rows: 662 Columns: 26
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): student_ID, course_ID, subject, semester, section
## dbl (18): int, val, percomp, tv, q1, q2, q3, q4, q5, q6, q7, q8, q9, q10, p...
## dttm (3): date.x, date.y, date
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Hint: By asking you to “assign the output from the read_csv() function the name survey, consider how in the code chunk above this”Your Turn" code chunk, we assigned the output from the read_csv() function to the name gradebook.
After reading the data, let’s continue the practice of looking at our data. Type survey into the console to take a look at the data: Does it appear to be the correct file? What do the variables seem to be about? What wrangling steps do we need to take? Taking a quick peak at the data helps us to begin to formulate answers to these and is an important step in any data analysis, especially as we prepare for what we are going to do.
Add one or more of the things you notice or wonder about the data here:
q is survey question; I assume it’s a scale from 1 to 5, though I don’t understand all the variables
student_ID is now labeled as a character, not numeric (good!)
RStudio Tip: If you happen to run into issues with data import, RStudio has a handy “Import Dataset” feature for a point and click approach to adding data to your environment. If you want to give this a try, be sure to pay attention to the default settings and the name it will give your data frame when imported.
Once your data is in R, there are many different ways you can view it. Give each of the following at try:
# enter the name of your data frame and view directly in the console or a code chunk
survey
## # A tibble: 662 × 26
## student_ID course_ID subject semester section int val percomp tv q1
## <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 43146 FrScA-S2… FrScA S216 02 4.2 3.67 4 3.86 4
## 2 44638 OcnA-S11… OcnA S116 01 4 3 3 3.57 4
## 3 47448 FrScA-S2… FrScA S216 01 4.2 3 3 3.71 5
## 4 47979 OcnA-S21… OcnA S216 01 4 3.67 2.5 3.86 4
## 5 48797 PhysA-S1… PhysA S116 01 3.8 3.67 3.5 3.71 4
## 6 51943 FrScA-S2… FrScA S216 03 3.8 3.67 3.5 3.71 4
## 7 52326 AnPhA-S2… AnPhA S216 01 3.6 4 3 4 4
## 8 52446 PhysA-S1… PhysA S116 01 4.2 3.67 3 4 4
## 9 53447 FrScA-S1… FrScA S116 01 3.8 2 3 3 5
## 10 53475 FrScA-S2… FrScA S216 01 4.8 3.33 4 4.14 5
## # … with 652 more rows, and 16 more variables: q2 <dbl>, q3 <dbl>, q4 <dbl>,
## # q5 <dbl>, q6 <dbl>, q7 <dbl>, q8 <dbl>, q9 <dbl>, q10 <dbl>, date.x <dttm>,
## # post_int <dbl>, post_uv <dbl>, post_tv <dbl>, post_percomp <dbl>,
## # date.y <dttm>, date <dttm>
# view your data frame transposed so your can see every column and the first few entries
glimpse(survey)
## Rows: 662
## Columns: 26
## $ student_ID <chr> "43146", "44638", "47448", "47979", "48797", "51943", "52…
## $ course_ID <chr> "FrScA-S216-02", "OcnA-S116-01", "FrScA-S216-01", "OcnA-S…
## $ subject <chr> "FrScA", "OcnA", "FrScA", "OcnA", "PhysA", "FrScA", "AnPh…
## $ semester <chr> "S216", "S116", "S216", "S216", "S116", "S216", "S216", "…
## $ section <chr> "02", "01", "01", "01", "01", "03", "01", "01", "01", "01…
## $ int <dbl> 4.2, 4.0, 4.2, 4.0, 3.8, 3.8, 3.6, 4.2, 3.8, 4.8, 4.6, 3.…
## $ val <dbl> 3.666667, 3.000000, 3.000000, 3.666667, 3.666667, 3.66666…
## $ percomp <dbl> 4.0, 3.0, 3.0, 2.5, 3.5, 3.5, 3.0, 3.0, 3.0, 4.0, 4.0, 3.…
## $ tv <dbl> 3.857143, 3.571429, 3.714286, 3.857143, 3.714286, 3.71428…
## $ q1 <dbl> 4, 4, 5, 4, 4, 4, 4, 4, 5, 5, 4, 3, 4, 4, 5, 4, 4, 5, 5, …
## $ q2 <dbl> 4, 2, 3, 3, 4, 4, 4, 4, 2, 4, 5, 1, 3, 2, 5, 4, 4, 4, 3, …
## $ q3 <dbl> 4, 2, 3, 2, 3, 3, 4, 3, 3, 4, 4, 3, 3, 2, 4, 3, 3, 4, 4, …
## $ q4 <dbl> 5, 4, 4, 4, 4, 4, 2, 4, 4, 5, 4, 4, 4, 5, 5, 4, 4, 5, 5, …
## $ q5 <dbl> 4, 4, 4, 4, 4, 3, 4, 4, 4, 5, 5, 4, 4, 5, 5, 4, 4, 5, 5, …
## $ q6 <dbl> 4, 4, 3, 4, 4, 3, 4, 4, 2, 3, 5, 3, 3, 3, 5, 4, 4, 4, 5, …
## $ q7 <dbl> 4, 4, 3, 3, 4, 4, 2, 3, 3, 4, 4, 4, 4, 4, 4, 4, 3, 5, 4, …
## $ q8 <dbl> 4, 4, 4, 4, 4, 4, 4, 5, 4, 5, 5, 3, 4, 5, 5, 4, 5, 5, 5, …
## $ q9 <dbl> 3, 3, 3, 4, 3, 4, 4, 3, 2, 3, 4, 2, 3, 1, 5, 3, 3, 4, 3, …
## $ q10 <dbl> 4, 4, 4, 4, 3, 4, 4, 4, 2, 4, 5, 3, 4, 3, 5, 4, 5, 5, 5, …
## $ date.x <dttm> 2016-02-02 18:44:00, 2015-09-09 13:41:00, 2016-01-28 14:…
## $ post_int <dbl> NA, NA, NA, NA, NA, NA, NA, 3.50, 3.75, NA, 5.00, NA, NA,…
## $ post_uv <dbl> NA, NA, NA, NA, NA, NA, NA, 3.666667, 2.000000, NA, 4.666…
## $ post_tv <dbl> NA, NA, NA, NA, NA, NA, NA, 3.571429, 3.000000, NA, 4.857…
## $ post_percomp <dbl> NA, NA, NA, NA, NA, NA, NA, 3.5, 3.0, NA, 4.0, NA, NA, 3.…
## $ date.y <dttm> NA, NA, NA, NA, NA, NA, NA, 2016-01-02 00:41:00, 2015-10…
## $ date <dttm> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
# look at just the first six entries
head(survey)
## # A tibble: 6 × 26
## student_ID course_ID subject semester section int val percomp tv q1
## <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 43146 FrScA-S21… FrScA S216 02 4.2 3.67 4 3.86 4
## 2 44638 OcnA-S116… OcnA S116 01 4 3 3 3.57 4
## 3 47448 FrScA-S21… FrScA S216 01 4.2 3 3 3.71 5
## 4 47979 OcnA-S216… OcnA S216 01 4 3.67 2.5 3.86 4
## 5 48797 PhysA-S11… PhysA S116 01 3.8 3.67 3.5 3.71 4
## 6 51943 FrScA-S21… FrScA S216 03 3.8 3.67 3.5 3.71 4
## # … with 16 more variables: q2 <dbl>, q3 <dbl>, q4 <dbl>, q5 <dbl>, q6 <dbl>,
## # q7 <dbl>, q8 <dbl>, q9 <dbl>, q10 <dbl>, date.x <dttm>, post_int <dbl>,
## # post_uv <dbl>, post_tv <dbl>, post_percomp <dbl>, date.y <dttm>,
## # date <dttm>
# or the last six entries
tail(survey)
## # A tibble: 6 × 26
## student_ID course_ID subject semester section int val percomp tv q1
## <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 19 AnPhA-S21… AnPhA S217 02 4.2 5 5 4.5 5
## 2 42 FrScA-S21… FrScA S217 01 4 4 4 4 4
## 3 52 FrScA-S21… FrScA S217 03 4.4 2.67 3.5 3.75 4
## 4 57 FrScA-S21… FrScA S217 01 4.4 2.33 2.5 3.62 5
## 5 72 FrScA-S21… FrScA S217 01 5 3 4 4.25 5
## 6 80 FrScA-S21… FrScA S217 01 3.6 2.33 3 3.12 4
## # … with 16 more variables: q2 <dbl>, q3 <dbl>, q4 <dbl>, q5 <dbl>, q6 <dbl>,
## # q7 <dbl>, q8 <dbl>, q9 <dbl>, q10 <dbl>, date.x <dttm>, post_int <dbl>,
## # post_uv <dbl>, post_tv <dbl>, post_percomp <dbl>, date.y <dttm>,
## # date <dttm>
# view the names of your variables or columns
names(survey)
## [1] "student_ID" "course_ID" "subject" "semester" "section"
## [6] "int" "val" "percomp" "tv" "q1"
## [11] "q2" "q3" "q4" "q5" "q6"
## [16] "q7" "q8" "q9" "q10" "date.x"
## [21] "post_int" "post_uv" "post_tv" "post_percomp" "date.y"
## [26] "date"
# or view in source pane
View(survey)
Yes, the “V” is capitalized—very unusual for an R function (the tidyverse does have it’s own lowercase view() function if that bothers you though). Because this function is a bit atypical in more ways than one, I have two recommendations concerning its use:
Use it strictly in the console. Because it opens a new viewing window, including it in an R Markdown script can cause issues when “knitting” or rendering an HTML (or PDF) report. Hence I have included the eval = FALSE argument in the code chunk so it it not run when you knit your document.
Close the viewer window that opens once you have viewed the data. Keeping it open can clutter your work space a bit and can lead to confusion about what data frame it was you viewed.
Tidy data refers to a consistent way to organize your data in R. In our course text, R for Data Science, Hadley Wickham notes that getting your data into this format requires some work, but that work pays off in the long term:
Once you have tidy data and the tidy tools provided by packages in the tidyverse, you will spend much less time munging data from one representation to another, allowing you to spend more time on the analytic questions at hand.
Not surprisingly, the Tidyverse set of packages including packages like dplyr adhere “tidy” data principles (Wickham 2014). Tidy data has three interrelated rules which make a dataset tidy:
Each variable is a column
Each observation is a row
Each type of observational unit is a table
Why would this data be considered “untidy?” Hint: Look at the column names.
Earlier, we loaded time_spent, which contains information on the number of minutes that students spent on the course, as well as other variables, particularly course_id.
Information about the course subject, semester, and section are not stored in in a “tidy” format, but rather a single variable—course_id. This format of data storage is not ideal. If we instead give each piece of information its own column, we’ll have more opportunities for later analysis. We’ll use a function called separate() to do this.
First, let’s practice with a small data set. We’ll create it directly in R; run the code below to do that (and to assign the name df to the dataset).
df <- tibble(course_var = c("Fall - Chemistry",
"Fall - Earth Science",
"Spring - Forensic Science",
"Spring - Earth Science",
"Spring - Biology"))
df
## # A tibble: 5 × 1
## course_var
## <chr>
## 1 Fall - Chemistry
## 2 Fall - Earth Science
## 3 Spring - Forensic Science
## 4 Spring - Earth Science
## 5 Spring - Biology
Print df to the console. You should see a single variable, course_info, with four rows.
In this (very small) data frame, there is information about both the semester and the course are encoded within the same variable. The separate() function has two primary arguments, one each for:
Below, see using course_var for #1, and c("Semester", "Course") for #2, can be used to separate the semester and course data into two separate variable
df %>%
separate(course_var, c("semester", "course"))
## Warning: Expected 2 pieces. Additional pieces discarded in 3 rows [2, 3, 4].
## # A tibble: 5 × 2
## semester course
## <chr> <chr>
## 1 Fall Chemistry
## 2 Fall Earth
## 3 Spring Forensic
## 4 Spring Earth
## 5 Spring Biology
Next, let’s try something slightly different. Here, we have a data frame with a variable that encodes three pieces of information within the same variable: the year, semester, and subject. There are also a few other differences.
df2 <- tibble(course_variable = c("19-Fall-Algebra I",
"20-Fall-Algebra II",
"20-Spring-Algebra I",
"20-Spring-Algebra II",
"21-Fall-Algebra I"))
df2
## # A tibble: 5 × 1
## course_variable
## <chr>
## 1 19-Fall-Algebra I
## 2 20-Fall-Algebra II
## 3 20-Spring-Algebra I
## 4 20-Spring-Algebra II
## 5 21-Fall-Algebra I
Can you separate the variable in the above data frame not into two, but rather three, new variables?
df2 %>%
separate(course_variable, c("year","semester", "course"))
## Warning: Expected 3 pieces. Additional pieces discarded in 5 rows [1, 2, 3, 4,
## 5].
## # A tibble: 5 × 3
## year semester course
## <chr> <chr> <chr>
## 1 19 Fall Algebra
## 2 20 Fall Algebra
## 3 20 Spring Algebra
## 4 20 Spring Algebra
## 5 21 Fall Algebra
Hint: Try to modify the code from above (in which you separated course_var into two variables) based on a) the name of the variable in df2 and b) adding the name for the third new variable you wish to create.
Let’s return back to our time_spent data frame, now. It is often helpful to take a look at the data before writing code.
Below, we will load time_spent and run the separate() function with the course_id variable to split up the subject, semester, and section so we can use them later on. In other words, whereas above we separated the variable course_variable, in the data set we’ll use here, we’ll separate the course_id variable.
time_spent %>%
separate(course_id,
c("subject", "semester", "section"))
## # A tibble: 716 × 8
## student_id subject semester section gender enrollment_reason enrollment_stat…
## <dbl> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 60186 AnPhA S116 01 M Course Unavailab… Approved/Enroll…
## 2 66693 AnPhA S116 01 M Course Unavailab… Approved/Enroll…
## 3 66811 AnPhA S116 01 F Course Unavailab… Approved/Enroll…
## 4 66862 AnPhA S116 01 F Course Unavailab… Approved/Enroll…
## 5 67508 AnPhA S116 01 F Scheduling Confl… Approved/Enroll…
## 6 70532 AnPhA S116 01 F Learning Prefere… Approved/Enroll…
## 7 77010 AnPhA S116 01 F Learning Prefere… Approved/Enroll…
## 8 85249 AnPhA S116 01 F Course Unavailab… Approved/Enroll…
## 9 85411 AnPhA S116 01 F Scheduling Confl… Approved/Enroll…
## 10 85583 AnPhA S116 01 F Scheduling Confl… Approved/Enroll…
## # … with 706 more rows, and 1 more variable: time_spent <dbl>
There is one last key step—one that is likely to be a bit disorienting at first—that we’ll do next. Once we’ve processed the data how we would like, we have to assign, or save, the results back to the name for the data with which we have been working. This is done with the assignment operator, or the <- symbol.
Copy the code you successfully ran in the chunk above to follow the assignment operator in the chunk below. In other words, write the code you wrote above, but assign the output back to time_spent so that it now includes the course id separated into three distinct variables.
time_spent <- time_spent %>%
separate(course_id,
c("subject", "semester", "section"))
We have made a habit of continually looking at our data after running code to ensure that the step worked as intended. Type the name of the data we have been working with in the code chunk below to ensure that the course_id variable has been separated into three variables that correspond to the subject, semester, and section.
time_spent
## # A tibble: 716 × 8
## student_id subject semester section gender enrollment_reason enrollment_stat…
## <dbl> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 60186 AnPhA S116 01 M Course Unavailab… Approved/Enroll…
## 2 66693 AnPhA S116 01 M Course Unavailab… Approved/Enroll…
## 3 66811 AnPhA S116 01 F Course Unavailab… Approved/Enroll…
## 4 66862 AnPhA S116 01 F Course Unavailab… Approved/Enroll…
## 5 67508 AnPhA S116 01 F Scheduling Confl… Approved/Enroll…
## 6 70532 AnPhA S116 01 F Learning Prefere… Approved/Enroll…
## 7 77010 AnPhA S116 01 F Learning Prefere… Approved/Enroll…
## 8 85249 AnPhA S116 01 F Course Unavailab… Approved/Enroll…
## 9 85411 AnPhA S116 01 F Scheduling Confl… Approved/Enroll…
## 10 85583 AnPhA S116 01 F Scheduling Confl… Approved/Enroll…
## # … with 706 more rows, and 1 more variable: time_spent <dbl>
If those look good, let’s proceed to the next step. If something doesn’t look right, consider re-running the code chunks above, perhaps returning all the way to the first code chunk that you ran (to load the data) to ensure that the output is as you intended for it to be.
In the above code, you used separate() to create new variables based on an existing variable. While that function solves a specific problem (when there are effectively multiple variables combined in one), its use represents a pattern that is fairly common: you use a function to solve a problem; figuring out how it works, checking the output, then assigning the output back to the name of the data frame, after which you can proceed to the next step.
There are a lot of other functions like separate that help you to solve specific problems, and we’ll introduce many over the course - and will point you to resources that describe many more.
There are also functions that can serve as general purpose tools that can solve many problems; one of the most useful is mutate(), a function to create new variables in a data set. Specifically, we’ll use mutate() to create a new variable for the percentage of points each student earned; keep in mind as you work through these steps how so many parts of wrangling data involves either changing a variable or creating a new one. For these purposes, mutate can be very helpful.
Let’s begin again with a small data set with two variables, var_a and var_b. Run the chunk below.
df3 <- tibble(var_a = c(30, 50, 30, 10, 30, 40, 40, 30, 20, 50),
var_b = c(100, 90, 60, 70, 60, 80, 70, 50, 30, 20)
)
Next, print df3 to the console. You should see two numeric variables; imagine they represent points that students earned on a 50-point quiz and a 100-point test, respectively. There are a lot of things that you might wish to do with these variables. For instance, you may wish to sum them together. The code below does this.
df3 %>%
mutate(points_sum = var_a + var_b)
## # A tibble: 10 × 3
## var_a var_b points_sum
## <dbl> <dbl> <dbl>
## 1 30 100 130
## 2 50 90 140
## 3 30 60 90
## 4 10 70 80
## 5 30 60 90
## 6 40 80 120
## 7 40 70 110
## 8 30 50 80
## 9 20 30 50
## 10 50 20 70
We can combine many mutate() functions together. Below, complete the following code to create a new variable (let’s call it points_proportion) that represents the proportion of the total points students could, potentially, earn. To do this, you can divide points_sum created in the second line of code by the maximum possible points—150.
df3 %>%
mutate(points_sum = var_a + var_b) %>%
mutate(points_proportion = points_sum / 150)
## # A tibble: 10 × 4
## var_a var_b points_sum points_proportion
## <dbl> <dbl> <dbl> <dbl>
## 1 30 100 130 0.867
## 2 50 90 140 0.933
## 3 30 60 90 0.6
## 4 10 70 80 0.533
## 5 30 60 90 0.6
## 6 40 80 120 0.8
## 7 40 70 110 0.733
## 8 30 50 80 0.533
## 9 20 30 50 0.333
## 10 50 20 70 0.467
Hint: Just like you can use the + symbol to add variables together, you can use the / symbol to divide a variable by another—or by a value, like 150!
After adding the above, you should see output that contains four variables, one each for var_a and var_b, points_sum, which represents the sum of the points students earned, and points_proportion, which represents the percentage of the total points students earned.
Let’s now process the time_spent variable. Specifically, this variable currently represents the number of minutes that students spent on the course LMS. Below, complete the code template to create a new variable, time_spent_hours, that represents the number of hours that students spent on the course LMS.
time_spent %>%
mutate(time_spent_hours = time_spent / 60)
## # A tibble: 716 × 9
## student_id subject semester section gender enrollment_reason enrollment_stat…
## <dbl> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 60186 AnPhA S116 01 M Course Unavailab… Approved/Enroll…
## 2 66693 AnPhA S116 01 M Course Unavailab… Approved/Enroll…
## 3 66811 AnPhA S116 01 F Course Unavailab… Approved/Enroll…
## 4 66862 AnPhA S116 01 F Course Unavailab… Approved/Enroll…
## 5 67508 AnPhA S116 01 F Scheduling Confl… Approved/Enroll…
## 6 70532 AnPhA S116 01 F Learning Prefere… Approved/Enroll…
## 7 77010 AnPhA S116 01 F Learning Prefere… Approved/Enroll…
## 8 85249 AnPhA S116 01 F Course Unavailab… Approved/Enroll…
## 9 85411 AnPhA S116 01 F Scheduling Confl… Approved/Enroll…
## 10 85583 AnPhA S116 01 F Scheduling Confl… Approved/Enroll…
## # … with 706 more rows, and 2 more variables: time_spent <dbl>,
## # time_spent_hours <dbl>
Hint: Refer to the code you wrote above, being clear about a) what the name of the new variable you are creating is and b) how you will create this variable using division (by the number of minutes in an hour). Also, it’s a bit confusing and poor naming on my part, but you will need to use the time_spent variable in our time_spent data frame.
We used the above as a test bed to ensure that our code worked as intended. Once we are confident that we are creating the variable in the way we intend to, we can assign the output back to the data frame that time_spent refers to. Paste your working code here to save the new time_spent_hours variable to
time_spent <- time_spent %>%
mutate(time_spent_hours = time_spent / 60)
Good work wrangling this dataset!
Now let’s process the gradebook data. In particular, we’ll separate the course_id variable in the same way we separated that variable in the log data, and we’ll also calculate a new variable representing the proportion of points students earned (out of the points possible to earn).
Let’s start with separating the course_id variable. Run the code in the next chunk to do this. If you named the three parts of the course ID differently than they’re named below, be sure that these three variables are named identically; this is the key (literally) to these variables joining correctly in the next section.
gradebook <- gradebook %>%
separate(course_id,
c("subject", "semester", "section"))
Next, we’ll mutate our data set to create a new column, one representing the proportion of points students earned. Let’s consider a data frame with example data, df4 .
df4 <- tibble(var_a = c(8, 8, 7, 8, 9, 6, 8, 8, 7, 8),
var_b = 9)
Note: To create df4, for var_a, we passed a vector that we created with the function c() that contains 10 values. Consider these to be the number of times that learners participated in an outside-of-school STEM club. Instead of passing another vector for var_b, we simply used the value 9, which represents the number of opportunities students had to participate in the outside-of–school STEM club . In this case, the value 9 is repeated for however many rows there are in the data frame. Thus, in the context of creating a data frame, var_b = 9 is the same as var_b = c(9, 9, 9, 9, 9, 9, 9, 9, 9, 9).
Since interpreting proportions when the denominator is nine can be difficult, we may which to create a variable for the proportion.
After running the chunk above, print df4 to the console or view using another method to get a sense of what the data frame consists of.
To create a third variable that represents the proportion of STEM club activities students participated in, divide var_a by var_b.
df4 %>%
mutate(var_c = var_a / var_b)
## # A tibble: 10 × 3
## var_a var_b var_c
## <dbl> <dbl> <dbl>
## 1 8 9 0.889
## 2 8 9 0.889
## 3 7 9 0.778
## 4 8 9 0.889
## 5 9 9 1
## 6 6 9 0.667
## 7 8 9 0.889
## 8 8 9 0.889
## 9 7 9 0.778
## 10 8 9 0.889
What happens if the output is different than you intended? That’s no problem! Re-run the code-chunk above (in which we create df) to have a blank slate with which to try again.
Your turn once more. This time, create a new variable—here, let’s name it proportion_earned—using the gradebook data. This will involve using the mutate function with the gradebook data, creating a new variable (proportion_earned) on the basis of the values of two existing variables:
total_points_possible
total_points_earned
Also, once your code is ready, you’ll need to assign the results back to gradebook. This is challenging as you’re starting from scratch with the code. However, good R programmers use other code (that they or others wrote!) often, so feel free to copy and paste code from other, similar problems to give yourself a head start.
gradebook <- gradebook %>%
mutate(proportion_earned = total_points_earned / total_points_possible)
Once the above step is complete, take another look at gradebook by printing it to the console or viewing it using a preferred method. There should now be seven columns, the six originally in the data and a new, seventh variable you’ve “mutated.”
Finally, let’s process our survey data that we imported earlier. First though, take a quick look again by typing survey into the console or using the code chunk below.
survey
## # A tibble: 662 × 26
## student_ID course_ID subject semester section int val percomp tv q1
## <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 43146 FrScA-S2… FrScA S216 02 4.2 3.67 4 3.86 4
## 2 44638 OcnA-S11… OcnA S116 01 4 3 3 3.57 4
## 3 47448 FrScA-S2… FrScA S216 01 4.2 3 3 3.71 5
## 4 47979 OcnA-S21… OcnA S216 01 4 3.67 2.5 3.86 4
## 5 48797 PhysA-S1… PhysA S116 01 3.8 3.67 3.5 3.71 4
## 6 51943 FrScA-S2… FrScA S216 03 3.8 3.67 3.5 3.71 4
## 7 52326 AnPhA-S2… AnPhA S216 01 3.6 4 3 4 4
## 8 52446 PhysA-S1… PhysA S116 01 4.2 3.67 3 4 4
## 9 53447 FrScA-S1… FrScA S116 01 3.8 2 3 3 5
## 10 53475 FrScA-S2… FrScA S216 01 4.8 3.33 4 4.14 5
## # … with 652 more rows, and 16 more variables: q2 <dbl>, q3 <dbl>, q4 <dbl>,
## # q5 <dbl>, q6 <dbl>, q7 <dbl>, q8 <dbl>, q9 <dbl>, q10 <dbl>, date.x <dttm>,
## # post_int <dbl>, post_uv <dbl>, post_tv <dbl>, post_percomp <dbl>,
## # date.y <dttm>, date <dttm>
Does it appear to be the correct data frame? What do the variables seem to be about? What wrangling steps do we need to take? Taking a quick peak at the data helps us to begin to formulate answers to these and is an important step in any data analysis, especially as we prepare for what we are going to do.
Add one or more of the things you notice or wonder about the data here:
You may have noticed that student_ID is not formatted exactly the same as student_id in our other files. This is important because in the next section when we “join,” or merge, our data files, these variables will need to have identical names.
Fortunately the {janitor} package has simple functions for examining and cleaning dirty data. It was built with beginning and intermediate R users in mind and is optimized for user-friendliness. There is also a handy function called clean_names() in the {janitor} package for standardizing variable names.
Run the following code to load the {janitor} package and clean our column names so both data frames are consistent:
library(janitor)
##
## Attaching package: 'janitor'
## The following objects are masked from 'package:stats':
##
## chisq.test, fisher.test
survey <- clean_names(survey)
Let’s take one more look at the data by typing its name into the code chunk below to check that the above function appeared to work; if it did, the names should be lower-case, and any symbols or spaces should now be replaced by an underscore (_).
survey
## # A tibble: 662 × 26
## student_id course_id subject semester section int val percomp tv q1
## <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 43146 FrScA-S2… FrScA S216 02 4.2 3.67 4 3.86 4
## 2 44638 OcnA-S11… OcnA S116 01 4 3 3 3.57 4
## 3 47448 FrScA-S2… FrScA S216 01 4.2 3 3 3.71 5
## 4 47979 OcnA-S21… OcnA S216 01 4 3.67 2.5 3.86 4
## 5 48797 PhysA-S1… PhysA S116 01 3.8 3.67 3.5 3.71 4
## 6 51943 FrScA-S2… FrScA S216 03 3.8 3.67 3.5 3.71 4
## 7 52326 AnPhA-S2… AnPhA S216 01 3.6 4 3 4 4
## 8 52446 PhysA-S1… PhysA S116 01 4.2 3.67 3 4 4
## 9 53447 FrScA-S1… FrScA S116 01 3.8 2 3 3 5
## 10 53475 FrScA-S2… FrScA S216 01 4.8 3.33 4 4.14 5
## # … with 652 more rows, and 16 more variables: q2 <dbl>, q3 <dbl>, q4 <dbl>,
## # q5 <dbl>, q6 <dbl>, q7 <dbl>, q8 <dbl>, q9 <dbl>, q10 <dbl>, date_x <dttm>,
## # post_int <dbl>, post_uv <dbl>, post_tv <dbl>, post_percomp <dbl>,
## # date_y <dttm>, date <dttm>
We’re now ready to join! At their core, joins involve operations on two data frames at the same time. This may seem useful only in certain cases, but consider the following data analysis tasks:
You have collected data from students from one of ten classrooms; at the same time, you have data on the teachers of those ten classes (five of which tried out a new curriculum, and five who taught a “business-as-usual” curriculum
You are studying the posts on Twitter and Pinterest of one of around 100 mathematics teachers
After working with a local school district, you collected survey responses from 100s of teachers who teach in one of approximately 25 elementary, middle, and high schools; you received data from the district on the characteristics of the schools, including how many students they serve and how many teachers work in them
In each of these cases—and many others like them—your single analysis involves multiple data files. While in some cases it is possible to analyze each data set individually, it is often useful (or necessary, depending upon your goal) to join these sources of data together. This is especially the case for learning analytics research, in which researchers and analysts often are interested in understanding teaching and learning through the lens of multiple data sources, including digital data, institutional records, and survey data, among other sources. In all of these cases, knowing how to promptly join together files—even files with tens of thousands of hundreds of thousands of rows—can be empowering.
Consider two example data frames. df5 contains a variable with four student names, name and a variable for the number of STEM-related classes they have taken, n_stem_classes.
df6 contains a variable with three student names, name (like in df5), as well as another, different variable, for students’ self-reported interest in STEM topics, interest_in_stem, measured on a one-seven scale, with seven indicating higher levels of interest.
Run the code below and then type df5 and df6 in the console.
df5 <- tibble(name = c("Sheila", "Tayla", "Marcus"),
n_stem_classes = c(4, 5, 6))
df5
## # A tibble: 3 × 2
## name n_stem_classes
## <chr> <dbl>
## 1 Sheila 4
## 2 Tayla 5
## 3 Marcus 6
df6 <- tibble(name = c("Tayla", "Marcus", "Sheila", "Vin"),
interest_in_stem = c(4, 7, 6, 6))
df6
## # A tibble: 4 × 2
## name interest_in_stem
## <chr> <dbl>
## 1 Tayla 4
## 2 Marcus 7
## 3 Sheila 6
## 4 Vin 6
A key (pun intended!) with joins is to consider what variable(s) will serve as the key. This is the variable to join by.
A key must have two characteristics; it is:
a character string— a word (thus, you cannot join on a number unless you “coerce” or change it to be a character string, as well see later in this case study).
present in both of the data frames you are joining.
To join two datasets, it is important that the key (or keys) on which you are joining the data is formatted identically. The key represents an identifier that is present in both of the data sets you are joining. For instance, you may have data collected from (or created about) the same students that are from two very different sources, such as a self-report survey of students and their teacher-assigned grades in class.
While some of the time it takes some thought to determine what the key is (or what the keys are—you can join on multiple keys!), in the above case, there is just one variable that meets both of the above characteristics.
In the code below, enter the name of the variable that is the key within the quotation marks following the by = argument. Then run the code chunk and note the output.
full_join(df5, df6, by = "name")
## # A tibble: 4 × 3
## name n_stem_classes interest_in_stem
## <chr> <dbl> <dbl>
## 1 Sheila 4 6
## 2 Tayla 5 4
## 3 Marcus 6 7
## 4 Vin NA 6
What do you notice about the output of the full_join()? All observations are valid; consider how the output is similar to and different from df5 and df6, particularly in one or more notes following the bullet point.
full_join() is one of a number of joins from which we can choose. full_join() is distinguished from the other joins by how it returns all of the rows in both of the data frames being joined. If a particular key is present in one of the data frames but not the other, the values for the variable in the data set for which the key is not present are simply recorded as missing (like in the above, where there is no value for the number of STEM classes Vin has taken).
There is one other join on which we’ll focus for now. That is left_join(), which differs from full_join() in that it returns all of the rows in the “left” data frame, the data frame named first in the function, but not all of the rows in the “right” data frame: it retains only the rows in the “right” data frame, the data frame named second in the function, that have a matching key. An example is necessary. Before running the code below, add the same key you added above.
left_join(df5, df6, by = "name")
## # A tibble: 3 × 3
## name n_stem_classes interest_in_stem
## <chr> <dbl> <dbl>
## 1 Sheila 4 6
## 2 Tayla 5 4
## 3 Marcus 6 7
Different from the above, left_join() did not return all of the rows from both data frames, instead returning all of the rows in the “left” data frame (and those in the “right” data frame with a match).
For now, we’re going to use a single join function, full_join(). In the code below, join gradebook and time_spent; type the names of those two data frames as arguments to the full_join() function in a similar manner as in the full_join() code above, and then run this code chunk. For now, don’t specify anything for the by = part of the function.
# join together the gradebook and log_wrangled
joined_data <- full_join(gradebook, time_spent)
## Joining, by = c("student_id", "subject", "semester", "section")
joined_data
## # A tibble: 830 × 12
## student_id subject semester section total_points_possible total_points_earned
## <dbl> <chr> <chr> <chr> <dbl> <dbl>
## 1 43146 FrScA S216 02 1217 1150
## 2 44638 OcnA S116 01 1676 1384.
## 3 47448 FrScA S216 01 1232 1116
## 4 47979 OcnA S216 01 1833 1493.
## 5 48797 PhysA S116 01 2225 1995.
## 6 51943 FrScA S216 03 1222 70
## 7 52326 AnPhA S216 01 1775 1519.
## 8 52446 PhysA S116 01 2225 2198
## 9 53447 FrScA S116 01 1212 1173
## 10 53475 FrScA S116 02 1212 0
## # … with 820 more rows, and 6 more variables: proportion_earned <dbl>,
## # gender <chr>, enrollment_reason <chr>, enrollment_status <chr>,
## # time_spent <dbl>, time_spent_hours <dbl>
You may notice a message in the console or first output box above that says Joining, by = c("student_id", "course", "subject", "section"). This is telling us that these files are being joined on the basis of all four of these variables matching in both data sets; in other words, for rows to be joined, they must match identically on all four of these variables.
This is related to not specifying anything for the by = part of the function; by default, full_join() (and left_join()) will consider any character strings with identical names that are present in both data sets to be keys. But, it’s generally better practice to specify the variables on which we are joining.
In the code chunk below, write your join like above, but add the by = c("student_id", "course", "subject", "section") part to your code. You may notice the message does not appear. This is generally a better practice because you know precisely on which variables you data sets are joining.
# join together the gradebook and log_wrangled
joined_data <- full_join(gradebook, time_spent, by = c("student_id", "subject", "semester", "section"))
joined_data
## # A tibble: 830 × 12
## student_id subject semester section total_points_possible total_points_earned
## <dbl> <chr> <chr> <chr> <dbl> <dbl>
## 1 43146 FrScA S216 02 1217 1150
## 2 44638 OcnA S116 01 1676 1384.
## 3 47448 FrScA S216 01 1232 1116
## 4 47979 OcnA S216 01 1833 1493.
## 5 48797 PhysA S116 01 2225 1995.
## 6 51943 FrScA S216 03 1222 70
## 7 52326 AnPhA S216 01 1775 1519.
## 8 52446 PhysA S116 01 2225 2198
## 9 53447 FrScA S116 01 1212 1173
## 10 53475 FrScA S116 02 1212 0
## # … with 820 more rows, and 6 more variables: proportion_earned <dbl>,
## # gender <chr>, enrollment_reason <chr>, enrollment_status <chr>,
## # time_spent <dbl>, time_spent_hours <dbl>
Hint: If you’re curious about how to format the use of the by = part of your code, look up above at how you used this argument to the full_join() function.
What do you notice about the result—the data you joined? In particular, how does it differ from the two data sets from which it was created? Add one or more notes below.
Now let’s try joining on our final data set so that our LMS, gradebook and survey data are all in a single data frame for analysis called data_to_explore:
# join together the gradebook and log_wrangled
data_to_explore <- full_join(joined_data, survey, by = c("student_id", "subject", "semester", "section"))
data_to_explore
Oops! You probably notices a red error message telling us that even though student_id is consistently named in both files, they are not of the same data type. Something we were warned about earlier.
We can use the mutate() function that we learned about in our tutorials to change this variable in our joined_data from a “double,” i.e. numeric variable, to a character variable using the base R as.character() function:
joined_data <- joined_data %>%
mutate(student_id = as.character(student_id))
In the code chunk below, try joining our two data frames again using the same code from above:
# join together the gradebook and log_wrangled
data_to_explore <- full_join(joined_data, survey, by = c("student_id", "subject", "semester", "section"))
data_to_explore
We’ll revisit joins in our Unit 2 tutorials, but for a quick overview of the different join functions with helpful visuals, visit: https://statisticsglobe.com/r-dplyr-join-inner-left-right-full-semi-anti
Congratulations! We’re done wrangling our data and it is ready to explore!
As highlighted in both DSEIUR and Learning Analytics Goes to School, calculating summary statistics and data visualization are a key part of exploratory data analysis. One goal in this phase is to explore questions that drove the original analysis and develop new questions and hypotheses to test in later stages. In Part 3 we’ll learn to:
Summarize Key Stats. We’ll learn about the {skimr} package for pulling together some quick creating descriptive statistics when your goal is to understand your data internally.
Visualize Data. We’ll introduce the histogram geom for taking a quick peak the distributions of some key variables and put together some scatter plots for examining potential relationships between time spent and student performance.
Correlated Variables. We’ll wrap up our Part 3 by learning to create a correlation matrix for quantifying these relationships between key variables and student performance.
At this point, we have quite the comprehensive data set, including single measures from a) students for the time they spent in the course LMS and other information about them, such as information on why they are enrolled in the course, b) their academic achievement.
An efficient package for creating descriptive statistics when your goal is to understand your data internally (rather than to create a table for an external-to-the-research-team audience, like for a journal article) is the {skimr} package. A key feature of the {skimr} package is that it works well with the {tidyverse} packages we are using: it takes data frames as input, and returns data frames as output, which means we can manipulate them with {tidyverse} functions like select(), filter(), and arrange(), for example.
Let’s load the {skimr} package:
library(skimr)
The challenge here is not the complexity of the skim() function, per se, but will be comprehending the terminology. In the code chunk below:
Pass to the skim() function a single argument (recall from our tutorials last week that functions have names and arguments!)
That single argument is the data frame (aka in tidyverse parlance, a tibble) for which you are aiming to calculate descriptive statistics
Run the following code to “skim” our data_to_explore tibble:
skim(data_to_explore)
## Error in skim(data_to_explore): object 'data_to_explore' not found
Note: If you are having difficult viewing your data in the code chunk, try clicking the icon in the output that looks like a spreadsheet with an arrow on it to expand your output in a separate window.
What do you notice about the output? These observations might pertain to the format of the output or its values (i.e., what the mean for the val variable is). Note one or two observations or questions below:
As we noted earlier, the {skimr} package works nicely with other {tidyverse} functions.
Hint: For help, also consider running ?skim() in the console to view the documentation for the function.
Recall from the Week 3 tutorials and exercises how we isolated data using the select() and filter() functions. In the code chunk below, look at descriptives for just proportion_earned , time_spent and gender, but only for the “OcnA” and “PhysA” subjects.
Can you do this by modifying the code below to do this?
data_to_explore %>%
select(subject, proportion_earned, time_spent, gender) %>%
filter(subject == "OcnA" | subject == "PhysA") %>%
skim()
## Error in select(., subject, proportion_earned, time_spent, gender): object 'data_to_explore' not found
We noted earlier that this output is best for internal use. This is because the output is rich, but not well-suited to exporting to a table that you add, for instance, to a Microsoft Word manuscript. Of course, these values can be entered manually into a table, but we’ll also discuss ways later on to create tables that are ready–or nearly-ready–to be added directly to journal articles or evaluation reports.
If you are curious about doing more with {skimr}, check out: <https://cran.r-project.org/web/packages/skimr/vignettes/skimr.html>
You likely notice that the skim() function created some miniature histograms for each of the numeric variables in our data-to-explore data frame. Since we are especially interested in whether time spent in the online course is related to student performance, let’s take a closer look at this variable.
The code below creates a histogram with 30 bins—the default number for geom_histogram:
data_to_explore %>%
ggplot(aes(x = time_spent_hours)) +
geom_histogram(bins = 30)
## Error in ggplot(., aes(x = time_spent_hours)): object 'data_to_explore' not found
Now change the number of bins using the code chunk below below and note any differences in what you interpret about the data.
data_to_explore %>%
ggplot(aes(x = time_spent_hours)) +
geom_histogram(bins = 25)
## Error in ggplot(., aes(x = time_spent_hours)): object 'data_to_explore' not found
What do you think the ideal number of bins is—with what is ideal being the number of bins that helps you to interpret the overall distribution of the values for how much time students’ spent (note: there is no one right or wrong answer here!)?
We’ll next be using the facet_wrap() function to create small multiples, or plots that are specific to subsets of your data. These subsets are identified based on another variable in your dataset. For example, the code below uses the built-in mpg dataset from the tutorials to plot the relationship between the displacement of a car’s engine and its highway miles per gallon fuel efficiency.
ggplot(mpg, aes(displ, hwy)) +
geom_point()
The code in the next plot creates individual plots for each class—think compact car or SUV.
ggplot(mpg, aes(displ, hwy)) +
geom_point() +
facet_wrap(~class)
In the code below, create a faceted histogram based on the subject of the course. To do so, consider both:
What code you used to create the histogram of the time students’ spent on the course
How, in the example above, facet_wrap refers to the variable in that data frame that represents the class of the car—but modifying the code to work for your subject variable
You may also wish to change the color; reflect back to the getting started task for an example of how to do this.
data_to_explore %>%
ggplot(aes(x = time_spent_hours, color = gender)) +
geom_histogram(bins = 30) +
facet_wrap(~subject)
## Error in ggplot(., aes(x = time_spent_hours, color = gender)): object 'data_to_explore' not found
What do you notice about this figure? What questions do you have? Add a note (or a few notes!) below:
Having prepared the data sets we joined together, and worked hard to join those data sets, we’re now ready to use this joined data set in our exploration of how the time students spent on the course LMS relate to the number of points they earned throughout the course.
We’ll be using the {ggplot2} package again, but, this time, will be creating a different type of plot.
Run the code below to create a scatter plot of the proportion of points students earned and the number of hours they spent on the course LMS.
ggplot(data_to_explore, aes(x = time_spent_hours, y = proportion_earned)) +
geom_point()
## Error in ggplot(data_to_explore, aes(x = time_spent_hours, y = proportion_earned)): object 'data_to_explore' not found
What do you notice about this graph? And what do you wonder? How about the code—what do you notice about it (and what do you wonder)? Add one or more of what you see as the most important elements.
Using {ggplot2} makes it efficient to iterate through different versions of similar plots. For instance, we can color the points by a third variable, such as the reason for which students enrolled in the course, to begin to explore what was going on for students who spent very little time on the course:
ggplot(data_to_explore, aes(x = time_spent_hours, y = proportion_earned, color = enrollment_status)) +
geom_point()
## Error in ggplot(data_to_explore, aes(x = time_spent_hours, y = proportion_earned, : object 'data_to_explore' not found
We can also additionally create faceted plots, like the one you created in the last learning lab. In the code below, facet the plot by subject.
ggplot(data_to_explore, aes(x = time_spent_hours, y = proportion_earned, color = enrollment_status)) +
geom_point() +
facet_wrap(~subject)
## Error in ggplot(data_to_explore, aes(x = time_spent_hours, y = proportion_earned, : object 'data_to_explore' not found
You may wish to style your plot. A few ways you can do that are as follows; we’ll discuss more throughout the institute. For each of the following, add them to your plot by adding a plus symbol to the line prior to the line you are adding. For instance, the following code styles the x-axis label of a plot:
ggplot(data_to_explore, aes(x = time_spent_hours, y = proportion_earned, color = enrollment_status)) +
geom_point() +
xlab("Time Spent (Hours)")
## Error in ggplot(data_to_explore, aes(x = time_spent_hours, y = proportion_earned, : object 'data_to_explore' not found
Try adding (and modifying, if you’d like) any of the following to the faceted plot you created in the code chunk below:
xlab("Time Spent (Hours)")
ylab("Proportion of Points Earned")
scale_color_brewer("Enrollment Status", type = "qual", palette = 3)
ggtitle("How Time Spent on Course LMS is Related to Points Earned in the Course")
theme(legend.position = "bottom")
ggplot(data_to_explore, aes(x = time_spent_hours, y = proportion_earned, color = enrollment_status)) +
geom_point() +
facet_wrap(~subject) +
xlab("Time Spent (Hours)") +
ylab("Proportion of Points Earned") +
scale_color_brewer("Enrollment Status", type = "qual", palette = 3) +
ggtitle("How Time Spent on Course LMS is Related to Points Earned in the Course") +
theme(legend.position = "bottom")
## Error in ggplot(data_to_explore, aes(x = time_spent_hours, y = proportion_earned, : object 'data_to_explore' not found
Once you have settled on a plot you are happy with (for now!), add a sentence or two interpreting your graph (like you were describing it within a journal article):
As highlighted in Leah P. Macfadyen and Dawson (2010a), the scatter plots are a useful initial approach for identifying potential correlational trends between variables under investigation, but to further interrogate the significance of selected variables as indicators of student achievement, a simple correlation analysis of each variable with student final grade can be conducted.
There are two efficient ways to create correlation matrices, one that is best for internal use, and one that is best for inclusion in a manuscript.
First, the {corrr} package provides a way to create a correlation matrix in a {tidyverse}-friendly way. Like for the {skimr} package, it can take as little as a line of code to create a correlation matrix. If not familiar, a correlation matrix is a table that presents how all of the variables are related to all of the other variables.
Run the following code to load the {corrr} package:
library(corrr)
##
## Attaching package: 'corrr'
## The following object is masked from 'package:skimr':
##
## focus
Since one main goal of this case study is to investigate whether time spent in an online course is predictive of student achievement, let’s first take a look and see if there is a simple correlation between time spent and student achievement.
Run the following code to create a simple correlation matrix using the correlate() function from the {corrr} package:
data_to_explore %>%
select(proportion_earned, time_spent_hours) %>%
correlate()
## Error in select(., proportion_earned, time_spent_hours): object 'data_to_explore' not found
For the purpose of printing, and as a quick aside, the {corrr} package also has several nice functions for converting a correlation data frame into a matrix with the correlations cleanly formatted (leading zeros removed; spaced for signs) and the diagonal (or any NA) left blank.
Run the following code to try it out:
data_to_explore %>%
select(proportion_earned, time_spent_hours, int, val, percomp) %>%
correlate() %>%
rearrange() %>%
shave() %>%
fashion()
## Error in select(., proportion_earned, time_spent_hours, int, val, percomp): object 'data_to_explore' not found
In the code chunk below, select 3-4 numeric variables in addition to time_spent_hours that you think may be related to student achievement, i.e. proportion_earned, and run a simple correlation.
data_to_explore %>%
select(total_points_possible, proportion_earned, time_spent_hours, int, val, tv) %>%
correlate() %>%
rearrange() %>%
shave() %>%
fashion()
## Error in select(., total_points_possible, proportion_earned, time_spent_hours, : object 'data_to_explore' not found
Hint: One key is to correlate only numeric variables. Note that while some numeric variables can technically be used, it is likely not sensible to correlate all of the variables; some—for instance, the section variable—are not very sensible to correlate!
What did you find? Were your selected variables related to time spent in the course? These observations might pertain to the format of the output or its values (i.e., what the mean for the val variable is). Note one or two of these observations or questions below:
If you are interested in learning more about the {corrr} package, visit: <https://corrr.tidymodels.org>
As we noted earlier, the {skimr} package works nicely with other {tidyverse} functions. While {corrr} is a nice package to quickly create a correlation matrix, you may wish to create one that is ready to be added directly to a dissertation or journal article. {apaTables} is great for creating more formal forms of output that can be added directly to an APA-formatted manuscript; it also has functionality for regression and other types of model output. It is not as friendly to {tidyverse} functions; first, we need to select only the variables we wish to correlate.
Then, we can use that subset of the variables as the argument to the apa.cor.table() function.
Run the following code to create a subset of the larger data_to_explore data frame with the variables you wish to correlate, then create a correlation table using apa.cor.table().
library(apaTables)
data_to_explore_subset <- data_to_explore %>%
select(time_spent_hours, proportion_earned, int)
## Error in select(., time_spent_hours, proportion_earned, int): object 'data_to_explore' not found
apa.cor.table(data_to_explore_subset)
## Error in as.data.frame(data): object 'data_to_explore_subset' not found
This may look nice, but how to actually add this into a dissertation or journal article that you might be interested in publishing? Read the documentation for apa.cor.table() by running ?apa.cor.table() in the console. Look through the documentation and examples to understand how to output a file with the formatted correlation table, and then run the code to do that with your subset of the data_to_explore data frame.
apa.cor.table(data_to_explore_subset, filename = "cor-table.doc",
table.number = 1,
show.conf.interval = TRUE,
show.sig.stars = TRUE,
landscape = TRUE)
## Error in as.data.frame(data): object 'data_to_explore_subset' not found
You should now see a new Word document in your project folder called survey-cor-table.doc. Click on that and you’ll be prompted to download from your browser.
As highlighted in Chapter 3 of Data Science in Education Using R, the Model step of the data science process entails “using statistical models, from simple to complex, to understand trends and patterns in the data.” The authors note that while descriptive statistics and data visualization during the Explore step can help us to identify patterns and relationships in our data, statistical models can be used to help us determine if relationships, patterns and trends are actually meaningful.
For the purpose of this case study, let’s consider the proportion_earned variable to be our dependent, or the outcome, variable. You may be new to linear regression models, or you may have a lot of experience. In brief, a linear regression model involves estimating the relationships between one or more independent variables with one dependent variable. Mathematically, it can be written like the following.
\[ \operatorname{dependentvar} = \beta_{0} + \beta_{1}(\operatorname{independentvar}) + \epsilon \]
Here, the dependentvar is predicted by two coefficients, or things that help to explain the dependent variable. The first coefficient, \(\beta_0\), is the intercept. This coefficient tells us what the estimated value of the dependent variable is when the independent variable (independentvar) is equal to 0. The other coefficient, \(\beta_1\), or the slope, represents the association of a one-unit change in the independent variable in the value of the dependent variable.
Let’s consider a simple concrete example. We’ll use the lm() function in R to estimate a linear regression model.
The following code estimates a model in which proportion_earned, the proportion of points students earned, is the dependent variable. It is predicted by one independent variable, int, students’ self-reported interest in science.
lm(proportion_earned ~ time_spent_hours,
data = data_to_explore)
## Error in is.data.frame(data): object 'data_to_explore' not found
Let’s take a look at the output.
We can see that the intercept is estimated at 0.53. This tells us that when students’ time spent in the online course is equal to zero, their predicted proportion of points earned is 0.62—not such a great grade, but also not surprising! But, for every one-unit, or hour, increase in time spent in science, their estimate proportion of points earned was 0.0048. So if a student spent, for instance, 40 hours on the course, their estimated final grade would be .62 + (.0048 * 40), or around .82, or 82%. A pretty solid B-!
We can add additional predictor variables by separating variables with a plus symbol. Run the following code to add int, students’ self-reported interest in science, to our linear model:
lm(proportion_earned ~ time_spent_hours + int,
data = data_to_explore)
## Error in is.data.frame(data): object 'data_to_explore' not found
We can see that the intercept is now estimated at 0.44, which tells us that when students’ time spent and interest are equal to zero, they are likely fail the course. Note that that estimate for interest in science is .046, so for every one-unit increase in int, we should expect an 5 percentage point increase in their grade.
We can save the output of the function to an object—let’s say m1, standing for model 1. We can then use the summary() function built into R to view a much more feature-rich summary of the estimated model.
m1 <- lm(proportion_earned ~ time_spent_hours + int, data = data_to_explore)
## Error in is.data.frame(data): object 'data_to_explore' not found
summary(m1)
## Error in summary(m1): object 'm1' not found
There is a lot to unpack in this output, but for now the most important values to look at are those in the Estimate column, which represent the intercept and slopes for your linear regression model.
Note that the estimate for time_spent_hours is now 0.0042 and statistically significant. We see that int, interest in science, is also statistically significant.
Now let’s consider the mean values for each of these predictors. Recall from our tutorials last week the summarize() function from the {dplyr} package that was used to create summary statistics such as the mean, standard deviation, or the minimum or maximum of a value. At its core, think of summarize() as a function that returns a single value (whether it’s a mean, median, standard deviation—whichever!) that summarizes a single column.
Let’s use the summarize function to calculate the means for time spent and interest in science and add the argument na.rm = TRUE to tell R that it can ignore missing, or NA values, and to calculate the summary statistic using the non-missing values.
data_to_explore %>%
summarize(mean_interest = mean(int, na.rm = TRUE),
mean_time = mean(time_spent_hours, na.rm = TRUE))
## Error in summarize(., mean_interest = mean(int, na.rm = TRUE), mean_time = mean(time_spent_hours, : object 'data_to_explore' not found
The mean value for interest is quite high. If we multiply the estimate relationship between interest and proportion of points earned—0.046—by this, the mean interest across all of the students—we can determine that students’ estimated construction to their final grade was 0.046 X 4.3, or 0.197. For hours spent spent, the average students’ estimate was 0.0042 X 30.48, or 0.128.
If we add both 0.197 and 0.128 to the intercept, 0.449, that equals 0.774, or about 77%. In other words, a student with average interest in science who spent an average amount of time in the course earned a pretty average grade.
Finally, similar to our APA formatted correlation table above, we can use the {apaTables} package to create a nice regression table that could be used for later publication:
apa.reg.table(m1, filename = "lm-table.doc")
## Error in apa.reg.table(m1, filename = "lm-table.doc"): object 'm1' not found
Below, estimate a different regression models with at least 2 variables, save as m2, and view a summary() of the results:
m2 <- lm(proportion_earned ~ subject +
time_spent_hours, data = data_to_explore)
## Error in is.data.frame(data): object 'data_to_explore' not found
summary(m2)
## Error in summary(m2): object 'm2' not found
Add a brief note or two interpreting the above model (m2):
every hour studying is worth about half a point (.0053) in grade increase
Bio seems to be a harder class giving a negative correlation
The final(ish) step in our workflow/process is sharing the results of analysis with wider audience. Krumm et al. (2018) have outline the following 3-step process for communicating with education stakeholders what you have learned through analysis:
Select. Communicating what one has learned involves selecting among those analyses that are most important and most useful to an intended audience, as well as selecting a form for displaying that information, such as a graph or table in static or interactive form, i.e. a “data product.”
Polish. After creating initial versions of data products, research teams often spend time refining or polishing them, by adding or editing titles, labels, and notations and by working with colors and shapes to highlight key points.
Narrate. Writing a narrative to accompany the data products involves, at a minimum, pairing a data product with its related research question, describing how best to interpret the data product, and explaining the ways in which the data product helps answer the research question.
For Unit 1 we will keep it simple. In the code chunk below, select a chart, table or model created above (or create an entirely new one based a new analysis) that you think an education stakeholder might find interesting and addresses one of our research questions. Beneath the code chunk, write a very brief narrative to accompany your narrative.
data_to_explore %>%
ggplot(aes(x = time_spent_hours, y = proportion_earned, color = gender)) +
geom_point() +
facet_wrap(~subject)
## Error in ggplot(., aes(x = time_spent_hours, y = proportion_earned, color = gender)): object 'data_to_explore' not found
There are a couple of courses where female students are spending much more time to achieve higher grades (Anatomy/Physiology, Biology, Oceanography) than their male counterparts. This could be an indicator that study groups are not achieving equal participation compared to other classes. May be good to look into who is leading those groups and how they are comprised.
You’ve completed the first case study! To “turn in” your work, you can click the “Knit” icon at the top of the file, or click the dropdown arrow next to it and select “Knit top HTML.” This will create a report in your Files pane that serves as a record of your completed assignment and its output you can open or share.