knitr::include_graphics("Img/research-workflow.png")Rtutorial
Week3: Introduction to R Toolkit
This activity is prepared to warm-up your understanding of LA workflow. Through this exercise, we will practice basics of how RStudio works. In all of our practices, we will follow the LA workflow.
- Prepare: We will import our data file into R in the “Prepare” section.
- Wrangle: We will prepare and wrangle our data in the “Wrangle” section.
- Explore: We will check the patterns in the data through visualization.
- Model: We will run a regression model in the “Model” section.
- Communicate: Create a reproducible report of your work, you can share with others in the “Communicate” section.
How to follow this document?
In the R tutorial toolkit and case study week, we will be using a quarto markdown file, the extension of the file is “.qmd”. Unlike R markdown, Quarto documents are fully reproducible and support dozens of output formats, like PDFs, Word files, presentations, and more. In this document, you will see, we’re using visual mode to see the general parts.
There are two keys to your use of Quarto for this activity:
- First, you can see the document in the “Visual Editor” mode. You can use this mode by clicking the word “Visual” on the left side of the toolbar above. The visual editor allows you to view formatted headers, text and code chunks as specified by the underlying markdown syntax, or “Source” text. Visual mode is a bit more “human readable” than syntax but definitely take a look at the source text.
- Second, note the specially formatted text box below called a “code chunck.” These chuncks allows you to run code from multiple languages including R, Python, and SQL. This specific code chunck contains a line of R code. If you wonder other chunck options, you can visit this link and learn more about it.
The Data-Intensive Research Workflow
Last week, in the class, we talked about the data-intensive research workflow.As we mentioned a couple of times, we will follow this workflow to present our research approach.
Let’s get started.
1. Prepare
As a first step in the data-intensive research workflow, we will need to define our research question(s) and need to understand where the data comes from (Krumm, 2018).
For this work, we will work with data come from an unpublished research study, which utilized a number of different data sources to understand high school students’ motivation within the context of online courses.
Our research question is:
“Is there a relationship between the time students spend on a course and their final course grade?”
For our analysis, we will need certain packages.One of the most common package is “Tidyverse” package. This package is actually a collection of R packages designed for wrangling and exploring data and which all share an underlying design philosophy, grammar, and data structure.
knitr::include_graphics("Img/tidyverse.png")Through this class, we will have a chance to try different libraries in the tidyverse package. Let’s install our tidyverse package.
We installed the package, what do we need to do now for using this package?
We will also use another package called “skimr”. This is a handy package that provides summary statistics that you can skim quickly to understand your data. We’ll be using this later in the Explore section.
#install.packages(skimr)install the {skimr} package below.
install.packages("skimr")#load the {skimr} package below.
library(skimr)library(readr)library(tidyverse)── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ purrr 1.0.4
✔ forcats 1.0.1 ✔ stringr 1.6.0
✔ ggplot2 4.0.2 ✔ tibble 3.2.1
✔ lubridate 1.9.4 ✔ tidyr 1.3.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
sci_online_classes <- read_csv("sci-online-classes.csv")Rows: 603 Columns: 30
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (6): course_id, subject, semester, section, Gradebook_Item, Gender
dbl (23): student_id, total_points_possible, total_points_earned, percentage...
lgl (1): Grade_Category
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
View(sci_online_classes)Your Turn:
Why do you think we included data/ before our sci-online-classes.csv file? Why quotation marks?
Add your responses after the dashes below:
- Quotation marks are used to denote that the file path (“data/sci-online-classes.csv”) must be a text string in R. Functions like read.csv() require the file name and location to be specified as text so R can correctly identify and open the file. Without quotation marks, R would treat the file name as an object or variable instead of a file path, which would cause an error.
Hint: check the files panel.
Viewing or inspecting data
Let’s quickly check our data.
#Loading the data file
sci_online_classes# A tibble: 603 × 30
student_id course_id total_points_possible total_points_earned
<dbl> <chr> <dbl> <dbl>
1 43146 FrScA-S216-02 3280 2220
2 44638 OcnA-S116-01 3531 2672
3 47448 FrScA-S216-01 2870 1897
4 47979 OcnA-S216-01 4562 3090
5 48797 PhysA-S116-01 2207 1910
6 51943 FrScA-S216-03 4208 3596
7 52326 AnPhA-S216-01 4325 2255
8 52446 PhysA-S116-01 2086 1719
9 53447 FrScA-S116-01 4655 3149
10 53475 FrScA-S116-02 1710 1402
# ℹ 593 more rows
# ℹ 26 more variables: percentage_earned <dbl>, subject <chr>, semester <chr>,
# section <chr>, Gradebook_Item <chr>, Grade_Category <lgl>,
# FinalGradeCEMS <dbl>, Points_Possible <dbl>, Points_Earned <dbl>,
# Gender <chr>, q1 <dbl>, q2 <dbl>, q3 <dbl>, q4 <dbl>, q5 <dbl>, q6 <dbl>,
# q7 <dbl>, q8 <dbl>, q9 <dbl>, q10 <dbl>, TimeSpent <dbl>,
# TimeSpent_hours <dbl>, TimeSpent_std <dbl>, int <dbl>, pc <dbl>, uv <dbl>
View(sci_online_classes)Your Turn:
What do you notice about this dataset? What do you wonder? Add one or two thoughts after the dash below:
- This response critically examines the quality of the datasets.
Observation: Several variables contain missing values, particularly within the survey question columns. This could potentially impact analysis and necessitates data cleaning.
Inquiry: It would be beneficial to investigate why certain students lack survey responses and whether this pattern varies by course or semester.
There are other ways to inspect your data; the glimpse() function provides one such way. Let’s take a glimpse at our data.
glimpse(sci_online_classes)Rows: 603
Columns: 30
$ student_id <dbl> 43146, 44638, 47448, 47979, 48797, 51943, 52326,…
$ course_id <chr> "FrScA-S216-02", "OcnA-S116-01", "FrScA-S216-01"…
$ total_points_possible <dbl> 3280, 3531, 2870, 4562, 2207, 4208, 4325, 2086, …
$ total_points_earned <dbl> 2220, 2672, 1897, 3090, 1910, 3596, 2255, 1719, …
$ percentage_earned <dbl> 0.6768293, 0.7567261, 0.6609756, 0.6773345, 0.86…
$ subject <chr> "FrScA", "OcnA", "FrScA", "OcnA", "PhysA", "FrSc…
$ semester <chr> "S216", "S116", "S216", "S216", "S116", "S216", …
$ section <chr> "02", "01", "01", "01", "01", "03", "01", "01", …
$ Gradebook_Item <chr> "POINTS EARNED & TOTAL COURSE POINTS", "ATTEMPTE…
$ Grade_Category <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ FinalGradeCEMS <dbl> 93.45372, 81.70184, 88.48758, 81.85260, 84.00000…
$ Points_Possible <dbl> 5, 10, 10, 5, 438, 5, 10, 10, 443, 5, 12, 10, 5,…
$ Points_Earned <dbl> NA, 10.00, NA, 4.00, 399.00, NA, NA, 10.00, 425.…
$ Gender <chr> "M", "F", "M", "M", "F", "F", "M", "F", "F", "M"…
$ q1 <dbl> 5, 4, 5, 5, 4, NA, 5, 3, 4, NA, NA, 4, 3, 5, NA,…
$ q2 <dbl> 4, 4, 4, 5, 3, NA, 5, 3, 3, NA, NA, 5, 3, 3, NA,…
$ q3 <dbl> 4, 3, 4, 3, 3, NA, 3, 3, 3, NA, NA, 3, 3, 5, NA,…
$ q4 <dbl> 5, 4, 5, 5, 4, NA, 5, 3, 4, NA, NA, 5, 3, 5, NA,…
$ q5 <dbl> 5, 4, 5, 5, 4, NA, 5, 3, 4, NA, NA, 5, 4, 5, NA,…
$ q6 <dbl> 5, 4, 4, 5, 4, NA, 5, 4, 3, NA, NA, 5, 3, 5, NA,…
$ q7 <dbl> 5, 4, 4, 4, 4, NA, 4, 3, 3, NA, NA, 5, 3, 5, NA,…
$ q8 <dbl> 5, 5, 5, 5, 4, NA, 5, 3, 4, NA, NA, 4, 3, 5, NA,…
$ q9 <dbl> 4, 4, 3, 5, NA, NA, 5, 3, 2, NA, NA, 5, 2, 2, NA…
$ q10 <dbl> 5, 4, 5, 5, 3, NA, 5, 3, 5, NA, NA, 4, 4, 5, NA,…
$ TimeSpent <dbl> 1555.1667, 1382.7001, 860.4335, 1598.6166, 1481.…
$ TimeSpent_hours <dbl> 25.91944500, 23.04500167, 14.34055833, 26.643610…
$ TimeSpent_std <dbl> -0.18051496, -0.30780313, -0.69325954, -0.148446…
$ int <dbl> 5.0, 4.2, 5.0, 5.0, 3.8, 4.6, 5.0, 3.0, 4.2, NA,…
$ pc <dbl> 4.50, 3.50, 4.00, 3.50, 3.50, 4.00, 3.50, 3.00, …
$ uv <dbl> 4.333333, 4.000000, 3.666667, 5.000000, 3.500000…
Generally, rows typically represent “cases”, the units that we measure, or the units on which we collect data. What counts as a “case” (and therefore what is represented as a row) varies by (and within) fields. There may be mutliple types or levels of units studied in your field; listing more than one is fine! Also, please consider what columns-which usually represent variables-represent in your area of work and/or research.
Your Turn:
How many “cases” or observations are in this dataset?
- There are 603 cases (observations) in the dataset.
Pick two columns (or more) and write what you think it represents:
- percentage_earned: This column denotes the proportion of total course points attained by a student, computed as the ratio of points accumulated to points available. This metric serves as an indicator of the student’s overall academic achievement within the course.
- TimeSpent_hours: This column quantifies the total duration (in hours) a student dedicated to interacting with the online course material. This variable is hypothesised to serve as a proxy for student effort and engagement.
2. Wrangle
By wrangle, we refer to the process of cleaning and processing data, and, in some cases, merging (or joining) data from multiple sources. Often, this part of the process is very time-intensive! Wrangling your data into shape can itself be an important accomplishment! And documenting your code using R scripts or Markdown files will save yourself and others a great deal of time wrangling data in the future! There are great tools in R for data wrangling, especially through the use of {dplyr} R package which is part of the tidyverse.
Selecting Variables
Remember our research question, what we were interested in finding about this data?
Let’s practice selecting a few variables by introducing pipe operator, |>. Pipes are a powerful tool for combining a sequence of functions or processes.
Run the following code chunck to “pipe” our sci_data to the select() function include the following two variables as arguments:
FinalGradeCEMS (i.e., students’ final grades on a 0-100 point scale)
TimeSpent (i.e., the number of minutes they spent in the course’s learning management system)
#library(dplyr)
sci_online_classes|>
select(FinalGradeCEMS, TimeSpent)# A tibble: 603 × 2
FinalGradeCEMS TimeSpent
<dbl> <dbl>
1 93.5 1555.
2 81.7 1383.
3 88.5 860.
4 81.9 1599.
5 84 1482.
6 NA 3.45
7 83.6 1322.
8 97.8 1390.
9 96.1 1479.
10 NA NA
# ℹ 593 more rows
The code chunk produced error
Notice how the number of columns (variables) is now different! Let’s check our data with View() function this time.
I selected percentage_earned and Time_Spent_hours
sci_online_classes|>
select(percentage_earned, TimeSpent_hours)# A tibble: 603 × 2
percentage_earned TimeSpent_hours
<dbl> <dbl>
1 0.677 25.9
2 0.757 23.0
3 0.661 14.3
4 0.677 26.6
5 0.865 24.7
6 0.855 0.0575
7 0.521 22.0
8 0.824 23.2
9 0.676 24.7
10 0.820 NA
# ℹ 593 more rows
A quick footnote about pipes: The original pipe operator, %>%, comes from the {magrittr} package but all packages in the tidyverse load %>% for you automatically, so you don’t usually load magrittr explicitly. The pipe has become such a useful and much used operator in R that it is now baked into R using the new and simpler native pipe |> operator. You can use both fairly interchangeably but there are a few differences between pipe operators.
Filtering Variables
Let’s exploring filtering variables. We will filter our data to view only the rows associated with students who earned a final grade (as a percentage) of 70 or 70% or higher.
sci_online_classes|>
filter(FinalGradeCEMS>70)# A tibble: 438 × 30
student_id course_id total_points_possible total_points_earned
<dbl> <chr> <dbl> <dbl>
1 43146 FrScA-S216-02 3280 2220
2 44638 OcnA-S116-01 3531 2672
3 47448 FrScA-S216-01 2870 1897
4 47979 OcnA-S216-01 4562 3090
5 48797 PhysA-S116-01 2207 1910
6 52326 AnPhA-S216-01 4325 2255
7 52446 PhysA-S116-01 2086 1719
8 53447 FrScA-S116-01 4655 3149
9 53475 FrScA-S216-01 1209 977
10 54066 OcnA-S116-01 4641 3429
# ℹ 428 more rows
# ℹ 26 more variables: percentage_earned <dbl>, subject <chr>, semester <chr>,
# section <chr>, Gradebook_Item <chr>, Grade_Category <lgl>,
# FinalGradeCEMS <dbl>, Points_Possible <dbl>, Points_Earned <dbl>,
# Gender <chr>, q1 <dbl>, q2 <dbl>, q3 <dbl>, q4 <dbl>, q5 <dbl>, q6 <dbl>,
# q7 <dbl>, q8 <dbl>, q9 <dbl>, q10 <dbl>, TimeSpent <dbl>,
# TimeSpent_hours <dbl>, TimeSpent_std <dbl>, int <dbl>, pc <dbl>, uv <dbl>
Your Turn:
In the next code chunk, change the cut-off from 70% to some other value-larger or smaller (maybe much larger or smaller) - feel free to play around with the code a bit!
How many students had more that 85% grade?
sci_online_classes|>
filter(FinalGradeCEMS>85)# A tibble: 279 × 30
student_id course_id total_points_possible total_points_earned
<dbl> <chr> <dbl> <dbl>
1 43146 FrScA-S216-02 3280 2220
2 47448 FrScA-S216-01 2870 1897
3 52446 PhysA-S116-01 2086 1719
4 53447 FrScA-S116-01 4655 3149
5 54066 OcnA-S116-01 4641 3429
6 54282 OcnA-S116-02 3581 2777
7 54434 PhysA-S116-01 3228 2506
8 55078 FrScA-S216-01 7000 4212
9 56152 AnPhA-S116-02 3323 2468
10 57224 FrScA-S116-03 4546 3772
# ℹ 269 more rows
# ℹ 26 more variables: percentage_earned <dbl>, subject <chr>, semester <chr>,
# section <chr>, Gradebook_Item <chr>, Grade_Category <lgl>,
# FinalGradeCEMS <dbl>, Points_Possible <dbl>, Points_Earned <dbl>,
# Gender <chr>, q1 <dbl>, q2 <dbl>, q3 <dbl>, q4 <dbl>, q5 <dbl>, q6 <dbl>,
# q7 <dbl>, q8 <dbl>, q9 <dbl>, q10 <dbl>, TimeSpent <dbl>,
# TimeSpent_hours <dbl>, TimeSpent_std <dbl>, int <dbl>, pc <dbl>, uv <dbl>
What happens when you change the cut-off from 70 to something else?
- Increasing the cut-off from 70 to 85 leads to a reduction in the number of observations. A more stringent threshold excludes a greater proportion of students, thereby retaining only those with superior final grades and consequently diminishing the size of the resulting subset.
Arrange
The last function we’ll use for preparing tables in arrange. We’ll again use the pipe operator to combine this with arrange() function we used already -select(). We do this so we can view only time spent and final grades.
sci_online_classes|>
select(FinalGradeCEMS, TimeSpent)|>
arrange(FinalGradeCEMS)# A tibble: 603 × 2
FinalGradeCEMS TimeSpent
<dbl> <dbl>
1 0 13.9
2 0.535 306.
3 0.903 88.5
4 1.80 44.7
5 2.93 57.7
6 3.01 571.
7 3.06 0.7
8 3.43 245.
9 5.04 202.
10 5.2 11.0
# ℹ 593 more rows
Note that arrange works by sorting values in ascending order (from lowest to highest); you can change this by using the desc() function as an argument with.
#let's change the order from asc to desc
sci_online_classes|>
arrange(desc(FinalGradeCEMS))# A tibble: 603 × 30
student_id course_id total_points_possible total_points_earned
<dbl> <chr> <dbl> <dbl>
1 85650 FrScA-S116-01 8206 4432
2 91067 BioA-S116-01 2672 2249
3 66740 OcnA-S116-01 4171 3639
4 86792 FrScA-S116-01 2316 1927
5 78153 PhysA-S216-01 6530 3702
6 66689 FrScA-S216-01 3390 2738
7 88261 FrScA-S116-01 2419 1624
8 92740 PhysA-S116-01 3347 2308
9 92726 PhysA-S116-01 2739 2356
10 92741 PhysA-S116-01 3070 2163
# ℹ 593 more rows
# ℹ 26 more variables: percentage_earned <dbl>, subject <chr>, semester <chr>,
# section <chr>, Gradebook_Item <chr>, Grade_Category <lgl>,
# FinalGradeCEMS <dbl>, Points_Possible <dbl>, Points_Earned <dbl>,
# Gender <chr>, q1 <dbl>, q2 <dbl>, q3 <dbl>, q4 <dbl>, q5 <dbl>, q6 <dbl>,
# q7 <dbl>, q8 <dbl>, q9 <dbl>, q10 <dbl>, TimeSpent <dbl>,
# TimeSpent_hours <dbl>, TimeSpent_std <dbl>, int <dbl>, pc <dbl>, uv <dbl>
Your Turn:
In the next code chunk, replace FinalGradeCEMS that is used with both the select() and arrange() functions with a different variable in the dataset.
#Arrange in Ascending order
sci_online_classes |>
select(percentage_earned, FinalGradeCEMS) |>
arrange(percentage_earned)# A tibble: 603 × 2
percentage_earned FinalGradeCEMS
<dbl> <dbl>
1 0.338 NA
2 0.466 81.7
3 0.496 93.5
4 0.498 94.2
5 0.499 89.6
6 0.503 71.5
7 0.515 91.3
8 0.516 73.5
9 0.516 73.8
10 0.521 92.9
# ℹ 593 more rows
#Arrange in descending order
sci_online_classes |>
select(percentage_earned, FinalGradeCEMS) |>
arrange(desc(percentage_earned))# A tibble: 603 × 2
percentage_earned FinalGradeCEMS
<dbl> <dbl>
1 0.911 96.0
2 0.908 87.4
3 0.907 92.9
4 0.904 72.9
5 0.901 92.9
6 0.901 94.2
7 0.899 94.6
8 0.897 87.1
9 0.897 64.8
10 0.896 82.2
# ℹ 593 more rows
3. Explore
Exploratory data analysis, or exploring your data, involves processes of describing your data (such as by calculating the means and standard deviations of numeric variables, or counting the frequency of categorical variables) and, often, visualizing your data. As we’ll learn in later in this class, the explore phase can also involve the process of “feature engineering,” or creating new variables within a dataset [@krumm2018]. In this section, we’ll quickly pull together some basic stats using a handy function from the {skimr} package, and introduce you to a basic data visualization “code template” for the {ggplot} package from the tidyverse.
Summary Statistics
Let’s repurpose what we learned from our wrangle section to select just a few variables and quickly gather some descriptive stats using the skim() function from the {skimr} package.
sci_online_classes|>
select(FinalGradeCEMS, TimeSpent)|>
skim()| Name | select(sci_online_classes… |
| Number of rows | 603 |
| Number of columns | 2 |
| _______________________ | |
| Column type frequency: | |
| numeric | 2 |
| ________________________ | |
| Group variables | None |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| FinalGradeCEMS | 30 | 0.95 | 77.20 | 22.23 | 0.00 | 71.25 | 84.57 | 92.10 | 100.00 | ▁▁▁▃▇ |
| TimeSpent | 5 | 0.99 | 1799.75 | 1354.93 | 0.45 | 851.90 | 1550.91 | 2426.09 | 8870.88 | ▇▅▁▁▁ |
Your Turn:
Copy the code from the chunk from above and use it as a template to explore some other variables of interest from our sci_data.
Variables
FinalGradeCEMS
percentage_earned
#use skim() to summarize other variables of your choosing.
sci_online_classes |>
select(FinalGradeCEMS, percentage_earned) |>
skim()| Name | select(sci_online_classes… |
| Number of rows | 603 |
| Number of columns | 2 |
| _______________________ | |
| Column type frequency: | |
| numeric | 2 |
| ________________________ | |
| Group variables | None |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| FinalGradeCEMS | 30 | 0.95 | 77.20 | 22.23 | 0.00 | 71.25 | 84.57 | 92.10 | 100.00 | ▁▁▁▃▇ |
| percentage_earned | 0 | 1.00 | 0.76 | 0.09 | 0.34 | 0.70 | 0.78 | 0.83 | 0.91 | ▁▁▃▇▇ |
What happens if simply feed the skim function the entire sci_data object? Give it a try!
#use skim() on the entire data frame
skim(sci_online_classes)| Name | sci_online_classes |
| Number of rows | 603 |
| Number of columns | 30 |
| _______________________ | |
| Column type frequency: | |
| character | 6 |
| logical | 1 |
| numeric | 23 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| course_id | 0 | 1 | 12 | 13 | 0 | 26 | 0 |
| subject | 0 | 1 | 4 | 5 | 0 | 5 | 0 |
| semester | 0 | 1 | 4 | 4 | 0 | 3 | 0 |
| section | 0 | 1 | 2 | 2 | 0 | 4 | 0 |
| Gradebook_Item | 0 | 1 | 9 | 35 | 0 | 3 | 0 |
| Gender | 0 | 1 | 1 | 1 | 0 | 2 | 0 |
Variable type: logical
| skim_variable | n_missing | complete_rate | mean | count |
|---|---|---|---|---|
| Grade_Category | 603 | 0 | NaN | : |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| student_id | 0 | 1.00 | 86069.54 | 10548.60 | 43146.00 | 85612.50 | 88340.00 | 92730.50 | 97441.00 | ▁▁▁▃▇ |
| total_points_possible | 0 | 1.00 | 4274.41 | 2312.74 | 840.00 | 2809.50 | 3583.00 | 5069.00 | 15552.00 | ▇▅▂▁▁ |
| total_points_earned | 0 | 1.00 | 3244.69 | 1832.00 | 651.00 | 2050.50 | 2757.00 | 3875.00 | 12208.00 | ▇▅▁▁▁ |
| percentage_earned | 0 | 1.00 | 0.76 | 0.09 | 0.34 | 0.70 | 0.78 | 0.83 | 0.91 | ▁▁▃▇▇ |
| FinalGradeCEMS | 30 | 0.95 | 77.20 | 22.23 | 0.00 | 71.25 | 84.57 | 92.10 | 100.00 | ▁▁▁▃▇ |
| Points_Possible | 0 | 1.00 | 76.87 | 167.51 | 5.00 | 10.00 | 10.00 | 30.00 | 935.00 | ▇▁▁▁▁ |
| Points_Earned | 92 | 0.85 | 68.63 | 145.26 | 0.00 | 7.00 | 10.00 | 26.12 | 828.20 | ▇▁▁▁▁ |
| q1 | 123 | 0.80 | 4.30 | 0.68 | 1.00 | 4.00 | 4.00 | 5.00 | 5.00 | ▁▁▂▇▇ |
| q2 | 126 | 0.79 | 3.63 | 0.93 | 1.00 | 3.00 | 4.00 | 4.00 | 5.00 | ▁▂▆▇▃ |
| q3 | 123 | 0.80 | 3.33 | 0.91 | 1.00 | 3.00 | 3.00 | 4.00 | 5.00 | ▁▃▇▅▂ |
| q4 | 125 | 0.79 | 4.27 | 0.85 | 1.00 | 4.00 | 4.00 | 5.00 | 5.00 | ▁▁▂▇▇ |
| q5 | 127 | 0.79 | 4.19 | 0.68 | 2.00 | 4.00 | 4.00 | 5.00 | 5.00 | ▁▂▁▇▅ |
| q6 | 127 | 0.79 | 4.01 | 0.80 | 1.00 | 4.00 | 4.00 | 5.00 | 5.00 | ▁▁▃▇▅ |
| q7 | 129 | 0.79 | 3.91 | 0.82 | 1.00 | 3.00 | 4.00 | 4.75 | 5.00 | ▁▁▅▇▅ |
| q8 | 129 | 0.79 | 4.29 | 0.68 | 1.00 | 4.00 | 4.00 | 5.00 | 5.00 | ▁▁▂▇▆ |
| q9 | 129 | 0.79 | 3.49 | 0.98 | 1.00 | 3.00 | 4.00 | 4.00 | 5.00 | ▁▃▇▇▃ |
| q10 | 129 | 0.79 | 4.10 | 0.93 | 1.00 | 4.00 | 4.00 | 5.00 | 5.00 | ▁▂▃▇▇ |
| TimeSpent | 5 | 0.99 | 1799.75 | 1354.93 | 0.45 | 851.90 | 1550.91 | 2426.09 | 8870.88 | ▇▅▁▁▁ |
| TimeSpent_hours | 5 | 0.99 | 30.00 | 22.58 | 0.01 | 14.20 | 25.85 | 40.43 | 147.85 | ▇▅▁▁▁ |
| TimeSpent_std | 5 | 0.99 | 0.00 | 1.00 | -1.33 | -0.70 | -0.18 | 0.46 | 5.22 | ▇▅▁▁▁ |
| int | 76 | 0.87 | 4.22 | 0.59 | 2.00 | 3.90 | 4.20 | 4.70 | 5.00 | ▁▁▃▇▇ |
| pc | 75 | 0.88 | 3.61 | 0.64 | 1.50 | 3.00 | 3.50 | 4.00 | 5.00 | ▁▁▇▅▂ |
| uv | 75 | 0.88 | 3.72 | 0.70 | 1.00 | 3.33 | 3.67 | 4.17 | 5.00 | ▁▁▆▇▅ |
When the complete “sci_online_classes” object is submitted to the skim function, an extensive summary is generated, encompassing statistical measures for all variables, categorised by data type. While this offers a comprehensive overview of the dataset, the resulting output may be overwhelming. Consequently, it is often more advantageous to summarise a subset of pertinent variables rather than the complete dataset.
Data Visualization
Data visualization is an extremely common practice in Learning Analytics, especially in the use of data dashboards. Data visualization involves graphically representing one or more variables with the goal of discovering patterns in data. These patterns may help us to answer research questions or generate new questions about our data, to discover relationships between and among variables, and to create or select features for data modeling.
In this section we’ll focus on using a basic code template for the {ggplot2} package from the tidyverse. ggplot2 is a system for declaratively creating graphics, based on the grammar of graphics [@Wickham]. You provide the data, tell ggplot2 how to map variables to aesthetics, what graphical elements to use, and it takes care of the details.
The Graphing Workflow
At it’s core, you can create some very simple but attractive graphs with just a couple lines of code. {ggplot2} follows the common workflow for making graphs. To make a graph, you simply:
Start the graph with ggplot() and include your data as an argument;
“Add” elements to the graph using the + operator a geom_() function;
Select variables to graph on each axis with the aes() argument.
Let’s give it a try by creating a simple histogram of our FinalGradeCEMS variable. The code below creates a histogram, or a distribution of the values, in this case for students’ final grades.
ggplot(sci_online_classes) +
geom_histogram(aes(x = FinalGradeCEMS))`stat_bin()` using `bins = 30`. Pick better value `binwidth`.
Warning: Removed 30 rows containing non-finite outside the scale range
(`stat_bin()`).
We won’t spend a lot of time on it in this case study, but you can also add a wide range of aesthetic arguments to each geom, like changing the color of the histogram bars by adding an argument to specify color. Let’s give that a try using the fill = argument:
ggplot(sci_online_classes) +
geom_histogram( aes(x = FinalGradeCEMS),fill = "steelblue", color = "black",
bins = 30 )Warning: Removed 30 rows containing non-finite outside the scale range
(`stat_bin()`).
Your Turn:
Now us the code chunk below to visualize the distribution of another variable in the data, specifically TimeSpent. Also, change the color to one of your choosing; consider this list of valid color names here: http://www.stat.columbia.edu/~tzheng/files/Rcolor.pdf
#create a histogram of TimeSpent using a different color
ggplot(sci_online_classes) +
geom_histogram( aes(x = FinalGradeCEMS),fill = "steelblue", color = "black",
bins = 30 )Warning: Removed 30 rows containing non-finite outside the scale range
(`stat_bin()`).
Scatterplots
Let’s create a scatter plot for the relationship between these two variables. Scatterplots use the point geom, i.e., the geom_point() function, and are most useful for displaying the relationship between two continuous variables.
Your Turn:
Complete the code chunk below to create a simplet scatterplot with TimeSpent on the x axis and FinalGradeCEMS on the y axis.
Visualizing TimeSpent and FinalGradeCEM
ggplot(sci_online_classes) +
geom_point(aes(x = TimeSpent, y = FinalGradeCEMS), alpha = 0.4) +
geom_smooth(
aes(x = TimeSpent, y = FinalGradeCEMS),
method = "lm",
color = "blue"
) +
labs(
x = "Time Spent",
y = "Final Grade (CEMS)",
title = "Relationship Between Time Spent and Final Grade"
)`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 30 rows containing non-finite outside the scale range
(`stat_smooth()`).
Warning: Removed 30 rows containing missing values or values outside the scale range
(`geom_point()`).
What do you think about the relationship between TimeSpent and FinalGradeCEMS?
- A weak positive relationship was identified between TimeSpent and FinalGradeCEMS. Students who dedicated more time to the course generally achieved higher final grades; however, considerable variability was observed in these outcomes. This suggests that while time investment is associated with performance, it is not the sole determinant of final grades.
4. Model
“Model” is one of those terms that has many different meanings. For our purpose, we refer to the process of simplifying and summarizing our data. Thus, models can take many forms; calculating means represents a legitimate form of modeling data, as does estimating more complex models, including linear regressions, and models and algorithms associated with machine learning tasks. For now, we’ll run a base linear regression model to further examine the relationship between TimeSpent and FinalGradeCEMS.
We’ll dive much deeper into modeling in subsequent learning labs, but for now let’s see if there is a statistically significant relationship between students’ final grades, FinaGradeCEMS, and the TimeSpent on the course:
model_1 <- lm(FinalGradeCEMS ~ TimeSpent, data = sci_online_classes)
summary(model_1)
Call:
lm(formula = FinalGradeCEMS ~ TimeSpent, data = sci_online_classes)
Residuals:
Min 1Q Median 3Q Max
-67.136 -7.805 4.723 14.471 30.317
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.581e+01 1.491e+00 44.13 <2e-16 ***
TimeSpent 6.081e-03 6.482e-04 9.38 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 20.71 on 571 degrees of freedom
(30 observations deleted due to missingness)
Multiple R-squared: 0.1335, Adjusted R-squared: 0.132
F-statistic: 87.99 on 1 and 571 DF, p-value: < 2.2e-16
Your Turn:
Now let’s “add” another variable to the regression model. Specifically, use the + operator after TimeSpent to add the course subject variable, or another variable of your choosing, as a predictor of students’ final grade.
model_2 <- lm(FinalGradeCEMS ~ TimeSpent + subject, data = sci_online_classes)
summary(model_2)
Call:
lm(formula = FinalGradeCEMS ~ TimeSpent + subject, data = sci_online_classes)
Residuals:
Min 1Q Median 3Q Max
-70.378 -8.836 4.816 12.855 36.047
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 57.3931739 2.3382193 24.546 < 2e-16 ***
TimeSpent 0.0071098 0.0006516 10.912 < 2e-16 ***
subjectBioA -1.5596482 3.6053075 -0.433 0.665
subjectFrScA 11.7306546 2.2143847 5.297 1.68e-07 ***
subjectOcnA 1.0974545 2.5771474 0.426 0.670
subjectPhysA 16.0357213 3.0712923 5.221 2.50e-07 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 19.8 on 567 degrees of freedom
(30 observations deleted due to missingness)
Multiple R-squared: 0.213, Adjusted R-squared: 0.2061
F-statistic: 30.69 on 5 and 567 DF, p-value: < 2.2e-16
What do you notice about the results? Add a comment or two below:
- With course subject included as an additional regressor, the estimated coefficient for TimeSpent remains statistically significant and positive, indicating a continued positive association with FinalGradeCEMS. Certain subject categories (e.g., FrScA and PhysA) are associated with significantly higher final grades when compared to the reference category, as evidenced by their estimated coefficients. The model accounts for approximately 21% of the variance in final grades, suggesting that while TimeSpent and subject categories are significant predictors, much of the variability in student performance remains unexplained by the current model specification, pointing to the influence of other unobserved or unmodelled factors.
5. Communicate
The final step in the workflow/process is sharing the results of your analysis with wider audience. Krumm et al. @krumm2018 have outlined the following 3-step process for communicating with education stakeholders findings from an analysis:
Select. Communicating what one has learned involves selecting among those analyses that are most important and most useful to an intended audience, as well as selecting a form for displaying that information, such as a graph or table in static or interactive form, i.e. a “data product.”
Polish. After creating initial versions of data products, research teams often spend time refining or polishing them, by adding or editing titles, labels, and notations and by working with colors and shapes to highlight key points.
Narrate. Writing a narrative to accompany the data products involves, at a minimum, pairing a data product with its related research question, describing how best to interpret the data product, and explaining the ways in which the data product helps answer the research question and might be used to inform new analyses or a “change idea” for improving student learning.
Render File
For your course project, you will have an opportunity to create a simple “data product” designed to illustrate some insights gained from your analysis and ideally highlight an action step or change idea that can be used to improve learning or the contexts in which learning occurs.For now, we will wrap up this work by converting our work into a webpage that can be used to communicate your learning and demonstrate some of your new R skills. To do so, you will need to “render” your document by clicking the Render button in the menu bar at that the top of this file. This will do two things; it will:
check through all your code for any errors; and,
create a file in your directory that you can use to share you work through Posit Cloud, RPubs , GitHub Pages, Quarto Pub, or other methods.
Now that you’ve finished your first Rtutorial study, scroll back to the very top of this Quarto Document and change the author: “YOUR NAME HERE” to your actual name surrounded by quotation marks like so: author: “Dr. Cansu Tatar”.
Acknowledgement:
Special thanks to Dr. Shaun Kellogg for his support and guidance to create these course materials.