Rtutorial

Author

Dr.Cansu Tatar

Week3: Introduction to R Toolkit

This activity is prepared to warm-up your understanding of LA workflow. Through this exercise, we will practice basics of how RStudio works. In all of our practices, we will follow the LA workflow.

Prepare: We will import our data file into R in the “Prepare” section.
Wrangle: We will prepare and wrangle our data in the “Wrangle” section.
Explore: We will check the patterns in the data through visualization.
Model: We will run a regression model in the “Model” section.
Communicate: Create a reproducible report of your work, you can share with others in the “Communicate” section.

How to follow this document?

In the R tutorial toolkit and case study week, we will be using a quarto markdown file, the extension of the file is “.qmd”. Unlike R markdown, Quarto documents are fully reproducible and support dozens of output formats, like PDFs, Word files, presentations, and more. In this document, you will see, we’re using visual mode to see the general parts.

There are two keys to your use of Quarto for this activity:

First, you can see the document in the “Visual Editor” mode. You can use this mode by clicking the word “Visual” on the left side of the toolbar above. The visual editor allows you to view formatted headers, text and code chunks as specified by the underlying markdown syntax, or “Source” text. Visual mode is a bit more “human readable” than syntax but definitely take a look at the source text.
Second, note the specially formatted text box below called a “code chunck.” These chuncks allows you to run code from multiple languages including R, Python, and SQL. This specific code chunck contains a line of R code. If you wonder other chunck options, you can visit this link and learn more about it.

knitr::include_graphics("Img/research-workflow.png")

The Data-Intensive Research Workflow

Last week, in the class, we talked about the data-intensive research workflow.As we mentioned a couple of times, we will follow this workflow to present our research approach.

Let’s get started.

1. Prepare

As a first step in the data-intensive research workflow, we will need to define our research question(s) and need to understand where the data comes from (Krumm, 2018).

For this work, we will work with data come from an unpublished research study, which utilized a number of different data sources to understand high school students’ motivation within the context of online courses.

Our research question is:

“Is there a relationship between the time students spend on a course and their final course grade?”

For our analysis, we will need certain packages.One of the most common package is “Tidyverse” package. This package is actually a collection of R packages designed for wrangling and exploring data and which all share an underlying design philosophy, grammar, and data structure.

knitr::include_graphics("Img/tidyverse.png")

Through this class, we will have a chance to try different libraries in the tidyverse package. Let’s install our tidyverse package.

We installed the package, what do we need to do now for using this package?

We will also use another package called “skimr”. This is a handy package that provides summary statistics that you can skim quickly to understand your data. We’ll be using this later in the Explore section.

#install.packages(skimr)install the {skimr} package below.

install.packages("skimr")

#load the {skimr} package below.
library(skimr)

library(readr)

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ purrr     1.0.4
✔ forcats   1.0.1     ✔ stringr   1.6.0
✔ ggplot2   4.0.2     ✔ tibble    3.2.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

sci_online_classes <- read_csv("sci-online-classes.csv")

Rows: 603 Columns: 30
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (6): course_id, subject, semester, section, Gradebook_Item, Gender
dbl (23): student_id, total_points_possible, total_points_earned, percentage...
lgl  (1): Grade_Category

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

View(sci_online_classes)

Your Turn:

Why do you think we included data/ before our sci-online-classes.csv file? Why quotation marks?

Add your responses after the dashes below:

Quotation marks are used to denote that the file path (“data/sci-online-classes.csv”) must be a text string in R. Functions like read.csv() require the file name and location to be specified as text so R can correctly identify and open the file. Without quotation marks, R would treat the file name as an object or variable instead of a file path, which would cause an error.

Hint: check the files panel.

Viewing or inspecting data

Let’s quickly check our data.

#Loading the data file 
sci_online_classes

# A tibble: 603 × 30
   student_id course_id     total_points_possible total_points_earned
        <dbl> <chr>                         <dbl>               <dbl>
 1      43146 FrScA-S216-02                  3280                2220
 2      44638 OcnA-S116-01                   3531                2672
 3      47448 FrScA-S216-01                  2870                1897
 4      47979 OcnA-S216-01                   4562                3090
 5      48797 PhysA-S116-01                  2207                1910
 6      51943 FrScA-S216-03                  4208                3596
 7      52326 AnPhA-S216-01                  4325                2255
 8      52446 PhysA-S116-01                  2086                1719
 9      53447 FrScA-S116-01                  4655                3149
10      53475 FrScA-S116-02                  1710                1402
# ℹ 593 more rows
# ℹ 26 more variables: percentage_earned <dbl>, subject <chr>, semester <chr>,
#   section <chr>, Gradebook_Item <chr>, Grade_Category <lgl>,
#   FinalGradeCEMS <dbl>, Points_Possible <dbl>, Points_Earned <dbl>,
#   Gender <chr>, q1 <dbl>, q2 <dbl>, q3 <dbl>, q4 <dbl>, q5 <dbl>, q6 <dbl>,
#   q7 <dbl>, q8 <dbl>, q9 <dbl>, q10 <dbl>, TimeSpent <dbl>,
#   TimeSpent_hours <dbl>, TimeSpent_std <dbl>, int <dbl>, pc <dbl>, uv <dbl>

View(sci_online_classes)

Your Turn:

What do you notice about this dataset? What do you wonder? Add one or two thoughts after the dash below:

This response critically examines the quality of the datasets.
- Observation: Several variables contain missing values, particularly within the survey question columns. This could potentially impact analysis and necessitates data cleaning.
- Inquiry: It would be beneficial to investigate why certain students lack survey responses and whether this pattern varies by course or semester.

There are other ways to inspect your data; the glimpse() function provides one such way. Let’s take a glimpse at our data.

glimpse(sci_online_classes)

Rows: 603
Columns: 30
$ student_id            <dbl> 43146, 44638, 47448, 47979, 48797, 51943, 52326,…
$ course_id             <chr> "FrScA-S216-02", "OcnA-S116-01", "FrScA-S216-01"…
$ total_points_possible <dbl> 3280, 3531, 2870, 4562, 2207, 4208, 4325, 2086, …
$ total_points_earned   <dbl> 2220, 2672, 1897, 3090, 1910, 3596, 2255, 1719, …
$ percentage_earned     <dbl> 0.6768293, 0.7567261, 0.6609756, 0.6773345, 0.86…
$ subject               <chr> "FrScA", "OcnA", "FrScA", "OcnA", "PhysA", "FrSc…
$ semester              <chr> "S216", "S116", "S216", "S216", "S116", "S216", …
$ section               <chr> "02", "01", "01", "01", "01", "03", "01", "01", …
$ Gradebook_Item        <chr> "POINTS EARNED & TOTAL COURSE POINTS", "ATTEMPTE…
$ Grade_Category        <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ FinalGradeCEMS        <dbl> 93.45372, 81.70184, 88.48758, 81.85260, 84.00000…
$ Points_Possible       <dbl> 5, 10, 10, 5, 438, 5, 10, 10, 443, 5, 12, 10, 5,…
$ Points_Earned         <dbl> NA, 10.00, NA, 4.00, 399.00, NA, NA, 10.00, 425.…
$ Gender                <chr> "M", "F", "M", "M", "F", "F", "M", "F", "F", "M"…
$ q1                    <dbl> 5, 4, 5, 5, 4, NA, 5, 3, 4, NA, NA, 4, 3, 5, NA,…
$ q2                    <dbl> 4, 4, 4, 5, 3, NA, 5, 3, 3, NA, NA, 5, 3, 3, NA,…
$ q3                    <dbl> 4, 3, 4, 3, 3, NA, 3, 3, 3, NA, NA, 3, 3, 5, NA,…
$ q4                    <dbl> 5, 4, 5, 5, 4, NA, 5, 3, 4, NA, NA, 5, 3, 5, NA,…
$ q5                    <dbl> 5, 4, 5, 5, 4, NA, 5, 3, 4, NA, NA, 5, 4, 5, NA,…
$ q6                    <dbl> 5, 4, 4, 5, 4, NA, 5, 4, 3, NA, NA, 5, 3, 5, NA,…
$ q7                    <dbl> 5, 4, 4, 4, 4, NA, 4, 3, 3, NA, NA, 5, 3, 5, NA,…
$ q8                    <dbl> 5, 5, 5, 5, 4, NA, 5, 3, 4, NA, NA, 4, 3, 5, NA,…
$ q9                    <dbl> 4, 4, 3, 5, NA, NA, 5, 3, 2, NA, NA, 5, 2, 2, NA…
$ q10                   <dbl> 5, 4, 5, 5, 3, NA, 5, 3, 5, NA, NA, 4, 4, 5, NA,…
$ TimeSpent             <dbl> 1555.1667, 1382.7001, 860.4335, 1598.6166, 1481.…
$ TimeSpent_hours       <dbl> 25.91944500, 23.04500167, 14.34055833, 26.643610…
$ TimeSpent_std         <dbl> -0.18051496, -0.30780313, -0.69325954, -0.148446…
$ int                   <dbl> 5.0, 4.2, 5.0, 5.0, 3.8, 4.6, 5.0, 3.0, 4.2, NA,…
$ pc                    <dbl> 4.50, 3.50, 4.00, 3.50, 3.50, 4.00, 3.50, 3.00, …
$ uv                    <dbl> 4.333333, 4.000000, 3.666667, 5.000000, 3.500000…

Generally, rows typically represent “cases”, the units that we measure, or the units on which we collect data. What counts as a “case” (and therefore what is represented as a row) varies by (and within) fields. There may be mutliple types or levels of units studied in your field; listing more than one is fine! Also, please consider what columns-which usually represent variables-represent in your area of work and/or research.

Your Turn:

How many “cases” or observations are in this dataset?

There are 603 cases (observations) in the dataset.

Pick two columns (or more) and write what you think it represents:

percentage_earned: This column denotes the proportion of total course points attained by a student, computed as the ratio of points accumulated to points available. This metric serves as an indicator of the student’s overall academic achievement within the course.
TimeSpent_hours: This column quantifies the total duration (in hours) a student dedicated to interacting with the online course material. This variable is hypothesised to serve as a proxy for student effort and engagement.

2. Wrangle

By wrangle, we refer to the process of cleaning and processing data, and, in some cases, merging (or joining) data from multiple sources. Often, this part of the process is very time-intensive! Wrangling your data into shape can itself be an important accomplishment! And documenting your code using R scripts or Markdown files will save yourself and others a great deal of time wrangling data in the future! There are great tools in R for data wrangling, especially through the use of {dplyr} R package which is part of the tidyverse.

Selecting Variables

Remember our research question, what we were interested in finding about this data?

Let’s practice selecting a few variables by introducing pipe operator, |>. Pipes are a powerful tool for combining a sequence of functions or processes.

Run the following code chunck to “pipe” our sci_data to the select() function include the following two variables as arguments:

FinalGradeCEMS (i.e., students’ final grades on a 0-100 point scale)
TimeSpent (i.e., the number of minutes they spent in the course’s learning management system)
```
#library(dplyr)
```

sci_online_classes|>
  select(FinalGradeCEMS, TimeSpent)

# A tibble: 603 × 2
   FinalGradeCEMS TimeSpent
            <dbl>     <dbl>
 1           93.5   1555.  
 2           81.7   1383.  
 3           88.5    860.  
 4           81.9   1599.  
 5           84     1482.  
 6           NA        3.45
 7           83.6   1322.  
 8           97.8   1390.  
 9           96.1   1479.  
10           NA       NA   
# ℹ 593 more rows

The code chunk produced error

Notice how the number of columns (variables) is now different! Let’s check our data with View() function this time.

I selected percentage_earned and Time_Spent_hours

  sci_online_classes|>
select(percentage_earned, TimeSpent_hours)

# A tibble: 603 × 2
   percentage_earned TimeSpent_hours
               <dbl>           <dbl>
 1             0.677         25.9   
 2             0.757         23.0   
 3             0.661         14.3   
 4             0.677         26.6   
 5             0.865         24.7   
 6             0.855          0.0575
 7             0.521         22.0   
 8             0.824         23.2   
 9             0.676         24.7   
10             0.820         NA     
# ℹ 593 more rows

A quick footnote about pipes: The original pipe operator, %>%, comes from the {magrittr} package but all packages in the tidyverse load %>% for you automatically, so you don’t usually load magrittr explicitly. The pipe has become such a useful and much used operator in R that it is now baked into R using the new and simpler native pipe |> operator. You can use both fairly interchangeably but there are a few differences between pipe operators.

Filtering Variables

Let’s exploring filtering variables. We will filter our data to view only the rows associated with students who earned a final grade (as a percentage) of 70 or 70% or higher.

sci_online_classes|>
  filter(FinalGradeCEMS>70)

# A tibble: 438 × 30
   student_id course_id     total_points_possible total_points_earned
        <dbl> <chr>                         <dbl>               <dbl>
 1      43146 FrScA-S216-02                  3280                2220
 2      44638 OcnA-S116-01                   3531                2672
 3      47448 FrScA-S216-01                  2870                1897
 4      47979 OcnA-S216-01                   4562                3090
 5      48797 PhysA-S116-01                  2207                1910
 6      52326 AnPhA-S216-01                  4325                2255
 7      52446 PhysA-S116-01                  2086                1719
 8      53447 FrScA-S116-01                  4655                3149
 9      53475 FrScA-S216-01                  1209                 977
10      54066 OcnA-S116-01                   4641                3429
# ℹ 428 more rows
# ℹ 26 more variables: percentage_earned <dbl>, subject <chr>, semester <chr>,
#   section <chr>, Gradebook_Item <chr>, Grade_Category <lgl>,
#   FinalGradeCEMS <dbl>, Points_Possible <dbl>, Points_Earned <dbl>,
#   Gender <chr>, q1 <dbl>, q2 <dbl>, q3 <dbl>, q4 <dbl>, q5 <dbl>, q6 <dbl>,
#   q7 <dbl>, q8 <dbl>, q9 <dbl>, q10 <dbl>, TimeSpent <dbl>,
#   TimeSpent_hours <dbl>, TimeSpent_std <dbl>, int <dbl>, pc <dbl>, uv <dbl>

Your Turn:

In the next code chunk, change the cut-off from 70% to some other value-larger or smaller (maybe much larger or smaller) - feel free to play around with the code a bit!

How many students had more that 85% grade?

sci_online_classes|>
  filter(FinalGradeCEMS>85)

# A tibble: 279 × 30
   student_id course_id     total_points_possible total_points_earned
        <dbl> <chr>                         <dbl>               <dbl>
 1      43146 FrScA-S216-02                  3280                2220
 2      47448 FrScA-S216-01                  2870                1897
 3      52446 PhysA-S116-01                  2086                1719
 4      53447 FrScA-S116-01                  4655                3149
 5      54066 OcnA-S116-01                   4641                3429
 6      54282 OcnA-S116-02                   3581                2777
 7      54434 PhysA-S116-01                  3228                2506
 8      55078 FrScA-S216-01                  7000                4212
 9      56152 AnPhA-S116-02                  3323                2468
10      57224 FrScA-S116-03                  4546                3772
# ℹ 269 more rows
# ℹ 26 more variables: percentage_earned <dbl>, subject <chr>, semester <chr>,
#   section <chr>, Gradebook_Item <chr>, Grade_Category <lgl>,
#   FinalGradeCEMS <dbl>, Points_Possible <dbl>, Points_Earned <dbl>,
#   Gender <chr>, q1 <dbl>, q2 <dbl>, q3 <dbl>, q4 <dbl>, q5 <dbl>, q6 <dbl>,
#   q7 <dbl>, q8 <dbl>, q9 <dbl>, q10 <dbl>, TimeSpent <dbl>,
#   TimeSpent_hours <dbl>, TimeSpent_std <dbl>, int <dbl>, pc <dbl>, uv <dbl>

What happens when you change the cut-off from 70 to something else?

Increasing the cut-off from 70 to 85 leads to a reduction in the number of observations. A more stringent threshold excludes a greater proportion of students, thereby retaining only those with superior final grades and consequently diminishing the size of the resulting subset.

Arrange

The last function we’ll use for preparing tables in arrange. We’ll again use the pipe operator to combine this with arrange() function we used already -select(). We do this so we can view only time spent and final grades.

sci_online_classes|>
  select(FinalGradeCEMS, TimeSpent)|>
  arrange(FinalGradeCEMS)

# A tibble: 603 × 2
   FinalGradeCEMS TimeSpent
            <dbl>     <dbl>
 1          0          13.9
 2          0.535     306. 
 3          0.903      88.5
 4          1.80       44.7
 5          2.93       57.7
 6          3.01      571. 
 7          3.06        0.7
 8          3.43      245. 
 9          5.04      202. 
10          5.2        11.0
# ℹ 593 more rows

Note that arrange works by sorting values in ascending order (from lowest to highest); you can change this by using the desc() function as an argument with.

#let's change the order from asc to desc
sci_online_classes|>
  arrange(desc(FinalGradeCEMS))

# A tibble: 603 × 30
   student_id course_id     total_points_possible total_points_earned
        <dbl> <chr>                         <dbl>               <dbl>
 1      85650 FrScA-S116-01                  8206                4432
 2      91067 BioA-S116-01                   2672                2249
 3      66740 OcnA-S116-01                   4171                3639
 4      86792 FrScA-S116-01                  2316                1927
 5      78153 PhysA-S216-01                  6530                3702
 6      66689 FrScA-S216-01                  3390                2738
 7      88261 FrScA-S116-01                  2419                1624
 8      92740 PhysA-S116-01                  3347                2308
 9      92726 PhysA-S116-01                  2739                2356
10      92741 PhysA-S116-01                  3070                2163
# ℹ 593 more rows
# ℹ 26 more variables: percentage_earned <dbl>, subject <chr>, semester <chr>,
#   section <chr>, Gradebook_Item <chr>, Grade_Category <lgl>,
#   FinalGradeCEMS <dbl>, Points_Possible <dbl>, Points_Earned <dbl>,
#   Gender <chr>, q1 <dbl>, q2 <dbl>, q3 <dbl>, q4 <dbl>, q5 <dbl>, q6 <dbl>,
#   q7 <dbl>, q8 <dbl>, q9 <dbl>, q10 <dbl>, TimeSpent <dbl>,
#   TimeSpent_hours <dbl>, TimeSpent_std <dbl>, int <dbl>, pc <dbl>, uv <dbl>

Your Turn:

In the next code chunk, replace FinalGradeCEMS that is used with both the select() and arrange() functions with a different variable in the dataset.

#Arrange in Ascending order
  sci_online_classes |>
  select(percentage_earned, FinalGradeCEMS) |>
  arrange(percentage_earned)

# A tibble: 603 × 2
   percentage_earned FinalGradeCEMS
               <dbl>          <dbl>
 1             0.338           NA  
 2             0.466           81.7
 3             0.496           93.5
 4             0.498           94.2
 5             0.499           89.6
 6             0.503           71.5
 7             0.515           91.3
 8             0.516           73.5
 9             0.516           73.8
10             0.521           92.9
# ℹ 593 more rows

#Arrange in descending order
sci_online_classes |>
  select(percentage_earned, FinalGradeCEMS) |>
  arrange(desc(percentage_earned))

# A tibble: 603 × 2
   percentage_earned FinalGradeCEMS
               <dbl>          <dbl>
 1             0.911           96.0
 2             0.908           87.4
 3             0.907           92.9
 4             0.904           72.9
 5             0.901           92.9
 6             0.901           94.2
 7             0.899           94.6
 8             0.897           87.1
 9             0.897           64.8
10             0.896           82.2
# ℹ 593 more rows

3. Explore

Exploratory data analysis, or exploring your data, involves processes of describing your data (such as by calculating the means and standard deviations of numeric variables, or counting the frequency of categorical variables) and, often, visualizing your data. As we’ll learn in later in this class, the explore phase can also involve the process of “feature engineering,” or creating new variables within a dataset [@krumm2018]. In this section, we’ll quickly pull together some basic stats using a handy function from the {skimr} package, and introduce you to a basic data visualization “code template” for the {ggplot} package from the tidyverse.

Summary Statistics

Let’s repurpose what we learned from our wrangle section to select just a few variables and quickly gather some descriptive stats using the skim() function from the {skimr} package.

sci_online_classes|>
  select(FinalGradeCEMS, TimeSpent)|>
skim()

Data summary
Name	select(sci_online_classes…
Number of rows	603
Number of columns	2
_______________________
Column type frequency:
numeric	2
________________________
Group variables	None

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
FinalGradeCEMS	30	0.95	77.20	22.23	0.00	71.25	84.57	92.10	100.00	▁▁▁▃▇
TimeSpent	5	0.99	1799.75	1354.93	0.45	851.90	1550.91	2426.09	8870.88	▇▅▁▁▁

Your Turn:

Copy the code from the chunk from above and use it as a template to explore some other variables of interest from our sci_data.

Variables

FinalGradeCEMS
percentage_earned

#use skim() to summarize other variables of your choosing.
sci_online_classes |>
  select(FinalGradeCEMS, percentage_earned) |>
  skim()

Data summary
Name	select(sci_online_classes…
Number of rows	603
Number of columns	2
_______________________
Column type frequency:
numeric	2
________________________
Group variables	None

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
FinalGradeCEMS	30	0.95	77.20	22.23	0.00	71.25	84.57	92.10	100.00	▁▁▁▃▇
percentage_earned	0	1.00	0.76	0.09	0.34	0.70	0.78	0.83	0.91	▁▁▃▇▇

What happens if simply feed the skim function the entire sci_data object? Give it a try!

#use skim() on the entire data frame
skim(sci_online_classes)

Data summary
Name	sci_online_classes
Number of rows	603
Number of columns	30
_______________________
Column type frequency:
character	6
logical	1
numeric	23
________________________
Group variables	None

Variable type: character

skim_variable	complete_rate	min	max	n_unique
course_id	1	12	13	26
subject	1	4	5	5
semester	1	4	4	3
section	1	2	2	4
Gradebook_Item	1	9	35	3
Gender	1	1	1	2

Variable type: logical

skim_variable	n_missing	complete_rate	mean	count
Grade_Category	603	0	NaN	:

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
student_id	0	1.00	86069.54	10548.60	43146.00	85612.50	88340.00	92730.50	97441.00	▁▁▁▃▇
total_points_possible	0	1.00	4274.41	2312.74	840.00	2809.50	3583.00	5069.00	15552.00	▇▅▂▁▁
total_points_earned	0	1.00	3244.69	1832.00	651.00	2050.50	2757.00	3875.00	12208.00	▇▅▁▁▁
percentage_earned	0	1.00	0.76	0.09	0.34	0.70	0.78	0.83	0.91	▁▁▃▇▇
FinalGradeCEMS	30	0.95	77.20	22.23	0.00	71.25	84.57	92.10	100.00	▁▁▁▃▇
Points_Possible	0	1.00	76.87	167.51	5.00	10.00	10.00	30.00	935.00	▇▁▁▁▁
Points_Earned	92	0.85	68.63	145.26	0.00	7.00	10.00	26.12	828.20	▇▁▁▁▁
q1	123	0.80	4.30	0.68	1.00	4.00	4.00	5.00	5.00	▁▁▂▇▇
q2	126	0.79	3.63	0.93	1.00	3.00	4.00	4.00	5.00	▁▂▆▇▃
q3	123	0.80	3.33	0.91	1.00	3.00	3.00	4.00	5.00	▁▃▇▅▂
q4	125	0.79	4.27	0.85	1.00	4.00	4.00	5.00	5.00	▁▁▂▇▇
q5	127	0.79	4.19	0.68	2.00	4.00	4.00	5.00	5.00	▁▂▁▇▅
q6	127	0.79	4.01	0.80	1.00	4.00	4.00	5.00	5.00	▁▁▃▇▅
q7	129	0.79	3.91	0.82	1.00	3.00	4.00	4.75	5.00	▁▁▅▇▅
q8	129	0.79	4.29	0.68	1.00	4.00	4.00	5.00	5.00	▁▁▂▇▆
q9	129	0.79	3.49	0.98	1.00	3.00	4.00	4.00	5.00	▁▃▇▇▃
q10	129	0.79	4.10	0.93	1.00	4.00	4.00	5.00	5.00	▁▂▃▇▇
TimeSpent	5	0.99	1799.75	1354.93	0.45	851.90	1550.91	2426.09	8870.88	▇▅▁▁▁
TimeSpent_hours	5	0.99	30.00	22.58	0.01	14.20	25.85	40.43	147.85	▇▅▁▁▁
TimeSpent_std	5	0.99	0.00	1.00	-1.33	-0.70	-0.18	0.46	5.22	▇▅▁▁▁
int	76	0.87	4.22	0.59	2.00	3.90	4.20	4.70	5.00	▁▁▃▇▇
pc	75	0.88	3.61	0.64	1.50	3.00	3.50	4.00	5.00	▁▁▇▅▂
uv	75	0.88	3.72	0.70	1.00	3.33	3.67	4.17	5.00	▁▁▆▇▅

When the complete “sci_online_classes” object is submitted to the skim function, an extensive summary is generated, encompassing statistical measures for all variables, categorised by data type. While this offers a comprehensive overview of the dataset, the resulting output may be overwhelming. Consequently, it is often more advantageous to summarise a subset of pertinent variables rather than the complete dataset.

Data Visualization

Data visualization is an extremely common practice in Learning Analytics, especially in the use of data dashboards. Data visualization involves graphically representing one or more variables with the goal of discovering patterns in data. These patterns may help us to answer research questions or generate new questions about our data, to discover relationships between and among variables, and to create or select features for data modeling.

In this section we’ll focus on using a basic code template for the {ggplot2} package from the tidyverse. ggplot2 is a system for declaratively creating graphics, based on the grammar of graphics [@Wickham]. You provide the data, tell ggplot2 how to map variables to aesthetics, what graphical elements to use, and it takes care of the details.

The Graphing Workflow

At it’s core, you can create some very simple but attractive graphs with just a couple lines of code. {ggplot2} follows the common workflow for making graphs. To make a graph, you simply:

Start the graph with ggplot() and include your data as an argument;
“Add” elements to the graph using the + operator a geom_() function;
Select variables to graph on each axis with the aes() argument.

Let’s give it a try by creating a simple histogram of our FinalGradeCEMS variable. The code below creates a histogram, or a distribution of the values, in this case for students’ final grades.

ggplot(sci_online_classes) +
  geom_histogram(aes(x = FinalGradeCEMS))

`stat_bin()` using `bins = 30`. Pick better value `binwidth`.

Warning: Removed 30 rows containing non-finite outside the scale range
(`stat_bin()`).

We won’t spend a lot of time on it in this case study, but you can also add a wide range of aesthetic arguments to each geom, like changing the color of the histogram bars by adding an argument to specify color. Let’s give that a try using the fill = argument:

ggplot(sci_online_classes) +
  geom_histogram(  aes(x = FinalGradeCEMS),fill = "steelblue", color = "black",
    bins = 30 )

Warning: Removed 30 rows containing non-finite outside the scale range
(`stat_bin()`).

Your Turn:

Now us the code chunk below to visualize the distribution of another variable in the data, specifically TimeSpent. Also, change the color to one of your choosing; consider this list of valid color names here: http://www.stat.columbia.edu/~tzheng/files/Rcolor.pdf

#create a histogram of TimeSpent using a different color
ggplot(sci_online_classes) +
  geom_histogram(  aes(x = FinalGradeCEMS),fill = "steelblue", color = "black",
    bins = 30 )

Warning: Removed 30 rows containing non-finite outside the scale range
(`stat_bin()`).

Scatterplots

Let’s create a scatter plot for the relationship between these two variables. Scatterplots use the point geom, i.e., the geom_point() function, and are most useful for displaying the relationship between two continuous variables.

Your Turn:

Complete the code chunk below to create a simplet scatterplot with TimeSpent on the x axis and FinalGradeCEMS on the y axis.

Visualizing TimeSpent and FinalGradeCEM

ggplot(sci_online_classes) +
  geom_point(aes(x = TimeSpent, y = FinalGradeCEMS), alpha = 0.4) +
  geom_smooth(
    aes(x = TimeSpent, y = FinalGradeCEMS),
    method = "lm",
    color = "blue"
  ) +
  labs(
    x = "Time Spent",
    y = "Final Grade (CEMS)",
    title = "Relationship Between Time Spent and Final Grade"
  )

`geom_smooth()` using formula = 'y ~ x'

Warning: Removed 30 rows containing non-finite outside the scale range
(`stat_smooth()`).

Warning: Removed 30 rows containing missing values or values outside the scale range
(`geom_point()`).

What do you think about the relationship between TimeSpent and FinalGradeCEMS?

A weak positive relationship was identified between TimeSpent and FinalGradeCEMS. Students who dedicated more time to the course generally achieved higher final grades; however, considerable variability was observed in these outcomes. This suggests that while time investment is associated with performance, it is not the sole determinant of final grades.

4. Model

“Model” is one of those terms that has many different meanings. For our purpose, we refer to the process of simplifying and summarizing our data. Thus, models can take many forms; calculating means represents a legitimate form of modeling data, as does estimating more complex models, including linear regressions, and models and algorithms associated with machine learning tasks. For now, we’ll run a base linear regression model to further examine the relationship between TimeSpent and FinalGradeCEMS.

We’ll dive much deeper into modeling in subsequent learning labs, but for now let’s see if there is a statistically significant relationship between students’ final grades, FinaGradeCEMS, and the TimeSpent on the course:

model_1 <- lm(FinalGradeCEMS ~ TimeSpent, data = sci_online_classes)
summary(model_1)


Call:
lm(formula = FinalGradeCEMS ~ TimeSpent, data = sci_online_classes)

Residuals:
    Min      1Q  Median      3Q     Max 
-67.136  -7.805   4.723  14.471  30.317 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 6.581e+01  1.491e+00   44.13   <2e-16 ***
TimeSpent   6.081e-03  6.482e-04    9.38   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 20.71 on 571 degrees of freedom
  (30 observations deleted due to missingness)
Multiple R-squared:  0.1335,    Adjusted R-squared:  0.132 
F-statistic: 87.99 on 1 and 571 DF,  p-value: < 2.2e-16

Your Turn:

Now let’s “add” another variable to the regression model. Specifically, use the + operator after TimeSpent to add the course subject variable, or another variable of your choosing, as a predictor of students’ final grade.

model_2 <- lm(FinalGradeCEMS ~ TimeSpent + subject, data = sci_online_classes)
summary(model_2)


Call:
lm(formula = FinalGradeCEMS ~ TimeSpent + subject, data = sci_online_classes)

Residuals:
    Min      1Q  Median      3Q     Max 
-70.378  -8.836   4.816  12.855  36.047 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)  57.3931739  2.3382193  24.546  < 2e-16 ***
TimeSpent     0.0071098  0.0006516  10.912  < 2e-16 ***
subjectBioA  -1.5596482  3.6053075  -0.433    0.665    
subjectFrScA 11.7306546  2.2143847   5.297 1.68e-07 ***
subjectOcnA   1.0974545  2.5771474   0.426    0.670    
subjectPhysA 16.0357213  3.0712923   5.221 2.50e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 19.8 on 567 degrees of freedom
  (30 observations deleted due to missingness)
Multiple R-squared:  0.213, Adjusted R-squared:  0.2061 
F-statistic: 30.69 on 5 and 567 DF,  p-value: < 2.2e-16

What do you notice about the results? Add a comment or two below:

With course subject included as an additional regressor, the estimated coefficient for TimeSpent remains statistically significant and positive, indicating a continued positive association with FinalGradeCEMS. Certain subject categories (e.g., FrScA and PhysA) are associated with significantly higher final grades when compared to the reference category, as evidenced by their estimated coefficients. The model accounts for approximately 21% of the variance in final grades, suggesting that while TimeSpent and subject categories are significant predictors, much of the variability in student performance remains unexplained by the current model specification, pointing to the influence of other unobserved or unmodelled factors.

5. Communicate

The final step in the workflow/process is sharing the results of your analysis with wider audience. Krumm et al. @krumm2018 have outlined the following 3-step process for communicating with education stakeholders findings from an analysis:

Select. Communicating what one has learned involves selecting among those analyses that are most important and most useful to an intended audience, as well as selecting a form for displaying that information, such as a graph or table in static or interactive form, i.e. a “data product.”
Polish. After creating initial versions of data products, research teams often spend time refining or polishing them, by adding or editing titles, labels, and notations and by working with colors and shapes to highlight key points.
Narrate. Writing a narrative to accompany the data products involves, at a minimum, pairing a data product with its related research question, describing how best to interpret the data product, and explaining the ways in which the data product helps answer the research question and might be used to inform new analyses or a “change idea” for improving student learning.

Render File

For your course project, you will have an opportunity to create a simple “data product” designed to illustrate some insights gained from your analysis and ideally highlight an action step or change idea that can be used to improve learning or the contexts in which learning occurs.For now, we will wrap up this work by converting our work into a webpage that can be used to communicate your learning and demonstrate some of your new R skills. To do so, you will need to “render” your document by clicking the Render button in the menu bar at that the top of this file. This will do two things; it will:

check through all your code for any errors; and,
create a file in your directory that you can use to share you work through Posit Cloud, RPubs , GitHub Pages, Quarto Pub, or other methods.

Now that you’ve finished your first Rtutorial study, scroll back to the very top of this Quarto Document and change the author: “YOUR NAME HERE” to your actual name surrounded by quotation marks like so: author: “Dr. Cansu Tatar”.

Acknowledgement:

Special thanks to Dr. Shaun Kellogg for his support and guidance to create these course materials.