Data Collection & Preparation

Learning Analytics — Data Collection and Preparation (Required)

Author

Jennifer Rutherford

Published

June 3, 2026


Learning objectives

By the end of this file, you will be able to:

  • Load a CSV dataset into R using read_csv()
  • Inspect a data frame using glimpse(), head(), str(), and skim()
  • Handle missing values appropriately — not just with na.omit()
  • Select and filter data using select() and filter()
  • Create and modify variables using mutate() and ifelse()
  • Sort data using arrange()
  • Chain multiple operations using the pipe |>

Dataset: sci-online-classes

This file uses sci-online-classes.csv — a dataset from a real study of science students in online courses. Each row represents one student in one course section.

Data dictionary — know your variables before you code

Variable Type What it means
student_id numeric Unique student identifier
subject character Course subject (e.g., “OcnA”, “PhysA”)
semester character Semester the course was taken
section character Course section identifier
FinalGradeCEMS numeric Final course grade (0–100) — key outcome variable
TimeSpent_hours numeric Total hours spent on the LMS
percentage_earned numeric Percentage of points earned
Gender character Student gender (“M” or “F”)
total_points_possible numeric Total points available in the course
Grade_Category character Pass/fail category
Note

Knowing your variables before writing any code prevents the most common beginner mistake: writing code that references a column that does not exist or is the wrong type.


Part 1 · Setup and loading data

# Load the dataset from the data folder
data <- read_csv("data/sci-online-classes.csv")

# Always inspect immediately after loading
glimpse(data)
Rows: 603
Columns: 30
$ student_id            <dbl> 43146, 44638, 47448, 47979, 48797, 51943, 52326,…
$ course_id             <chr> "FrScA-S216-02", "OcnA-S116-01", "FrScA-S216-01"…
$ total_points_possible <dbl> 3280, 3531, 2870, 4562, 2207, 4208, 4325, 2086, …
$ total_points_earned   <dbl> 2220, 2672, 1897, 3090, 1910, 3596, 2255, 1719, …
$ percentage_earned     <dbl> 0.6768293, 0.7567261, 0.6609756, 0.6773345, 0.86…
$ subject               <chr> "FrScA", "OcnA", "FrScA", "OcnA", "PhysA", "FrSc…
$ semester              <chr> "S216", "S116", "S216", "S216", "S116", "S216", …
$ section               <chr> "02", "01", "01", "01", "01", "03", "01", "01", …
$ Gradebook_Item        <chr> "POINTS EARNED & TOTAL COURSE POINTS", "ATTEMPTE…
$ Grade_Category        <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ FinalGradeCEMS        <dbl> 93.45372, 81.70184, 88.48758, 81.85260, 84.00000…
$ Points_Possible       <dbl> 5, 10, 10, 5, 438, 5, 10, 10, 443, 5, 12, 10, 5,…
$ Points_Earned         <dbl> NA, 10.00, NA, 4.00, 399.00, NA, NA, 10.00, 425.…
$ Gender                <chr> "M", "F", "M", "M", "F", "F", "M", "F", "F", "M"…
$ q1                    <dbl> 5, 4, 5, 5, 4, NA, 5, 3, 4, NA, NA, 4, 3, 5, NA,…
$ q2                    <dbl> 4, 4, 4, 5, 3, NA, 5, 3, 3, NA, NA, 5, 3, 3, NA,…
$ q3                    <dbl> 4, 3, 4, 3, 3, NA, 3, 3, 3, NA, NA, 3, 3, 5, NA,…
$ q4                    <dbl> 5, 4, 5, 5, 4, NA, 5, 3, 4, NA, NA, 5, 3, 5, NA,…
$ q5                    <dbl> 5, 4, 5, 5, 4, NA, 5, 3, 4, NA, NA, 5, 4, 5, NA,…
$ q6                    <dbl> 5, 4, 4, 5, 4, NA, 5, 4, 3, NA, NA, 5, 3, 5, NA,…
$ q7                    <dbl> 5, 4, 4, 4, 4, NA, 4, 3, 3, NA, NA, 5, 3, 5, NA,…
$ q8                    <dbl> 5, 5, 5, 5, 4, NA, 5, 3, 4, NA, NA, 4, 3, 5, NA,…
$ q9                    <dbl> 4, 4, 3, 5, NA, NA, 5, 3, 2, NA, NA, 5, 2, 2, NA…
$ q10                   <dbl> 5, 4, 5, 5, 3, NA, 5, 3, 5, NA, NA, 4, 4, 5, NA,…
$ TimeSpent             <dbl> 1555.1667, 1382.7001, 860.4335, 1598.6166, 1481.…
$ TimeSpent_hours       <dbl> 25.91944500, 23.04500167, 14.34055833, 26.643610…
$ TimeSpent_std         <dbl> -0.18051496, -0.30780313, -0.69325954, -0.148446…
$ int                   <dbl> 5.0, 4.2, 5.0, 5.0, 3.8, 4.6, 5.0, 3.0, 4.2, NA,…
$ pc                    <dbl> 4.50, 3.50, 4.00, 3.50, 3.50, 4.00, 3.50, 3.00, …
$ uv                    <dbl> 4.333333, 4.000000, 3.666667, 5.000000, 3.500000…
# How many rows and columns?
nrow(data)
[1] 603
ncol(data)
[1] 30
# First 6 rows
head(data)

Question: What do you notice from the first look at this dataset? Share one thing that stands out — a variable type that surprised you, an unexpected value range, or anything else.

  • [ On the gradebook column it has one attempted and the others say pointed earned and total course points.On the POINTS EARNED & TOTAL COURSE POINTS one said 438 and points earned 399. These seem are far higher than the other points. On time spend there is one that is only 3 hours for the entire semester and the others are alot more.]

Part 2 · Exploring your data

Summary statistics

# summary() gives min, max, median, mean for numeric columns
summary(data)
   student_id     course_id         total_points_possible total_points_earned
 Min.   :43146   Length:603         Min.   :  840         Min.   :  651      
 1st Qu.:85612   Class :character   1st Qu.: 2810         1st Qu.: 2050      
 Median :88340   Mode  :character   Median : 3583         Median : 2757      
 Mean   :86070                      Mean   : 4274         Mean   : 3245      
 3rd Qu.:92730                      3rd Qu.: 5069         3rd Qu.: 3875      
 Max.   :97441                      Max.   :15552         Max.   :12208      
                                                                             
 percentage_earned   subject            semester           section         
 Min.   :0.3384    Length:603         Length:603         Length:603        
 1st Qu.:0.7047    Class :character   Class :character   Class :character  
 Median :0.7770    Mode  :character   Mode  :character   Mode  :character  
 Mean   :0.7577                                                            
 3rd Qu.:0.8262                                                            
 Max.   :0.9106                                                            
                                                                           
 Gradebook_Item     Grade_Category FinalGradeCEMS   Points_Possible 
 Length:603         Mode:logical   Min.   :  0.00   Min.   :  5.00  
 Class :character   NA's:603       1st Qu.: 71.25   1st Qu.: 10.00  
 Mode  :character                  Median : 84.57   Median : 10.00  
                                   Mean   : 77.20   Mean   : 76.87  
                                   3rd Qu.: 92.10   3rd Qu.: 30.00  
                                   Max.   :100.00   Max.   :935.00  
                                   NA's   :30                       
 Points_Earned       Gender                q1              q2       
 Min.   :  0.00   Length:603         Min.   :1.000   Min.   :1.000  
 1st Qu.:  7.00   Class :character   1st Qu.:4.000   1st Qu.:3.000  
 Median : 10.00   Mode  :character   Median :4.000   Median :4.000  
 Mean   : 68.63                      Mean   :4.296   Mean   :3.629  
 3rd Qu.: 26.12                      3rd Qu.:5.000   3rd Qu.:4.000  
 Max.   :828.20                      Max.   :5.000   Max.   :5.000  
 NA's   :92                          NA's   :123     NA's   :126    
       q3              q4              q5              q6       
 Min.   :1.000   Min.   :1.000   Min.   :2.000   Min.   :1.000  
 1st Qu.:3.000   1st Qu.:4.000   1st Qu.:4.000   1st Qu.:4.000  
 Median :3.000   Median :4.000   Median :4.000   Median :4.000  
 Mean   :3.327   Mean   :4.268   Mean   :4.191   Mean   :4.008  
 3rd Qu.:4.000   3rd Qu.:5.000   3rd Qu.:5.000   3rd Qu.:5.000  
 Max.   :5.000   Max.   :5.000   Max.   :5.000   Max.   :5.000  
 NA's   :123     NA's   :125     NA's   :127     NA's   :127    
       q7              q8              q9             q10       
 Min.   :1.000   Min.   :1.000   Min.   :1.000   Min.   :1.000  
 1st Qu.:3.000   1st Qu.:4.000   1st Qu.:3.000   1st Qu.:4.000  
 Median :4.000   Median :4.000   Median :4.000   Median :4.000  
 Mean   :3.907   Mean   :4.289   Mean   :3.487   Mean   :4.101  
 3rd Qu.:4.750   3rd Qu.:5.000   3rd Qu.:4.000   3rd Qu.:5.000  
 Max.   :5.000   Max.   :5.000   Max.   :5.000   Max.   :5.000  
 NA's   :129     NA's   :129     NA's   :129     NA's   :129    
   TimeSpent       TimeSpent_hours    TimeSpent_std          int       
 Min.   :   0.45   Min.   :  0.0075   Min.   :-1.3280   Min.   :2.000  
 1st Qu.: 851.90   1st Qu.: 14.1983   1st Qu.:-0.6996   1st Qu.:3.900  
 Median :1550.91   Median : 25.8485   Median :-0.1837   Median :4.200  
 Mean   :1799.75   Mean   : 29.9959   Mean   : 0.0000   Mean   :4.219  
 3rd Qu.:2426.09   3rd Qu.: 40.4348   3rd Qu.: 0.4623   3rd Qu.:4.700  
 Max.   :8870.88   Max.   :147.8481   Max.   : 5.2188   Max.   :5.000  
 NA's   :5         NA's   :5          NA's   :5         NA's   :76     
       pc              uv       
 Min.   :1.500   Min.   :1.000  
 1st Qu.:3.000   1st Qu.:3.333  
 Median :3.500   Median :3.667  
 Mean   :3.608   Mean   :3.719  
 3rd Qu.:4.000   3rd Qu.:4.167  
 Max.   :5.000   Max.   :5.000  
 NA's   :75      NA's   :75     
# skim() gives a richer view including missing counts and distributions
skim(data)
Data summary
Name data
Number of rows 603
Number of columns 30
_______________________
Column type frequency:
character 6
logical 1
numeric 23
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
course_id 0 1 12 13 0 26 0
subject 0 1 4 5 0 5 0
semester 0 1 4 4 0 3 0
section 0 1 2 2 0 4 0
Gradebook_Item 0 1 9 35 0 3 0
Gender 0 1 1 1 0 2 0

Variable type: logical

skim_variable n_missing complete_rate mean count
Grade_Category 603 0 NaN :

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
student_id 0 1.00 86069.54 10548.60 43146.00 85612.50 88340.00 92730.50 97441.00 ▁▁▁▃▇
total_points_possible 0 1.00 4274.41 2312.74 840.00 2809.50 3583.00 5069.00 15552.00 ▇▅▂▁▁
total_points_earned 0 1.00 3244.69 1832.00 651.00 2050.50 2757.00 3875.00 12208.00 ▇▅▁▁▁
percentage_earned 0 1.00 0.76 0.09 0.34 0.70 0.78 0.83 0.91 ▁▁▃▇▇
FinalGradeCEMS 30 0.95 77.20 22.23 0.00 71.25 84.57 92.10 100.00 ▁▁▁▃▇
Points_Possible 0 1.00 76.87 167.51 5.00 10.00 10.00 30.00 935.00 ▇▁▁▁▁
Points_Earned 92 0.85 68.63 145.26 0.00 7.00 10.00 26.12 828.20 ▇▁▁▁▁
q1 123 0.80 4.30 0.68 1.00 4.00 4.00 5.00 5.00 ▁▁▂▇▇
q2 126 0.79 3.63 0.93 1.00 3.00 4.00 4.00 5.00 ▁▂▆▇▃
q3 123 0.80 3.33 0.91 1.00 3.00 3.00 4.00 5.00 ▁▃▇▅▂
q4 125 0.79 4.27 0.85 1.00 4.00 4.00 5.00 5.00 ▁▁▂▇▇
q5 127 0.79 4.19 0.68 2.00 4.00 4.00 5.00 5.00 ▁▂▁▇▅
q6 127 0.79 4.01 0.80 1.00 4.00 4.00 5.00 5.00 ▁▁▃▇▅
q7 129 0.79 3.91 0.82 1.00 3.00 4.00 4.75 5.00 ▁▁▅▇▅
q8 129 0.79 4.29 0.68 1.00 4.00 4.00 5.00 5.00 ▁▁▂▇▆
q9 129 0.79 3.49 0.98 1.00 3.00 4.00 4.00 5.00 ▁▃▇▇▃
q10 129 0.79 4.10 0.93 1.00 4.00 4.00 5.00 5.00 ▁▂▃▇▇
TimeSpent 5 0.99 1799.75 1354.93 0.45 851.90 1550.91 2426.09 8870.88 ▇▅▁▁▁
TimeSpent_hours 5 0.99 30.00 22.58 0.01 14.20 25.85 40.43 147.85 ▇▅▁▁▁
TimeSpent_std 5 0.99 0.00 1.00 -1.33 -0.70 -0.18 0.46 5.22 ▇▅▁▁▁
int 76 0.87 4.22 0.59 2.00 3.90 4.20 4.70 5.00 ▁▁▃▇▇
pc 75 0.88 3.61 0.64 1.50 3.00 3.50 4.00 5.00 ▁▁▇▅▂
uv 75 0.88 3.72 0.70 1.00 3.33 3.67 4.17 5.00 ▁▁▆▇▅

Question: What did you discover from the skim() output? Mention at least two things — for example: variables with missing data, the range of a numeric variable, or anything that caught your attention.

  • [I noticed in the time spent it went from 5 to over a hundred.Under gender it has only number 1 and not any 0’s or it doesn’t say M or F. The grade book item ranges from 1 to 35 with 35 being the highest other than 1, one being 13.Also the semester has number 3, this threw me off because I’m thinking there are two semesters in a year.]

Part 3 · Cleaning data

Handling missing values — the right way

A common mistake is using na.omit() to remove ALL rows with any missing value. Let’s see why this is usually a bad idea:

# This removes every row that has ANY missing value in ANY column
data_naive <- na.omit(data)

nrow(data)        # how many rows originally?
[1] 603
nrow(data_naive)  # how many rows after na.omit?
[1] 0

Question: What happened to the dataset? Why do you think removing all rows with any missing value is not the right approach for this data?

  • [It removed all the columns because there were errors throughout many of them. Removing all the rows with missing values.When I removed all rows with missing values, the dataset became much smaller because many variables had at least some missing data. This means we would lose too many records and remove student information that could be analyzed differently.]

A better approach: target specific variables

# Remove rows where FinalGradeCEMS is missing — our key outcome variable
data_clean <- data |>
  filter(!is.na(FinalGradeCEMS))

nrow(data_clean)
[1] 573
# Confirm FinalGradeCEMS no longer has missing values
sum(is.na(data_clean$FinalGradeCEMS))
[1] 0
# Inspect the cleaned data
head(data_clean)

Standardize column names (optional but good practice)

# clean_names() from the janitor package standardizes column names:
# removes spaces, converts to lowercase, fixes special characters
data_clean <- data_clean |> clean_names()

glimpse(data_clean)
Rows: 573
Columns: 30
$ student_id            <dbl> 43146, 44638, 47448, 47979, 48797, 52326, 52446,…
$ course_id             <chr> "FrScA-S216-02", "OcnA-S116-01", "FrScA-S216-01"…
$ total_points_possible <dbl> 3280, 3531, 2870, 4562, 2207, 4325, 2086, 4655, …
$ total_points_earned   <dbl> 2220, 2672, 1897, 3090, 1910, 2255, 1719, 3149, …
$ percentage_earned     <dbl> 0.6768293, 0.7567261, 0.6609756, 0.6773345, 0.86…
$ subject               <chr> "FrScA", "OcnA", "FrScA", "OcnA", "PhysA", "AnPh…
$ semester              <chr> "S216", "S116", "S216", "S216", "S116", "S216", …
$ section               <chr> "02", "01", "01", "01", "01", "01", "01", "01", …
$ gradebook_item        <chr> "POINTS EARNED & TOTAL COURSE POINTS", "ATTEMPTE…
$ grade_category        <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ final_grade_cems      <dbl> 93.45372, 81.70184, 88.48758, 81.85260, 84.00000…
$ points_possible       <dbl> 5, 10, 10, 5, 438, 10, 10, 443, 12, 10, 5, 10, 2…
$ points_earned         <dbl> NA, 10.00, NA, 4.00, 399.00, NA, 10.00, 425.00, …
$ gender                <chr> "M", "F", "M", "M", "F", "M", "F", "F", "M", "M"…
$ q1                    <dbl> 5, 4, 5, 5, 4, 5, 3, 4, NA, 4, 3, 5, NA, 4, 4, N…
$ q2                    <dbl> 4, 4, 4, 5, 3, 5, 3, 3, NA, 5, 3, 3, NA, 2, 4, N…
$ q3                    <dbl> 4, 3, 4, 3, 3, 3, 3, 3, NA, 3, 3, 5, NA, 2, 3, N…
$ q4                    <dbl> 5, 4, 5, 5, 4, 5, 3, 4, NA, 5, 3, 5, NA, 4, 5, N…
$ q5                    <dbl> 5, 4, 5, 5, 4, 5, 3, 4, NA, 5, 4, 5, NA, 4, 4, N…
$ q6                    <dbl> 5, 4, 4, 5, 4, 5, 4, 3, NA, 5, 3, 5, NA, 4, 4, N…
$ q7                    <dbl> 5, 4, 4, 4, 4, 4, 3, 3, NA, 5, 3, 5, NA, 4, 5, N…
$ q8                    <dbl> 5, 5, 5, 5, 4, 5, 3, 4, NA, 4, 3, 5, NA, 4, 4, N…
$ q9                    <dbl> 4, 4, 3, 5, NA, 5, 3, 2, NA, 5, 2, 2, NA, 2, 4, …
$ q10                   <dbl> 5, 4, 5, 5, 3, 5, 3, 5, NA, 4, 4, 5, NA, 4, 4, N…
$ time_spent            <dbl> 1555.1667, 1382.7001, 860.4335, 1598.6166, 1481.…
$ time_spent_hours      <dbl> 25.919445, 23.045002, 14.340558, 26.643610, 24.6…
$ time_spent_std        <dbl> -0.18051496, -0.30780313, -0.69325954, -0.148446…
$ int                   <dbl> 5.0, 4.2, 5.0, 5.0, 3.8, 5.0, 3.0, 4.2, NA, 4.4,…
$ pc                    <dbl> 4.50, 3.50, 4.00, 3.50, 3.50, 3.50, 3.00, 3.00, …
$ uv                    <dbl> 4.333333, 4.000000, 3.666667, 5.000000, 3.500000…
Note

After clean_names(), all column names become lowercase_with_underscores. If column names changed, update your code below to match the new names. Check names(data_clean) if you are not sure what the columns are called.

# Always confirm your column names after cleaning
names(data_clean)
 [1] "student_id"            "course_id"             "total_points_possible"
 [4] "total_points_earned"   "percentage_earned"     "subject"              
 [7] "semester"              "section"               "gradebook_item"       
[10] "grade_category"        "final_grade_cems"      "points_possible"      
[13] "points_earned"         "gender"                "q1"                   
[16] "q2"                    "q3"                    "q4"                   
[19] "q5"                    "q6"                    "q7"                   
[22] "q8"                    "q9"                    "q10"                  
[25] "time_spent"            "time_spent_hours"      "time_spent_std"       
[28] "int"                   "pc"                    "uv"                   

Removing unnecessary columns

# Use - to drop columns you will not use
# grade_category is a derived variable we will not need for this analysis

Part 4 · Selecting and filtering

select() — choose columns

# Select specific columns by name
selected_data <- data_clean |>
  select(student_id, subject, semester, final_grade_cems)

head(selected_data)

Task: Select all columns except subject and section. Save the result as reduced_data and inspect it.

# Hint: use - before column names to exclude them.
# Example: select(data_clean, -column_name)
# You have seen select() used once above — try writing this one yourself.

# Inspect the reduced_data dataset to verify excluded columns and structure
# write your code here
select(data_clean, -subject, -section)
reduced_data <- select(data_clean, -subject, -section)
head(reduced_data)
glimpse(reduced_data)
Rows: 573
Columns: 28
$ student_id            <dbl> 43146, 44638, 47448, 47979, 48797, 52326, 52446,…
$ course_id             <chr> "FrScA-S216-02", "OcnA-S116-01", "FrScA-S216-01"…
$ total_points_possible <dbl> 3280, 3531, 2870, 4562, 2207, 4325, 2086, 4655, …
$ total_points_earned   <dbl> 2220, 2672, 1897, 3090, 1910, 2255, 1719, 3149, …
$ percentage_earned     <dbl> 0.6768293, 0.7567261, 0.6609756, 0.6773345, 0.86…
$ semester              <chr> "S216", "S116", "S216", "S216", "S116", "S216", …
$ gradebook_item        <chr> "POINTS EARNED & TOTAL COURSE POINTS", "ATTEMPTE…
$ grade_category        <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ final_grade_cems      <dbl> 93.45372, 81.70184, 88.48758, 81.85260, 84.00000…
$ points_possible       <dbl> 5, 10, 10, 5, 438, 10, 10, 443, 12, 10, 5, 10, 2…
$ points_earned         <dbl> NA, 10.00, NA, 4.00, 399.00, NA, 10.00, 425.00, …
$ gender                <chr> "M", "F", "M", "M", "F", "M", "F", "F", "M", "M"…
$ q1                    <dbl> 5, 4, 5, 5, 4, 5, 3, 4, NA, 4, 3, 5, NA, 4, 4, N…
$ q2                    <dbl> 4, 4, 4, 5, 3, 5, 3, 3, NA, 5, 3, 3, NA, 2, 4, N…
$ q3                    <dbl> 4, 3, 4, 3, 3, 3, 3, 3, NA, 3, 3, 5, NA, 2, 3, N…
$ q4                    <dbl> 5, 4, 5, 5, 4, 5, 3, 4, NA, 5, 3, 5, NA, 4, 5, N…
$ q5                    <dbl> 5, 4, 5, 5, 4, 5, 3, 4, NA, 5, 4, 5, NA, 4, 4, N…
$ q6                    <dbl> 5, 4, 4, 5, 4, 5, 4, 3, NA, 5, 3, 5, NA, 4, 4, N…
$ q7                    <dbl> 5, 4, 4, 4, 4, 4, 3, 3, NA, 5, 3, 5, NA, 4, 5, N…
$ q8                    <dbl> 5, 5, 5, 5, 4, 5, 3, 4, NA, 4, 3, 5, NA, 4, 4, N…
$ q9                    <dbl> 4, 4, 3, 5, NA, 5, 3, 2, NA, 5, 2, 2, NA, 2, 4, …
$ q10                   <dbl> 5, 4, 5, 5, 3, 5, 3, 5, NA, 4, 4, 5, NA, 4, 4, N…
$ time_spent            <dbl> 1555.1667, 1382.7001, 860.4335, 1598.6166, 1481.…
$ time_spent_hours      <dbl> 25.919445, 23.045002, 14.340558, 26.643610, 24.6…
$ time_spent_std        <dbl> -0.18051496, -0.30780313, -0.69325954, -0.148446…
$ int                   <dbl> 5.0, 4.2, 5.0, 5.0, 3.8, 5.0, 3.0, 4.2, NA, 4.4,…
$ pc                    <dbl> 4.50, 3.50, 4.00, 3.50, 3.50, 3.50, 3.00, 3.00, …
$ uv                    <dbl> 4.333333, 4.000000, 3.666667, 5.000000, 3.500000…
colnames(reduced_data)
 [1] "student_id"            "course_id"             "total_points_possible"
 [4] "total_points_earned"   "percentage_earned"     "semester"             
 [7] "gradebook_item"        "grade_category"        "final_grade_cems"     
[10] "points_possible"       "points_earned"         "gender"               
[13] "q1"                    "q2"                    "q3"                   
[16] "q4"                    "q5"                    "q6"                   
[19] "q7"                    "q8"                    "q9"                   
[22] "q10"                   "time_spent"            "time_spent_hours"     
[25] "time_spent_std"        "int"                   "pc"                   
[28] "uv"                   

filter() — choose rows

# Keep only students in "OcnA" courses
ocna_students <- data_clean |>
  filter(subject == "OcnA")

nrow(ocna_students)
[1] 105
head(ocna_students)

Task: Filter to remove rows where total_points_possible is NA. Save as no_na_points and inspect with glimpse().

# Hint: !is.na(column_name) means "where column_name is NOT missing"

no_na_points <- data_clean |>
  filter(!is.na(total_points_possible))

glimpse(no_na_points)
Rows: 573
Columns: 30
$ student_id            <dbl> 43146, 44638, 47448, 47979, 48797, 52326, 52446,…
$ course_id             <chr> "FrScA-S216-02", "OcnA-S116-01", "FrScA-S216-01"…
$ total_points_possible <dbl> 3280, 3531, 2870, 4562, 2207, 4325, 2086, 4655, …
$ total_points_earned   <dbl> 2220, 2672, 1897, 3090, 1910, 2255, 1719, 3149, …
$ percentage_earned     <dbl> 0.6768293, 0.7567261, 0.6609756, 0.6773345, 0.86…
$ subject               <chr> "FrScA", "OcnA", "FrScA", "OcnA", "PhysA", "AnPh…
$ semester              <chr> "S216", "S116", "S216", "S216", "S116", "S216", …
$ section               <chr> "02", "01", "01", "01", "01", "01", "01", "01", …
$ gradebook_item        <chr> "POINTS EARNED & TOTAL COURSE POINTS", "ATTEMPTE…
$ grade_category        <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ final_grade_cems      <dbl> 93.45372, 81.70184, 88.48758, 81.85260, 84.00000…
$ points_possible       <dbl> 5, 10, 10, 5, 438, 10, 10, 443, 12, 10, 5, 10, 2…
$ points_earned         <dbl> NA, 10.00, NA, 4.00, 399.00, NA, 10.00, 425.00, …
$ gender                <chr> "M", "F", "M", "M", "F", "M", "F", "F", "M", "M"…
$ q1                    <dbl> 5, 4, 5, 5, 4, 5, 3, 4, NA, 4, 3, 5, NA, 4, 4, N…
$ q2                    <dbl> 4, 4, 4, 5, 3, 5, 3, 3, NA, 5, 3, 3, NA, 2, 4, N…
$ q3                    <dbl> 4, 3, 4, 3, 3, 3, 3, 3, NA, 3, 3, 5, NA, 2, 3, N…
$ q4                    <dbl> 5, 4, 5, 5, 4, 5, 3, 4, NA, 5, 3, 5, NA, 4, 5, N…
$ q5                    <dbl> 5, 4, 5, 5, 4, 5, 3, 4, NA, 5, 4, 5, NA, 4, 4, N…
$ q6                    <dbl> 5, 4, 4, 5, 4, 5, 4, 3, NA, 5, 3, 5, NA, 4, 4, N…
$ q7                    <dbl> 5, 4, 4, 4, 4, 4, 3, 3, NA, 5, 3, 5, NA, 4, 5, N…
$ q8                    <dbl> 5, 5, 5, 5, 4, 5, 3, 4, NA, 4, 3, 5, NA, 4, 4, N…
$ q9                    <dbl> 4, 4, 3, 5, NA, 5, 3, 2, NA, 5, 2, 2, NA, 2, 4, …
$ q10                   <dbl> 5, 4, 5, 5, 3, 5, 3, 5, NA, 4, 4, 5, NA, 4, 4, N…
$ time_spent            <dbl> 1555.1667, 1382.7001, 860.4335, 1598.6166, 1481.…
$ time_spent_hours      <dbl> 25.919445, 23.045002, 14.340558, 26.643610, 24.6…
$ time_spent_std        <dbl> -0.18051496, -0.30780313, -0.69325954, -0.148446…
$ int                   <dbl> 5.0, 4.2, 5.0, 5.0, 3.8, 5.0, 3.0, 4.2, NA, 4.4,…
$ pc                    <dbl> 4.50, 3.50, 4.00, 3.50, 3.50, 3.50, 3.00, 3.00, …
$ uv                    <dbl> 4.333333, 4.000000, 3.666667, 5.000000, 3.500000…

Question: (In your own words,) What does the ! symbol mean in R?

  • [In the above code it says, where column_name is NOT missing, so I think ! means NOT or the reverse of something. I had to look it up to understand it means the reverse of a code.]

Task: Filter to keep only students with final_grade_cems above 85. Save as data_over85. Then separately filter for time_spent_hours greater than 50. Save as data_timeover50.

data_over85 <- data_clean |>
  filter(final_grade_cems > 85)

nrow(data_over85)
[1] 279
# Now filter for time_spent_hours > 50
data_timeover50 <- data_clean |>
  filter(time_spent_hours > 50)

nrow(data_timeover50)
[1] 96

Part 5 · Creating and modifying variables

mutate() and ifelse()

mutate() adds or changes a column. ifelse() lets you create a variable whose value depends on a condition.

# Create 'added_gender' — spell out "Female" / "Male" from the "F"/"M" codes
data_gender <- data_clean |>
  mutate(added_gender = ifelse(gender == "F", "Female", "Male"))

# Inspect the new column
data_gender |> select(student_id, gender, added_gender) |> head()

Question: In your own words, how does ifelse() work? What are its three parts?

  • [I think ifelse works by showing a condition like if like the variable of a student is female then do this condtion, but student is not female (variable) then do that. Its shows the condition if something happens or does not happen then another condition will occur based on the variable.]

Task: Create a new variable called pass_fail — “Pass” if final_grade_cems is 70 or above, “Fail” if below 70.

# YOUR CODE HERE — use mutate() and ifelse() to create a pass_fail column,
# then use count() to see how many students passed vs. failed.

#| label: mutate-task
 data_clean |>
  mutate(pass_fail = ifelse(final_grade_cems >= 70, "Pass", "Fail")) |>
  head()

Part 6 · Sorting data

arrange()

# Sort by subject, then by final grade descending
arranged_data <- data_clean |>
  arrange(subject, desc(final_grade_cems))

head(arranged_data)

Part 7 · Chaining with the pipe |>

The pipe |> chains operations together so you can do multiple steps in one readable flow. Read each |> as “then.”

# All in one pipeline:
# start with data → select columns → filter subject → arrange by grade
final_data <- data_clean |>
  select(student_id, subject, semester, final_grade_cems) |>
  filter(subject == "OcnA") |>
  arrange(desc(final_grade_cems))

print(final_data)
# A tibble: 105 × 4
   student_id subject semester final_grade_cems
        <dbl> <chr>   <chr>               <dbl>
 1      66740 OcnA    S116                 99.3
 2      91163 OcnA    S216                 97.4
 3      94744 OcnA    S216                 96.8
 4      91818 OcnA    S116                 96.5
 5      90090 OcnA    S116                 96.3
 6      88168 OcnA    S116                 96.0
 7      89114 OcnA    S116                 95.0
 8      86758 OcnA    S116                 94.6
 9      68476 OcnA    S116                 94.6
10      79893 OcnA    T116                 94.5
# ℹ 95 more rows

Task: Build your own pipeline. Start with data_clean, and select any four columns of your choice, filter for gender == "M", and arrange by time_spent_hours descending. Save as my_pipeline and inspect the result.

TipWriting this yourself

By this point you have seen select(), filter(), arrange(), and |> used multiple times. This chunk has no hints — write the full pipeline from memory. If you need to check a column name, run names(data_clean) in the Console first.

# YOUR CODE HERE — write the full pipeline without looking at the examples above

#pipeline task

my_pipeline <- data_clean |>
  select(student_id, final_grade_cems, gender, time_spent_hours) |>
  filter(gender == "M") |>
  arrange(desc(time_spent_hours))

print(my_pipeline)
# A tibble: 172 × 4
   student_id final_grade_cems gender time_spent_hours
        <dbl>            <dbl> <chr>             <dbl>
 1      69743             85.5 M                 118. 
 2      86429             89.5 M                 117. 
 3      91163             97.4 M                 104. 
 4      92185             91.9 M                 103. 
 5      76132             95.3 M                  84.0
 6      88849             90.8 M                  80.0
 7      84641             83.6 M                  76.9
 8      94251             96.9 M                  75.9
 9      95634             76.6 M                  74.4
10      94805             87.3 M                  74.0
# ℹ 162 more rows

Part 8 · First visualization

NotePredict before you run

Before running the scatter plot below, write your prediction in the space provided. Students who commit to a prediction before seeing the output retain the interpretation better than students who just read the result.

Prediction: Before running the next chunk, write one sentence: do you expect students who spend more time on the LMS to earn higher or lower grades? How strong do you expect the relationship to be — weak, moderate, or strong?

  • [I predict that students who spend more time on the LMS to earn a higher grade because they are learning the format of the system. It may not be 100% correct but they are getting practice. I expect the relationship to be stronger than the ones who don’t spend more time on the LMS.]
# Scatter plot of time spent vs. final grade
ggplot(data_clean, aes(x = time_spent_hours, y = final_grade_cems)) +
  geom_point(color = "#378ADD", size = 2, alpha = 0.6) +
  geom_smooth(method = "lm", color = "#0F6E56", se = TRUE) +
  labs(
    title = "Time Spent vs. Final Grade",
    x = "Time Spent on LMS (hours)",
    y = "Final Grade"
  ) +
  theme_minimal()

Relationship between time spent on the LMS and final grade

Question: Based on this scatter plot, what do you expect the relationship between time spent and final grades to be? Why?

  • [The scatterplot shows a positive correlation between time spent and grade. Students who invest more time tend to earn higher grades because the data points are moving up and to the right which shows a positive correlation. ]

Final practice

Use data_clean or create a subset — write your own code for each task.

# Show two different ways to use select() — pick different column sets each time.


selected_data <- data_clean |>
  select(time_spent, final_grade_cems)
head(reduced_data)
glimpse(reduced_data)
Rows: 573
Columns: 28
$ student_id            <dbl> 43146, 44638, 47448, 47979, 48797, 52326, 52446,…
$ course_id             <chr> "FrScA-S216-02", "OcnA-S116-01", "FrScA-S216-01"…
$ total_points_possible <dbl> 3280, 3531, 2870, 4562, 2207, 4325, 2086, 4655, …
$ total_points_earned   <dbl> 2220, 2672, 1897, 3090, 1910, 2255, 1719, 3149, …
$ percentage_earned     <dbl> 0.6768293, 0.7567261, 0.6609756, 0.6773345, 0.86…
$ semester              <chr> "S216", "S116", "S216", "S216", "S116", "S216", …
$ gradebook_item        <chr> "POINTS EARNED & TOTAL COURSE POINTS", "ATTEMPTE…
$ grade_category        <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ final_grade_cems      <dbl> 93.45372, 81.70184, 88.48758, 81.85260, 84.00000…
$ points_possible       <dbl> 5, 10, 10, 5, 438, 10, 10, 443, 12, 10, 5, 10, 2…
$ points_earned         <dbl> NA, 10.00, NA, 4.00, 399.00, NA, 10.00, 425.00, …
$ gender                <chr> "M", "F", "M", "M", "F", "M", "F", "F", "M", "M"…
$ q1                    <dbl> 5, 4, 5, 5, 4, 5, 3, 4, NA, 4, 3, 5, NA, 4, 4, N…
$ q2                    <dbl> 4, 4, 4, 5, 3, 5, 3, 3, NA, 5, 3, 3, NA, 2, 4, N…
$ q3                    <dbl> 4, 3, 4, 3, 3, 3, 3, 3, NA, 3, 3, 5, NA, 2, 3, N…
$ q4                    <dbl> 5, 4, 5, 5, 4, 5, 3, 4, NA, 5, 3, 5, NA, 4, 5, N…
$ q5                    <dbl> 5, 4, 5, 5, 4, 5, 3, 4, NA, 5, 4, 5, NA, 4, 4, N…
$ q6                    <dbl> 5, 4, 4, 5, 4, 5, 4, 3, NA, 5, 3, 5, NA, 4, 4, N…
$ q7                    <dbl> 5, 4, 4, 4, 4, 4, 3, 3, NA, 5, 3, 5, NA, 4, 5, N…
$ q8                    <dbl> 5, 5, 5, 5, 4, 5, 3, 4, NA, 4, 3, 5, NA, 4, 4, N…
$ q9                    <dbl> 4, 4, 3, 5, NA, 5, 3, 2, NA, 5, 2, 2, NA, 2, 4, …
$ q10                   <dbl> 5, 4, 5, 5, 3, 5, 3, 5, NA, 4, 4, 5, NA, 4, 4, N…
$ time_spent            <dbl> 1555.1667, 1382.7001, 860.4335, 1598.6166, 1481.…
$ time_spent_hours      <dbl> 25.919445, 23.045002, 14.340558, 26.643610, 24.6…
$ time_spent_std        <dbl> -0.18051496, -0.30780313, -0.69325954, -0.148446…
$ int                   <dbl> 5.0, 4.2, 5.0, 5.0, 3.8, 5.0, 3.0, 4.2, NA, 4.4,…
$ pc                    <dbl> 4.50, 3.50, 4.00, 3.50, 3.50, 3.50, 3.00, 3.00, …
$ uv                    <dbl> 4.333333, 4.000000, 3.666667, 5.000000, 3.500000…
colnames(reduced_data)
 [1] "student_id"            "course_id"             "total_points_possible"
 [4] "total_points_earned"   "percentage_earned"     "semester"             
 [7] "gradebook_item"        "grade_category"        "final_grade_cems"     
[10] "points_possible"       "points_earned"         "gender"               
[13] "q1"                    "q2"                    "q3"                   
[16] "q4"                    "q5"                    "q6"                   
[19] "q7"                    "q8"                    "q9"                   
[22] "q10"                   "time_spent"            "time_spent_hours"     
[25] "time_spent_std"        "int"                   "pc"                   
[28] "uv"                   
select(data_clean, -time_spent, -final_grade_cems)
reduced_data <- select(data_clean, -time_spent, -final_grade_cems)
head(reduced_data)
glimpse(reduced_data)
Rows: 573
Columns: 28
$ student_id            <dbl> 43146, 44638, 47448, 47979, 48797, 52326, 52446,…
$ course_id             <chr> "FrScA-S216-02", "OcnA-S116-01", "FrScA-S216-01"…
$ total_points_possible <dbl> 3280, 3531, 2870, 4562, 2207, 4325, 2086, 4655, …
$ total_points_earned   <dbl> 2220, 2672, 1897, 3090, 1910, 2255, 1719, 3149, …
$ percentage_earned     <dbl> 0.6768293, 0.7567261, 0.6609756, 0.6773345, 0.86…
$ subject               <chr> "FrScA", "OcnA", "FrScA", "OcnA", "PhysA", "AnPh…
$ semester              <chr> "S216", "S116", "S216", "S216", "S116", "S216", …
$ section               <chr> "02", "01", "01", "01", "01", "01", "01", "01", …
$ gradebook_item        <chr> "POINTS EARNED & TOTAL COURSE POINTS", "ATTEMPTE…
$ grade_category        <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ points_possible       <dbl> 5, 10, 10, 5, 438, 10, 10, 443, 12, 10, 5, 10, 2…
$ points_earned         <dbl> NA, 10.00, NA, 4.00, 399.00, NA, 10.00, 425.00, …
$ gender                <chr> "M", "F", "M", "M", "F", "M", "F", "F", "M", "M"…
$ q1                    <dbl> 5, 4, 5, 5, 4, 5, 3, 4, NA, 4, 3, 5, NA, 4, 4, N…
$ q2                    <dbl> 4, 4, 4, 5, 3, 5, 3, 3, NA, 5, 3, 3, NA, 2, 4, N…
$ q3                    <dbl> 4, 3, 4, 3, 3, 3, 3, 3, NA, 3, 3, 5, NA, 2, 3, N…
$ q4                    <dbl> 5, 4, 5, 5, 4, 5, 3, 4, NA, 5, 3, 5, NA, 4, 5, N…
$ q5                    <dbl> 5, 4, 5, 5, 4, 5, 3, 4, NA, 5, 4, 5, NA, 4, 4, N…
$ q6                    <dbl> 5, 4, 4, 5, 4, 5, 4, 3, NA, 5, 3, 5, NA, 4, 4, N…
$ q7                    <dbl> 5, 4, 4, 4, 4, 4, 3, 3, NA, 5, 3, 5, NA, 4, 5, N…
$ q8                    <dbl> 5, 5, 5, 5, 4, 5, 3, 4, NA, 4, 3, 5, NA, 4, 4, N…
$ q9                    <dbl> 4, 4, 3, 5, NA, 5, 3, 2, NA, 5, 2, 2, NA, 2, 4, …
$ q10                   <dbl> 5, 4, 5, 5, 3, 5, 3, 5, NA, 4, 4, 5, NA, 4, 4, N…
$ time_spent_hours      <dbl> 25.919445, 23.045002, 14.340558, 26.643610, 24.6…
$ time_spent_std        <dbl> -0.18051496, -0.30780313, -0.69325954, -0.148446…
$ int                   <dbl> 5.0, 4.2, 5.0, 5.0, 3.8, 5.0, 3.0, 4.2, NA, 4.4,…
$ pc                    <dbl> 4.50, 3.50, 4.00, 3.50, 3.50, 3.50, 3.00, 3.00, …
$ uv                    <dbl> 4.333333, 4.000000, 3.666667, 5.000000, 3.500000…
colnames(reduced_data)
 [1] "student_id"            "course_id"             "total_points_possible"
 [4] "total_points_earned"   "percentage_earned"     "subject"              
 [7] "semester"              "section"               "gradebook_item"       
[10] "grade_category"        "points_possible"       "points_earned"        
[13] "gender"                "q1"                    "q2"                   
[16] "q3"                    "q4"                    "q5"                   
[19] "q6"                    "q7"                    "q8"                   
[22] "q9"                    "q10"                   "time_spent_hours"     
[25] "time_spent_std"        "int"                   "pc"                   
[28] "uv"                   
# Use filter() to create one meaningful subset of this data.
# Choose a filter condition that makes instructional sense.

data_clean <- data |>
  filter(!is.na(FinalGradeCEMS))

nrow(data_clean)
[1] 573
# Confirm FinalGradeCEMS no longer has missing values
sum(is.na(data_clean$FinalGradeCEMS))
[1] 0
# Use arrange() to sort the data in a way that would be useful to a teacher
# or instructional designer reviewing student performance.



arranged_data <- data_clean |>
  arrange(TimeSpent, FinalGradeCEMS)

head(arranged_data)

Render & submit

Step 1 — Add your name

Change the author: field in the YAML header at the very top of this file to your name. The YAML header controls the document’s metadata and formatting — it does not display in the final rendered output.

Step 2 — Render

Click the Render button in the toolbar above. A formatted HTML page will appear in your Viewer tab or open in a new browser window. If you see any error messages, check the Console pane below for the exact error text — copy it and post in the course discussion board if you are stuck.

Step 3 — Publish

Share a link to your published document with your instructor. Choose any publishing method below:

Option Best for Link
Posit Cloud Quickest — one click from your workspace Guide
RPubs Free, public, easy to share a link rpubs.com
Quarto Pub Clean public portfolio pages Guide
GitHub Pages Best for a professional portfolio Guide
TipE-portfolio tip

If you are building a professional e-portfolio, this document demonstrates your ability to load, clean, filter, and visualize real educational data in R. Consider publishing to Quarto Pub or GitHub Pages for a shareable, permanent link you can include in a resume or portfolio site.

Once your instructor has reviewed your published link, you will be notified and the module will be marked complete.

If you have any questions or run into technical issues, contact your instructor or post in the course discussion board with the exact error message from the Console.