Unit 2: Rtutorial

Author

Alex Turk

Week 6: Working with Data

In the previous R tutorial, we started to work on science classroom dataset. We applied the data intensive research steps to explore our data and investigate the relationship between students’ grades and time-spent.

Let’s remember which libraries and functions we used!

Your Turn:

Please write down one or two sentence to explain why and how we use the following libraries and functions.

  1. tidyverse: R packages designed for data science, making it easier to import, clean, transform, and visualize data using a consistent and user-friendly syntax.

  2. skimr: provides extended summary statistics for a data frame, helping you quickly understand the structure and missing values in your dataset.

  3. ggplot: used to create high-quality, customizable data visualizations such as scatter plots, histograms, and more.

  4. read_csv(): reads CSV (comma-separated values) files into R as tibbles, allowing for quick data import.

  5. view(): opens your data frame in a spreadsheet-style window, making it easy to browse data interactively.

  6. glimpse(): a compact overview of your data, showing column names, types, and a sample of values.

  7. head(): displays the first 6 rows of your data, which is useful for a quick check of the data structure.

  8. tail(): shows the last 6 rows of your data frame.

  9. select(): is used to choose specific columns from your data frame.

  10. filter(): used to pick out rows based on specified criteria.

  11. arrange(): sorts your data by one or more columns.

  12. desc(): used within arrange() to sort a column in descending order.

  13. geom_histogram(): creates a histogram, which shows the distribution of a numeric variable.

  14. geom_point(): creates a scatter plot, showing the relationship between two numeric variables.

    Load the Tidyverse Package

Let’s start our R code along by loading the tidyverse package.

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.2.0     ✔ readr     2.1.6
✔ forcats   1.0.1     ✔ stringr   1.6.0
✔ ggplot2   4.0.2     ✔ tibble    3.3.1
✔ lubridate 1.9.5     ✔ tidyr     1.3.2
✔ purrr     1.2.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Load the online science class data

Now, load the online science class data from the data folder and assign your data to a new object.

library(readr)
sci_classes <- read_csv("Data/sci-online-classes.csv")
Rows: 603 Columns: 30
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (6): course_id, subject, semester, section, Gradebook_Item, Gender
dbl (23): student_id, total_points_possible, total_points_earned, percentage...
lgl  (1): Grade_Category

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

You loaded the data, now what should we do?

sci_classes
# A tibble: 603 × 30
   student_id course_id     total_points_possible total_points_earned
        <dbl> <chr>                         <dbl>               <dbl>
 1      43146 FrScA-S216-02                  3280                2220
 2      44638 OcnA-S116-01                   3531                2672
 3      47448 FrScA-S216-01                  2870                1897
 4      47979 OcnA-S216-01                   4562                3090
 5      48797 PhysA-S116-01                  2207                1910
 6      51943 FrScA-S216-03                  4208                3596
 7      52326 AnPhA-S216-01                  4325                2255
 8      52446 PhysA-S116-01                  2086                1719
 9      53447 FrScA-S116-01                  4655                3149
10      53475 FrScA-S116-02                  1710                1402
# ℹ 593 more rows
# ℹ 26 more variables: percentage_earned <dbl>, subject <chr>, semester <chr>,
#   section <chr>, Gradebook_Item <chr>, Grade_Category <lgl>,
#   FinalGradeCEMS <dbl>, Points_Possible <dbl>, Points_Earned <dbl>,
#   Gender <chr>, q1 <dbl>, q2 <dbl>, q3 <dbl>, q4 <dbl>, q5 <dbl>, q6 <dbl>,
#   q7 <dbl>, q8 <dbl>, q9 <dbl>, q10 <dbl>, TimeSpent <dbl>,
#   TimeSpent_hours <dbl>, TimeSpent_std <dbl>, int <dbl>, pc <dbl>, uv <dbl>

Your Turn:

Examine the contents of sci_classes in your console. You should type the object name to the console and check that.

Question: Is your object a tibble? How do you know?

Your response here: Yes, the object is a tibble. The console output says “A tibble” at the top and displays only the first 10 rows and the column types.

Hint: Check the output in the console.

Check your data with different functions

You can check your data with different functions. Let’s remember how we use different functions to check our data.

view(sci_classes)
glimpse(sci_classes)
Rows: 603
Columns: 30
$ student_id            <dbl> 43146, 44638, 47448, 47979, 48797, 51943, 52326,…
$ course_id             <chr> "FrScA-S216-02", "OcnA-S116-01", "FrScA-S216-01"…
$ total_points_possible <dbl> 3280, 3531, 2870, 4562, 2207, 4208, 4325, 2086, …
$ total_points_earned   <dbl> 2220, 2672, 1897, 3090, 1910, 3596, 2255, 1719, …
$ percentage_earned     <dbl> 0.6768293, 0.7567261, 0.6609756, 0.6773345, 0.86…
$ subject               <chr> "FrScA", "OcnA", "FrScA", "OcnA", "PhysA", "FrSc…
$ semester              <chr> "S216", "S116", "S216", "S216", "S116", "S216", …
$ section               <chr> "02", "01", "01", "01", "01", "03", "01", "01", …
$ Gradebook_Item        <chr> "POINTS EARNED & TOTAL COURSE POINTS", "ATTEMPTE…
$ Grade_Category        <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ FinalGradeCEMS        <dbl> 93.45372, 81.70184, 88.48758, 81.85260, 84.00000…
$ Points_Possible       <dbl> 5, 10, 10, 5, 438, 5, 10, 10, 443, 5, 12, 10, 5,…
$ Points_Earned         <dbl> NA, 10.00, NA, 4.00, 399.00, NA, NA, 10.00, 425.…
$ Gender                <chr> "M", "F", "M", "M", "F", "F", "M", "F", "F", "M"…
$ q1                    <dbl> 5, 4, 5, 5, 4, NA, 5, 3, 4, NA, NA, 4, 3, 5, NA,…
$ q2                    <dbl> 4, 4, 4, 5, 3, NA, 5, 3, 3, NA, NA, 5, 3, 3, NA,…
$ q3                    <dbl> 4, 3, 4, 3, 3, NA, 3, 3, 3, NA, NA, 3, 3, 5, NA,…
$ q4                    <dbl> 5, 4, 5, 5, 4, NA, 5, 3, 4, NA, NA, 5, 3, 5, NA,…
$ q5                    <dbl> 5, 4, 5, 5, 4, NA, 5, 3, 4, NA, NA, 5, 4, 5, NA,…
$ q6                    <dbl> 5, 4, 4, 5, 4, NA, 5, 4, 3, NA, NA, 5, 3, 5, NA,…
$ q7                    <dbl> 5, 4, 4, 4, 4, NA, 4, 3, 3, NA, NA, 5, 3, 5, NA,…
$ q8                    <dbl> 5, 5, 5, 5, 4, NA, 5, 3, 4, NA, NA, 4, 3, 5, NA,…
$ q9                    <dbl> 4, 4, 3, 5, NA, NA, 5, 3, 2, NA, NA, 5, 2, 2, NA…
$ q10                   <dbl> 5, 4, 5, 5, 3, NA, 5, 3, 5, NA, NA, 4, 4, 5, NA,…
$ TimeSpent             <dbl> 1555.1667, 1382.7001, 860.4335, 1598.6166, 1481.…
$ TimeSpent_hours       <dbl> 25.91944500, 23.04500167, 14.34055833, 26.643610…
$ TimeSpent_std         <dbl> -0.18051496, -0.30780313, -0.69325954, -0.148446…
$ int                   <dbl> 5.0, 4.2, 5.0, 5.0, 3.8, 4.6, 5.0, 3.0, 4.2, NA,…
$ pc                    <dbl> 4.50, 3.50, 4.00, 3.50, 3.50, 4.00, 3.50, 3.00, …
$ uv                    <dbl> 4.333333, 4.000000, 3.666667, 5.000000, 3.500000…
head(sci_classes)
# A tibble: 6 × 30
  student_id course_id     total_points_possible total_points_earned
       <dbl> <chr>                         <dbl>               <dbl>
1      43146 FrScA-S216-02                  3280                2220
2      44638 OcnA-S116-01                   3531                2672
3      47448 FrScA-S216-01                  2870                1897
4      47979 OcnA-S216-01                   4562                3090
5      48797 PhysA-S116-01                  2207                1910
6      51943 FrScA-S216-03                  4208                3596
# ℹ 26 more variables: percentage_earned <dbl>, subject <chr>, semester <chr>,
#   section <chr>, Gradebook_Item <chr>, Grade_Category <lgl>,
#   FinalGradeCEMS <dbl>, Points_Possible <dbl>, Points_Earned <dbl>,
#   Gender <chr>, q1 <dbl>, q2 <dbl>, q3 <dbl>, q4 <dbl>, q5 <dbl>, q6 <dbl>,
#   q7 <dbl>, q8 <dbl>, q9 <dbl>, q10 <dbl>, TimeSpent <dbl>,
#   TimeSpent_hours <dbl>, TimeSpent_std <dbl>, int <dbl>, pc <dbl>, uv <dbl>
tail(sci_classes)
# A tibble: 6 × 30
  student_id course_id     total_points_possible total_points_earned
       <dbl> <chr>                         <dbl>               <dbl>
1      97150 PhysA-S216-01                  2710                1803
2      97265 PhysA-S216-01                  3101                2078
3      97272 OcnA-S216-01                   2872                1733
4      97374 BioA-S216-01                   8586                6978
5      97386 BioA-S216-01                   2761                1937
6      97441 FrScA-S216-02                  2607                2205
# ℹ 26 more variables: percentage_earned <dbl>, subject <chr>, semester <chr>,
#   section <chr>, Gradebook_Item <chr>, Grade_Category <lgl>,
#   FinalGradeCEMS <dbl>, Points_Possible <dbl>, Points_Earned <dbl>,
#   Gender <chr>, q1 <dbl>, q2 <dbl>, q3 <dbl>, q4 <dbl>, q5 <dbl>, q6 <dbl>,
#   q7 <dbl>, q8 <dbl>, q9 <dbl>, q10 <dbl>, TimeSpent <dbl>,
#   TimeSpent_hours <dbl>, TimeSpent_std <dbl>, int <dbl>, pc <dbl>, uv <dbl>

Isolating Data with dplyr

We will use select() function to select the following columns from our data.

  • student_id

  • subject

  • semester

  • FinalGradeCEMS

  • After selecting these columns, assign that to a new object with a name of “student_grade”.

student_grade <- sci_classes %>%
  select(student_id, subject, semester, FinalGradeCEMS)

Your Turn:

Examine students’ grades, what did you realize about it?

Your response here: Some students have missing (NA) values in their grades, indicating not all students have a final grade recorded.

Hint: Check the missing data.

Specific select

Now, we will make a specific selection.

  • Select all columns except subject and semester.

  • Assign to a new object with a different name.

  • Examine your data frame.

no_subj_sem <- sci_classes %>%
  select(-subject, -semester)
view(no_subj_sem)

Checking the data frame:

Your Turn:

  • Select all columns except student_id and FinalGradeCEMS.

  • Assign to a new object with a different name.

  • Examine your data frame.

no_id_grade <- sci_classes %>%
  select(-student_id, -FinalGradeCEMS)
view(no_id_grade)

Specific select

  • Select only the columns that start with Time

  • Assign to a new object with a different name.

  • Use view() function to examine your data frame.

time_columns <- sci_classes %>%
  select(starts_with("Time"))
view(time_columns)

Your Turn:

  • Select only the columns that ends with “r”

  • Assign to a new object with a different name.

  • Use view() function to examine your data frame.

end_r_columns <- sci_classes %>%
  select(ends_with("r"))
view(end_r_columns)

Filter Function

  • Filter the sci_classes data frame for just males.

  • Assign to a new object.

  • Use the head() function to examine your data frame.

males <- sci_classes %>%
  filter(Gender == "Male")
head(males)
# A tibble: 0 × 30
# ℹ 30 variables: student_id <dbl>, course_id <chr>,
#   total_points_possible <dbl>, total_points_earned <dbl>,
#   percentage_earned <dbl>, subject <chr>, semester <chr>, section <chr>,
#   Gradebook_Item <chr>, Grade_Category <lgl>, FinalGradeCEMS <dbl>,
#   Points_Possible <dbl>, Points_Earned <dbl>, Gender <chr>, q1 <dbl>,
#   q2 <dbl>, q3 <dbl>, q4 <dbl>, q5 <dbl>, q6 <dbl>, q7 <dbl>, q8 <dbl>,
#   q9 <dbl>, q10 <dbl>, TimeSpent <dbl>, TimeSpent_hours <dbl>, …

Your Turn:

  • Filter the sci_classes data frame for just females.

  • Assign to a new object.

  • Use the tail() function to examine your data frame.

females <- sci_classes %>%
  filter(Gender == "Female")
tail(females)
# A tibble: 0 × 30
# ℹ 30 variables: student_id <dbl>, course_id <chr>,
#   total_points_possible <dbl>, total_points_earned <dbl>,
#   percentage_earned <dbl>, subject <chr>, semester <chr>, section <chr>,
#   Gradebook_Item <chr>, Grade_Category <lgl>, FinalGradeCEMS <dbl>,
#   Points_Possible <dbl>, Points_Earned <dbl>, Gender <chr>, q1 <dbl>,
#   q2 <dbl>, q3 <dbl>, q4 <dbl>, q5 <dbl>, q6 <dbl>, q7 <dbl>, q8 <dbl>,
#   q9 <dbl>, q10 <dbl>, TimeSpent <dbl>, TimeSpent_hours <dbl>, …

Let’s try filter function with two arguments now.

  • Filter the sci_classes data frame for students whose

  • percentage_earned is greater than 0.8

  • Assign to a new object.

  • Use the tail() function to examine your data frame. 

high_achievers <- sci_classes %>%
  filter(percentage_earned > 0.8)
tail(high_achievers)
# A tibble: 6 × 30
  student_id course_id     total_points_possible total_points_earned
       <dbl> <chr>                         <dbl>               <dbl>
1      96856 OcnA-S216-01                   9630                8110
2      96871 AnPhA-S216-02                  3012                2560
3      96950 BioA-S216-01                   6190                4970
4      97006 OcnA-S216-01                   3922                3209
5      97374 BioA-S216-01                   8586                6978
6      97441 FrScA-S216-02                  2607                2205
# ℹ 26 more variables: percentage_earned <dbl>, subject <chr>, semester <chr>,
#   section <chr>, Gradebook_Item <chr>, Grade_Category <lgl>,
#   FinalGradeCEMS <dbl>, Points_Possible <dbl>, Points_Earned <dbl>,
#   Gender <chr>, q1 <dbl>, q2 <dbl>, q3 <dbl>, q4 <dbl>, q5 <dbl>, q6 <dbl>,
#   q7 <dbl>, q8 <dbl>, q9 <dbl>, q10 <dbl>, TimeSpent <dbl>,
#   TimeSpent_hours <dbl>, TimeSpent_std <dbl>, int <dbl>, pc <dbl>, uv <dbl>

Your Turn:

Filter the sci_classes data frame for students whose

  • percentage_earned is smaller or equal to 0.6

  • Assign to a new object.

  • Use the head() function to examine your data frame. 

low_achievers <- sci_classes %>%
  filter(percentage_earned <= 0.6)
head(low_achievers)
# A tibble: 6 × 30
  student_id course_id     total_points_possible total_points_earned
       <dbl> <chr>                         <dbl>               <dbl>
1      52326 AnPhA-S216-01                  4325                2255
2      65116 FrScA-S216-01                  4320                2326
3      77067 FrScA-S216-02                  3272                1896
4      78153 PhysA-S216-01                  6530                3702
5      84749 AnPhA-T116-01                  4543                2646
6      84794 BioA-T116-01                   2787                1439
# ℹ 26 more variables: percentage_earned <dbl>, subject <chr>, semester <chr>,
#   section <chr>, Gradebook_Item <chr>, Grade_Category <lgl>,
#   FinalGradeCEMS <dbl>, Points_Possible <dbl>, Points_Earned <dbl>,
#   Gender <chr>, q1 <dbl>, q2 <dbl>, q3 <dbl>, q4 <dbl>, q5 <dbl>, q6 <dbl>,
#   q7 <dbl>, q8 <dbl>, q9 <dbl>, q10 <dbl>, TimeSpent <dbl>,
#   TimeSpent_hours <dbl>, TimeSpent_std <dbl>, int <dbl>, pc <dbl>, uv <dbl>

Let’s use filter () function for the missing data.

  • Filter the sci_classes data frame so rows with 

  • NA for points earned are removed.

  • Assign to a new object.

  • Use the glimpse() function to examine your data frame.

no_na_points <- sci_classes %>%
  filter(!is.na(total_points_earned))
glimpse(no_na_points)
Rows: 603
Columns: 30
$ student_id            <dbl> 43146, 44638, 47448, 47979, 48797, 51943, 52326,…
$ course_id             <chr> "FrScA-S216-02", "OcnA-S116-01", "FrScA-S216-01"…
$ total_points_possible <dbl> 3280, 3531, 2870, 4562, 2207, 4208, 4325, 2086, …
$ total_points_earned   <dbl> 2220, 2672, 1897, 3090, 1910, 3596, 2255, 1719, …
$ percentage_earned     <dbl> 0.6768293, 0.7567261, 0.6609756, 0.6773345, 0.86…
$ subject               <chr> "FrScA", "OcnA", "FrScA", "OcnA", "PhysA", "FrSc…
$ semester              <chr> "S216", "S116", "S216", "S216", "S116", "S216", …
$ section               <chr> "02", "01", "01", "01", "01", "03", "01", "01", …
$ Gradebook_Item        <chr> "POINTS EARNED & TOTAL COURSE POINTS", "ATTEMPTE…
$ Grade_Category        <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ FinalGradeCEMS        <dbl> 93.45372, 81.70184, 88.48758, 81.85260, 84.00000…
$ Points_Possible       <dbl> 5, 10, 10, 5, 438, 5, 10, 10, 443, 5, 12, 10, 5,…
$ Points_Earned         <dbl> NA, 10.00, NA, 4.00, 399.00, NA, NA, 10.00, 425.…
$ Gender                <chr> "M", "F", "M", "M", "F", "F", "M", "F", "F", "M"…
$ q1                    <dbl> 5, 4, 5, 5, 4, NA, 5, 3, 4, NA, NA, 4, 3, 5, NA,…
$ q2                    <dbl> 4, 4, 4, 5, 3, NA, 5, 3, 3, NA, NA, 5, 3, 3, NA,…
$ q3                    <dbl> 4, 3, 4, 3, 3, NA, 3, 3, 3, NA, NA, 3, 3, 5, NA,…
$ q4                    <dbl> 5, 4, 5, 5, 4, NA, 5, 3, 4, NA, NA, 5, 3, 5, NA,…
$ q5                    <dbl> 5, 4, 5, 5, 4, NA, 5, 3, 4, NA, NA, 5, 4, 5, NA,…
$ q6                    <dbl> 5, 4, 4, 5, 4, NA, 5, 4, 3, NA, NA, 5, 3, 5, NA,…
$ q7                    <dbl> 5, 4, 4, 4, 4, NA, 4, 3, 3, NA, NA, 5, 3, 5, NA,…
$ q8                    <dbl> 5, 5, 5, 5, 4, NA, 5, 3, 4, NA, NA, 4, 3, 5, NA,…
$ q9                    <dbl> 4, 4, 3, 5, NA, NA, 5, 3, 2, NA, NA, 5, 2, 2, NA…
$ q10                   <dbl> 5, 4, 5, 5, 3, NA, 5, 3, 5, NA, NA, 4, 4, 5, NA,…
$ TimeSpent             <dbl> 1555.1667, 1382.7001, 860.4335, 1598.6166, 1481.…
$ TimeSpent_hours       <dbl> 25.91944500, 23.04500167, 14.34055833, 26.643610…
$ TimeSpent_std         <dbl> -0.18051496, -0.30780313, -0.69325954, -0.148446…
$ int                   <dbl> 5.0, 4.2, 5.0, 5.0, 3.8, 4.6, 5.0, 3.0, 4.2, NA,…
$ pc                    <dbl> 4.50, 3.50, 4.00, 3.50, 3.50, 4.00, 3.50, 3.00, …
$ uv                    <dbl> 4.333333, 4.000000, 3.666667, 5.000000, 3.500000…

Filter the sci_classes data for the following subjects:

  • BioA

  • PhysA

  • OcnA

  • Assign to a new object with a different name.

  • Use the summary() function to examine your data frame.

selected_subjects <- sci_classes %>%
  filter(subject %in% c("BioA", "PhysA", "OcnA"))
summary(selected_subjects)
   student_id     course_id         total_points_possible total_points_earned
 Min.   :44638   Length:219         Min.   :  898         Min.   :  721      
 1st Qu.:85446   Class :character   1st Qu.: 2924         1st Qu.: 2196      
 Median :88703   Mode  :character   Median : 3682         Median : 2830      
 Mean   :85846                      Mean   : 4396         Mean   : 3370      
 3rd Qu.:92742                      3rd Qu.: 5051         3rd Qu.: 3892      
 Max.   :97386                      Max.   :15092         Max.   :12208      
                                                                             
 percentage_earned   subject            semester           section         
 Min.   :0.4956    Length:219         Length:219         Length:219        
 1st Qu.:0.7072    Class :character   Class :character   Class :character  
 Median :0.7776    Mode  :character   Mode  :character   Mode  :character  
 Mean   :0.7637                                                            
 3rd Qu.:0.8314                                                            
 Max.   :0.9010                                                            
                                                                           
 Gradebook_Item     Grade_Category FinalGradeCEMS  Points_Possible 
 Length:219         Mode:logical   Min.   : 0.00   Min.   :  5.00  
 Class :character   NA's:219       1st Qu.:68.30   1st Qu.: 10.00  
 Mode  :character                  Median :83.20   Median : 10.00  
                                   Mean   :74.91   Mean   : 77.36  
                                   3rd Qu.:93.03   3rd Qu.: 30.00  
                                   Max.   :99.76   Max.   :935.00  
                                   NA's   :9                       
 Points_Earned       Gender                q1              q2       
 Min.   :  0.00   Length:219         Min.   :1.000   Min.   :2.000  
 1st Qu.:  7.00   Class :character   1st Qu.:4.000   1st Qu.:3.000  
 Median : 10.00   Mode  :character   Median :4.000   Median :4.000  
 Mean   : 72.23                      Mean   :4.261   Mean   :3.623  
 3rd Qu.: 27.38                      3rd Qu.:5.000   3rd Qu.:4.000  
 Max.   :651.20                      Max.   :5.000   Max.   :5.000  
 NA's   :36                          NA's   :35      NA's   :36     
       q3              q4              q5              q6       
 Min.   :1.000   Min.   :1.000   Min.   :2.000   Min.   :1.000  
 1st Qu.:3.000   1st Qu.:4.000   1st Qu.:4.000   1st Qu.:4.000  
 Median :3.000   Median :4.000   Median :4.000   Median :4.000  
 Mean   :3.332   Mean   :4.262   Mean   :4.187   Mean   :4.033  
 3rd Qu.:4.000   3rd Qu.:5.000   3rd Qu.:5.000   3rd Qu.:5.000  
 Max.   :5.000   Max.   :5.000   Max.   :5.000   Max.   :5.000  
 NA's   :35      NA's   :36      NA's   :37      NA's   :37     
       q7              q8              q9             q10       
 Min.   :2.000   Min.   :1.000   Min.   :1.000   Min.   :1.000  
 1st Qu.:3.000   1st Qu.:4.000   1st Qu.:3.000   1st Qu.:4.000  
 Median :4.000   Median :4.000   Median :4.000   Median :4.000  
 Mean   :3.899   Mean   :4.256   Mean   :3.486   Mean   :4.028  
 3rd Qu.:4.000   3rd Qu.:5.000   3rd Qu.:4.000   3rd Qu.:5.000  
 Max.   :5.000   Max.   :5.000   Max.   :5.000   Max.   :5.000  
 NA's   :40      NA's   :39      NA's   :40      NA's   :38     
   TimeSpent         TimeSpent_hours     TimeSpent_std           int       
 Min.   :   0.5833   Min.   :9.722e-03   Min.   :-1.32787   Min.   :2.000  
 1st Qu.: 647.8042   1st Qu.:1.080e+01   1st Qu.:-0.85019   1st Qu.:3.800  
 Median :1455.0584   Median :2.425e+01   Median :-0.25440   Median :4.200  
 Mean   :1681.1068   Mean   :2.802e+01   Mean   :-0.08757   Mean   :4.202  
 3rd Qu.:2273.3585   3rd Qu.:3.789e+01   3rd Qu.: 0.34954   3rd Qu.:4.700  
 Max.   :8870.8833   Max.   :1.478e+02   Max.   : 5.21882   Max.   :5.000  
 NA's   :1           NA's   :1           NA's   :1          NA's   :19     
       pc              uv       
 Min.   :2.000   Min.   :1.333  
 1st Qu.:3.000   1st Qu.:3.333  
 Median :3.500   Median :3.833  
 Mean   :3.597   Mean   :3.730  
 3rd Qu.:4.000   3rd Qu.:4.028  
 Max.   :5.000   Max.   :5.000  
 NA's   :19      NA's   :19     

Arrange () Function

Let’s recall how we were using the arrange () function for our dataset.

  • Arrange sci_classes by subject subject then 

  • percentage_earned in descending order.

  • Assign to a new object.

  • Use the str() function to examine your data frame.

arranged <- sci_classes %>%
  arrange(subject, desc(percentage_earned))
str(arranged)
spc_tbl_ [603 × 30] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ student_id           : num [1:603] 70192 86488 96690 91175 86267 ...
 $ course_id            : chr [1:603] "AnPhA-S116-02" "AnPhA-S116-01" "AnPhA-S216-01" "AnPhA-S116-02" ...
 $ total_points_possible: num [1:603] 1936 3342 4804 3199 3045 ...
 $ total_points_earned  : num [1:603] 1763 3033 4309 2867 2705 ...
 $ percentage_earned    : num [1:603] 0.911 0.908 0.897 0.896 0.888 ...
 $ subject              : chr [1:603] "AnPhA" "AnPhA" "AnPhA" "AnPhA" ...
 $ semester             : chr [1:603] "S116" "S116" "S216" "S116" ...
 $ section              : chr [1:603] "02" "01" "01" "02" ...
 $ Gradebook_Item       : chr [1:603] "POINTS EARNED & TOTAL COURSE POINTS" "POINTS EARNED & TOTAL COURSE POINTS" "POINTS EARNED & TOTAL COURSE POINTS" "POINTS EARNED & TOTAL COURSE POINTS" ...
 $ Grade_Category       : logi [1:603] NA NA NA NA NA NA ...
 $ FinalGradeCEMS       : num [1:603] 96 87.4 64.8 82.2 35.1 ...
 $ Points_Possible      : num [1:603] 10 28 10 5 50 15 10 10 353 460 ...
 $ Points_Earned        : num [1:603] 7 26 3 5 50 11 8 10 330 452 ...
 $ Gender               : chr [1:603] "F" "M" "F" "F" ...
 $ q1                   : num [1:603] 4 4 4 5 5 4 5 4 NA NA ...
 $ q2                   : num [1:603] 3 4 3 3 5 2 4 4 NA NA ...
 $ q3                   : num [1:603] 3 2 2 3 3 3 4 3 NA NA ...
 $ q4                   : num [1:603] 4 3 5 5 5 4 5 4 NA NA ...
 $ q5                   : num [1:603] 4 3 4 5 5 4 5 4 NA NA ...
 $ q6                   : num [1:603] 3 3 4 4 5 3 5 4 NA NA ...
 $ q7                   : num [1:603] 3 3 3 3 4 4 5 4 NA NA ...
 $ q8                   : num [1:603] 5 2 4 5 5 4 4 4 NA NA ...
 $ q9                   : num [1:603] 2 3 3 3 5 1 4 4 NA NA ...
 $ q10                  : num [1:603] 5 3 2 5 5 2 5 4 NA NA ...
 $ TimeSpent            : num [1:603] 1537 3600 1970 1315 406 ...
 $ TimeSpent_hours      : num [1:603] 25.62 60 32.83 21.92 6.77 ...
 $ TimeSpent_std        : num [1:603] -0.194 1.328 0.125 -0.358 -1.029 ...
 $ int                  : num [1:603] 4.4 3 3.8 5 5 3.9 4.6 4 4.8 4.6 ...
 $ pc                   : num [1:603] 3 2.5 2.5 3 3.5 3.5 3.75 3.5 3.5 4.5 ...
 $ uv                   : num [1:603] 2.67 3.33 3.33 3.33 5 ...
 - attr(*, "spec")=
  .. cols(
  ..   student_id = col_double(),
  ..   course_id = col_character(),
  ..   total_points_possible = col_double(),
  ..   total_points_earned = col_double(),
  ..   percentage_earned = col_double(),
  ..   subject = col_character(),
  ..   semester = col_character(),
  ..   section = col_character(),
  ..   Gradebook_Item = col_character(),
  ..   Grade_Category = col_logical(),
  ..   FinalGradeCEMS = col_double(),
  ..   Points_Possible = col_double(),
  ..   Points_Earned = col_double(),
  ..   Gender = col_character(),
  ..   q1 = col_double(),
  ..   q2 = col_double(),
  ..   q3 = col_double(),
  ..   q4 = col_double(),
  ..   q5 = col_double(),
  ..   q6 = col_double(),
  ..   q7 = col_double(),
  ..   q8 = col_double(),
  ..   q9 = col_double(),
  ..   q10 = col_double(),
  ..   TimeSpent = col_double(),
  ..   TimeSpent_hours = col_double(),
  ..   TimeSpent_std = col_double(),
  ..   int = col_double(),
  ..   pc = col_double(),
  ..   uv = col_double()
  .. )
 - attr(*, "problems")=<externalptr> 

%>% Pipe Operator

Using sci_classes data and the %>% pipe operator:

  • Select subject, section, time spent in hours and final course grade.

  • Filter for students in OcnA courses with grades greater than or equal to 60.

  • Arrange grades by section in descending order.

  • Assign to a new object.

Examine the contents using a method of your choosing.

ocna_grades <- sci_classes %>%
  select(subject, section, TimeSpent_hours, FinalGradeCEMS) %>%
  filter(subject == "OcnA", FinalGradeCEMS >= 60) %>%
  arrange(desc(section))
view(ocna_grades)

Deriving info with dplyr

We will practice summarise () and group_by () functions now.

Summarise () Function

Using sci_classes data and the summarise() function:

  • Get a distinct count of course ids.

  • Use the %>% operator

sci_classes %>%
  summarise(count = n_distinct(course_id))
# A tibble: 1 × 1
  count
  <int>
1    26
  • Get a distinct count of course ids.

  • Use the |> operator

sci_classes |>
  summarise(count = n_distinct(course_id))
# A tibble: 1 × 1
  count
  <int>
1    26

Group_by () Function

Using the sci_classes data and the pipe operator.

  • Filter final grades to remove NAs.

  • Group your data by subject and gender.

  • Summarise your data to calculate the following stats:

  • total number of students

  • mean final grade

  • mean time spent in the course

  • Assign to a new object

  • Examine the contents using a method of your choosing.

summary_by_group <- sci_classes %>%
  filter(!is.na(FinalGradeCEMS)) %>%
  group_by(subject, Gender) %>%
  summarise(
    total_students = n(),
    mean_final_grade = mean(FinalGradeCEMS, na.rm = TRUE),
    mean_time_spent = mean(TimeSpent_hours, na.rm = TRUE)
  )
`summarise()` has regrouped the output.
ℹ Summaries were computed grouped by subject and Gender.
ℹ Output is grouped by subject.
ℹ Use `summarise(.groups = "drop_last")` to silence this message.
ℹ Use `summarise(.by = c(subject, Gender))` for per-operation grouping
  (`?dplyr::dplyr_by`) instead.
view(summary_by_group)

Mutate () Function

Replace the dashed lines in the following code to;

  • Create a new variable called score that is the product of percentage earned and 100

  • Create a faceted scatter plot with hours spent in the course on the x-axis, score on the y-axis, and point colored by gender.

  • Include an alpha value to your graph.

sci_classes %>%
  mutate(score = percentage_earned * 100) %>%
  ggplot() +
  geom_point(aes(x = TimeSpent_hours, y = score, color = Gender), alpha = 0.5) +
  facet_wrap(~subject)

Final Step:

You are almost done, all you need to is to render your file and publish it in one of the following platform.

Your Turn:

Render File: For now, we will wrap up this work by converting our work into a webpage that can be used to communicate your learning and demonstrate some of your new R skills. To do so, you will need to “render” your document by clicking the Render button in the menu bar at that the top of this file. This will do two things; it will:

  1. check through all your code for any errors; and,

  2. create a file in your directory that you can use to share you work through Posit Cloud, RPubs , GitHub Pages, Quarto Pub, or other methods.

  3. Submit your link to the Blackboard!

Now that you’ve finished your Rtutorial study, scroll back to the very top of this Quarto Document and change the author: “YOUR NAME HERE” to your actual name surrounded by quotation marks like so: author: “Dr. Cansu Tatar”.