Unit 2: Rtutorial

Author

Sam fountain

Week 6: Working with Data

In the previous R tutorial, we started to work on science classroom dataset. We applied the data intensive research steps to explore our data and investigate the relationship between students’ grades and time-spent.

Let’s remember which libraries and functions we used!

Your Turn:

Please write down one or two sentence to explain why and how we use the following libraries and functions.

  1. tidyverse: library of different packages, including ggplot2 and readr

  2. skimr: provides a much more detailed, beautiful overview of your data than R’s built in functions

  3. ggplot:The command that initializes a plot. It tells R which dataset you are using and defines the what goes on the x and y axes

  4. read_csv():Part of the readr package. It imports .csv files into R as a tibble. It’s faster and smarter about identifying data types than the base R

  5. view():Opens a spreadsheet-style data viewer in a new tab in RStudio

  6. glimpse():It shows every column name, the data type, and the first few entries. For seeing your data’s structure quickly.

  7. head(): This shows you the “top” (first 6 rows) of your dataset

  8. tail(): This shows you the “tail” (last 6 rows) of your dataset

  9. select(): Used to pick specific columns based on their names.

  10. filter(): Used to pick specific rows based on conditions

  11. arrange():Used to sort your rows. By default, it sorts in ascending order

  12. desc():It is used inside arrange to flip the order

  13. geom_histogram(): A geometry layer added to ggplot to create a histogram. It visualizes the distribution of a single continuous variable.

  14. geom_point():Adds a scatter plot layer. It maps two continuous variables to the x and y axes to show relationships or clusters

Load the Tidyverse Package

Let’s start our R code along by loading the tidyverse package.

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.6
✔ forcats   1.0.1     ✔ stringr   1.6.0
✔ ggplot2   4.0.2     ✔ tibble    3.3.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.2
✔ purrr     1.2.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Load the online science class data

Now, load the online science class data from the data folder and assign your data to a new object.

sci_class<-read.csv("Data/sci-online-classes.csv")

You loaded the data, now what should we do?

View(sci_class)

Your Turn:

Examine the contents of sci_classes in your console. You should type the object name to the console and check that.

Question: Is your object a tibble? How do you know?

Your response here: I think so. it is a data frame and tibble is checked in the console

Hint: Check the output in the console.

Check your data with different functions

You can check your data with different functions. Let’s remember how we use different functions to check our data.

glimpse(sci_class)
Rows: 603
Columns: 30
$ student_id            <int> 43146, 44638, 47448, 47979, 48797, 51943, 52326,…
$ course_id             <chr> "FrScA-S216-02", "OcnA-S116-01", "FrScA-S216-01"…
$ total_points_possible <int> 3280, 3531, 2870, 4562, 2207, 4208, 4325, 2086, …
$ total_points_earned   <int> 2220, 2672, 1897, 3090, 1910, 3596, 2255, 1719, …
$ percentage_earned     <dbl> 0.6768293, 0.7567261, 0.6609756, 0.6773345, 0.86…
$ subject               <chr> "FrScA", "OcnA", "FrScA", "OcnA", "PhysA", "FrSc…
$ semester              <chr> "S216", "S116", "S216", "S216", "S116", "S216", …
$ section               <int> 2, 1, 1, 1, 1, 3, 1, 1, 1, 2, 1, 1, 2, 2, 1, 1, …
$ Gradebook_Item        <chr> "POINTS EARNED & TOTAL COURSE POINTS", "ATTEMPTE…
$ Grade_Category        <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ FinalGradeCEMS        <dbl> 93.45372, 81.70184, 88.48758, 81.85260, 84.00000…
$ Points_Possible       <int> 5, 10, 10, 5, 438, 5, 10, 10, 443, 5, 12, 10, 5,…
$ Points_Earned         <dbl> NA, 10.00, NA, 4.00, 399.00, NA, NA, 10.00, 425.…
$ Gender                <chr> "M", "F", "M", "M", "F", "F", "M", "F", "F", "M"…
$ q1                    <int> 5, 4, 5, 5, 4, NA, 5, 3, 4, NA, NA, 4, 3, 5, NA,…
$ q2                    <int> 4, 4, 4, 5, 3, NA, 5, 3, 3, NA, NA, 5, 3, 3, NA,…
$ q3                    <int> 4, 3, 4, 3, 3, NA, 3, 3, 3, NA, NA, 3, 3, 5, NA,…
$ q4                    <int> 5, 4, 5, 5, 4, NA, 5, 3, 4, NA, NA, 5, 3, 5, NA,…
$ q5                    <int> 5, 4, 5, 5, 4, NA, 5, 3, 4, NA, NA, 5, 4, 5, NA,…
$ q6                    <int> 5, 4, 4, 5, 4, NA, 5, 4, 3, NA, NA, 5, 3, 5, NA,…
$ q7                    <int> 5, 4, 4, 4, 4, NA, 4, 3, 3, NA, NA, 5, 3, 5, NA,…
$ q8                    <int> 5, 5, 5, 5, 4, NA, 5, 3, 4, NA, NA, 4, 3, 5, NA,…
$ q9                    <int> 4, 4, 3, 5, NA, NA, 5, 3, 2, NA, NA, 5, 2, 2, NA…
$ q10                   <int> 5, 4, 5, 5, 3, NA, 5, 3, 5, NA, NA, 4, 4, 5, NA,…
$ TimeSpent             <dbl> 1555.1667, 1382.7001, 860.4335, 1598.6166, 1481.…
$ TimeSpent_hours       <dbl> 25.91944500, 23.04500167, 14.34055833, 26.643610…
$ TimeSpent_std         <dbl> -0.18051496, -0.30780313, -0.69325954, -0.148446…
$ int                   <dbl> 5.0, 4.2, 5.0, 5.0, 3.8, 4.6, 5.0, 3.0, 4.2, NA,…
$ pc                    <dbl> 4.50, 3.50, 4.00, 3.50, 3.50, 4.00, 3.50, 3.00, …
$ uv                    <dbl> 4.333333, 4.000000, 3.666667, 5.000000, 3.500000…

Isolating Data with dplyr

We will use select() function to select the following columns from our data.

  • student_id

  • subject

  • semester

  • FinalGradeCEMS

  • After selecting these columns, assign that to a new object with a name of “student_grade”.

student_grade <-sci_class |>
  select(student_id, subject, semester, FinalGradeCEMS)

Your Turn:

Examine students’ grades, what did you realize about it?

Your response here: 93.5 81.7 88.5 81.9

it is accurate to 1 decimal place

Hint: Check the missing data.

Specific select

Now, we will make a specific selection.

  • Select all columns except subject and semester.

  • Assign to a new object with a different name.

  • Examine your data frame.

sci_df <- select(sci_class, -c(subject, semester))

Checking the data frame:

Your Turn:

  • Select all columns except student_id and FinalGradeCEMS.

  • Assign to a new object with a different name.

  • Examine your data frame.

sci_dataframe <- select(sci_class, -c(student_id, FinalGradeCEMS))

Specific select

  • Select only the columns that start with Time

  • Assign to a new object with a different name.

  • Use view() function to examine your data frame.

time_object <- select(sci_class, starts_with("Time"))
view(time_object)

Your Turn:

  • Select only the columns that ends with “r”

  • Assign to a new object with a different name.

  • Use view() function to examine your data frame.

rend_object <- select(sci_class, ends_with("r"))
view(rend_object)

Filter Function

  • Filter the sci_classes data frame for just males.

  • Assign to a new object.

  • Use the head() function to examine your data frame.

male_data<-sci_class|>
  filter(Gender=="M")
head(male_data)
  student_id     course_id total_points_possible total_points_earned
1      43146 FrScA-S216-02                  3280                2220
2      47448 FrScA-S216-01                  2870                1897
3      47979  OcnA-S216-01                  4562                3090
4      52326 AnPhA-S216-01                  4325                2255
5      53475 FrScA-S116-02                  1710                1402
6      53475 FrScA-S216-01                  1209                 977
  percentage_earned subject semester section
1         0.6768293   FrScA     S216       2
2         0.6609756   FrScA     S216       1
3         0.6773345    OcnA     S216       1
4         0.5213873   AnPhA     S216       1
5         0.8198830   FrScA     S116       2
6         0.8081059   FrScA     S216       1
                       Gradebook_Item Grade_Category FinalGradeCEMS
1 POINTS EARNED & TOTAL COURSE POINTS             NA       93.45372
2 POINTS EARNED & TOTAL COURSE POINTS             NA       88.48758
3 POINTS EARNED & TOTAL COURSE POINTS             NA       81.85260
4 POINTS EARNED & TOTAL COURSE POINTS             NA       83.58827
5 POINTS EARNED & TOTAL COURSE POINTS             NA             NA
6 POINTS EARNED & TOTAL COURSE POINTS             NA       81.03837
  Points_Possible Points_Earned Gender q1 q2 q3 q4 q5 q6 q7 q8 q9 q10 TimeSpent
1               5            NA      M  5  4  4  5  5  5  5  5  4   5 1555.1667
2              10            NA      M  5  4  4  5  5  4  4  5  3   5  860.4335
3               5           4.0      M  5  5  3  5  5  5  4  5  5   5 1598.6166
4              10            NA      M  5  5  3  5  5  5  4  5  5   5 1321.8164
5               5           2.5      M NA NA NA NA NA NA NA NA NA  NA        NA
6              12          12.0      M NA NA NA NA NA NA NA NA NA  NA 1867.4169
  TimeSpent_hours TimeSpent_std int  pc       uv
1        25.91944   -0.18051496   5 4.5 4.333333
2        14.34056   -0.69325954   5 4.0 3.666667
3        26.64361   -0.14844697   5 3.5 5.000000
4        22.03027   -0.35273806   5 3.5 5.000000
5              NA            NA  NA  NA       NA
6        31.12361    0.04993983  NA  NA       NA

Your Turn:

  • Filter the sci_classes data frame for just females.

  • Assign to a new object.

  • Use the tail() function to examine your data frame.

female_data<-sci_class|>
  filter(Gender=="F")
tail(female_data)
    student_id     course_id total_points_possible total_points_earned
417      97150 PhysA-S216-01                  2710                1803
418      97265 PhysA-S216-01                  3101                2078
419      97272  OcnA-S216-01                  2872                1733
420      97374  BioA-S216-01                  8586                6978
421      97386  BioA-S216-01                  2761                1937
422      97441 FrScA-S216-02                  2607                2205
    percentage_earned subject semester section
417         0.6653137   PhysA     S216       1
418         0.6701064   PhysA     S216       1
419         0.6034123    OcnA     S216       1
420         0.8127184    BioA     S216       1
421         0.7015574    BioA     S216       1
422         0.8457998   FrScA     S216       2
                         Gradebook_Item Grade_Category FinalGradeCEMS
417 POINTS EARNED & TOTAL COURSE POINTS             NA       62.54861
418 POINTS EARNED & TOTAL COURSE POINTS             NA       84.56944
419 POINTS EARNED & TOTAL COURSE POINTS             NA       84.23953
420 POINTS EARNED & TOTAL COURSE POINTS             NA       12.35294
421 POINTS EARNED & TOTAL COURSE POINTS             NA       54.15829
422 POINTS EARNED & TOTAL COURSE POINTS             NA       23.13770
    Points_Possible Points_Earned Gender q1 q2 q3 q4 q5 q6 q7 q8 q9 q10
417             625        460.38      F NA NA NA NA NA NA NA NA NA  NA
418             438            NA      F  4  4  3  4  3  4  4  4  4   4
419              30         30.00      F  5  5  3  5  4  5  3  5  5   3
420              10          8.50      F NA NA NA NA NA NA NA NA NA  NA
421               5            NA      F  4  4  3  4  4  4  3  4  3   3
422              10         10.00      F  4  2  3  5  4  2  5  5  2   4
    TimeSpent TimeSpent_hours TimeSpent_std int  pc       uv
417  759.3335       12.655558    -0.7678759  NA  NA       NA
418  817.4501       13.624168    -0.7249832 3.8 3.5 4.000000
419 1638.4500       27.307500    -0.1190481 4.4 3.0 5.000000
420  470.8000        7.846667    -0.9808267  NA  NA       NA
421   71.0166        1.183610    -1.2758850 3.8 3.0 3.666667
422  208.6664        3.477773    -1.1742932 4.4 4.0 2.000000

Let’s try filter function with two arguments now.

  • Filter the sci_classes data frame for students whose

  • percentage_earned is greater than 0.8

  • Assign to a new object.

  • Use the tail() function to examine your data frame. 

bio_students <-filter(sci_class, percentage_earned >0.8 & subject=="BioA")
tail(bio_students)
   student_id    course_id total_points_possible total_points_earned
11      91066 BioA-S116-01                  5766                4820
12      91067 BioA-S116-01                  2672                2249
13      92633 BioA-S116-01                  2954                2495
14      95658 BioA-S216-01                  3362                2775
15      96950 BioA-S216-01                  6190                4970
16      97374 BioA-S216-01                  8586                6978
   percentage_earned subject semester section
11         0.8359348    BioA     S116       1
12         0.8416916    BioA     S116       1
13         0.8446175    BioA     S116       1
14         0.8254015    BioA     S216       1
15         0.8029079    BioA     S216       1
16         0.8127184    BioA     S216       1
                        Gradebook_Item Grade_Category FinalGradeCEMS
11 POINTS EARNED & TOTAL COURSE POINTS             NA      36.344086
12 POINTS EARNED & TOTAL COURSE POINTS             NA      99.758065
13 POINTS EARNED & TOTAL COURSE POINTS             NA       3.010753
14 POINTS EARNED & TOTAL COURSE POINTS             NA      82.299465
15 POINTS EARNED & TOTAL COURSE POINTS             NA      91.818182
16 POINTS EARNED & TOTAL COURSE POINTS             NA      12.352941
   Points_Possible Points_Earned Gender q1 q2 q3 q4 q5 q6 q7 q8 q9 q10
11              24          10.0      F NA NA NA NA NA NA NA NA NA  NA
12              10           9.5      F  4  3  3  4  4  4  3  4  3   3
13             438         345.0      F  5  3  3  4  4  3  4  4  3   3
14              10          10.0      M  3  2  2  2  4  3  3  4  3   4
15               5           5.0      F  4  3  2  4  4  3  4  4  3   2
16              10           8.5      F NA NA NA NA NA NA NA NA NA  NA
   TimeSpent TimeSpent_hours TimeSpent_std      int  pc       uv
11  222.9831        3.716385    -1.1637268 3.000000 2.5 3.333333
12 2920.9838       48.683063     0.8275199 3.800000 3.0 3.333333
13  571.2834        9.521390    -0.9066654 4.000000 3.5 3.000000
14  644.2667       10.737778    -0.8528004 3.400000 2.5 2.666667
15 2264.4834       37.741390     0.3429929 4.222222 3.0 3.500000
16  470.8000        7.846667    -0.9808267       NA  NA       NA

Your Turn:

Filter the sci_classes data frame for students whose

  • percentage_earned is smaller or equal to 0.6

  • Assign to a new object.

  • Use the head() function to examine your data frame. 

fra_students <-filter(sci_class, percentage_earned <=0.6 & subject=="   
FrScA")
head(fra_students)
 [1] student_id            course_id             total_points_possible
 [4] total_points_earned   percentage_earned     subject              
 [7] semester              section               Gradebook_Item       
[10] Grade_Category        FinalGradeCEMS        Points_Possible      
[13] Points_Earned         Gender                q1                   
[16] q2                    q3                    q4                   
[19] q5                    q6                    q7                   
[22] q8                    q9                    q10                  
[25] TimeSpent             TimeSpent_hours       TimeSpent_std        
[28] int                   pc                    uv                   
<0 rows> (or 0-length row.names)

Let’s use filter () function for the missing data.

  • Filter the sci_classes data frame so rows with 

  • NA for points earned are removed.

  • Assign to a new object.

  • Use the glimpse() function to examine your data frame.

clean_data<-sci_class |>
  filter(!is.na(Points_Earned))
glimpse(clean_data)
Rows: 511
Columns: 30
$ student_id            <int> 44638, 47979, 48797, 52446, 53447, 53475, 53475,…
$ course_id             <chr> "OcnA-S116-01", "OcnA-S216-01", "PhysA-S116-01",…
$ total_points_possible <int> 3531, 4562, 2207, 2086, 4655, 1710, 1209, 4641, …
$ total_points_earned   <int> 2672, 3090, 1910, 1719, 3149, 1402, 977, 3429, 2…
$ percentage_earned     <dbl> 0.7567261, 0.6773345, 0.8654282, 0.8240652, 0.67…
$ subject               <chr> "OcnA", "OcnA", "PhysA", "PhysA", "FrScA", "FrSc…
$ semester              <chr> "S116", "S216", "S116", "S116", "S116", "S116", …
$ section               <int> 1, 1, 1, 1, 1, 2, 1, 1, 2, 2, 1, 1, 2, 1, 1, 1, …
$ Gradebook_Item        <chr> "ATTEMPTED", "POINTS EARNED & TOTAL COURSE POINT…
$ Grade_Category        <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ FinalGradeCEMS        <dbl> 81.70184, 81.85260, 84.00000, 97.77778, 96.11872…
$ Points_Possible       <int> 10, 5, 438, 10, 443, 5, 12, 10, 5, 10, 220, 30, …
$ Points_Earned         <dbl> 10.00, 4.00, 399.00, 10.00, 425.00, 2.50, 12.00,…
$ Gender                <chr> "F", "M", "F", "F", "F", "M", "M", "M", "F", "F"…
$ q1                    <int> 4, 5, 4, 3, 4, NA, NA, 4, 3, 5, NA, 4, 4, NA, 4,…
$ q2                    <int> 4, 5, 3, 3, 3, NA, NA, 5, 3, 3, NA, 2, 4, NA, 3,…
$ q3                    <int> 3, 3, 3, 3, 3, NA, NA, 3, 3, 5, NA, 2, 3, NA, 3,…
$ q4                    <int> 4, 5, 4, 3, 4, NA, NA, 5, 3, 5, NA, 4, 5, NA, 4,…
$ q5                    <int> 4, 5, 4, 3, 4, NA, NA, 5, 4, 5, NA, 4, 4, NA, 4,…
$ q6                    <int> 4, 5, 4, 4, 3, NA, NA, 5, 3, 5, NA, 4, 4, NA, 3,…
$ q7                    <int> 4, 4, 4, 3, 3, NA, NA, 5, 3, 5, NA, 4, 5, NA, 3,…
$ q8                    <int> 5, 5, 4, 3, 4, NA, NA, 4, 3, 5, NA, 4, 4, NA, 4,…
$ q9                    <int> 4, 5, NA, 3, 2, NA, NA, 5, 2, 2, NA, 2, 4, NA, 2…
$ q10                   <int> 4, 5, 3, 3, 5, NA, NA, 4, 4, 5, NA, 4, 4, NA, 3,…
$ TimeSpent             <dbl> 1382.7001, 1598.6166, 1481.8000, 1390.2167, 1479…
$ TimeSpent_hours       <dbl> 23.04500167, 26.64361000, 24.69666667, 23.170278…
$ TimeSpent_std         <dbl> -0.30780313, -0.14844697, -0.23466291, -0.302255…
$ int                   <dbl> 4.2, 5.0, 3.8, 3.0, 4.2, NA, NA, 4.4, 3.4, 4.7, …
$ pc                    <dbl> 3.50, 3.50, 3.50, 3.00, 3.00, NA, NA, 4.00, 3.00…
$ uv                    <dbl> 4.000000, 5.000000, 3.500000, 3.333333, 2.666667…

Filter the sci_classes data for the following subjects:

  • BioA

  • PhysA

  • OcnA

  • Assign to a new object with a different name.

  • Use the summary() function to examine your data frame.

my_classes<-filter(sci_class, subject %in% c("BioA", "PhysA", "OcnA"))
summary(my_classes)
   student_id     course_id         total_points_possible total_points_earned
 Min.   :44638   Length:219         Min.   :  898         Min.   :  721      
 1st Qu.:85446   Class :character   1st Qu.: 2924         1st Qu.: 2196      
 Median :88703   Mode  :character   Median : 3682         Median : 2830      
 Mean   :85846                      Mean   : 4396         Mean   : 3370      
 3rd Qu.:92742                      3rd Qu.: 5051         3rd Qu.: 3892      
 Max.   :97386                      Max.   :15092         Max.   :12208      
                                                                             
 percentage_earned   subject            semester            section     
 Min.   :0.4956    Length:219         Length:219         Min.   :1.000  
 1st Qu.:0.7072    Class :character   Class :character   1st Qu.:1.000  
 Median :0.7776    Mode  :character   Mode  :character   Median :1.000  
 Mean   :0.7637                                          Mean   :1.164  
 3rd Qu.:0.8314                                          3rd Qu.:1.000  
 Max.   :0.9010                                          Max.   :3.000  
                                                                        
 Gradebook_Item     Grade_Category FinalGradeCEMS  Points_Possible 
 Length:219         Mode:logical   Min.   : 0.00   Min.   :  5.00  
 Class :character   NA's:219       1st Qu.:68.30   1st Qu.: 10.00  
 Mode  :character                  Median :83.20   Median : 10.00  
                                   Mean   :74.91   Mean   : 77.36  
                                   3rd Qu.:93.03   3rd Qu.: 30.00  
                                   Max.   :99.76   Max.   :935.00  
                                   NA's   :9                       
 Points_Earned       Gender                q1              q2       
 Min.   :  0.00   Length:219         Min.   :1.000   Min.   :2.000  
 1st Qu.:  7.00   Class :character   1st Qu.:4.000   1st Qu.:3.000  
 Median : 10.00   Mode  :character   Median :4.000   Median :4.000  
 Mean   : 72.23                      Mean   :4.261   Mean   :3.623  
 3rd Qu.: 27.38                      3rd Qu.:5.000   3rd Qu.:4.000  
 Max.   :651.20                      Max.   :5.000   Max.   :5.000  
 NA's   :36                          NA's   :35      NA's   :36     
       q3              q4              q5              q6       
 Min.   :1.000   Min.   :1.000   Min.   :2.000   Min.   :1.000  
 1st Qu.:3.000   1st Qu.:4.000   1st Qu.:4.000   1st Qu.:4.000  
 Median :3.000   Median :4.000   Median :4.000   Median :4.000  
 Mean   :3.332   Mean   :4.262   Mean   :4.187   Mean   :4.033  
 3rd Qu.:4.000   3rd Qu.:5.000   3rd Qu.:5.000   3rd Qu.:5.000  
 Max.   :5.000   Max.   :5.000   Max.   :5.000   Max.   :5.000  
 NA's   :35      NA's   :36      NA's   :37      NA's   :37     
       q7              q8              q9             q10       
 Min.   :2.000   Min.   :1.000   Min.   :1.000   Min.   :1.000  
 1st Qu.:3.000   1st Qu.:4.000   1st Qu.:3.000   1st Qu.:4.000  
 Median :4.000   Median :4.000   Median :4.000   Median :4.000  
 Mean   :3.899   Mean   :4.256   Mean   :3.486   Mean   :4.028  
 3rd Qu.:4.000   3rd Qu.:5.000   3rd Qu.:4.000   3rd Qu.:5.000  
 Max.   :5.000   Max.   :5.000   Max.   :5.000   Max.   :5.000  
 NA's   :40      NA's   :39      NA's   :40      NA's   :38     
   TimeSpent         TimeSpent_hours     TimeSpent_std           int       
 Min.   :   0.5833   Min.   :9.722e-03   Min.   :-1.32787   Min.   :2.000  
 1st Qu.: 647.8042   1st Qu.:1.080e+01   1st Qu.:-0.85019   1st Qu.:3.800  
 Median :1455.0584   Median :2.425e+01   Median :-0.25440   Median :4.200  
 Mean   :1681.1068   Mean   :2.802e+01   Mean   :-0.08757   Mean   :4.202  
 3rd Qu.:2273.3585   3rd Qu.:3.789e+01   3rd Qu.: 0.34954   3rd Qu.:4.700  
 Max.   :8870.8833   Max.   :1.478e+02   Max.   : 5.21882   Max.   :5.000  
 NA's   :1           NA's   :1           NA's   :1          NA's   :19     
       pc              uv       
 Min.   :2.000   Min.   :1.333  
 1st Qu.:3.000   1st Qu.:3.333  
 Median :3.500   Median :3.833  
 Mean   :3.597   Mean   :3.730  
 3rd Qu.:4.000   3rd Qu.:4.028  
 Max.   :5.000   Max.   :5.000  
 NA's   :19      NA's   :19     

Arrange () Function

Let’s recall how we were using the arrange () function for our dataset.

  • Arrange sci_classes by subject subject then 

  • percentage_earned in descending order.

  • Assign to a new object.

  • Use the str() function to examine your data frame.

order_classes<-sci_class |>
  arrange(subject, desc(percentage_earned))

str(order_classes)
'data.frame':   603 obs. of  30 variables:
 $ student_id           : int  70192 86488 96690 91175 86267 86707 88153 96677 88612 85865 ...
 $ course_id            : chr  "AnPhA-S116-02" "AnPhA-S116-01" "AnPhA-S216-01" "AnPhA-S116-02" ...
 $ total_points_possible: int  1936 3342 4804 3199 3045 11355 4640 1427 8490 3050 ...
 $ total_points_earned  : int  1763 3033 4309 2867 2705 10026 4094 1247 7410 2644 ...
 $ percentage_earned    : num  0.911 0.908 0.897 0.896 0.888 ...
 $ subject              : chr  "AnPhA" "AnPhA" "AnPhA" "AnPhA" ...
 $ semester             : chr  "S116" "S116" "S216" "S116" ...
 $ section              : int  2 1 1 2 1 2 1 1 1 1 ...
 $ Gradebook_Item       : chr  "POINTS EARNED & TOTAL COURSE POINTS" "POINTS EARNED & TOTAL COURSE POINTS" "POINTS EARNED & TOTAL COURSE POINTS" "POINTS EARNED & TOTAL COURSE POINTS" ...
 $ Grade_Category       : logi  NA NA NA NA NA NA ...
 $ FinalGradeCEMS       : num  96 87.4 64.8 82.2 35.1 ...
 $ Points_Possible      : int  10 28 10 5 50 15 10 10 353 460 ...
 $ Points_Earned        : num  7 26 3 5 50 11 8 10 330 452 ...
 $ Gender               : chr  "F" "M" "F" "F" ...
 $ q1                   : int  4 4 4 5 5 4 5 4 NA NA ...
 $ q2                   : int  3 4 3 3 5 2 4 4 NA NA ...
 $ q3                   : int  3 2 2 3 3 3 4 3 NA NA ...
 $ q4                   : int  4 3 5 5 5 4 5 4 NA NA ...
 $ q5                   : int  4 3 4 5 5 4 5 4 NA NA ...
 $ q6                   : int  3 3 4 4 5 3 5 4 NA NA ...
 $ q7                   : int  3 3 3 3 4 4 5 4 NA NA ...
 $ q8                   : int  5 2 4 5 5 4 4 4 NA NA ...
 $ q9                   : int  2 3 3 3 5 1 4 4 NA NA ...
 $ q10                  : int  5 3 2 5 5 2 5 4 NA NA ...
 $ TimeSpent            : num  1537 3600 1970 1315 406 ...
 $ TimeSpent_hours      : num  25.62 60 32.83 21.92 6.77 ...
 $ TimeSpent_std        : num  -0.194 1.328 0.125 -0.358 -1.029 ...
 $ int                  : num  4.4 3 3.8 5 5 3.9 4.6 4 4.8 4.6 ...
 $ pc                   : num  3 2.5 2.5 3 3.5 3.5 3.75 3.5 3.5 4.5 ...
 $ uv                   : num  2.67 3.33 3.33 3.33 5 ...

%>% Pipe Operator

Using sci_classes data and the %>% pipe operator:

  • Select subject, section, time spent in hours and final course grade.

  • Filter for students in OcnA courses with grades greater than or equal to 60.

  • Arrange grades by section in descending order.

  • Assign to a new object.

Examine the contents using a method of your choosing.

spec_course<-sci_class %>%
  select(subject, section, TimeSpent_hours, FinalGradeCEMS) %>%
  filter(subject=="OcnA" & FinalGradeCEMS >= 60) %>%
  arrange(desc(section))

Deriving info with dplyr

We will practice summarise () and group_by () functions now.

Summarise () Function

Using sci_classes data and the summarise() function:

  • Get a distinct count of course ids.

  • Use the %>% operator

sci_class %>%
  summarise(courses<-n_distinct(course_id))
  courses <- n_distinct(course_id)
1                               26
  • Get a distinct count of course ids.

  • Use the |> operator

sci_class |>
  summarise(courses<-n_distinct(course_id))
  courses <- n_distinct(course_id)
1                               26

Group_by () Function

Using the sci_classes data and the pipe operator.

  • Filter final grades to remove NAs.

  • Group your data by subject and gender.

  • Summarise your data to calculate the following stats:

  • total number of students

  • mean final grade

  • mean time spent in the course

  • Assign to a new object

  • Examine the contents using a method of your choosing.

finals<-sci_class %>%
  filter(!is.na(FinalGradeCEMS)) %>%
  group_by(subject, Gender) %>%
  summarise(total=sum(student_id),
            grade=mean(FinalGradeCEMS),
            time=mean(TimeSpent_hours))
`summarise()` has grouped output by 'subject'. You can override using the
`.groups` argument.

Mutate () Function

Replace the dashed lines in the following code to;

  • Create a new variable called score that is the product of percentage earned and 100

  • Create a faceted scatter plot with hours spent in the course on the x-axis, score on the y-axis, and point colored by gender.

  • Include an alpha value to your graph.

sci_class %>%
  mutate(score = percentage_earned * 100) %>%
  ggplot() +
  geom_point(mapping = aes(x = TimeSpent_hours,
                           y = score,
                           col = Gender)) +
  facet_wrap(~subject)

Final Step:

You are almost done, all you need to is to render your file and publish it in one of the following platform.

Your Turn:

Render File: For now, we will wrap up this work by converting our work into a webpage that can be used to communicate your learning and demonstrate some of your new R skills. To do so, you will need to “render” your document by clicking the Render button in the menu bar at that the top of this file. This will do two things; it will:

  1. check through all your code for any errors; and,

  2. create a file in your directory that you can use to share you work through Posit Cloud, RPubs , GitHub Pages, Quarto Pub, or other methods.

  3. Submit your link to the Blackboard!

Now that you’ve finished your Rtutorial study, scroll back to the very top of this Quarto Document and change the author: “YOUR NAME HERE” to your actual name surrounded by quotation marks like so: author: “Dr. Cansu Tatar”.