Unit 2: Rtutorial

Author

Jefferson Mwendwa Kiteme

Week 6: Working with Data

In the previous R tutorial, we started to work on science classroom dataset. We applied the data intensive research steps to explore our data and investigate the relationship between students’ grades and time-spent.

Let’s remember which libraries and functions we used!

Your Turn:

Please write down one or two sentence to explain why and how we use the following libraries and functions.

tidyverse: The tidyverse is a collection of R packages designed for data science. We use it to efficiently import, clean, manipulate, visualize, and model data using a consistent and readable syntax.
skimr: The skimr package provides quick, detailed summaries of variables in a dataset. We use it to understand variable types, missing values, and basic descriptive statistics all at once.
ggplot: ggplot is a data visualization package used to create graphs using a layered approach. We use it to visualize relationships and distributions in the data, such as grades and time spent.
read_csv(): The read_csv() function imports CSV files into R as data frames. We use it to load our dataset so it can be analyzed in R.
view(): The view() function opens the dataset in a spreadsheet-style window. We use it to visually inspect the full dataset and get an initial sense of its structure.
glimpse(): The glimpse() function provides a compact overview of the dataset, showing the number of rows, columns, variable names, and data types. We use it to quickly understand the structure of the data.
head(): The head() function displays the first few rows of a dataset. We use it to preview the data and confirm it loaded correctly.
tail(): The tail() function displays the last few rows of a dataset. We use it to check the end of the dataset and confirm completeness.
select(): The select() function is used to choose specific variables (columns) from a dataset. We use it to focus on variables relevant to a particular analysis.
filter(): The filter() function is used to keep rows that meet certain conditions. We use it to subset the data, such as selecting students above a certain grade threshold.
arrange(): The arrange() function sorts the dataset based on the values of one or more variables. We use it to order cases from lowest to highest or vice versa.
desc(): The desc() function reverses the default ascending order in arrange(). We use it to sort variables in descending order, such as highest grades first.
geom_histogram(): The geom_histogram() function creates histograms. We use it to visualize the distribution of a single continuous variable, such as time spent or grades.
geom_point(): The geom_point() function creates scatterplots. We use it to examine the relationship between two continuous variables, such as time spent and final grades.

Load the Tidyverse Package

Let’s start our R code along by loading the tidyverse package.

library(tidyverse)

Warning: package 'tidyverse' was built under R version 4.4.3

Warning: package 'ggplot2' was built under R version 4.4.3

Warning: package 'tidyr' was built under R version 4.4.3

Warning: package 'purrr' was built under R version 4.4.3

Warning: package 'stringr' was built under R version 4.4.3

Warning: package 'forcats' was built under R version 4.4.3

Warning: package 'lubridate' was built under R version 4.4.3

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.1     ✔ stringr   1.6.0
✔ ggplot2   4.0.2     ✔ tibble    3.2.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.2
✔ purrr     1.2.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Load the online science class data

Now, load the online science class data from the data folder and assign your data to a new object.

library(readr)
sci_online_classes <- read_csv("C:/Users/jeffk/OneDrive/Desktop/Rtutorial-2 2/Rtutorial-2/Data/sci-online-classes.csv")

Rows: 603 Columns: 30
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (6): course_id, subject, semester, section, Gradebook_Item, Gender
dbl (23): student_id, total_points_possible, total_points_earned, percentage...
lgl  (1): Grade_Category

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

View(sci_online_classes)

You loaded the data, now what should we do?

view(sci_online_classes)
glimpse(sci_online_classes)

Rows: 603
Columns: 30
$ student_id            <dbl> 43146, 44638, 47448, 47979, 48797, 51943, 52326,…
$ course_id             <chr> "FrScA-S216-02", "OcnA-S116-01", "FrScA-S216-01"…
$ total_points_possible <dbl> 3280, 3531, 2870, 4562, 2207, 4208, 4325, 2086, …
$ total_points_earned   <dbl> 2220, 2672, 1897, 3090, 1910, 3596, 2255, 1719, …
$ percentage_earned     <dbl> 0.6768293, 0.7567261, 0.6609756, 0.6773345, 0.86…
$ subject               <chr> "FrScA", "OcnA", "FrScA", "OcnA", "PhysA", "FrSc…
$ semester              <chr> "S216", "S116", "S216", "S216", "S116", "S216", …
$ section               <chr> "02", "01", "01", "01", "01", "03", "01", "01", …
$ Gradebook_Item        <chr> "POINTS EARNED & TOTAL COURSE POINTS", "ATTEMPTE…
$ Grade_Category        <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ FinalGradeCEMS        <dbl> 93.45372, 81.70184, 88.48758, 81.85260, 84.00000…
$ Points_Possible       <dbl> 5, 10, 10, 5, 438, 5, 10, 10, 443, 5, 12, 10, 5,…
$ Points_Earned         <dbl> NA, 10.00, NA, 4.00, 399.00, NA, NA, 10.00, 425.…
$ Gender                <chr> "M", "F", "M", "M", "F", "F", "M", "F", "F", "M"…
$ q1                    <dbl> 5, 4, 5, 5, 4, NA, 5, 3, 4, NA, NA, 4, 3, 5, NA,…
$ q2                    <dbl> 4, 4, 4, 5, 3, NA, 5, 3, 3, NA, NA, 5, 3, 3, NA,…
$ q3                    <dbl> 4, 3, 4, 3, 3, NA, 3, 3, 3, NA, NA, 3, 3, 5, NA,…
$ q4                    <dbl> 5, 4, 5, 5, 4, NA, 5, 3, 4, NA, NA, 5, 3, 5, NA,…
$ q5                    <dbl> 5, 4, 5, 5, 4, NA, 5, 3, 4, NA, NA, 5, 4, 5, NA,…
$ q6                    <dbl> 5, 4, 4, 5, 4, NA, 5, 4, 3, NA, NA, 5, 3, 5, NA,…
$ q7                    <dbl> 5, 4, 4, 4, 4, NA, 4, 3, 3, NA, NA, 5, 3, 5, NA,…
$ q8                    <dbl> 5, 5, 5, 5, 4, NA, 5, 3, 4, NA, NA, 4, 3, 5, NA,…
$ q9                    <dbl> 4, 4, 3, 5, NA, NA, 5, 3, 2, NA, NA, 5, 2, 2, NA…
$ q10                   <dbl> 5, 4, 5, 5, 3, NA, 5, 3, 5, NA, NA, 4, 4, 5, NA,…
$ TimeSpent             <dbl> 1555.1667, 1382.7001, 860.4335, 1598.6166, 1481.…
$ TimeSpent_hours       <dbl> 25.91944500, 23.04500167, 14.34055833, 26.643610…
$ TimeSpent_std         <dbl> -0.18051496, -0.30780313, -0.69325954, -0.148446…
$ int                   <dbl> 5.0, 4.2, 5.0, 5.0, 3.8, 4.6, 5.0, 3.0, 4.2, NA,…
$ pc                    <dbl> 4.50, 3.50, 4.00, 3.50, 3.50, 4.00, 3.50, 3.00, …
$ uv                    <dbl> 4.333333, 4.000000, 3.666667, 5.000000, 3.500000…

Your Turn:

Examine the contents of sci_classes in your console. You should type the object name to the console and check that.

Question: Is your object a tibble? How do you know?

Your response here: Yes, the object is a tibble. We can tell because glimpse() displays the data in tibble format, including column names, data types, and a compact preview, and because read_csv() imports data as a tibble by default.

Hint: Check the output in the console.

Check your data with different functions

You can check your data with different functions. Let’s remember how we use different functions to check our data.

glimpse

function (x, width = NULL, ...) 
{
    UseMethod("glimpse")
}
<bytecode: 0x000001442ceb91c0>
<environment: namespace:pillar>

Isolating Data with dplyr

We will use select() function to select the following columns from our data.

student_id
subject
semester
FinalGradeCEMS
After selecting these columns, assign that to a new object with a name of “student_grade”.

student_grade <-sci_online_classes |>
  select(student_id,subject,semester,FinalGradeCEMS)

Your Turn:

Examine students’ grades, what did you realize about it?

Your response here: After examining students’ grades, I realized that not all students have a recorded final grade, as indicated by missing values (NA) in the FinalGradeCEMS column. This suggests that some students may not have completed the course, withdrew, or had grades that were not captured in the dataset. As a result, analyses involving final grades exclude these students, which reduces the number of observations and should be considered when interpreting results.

Hint: Check the missing data.

Specific select

Now, we will make a specific selection.

Select all columns except subject and semester.
Assign to a new object with a different name.
Examine your data frame.

sci_df <- select(sci_online_classes, -c(subject,semester))

Checking the data frame:

Your Turn:

Select all columns except student_id and FinalGradeCEMS.
Assign to a new object with a different name.
Examine your data frame.

science_newdf <- select(sci_online_classes, -c(student_id,FinalGradeCEMS))

Specific select

Select only the columns that start with Time
Assign to a new object with a different name.
Use view() function to examine your data frame.

time_object<-select(sci_online_classes,starts_with("Time"))
view(time_object)

Your Turn:

Select only the columns that ends with “r”
Assign to a new object with a different name.
Use view() function to examine your data frame.

r_object<-select(sci_online_classes,ends_with("r"))
view(r_object)

Filter Function

Filter the sci_classes data frame for just males.
Assign to a new object.
Use the head() function to examine your data frame.

male_students<-sci_online_classes|>
  filter(Gender=="M")
head(male_students)

# A tibble: 6 × 30
  student_id course_id     total_points_possible total_points_earned
       <dbl> <chr>                         <dbl>               <dbl>
1      43146 FrScA-S216-02                  3280                2220
2      47448 FrScA-S216-01                  2870                1897
3      47979 OcnA-S216-01                   4562                3090
4      52326 AnPhA-S216-01                  4325                2255
5      53475 FrScA-S116-02                  1710                1402
6      53475 FrScA-S216-01                  1209                 977
# ℹ 26 more variables: percentage_earned <dbl>, subject <chr>, semester <chr>,
#   section <chr>, Gradebook_Item <chr>, Grade_Category <lgl>,
#   FinalGradeCEMS <dbl>, Points_Possible <dbl>, Points_Earned <dbl>,
#   Gender <chr>, q1 <dbl>, q2 <dbl>, q3 <dbl>, q4 <dbl>, q5 <dbl>, q6 <dbl>,
#   q7 <dbl>, q8 <dbl>, q9 <dbl>, q10 <dbl>, TimeSpent <dbl>,
#   TimeSpent_hours <dbl>, TimeSpent_std <dbl>, int <dbl>, pc <dbl>, uv <dbl>

Your Turn:

Filter the sci_classes data frame for just females.
Assign to a new object.
Use the tail() function to examine your data frame.

female_students<-sci_online_classes|>
  filter(Gender=="F")
tail(female_students)

# A tibble: 6 × 30
  student_id course_id     total_points_possible total_points_earned
       <dbl> <chr>                         <dbl>               <dbl>
1      97150 PhysA-S216-01                  2710                1803
2      97265 PhysA-S216-01                  3101                2078
3      97272 OcnA-S216-01                   2872                1733
4      97374 BioA-S216-01                   8586                6978
5      97386 BioA-S216-01                   2761                1937
6      97441 FrScA-S216-02                  2607                2205
# ℹ 26 more variables: percentage_earned <dbl>, subject <chr>, semester <chr>,
#   section <chr>, Gradebook_Item <chr>, Grade_Category <lgl>,
#   FinalGradeCEMS <dbl>, Points_Possible <dbl>, Points_Earned <dbl>,
#   Gender <chr>, q1 <dbl>, q2 <dbl>, q3 <dbl>, q4 <dbl>, q5 <dbl>, q6 <dbl>,
#   q7 <dbl>, q8 <dbl>, q9 <dbl>, q10 <dbl>, TimeSpent <dbl>,
#   TimeSpent_hours <dbl>, TimeSpent_std <dbl>, int <dbl>, pc <dbl>, uv <dbl>

Let’s try filter function with two arguments now.

Filter the sci_classes data frame for students whose
percentage_earned is greater than 0.8
in the class “BioA”
Assign to a new object.
Use the tail() function to examine your data frame.

bio_students<-filter(sci_online_classes,percentage_earned>0.8 & subject=="BioA")
tail(bio_students)

# A tibble: 6 × 30
  student_id course_id    total_points_possible total_points_earned
       <dbl> <chr>                        <dbl>               <dbl>
1      91066 BioA-S116-01                  5766                4820
2      91067 BioA-S116-01                  2672                2249
3      92633 BioA-S116-01                  2954                2495
4      95658 BioA-S216-01                  3362                2775
5      96950 BioA-S216-01                  6190                4970
6      97374 BioA-S216-01                  8586                6978
# ℹ 26 more variables: percentage_earned <dbl>, subject <chr>, semester <chr>,
#   section <chr>, Gradebook_Item <chr>, Grade_Category <lgl>,
#   FinalGradeCEMS <dbl>, Points_Possible <dbl>, Points_Earned <dbl>,
#   Gender <chr>, q1 <dbl>, q2 <dbl>, q3 <dbl>, q4 <dbl>, q5 <dbl>, q6 <dbl>,
#   q7 <dbl>, q8 <dbl>, q9 <dbl>, q10 <dbl>, TimeSpent <dbl>,
#   TimeSpent_hours <dbl>, TimeSpent_std <dbl>, int <dbl>, pc <dbl>, uv <dbl>

Your Turn:

Filter the sci_classes data frame for students whose

percentage_earned is smaller or equal to 0.6
in subject PhysA
Assign to a new object.
Use the head() function to examine your data frame.

phy_students<-filter(sci_online_classes,percentage_earned<=0.6 & subject=="PhysA")
head(phy_students)

# A tibble: 5 × 30
  student_id course_id     total_points_possible total_points_earned
       <dbl> <chr>                         <dbl>               <dbl>
1      78153 PhysA-S216-01                  6530                3702
2      85962 PhysA-S116-01                  3254                1828
3      92725 PhysA-S116-01                  3226                1880
4      92729 PhysA-S116-01                  3556                1770
5      92732 PhysA-S116-01                  1207                 721
# ℹ 26 more variables: percentage_earned <dbl>, subject <chr>, semester <chr>,
#   section <chr>, Gradebook_Item <chr>, Grade_Category <lgl>,
#   FinalGradeCEMS <dbl>, Points_Possible <dbl>, Points_Earned <dbl>,
#   Gender <chr>, q1 <dbl>, q2 <dbl>, q3 <dbl>, q4 <dbl>, q5 <dbl>, q6 <dbl>,
#   q7 <dbl>, q8 <dbl>, q9 <dbl>, q10 <dbl>, TimeSpent <dbl>,
#   TimeSpent_hours <dbl>, TimeSpent_std <dbl>, int <dbl>, pc <dbl>, uv <dbl>

Let’s use filter () function for the missing data.

Filter the sci_classes data frame so rows with
NA for points earned are removed.
Assign to a new object.
Use the glimpse() function to examine your data frame.

clean_data<-sci_online_classes|>
  filter(!is.na(Points_Earned))
glimpse(clean_data)

Rows: 511
Columns: 30
$ student_id            <dbl> 44638, 47979, 48797, 52446, 53447, 53475, 53475,…
$ course_id             <chr> "OcnA-S116-01", "OcnA-S216-01", "PhysA-S116-01",…
$ total_points_possible <dbl> 3531, 4562, 2207, 2086, 4655, 1710, 1209, 4641, …
$ total_points_earned   <dbl> 2672, 3090, 1910, 1719, 3149, 1402, 977, 3429, 2…
$ percentage_earned     <dbl> 0.7567261, 0.6773345, 0.8654282, 0.8240652, 0.67…
$ subject               <chr> "OcnA", "OcnA", "PhysA", "PhysA", "FrScA", "FrSc…
$ semester              <chr> "S116", "S216", "S116", "S116", "S116", "S116", …
$ section               <chr> "01", "01", "01", "01", "01", "02", "01", "01", …
$ Gradebook_Item        <chr> "ATTEMPTED", "POINTS EARNED & TOTAL COURSE POINT…
$ Grade_Category        <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ FinalGradeCEMS        <dbl> 81.70184, 81.85260, 84.00000, 97.77778, 96.11872…
$ Points_Possible       <dbl> 10, 5, 438, 10, 443, 5, 12, 10, 5, 10, 220, 30, …
$ Points_Earned         <dbl> 10.00, 4.00, 399.00, 10.00, 425.00, 2.50, 12.00,…
$ Gender                <chr> "F", "M", "F", "F", "F", "M", "M", "M", "F", "F"…
$ q1                    <dbl> 4, 5, 4, 3, 4, NA, NA, 4, 3, 5, NA, 4, 4, NA, 4,…
$ q2                    <dbl> 4, 5, 3, 3, 3, NA, NA, 5, 3, 3, NA, 2, 4, NA, 3,…
$ q3                    <dbl> 3, 3, 3, 3, 3, NA, NA, 3, 3, 5, NA, 2, 3, NA, 3,…
$ q4                    <dbl> 4, 5, 4, 3, 4, NA, NA, 5, 3, 5, NA, 4, 5, NA, 4,…
$ q5                    <dbl> 4, 5, 4, 3, 4, NA, NA, 5, 4, 5, NA, 4, 4, NA, 4,…
$ q6                    <dbl> 4, 5, 4, 4, 3, NA, NA, 5, 3, 5, NA, 4, 4, NA, 3,…
$ q7                    <dbl> 4, 4, 4, 3, 3, NA, NA, 5, 3, 5, NA, 4, 5, NA, 3,…
$ q8                    <dbl> 5, 5, 4, 3, 4, NA, NA, 4, 3, 5, NA, 4, 4, NA, 4,…
$ q9                    <dbl> 4, 5, NA, 3, 2, NA, NA, 5, 2, 2, NA, 2, 4, NA, 2…
$ q10                   <dbl> 4, 5, 3, 3, 5, NA, NA, 4, 4, 5, NA, 4, 4, NA, 3,…
$ TimeSpent             <dbl> 1382.7001, 1598.6166, 1481.8000, 1390.2167, 1479…
$ TimeSpent_hours       <dbl> 23.04500167, 26.64361000, 24.69666667, 23.170278…
$ TimeSpent_std         <dbl> -0.30780313, -0.14844697, -0.23466291, -0.302255…
$ int                   <dbl> 4.2, 5.0, 3.8, 3.0, 4.2, NA, NA, 4.4, 3.4, 4.7, …
$ pc                    <dbl> 3.50, 3.50, 3.50, 3.00, 3.00, NA, NA, 4.00, 3.00…
$ uv                    <dbl> 4.000000, 5.000000, 3.500000, 3.333333, 2.666667…

Filter the sci_classes data for the following subjects:

BioA
PhysA
OcnA
Assign to a new object with a different name.
Use the summary() function to examine your data frame.

my_Classes<-filter(sci_online_classes,subject %in% c("BioA", "PhysA", "OcnA"))

Arrange () Function

Let’s recall how we were using the arrange () function for our dataset.

Arrange sci_classes by subject subject then
percentage_earned in descending order.
Assign to a new object.
Use the str() function to examine your data frame.

order_classes<-sci_online_classes |>
  arrange(subject, desc(percentage_earned))

str(order_classes)

spc_tbl_ [603 × 30] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ student_id           : num [1:603] 70192 86488 96690 91175 86267 ...
 $ course_id            : chr [1:603] "AnPhA-S116-02" "AnPhA-S116-01" "AnPhA-S216-01" "AnPhA-S116-02" ...
 $ total_points_possible: num [1:603] 1936 3342 4804 3199 3045 ...
 $ total_points_earned  : num [1:603] 1763 3033 4309 2867 2705 ...
 $ percentage_earned    : num [1:603] 0.911 0.908 0.897 0.896 0.888 ...
 $ subject              : chr [1:603] "AnPhA" "AnPhA" "AnPhA" "AnPhA" ...
 $ semester             : chr [1:603] "S116" "S116" "S216" "S116" ...
 $ section              : chr [1:603] "02" "01" "01" "02" ...
 $ Gradebook_Item       : chr [1:603] "POINTS EARNED & TOTAL COURSE POINTS" "POINTS EARNED & TOTAL COURSE POINTS" "POINTS EARNED & TOTAL COURSE POINTS" "POINTS EARNED & TOTAL COURSE POINTS" ...
 $ Grade_Category       : logi [1:603] NA NA NA NA NA NA ...
 $ FinalGradeCEMS       : num [1:603] 96 87.4 64.8 82.2 35.1 ...
 $ Points_Possible      : num [1:603] 10 28 10 5 50 15 10 10 353 460 ...
 $ Points_Earned        : num [1:603] 7 26 3 5 50 11 8 10 330 452 ...
 $ Gender               : chr [1:603] "F" "M" "F" "F" ...
 $ q1                   : num [1:603] 4 4 4 5 5 4 5 4 NA NA ...
 $ q2                   : num [1:603] 3 4 3 3 5 2 4 4 NA NA ...
 $ q3                   : num [1:603] 3 2 2 3 3 3 4 3 NA NA ...
 $ q4                   : num [1:603] 4 3 5 5 5 4 5 4 NA NA ...
 $ q5                   : num [1:603] 4 3 4 5 5 4 5 4 NA NA ...
 $ q6                   : num [1:603] 3 3 4 4 5 3 5 4 NA NA ...
 $ q7                   : num [1:603] 3 3 3 3 4 4 5 4 NA NA ...
 $ q8                   : num [1:603] 5 2 4 5 5 4 4 4 NA NA ...
 $ q9                   : num [1:603] 2 3 3 3 5 1 4 4 NA NA ...
 $ q10                  : num [1:603] 5 3 2 5 5 2 5 4 NA NA ...
 $ TimeSpent            : num [1:603] 1537 3600 1970 1315 406 ...
 $ TimeSpent_hours      : num [1:603] 25.62 60 32.83 21.92 6.77 ...
 $ TimeSpent_std        : num [1:603] -0.194 1.328 0.125 -0.358 -1.029 ...
 $ int                  : num [1:603] 4.4 3 3.8 5 5 3.9 4.6 4 4.8 4.6 ...
 $ pc                   : num [1:603] 3 2.5 2.5 3 3.5 3.5 3.75 3.5 3.5 4.5 ...
 $ uv                   : num [1:603] 2.67 3.33 3.33 3.33 5 ...
 - attr(*, "spec")=
  .. cols(
  ..   student_id = col_double(),
  ..   course_id = col_character(),
  ..   total_points_possible = col_double(),
  ..   total_points_earned = col_double(),
  ..   percentage_earned = col_double(),
  ..   subject = col_character(),
  ..   semester = col_character(),
  ..   section = col_character(),
  ..   Gradebook_Item = col_character(),
  ..   Grade_Category = col_logical(),
  ..   FinalGradeCEMS = col_double(),
  ..   Points_Possible = col_double(),
  ..   Points_Earned = col_double(),
  ..   Gender = col_character(),
  ..   q1 = col_double(),
  ..   q2 = col_double(),
  ..   q3 = col_double(),
  ..   q4 = col_double(),
  ..   q5 = col_double(),
  ..   q6 = col_double(),
  ..   q7 = col_double(),
  ..   q8 = col_double(),
  ..   q9 = col_double(),
  ..   q10 = col_double(),
  ..   TimeSpent = col_double(),
  ..   TimeSpent_hours = col_double(),
  ..   TimeSpent_std = col_double(),
  ..   int = col_double(),
  ..   pc = col_double(),
  ..   uv = col_double()
  .. )
 - attr(*, "problems")=<externalptr>

%>% Pipe Operator

Using sci_classes data and the %>% pipe operator:

Select subject, section, time spent in hours and final course grade.
Filter for students in OcnA courses with grades greater than or equal to 60.
Arrange grades by section in descending order.
Assign to a new object.

Examine the contents using a method of your choosing.

specific_course<-sci_online_classes %>%
  select(subject,section,TimeSpent_hours, FinalGradeCEMS) %>%
  filter(subject=="OcnA"& FinalGradeCEMS>=60) %>%
  arrange(desc(section))

Deriving info with dplyr

We will practice summarise () and group_by () functions now.

Summarise () Function

Using sci_classes data and the summarise() function:

Get a distinct count of course ids.
Use the %>% operator

sci_online_classes %>%
  summarise(courses<-n_distinct(course_id))

# A tibble: 1 × 1
  `courses <- n_distinct(course_id)`
                               <int>
1                                 26

Get a distinct count of course ids.
Use the |> operator

sci_online_classes|>
  summarise(courses<-n_distinct(course_id))

# A tibble: 1 × 1
  `courses <- n_distinct(course_id)`
                               <int>
1                                 26

Group_by () Function

Using the sci_classes data and the pipe operator.

Filter final grades to remove NAs.
Group your data by subject and gender.
Summarise your data to calculate the following stats:
total number of students
mean final grade
mean time spent in the course
Assign to a new object
Examine the contents using a method of your choosing.

finals<-sci_online_classes %>%
  filter(!is.na(FinalGradeCEMS)) %>%
  group_by(subject,Gender) %>%
  summarise(total=sum(student_id),
            grade=mean(FinalGradeCEMS),
            time=mean(TimeSpent_hours))

`summarise()` has grouped output by 'subject'. You can override using the
`.groups` argument.

Mutate () Function

Replace the dashed lines in the following code to;

Create a new variable called score that is the product of percentage earned and 100
Create a faceted scatter plot with hours spent in the course on the x-axis, score on the y-axis, and point colored by gender.
Include an alpha value to your graph.

sci_online_classes %>%
  mutate(score = percentage_earned * 100) %>%
  ggplot() +
  geom_point(mapping = aes(x=TimeSpent_hours,
                           y=score, 
                          col = Gender)) +
  facet_wrap(~subject)

Final Step:

You are almost done, all you need to is to render your file and publish it in one of the following platform.

Your Turn:

Render File: For now, we will wrap up this work by converting our work into a webpage that can be used to communicate your learning and demonstrate some of your new R skills. To do so, you will need to “render” your document by clicking the Render button in the menu bar at that the top of this file. This will do two things; it will:

check through all your code for any errors; and,
create a file in your directory that you can use to share you work through Posit Cloud, RPubs , GitHub Pages, Quarto Pub, or other methods.
Submit your link to the Blackboard!

Now that you’ve finished your Rtutorial study, scroll back to the very top of this Quarto Document and change the author: “YOUR NAME HERE” to your actual name surrounded by quotation marks like so: author: “Dr. Cansu Tatar”.