Unit 2: Rtutorial

Author

Dr.Cansu Tatar

Week 6: Working with Data

In the previous R tutorial, we started to work on science classroom dataset. We applied the data intensive research steps to explore our data and investigate the relationship between students’ grades and time-spent.

Let’s remember which libraries and functions we used!

Your Turn:

Please write down one or two sentence to explain why and how we use the following libraries and functions.

tidyverse: The tidyverse package was central to our work because it brings together tools for cleaning, transforming, and visualizing data. It made the workflow more organized and much easier to read.
skimr: With skimr, we were able to quickly generate a detailed summary of the dataset. I found it especially helpful for spotting missing values and getting an overall sense of the data distribution.
ggplot: ggplot allowed us to visually explore patterns in the dataset. Instead of just looking at numbers, we could actually see relationships, like how time spent studying might relate to grades.
read_csv(): Before doing any analysis, we used read_csv() to import the dataset into R. Without loading the file properly, none of the later steps would have been possible.
view(): When I wanted to examine the dataset more closely, view() opened it in a spreadsheet-style format. This made it easier to scroll through and visually inspect the entries.
glimpse(): glimpse() gave us a quick snapshot of the structure of the dataset, including variable types and sample values. It’s a fast way to understand what you’re working with.
head(): To get an initial feel for the data, we looked at the first few rows using head(). That helped confirm that everything was imported correctly.
tail(): On the other end, tail() showed us the last few rows of the dataset. This was useful for checking completeness and consistency.
select(): We used select() to focus on the columns relevant to our question. It helped simplify the dataset.
filter(): filter() allowed us to narrow the data based on specific conditions. This was important when we wanted to analyze particular groups of students.
arrange(): To examine trends more clearly, we sorted the data using arrange(). Organizing values helped reveal patterns that weren’t obvious before.
desc(): When we wanted to see the highest values first, we paired arrange() with desc(). This reversed the order and made comparisons easier.
geom_histogram(): geom_histogram() was used to visualized how a single variable was distributed. It helped us see whether the data was clustered, spread out, or skewed.
geom_point(): For exploring relationships between two variables, geom_point() created a scatterplot. This made it easier to interpret the connection between study time and student performance.

Load the Tidyverse Package

Let’s start our R code along by loading the tidyverse package.

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.2.0     ✔ readr     2.1.6
✔ forcats   1.0.1     ✔ stringr   1.6.0
✔ ggplot2   4.0.2     ✔ tibble    3.3.1
✔ lubridate 1.9.5     ✔ tidyr     1.3.2
✔ purrr     1.2.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Load the online science class data

Now, load the online science class data from the data folder and assign your data to a new object.

library(readr)
library(skimr)

sci_online_classes <- read_csv("Data/sci-online-classes.csv")

Rows: 603 Columns: 30
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (6): course_id, subject, semester, section, Gradebook_Item, Gender
dbl (23): student_id, total_points_possible, total_points_earned, percentage...
lgl  (1): Grade_Category

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

You loaded the data, now what should we do?

view(sci_online_classes)
glimpse(sci_online_classes)

Rows: 603
Columns: 30
$ student_id            <dbl> 43146, 44638, 47448, 47979, 48797, 51943, 52326,…
$ course_id             <chr> "FrScA-S216-02", "OcnA-S116-01", "FrScA-S216-01"…
$ total_points_possible <dbl> 3280, 3531, 2870, 4562, 2207, 4208, 4325, 2086, …
$ total_points_earned   <dbl> 2220, 2672, 1897, 3090, 1910, 3596, 2255, 1719, …
$ percentage_earned     <dbl> 0.6768293, 0.7567261, 0.6609756, 0.6773345, 0.86…
$ subject               <chr> "FrScA", "OcnA", "FrScA", "OcnA", "PhysA", "FrSc…
$ semester              <chr> "S216", "S116", "S216", "S216", "S116", "S216", …
$ section               <chr> "02", "01", "01", "01", "01", "03", "01", "01", …
$ Gradebook_Item        <chr> "POINTS EARNED & TOTAL COURSE POINTS", "ATTEMPTE…
$ Grade_Category        <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ FinalGradeCEMS        <dbl> 93.45372, 81.70184, 88.48758, 81.85260, 84.00000…
$ Points_Possible       <dbl> 5, 10, 10, 5, 438, 5, 10, 10, 443, 5, 12, 10, 5,…
$ Points_Earned         <dbl> NA, 10.00, NA, 4.00, 399.00, NA, NA, 10.00, 425.…
$ Gender                <chr> "M", "F", "M", "M", "F", "F", "M", "F", "F", "M"…
$ q1                    <dbl> 5, 4, 5, 5, 4, NA, 5, 3, 4, NA, NA, 4, 3, 5, NA,…
$ q2                    <dbl> 4, 4, 4, 5, 3, NA, 5, 3, 3, NA, NA, 5, 3, 3, NA,…
$ q3                    <dbl> 4, 3, 4, 3, 3, NA, 3, 3, 3, NA, NA, 3, 3, 5, NA,…
$ q4                    <dbl> 5, 4, 5, 5, 4, NA, 5, 3, 4, NA, NA, 5, 3, 5, NA,…
$ q5                    <dbl> 5, 4, 5, 5, 4, NA, 5, 3, 4, NA, NA, 5, 4, 5, NA,…
$ q6                    <dbl> 5, 4, 4, 5, 4, NA, 5, 4, 3, NA, NA, 5, 3, 5, NA,…
$ q7                    <dbl> 5, 4, 4, 4, 4, NA, 4, 3, 3, NA, NA, 5, 3, 5, NA,…
$ q8                    <dbl> 5, 5, 5, 5, 4, NA, 5, 3, 4, NA, NA, 4, 3, 5, NA,…
$ q9                    <dbl> 4, 4, 3, 5, NA, NA, 5, 3, 2, NA, NA, 5, 2, 2, NA…
$ q10                   <dbl> 5, 4, 5, 5, 3, NA, 5, 3, 5, NA, NA, 4, 4, 5, NA,…
$ TimeSpent             <dbl> 1555.1667, 1382.7001, 860.4335, 1598.6166, 1481.…
$ TimeSpent_hours       <dbl> 25.91944500, 23.04500167, 14.34055833, 26.643610…
$ TimeSpent_std         <dbl> -0.18051496, -0.30780313, -0.69325954, -0.148446…
$ int                   <dbl> 5.0, 4.2, 5.0, 5.0, 3.8, 4.6, 5.0, 3.0, 4.2, NA,…
$ pc                    <dbl> 4.50, 3.50, 4.00, 3.50, 3.50, 4.00, 3.50, 3.00, …
$ uv                    <dbl> 4.333333, 4.000000, 3.666667, 5.000000, 3.500000…

Your Turn:

Examine the contents of sci_classes in your console. You should type the object name to the console and check that.

Question: Is your object a tibble? How do you know?

Your response here: Yes, the object is a tibble. I know this because when I used glimpse() and printed the object in the console, it showed the number of rows and columns at the top and displayed each variable with its data type (like <dbl> and <chr>), which is the typical tibble format in tidyverse.

Hint: Check the output in the console.

Check your data with different functions

You can check your data with different functions. Let’s remember how we use different functions to check our data.

view(sci_online_classes)

glimpse(sci_online_classes)

Rows: 603
Columns: 30
$ student_id            <dbl> 43146, 44638, 47448, 47979, 48797, 51943, 52326,…
$ course_id             <chr> "FrScA-S216-02", "OcnA-S116-01", "FrScA-S216-01"…
$ total_points_possible <dbl> 3280, 3531, 2870, 4562, 2207, 4208, 4325, 2086, …
$ total_points_earned   <dbl> 2220, 2672, 1897, 3090, 1910, 3596, 2255, 1719, …
$ percentage_earned     <dbl> 0.6768293, 0.7567261, 0.6609756, 0.6773345, 0.86…
$ subject               <chr> "FrScA", "OcnA", "FrScA", "OcnA", "PhysA", "FrSc…
$ semester              <chr> "S216", "S116", "S216", "S216", "S116", "S216", …
$ section               <chr> "02", "01", "01", "01", "01", "03", "01", "01", …
$ Gradebook_Item        <chr> "POINTS EARNED & TOTAL COURSE POINTS", "ATTEMPTE…
$ Grade_Category        <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ FinalGradeCEMS        <dbl> 93.45372, 81.70184, 88.48758, 81.85260, 84.00000…
$ Points_Possible       <dbl> 5, 10, 10, 5, 438, 5, 10, 10, 443, 5, 12, 10, 5,…
$ Points_Earned         <dbl> NA, 10.00, NA, 4.00, 399.00, NA, NA, 10.00, 425.…
$ Gender                <chr> "M", "F", "M", "M", "F", "F", "M", "F", "F", "M"…
$ q1                    <dbl> 5, 4, 5, 5, 4, NA, 5, 3, 4, NA, NA, 4, 3, 5, NA,…
$ q2                    <dbl> 4, 4, 4, 5, 3, NA, 5, 3, 3, NA, NA, 5, 3, 3, NA,…
$ q3                    <dbl> 4, 3, 4, 3, 3, NA, 3, 3, 3, NA, NA, 3, 3, 5, NA,…
$ q4                    <dbl> 5, 4, 5, 5, 4, NA, 5, 3, 4, NA, NA, 5, 3, 5, NA,…
$ q5                    <dbl> 5, 4, 5, 5, 4, NA, 5, 3, 4, NA, NA, 5, 4, 5, NA,…
$ q6                    <dbl> 5, 4, 4, 5, 4, NA, 5, 4, 3, NA, NA, 5, 3, 5, NA,…
$ q7                    <dbl> 5, 4, 4, 4, 4, NA, 4, 3, 3, NA, NA, 5, 3, 5, NA,…
$ q8                    <dbl> 5, 5, 5, 5, 4, NA, 5, 3, 4, NA, NA, 4, 3, 5, NA,…
$ q9                    <dbl> 4, 4, 3, 5, NA, NA, 5, 3, 2, NA, NA, 5, 2, 2, NA…
$ q10                   <dbl> 5, 4, 5, 5, 3, NA, 5, 3, 5, NA, NA, 4, 4, 5, NA,…
$ TimeSpent             <dbl> 1555.1667, 1382.7001, 860.4335, 1598.6166, 1481.…
$ TimeSpent_hours       <dbl> 25.91944500, 23.04500167, 14.34055833, 26.643610…
$ TimeSpent_std         <dbl> -0.18051496, -0.30780313, -0.69325954, -0.148446…
$ int                   <dbl> 5.0, 4.2, 5.0, 5.0, 3.8, 4.6, 5.0, 3.0, 4.2, NA,…
$ pc                    <dbl> 4.50, 3.50, 4.00, 3.50, 3.50, 4.00, 3.50, 3.00, …
$ uv                    <dbl> 4.333333, 4.000000, 3.666667, 5.000000, 3.500000…

head(sci_online_classes)

# A tibble: 6 × 30
  student_id course_id     total_points_possible total_points_earned
       <dbl> <chr>                         <dbl>               <dbl>
1      43146 FrScA-S216-02                  3280                2220
2      44638 OcnA-S116-01                   3531                2672
3      47448 FrScA-S216-01                  2870                1897
4      47979 OcnA-S216-01                   4562                3090
5      48797 PhysA-S116-01                  2207                1910
6      51943 FrScA-S216-03                  4208                3596
# ℹ 26 more variables: percentage_earned <dbl>, subject <chr>, semester <chr>,
#   section <chr>, Gradebook_Item <chr>, Grade_Category <lgl>,
#   FinalGradeCEMS <dbl>, Points_Possible <dbl>, Points_Earned <dbl>,
#   Gender <chr>, q1 <dbl>, q2 <dbl>, q3 <dbl>, q4 <dbl>, q5 <dbl>, q6 <dbl>,
#   q7 <dbl>, q8 <dbl>, q9 <dbl>, q10 <dbl>, TimeSpent <dbl>,
#   TimeSpent_hours <dbl>, TimeSpent_std <dbl>, int <dbl>, pc <dbl>, uv <dbl>

tail(sci_online_classes)

# A tibble: 6 × 30
  student_id course_id     total_points_possible total_points_earned
       <dbl> <chr>                         <dbl>               <dbl>
1      97150 PhysA-S216-01                  2710                1803
2      97265 PhysA-S216-01                  3101                2078
3      97272 OcnA-S216-01                   2872                1733
4      97374 BioA-S216-01                   8586                6978
5      97386 BioA-S216-01                   2761                1937
6      97441 FrScA-S216-02                  2607                2205
# ℹ 26 more variables: percentage_earned <dbl>, subject <chr>, semester <chr>,
#   section <chr>, Gradebook_Item <chr>, Grade_Category <lgl>,
#   FinalGradeCEMS <dbl>, Points_Possible <dbl>, Points_Earned <dbl>,
#   Gender <chr>, q1 <dbl>, q2 <dbl>, q3 <dbl>, q4 <dbl>, q5 <dbl>, q6 <dbl>,
#   q7 <dbl>, q8 <dbl>, q9 <dbl>, q10 <dbl>, TimeSpent <dbl>,
#   TimeSpent_hours <dbl>, TimeSpent_std <dbl>, int <dbl>, pc <dbl>, uv <dbl>

summary(sci_online_classes)

   student_id     course_id         total_points_possible total_points_earned
 Min.   :43146   Length:603         Min.   :  840         Min.   :  651      
 1st Qu.:85613   Class :character   1st Qu.: 2810         1st Qu.: 2050      
 Median :88340   Mode  :character   Median : 3583         Median : 2757      
 Mean   :86070                      Mean   : 4274         Mean   : 3245      
 3rd Qu.:92731                      3rd Qu.: 5069         3rd Qu.: 3875      
 Max.   :97441                      Max.   :15552         Max.   :12208      
                                                                             
 percentage_earned   subject            semester           section         
 Min.   :0.3384    Length:603         Length:603         Length:603        
 1st Qu.:0.7047    Class :character   Class :character   Class :character  
 Median :0.7770    Mode  :character   Mode  :character   Mode  :character  
 Mean   :0.7577                                                            
 3rd Qu.:0.8262                                                            
 Max.   :0.9106                                                            
                                                                           
 Gradebook_Item     Grade_Category FinalGradeCEMS   Points_Possible 
 Length:603         Mode:logical   Min.   :  0.00   Min.   :  5.00  
 Class :character   NA's:603       1st Qu.: 71.25   1st Qu.: 10.00  
 Mode  :character                  Median : 84.57   Median : 10.00  
                                   Mean   : 77.20   Mean   : 76.87  
                                   3rd Qu.: 92.10   3rd Qu.: 30.00  
                                   Max.   :100.00   Max.   :935.00  
                                   NA's   :30                       
 Points_Earned       Gender                q1              q2       
 Min.   :  0.00   Length:603         Min.   :1.000   Min.   :1.000  
 1st Qu.:  7.00   Class :character   1st Qu.:4.000   1st Qu.:3.000  
 Median : 10.00   Mode  :character   Median :4.000   Median :4.000  
 Mean   : 68.63                      Mean   :4.296   Mean   :3.629  
 3rd Qu.: 26.12                      3rd Qu.:5.000   3rd Qu.:4.000  
 Max.   :828.20                      Max.   :5.000   Max.   :5.000  
 NA's   :92                          NA's   :123     NA's   :126    
       q3              q4              q5              q6       
 Min.   :1.000   Min.   :1.000   Min.   :2.000   Min.   :1.000  
 1st Qu.:3.000   1st Qu.:4.000   1st Qu.:4.000   1st Qu.:4.000  
 Median :3.000   Median :4.000   Median :4.000   Median :4.000  
 Mean   :3.327   Mean   :4.268   Mean   :4.191   Mean   :4.008  
 3rd Qu.:4.000   3rd Qu.:5.000   3rd Qu.:5.000   3rd Qu.:5.000  
 Max.   :5.000   Max.   :5.000   Max.   :5.000   Max.   :5.000  
 NA's   :123     NA's   :125     NA's   :127     NA's   :127    
       q7              q8              q9             q10       
 Min.   :1.000   Min.   :1.000   Min.   :1.000   Min.   :1.000  
 1st Qu.:3.000   1st Qu.:4.000   1st Qu.:3.000   1st Qu.:4.000  
 Median :4.000   Median :4.000   Median :4.000   Median :4.000  
 Mean   :3.907   Mean   :4.289   Mean   :3.487   Mean   :4.101  
 3rd Qu.:4.750   3rd Qu.:5.000   3rd Qu.:4.000   3rd Qu.:5.000  
 Max.   :5.000   Max.   :5.000   Max.   :5.000   Max.   :5.000  
 NA's   :129     NA's   :129     NA's   :129     NA's   :129    
   TimeSpent       TimeSpent_hours    TimeSpent_std          int       
 Min.   :   0.45   Min.   :  0.0075   Min.   :-1.3280   Min.   :2.000  
 1st Qu.: 851.90   1st Qu.: 14.1983   1st Qu.:-0.6996   1st Qu.:3.900  
 Median :1550.91   Median : 25.8485   Median :-0.1837   Median :4.200  
 Mean   :1799.75   Mean   : 29.9959   Mean   : 0.0000   Mean   :4.219  
 3rd Qu.:2426.09   3rd Qu.: 40.4348   3rd Qu.: 0.4623   3rd Qu.:4.700  
 Max.   :8870.88   Max.   :147.8481   Max.   : 5.2188   Max.   :5.000  
 NA's   :5         NA's   :5          NA's   :5         NA's   :76     
       pc              uv       
 Min.   :1.500   Min.   :1.000  
 1st Qu.:3.000   1st Qu.:3.333  
 Median :3.500   Median :3.667  
 Mean   :3.608   Mean   :3.719  
 3rd Qu.:4.000   3rd Qu.:4.167  
 Max.   :5.000   Max.   :5.000  
 NA's   :75      NA's   :75

skim(sci_online_classes)

Data summary
Name	sci_online_classes
Number of rows	603
Number of columns	30
_______________________
Column type frequency:
character	6
logical	1
numeric	23
________________________
Group variables	None

Variable type: character

skim_variable	complete_rate	min	max	n_unique
course_id	1	12	13	26
subject	1	4	5	5
semester	1	4	4	3
section	1	2	2	4
Gradebook_Item	1	9	35	3
Gender	1	1	1	2

Variable type: logical

skim_variable	n_missing	complete_rate	mean	count
Grade_Category	603	0	NaN	:

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
student_id	0	1.00	86069.54	10548.60	43146.00	85612.50	88340.00	92730.50	97441.00	▁▁▁▃▇
total_points_possible	0	1.00	4274.41	2312.74	840.00	2809.50	3583.00	5069.00	15552.00	▇▅▂▁▁
total_points_earned	0	1.00	3244.69	1832.00	651.00	2050.50	2757.00	3875.00	12208.00	▇▅▁▁▁
percentage_earned	0	1.00	0.76	0.09	0.34	0.70	0.78	0.83	0.91	▁▁▃▇▇
FinalGradeCEMS	30	0.95	77.20	22.23	0.00	71.25	84.57	92.10	100.00	▁▁▁▃▇
Points_Possible	0	1.00	76.87	167.51	5.00	10.00	10.00	30.00	935.00	▇▁▁▁▁
Points_Earned	92	0.85	68.63	145.26	0.00	7.00	10.00	26.12	828.20	▇▁▁▁▁
q1	123	0.80	4.30	0.68	1.00	4.00	4.00	5.00	5.00	▁▁▂▇▇
q2	126	0.79	3.63	0.93	1.00	3.00	4.00	4.00	5.00	▁▂▆▇▃
q3	123	0.80	3.33	0.91	1.00	3.00	3.00	4.00	5.00	▁▃▇▅▂
q4	125	0.79	4.27	0.85	1.00	4.00	4.00	5.00	5.00	▁▁▂▇▇
q5	127	0.79	4.19	0.68	2.00	4.00	4.00	5.00	5.00	▁▂▁▇▅
q6	127	0.79	4.01	0.80	1.00	4.00	4.00	5.00	5.00	▁▁▃▇▅
q7	129	0.79	3.91	0.82	1.00	3.00	4.00	4.75	5.00	▁▁▅▇▅
q8	129	0.79	4.29	0.68	1.00	4.00	4.00	5.00	5.00	▁▁▂▇▆
q9	129	0.79	3.49	0.98	1.00	3.00	4.00	4.00	5.00	▁▃▇▇▃
q10	129	0.79	4.10	0.93	1.00	4.00	4.00	5.00	5.00	▁▂▃▇▇
TimeSpent	5	0.99	1799.75	1354.93	0.45	851.90	1550.91	2426.09	8870.88	▇▅▁▁▁
TimeSpent_hours	5	0.99	30.00	22.58	0.01	14.20	25.85	40.43	147.85	▇▅▁▁▁
TimeSpent_std	5	0.99	0.00	1.00	-1.33	-0.70	-0.18	0.46	5.22	▇▅▁▁▁
int	76	0.87	4.22	0.59	2.00	3.90	4.20	4.70	5.00	▁▁▃▇▇
pc	75	0.88	3.61	0.64	1.50	3.00	3.50	4.00	5.00	▁▁▇▅▂
uv	75	0.88	3.72	0.70	1.00	3.33	3.67	4.17	5.00	▁▁▆▇▅

str(sci_online_classes)

spc_tbl_ [603 × 30] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ student_id           : num [1:603] 43146 44638 47448 47979 48797 ...
 $ course_id            : chr [1:603] "FrScA-S216-02" "OcnA-S116-01" "FrScA-S216-01" "OcnA-S216-01" ...
 $ total_points_possible: num [1:603] 3280 3531 2870 4562 2207 ...
 $ total_points_earned  : num [1:603] 2220 2672 1897 3090 1910 ...
 $ percentage_earned    : num [1:603] 0.677 0.757 0.661 0.677 0.865 ...
 $ subject              : chr [1:603] "FrScA" "OcnA" "FrScA" "OcnA" ...
 $ semester             : chr [1:603] "S216" "S116" "S216" "S216" ...
 $ section              : chr [1:603] "02" "01" "01" "01" ...
 $ Gradebook_Item       : chr [1:603] "POINTS EARNED & TOTAL COURSE POINTS" "ATTEMPTED" "POINTS EARNED & TOTAL COURSE POINTS" "POINTS EARNED & TOTAL COURSE POINTS" ...
 $ Grade_Category       : logi [1:603] NA NA NA NA NA NA ...
 $ FinalGradeCEMS       : num [1:603] 93.5 81.7 88.5 81.9 84 ...
 $ Points_Possible      : num [1:603] 5 10 10 5 438 5 10 10 443 5 ...
 $ Points_Earned        : num [1:603] NA 10 NA 4 399 NA NA 10 425 2.5 ...
 $ Gender               : chr [1:603] "M" "F" "M" "M" ...
 $ q1                   : num [1:603] 5 4 5 5 4 NA 5 3 4 NA ...
 $ q2                   : num [1:603] 4 4 4 5 3 NA 5 3 3 NA ...
 $ q3                   : num [1:603] 4 3 4 3 3 NA 3 3 3 NA ...
 $ q4                   : num [1:603] 5 4 5 5 4 NA 5 3 4 NA ...
 $ q5                   : num [1:603] 5 4 5 5 4 NA 5 3 4 NA ...
 $ q6                   : num [1:603] 5 4 4 5 4 NA 5 4 3 NA ...
 $ q7                   : num [1:603] 5 4 4 4 4 NA 4 3 3 NA ...
 $ q8                   : num [1:603] 5 5 5 5 4 NA 5 3 4 NA ...
 $ q9                   : num [1:603] 4 4 3 5 NA NA 5 3 2 NA ...
 $ q10                  : num [1:603] 5 4 5 5 3 NA 5 3 5 NA ...
 $ TimeSpent            : num [1:603] 1555 1383 860 1599 1482 ...
 $ TimeSpent_hours      : num [1:603] 25.9 23 14.3 26.6 24.7 ...
 $ TimeSpent_std        : num [1:603] -0.181 -0.308 -0.693 -0.148 -0.235 ...
 $ int                  : num [1:603] 5 4.2 5 5 3.8 4.6 5 3 4.2 NA ...
 $ pc                   : num [1:603] 4.5 3.5 4 3.5 3.5 4 3.5 3 3 NA ...
 $ uv                   : num [1:603] 4.33 4 3.67 5 3.5 ...
 - attr(*, "spec")=
  .. cols(
  ..   student_id = col_double(),
  ..   course_id = col_character(),
  ..   total_points_possible = col_double(),
  ..   total_points_earned = col_double(),
  ..   percentage_earned = col_double(),
  ..   subject = col_character(),
  ..   semester = col_character(),
  ..   section = col_character(),
  ..   Gradebook_Item = col_character(),
  ..   Grade_Category = col_logical(),
  ..   FinalGradeCEMS = col_double(),
  ..   Points_Possible = col_double(),
  ..   Points_Earned = col_double(),
  ..   Gender = col_character(),
  ..   q1 = col_double(),
  ..   q2 = col_double(),
  ..   q3 = col_double(),
  ..   q4 = col_double(),
  ..   q5 = col_double(),
  ..   q6 = col_double(),
  ..   q7 = col_double(),
  ..   q8 = col_double(),
  ..   q9 = col_double(),
  ..   q10 = col_double(),
  ..   TimeSpent = col_double(),
  ..   TimeSpent_hours = col_double(),
  ..   TimeSpent_std = col_double(),
  ..   int = col_double(),
  ..   pc = col_double(),
  ..   uv = col_double()
  .. )
 - attr(*, "problems")=<externalptr>

Isolating Data with dplyr

We will use select() function to select the following columns from our data.

student_id
subject
semester
FinalGradeCEMS
After selecting these columns, assign that to a new object with a name of “student_grade”.

student_grade <- sci_online_classes |>
  select(student_id, subject, semester, FinalGradeCEMS)

Your Turn:

Examine students’ grades, what did you realize about it? I noticed that there are grades ranging from 0 to 100 and many students with missing(N/A) grades.

Your response here:

Hint: Check the missing data.

Specific select

Now, we will make a specific selection.

Select all columns except subject and semester.
Assign to a new object with a different name.
Examine your data frame.

sci_dataframe<- select(sci_online_classes, -c(subject,semester))

sci_dataframe

# A tibble: 603 × 28
   student_id course_id     total_points_possible total_points_earned
        <dbl> <chr>                         <dbl>               <dbl>
 1      43146 FrScA-S216-02                  3280                2220
 2      44638 OcnA-S116-01                   3531                2672
 3      47448 FrScA-S216-01                  2870                1897
 4      47979 OcnA-S216-01                   4562                3090
 5      48797 PhysA-S116-01                  2207                1910
 6      51943 FrScA-S216-03                  4208                3596
 7      52326 AnPhA-S216-01                  4325                2255
 8      52446 PhysA-S116-01                  2086                1719
 9      53447 FrScA-S116-01                  4655                3149
10      53475 FrScA-S116-02                  1710                1402
# ℹ 593 more rows
# ℹ 24 more variables: percentage_earned <dbl>, section <chr>,
#   Gradebook_Item <chr>, Grade_Category <lgl>, FinalGradeCEMS <dbl>,
#   Points_Possible <dbl>, Points_Earned <dbl>, Gender <chr>, q1 <dbl>,
#   q2 <dbl>, q3 <dbl>, q4 <dbl>, q5 <dbl>, q6 <dbl>, q7 <dbl>, q8 <dbl>,
#   q9 <dbl>, q10 <dbl>, TimeSpent <dbl>, TimeSpent_hours <dbl>,
#   TimeSpent_std <dbl>, int <dbl>, pc <dbl>, uv <dbl>

Checking the data frame:

Your Turn:

Select all columns except student_id and FinalGradeCEMS.
Assign to a new object with a different name.
Examine your data frame.

new_data_frame <- select(sci_online_classes, -c(student_id, FinalGradeCEMS))

new_data_frame

# A tibble: 603 × 28
   course_id total_points_possible total_points_earned percentage_earned subject
   <chr>                     <dbl>               <dbl>             <dbl> <chr>  
 1 FrScA-S2…                  3280                2220             0.677 FrScA  
 2 OcnA-S11…                  3531                2672             0.757 OcnA   
 3 FrScA-S2…                  2870                1897             0.661 FrScA  
 4 OcnA-S21…                  4562                3090             0.677 OcnA   
 5 PhysA-S1…                  2207                1910             0.865 PhysA  
 6 FrScA-S2…                  4208                3596             0.855 FrScA  
 7 AnPhA-S2…                  4325                2255             0.521 AnPhA  
 8 PhysA-S1…                  2086                1719             0.824 PhysA  
 9 FrScA-S1…                  4655                3149             0.676 FrScA  
10 FrScA-S1…                  1710                1402             0.820 FrScA  
# ℹ 593 more rows
# ℹ 23 more variables: semester <chr>, section <chr>, Gradebook_Item <chr>,
#   Grade_Category <lgl>, Points_Possible <dbl>, Points_Earned <dbl>,
#   Gender <chr>, q1 <dbl>, q2 <dbl>, q3 <dbl>, q4 <dbl>, q5 <dbl>, q6 <dbl>,
#   q7 <dbl>, q8 <dbl>, q9 <dbl>, q10 <dbl>, TimeSpent <dbl>,
#   TimeSpent_hours <dbl>, TimeSpent_std <dbl>, int <dbl>, pc <dbl>, uv <dbl>

glimpse(new_data_frame)

Rows: 603
Columns: 28
$ course_id             <chr> "FrScA-S216-02", "OcnA-S116-01", "FrScA-S216-01"…
$ total_points_possible <dbl> 3280, 3531, 2870, 4562, 2207, 4208, 4325, 2086, …
$ total_points_earned   <dbl> 2220, 2672, 1897, 3090, 1910, 3596, 2255, 1719, …
$ percentage_earned     <dbl> 0.6768293, 0.7567261, 0.6609756, 0.6773345, 0.86…
$ subject               <chr> "FrScA", "OcnA", "FrScA", "OcnA", "PhysA", "FrSc…
$ semester              <chr> "S216", "S116", "S216", "S216", "S116", "S216", …
$ section               <chr> "02", "01", "01", "01", "01", "03", "01", "01", …
$ Gradebook_Item        <chr> "POINTS EARNED & TOTAL COURSE POINTS", "ATTEMPTE…
$ Grade_Category        <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ Points_Possible       <dbl> 5, 10, 10, 5, 438, 5, 10, 10, 443, 5, 12, 10, 5,…
$ Points_Earned         <dbl> NA, 10.00, NA, 4.00, 399.00, NA, NA, 10.00, 425.…
$ Gender                <chr> "M", "F", "M", "M", "F", "F", "M", "F", "F", "M"…
$ q1                    <dbl> 5, 4, 5, 5, 4, NA, 5, 3, 4, NA, NA, 4, 3, 5, NA,…
$ q2                    <dbl> 4, 4, 4, 5, 3, NA, 5, 3, 3, NA, NA, 5, 3, 3, NA,…
$ q3                    <dbl> 4, 3, 4, 3, 3, NA, 3, 3, 3, NA, NA, 3, 3, 5, NA,…
$ q4                    <dbl> 5, 4, 5, 5, 4, NA, 5, 3, 4, NA, NA, 5, 3, 5, NA,…
$ q5                    <dbl> 5, 4, 5, 5, 4, NA, 5, 3, 4, NA, NA, 5, 4, 5, NA,…
$ q6                    <dbl> 5, 4, 4, 5, 4, NA, 5, 4, 3, NA, NA, 5, 3, 5, NA,…
$ q7                    <dbl> 5, 4, 4, 4, 4, NA, 4, 3, 3, NA, NA, 5, 3, 5, NA,…
$ q8                    <dbl> 5, 5, 5, 5, 4, NA, 5, 3, 4, NA, NA, 4, 3, 5, NA,…
$ q9                    <dbl> 4, 4, 3, 5, NA, NA, 5, 3, 2, NA, NA, 5, 2, 2, NA…
$ q10                   <dbl> 5, 4, 5, 5, 3, NA, 5, 3, 5, NA, NA, 4, 4, 5, NA,…
$ TimeSpent             <dbl> 1555.1667, 1382.7001, 860.4335, 1598.6166, 1481.…
$ TimeSpent_hours       <dbl> 25.91944500, 23.04500167, 14.34055833, 26.643610…
$ TimeSpent_std         <dbl> -0.18051496, -0.30780313, -0.69325954, -0.148446…
$ int                   <dbl> 5.0, 4.2, 5.0, 5.0, 3.8, 4.6, 5.0, 3.0, 4.2, NA,…
$ pc                    <dbl> 4.50, 3.50, 4.00, 3.50, 3.50, 4.00, 3.50, 3.00, …
$ uv                    <dbl> 4.333333, 4.000000, 3.666667, 5.000000, 3.500000…

Specific select

Select only the columns that start with Time
Assign to a new object with a different name.
Use view() function to examine your data frame.

time_object<-select(sci_online_classes, starts_with("Time"))
view(time_object)

Your Turn:

Select only the columns that ends with “r”
Assign to a new object with a different name.
Use view() function to examine your data frame.

r_object<-select(sci_online_classes, ends_with("r"))
view(r_object)

Filter Function

Filter the sci_classes data frame for just males.
Assign to a new object.
Use the head() function to examine your data frame.

male_students<-sci_online_classes|>
  filter(Gender=="M")

head(male_students)

# A tibble: 6 × 30
  student_id course_id     total_points_possible total_points_earned
       <dbl> <chr>                         <dbl>               <dbl>
1      43146 FrScA-S216-02                  3280                2220
2      47448 FrScA-S216-01                  2870                1897
3      47979 OcnA-S216-01                   4562                3090
4      52326 AnPhA-S216-01                  4325                2255
5      53475 FrScA-S116-02                  1710                1402
6      53475 FrScA-S216-01                  1209                 977
# ℹ 26 more variables: percentage_earned <dbl>, subject <chr>, semester <chr>,
#   section <chr>, Gradebook_Item <chr>, Grade_Category <lgl>,
#   FinalGradeCEMS <dbl>, Points_Possible <dbl>, Points_Earned <dbl>,
#   Gender <chr>, q1 <dbl>, q2 <dbl>, q3 <dbl>, q4 <dbl>, q5 <dbl>, q6 <dbl>,
#   q7 <dbl>, q8 <dbl>, q9 <dbl>, q10 <dbl>, TimeSpent <dbl>,
#   TimeSpent_hours <dbl>, TimeSpent_std <dbl>, int <dbl>, pc <dbl>, uv <dbl>

Your Turn:

Filter the sci_classes data frame for just females.
Assign to a new object.
Use the tail() function to examine your data frame.

female_students<-sci_online_classes|>
  filter(Gender=="F")

tail(female_students)

# A tibble: 6 × 30
  student_id course_id     total_points_possible total_points_earned
       <dbl> <chr>                         <dbl>               <dbl>
1      97150 PhysA-S216-01                  2710                1803
2      97265 PhysA-S216-01                  3101                2078
3      97272 OcnA-S216-01                   2872                1733
4      97374 BioA-S216-01                   8586                6978
5      97386 BioA-S216-01                   2761                1937
6      97441 FrScA-S216-02                  2607                2205
# ℹ 26 more variables: percentage_earned <dbl>, subject <chr>, semester <chr>,
#   section <chr>, Gradebook_Item <chr>, Grade_Category <lgl>,
#   FinalGradeCEMS <dbl>, Points_Possible <dbl>, Points_Earned <dbl>,
#   Gender <chr>, q1 <dbl>, q2 <dbl>, q3 <dbl>, q4 <dbl>, q5 <dbl>, q6 <dbl>,
#   q7 <dbl>, q8 <dbl>, q9 <dbl>, q10 <dbl>, TimeSpent <dbl>,
#   TimeSpent_hours <dbl>, TimeSpent_std <dbl>, int <dbl>, pc <dbl>, uv <dbl>

Let’s try filter function with two arguments now.

Filter the sci_classes data frame for students whose
percentage_earned is greater than 0.8
in the class “BioA”
Assign to a new object.
Use the tail() function to examine your data frame.

bio_students <- filter(sci_online_classes, percentage_earned>0.8 & subject=="BioA")
tail(bio_students)

# A tibble: 6 × 30
  student_id course_id    total_points_possible total_points_earned
       <dbl> <chr>                        <dbl>               <dbl>
1      91066 BioA-S116-01                  5766                4820
2      91067 BioA-S116-01                  2672                2249
3      92633 BioA-S116-01                  2954                2495
4      95658 BioA-S216-01                  3362                2775
5      96950 BioA-S216-01                  6190                4970
6      97374 BioA-S216-01                  8586                6978
# ℹ 26 more variables: percentage_earned <dbl>, subject <chr>, semester <chr>,
#   section <chr>, Gradebook_Item <chr>, Grade_Category <lgl>,
#   FinalGradeCEMS <dbl>, Points_Possible <dbl>, Points_Earned <dbl>,
#   Gender <chr>, q1 <dbl>, q2 <dbl>, q3 <dbl>, q4 <dbl>, q5 <dbl>, q6 <dbl>,
#   q7 <dbl>, q8 <dbl>, q9 <dbl>, q10 <dbl>, TimeSpent <dbl>,
#   TimeSpent_hours <dbl>, TimeSpent_std <dbl>, int <dbl>, pc <dbl>, uv <dbl>

Your Turn:

Filter the sci_classes data frame for students whose

percentage_earned is smaller or equal to 0.6
Assign to a new object.
Use the head() function to examine your data frame.

physics_students <- filter(sci_online_classes, percentage_earned<=0.6 & subject=="PhysA")
head(physics_students)

# A tibble: 5 × 30
  student_id course_id     total_points_possible total_points_earned
       <dbl> <chr>                         <dbl>               <dbl>
1      78153 PhysA-S216-01                  6530                3702
2      85962 PhysA-S116-01                  3254                1828
3      92725 PhysA-S116-01                  3226                1880
4      92729 PhysA-S116-01                  3556                1770
5      92732 PhysA-S116-01                  1207                 721
# ℹ 26 more variables: percentage_earned <dbl>, subject <chr>, semester <chr>,
#   section <chr>, Gradebook_Item <chr>, Grade_Category <lgl>,
#   FinalGradeCEMS <dbl>, Points_Possible <dbl>, Points_Earned <dbl>,
#   Gender <chr>, q1 <dbl>, q2 <dbl>, q3 <dbl>, q4 <dbl>, q5 <dbl>, q6 <dbl>,
#   q7 <dbl>, q8 <dbl>, q9 <dbl>, q10 <dbl>, TimeSpent <dbl>,
#   TimeSpent_hours <dbl>, TimeSpent_std <dbl>, int <dbl>, pc <dbl>, uv <dbl>

Let’s use filter () function for the missing data.

Filter the sci_classes data frame so rows with
NA for points earned are removed.
Assign to a new object.
Use the glimpse() function to examine your data frame.

clean_data <-sci_online_classes |>
  filter(!is.na(Points_Earned))

glimpse (clean_data)

Rows: 511
Columns: 30
$ student_id            <dbl> 44638, 47979, 48797, 52446, 53447, 53475, 53475,…
$ course_id             <chr> "OcnA-S116-01", "OcnA-S216-01", "PhysA-S116-01",…
$ total_points_possible <dbl> 3531, 4562, 2207, 2086, 4655, 1710, 1209, 4641, …
$ total_points_earned   <dbl> 2672, 3090, 1910, 1719, 3149, 1402, 977, 3429, 2…
$ percentage_earned     <dbl> 0.7567261, 0.6773345, 0.8654282, 0.8240652, 0.67…
$ subject               <chr> "OcnA", "OcnA", "PhysA", "PhysA", "FrScA", "FrSc…
$ semester              <chr> "S116", "S216", "S116", "S116", "S116", "S116", …
$ section               <chr> "01", "01", "01", "01", "01", "02", "01", "01", …
$ Gradebook_Item        <chr> "ATTEMPTED", "POINTS EARNED & TOTAL COURSE POINT…
$ Grade_Category        <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ FinalGradeCEMS        <dbl> 81.70184, 81.85260, 84.00000, 97.77778, 96.11872…
$ Points_Possible       <dbl> 10, 5, 438, 10, 443, 5, 12, 10, 5, 10, 220, 30, …
$ Points_Earned         <dbl> 10.00, 4.00, 399.00, 10.00, 425.00, 2.50, 12.00,…
$ Gender                <chr> "F", "M", "F", "F", "F", "M", "M", "M", "F", "F"…
$ q1                    <dbl> 4, 5, 4, 3, 4, NA, NA, 4, 3, 5, NA, 4, 4, NA, 4,…
$ q2                    <dbl> 4, 5, 3, 3, 3, NA, NA, 5, 3, 3, NA, 2, 4, NA, 3,…
$ q3                    <dbl> 3, 3, 3, 3, 3, NA, NA, 3, 3, 5, NA, 2, 3, NA, 3,…
$ q4                    <dbl> 4, 5, 4, 3, 4, NA, NA, 5, 3, 5, NA, 4, 5, NA, 4,…
$ q5                    <dbl> 4, 5, 4, 3, 4, NA, NA, 5, 4, 5, NA, 4, 4, NA, 4,…
$ q6                    <dbl> 4, 5, 4, 4, 3, NA, NA, 5, 3, 5, NA, 4, 4, NA, 3,…
$ q7                    <dbl> 4, 4, 4, 3, 3, NA, NA, 5, 3, 5, NA, 4, 5, NA, 3,…
$ q8                    <dbl> 5, 5, 4, 3, 4, NA, NA, 4, 3, 5, NA, 4, 4, NA, 4,…
$ q9                    <dbl> 4, 5, NA, 3, 2, NA, NA, 5, 2, 2, NA, 2, 4, NA, 2…
$ q10                   <dbl> 4, 5, 3, 3, 5, NA, NA, 4, 4, 5, NA, 4, 4, NA, 3,…
$ TimeSpent             <dbl> 1382.7001, 1598.6166, 1481.8000, 1390.2167, 1479…
$ TimeSpent_hours       <dbl> 23.04500167, 26.64361000, 24.69666667, 23.170278…
$ TimeSpent_std         <dbl> -0.30780313, -0.14844697, -0.23466291, -0.302255…
$ int                   <dbl> 4.2, 5.0, 3.8, 3.0, 4.2, NA, NA, 4.4, 3.4, 4.7, …
$ pc                    <dbl> 3.50, 3.50, 3.50, 3.00, 3.00, NA, NA, 4.00, 3.00…
$ uv                    <dbl> 4.000000, 5.000000, 3.500000, 3.333333, 2.666667…

Filter the sci_classes data for the following subjects:

BioA
PhysA
OcnA
Assign to a new object with a different name.
Use the summary() function to examine your data frame.

my_classes <- filter (sci_online_classes, subject %in% c("BioA", "PhysA", "OcnA"))

Arrange () Function

Let’s recall how we were using the arrange () function for our dataset.

Arrange sci_classes by subject subject then
percentage_earned in descending order.
Assign to a new object.
Use the str() function to examine your data frame.

order_classes <- sci_online_classes |>
  arrange(subject, desc(percentage_earned))

str(order_classes)

spc_tbl_ [603 × 30] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ student_id           : num [1:603] 70192 86488 96690 91175 86267 ...
 $ course_id            : chr [1:603] "AnPhA-S116-02" "AnPhA-S116-01" "AnPhA-S216-01" "AnPhA-S116-02" ...
 $ total_points_possible: num [1:603] 1936 3342 4804 3199 3045 ...
 $ total_points_earned  : num [1:603] 1763 3033 4309 2867 2705 ...
 $ percentage_earned    : num [1:603] 0.911 0.908 0.897 0.896 0.888 ...
 $ subject              : chr [1:603] "AnPhA" "AnPhA" "AnPhA" "AnPhA" ...
 $ semester             : chr [1:603] "S116" "S116" "S216" "S116" ...
 $ section              : chr [1:603] "02" "01" "01" "02" ...
 $ Gradebook_Item       : chr [1:603] "POINTS EARNED & TOTAL COURSE POINTS" "POINTS EARNED & TOTAL COURSE POINTS" "POINTS EARNED & TOTAL COURSE POINTS" "POINTS EARNED & TOTAL COURSE POINTS" ...
 $ Grade_Category       : logi [1:603] NA NA NA NA NA NA ...
 $ FinalGradeCEMS       : num [1:603] 96 87.4 64.8 82.2 35.1 ...
 $ Points_Possible      : num [1:603] 10 28 10 5 50 15 10 10 353 460 ...
 $ Points_Earned        : num [1:603] 7 26 3 5 50 11 8 10 330 452 ...
 $ Gender               : chr [1:603] "F" "M" "F" "F" ...
 $ q1                   : num [1:603] 4 4 4 5 5 4 5 4 NA NA ...
 $ q2                   : num [1:603] 3 4 3 3 5 2 4 4 NA NA ...
 $ q3                   : num [1:603] 3 2 2 3 3 3 4 3 NA NA ...
 $ q4                   : num [1:603] 4 3 5 5 5 4 5 4 NA NA ...
 $ q5                   : num [1:603] 4 3 4 5 5 4 5 4 NA NA ...
 $ q6                   : num [1:603] 3 3 4 4 5 3 5 4 NA NA ...
 $ q7                   : num [1:603] 3 3 3 3 4 4 5 4 NA NA ...
 $ q8                   : num [1:603] 5 2 4 5 5 4 4 4 NA NA ...
 $ q9                   : num [1:603] 2 3 3 3 5 1 4 4 NA NA ...
 $ q10                  : num [1:603] 5 3 2 5 5 2 5 4 NA NA ...
 $ TimeSpent            : num [1:603] 1537 3600 1970 1315 406 ...
 $ TimeSpent_hours      : num [1:603] 25.62 60 32.83 21.92 6.77 ...
 $ TimeSpent_std        : num [1:603] -0.194 1.328 0.125 -0.358 -1.029 ...
 $ int                  : num [1:603] 4.4 3 3.8 5 5 3.9 4.6 4 4.8 4.6 ...
 $ pc                   : num [1:603] 3 2.5 2.5 3 3.5 3.5 3.75 3.5 3.5 4.5 ...
 $ uv                   : num [1:603] 2.67 3.33 3.33 3.33 5 ...
 - attr(*, "spec")=
  .. cols(
  ..   student_id = col_double(),
  ..   course_id = col_character(),
  ..   total_points_possible = col_double(),
  ..   total_points_earned = col_double(),
  ..   percentage_earned = col_double(),
  ..   subject = col_character(),
  ..   semester = col_character(),
  ..   section = col_character(),
  ..   Gradebook_Item = col_character(),
  ..   Grade_Category = col_logical(),
  ..   FinalGradeCEMS = col_double(),
  ..   Points_Possible = col_double(),
  ..   Points_Earned = col_double(),
  ..   Gender = col_character(),
  ..   q1 = col_double(),
  ..   q2 = col_double(),
  ..   q3 = col_double(),
  ..   q4 = col_double(),
  ..   q5 = col_double(),
  ..   q6 = col_double(),
  ..   q7 = col_double(),
  ..   q8 = col_double(),
  ..   q9 = col_double(),
  ..   q10 = col_double(),
  ..   TimeSpent = col_double(),
  ..   TimeSpent_hours = col_double(),
  ..   TimeSpent_std = col_double(),
  ..   int = col_double(),
  ..   pc = col_double(),
  ..   uv = col_double()
  .. )
 - attr(*, "problems")=<externalptr>

%>% Pipe Operator

Using sci_classes data and the %>% pipe operator:

Select subject, section, time spent in hours and final course grade.
Filter for students in OcnA courses with grades greater than or equal to 60.
Arrange grades by section in descending order.
Assign to a new object.

Examine the contents using a method of your choosing.

specific_course<- sci_online_classes %>%
  select(subject,section,TimeSpent_hours, FinalGradeCEMS) %>%
  filter(subject=="OcnA" & FinalGradeCEMS >= 60) %>%
  arrange (desc(section))

Deriving info with dplyr

We will practice summarise () and group_by () functions now.

Summarise () Function

Using sci_classes data and the summarise() function:

Get a distinct count of course ids.
Use the %>% operator

sci_online_classes %>%
  summarise(courses<-n_distinct(course_id))

# A tibble: 1 × 1
  `courses <- n_distinct(course_id)`
                               <int>
1                                 26

Get a distinct count of course ids.
Use the |> operator

sci_online_classes |>
  summarise(courses<-n_distinct(course_id))

# A tibble: 1 × 1
  `courses <- n_distinct(course_id)`
                               <int>
1                                 26

Group_by () Function

Using the sci_classes data and the pipe operator.

Filter final grades to remove NAs.
Group your data by subject and gender.
Summarise your data to calculate the following stats:
total number of students
mean final grade
mean time spent in the course
Assign to a new object
Examine the contents using a method of your choosing.

finals<-sci_online_classes %>%
  filter(!is.na(FinalGradeCEMS)) %>%
  group_by(subject, Gender) %>%
  summarise(total=sum(student_id),
            grade=mean(FinalGradeCEMS),
            time=mean(TimeSpent_hours))

`summarise()` has regrouped the output.
ℹ Summaries were computed grouped by subject and Gender.
ℹ Output is grouped by subject.
ℹ Use `summarise(.groups = "drop_last")` to silence this message.
ℹ Use `summarise(.by = c(subject, Gender))` for per-operation grouping
  (`?dplyr::dplyr_by`) instead.

Mutate () Function

Replace the dashed lines in the following code to;

Create a new variable called score that is the product of percentage earned and 100
Create a faceted scatter plot with hours spent in the course on the x-axis, score on the y-axis, and point colored by gender.
Include an alpha value to your graph.

sci_online_classes %>%
  mutate(score = percentage_earned * 100) %>%
  ggplot() +
  geom_point(mapping = aes(x = TimeSpent_hours,
                           y = score,
                           col = Gender)) +
  facet_wrap(~subject)

Final Step:

You are almost done, all you need to is to render your file and publish it in one of the following platform.

Your Turn:

Render File: For now, we will wrap up this work by converting our work into a webpage that can be used to communicate your learning and demonstrate some of your new R skills. To do so, you will need to “render” your document by clicking the Render button in the menu bar at that the top of this file. This will do two things; it will:

check through all your code for any errors; and,
create a file in your directory that you can use to share you work through Posit Cloud, RPubs , GitHub Pages, Quarto Pub, or other methods.
Submit your link to the Blackboard!

Now that you’ve finished your Rtutorial study, scroll back to the very top of this Quarto Document and change the author: “YOUR NAME HERE” to your actual name surrounded by quotation marks like so: author: “Dr. Cansu Tatar”.