R Studio/Quarto Workbook

Author

Oliver Barnes

Quarto Intro

Quarto enables you to weave together content and executable code into a finished document. To learn more about Quarto see https://quarto.org.

Running Code

When you click the Render button a document will be generated that includes both content and the output of embedded code. You can embed code like this:

1 + 1

[1] 2

You can add options to executable code like this

[1] 4

The echo: false option disables the printing of code (only output is displayed).

Warning

Make sure the working directory is set up correctly.
Files have to be loaded from the same place the file is saved.
To set working directory use: setwd(“PATH/directoryname”)

Loading Packages

To load a package within RStudio use the following sequence within a code chunk:

#| label: load packages
#| include: true
#| warning: false 
library(palmerpenguins)
library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.1     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.4.2     ✔ tibble    3.2.1
✔ lubridate 1.9.2     ✔ tidyr     1.3.0
✔ purrr     1.0.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

(Note: “#|” denotes the header for a chunk and establishes line by line the output functions)

Palmer Penguins Dataset

Simply starting a new code chunk and typing in the word “penguins” will open up a table of the Palmer penguins data, like so:

penguins

# A tibble: 344 × 8
   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
 1 Adelie  Torgersen           39.1          18.7               181        3750
 2 Adelie  Torgersen           39.5          17.4               186        3800
 3 Adelie  Torgersen           40.3          18                 195        3250
 4 Adelie  Torgersen           NA            NA                  NA          NA
 5 Adelie  Torgersen           36.7          19.3               193        3450
 6 Adelie  Torgersen           39.3          20.6               190        3650
 7 Adelie  Torgersen           38.9          17.8               181        3625
 8 Adelie  Torgersen           39.2          19.6               195        4675
 9 Adelie  Torgersen           34.1          18.1               193        3475
10 Adelie  Torgersen           42            20.2               190        4250
# ℹ 334 more rows
# ℹ 2 more variables: sex <fct>, year <int>

%>% {This is the ‘pipe’ symbol and orders R to perform an action/function based on a specific input e.g. a dataset}. So penguins %>% glimpse commands R to ‘glimpse’/view the penguins dataset giving as much information as possible, see below. (The pipe %>% is contained within tidyverse, specifically in a package called magrittr)

penguins %>% glimpse

Rows: 344
Columns: 8
$ species           <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
$ island            <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
$ bill_length_mm    <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
$ bill_depth_mm     <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
$ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
$ body_mass_g       <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
$ sex               <fct> male, female, female, NA, female, male, female, male…
$ year              <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…

Inserting an image

Inserting an image in to a Quarto workbook via RStudio is very easy. Simply switch from ‘Source’ to ‘Visual’ mode, click on”insert”, “Figure/Image”, copy and paste the URL address of the image then click “OK”. Here’s a nice photograph of Bob Marley smiling:`

Click here to view source website

Embedding a video

Embedding a video in to Quarto is also very straight forward. Simply right click on the video you wish to embed, click “copy embed code”, paste it in the source pane/script editor and click Render (no need to add it in to a code chunk):

Data Analysis Outputs

There are many ways to display data analyses outputs in RStudio and even more ways to manipulate and scrutinize the data. Below are just two examples of visual representations for data analyses (histogram and scatterplot).

Histogram (body mass)

Scatterplot (bill length vs bill depth)

Week 3 - Pre-Session - Tidyverse

The tidyverse package is actually a “package of other packages” (dplyr, ggplot2, etc.) and these are shown when Tidyverse is loaded.Packages such as tidyverse must be loaded at the start of every new RStudio session due to naming conflicts between other packages/package functions e.g., filter(): which has one function within the dplyr package and one within the stats package.You can also choose to load individual functions from a package instead of whole libraries, e.g.to load only the filter function from dplyr you can use “dplyr::filter()”

The “diamonds” dataset

Here we are going to be working with the “diamonds” dataset. This dataset is built into ggplot2 and therefore included in tidyverse.

view(diamonds)

The above command opens the diamonds dataset in a separate tab.

To view the structure/type of each variable in the diamonds dataset run “str(diamonds)” as below:

str(diamonds)

tibble [53,940 × 10] (S3: tbl_df/tbl/data.frame)
 $ carat  : num [1:53940] 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
 $ cut    : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
 $ color  : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
 $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
 $ depth  : num [1:53940] 61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
 $ table  : num [1:53940] 55 61 65 58 58 57 57 55 61 61 ...
 $ price  : int [1:53940] 326 326 327 334 335 336 336 337 337 338 ...
 $ x      : num [1:53940] 3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
 $ y      : num [1:53940] 3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
 $ z      : num [1:53940] 2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...

As we can see there are 53,940 entries (individual diamonds) and 10 variables to describe each entry.

With built-in datasets such as “diamonds” running a simple query via (?diamonds) will provide further information on the dataset:

(?diamonds)

starting httpd help server ... done

Prices of over 50,000 round cut diamonds.

Description: A dataset containing the prices and other attributes of almost 54,000 diamonds. The variables are as follows:

Usage diamonds Format A data frame with 53940 rows and 10 variables:

-price price in US dollars ($326–$18,823)

-carat weight of the diamond (0.2–5.01)

-cut quality of the cut (Fair, Good, Very Good, Premium, Ideal)

-color diamond colour, from D (best) to J (worst)

-clarity a measurement of how clear the diamond is (I1 (worst), SI2, SI1, VS2, VS1, VVS2, VVS1, IF (best))

-x length in mm (0–10.74)

-y width in mm (0–58.9)

-z depth in mm (0–31.8)

-depth total depth percentage = z / mean(x, y) = 2 * z / (x + y) (43–79)

-table width of top of diamond relative to widest point (43–95)

Data Management

6.1 mutate():

mutate() adds new columns or modifies current variables in a dataset

E.g. to create three new variables within the diamonds dataset:

Variable name: JustOne, all values: 1
Variable name: Values, all values: something
Variable name: Simple, all values: TRUE

I would use:

diamonds %>% mutate(JustOne = 1, Values = "something", Simple = TRUE)

# A tibble: 53,940 × 13
   carat cut    color clarity depth table price     x     y     z JustOne Values
   <dbl> <ord>  <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>   <dbl> <chr> 
 1  0.23 Ideal  E     SI2      61.5    55   326  3.95  3.98  2.43       1 somet…
 2  0.21 Premi… E     SI1      59.8    61   326  3.89  3.84  2.31       1 somet…
 3  0.23 Good   E     VS1      56.9    65   327  4.05  4.07  2.31       1 somet…
 4  0.29 Premi… I     VS2      62.4    58   334  4.2   4.23  2.63       1 somet…
 5  0.31 Good   J     SI2      63.3    58   335  4.34  4.35  2.75       1 somet…
 6  0.24 Very … J     VVS2     62.8    57   336  3.94  3.96  2.48       1 somet…
 7  0.24 Very … I     VVS1     62.3    57   336  3.95  3.98  2.47       1 somet…
 8  0.26 Very … H     SI1      61.9    55   337  4.07  4.11  2.53       1 somet…
 9  0.22 Fair   E     VS2      65.1    61   337  3.87  3.78  2.49       1 somet…
10  0.23 Very … H     VS1      59.4    61   338  4     4.05  2.39       1 somet…
# ℹ 53,930 more rows
# ℹ 1 more variable: Simple <lgl>

mutate() can also be used to create new variables based on existing variables from the dataset.

E.g. to create a new column detailing a 20% reduction in price for each diamond I would use:

diamonds.new <-
diamonds %>%
  mutate(price20percoff = price * 0.80)

Tip

These added columns/variables have now been saved to the original dataset using (<-) above.

This line of code commands R to save new variables as an ‘object’.

6.1.1 Nesting Functions

We can also use other functions inside mutate() to create our new variable(s). For example, we might use the mean(), standard deviation (sd()) and/or median() function(s) to calculate the average price value for all diamonds in the dataset. This is called nesting i.e., where one function e.g. mean() “nests” inside another function mutate().

E.g. diamonds %>% mutate(m = mean(price), sd = sd(price), med = median(price))

diamonds %>% 
  mutate(m = mean(price),     
         sd = sd(price),      
         med = median(price))

# A tibble: 53,940 × 13
   carat cut       color clarity depth table price     x     y     z     m    sd
   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
 1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43 3933. 3989.
 2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31 3933. 3989.
 3  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31 3933. 3989.
 4  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63 3933. 3989.
 5  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75 3933. 3989.
 6  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48 3933. 3989.
 7  0.24 Very Good I     VVS1     62.3    57   336  3.95  3.98  2.47 3933. 3989.
 8  0.26 Very Good H     SI1      61.9    55   337  4.07  4.11  2.53 3933. 3989.
 9  0.22 Fair      E     VS2      65.1    61   337  3.87  3.78  2.49 3933. 3989.
10  0.23 Very Good H     VS1      59.4    61   338  4     4.05  2.39 3933. 3989.
# ℹ 53,930 more rows
# ℹ 1 more variable: med <dbl>

Note: The values in these new columns will be the same for every row because R takes all of the values in price to calculate the mean/standard deviation/median.

diamonds %>%
  mutate(totweight = sum(carat))

# A tibble: 53,940 × 11
   carat cut       color clarity depth table price     x     y     z totweight
   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>     <dbl>
 1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43    43041.
 2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31    43041.
 3  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31    43041.
 4  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63    43041.
 5  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75    43041.
 6  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48    43041.
 7  0.24 Very Good I     VVS1     62.3    57   336  3.95  3.98  2.47    43041.
 8  0.26 Very Good H     SI1      61.9    55   337  4.07  4.11  2.53    43041.
 9  0.22 Fair      E     VS2      65.1    61   337  3.87  3.78  2.49    43041.
10  0.23 Very Good H     VS1      59.4    61   338  4     4.05  2.39    43041.
# ℹ 53,930 more rows

The above formula has calculated the total weight of all diamonds i.e. sum of all carats

6.1.1.1 Recode

Recode modifies the values within a variable like so:

data %>% mutate(Variable = recode(Variable, “old value” = “new value”))

cut.new <- 
diamonds %>% 
  mutate(cut.new = recode(cut,  "Ideal" = "IDEAL"))

This has simply changed the column “Ideal” to all caps i.e. “IDEAL”. Of course R can perform far more complex recoding functions, but this is the basic idea.

Remember to save the changes within the code chunk:

cut.new <- diamonds %>% mutate(cut.new = recode(cut, “Ideal” = “IDEAL”))

6.2 Summarize

Summarize() collapses all rows and returns a one-row summary. R will recognize both the British and American spelling (summarise/summarize).

diamonds %>%
summarize(avg.price = mean(price))

# A tibble: 1 × 1
  avg.price
      <dbl>
1     3933.

Here above we see the average (mean) price of all the diamonds within the dataset.

As with the mutate() function, multiple operations can be performed simultaneously within sumarize() as can nesting, like so:

diamonds %>%
  summarize(avg.price = mean(price),
            random.add = 1 + 2, # math operation without an existing variable
            avg.carat = mean(carat),
            stdev.price = sd(price)) # calculating the standard deviation

# A tibble: 1 × 4
  avg.price random.add avg.carat stdev.price
      <dbl>      <dbl>     <dbl>       <dbl>
1     3933.          3     0.798       3989.

6.3 group_by() and ungroup()

This function takes existing data and groups specific variables together for future operations. Many operations are performed on groups. After running some data see here based on males and females, their ages, and their scores in a separate R Script (i.e. not my Quarto workbook qmd file), the output came in the Environment Pane.

Thereafter, applying the following formula in the same script gave me a small table detailing mean(score), sd(score), and n(number of subjects) for girls and boys separately:

So by grouping we get to see the summaries of individual variables within the dataset!

Now lets group by Sex and Age (note below that entering “Sex” first and then “Age” in group_by() will display these variables in the table in that order from left to right, and vice versa):

This has given us 27 rows in total. R has considered every unique combination of Sex and Age in the dataset.

The missing std dev (s) values are because there is only 1 observtion for that group, while std dev requires at least 2.

Warning

If ungroup() is left out at the end of a code chunk wishing to group variables i.e. group_by(), this will leave the grouping in place and likely lead to errors in further calculations.

Always use ungroup() after group_by().

6.4 Filter

filter() only retains specific rows of data that meet the specified requirement(s), e.g.

To only display data from the diamonds dataset that have a cut value of Fair:

diamonds %>% filter(cut == "Fair")

# A tibble: 1,610 × 10
   carat cut   color clarity depth table price     x     y     z
   <dbl> <ord> <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1  0.22 Fair  E     VS2      65.1    61   337  3.87  3.78  2.49
 2  0.86 Fair  E     SI2      55.1    69  2757  6.45  6.33  3.52
 3  0.96 Fair  F     SI2      66.3    62  2759  6.27  5.95  4.07
 4  0.7  Fair  F     VS2      64.5    57  2762  5.57  5.53  3.58
 5  0.7  Fair  F     VS2      65.3    55  2762  5.63  5.58  3.66
 6  0.91 Fair  H     SI2      64.4    57  2763  6.11  6.09  3.93
 7  0.91 Fair  H     SI2      65.7    60  2763  6.03  5.99  3.95
 8  0.98 Fair  H     SI2      67.9    60  2777  6.05  5.97  4.08
 9  0.84 Fair  G     SI1      55.1    67  2782  6.39  6.2   3.47
10  1.01 Fair  E     I1       64.5    58  2788  6.29  6.21  4.03
# ℹ 1,600 more rows

Only display data from diamonds that have a cut value of Fair or Good and a price at or under $600 (notice how the or statement is obtained with | while the and statement is achieved through a comma):

diamonds %>%
  filter(cut == "Fair" | cut == "Good", price <= 600)

# A tibble: 505 × 10
   carat cut   color clarity depth table price     x     y     z
   <dbl> <ord> <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1  0.23 Good  E     VS1      56.9    65   327  4.05  4.07  2.31
 2  0.31 Good  J     SI2      63.3    58   335  4.34  4.35  2.75
 3  0.22 Fair  E     VS2      65.1    61   337  3.87  3.78  2.49
 4  0.3  Good  J     SI1      64      55   339  4.25  4.28  2.73
 5  0.3  Good  J     SI1      63.4    54   351  4.23  4.29  2.7 
 6  0.3  Good  J     SI1      63.8    56   351  4.23  4.26  2.71
 7  0.3  Good  I     SI2      63.3    56   351  4.26  4.3   2.71
 8  0.23 Good  F     VS1      58.2    59   402  4.06  4.08  2.37
 9  0.23 Good  E     VS1      64.1    59   402  3.83  3.85  2.46
10  0.31 Good  H     SI1      64      54   402  4.29  4.31  2.75
# ℹ 495 more rows

The following code would require the cut be Fair and Good (for which none exists):

diamonds %>%
  filter(cut == "Fair", cut == "Good", price <= 600)

# A tibble: 0 × 10
# ℹ 10 variables: carat <dbl>, cut <ord>, color <ord>, clarity <ord>,
#   depth <dbl>, table <dbl>, price <int>, x <dbl>, y <dbl>, z <dbl>

Only display data from diamonds that do not have a cut value of Fair:

diamonds %>% filter(cut != "Fair")

# A tibble: 52,330 × 10
   carat cut       color clarity depth table price     x     y     z
   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
 2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
 3  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31
 4  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
 5  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75
 6  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48
 7  0.24 Very Good I     VVS1     62.3    57   336  3.95  3.98  2.47
 8  0.26 Very Good H     SI1      61.9    55   337  4.07  4.11  2.53
 9  0.23 Very Good H     VS1      59.4    61   338  4     4.05  2.39
10  0.3  Good      J     SI1      64      55   339  4.25  4.28  2.73
# ℹ 52,320 more rows

6.5 Select

Function: Selects only the columns (variables) that you want to see. Gets rid of all other columns. You can refer to the columns by the column position (first column) or by name. The order in which you list the column names/positions is the order that the columns will be displayed.

In the diamonds dataset, only retain the first 4 columns:

diamonds %>% select(1:4)

# A tibble: 53,940 × 4
   carat cut       color clarity
   <dbl> <ord>     <ord> <ord>  
 1  0.23 Ideal     E     SI2    
 2  0.21 Premium   E     SI1    
 3  0.23 Good      E     VS1    
 4  0.29 Premium   I     VS2    
 5  0.31 Good      J     SI2    
 6  0.24 Very Good J     VVS2   
 7  0.24 Very Good I     VVS1   
 8  0.26 Very Good H     SI1    
 9  0.22 Fair      E     VS2    
10  0.23 Very Good H     VS1    
# ℹ 53,930 more rows

Retain all of the columns except for cut and color:

diamonds %>% select(-cut, -color)

# A tibble: 53,940 × 8
   carat clarity depth table price     x     y     z
   <dbl> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1  0.23 SI2      61.5    55   326  3.95  3.98  2.43
 2  0.21 SI1      59.8    61   326  3.89  3.84  2.31
 3  0.23 VS1      56.9    65   327  4.05  4.07  2.31
 4  0.29 VS2      62.4    58   334  4.2   4.23  2.63
 5  0.31 SI2      63.3    58   335  4.34  4.35  2.75
 6  0.24 VVS2     62.8    57   336  3.94  3.96  2.48
 7  0.24 VVS1     62.3    57   336  3.95  3.98  2.47
 8  0.26 SI1      61.9    55   337  4.07  4.11  2.53
 9  0.22 VS2      65.1    61   337  3.87  3.78  2.49
10  0.23 VS1      59.4    61   338  4     4.05  2.39
# ℹ 53,930 more rows

You can also retain all of the columns, but rearrange some of the columns to appear at the beginning—this moves the x,y,z variables to the first 3 columns:

diamonds %>% select(x,y,z, everything())

# A tibble: 53,940 × 10
       x     y     z carat cut       color clarity depth table price
   <dbl> <dbl> <dbl> <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int>
 1  3.95  3.98  2.43  0.23 Ideal     E     SI2      61.5    55   326
 2  3.89  3.84  2.31  0.21 Premium   E     SI1      59.8    61   326
 3  4.05  4.07  2.31  0.23 Good      E     VS1      56.9    65   327
 4  4.2   4.23  2.63  0.29 Premium   I     VS2      62.4    58   334
 5  4.34  4.35  2.75  0.31 Good      J     SI2      63.3    58   335
 6  3.94  3.96  2.48  0.24 Very Good J     VVS2     62.8    57   336
 7  3.95  3.98  2.47  0.24 Very Good I     VVS1     62.3    57   336
 8  4.07  4.11  2.53  0.26 Very Good H     SI1      61.9    55   337
 9  3.87  3.78  2.49  0.22 Fair      E     VS2      65.1    61   337
10  4     4.05  2.39  0.23 Very Good H     VS1      59.4    61   338
# ℹ 53,930 more rows

6.6 Arrange

Function: Allows you to arrange values within a variable in ascending or descending order (if that is applicable to your values). This can apply to both numerical and non-numerical values, e.g.to arrange colour by alphabetical order (A to Z):

diamonds %>% arrange(color)

# A tibble: 53,940 × 10
   carat cut       color clarity depth table price     x     y     z
   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1  0.23 Very Good D     VS2      60.5    61   357  3.96  3.97  2.4 
 2  0.23 Very Good D     VS1      61.9    58   402  3.92  3.96  2.44
 3  0.26 Very Good D     VS2      60.8    59   403  4.13  4.16  2.52
 4  0.26 Good      D     VS2      65.2    56   403  3.99  4.02  2.61
 5  0.26 Good      D     VS1      58.4    63   403  4.19  4.24  2.46
 6  0.22 Premium   D     VS2      59.3    62   404  3.91  3.88  2.31
 7  0.3  Premium   D     SI1      62.6    59   552  4.23  4.27  2.66
 8  0.3  Ideal     D     SI1      62.5    57   552  4.29  4.32  2.69
 9  0.3  Ideal     D     SI1      62.1    56   552  4.3   4.33  2.68
10  0.24 Very Good D     VVS1     61.5    60   553  3.97  4     2.45
# ℹ 53,930 more rows

Or, to arrange price by numerical order (lowest to highest):

diamonds %>% arrange(price)

# A tibble: 53,940 × 10
   carat cut       color clarity depth table price     x     y     z
   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
 2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
 3  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31
 4  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
 5  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75
 6  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48
 7  0.24 Very Good I     VVS1     62.3    57   336  3.95  3.98  2.47
 8  0.26 Very Good H     SI1      61.9    55   337  4.07  4.11  2.53
 9  0.22 Fair      E     VS2      65.1    61   337  3.87  3.78  2.49
10  0.23 Very Good H     VS1      59.4    61   338  4     4.05  2.39
# ℹ 53,930 more rows

Or, to arrange price in descending numerical order:

diamonds %>% arrange(desc(price))

# A tibble: 53,940 × 10
   carat cut       color clarity depth table price     x     y     z
   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1  2.29 Premium   I     VS2      60.8    60 18823  8.5   8.47  5.16
 2  2    Very Good G     SI1      63.5    56 18818  7.9   7.97  5.04
 3  1.51 Ideal     G     IF       61.7    55 18806  7.37  7.41  4.56
 4  2.07 Ideal     G     SI2      62.5    55 18804  8.2   8.13  5.11
 5  2    Very Good H     SI1      62.8    57 18803  7.95  8     5.01
 6  2.29 Premium   I     SI1      61.8    59 18797  8.52  8.45  5.24
 7  2.04 Premium   H     SI1      58.1    60 18795  8.37  8.28  4.84
 8  2    Premium   I     VS1      60.8    59 18795  8.13  8.02  4.91
 9  1.71 Premium   F     VS2      62.3    59 18791  7.57  7.53  4.7 
10  2.15 Ideal     G     SI2      62.6    54 18791  8.29  8.35  5.21
# ℹ 53,930 more rows

6.6.1 Exercises

Problem A

library(tidyverse) 
midwest %>% # utilises the midwest dataset 
 group_by(state) %>% # implies that the following commands are to be grouped by state for that state
  summarize(poptotalmean = mean(poptotal), # display the mean of the poptotal column for that state
popmax = max(poptotal), #display the maximum value of poptotal for that state
popmin = min(poptotal), #display the minimum value of poptotal for that state 
popdistinct = n_distinct(poptotal), # displays the number of unique values in poptotal for each state popfirst = first(poptotal), #display the first value in poptotal 
popany = any(poptotal < 5000), #split the popany column in to true and false values i.e. if less than 5000 = true if more = false 
popany2 = any(poptotal > 2000000)) %>% #same as above i.e. true if more than 2m false if less 
ungroup() #ungroup the dataset

# A tibble: 5 × 7
  state poptotalmean  popmax popmin popdistinct popany popany2
  <chr>        <dbl>   <int>  <int>       <int> <lgl>  <lgl>  
1 IL         112065. 5105067   4373         101 TRUE   TRUE   
2 IN          60263.  797159   5315          92 FALSE  FALSE  
3 MI         111992. 2111687   1701          83 TRUE   TRUE   
4 OH         123263. 1412140  11098          88 FALSE  FALSE  
5 WI          67941.  959275   3890          72 TRUE   FALSE

Problem B

midwest %>% # do the following to the midwest dataset
  group_by(state) %>% # group all other variables according to the state variable 
  summarize(num5k = sum(poptotal < 5000), # summarize the num5k column according to the no of entries with poptotal < 5000
            num2mil = sum(poptotal > 2000000), # summarize the num2mil column according to the no of entries with poptotal > 2,000,000
            numrows = n()) %>% # summarize the total no of rows
  ungroup() # ungroup variables

# A tibble: 5 × 4
  state num5k num2mil numrows
  <chr> <int>   <int>   <int>
1 IL        1       1     102
2 IN        0       0      92
3 MI        1       1      83
4 OH        0       0      88
5 WI        2       0      72

Problem C

midwest %>% 
  group_by(county) %>% 
  summarize(x = n_distinct(state)) %>% # summarise object x, defined as the number of unique states within the midwest dataset
  arrange(desc(x)) %>% # arrange x in descending order i.e. from most common state to least
  ungroup()

# A tibble: 320 × 2
   county         x
   <chr>      <int>
 1 CRAWFORD       5
 2 JACKSON        5
 3 MONROE         5
 4 ADAMS          4
 5 BROWN          4
 6 CLARK          4
 7 CLINTON        4
 8 JEFFERSON      4
 9 LAKE           4
10 WASHINGTON     4
# ℹ 310 more rows

midwest %>% 
  group_by(county) %>% 
  summarize(x = n()) %>% # summarize x by displaying the number of entries for each county
  ungroup()

# A tibble: 320 × 2
   county        x
   <chr>     <int>
 1 ADAMS         4
 2 ALCONA        1
 3 ALEXANDER     1
 4 ALGER         1
 5 ALLEGAN       1
 6 ALLEN         2
 7 ALPENA        1
 8 ANTRIM        1
 9 ARENAC        1
10 ASHLAND       2
# ℹ 310 more rows

Problem D

diamonds %>% 
  group_by(clarity) %>% 
  summarize(a = n_distinct(color), # show the no of unique colours
            b = n_distinct(price), # show the no of unique prices
            c = n()) %>% # show the total no of variables
  ungroup()

# A tibble: 8 × 4
  clarity     a     b     c
  <ord>   <int> <int> <int>
1 I1          7   632   741
2 SI2         7  4904  9194
3 SI1         7  5380 13065
4 VS2         7  5051 12258
5 VS1         7  3926  8171
6 VVS2        7  2409  5066
7 VVS1        7  1623  3655
8 IF          7   902  1790

Problem E

diamonds %>% 
  group_by(color, cut) %>% # display every unique combination of colour and cut [note the comma between variables]
  summarize(m = mean(price), # summarize the mean/average price of each combination
            s = sd(price)) %>% # show the sd of each combination
  ungroup()

`summarise()` has grouped output by 'color'. You can override using the
`.groups` argument.

# A tibble: 35 × 4
   color cut           m     s
   <ord> <ord>     <dbl> <dbl>
 1 D     Fair      4291. 3286.
 2 D     Good      3405. 3175.
 3 D     Very Good 3470. 3524.
 4 D     Premium   3631. 3712.
 5 D     Ideal     2629. 3001.
 6 E     Fair      3682. 2977.
 7 E     Good      3424. 3331.
 8 E     Very Good 3215. 3408.
 9 E     Premium   3539. 3795.
10 E     Ideal     2598. 2956.
# ℹ 25 more rows

Problem F

diamonds %>% 
  group_by(cut) %>% 
  summarize(potato = mean(depth),
            pizza = mean(price),
            popcorn = median(y),
            pineapple = potato - pizza,
            papaya = pineapple ^ 2,
            peach = n()) %>% 
  ungroup()

# A tibble: 5 × 7
  cut       potato pizza popcorn pineapple    papaya peach
  <ord>      <dbl> <dbl>   <dbl>     <dbl>     <dbl> <int>
1 Fair        64.0 4359.    6.1     -4295. 18444586.  1610
2 Good        62.4 3929.    5.99    -3866. 14949811.  4906
3 Very Good   61.8 3982.    5.77    -3920. 15365942. 12082
4 Premium     61.3 4584.    6.06    -4523. 20457466. 13791
5 Ideal       61.7 3458.    5.26    -3396. 11531679. 21551

Problem G

diamonds %>% 
  group_by(color) %>% # group all unique variations of colour
  summarize(m = mean(price)) %>% # summarize the mean price of each colour group
  mutate(x1 = str_c("Diamond color ", color), 
         x2 = 5) %>% # create 2 new columns - x1 with values of "Diamond color"+ color & x2 with the value 5 in every row
  ungroup()

# A tibble: 7 × 4
  color     m x1                 x2
  <ord> <dbl> <chr>           <dbl>
1 D     3170. Diamond color D     5
2 E     3077. Diamond color E     5
3 F     3725. Diamond color F     5
4 G     3999. Diamond color G     5
5 H     4487. Diamond color H     5
6 I     5092. Diamond color I     5
7 J     5324. Diamond color J     5

Problem H

diamonds %>% 
  group_by(color) %>% 
  mutate(x1 = price * 0.5) %>% # create a new column: x1, equal to the price of each color x 0.5
  summarize(m = mean(x1)) %>% # summarize the mean of x1 for each colour {m}. This summary excludes the original x1 column and replaces it with column m
  ungroup()

# A tibble: 7 × 2
  color     m
  <ord> <dbl>
1 D     1585.
2 E     1538.
3 F     1862.
4 G     2000.
5 H     2243.
6 I     2546.
7 J     2662.

diamonds %>% 
  group_by(color) %>% 
  mutate(x1 = price * 0.5) %>% # add a column equal to half of the price called x1
  ungroup() %>%  
  summarize(m = mean(x1)) # summarize the mean values of x1

# A tibble: 1 × 1
      m
  <dbl>
1 1966.

Since the above ungroups before printing the summary, the output isn’t separated into individual colors and instead gives the mean of all colours combined.

Why is grouping data neccessary?

Grouping data is necessary as it allows statistical analysis to be carried out on specific and targeted groups/variables within the dataset

Why is ungrouping neccessary?

Ungrouping is necessary as it disables the data from being grouped in the same fashion next time it is used

When should you ungroup data?

You should ungroup data as soon as you no longer need or want the data to be grouped

If the code does not contain group_by(), do you still need ungroup() at the end? For example, does data() %>% mutate(newVar = 1 + 2) require ungroup()?

No, no need

6.7 Extra Practice

View(diamonds) # opens the diamonds dataset in a new window

arrange(diamonds, price) # this displays the data with price ascending from lowest to highest

# A tibble: 53,940 × 10
   carat cut       color clarity depth table price     x     y     z
   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
 2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
 3  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31
 4  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
 5  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75
 6  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48
 7  0.24 Very Good I     VVS1     62.3    57   336  3.95  3.98  2.47
 8  0.26 Very Good H     SI1      61.9    55   337  4.07  4.11  2.53
 9  0.22 Fair      E     VS2      65.1    61   337  3.87  3.78  2.49
10  0.23 Very Good H     VS1      59.4    61   338  4     4.05  2.39
# ℹ 53,930 more rows

diamonds %>% arrange(price) #this does exactly the same as above

diamonds %>% 
  arrange(price) %>%
  arrange(cut)

# A tibble: 53,940 × 10
   carat cut   color clarity depth table price     x     y     z
   <dbl> <ord> <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1  0.22 Fair  E     VS2      65.1    61   337  3.87  3.78  2.49
 2  0.25 Fair  E     VS1      55.2    64   361  4.21  4.23  2.33
 3  0.23 Fair  G     VVS2     61.4    66   369  3.87  3.91  2.39
 4  0.27 Fair  E     VS1      66.4    58   371  3.99  4.02  2.66
 5  0.3  Fair  J     VS2      64.8    58   416  4.24  4.16  2.72
 6  0.3  Fair  F     SI1      63.1    58   496  4.3   4.22  2.69
 7  0.34 Fair  J     SI1      64.5    57   497  4.38  4.36  2.82
 8  0.37 Fair  F     SI1      65.3    56   527  4.53  4.47  2.94
 9  0.3  Fair  D     SI2      64.6    54   536  4.29  4.25  2.76
10  0.25 Fair  D     VS1      61.2    55   563  4.09  4.11  2.51
# ℹ 53,930 more rows

The above formula arranges the data according to lowest price (ascending) within each cut (also ascending in terms of quality)

diamonds %>% 
  arrange(desc(price)) %>% 
  arrange(desc(cut))

# A tibble: 53,940 × 10
   carat cut   color clarity depth table price     x     y     z
   <dbl> <ord> <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1  1.51 Ideal G     IF       61.7  55   18806  7.37  7.41  4.56
 2  2.07 Ideal G     SI2      62.5  55   18804  8.2   8.13  5.11
 3  2.15 Ideal G     SI2      62.6  54   18791  8.29  8.35  5.21
 4  2.05 Ideal G     SI1      61.9  57   18787  8.1   8.16  5.03
 5  1.6  Ideal F     VS1      62    56   18780  7.47  7.52  4.65
 6  2.06 Ideal I     VS2      62.2  55   18779  8.15  8.19  5.08
 7  1.71 Ideal G     VVS2     62.1  55   18768  7.66  7.63  4.75
 8  2.08 Ideal H     SI1      58.7  60   18760  8.36  8.4   4.92
 9  2.03 Ideal G     SI1      60    55.8 18757  8.17  8.3   4.95
10  2.61 Ideal I     SI2      62.1  56   18756  8.85  8.73  5.46
# ℹ 53,930 more rows

The above table displays the opposite of the previous table i.e. highest price (and descending) of the highest quality cut (also descending)

diamonds %>% 
  arrange(price) %>% 
  arrange(clarity)

# A tibble: 53,940 × 10
   carat cut     color clarity depth table price     x     y     z
   <dbl> <ord>   <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1  0.32 Premium E     I1       60.9    58   345  4.38  4.42  2.68
 2  0.32 Good    D     I1       64      54   361  4.33  4.36  2.78
 3  0.31 Premium F     I1       62.9    59   394  4.33  4.29  2.71
 4  0.34 Ideal   D     I1       62.5    57   413  4.47  4.49  2.8 
 5  0.34 Ideal   D     I1       61.4    55   413  4.5   4.52  2.77
 6  0.32 Premium E     I1       60.9    58   444  4.42  4.38  2.68
 7  0.43 Premium H     I1       62      59   452  4.78  4.83  2.98
 8  0.41 Good    G     I1       63.8    56   467  4.7   4.74  3.01
 9  0.32 Good    D     I1       64      54   468  4.36  4.33  2.78
10  0.32 Ideal   E     I1       60.7    57   490  4.45  4.41  2.69
# ℹ 53,930 more rows

The above arranges the diamonds from lowest to highest price starting with the lowest grade clarity (and ascending)

diamonds %>% 
  mutate(saleprice = price - 250)

# A tibble: 53,940 × 11
   carat cut       color clarity depth table price     x     y     z saleprice
   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>     <dbl>
 1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43        76
 2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31        76
 3  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31        77
 4  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63        84
 5  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75        85
 6  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48        86
 7  0.24 Very Good I     VVS1     62.3    57   336  3.95  3.98  2.47        86
 8  0.26 Very Good H     SI1      61.9    55   337  4.07  4.11  2.53        87
 9  0.22 Fair      E     VS2      65.1    61   337  3.87  3.78  2.49        87
10  0.23 Very Good H     VS1      59.4    61   338  4     4.05  2.39        88
# ℹ 53,930 more rows

The above formula has created a new variable named salePprice that reflects a $250 discount off the original cost of each diamond

diamonds %>% 
  select(-x,-y,-z)

# A tibble: 53,940 × 7
   carat cut       color clarity depth table price
   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int>
 1  0.23 Ideal     E     SI2      61.5    55   326
 2  0.21 Premium   E     SI1      59.8    61   326
 3  0.23 Good      E     VS1      56.9    65   327
 4  0.29 Premium   I     VS2      62.4    58   334
 5  0.31 Good      J     SI2      63.3    58   335
 6  0.24 Very Good J     VVS2     62.8    57   336
 7  0.24 Very Good I     VVS1     62.3    57   336
 8  0.26 Very Good H     SI1      61.9    55   337
 9  0.22 Fair      E     VS2      65.1    61   337
10  0.23 Very Good H     VS1      59.4    61   338
# ℹ 53,930 more rows

The above formula removed the x, y and z variables from the dataset

diamonds %>% 
  group_by(cut) %>% 
  summarise(number = n()) %>% 
  ungroup()

# A tibble: 5 × 2
  cut       number
  <ord>      <int>
1 Fair        1610
2 Good        4906
3 Very Good  12082
4 Premium    13791
5 Ideal      21551

The above formula has determined the no of diamonds there are for each cut value

diamonds %>% 
  mutate(totalNum = sum(n()))

# A tibble: 53,940 × 11
   carat cut       color clarity depth table price     x     y     z totalNum
   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>    <int>
 1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43    53940
 2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31    53940
 3  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31    53940
 4  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63    53940
 5  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75    53940
 6  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48    53940
 7  0.24 Very Good I     VVS1     62.3    57   336  3.95  3.98  2.47    53940
 8  0.26 Very Good H     SI1      61.9    55   337  4.07  4.11  2.53    53940
 9  0.22 Fair      E     VS2      65.1    61   337  3.87  3.78  2.49    53940
10  0.23 Very Good H     VS1      59.4    61   338  4     4.05  2.39    53940
# ℹ 53,930 more rows

The above formula has created a new column named totalNum that calculates the total number of diamonds.