R Studio/Quarto Workbook

Author

Oliver Barnes

Quarto Intro

Quarto enables you to weave together content and executable code into a finished document. To learn more about Quarto see https://quarto.org.

Running Code

When you click the Render button a document will be generated that includes both content and the output of embedded code. You can embed code like this:

1 + 1

[1] 2

You can add options to executable code like this

[1] 4

The echo: false option disables the printing of code (only output is displayed).

Warning

Make sure the working directory is set up correctly.
Files have to be loaded from the same place the file is saved.
To set working directory use: setwd(“PATH/directoryname”)

Loading Packages

To load a package within RStudio use the following sequence within a code chunk:

#| label: load packages
#| include: true
#| warning: false 
library(palmerpenguins)
library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.1     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.2     ✔ tidyr     1.3.0
✔ purrr     1.0.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

(Note: “#|” denotes the header for a chunk and establishes line by line the output functions)

Palmer Penguins Dataset

Simply starting a new code chunk and typing in the word “penguins” will open up a table of the Palmer penguins data, like so:

penguins

# A tibble: 344 × 8
   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
 1 Adelie  Torgersen           39.1          18.7               181        3750
 2 Adelie  Torgersen           39.5          17.4               186        3800
 3 Adelie  Torgersen           40.3          18                 195        3250
 4 Adelie  Torgersen           NA            NA                  NA          NA
 5 Adelie  Torgersen           36.7          19.3               193        3450
 6 Adelie  Torgersen           39.3          20.6               190        3650
 7 Adelie  Torgersen           38.9          17.8               181        3625
 8 Adelie  Torgersen           39.2          19.6               195        4675
 9 Adelie  Torgersen           34.1          18.1               193        3475
10 Adelie  Torgersen           42            20.2               190        4250
# ℹ 334 more rows
# ℹ 2 more variables: sex <fct>, year <int>

%>% {This is the ‘pipe’ symbol and orders R to perform an action/function based on a specific input e.g. a dataset}. So penguins %>% glimpse commands R to ‘glimpse’/view the penguins dataset giving as much information as possible, see below. (The pipe %>% is contained within tidyverse, specifically in a package called magrittr)

penguins %>% glimpse

Rows: 344
Columns: 8
$ species           <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
$ island            <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
$ bill_length_mm    <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
$ bill_depth_mm     <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
$ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
$ body_mass_g       <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
$ sex               <fct> male, female, female, NA, female, male, female, male…
$ year              <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…

Inserting an image

Inserting an image in to a Quarto workbook via RStudio is very easy. Simply switch from ‘Source’ to ‘Visual’ mode, click on”insert”, “Figure/Image”, copy and paste the URL address of the image then click “OK”. Here’s a nice photograph of Bob Marley smiling:`

Click here to view source website

Embedding a video

Embedding a video in to Quarto is also very straight forward. Simply right click on the video you wish to embed, click “copy embed code”, paste it in the source pane/script editor and click Render (no need to add it in to a code chunk):

Data Analysis Outputs

There are many ways to display data analyses outputs in RStudio and even more ways to manipulate and scrutinize the data. Below are just two examples of visual representations for data analyses (histogram and scatterplot).

Histogram (body mass)

Scatterplot (bill length vs bill depth)

Week 3 - Pre-Session - Tidyverse

The tidyverse package is actually a “package of other packages” (dplyr, ggplot2, etc.) and these are shown when Tidyverse is loaded.Packages such as tidyverse must be loaded at the start of every new RStudio session due to naming conflicts between other packages/package functions e.g., filter(): which has one function within the dplyr package and one within the stats package.You can also choose to load individual functions from a package instead of whole libraries, e.g.to load only the filter function from dplyr you can use “dplyr::filter()”

The “diamonds” dataset

Here we are going to be working with the “diamonds” dataset. This dataset is built into ggplot2 and therefore included in tidyverse.

view(diamonds)

The above command opens the diamonds dataset in a separate tab.

To view the structure/type of each variable in the diamonds dataset run “str(diamonds)” as below:

str(diamonds)

tibble [53,940 × 10] (S3: tbl_df/tbl/data.frame)
 $ carat  : num [1:53940] 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
 $ cut    : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
 $ color  : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
 $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
 $ depth  : num [1:53940] 61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
 $ table  : num [1:53940] 55 61 65 58 58 57 57 55 61 61 ...
 $ price  : int [1:53940] 326 326 327 334 335 336 336 337 337 338 ...
 $ x      : num [1:53940] 3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
 $ y      : num [1:53940] 3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
 $ z      : num [1:53940] 2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...

As we can see there are 53,940 entries (individual diamonds) and 10 variables to describe each entry.

With built-in datasets such as “diamonds” running a simple query via (?diamonds) will provide further information on the dataset:

Prices of over 50,000 round cut diamonds.

Description: A dataset containing the prices and other attributes of almost 54,000 diamonds. The variables are as follows:

Usage diamonds Format A data frame with 53940 rows and 10 variables:

-price price in US dollars ($326–$18,823)

-carat weight of the diamond (0.2–5.01)

-cut quality of the cut (Fair, Good, Very Good, Premium, Ideal)

-color diamond colour, from D (best) to J (worst)

-clarity a measurement of how clear the diamond is (I1 (worst), SI2, SI1, VS2, VS1, VVS2, VVS1, IF (best))

-x length in mm (0–10.74)

-y width in mm (0–58.9)

-z depth in mm (0–31.8)

-depth total depth percentage = z / mean(x, y) = 2 * z / (x + y) (43–79)

-table width of top of diamond relative to widest point (43–95)

Data Management

6.1 mutate():

mutate() adds new columns or modifies current variables in a dataset

E.g. to create three new variables within the diamonds dataset:

Variable name: JustOne, all values: 1
Variable name: Values, all values: something
Variable name: Simple, all values: TRUE

I would use:

diamonds %>% mutate(JustOne = 1, Values = "something", Simple = TRUE)

# A tibble: 53,940 × 13
   carat cut    color clarity depth table price     x     y     z JustOne Values
   <dbl> <ord>  <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>   <dbl> <chr> 
 1  0.23 Ideal  E     SI2      61.5    55   326  3.95  3.98  2.43       1 somet…
 2  0.21 Premi… E     SI1      59.8    61   326  3.89  3.84  2.31       1 somet…
 3  0.23 Good   E     VS1      56.9    65   327  4.05  4.07  2.31       1 somet…
 4  0.29 Premi… I     VS2      62.4    58   334  4.2   4.23  2.63       1 somet…
 5  0.31 Good   J     SI2      63.3    58   335  4.34  4.35  2.75       1 somet…
 6  0.24 Very … J     VVS2     62.8    57   336  3.94  3.96  2.48       1 somet…
 7  0.24 Very … I     VVS1     62.3    57   336  3.95  3.98  2.47       1 somet…
 8  0.26 Very … H     SI1      61.9    55   337  4.07  4.11  2.53       1 somet…
 9  0.22 Fair   E     VS2      65.1    61   337  3.87  3.78  2.49       1 somet…
10  0.23 Very … H     VS1      59.4    61   338  4     4.05  2.39       1 somet…
# ℹ 53,930 more rows
# ℹ 1 more variable: Simple <lgl>

mutate() can also be used to create new variables based on existing variables from the dataset.

E.g. to create a new column detailing a 20% reduction in price for each diamond I would use:

diamonds.new <-
diamonds %>%
  mutate(price20percoff = price * 0.80)

Tip

These added columns/variables have now been saved to the original dataset using (<-) above.

This line of code commands R to save new variables as an ‘object’.

6.1.1 Nesting Functions

We can also use other functions inside mutate() to create our new variable(s). For example, we might use the mean(), standard deviation (sd()) and/or median() function(s) to calculate the average price value for all diamonds in the dataset. This is called nesting i.e., where one function e.g. mean() “nests” inside another function mutate().

E.g. diamonds %>% mutate(m = mean(price), sd = sd(price), med = median(price))

diamonds %>% 
  mutate(m = mean(price),     
         sd = sd(price),      
         med = median(price))

# A tibble: 53,940 × 13
   carat cut       color clarity depth table price     x     y     z     m    sd
   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
 1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43 3933. 3989.
 2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31 3933. 3989.
 3  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31 3933. 3989.
 4  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63 3933. 3989.
 5  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75 3933. 3989.
 6  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48 3933. 3989.
 7  0.24 Very Good I     VVS1     62.3    57   336  3.95  3.98  2.47 3933. 3989.
 8  0.26 Very Good H     SI1      61.9    55   337  4.07  4.11  2.53 3933. 3989.
 9  0.22 Fair      E     VS2      65.1    61   337  3.87  3.78  2.49 3933. 3989.
10  0.23 Very Good H     VS1      59.4    61   338  4     4.05  2.39 3933. 3989.
# ℹ 53,930 more rows
# ℹ 1 more variable: med <dbl>

Note: The values in these new columns will be the same for every row because R takes all of the values in price to calculate the mean/standard deviation/median.

diamonds %>%
  mutate(totweight = sum(carat))

# A tibble: 53,940 × 11
   carat cut       color clarity depth table price     x     y     z totweight
   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>     <dbl>
 1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43    43041.
 2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31    43041.
 3  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31    43041.
 4  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63    43041.
 5  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75    43041.
 6  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48    43041.
 7  0.24 Very Good I     VVS1     62.3    57   336  3.95  3.98  2.47    43041.
 8  0.26 Very Good H     SI1      61.9    55   337  4.07  4.11  2.53    43041.
 9  0.22 Fair      E     VS2      65.1    61   337  3.87  3.78  2.49    43041.
10  0.23 Very Good H     VS1      59.4    61   338  4     4.05  2.39    43041.
# ℹ 53,930 more rows

The above formula has calculated the total weight of all diamonds i.e. sum of all carats

6.1.1.1 Recode

Recode modifies the values within a variable like so:

data %>% mutate(Variable = recode(Variable, “old value” = “new value”))

cut.new <- 
diamonds %>% 
  mutate(cut.new = recode(cut,  "Ideal" = "IDEAL"))

This has simply changed the column “Ideal” to all caps i.e. “IDEAL”. Of course R can perform far more complex recoding functions, but this is the basic idea.

Remember to save the changes within the code chunk:

cut.new <- diamonds %>% mutate(cut.new = recode(cut, “Ideal” = “IDEAL”))

6.2 Summarize

Summarize() collapses all rows and returns a one-row summary. R will recognize both the British and American spelling (summarise/summarize).

diamonds %>%
summarize(avg.price = mean(price))

# A tibble: 1 × 1
  avg.price
      <dbl>
1     3933.

Here above we see the average (mean) price of all the diamonds within the dataset.

As with the mutate() function, multiple operations can be performed simultaneously within sumarize() as can nesting, like so:

diamonds %>%
  summarize(avg.price = mean(price),
            random.add = 1 + 2, # math operation without an existing variable
            avg.carat = mean(carat),
            stdev.price = sd(price)) # calculating the standard deviation

# A tibble: 1 × 4
  avg.price random.add avg.carat stdev.price
      <dbl>      <dbl>     <dbl>       <dbl>
1     3933.          3     0.798       3989.

6.3 group_by() and ungroup()

This function takes existing data and groups specific variables together for future operations. Many operations are performed on groups. After running some data see here based on males and females, their ages, and their scores in a separate R Script (i.e. not my Quarto workbook qmd file), the output came in the Environment Pane.

Thereafter, applying the following formula in the same script gave me a small table detailing mean(score), sd(score), and n(number of subjects) for girls and boys separately:

So by grouping we get to see the summaries of individual variables within the dataset!

Now lets group by Sex and Age (note below that entering “Sex” first and then “Age” in group_by() will display these variables in the table in that order from left to right, and vice versa):

This has given us 27 rows in total. R has considered every unique combination of Sex and Age in the dataset.

The missing std dev (s) values are because there is only 1 observtion for that group, while std dev requires at least 2.

Warning

If ungroup() is left out at the end of a code chunk wishing to group variables i.e. group_by(), this will leave the grouping in place and likely lead to errors in further calculations.

Always use ungroup() after group_by().

6.4 Filter

filter() only retains specific rows of data that meet the specified requirement(s), e.g.

To only display data from the diamonds dataset that have a cut value of Fair:

diamonds %>% filter(cut == "Fair")

# A tibble: 1,610 × 10
   carat cut   color clarity depth table price     x     y     z
   <dbl> <ord> <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1  0.22 Fair  E     VS2      65.1    61   337  3.87  3.78  2.49
 2  0.86 Fair  E     SI2      55.1    69  2757  6.45  6.33  3.52
 3  0.96 Fair  F     SI2      66.3    62  2759  6.27  5.95  4.07
 4  0.7  Fair  F     VS2      64.5    57  2762  5.57  5.53  3.58
 5  0.7  Fair  F     VS2      65.3    55  2762  5.63  5.58  3.66
 6  0.91 Fair  H     SI2      64.4    57  2763  6.11  6.09  3.93
 7  0.91 Fair  H     SI2      65.7    60  2763  6.03  5.99  3.95
 8  0.98 Fair  H     SI2      67.9    60  2777  6.05  5.97  4.08
 9  0.84 Fair  G     SI1      55.1    67  2782  6.39  6.2   3.47
10  1.01 Fair  E     I1       64.5    58  2788  6.29  6.21  4.03
# ℹ 1,600 more rows

Only display data from diamonds that have a cut value of Fair or Good and a price at or under $600 (notice how the or statement is obtained with | while the and statement is achieved through a comma):

diamonds %>%
  filter(cut == "Fair" | cut == "Good", price <= 600)

# A tibble: 505 × 10
   carat cut   color clarity depth table price     x     y     z
   <dbl> <ord> <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1  0.23 Good  E     VS1      56.9    65   327  4.05  4.07  2.31
 2  0.31 Good  J     SI2      63.3    58   335  4.34  4.35  2.75
 3  0.22 Fair  E     VS2      65.1    61   337  3.87  3.78  2.49
 4  0.3  Good  J     SI1      64      55   339  4.25  4.28  2.73
 5  0.3  Good  J     SI1      63.4    54   351  4.23  4.29  2.7 
 6  0.3  Good  J     SI1      63.8    56   351  4.23  4.26  2.71
 7  0.3  Good  I     SI2      63.3    56   351  4.26  4.3   2.71
 8  0.23 Good  F     VS1      58.2    59   402  4.06  4.08  2.37
 9  0.23 Good  E     VS1      64.1    59   402  3.83  3.85  2.46
10  0.31 Good  H     SI1      64      54   402  4.29  4.31  2.75
# ℹ 495 more rows

The following code would require the cut be Fair and Good (for which none exists):

diamonds %>%
  filter(cut == "Fair", cut == "Good", price <= 600)

# A tibble: 0 × 10
# ℹ 10 variables: carat <dbl>, cut <ord>, color <ord>, clarity <ord>,
#   depth <dbl>, table <dbl>, price <int>, x <dbl>, y <dbl>, z <dbl>

Only display data from diamonds that do not have a cut value of Fair:

diamonds %>% filter(cut != "Fair")

# A tibble: 52,330 × 10
   carat cut       color clarity depth table price     x     y     z
   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
 2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
 3  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31
 4  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
 5  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75
 6  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48
 7  0.24 Very Good I     VVS1     62.3    57   336  3.95  3.98  2.47
 8  0.26 Very Good H     SI1      61.9    55   337  4.07  4.11  2.53
 9  0.23 Very Good H     VS1      59.4    61   338  4     4.05  2.39
10  0.3  Good      J     SI1      64      55   339  4.25  4.28  2.73
# ℹ 52,320 more rows

6.5 Select

Function: Selects only the columns (variables) that you want to see. Gets rid of all other columns. You can refer to the columns by the column position (first column) or by name. The order in which you list the column names/positions is the order that the columns will be displayed.

In the diamonds dataset, only retain the first 4 columns:

diamonds %>% select(1:4)

# A tibble: 53,940 × 4
   carat cut       color clarity
   <dbl> <ord>     <ord> <ord>  
 1  0.23 Ideal     E     SI2    
 2  0.21 Premium   E     SI1    
 3  0.23 Good      E     VS1    
 4  0.29 Premium   I     VS2    
 5  0.31 Good      J     SI2    
 6  0.24 Very Good J     VVS2   
 7  0.24 Very Good I     VVS1   
 8  0.26 Very Good H     SI1    
 9  0.22 Fair      E     VS2    
10  0.23 Very Good H     VS1    
# ℹ 53,930 more rows

Retain all of the columns except for cut and color:

diamonds %>% select(-cut, -color)

# A tibble: 53,940 × 8
   carat clarity depth table price     x     y     z
   <dbl> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1  0.23 SI2      61.5    55   326  3.95  3.98  2.43
 2  0.21 SI1      59.8    61   326  3.89  3.84  2.31
 3  0.23 VS1      56.9    65   327  4.05  4.07  2.31
 4  0.29 VS2      62.4    58   334  4.2   4.23  2.63
 5  0.31 SI2      63.3    58   335  4.34  4.35  2.75
 6  0.24 VVS2     62.8    57   336  3.94  3.96  2.48
 7  0.24 VVS1     62.3    57   336  3.95  3.98  2.47
 8  0.26 SI1      61.9    55   337  4.07  4.11  2.53
 9  0.22 VS2      65.1    61   337  3.87  3.78  2.49
10  0.23 VS1      59.4    61   338  4     4.05  2.39
# ℹ 53,930 more rows

You can also retain all of the columns, but rearrange some of the columns to appear at the beginning—this moves the x,y,z variables to the first 3 columns:

diamonds %>% select(x,y,z, everything())

# A tibble: 53,940 × 10
       x     y     z carat cut       color clarity depth table price
   <dbl> <dbl> <dbl> <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int>
 1  3.95  3.98  2.43  0.23 Ideal     E     SI2      61.5    55   326
 2  3.89  3.84  2.31  0.21 Premium   E     SI1      59.8    61   326
 3  4.05  4.07  2.31  0.23 Good      E     VS1      56.9    65   327
 4  4.2   4.23  2.63  0.29 Premium   I     VS2      62.4    58   334
 5  4.34  4.35  2.75  0.31 Good      J     SI2      63.3    58   335
 6  3.94  3.96  2.48  0.24 Very Good J     VVS2     62.8    57   336
 7  3.95  3.98  2.47  0.24 Very Good I     VVS1     62.3    57   336
 8  4.07  4.11  2.53  0.26 Very Good H     SI1      61.9    55   337
 9  3.87  3.78  2.49  0.22 Fair      E     VS2      65.1    61   337
10  4     4.05  2.39  0.23 Very Good H     VS1      59.4    61   338
# ℹ 53,930 more rows

6.6 Arrange

Function: Allows you to arrange values within a variable in ascending or descending order (if that is applicable to your values). This can apply to both numerical and non-numerical values, e.g.to arrange colour by alphabetical order (A to Z):

diamonds %>% arrange(color)

# A tibble: 53,940 × 10
   carat cut       color clarity depth table price     x     y     z
   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1  0.23 Very Good D     VS2      60.5    61   357  3.96  3.97  2.4 
 2  0.23 Very Good D     VS1      61.9    58   402  3.92  3.96  2.44
 3  0.26 Very Good D     VS2      60.8    59   403  4.13  4.16  2.52
 4  0.26 Good      D     VS2      65.2    56   403  3.99  4.02  2.61
 5  0.26 Good      D     VS1      58.4    63   403  4.19  4.24  2.46
 6  0.22 Premium   D     VS2      59.3    62   404  3.91  3.88  2.31
 7  0.3  Premium   D     SI1      62.6    59   552  4.23  4.27  2.66
 8  0.3  Ideal     D     SI1      62.5    57   552  4.29  4.32  2.69
 9  0.3  Ideal     D     SI1      62.1    56   552  4.3   4.33  2.68
10  0.24 Very Good D     VVS1     61.5    60   553  3.97  4     2.45
# ℹ 53,930 more rows

Or, to arrange price by numerical order (lowest to highest):

diamonds %>% arrange(price)

# A tibble: 53,940 × 10
   carat cut       color clarity depth table price     x     y     z
   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
 2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
 3  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31
 4  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
 5  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75
 6  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48
 7  0.24 Very Good I     VVS1     62.3    57   336  3.95  3.98  2.47
 8  0.26 Very Good H     SI1      61.9    55   337  4.07  4.11  2.53
 9  0.22 Fair      E     VS2      65.1    61   337  3.87  3.78  2.49
10  0.23 Very Good H     VS1      59.4    61   338  4     4.05  2.39
# ℹ 53,930 more rows

Or, to arrange price in descending numerical order:

diamonds %>% arrange(desc(price))

# A tibble: 53,940 × 10
   carat cut       color clarity depth table price     x     y     z
   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1  2.29 Premium   I     VS2      60.8    60 18823  8.5   8.47  5.16
 2  2    Very Good G     SI1      63.5    56 18818  7.9   7.97  5.04
 3  1.51 Ideal     G     IF       61.7    55 18806  7.37  7.41  4.56
 4  2.07 Ideal     G     SI2      62.5    55 18804  8.2   8.13  5.11
 5  2    Very Good H     SI1      62.8    57 18803  7.95  8     5.01
 6  2.29 Premium   I     SI1      61.8    59 18797  8.52  8.45  5.24
 7  2.04 Premium   H     SI1      58.1    60 18795  8.37  8.28  4.84
 8  2    Premium   I     VS1      60.8    59 18795  8.13  8.02  4.91
 9  1.71 Premium   F     VS2      62.3    59 18791  7.57  7.53  4.7 
10  2.15 Ideal     G     SI2      62.6    54 18791  8.29  8.35  5.21
# ℹ 53,930 more rows

6.6.1 Exercises

Problem A

library(tidyverse) 
midwest %>% # utilises the midwest dataset 
 group_by(state) %>% # implies that the following commands are to be grouped by state for that state
  summarize(poptotalmean = mean(poptotal), # display the mean of the poptotal column for that state
popmax = max(poptotal), #display the maximum value of poptotal for that state
popmin = min(poptotal), #display the minimum value of poptotal for that state 
popdistinct = n_distinct(poptotal), # displays the number of unique values in poptotal for each state popfirst = first(poptotal), #display the first value in poptotal 
popany = any(poptotal < 5000), #split the popany column in to true and false values i.e. if less than 5000 = true if more = false 
popany2 = any(poptotal > 2000000)) %>% #same as above i.e. true if more than 2m false if less 
ungroup() #ungroup the dataset

# A tibble: 5 × 7
  state poptotalmean  popmax popmin popdistinct popany popany2
  <chr>        <dbl>   <int>  <int>       <int> <lgl>  <lgl>  
1 IL         112065. 5105067   4373         101 TRUE   TRUE   
2 IN          60263.  797159   5315          92 FALSE  FALSE  
3 MI         111992. 2111687   1701          83 TRUE   TRUE   
4 OH         123263. 1412140  11098          88 FALSE  FALSE  
5 WI          67941.  959275   3890          72 TRUE   FALSE

Problem B

midwest %>% # do the following to the midwest dataset
  group_by(state) %>% # group all other variables according to the state variable 
  summarize(num5k = sum(poptotal < 5000), # summarize the num5k column according to the no of entries with poptotal < 5000
            num2mil = sum(poptotal > 2000000), # summarize the num2mil column according to the no of entries with poptotal > 2,000,000
            numrows = n()) %>% # summarize the total no of rows
  ungroup() # ungroup variables

# A tibble: 5 × 4
  state num5k num2mil numrows
  <chr> <int>   <int>   <int>
1 IL        1       1     102
2 IN        0       0      92
3 MI        1       1      83
4 OH        0       0      88
5 WI        2       0      72

Problem C

midwest %>% 
  group_by(county) %>% 
  summarize(x = n_distinct(state)) %>% # summarise object x, defined as the number of unique states within the midwest dataset
  arrange(desc(x)) %>% # arrange x in descending order i.e. from most common state to least
  ungroup()

# A tibble: 320 × 2
   county         x
   <chr>      <int>
 1 CRAWFORD       5
 2 JACKSON        5
 3 MONROE         5
 4 ADAMS          4
 5 BROWN          4
 6 CLARK          4
 7 CLINTON        4
 8 JEFFERSON      4
 9 LAKE           4
10 WASHINGTON     4
# ℹ 310 more rows

midwest %>% 
  group_by(county) %>% 
  summarize(x = n()) %>% # summarize x by displaying the number of entries for each county
  ungroup()

# A tibble: 320 × 2
   county        x
   <chr>     <int>
 1 ADAMS         4
 2 ALCONA        1
 3 ALEXANDER     1
 4 ALGER         1
 5 ALLEGAN       1
 6 ALLEN         2
 7 ALPENA        1
 8 ANTRIM        1
 9 ARENAC        1
10 ASHLAND       2
# ℹ 310 more rows

Problem D

diamonds %>% 
  group_by(clarity) %>% 
  summarize(a = n_distinct(color), # show the no of unique colours
            b = n_distinct(price), # show the no of unique prices
            c = n()) %>% # show the total no of variables
  ungroup()

# A tibble: 8 × 4
  clarity     a     b     c
  <ord>   <int> <int> <int>
1 I1          7   632   741
2 SI2         7  4904  9194
3 SI1         7  5380 13065
4 VS2         7  5051 12258
5 VS1         7  3926  8171
6 VVS2        7  2409  5066
7 VVS1        7  1623  3655
8 IF          7   902  1790

Problem E

diamonds %>% 
  group_by(color, cut) %>% # display every unique combination of colour and cut [note the comma between variables]
  summarize(m = mean(price), # summarize the mean/average price of each combination
            s = sd(price)) %>% # show the sd of each combination
  ungroup()

`summarise()` has grouped output by 'color'. You can override using the
`.groups` argument.

# A tibble: 35 × 4
   color cut           m     s
   <ord> <ord>     <dbl> <dbl>
 1 D     Fair      4291. 3286.
 2 D     Good      3405. 3175.
 3 D     Very Good 3470. 3524.
 4 D     Premium   3631. 3712.
 5 D     Ideal     2629. 3001.
 6 E     Fair      3682. 2977.
 7 E     Good      3424. 3331.
 8 E     Very Good 3215. 3408.
 9 E     Premium   3539. 3795.
10 E     Ideal     2598. 2956.
# ℹ 25 more rows

Problem F

diamonds %>% 
  group_by(cut) %>% 
  summarize(potato = mean(depth),
            pizza = mean(price),
            popcorn = median(y),
            pineapple = potato - pizza,
            papaya = pineapple ^ 2,
            peach = n()) %>% 
  ungroup()

# A tibble: 5 × 7
  cut       potato pizza popcorn pineapple    papaya peach
  <ord>      <dbl> <dbl>   <dbl>     <dbl>     <dbl> <int>
1 Fair        64.0 4359.    6.1     -4295. 18444586.  1610
2 Good        62.4 3929.    5.99    -3866. 14949811.  4906
3 Very Good   61.8 3982.    5.77    -3920. 15365942. 12082
4 Premium     61.3 4584.    6.06    -4523. 20457466. 13791
5 Ideal       61.7 3458.    5.26    -3396. 11531679. 21551

Problem G

diamonds %>% 
  group_by(color) %>% # group all unique variations of colour
  summarize(m = mean(price)) %>% # summarize the mean price of each colour group
  mutate(x1 = str_c("Diamond color ", color), 
         x2 = 5) %>% # create 2 new columns - x1 with values of "Diamond color"+ color & x2 with the value 5 in every row
  ungroup()

# A tibble: 7 × 4
  color     m x1                 x2
  <ord> <dbl> <chr>           <dbl>
1 D     3170. Diamond color D     5
2 E     3077. Diamond color E     5
3 F     3725. Diamond color F     5
4 G     3999. Diamond color G     5
5 H     4487. Diamond color H     5
6 I     5092. Diamond color I     5
7 J     5324. Diamond color J     5

Problem H

diamonds %>% 
  group_by(color) %>% 
  mutate(x1 = price * 0.5) %>% # create a new column: x1, equal to the price of each color x 0.5
  summarize(m = mean(x1)) %>% # summarize the mean of x1 for each colour {m}. This summary excludes the original x1 column and replaces it with column m
  ungroup()

# A tibble: 7 × 2
  color     m
  <ord> <dbl>
1 D     1585.
2 E     1538.
3 F     1862.
4 G     2000.
5 H     2243.
6 I     2546.
7 J     2662.

diamonds %>% 
  group_by(color) %>% 
  mutate(x1 = price * 0.5) %>% # add a column equal to half of the price called x1
  ungroup() %>%  
  summarize(m = mean(x1)) # summarize the mean values of x1

# A tibble: 1 × 1
      m
  <dbl>
1 1966.

Since the above ungroups before printing the summary, the output isn’t separated into individual colors and instead gives the mean of all colours combined.

Why is grouping data neccessary?

Grouping data is necessary as it allows statistical analysis to be carried out on specific and targeted groups/variables within the dataset

Why is ungrouping neccessary?

Ungrouping is necessary as it disables the data from being grouped in the same fashion next time it is used

When should you ungroup data?

You should ungroup data as soon as you no longer need or want the data to be grouped

If the code does not contain group_by(), do you still need ungroup() at the end? For example, does data() %>% mutate(newVar = 1 + 2) require ungroup()?

No, no need

6.7 Extra Practice

View(diamonds) # opens the diamonds dataset in a new window

arrange(diamonds, price) # this displays the data with price ascending from lowest to highest

# A tibble: 53,940 × 10
   carat cut       color clarity depth table price     x     y     z
   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
 2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
 3  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31
 4  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
 5  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75
 6  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48
 7  0.24 Very Good I     VVS1     62.3    57   336  3.95  3.98  2.47
 8  0.26 Very Good H     SI1      61.9    55   337  4.07  4.11  2.53
 9  0.22 Fair      E     VS2      65.1    61   337  3.87  3.78  2.49
10  0.23 Very Good H     VS1      59.4    61   338  4     4.05  2.39
# ℹ 53,930 more rows

diamonds %>% arrange(price) #this does exactly the same as above

diamonds %>% 
  arrange(price) %>%
  arrange(cut)

# A tibble: 53,940 × 10
   carat cut   color clarity depth table price     x     y     z
   <dbl> <ord> <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1  0.22 Fair  E     VS2      65.1    61   337  3.87  3.78  2.49
 2  0.25 Fair  E     VS1      55.2    64   361  4.21  4.23  2.33
 3  0.23 Fair  G     VVS2     61.4    66   369  3.87  3.91  2.39
 4  0.27 Fair  E     VS1      66.4    58   371  3.99  4.02  2.66
 5  0.3  Fair  J     VS2      64.8    58   416  4.24  4.16  2.72
 6  0.3  Fair  F     SI1      63.1    58   496  4.3   4.22  2.69
 7  0.34 Fair  J     SI1      64.5    57   497  4.38  4.36  2.82
 8  0.37 Fair  F     SI1      65.3    56   527  4.53  4.47  2.94
 9  0.3  Fair  D     SI2      64.6    54   536  4.29  4.25  2.76
10  0.25 Fair  D     VS1      61.2    55   563  4.09  4.11  2.51
# ℹ 53,930 more rows

The above formula arranges the data according to lowest price (ascending) within each cut (also ascending in terms of quality)

diamonds %>% 
  arrange(desc(price)) %>% 
  arrange(desc(cut))

# A tibble: 53,940 × 10
   carat cut   color clarity depth table price     x     y     z
   <dbl> <ord> <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1  1.51 Ideal G     IF       61.7  55   18806  7.37  7.41  4.56
 2  2.07 Ideal G     SI2      62.5  55   18804  8.2   8.13  5.11
 3  2.15 Ideal G     SI2      62.6  54   18791  8.29  8.35  5.21
 4  2.05 Ideal G     SI1      61.9  57   18787  8.1   8.16  5.03
 5  1.6  Ideal F     VS1      62    56   18780  7.47  7.52  4.65
 6  2.06 Ideal I     VS2      62.2  55   18779  8.15  8.19  5.08
 7  1.71 Ideal G     VVS2     62.1  55   18768  7.66  7.63  4.75
 8  2.08 Ideal H     SI1      58.7  60   18760  8.36  8.4   4.92
 9  2.03 Ideal G     SI1      60    55.8 18757  8.17  8.3   4.95
10  2.61 Ideal I     SI2      62.1  56   18756  8.85  8.73  5.46
# ℹ 53,930 more rows

The above table displays the opposite of the previous table i.e. highest price (and descending) of the highest quality cut (also descending)

diamonds %>% 
  arrange(price) %>% 
  arrange(clarity)

# A tibble: 53,940 × 10
   carat cut     color clarity depth table price     x     y     z
   <dbl> <ord>   <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1  0.32 Premium E     I1       60.9    58   345  4.38  4.42  2.68
 2  0.32 Good    D     I1       64      54   361  4.33  4.36  2.78
 3  0.31 Premium F     I1       62.9    59   394  4.33  4.29  2.71
 4  0.34 Ideal   D     I1       62.5    57   413  4.47  4.49  2.8 
 5  0.34 Ideal   D     I1       61.4    55   413  4.5   4.52  2.77
 6  0.32 Premium E     I1       60.9    58   444  4.42  4.38  2.68
 7  0.43 Premium H     I1       62      59   452  4.78  4.83  2.98
 8  0.41 Good    G     I1       63.8    56   467  4.7   4.74  3.01
 9  0.32 Good    D     I1       64      54   468  4.36  4.33  2.78
10  0.32 Ideal   E     I1       60.7    57   490  4.45  4.41  2.69
# ℹ 53,930 more rows

The above arranges the diamonds from lowest to highest price starting with the lowest grade clarity (and ascending)

diamonds %>% 
  mutate(saleprice = price - 250)

# A tibble: 53,940 × 11
   carat cut       color clarity depth table price     x     y     z saleprice
   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>     <dbl>
 1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43        76
 2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31        76
 3  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31        77
 4  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63        84
 5  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75        85
 6  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48        86
 7  0.24 Very Good I     VVS1     62.3    57   336  3.95  3.98  2.47        86
 8  0.26 Very Good H     SI1      61.9    55   337  4.07  4.11  2.53        87
 9  0.22 Fair      E     VS2      65.1    61   337  3.87  3.78  2.49        87
10  0.23 Very Good H     VS1      59.4    61   338  4     4.05  2.39        88
# ℹ 53,930 more rows

The above formula has created a new variable named salePprice that reflects a $250 discount off the original cost of each diamond

diamonds %>% 
  select(-x,-y,-z)

# A tibble: 53,940 × 7
   carat cut       color clarity depth table price
   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int>
 1  0.23 Ideal     E     SI2      61.5    55   326
 2  0.21 Premium   E     SI1      59.8    61   326
 3  0.23 Good      E     VS1      56.9    65   327
 4  0.29 Premium   I     VS2      62.4    58   334
 5  0.31 Good      J     SI2      63.3    58   335
 6  0.24 Very Good J     VVS2     62.8    57   336
 7  0.24 Very Good I     VVS1     62.3    57   336
 8  0.26 Very Good H     SI1      61.9    55   337
 9  0.22 Fair      E     VS2      65.1    61   337
10  0.23 Very Good H     VS1      59.4    61   338
# ℹ 53,930 more rows

The above formula removed the x, y and z variables from the dataset

diamonds %>% 
  group_by(cut) %>% 
  summarise(number = n()) %>% 
  ungroup()

# A tibble: 5 × 2
  cut       number
  <ord>      <int>
1 Fair        1610
2 Good        4906
3 Very Good  12082
4 Premium    13791
5 Ideal      21551

The above formula has determined the no of diamonds there are for each cut value

diamonds %>% 
  mutate(totalNum = sum(n()))

# A tibble: 53,940 × 11
   carat cut       color clarity depth table price     x     y     z totalNum
   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>    <int>
 1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43    53940
 2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31    53940
 3  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31    53940
 4  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63    53940
 5  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75    53940
 6  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48    53940
 7  0.24 Very Good I     VVS1     62.3    57   336  3.95  3.98  2.47    53940
 8  0.26 Very Good H     SI1      61.9    55   337  4.07  4.11  2.53    53940
 9  0.22 Fair      E     VS2      65.1    61   337  3.87  3.78  2.49    53940
10  0.23 Very Good H     VS1      59.4    61   338  4     4.05  2.39    53940
# ℹ 53,930 more rows

The above formula has created a new column named totalNum that calculates the total number of diamonds.

Week 4 Pre-Session (R Graphics Cookbook) - Plots and Graphs how to

First of all I have viewed/loaded the “mtcars” dataset by typing in data(“mtcars”), see below. “head(mtcars)” then allows me to view/prints the data in tabular form. Well, at least the first 6 rows.

data("mtcars")
  head(mtcars)

                   mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

Using (?mtcars):

The data was extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973–74 models).

A data frame with 32 observations on 11 (numeric) variables.

[, 1] mpg Miles/(US) gallon [, 2] cyl Number of cylinders [, 3] disp Displacement (cu.in.) [, 4] hp Gross horsepower [, 5] drat Rear axle ratio [, 6] wt Weight (1000 lbs) [, 7] qsec 1/4 mile time [, 8] vs Engine (0 = V-shaped, 1 = straight) [, 9] am Transmission (0 = automatic, 1 = manual) [,10] gear Number of forward gears [,11] carb Number of carburetors

See Below - Scatter Plots:

plot(mtcars$wt, mtcars$mpg)

So above we can see there is a clear negative correlation between miles per gallon and car weight i.e. the heavier the car is the less mpg it provides (and vice versa), generally speaking.

The above scatter plot was created using the plotting functions of base R i.e. plotting functions that come with R Studio.

library(ggplot2)

ggplot(mtcars, aes(x = wt, y = mpg)) +
  geom_point()

Above is the output from ggplot2. So as you can see a bit neater/more informative and less clutter in the axis descriptions.

See Below - Line Graphs:

Below is an example of a line graph created using ggplot2 for a generic pressure vs temperature gradient.

ggplot(pressure, aes(x = temperature, y = pressure)) +
  geom_line() +
  geom_point()

See Below - Bar Graphs

Below shows how base R functions output a bar graph

# There are 11 cases of the value 4, 7 cases of 6, and 14 cases of 8

table(mtcars$cyl)


 4  6  8 
11  7 14

#> 
#>  4  6  8 
#> 11  7 14

# Generate a table of counts

barplot(table(mtcars$cyl))

Below shows how ggplot2 produces a bar graph:

library(ggplot2)

# Bar graph of values. This uses the BOD data frame, with the
# "Time" column for x values and the "demand" column for y values.
ggplot(BOD, aes(x = Time, y = demand)) +
  geom_col()

Above treats the Y-Axis (Time) as a continuous variable, while below shows the difference in inputting time as a discrete value:

# Convert the x variable to a factor, so that it is treated as discrete
ggplot(BOD, aes(x = factor(Time), y = demand)) +
  geom_col()

ggplot2 can also be used to plot the count of the number of data rows in each category by using geom_bar() instead of geom_col(). Once again, notice the difference between a continuous x-axis and a discrete one. For some kinds of data, it may make more sense to convert the continuous x variable to a discrete one, with the factor() function.

# Bar graph of counts This uses the mtcars data frame, with the "cyl" column for
# x position. The y position is calculated by counting the number of rows for
# each value of cyl.
ggplot(mtcars, aes(x = cyl)) +
  geom_bar()

# Bar graph of counts
ggplot(mtcars, aes(x = factor(cyl))) +
  geom_bar()

See Below - Histograms

hist(mtcars$mpg)

# Specify approximate number of bins with breaks
hist(mtcars$mpg, breaks = 10)

Histograms with base graphics (left); With more bins. Notice that because the bins are narrower, there are fewer items in each bin. (right).

library(ggplot2)
ggplot(mtcars, aes(x = mpg)) +
  geom_histogram()

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`

# With wider bins
ggplot(mtcars, aes(x = mpg)) +
  geom_histogram(binwidth = 4)

When you create a histogram without specifying the bin width, ggplot() prints out a message telling you that it’s defaulting to 30 bins, and to pick a better bin width. This is because it’s important to explore your data using different bin widths; the default of 30 may or may not show you something useful about your data.

See Below - Box Plots:

To make a box plot, use plot() and pass it a factor of x values and a vector of y values. When x is a factor (as opposed to a numeric vector), it will automatically create a box plot:

library(ggplot2)
ggplot(ToothGrowth, aes(x = supp, y = len)) +
  geom_boxplot()

It’s also possible to make box plots for multiple variables, by combining the variables with interaction():

ggplot(ToothGrowth, aes(x = interaction(supp, dose), y = len)) +
  geom_boxplot()

Week 4 - Online Session

Density Plot

Simple:

penguins%>%
group_by(species)%>%
  ggplot(aes(x=bill_length_mm, colour = species, fill = species))+geom_density()

Warning: Removed 2 rows containing non-finite outside the scale range
(`stat_density()`).

More complicated:

penguins %>% 
  na.omit() %>% 
  pivot_longer(bill_length_mm:body_mass_g, names_to = "trait") %>% 
  ggplot(aes(x=value,
         group=species,
         fill=species,
         color=species))+
  geom_density(alpha=0.7)+
  facet_grid(~trait, scales = "free_x" )+
  theme(axis.text=element_text(size=16),
        axis.title=element_text(size=16))+
  theme_minimal()

Box Plot

Bill Length…

data("penguins")
penguins %>% 
group_by(species) %>% 
  ggplot(aes(x=species, 
             y=bill_length_mm, 
             color=species, 
             fill=species))+
  geom_boxplot(alpha=0.5)+
  theme(axis.text=element_text(size=16),
        axis.title=element_text(size=16))

Warning: Removed 2 rows containing non-finite outside the scale range
(`stat_boxplot()`).

Histogram

Flipper Length..

data("penguins")
penguins %>% 
group_by(species) %>% 
  ggplot(aes(x=flipper_length_mm, color=species, fill=species))+
  geom_histogram()

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Warning: Removed 2 rows containing non-finite outside the scale range
(`stat_bin()`).

Bar chart

Individuals in each species count…

penguins %>% 
  ggplot(aes(x=species,
             color=species, 
             fill=species))+
  geom_bar(alpha=0.5)+
  theme(axis.text=element_text(size=16),
        axis.title=element_text(size=16))

count by year…

penguins %>% 
  ggplot(aes(x=year,color=species, fill=species))+
  geom_bar(position = "dodge")+
  theme(axis.text = element_text(size=16),axis.title=element_text(size=16))

count per island…

penguins %>% 
  ggplot(aes(x=island,
             color=species, 
             fill=species))+
  geom_bar()+
  theme(axis.text=element_text(size=16),
        axis.title=element_text(size=16))

Note above: for each species count the total count of each bar for each species i.e. Dream Island has ca. 55 Adelie penguins, not 125

Scatter Plot (Geom Point) with line of best fit

penguins %>% 
  ggplot(aes(x=bill_length_mm, y = bill_depth_mm))+
  geom_point()+
  geom_smooth(method = "lm")+
  theme(axis.text=element_text(size=16),axis.title=element_text(size=16))

`geom_smooth()` using formula = 'y ~ x'

Warning: Removed 2 rows containing non-finite outside the scale range
(`stat_smooth()`).

Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_point()`).

This is for all individuals within all 3 species combined

penguins %>% 
  ggplot(aes(x=bill_length_mm, 
             y = bill_depth_mm,color=species, 
             fill=species))+geom_point()+geom_smooth(method = "lm",se=FALSE)+
  theme(axis.text=element_text(size=16),
        axis.title=element_text(size=16))

`geom_smooth()` using formula = 'y ~ x'

Warning: Removed 2 rows containing non-finite outside the scale range
(`stat_smooth()`).

Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_point()`).

Separated by species - clear positive relationship between bill depth and bill length i.e. as one increases so does the other.

Box plot

penguins %>% 
  na.omit() %>% 
  ggplot(aes(x=sex, y = body_mass_g,color=species, fill=species))+
  geom_boxplot(alpha=0.7)+
  theme(axis.text=element_text(size=16),
        axis.title=element_text(size=16))

Body mass grouped by sex (above).

penguins %>% 
  na.omit() %>% 
  ggplot(aes(x=species, 
             y = body_mass_g,
             color=sex, 
             fill=sex))+
  geom_boxplot(alpha=0.7)+
  theme(axis.text=element_text(size=16),
        axis.title=element_text(size=16))

Body mass grouped by species.

QUESTION - Can Body mass predict bill length?

How to: Need to find relationship between body mass and bill length

Scatter-plot with linear regression will help to do this:

library(ggplot2)
penguins%>%
  na.omit() %>% 
  ggplot(aes(x=body_mass_g, y = bill_length_mm,color=species,))+
  geom_point()+
  geom_smooth(method = "lm", se=FALSE)+ theme(axis.text=element_text(size=16),axis.title=element_text(size=16))+
  labs(title = "Body mass vs bill length by Species")

`geom_smooth()` using formula = 'y ~ x'

Positive Correlation: Body mass and bill length are positively correlated for all three species. Bill length tends to rise in tandem with body bulk.

Variations by Species: When compared to the other two species, the Gentoo typically has the largest body mass, although has a relatively similar bill length to Chinstrap penguins.

Adelie (red): The body bulk and length of the bill are typically the lowest amongst all species in this species, although body mass can be compared somewhat to chinstrap penguins.

Conclusion

Body mass can be used to predict bill length, according to the scatterplot with regression lines; the strength of this association varies slightly between species. There may be some variance around the trend lines, though, so this relationship may not be entirely linear.

QUESTION 2:

Does sex explain flipper length?

To answer this we need to find significant difference between flipper length between male and female penguins.

library(ggplot2)
penguins%>%
  na.omit() %>% 
  ggplot(aes(x=sex, y =flipper_length_mm,fill=species))+
  geom_boxplot()+
  theme(axis.text=element_text(size=16),axis.title=element_text(size=16))+
  labs(title = "Flipper length by sex and species")

Discussion

Gentoo penguins (blue) have the longest flipper lengths overall, regardless of sex.

Chinstrap penguins (green) have intermediate flipper lengths.

Adelie penguins (red) have the shortest flipper lengths.

For each species, males tend to have longer flippers compared to females, as shown by the higher median values for males.

According to the box plot, flipper length is highly influenced by both sex and species. Gentoo penguins have significantly longer flippers than other species, regardless of sex, although with males typically having longer flippers than females.

Formative Exercise

library(tidyverse)
library(modeldata)


Attaching package: 'modeldata'

The following object is masked _by_ '.GlobalEnv':

    penguins

The following object is masked from 'package:palmerpenguins':

    penguins

view(crickets)

(?crickets) run in code chunk:

Rates of Cricket Chirps

Description

These data are from from McDonald (2009), by way of Mangiafico (2015), on the relationship between the ambient temperature and the rate of cricket chirps per minute. Data were collected for two species of the genus Oecanthus: O. exclamationis and O. niveus. The data are contained in a data frame called crickets with a total of 31 data points.

Source Mangiafico, S. 2015. “An R Companion for the Handbook of Biological Statistics.” https://rcompanion.org/handbook/.

McDonald, J. 2009. Handbook of Biological Statistics. Sparky House Publishing.

ggplot(crickets, aes(x = temp, 
                     y = rate,
                     color = species)) + 
  geom_point() +
  labs(x = "Temperature",
       y = "Chirp rate",
       color = "Species",
       title = "Cricket chirps",
       caption = "Source: McDonald (2009)") +
  scale_color_brewer(palette = "Dark2")

# Modifiying basic properties of the plot

ggplot(crickets, aes(x = temp, 
                     y = rate)) + 
  geom_point(color = "red",
             size = 2,
             alpha = .3,
             shape = "triangle") +
  labs(x = "Temperature",
       y = "Chirp rate",
       title = "Cricket chirps",
       caption = "Source: McDonald (2009)")

# Learn more about the options for the geom_abline()
# with ?geom_point

# Adding another layer


ggplot(crickets, aes(x = temp, 
                     y = rate)) + 
  geom_point() +
  geom_smooth(method = "lm",
              se = FALSE) +
  labs(x = "Temperature",
       y = "Chirp rate",
       title = "Cricket chirps",
       caption = "Source: McDonald (2009)")

`geom_smooth()` using formula = 'y ~ x'

Above is shown the general trend for both species combined.

ggplot(crickets, aes(x = temp, 
                     y = rate,
                     color = species)) + 
  geom_point() +
  geom_smooth(method = "lm",
              se = FALSE) +
  labs(x = "Temperature",
       y = "Chirp rate",
       color = "Species",
       title = "Cricket chirps",
       caption = "Source: McDonald (2009)") +
  scale_color_brewer(palette = "Dark2")

`geom_smooth()` using formula = 'y ~ x'

Here shows trend lines by species (in colour).

# Other plots

ggplot(crickets, aes(x = rate)) + 
  geom_histogram(bins = 15) # one quantitative variable

ggplot(crickets, aes(x = rate)) + 
  geom_freqpoly(bins = 15)

ggplot(crickets, aes(x = species)) + 
  geom_bar(color = "black",
           fill = "lightblue")

ggplot(crickets, aes(x = species, 
                     fill = species)) + 
  geom_bar(show.legend = FALSE) +
  scale_fill_brewer(palette = "Dark2")

ggplot(crickets, aes(x = species, 
                     y = rate,
                     color = species)) + 
  geom_boxplot(show.legend = FALSE) +
  scale_color_brewer(palette = "Dark2") +
  theme_minimal()

# faceting

# not great:
ggplot(crickets, aes(x = rate, 
                     fill = species)) + 
  geom_histogram(bins = 15) +
  scale_fill_brewer(palette = "Dark2")

ggplot(crickets, aes(x = rate,
                     fill = species)) + 
  geom_histogram(bins = 15,
                 show.legend = FALSE) + 
  facet_wrap(~species) +
  scale_fill_brewer(palette = "Dark2")

ggplot(crickets, aes(x = rate,
                     fill = species)) + 
  geom_histogram(bins = 15,
                 show.legend = FALSE) + 
  facet_wrap(~species,
             ncol = 1) +
  scale_fill_brewer(palette = "Dark2") + 
  theme_minimal()

Week 5 Post-Session Formative Exercise

Box Plots - T-tests (Independent and Paired), ANOVAs (One and Two-way), Mann-Whitney U test, Kruskal-Wallis test, Mood’s Median test (all comparing medians), Levene’s test (variance test), and Outlier Detection tests e.g. Tukey’s Fences are all tests associated with box plots. All these tests can be used to determine the significance of some aspect of the data illustrated by a box plot. For example median tests all compare medians across groups which is explicitly shown by a box plot. Meanwhile, variance tests assess the significance of variances of data across different group/variables which is shown through the interquartile range (IQR) and whiskers in a box plot.

Density Plots - Normality tests e.g. Shapiro-Wilk test, Two-Sample tests e.g. Mann-Whitney U test, Goodness-of-Fit tests e.g. Chi-Square and Bandwidth Selection tests e.g. Cross-Validation tests are all tests associated with density plots. As with box plots, density graphs display visual, data distribution information and are often used in more advanced statistical analysis, such as model fitting, hypothesis testing, and comparison of multiple distributions, as they provide a detailed and flexible representation of the data.The statistical tests associated with density plots are used to analyse specific properties of the data distribution e.g. Normality tests assess whether the data follow a normal distribution while Two-Sample tests evaluate whether two distributions differ from each other.

Scatter Plots - Correlation tests e.g. Pearson’s Correlation Coefficient or Spearman’s Rank Correlation, Regression Analysis e.g. Simple Linear Regression or Multiple Linear Regression, Hypothesis Tests for Regression e.g. Regression Coefficient t-test or ANOVA for Regression, Outlier Detection tests e.g. Grubbs’ test and Non-Parametric tests for Relationships e.g. Kruskal-Wallis test are all statistical tests associated with scatter plots. Scatter plots are used to visualise the relationship between at least two continuous variables and the associated statistical tests can quantify and evaluate these relationships in the data in more depth. For example, correlation tests assess and then report the strength and direction of any linear relationship between the two or more variables, while regression analysis is used to model the relationship between variables. It is worth noting here that there are very many variations to correlation tests and linear models.

Bar graphs - Chi-Square Tests e.g. chi-Square Test of Independence, T-tests (independent and paired), ANOVAs (One and Two-Way), G-tests i.e. the Likelihood Ratio Test, Contingency Tables, Proportion tests e.g. Z-test or Fisher’s Exact Test, Log-linear Models e.g. Saturated Model and Regression Analysis are all statistical tests associated with bar graphs. Bar graphs are typically used to display and compare categorical data, often showing the frequency or proportion of different groups, and hence associated stats tests generally focus on comparing group means, proportions, or distributions across categories. For example, Chi-square tests compare categorical data to test the independence or goodness-of-fit between groups, whereas log-linear models are used to analyse relationships (in categorical data) presented in contingency tables. Specifically, log-linear models calculate the expected frequencies of multi-way tables (tables with two or more categorical variables) by using the natural logarithm of expected cell counts. G-tests on the other hand are used to determine if there are significant differences between observed and expected frequencies in categorical data.

data ("iris")
head(iris)

  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa

(?iris) = This famous (Fisher’s or Anderson’s) iris data set gives the measurements in centimeters of the variables sepal length and width and petal length and width, respectively, for 50 flowers from each of 3 species of iris. The species are Iris setosa, versicolor, and virginica.

iris %>%
  glimpse()

Rows: 150
Columns: 5
$ Sepal.Length <dbl> 5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, 4.4, 4.9, 5.4, 4.…
$ Sepal.Width  <dbl> 3.5, 3.0, 3.2, 3.1, 3.6, 3.9, 3.4, 3.4, 2.9, 3.1, 3.7, 3.…
$ Petal.Length <dbl> 1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4, 1.5, 1.5, 1.…
$ Petal.Width  <dbl> 0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.3, 0.2, 0.2, 0.1, 0.2, 0.…
$ Species      <fct> setosa, setosa, setosa, setosa, setosa, setosa, setosa, s…

data(iris)

library(ggplot2)

ggplot(iris, aes(x = Species, y = Sepal.Length, color = Species)) +
  geom_boxplot(fill = "white") +
  labs(title = "Sepal Length by Species", x = "Species", y = "Sepal Length (cm)") +
  theme(plot.title = element_text(hjust = 0.5))

library(ggplot2)

data(iris)
ggplot(iris, aes(x = Petal.Length, fill = Species)) +
  geom_density(alpha = 0.5) +
  labs(title = "Petal Length by Species", x = "Petal Length (cm)", y = "Density") +
  theme(plot.title = element_text(hjust = 0.5))

data(iris)

library(ggplot2)

# create a scatter graph of petal length vs petal width
ggplot(iris, aes(x = Petal.Length, y = Petal.Width, color = Species)) +
  geom_point(aes(shape = Species, color = Species), size = 2) +
  geom_smooth(method = "lm", color = "blue") +
  labs(title = "Petal Length vs Petal width",
       x = "Petal Length (cm)",
       y = "Petal Width (cm)") +
  theme(plot.title = element_text(hjust = 0.5))

`geom_smooth()` using formula = 'y ~ x'

data(iris)

library(ggplot2)

iris$Size <-
ifelse(iris$Sepal.Length < 
         median(iris$Sepal.Length),
"small", "big")

ggplot(iris, aes(x = Species, fill = Size)) +
  geom_bar(position = "dodge") +
  labs(title = "Counts of Species by Size (of Sepal Length)",
       x = "Species", y = "Count", fill = "Size") +
  theme(plot.title = element_text(hjust = 0.5))

Week 6 Post-Session Formative Exercise

Let’s load dplyr, tidyr, and ggplot2(for the sample data):

library(ggplot2)
library(dplyr)
library(tidyr)
library(knitr)

# get the total number of cars with each class & cyl combination using group_by() and summarize().
mpg%>%
group_by(class, cyl)%>%
  summarize(n=n())%>%
  kable()

`summarise()` has grouped output by 'class'. You can override using the
`.groups` argument.

class	cyl	n
2seater	8	5
compact	4	32
compact	5	2
compact	6	13
midsize	4	16
midsize	6	23
midsize	8	2
minivan	4	1
minivan	6	10
pickup	4	3
pickup	6	10
pickup	8	20
subcompact	4	21
subcompact	5	2
subcompact	6	7
subcompact	8	5
suv	4	8
suv	6	16
suv	8	38

# Turn our summary data into a crosstab or contingency table. We need variable A (class) to be listed by row, and variable B (cyl) to be listed by column.We can achieve this by including the spread() command, to create columns for each cyl value, with n as the crosstab response value.
mpg%>%
  group_by(class, cyl)%>%
  summarise(n=n())%>%
  spread(cyl, n)%>%
  kable()

`summarise()` has grouped output by 'class'. You can override using the
`.groups` argument.

class	4	5	6	8
2seater	NA	NA	NA	5
compact	32	2	13	NA
midsize	16	NA	23	2
minivan	1	NA	10	NA
pickup	3	NA	10	20
subcompact	21	2	7	5
suv	8	NA	16	38

# Instead of displaying frequencies, we can get the average number of city miles by class & cyl
mpg%>%
  group_by(class, cyl)%>%
  summarise(mean_cty=mean(cty))%>%
  spread(cyl, mean_cty)%>%
  kable()

`summarise()` has grouped output by 'class'. You can override using the
`.groups` argument.

class	4	5	6	8
2seater	NA	NA	NA	15.40000
compact	21.37500	21	16.92308	NA
midsize	20.50000	NA	17.78261	16.00000
minivan	18.00000	NA	15.60000	NA
pickup	16.00000	NA	14.50000	11.80000
subcompact	22.85714	20	17.00000	14.80000
suv	18.00000	NA	14.50000	12.13158

# Or max number of city miles by class & cyl
mpg%>%
  group_by(class, cyl)%>%
  summarise(max_cty=max(cty))%>%
  spread(cyl, max_cty)%>%
  kable()

`summarise()` has grouped output by 'class'. You can override using the
`.groups` argument.

class	4	5	6	8
2seater	NA	NA	NA	16
compact	33	21	18	NA
midsize	23	NA	19	16
minivan	18	NA	17	NA
pickup	17	NA	16	14
subcompact	35	20	18	15
suv	20	NA	17	14

# find proportions by creating a new, calculated variable dividing row frequency by table frequency.
mpg%>%
  group_by(class)%>%
  summarize(n=n())%>%
  mutate(prop=n/sum(n))%>%   # our new proportion variable
  kable()

class	n	prop
2seater	5	0.0213675
compact	47	0.2008547
midsize	41	0.1752137
minivan	11	0.0470085
pickup	33	0.1410256
subcompact	35	0.1495726
suv	62	0.2649573

# create a contingency table of proportion values by applying the same spread command as before. Vary the group_by() and spread() arguments to produce proportions of different variables.
mpg%>%
  group_by(class, cyl)%>%
  summarize(n=n())%>%
  mutate(prop=n/sum(n))%>%
  subset(select=c("class","cyl","prop"))%>%   #drop the frequency value
  spread(class, prop)%>%
  kable()

`summarise()` has grouped output by 'class'. You can override using the
`.groups` argument.

cyl	2seater	compact	midsize	minivan	pickup	subcompact	suv
4	NA	0.6808511	0.3902439	0.0909091	0.0909091	0.6000000	0.1290323
5	NA	0.0425532	NA	NA	NA	0.0571429	NA
6	NA	0.2765957	0.5609756	0.9090909	0.3030303	0.2000000	0.2580645
8	1	NA	0.0487805	NA	0.6060606	0.1428571	0.6129032

#table() is a quick way to pull together row/column frequencies and proportions for categorical variables
# Using the basic table() command, we can get a contingency table of vehicle class by number of cylinders.
table(mpg$class, mpg$cyl)

            
              4  5  6  8
  2seater     0  0  0  5
  compact    32  2 13  0
  midsize    16  0 23  2
  minivan     1  0 10  0
  pickup      3  0 10 20
  subcompact 21  2  7  5
  suv         8  0 16 38

#The table frequency can also be called by using the ftable() command.
mpg_table<- table(mpg$class, mpg$cyl) #define object w/table parameters for simple calling
ftable(mpg_table)

             4  5  6  8
                       
2seater      0  0  0  5
compact     32  2 13  0
midsize     16  0 23  2
minivan      1  0 10  0
pickup       3  0 10 20
subcompact  21  2  7  5
suv          8  0 16 38

# For row frequencies, we use the margin.table() command, with the 1 argument.
margin.table(mpg_table, 1)


   2seater    compact    midsize    minivan     pickup subcompact        suv 
         5         47         41         11         33         35         62

#For column frequencies, we use the margin.table() command, with the 2 argument
margin.table(mpg_table, 2)


 4  5  6  8 
81  4 79 70

#We can get the proportion values for our variable combinations as well.

# For proportion of the entire table, we use the prop.table() command.
prop.table(mpg_table)

            
                       4           5           6           8
  2seater    0.000000000 0.000000000 0.000000000 0.021367521
  compact    0.136752137 0.008547009 0.055555556 0.000000000
  midsize    0.068376068 0.000000000 0.098290598 0.008547009
  minivan    0.004273504 0.000000000 0.042735043 0.000000000
  pickup     0.012820513 0.000000000 0.042735043 0.085470085
  subcompact 0.089743590 0.008547009 0.029914530 0.021367521
  suv        0.034188034 0.000000000 0.068376068 0.162393162

# For row proportions, we use the prop.table() command, with the 1 argument following the table name.
prop.table(mpg_table, 1)

            
                      4          5          6          8
  2seater    0.00000000 0.00000000 0.00000000 1.00000000
  compact    0.68085106 0.04255319 0.27659574 0.00000000
  midsize    0.39024390 0.00000000 0.56097561 0.04878049
  minivan    0.09090909 0.00000000 0.90909091 0.00000000
  pickup     0.09090909 0.00000000 0.30303030 0.60606061
  subcompact 0.60000000 0.05714286 0.20000000 0.14285714
  suv        0.12903226 0.00000000 0.25806452 0.61290323

# For column proportions, we use the prop.table() command, with the 2 argument following the table name.

prop.table(mpg_table, 2)

            
                      4          5          6          8
  2seater    0.00000000 0.00000000 0.00000000 0.07142857
  compact    0.39506173 0.50000000 0.16455696 0.00000000
  midsize    0.19753086 0.00000000 0.29113924 0.02857143
  minivan    0.01234568 0.00000000 0.12658228 0.00000000
  pickup     0.03703704 0.00000000 0.12658228 0.28571429
  subcompact 0.25925926 0.50000000 0.08860759 0.07142857
  suv        0.09876543 0.00000000 0.20253165 0.54285714

library(gmodels)
CrossTable(mpg$class, mpg$cyl)


 
   Cell Contents
|-------------------------|
|                       N |
| Chi-square contribution |
|           N / Row Total |
|           N / Col Total |
|         N / Table Total |
|-------------------------|

 
Total Observations in Table:  234 

 
             | mpg$cyl 
   mpg$class |         4 |         5 |         6 |         8 | Row Total | 
-------------|-----------|-----------|-----------|-----------|-----------|
     2seater |         0 |         0 |         0 |         5 |         5 | 
             |     1.731 |     0.085 |     1.688 |     8.210 |           | 
             |     0.000 |     0.000 |     0.000 |     1.000 |     0.021 | 
             |     0.000 |     0.000 |     0.000 |     0.071 |           | 
             |     0.000 |     0.000 |     0.000 |     0.021 |           | 
-------------|-----------|-----------|-----------|-----------|-----------|
     compact |        32 |         2 |        13 |         0 |        47 | 
             |    15.210 |     1.782 |     0.518 |    14.060 |           | 
             |     0.681 |     0.043 |     0.277 |     0.000 |     0.201 | 
             |     0.395 |     0.500 |     0.165 |     0.000 |           | 
             |     0.137 |     0.009 |     0.056 |     0.000 |           | 
-------------|-----------|-----------|-----------|-----------|-----------|
     midsize |        16 |         0 |        23 |         2 |        41 | 
             |     0.230 |     0.701 |     6.059 |     8.591 |           | 
             |     0.390 |     0.000 |     0.561 |     0.049 |     0.175 | 
             |     0.198 |     0.000 |     0.291 |     0.029 |           | 
             |     0.068 |     0.000 |     0.098 |     0.009 |           | 
-------------|-----------|-----------|-----------|-----------|-----------|
     minivan |         1 |         0 |        10 |         0 |        11 | 
             |     2.070 |     0.188 |    10.641 |     3.291 |           | 
             |     0.091 |     0.000 |     0.909 |     0.000 |     0.047 | 
             |     0.012 |     0.000 |     0.127 |     0.000 |           | 
             |     0.004 |     0.000 |     0.043 |     0.000 |           | 
-------------|-----------|-----------|-----------|-----------|-----------|
      pickup |         3 |         0 |        10 |        20 |        33 | 
             |     6.211 |     0.564 |     0.117 |    10.391 |           | 
             |     0.091 |     0.000 |     0.303 |     0.606 |     0.141 | 
             |     0.037 |     0.000 |     0.127 |     0.286 |           | 
             |     0.013 |     0.000 |     0.043 |     0.085 |           | 
-------------|-----------|-----------|-----------|-----------|-----------|
  subcompact |        21 |         2 |         7 |         5 |        35 | 
             |     6.515 |     3.284 |     1.963 |     2.858 |           | 
             |     0.600 |     0.057 |     0.200 |     0.143 |     0.150 | 
             |     0.259 |     0.500 |     0.089 |     0.071 |           | 
             |     0.090 |     0.009 |     0.030 |     0.021 |           | 
-------------|-----------|-----------|-----------|-----------|-----------|
         suv |         8 |         0 |        16 |        38 |        62 | 
             |     8.444 |     1.060 |     1.162 |    20.403 |           | 
             |     0.129 |     0.000 |     0.258 |     0.613 |     0.265 | 
             |     0.099 |     0.000 |     0.203 |     0.543 |           | 
             |     0.034 |     0.000 |     0.068 |     0.162 |           | 
-------------|-----------|-----------|-----------|-----------|-----------|
Column Total |        81 |         4 |        79 |        70 |       234 | 
             |     0.346 |     0.017 |     0.338 |     0.299 |           | 
-------------|-----------|-----------|-----------|-----------|-----------|

ggplot2:

# Load necessary libraries
library(ggplot2)
library(dplyr)

# Create a summary table of counts for class and cyl
data_summary <- mpg %>%
  count(class, cyl)

# Create the balloon plot
ggplot(data_summary, aes(x = class, y = factor(cyl), fill = n)) +
  geom_tile(aes(width = 0.9, height = 0.9)) +  # Create "balloons" using tiles
  scale_fill_gradient(low = "skyblue", high = "darkblue", name = "Count") +  # Color gradient for count
  labs(title = "Balloon Plot of Car Class vs. Number of Cylinders",
       x = "Car Class", 
       y = "Number of Cylinders") +
  theme_minimal() +
  theme(legend.position = "right")  # Adjust legend position

Corrplot:

# Load necessary libraries
library(ggplot2)  # for mpg dataset
library(corrplot)  # for the balloon plot

corrplot 0.95 loaded

# Create a contingency table for 'class' and 'cyl'
contingency_table <- table(mpg$class, mpg$cyl)

# Convert the table to proportions (values between 0 and 1)
prop_table <- prop.table(contingency_table)

# Set larger margins and reduce title font size (since title is removed, no need to worry about it now)
par(mar = c(7, 4, 7, 2))  # Adjust margins (top margin is set to 7 for space)

# Suppress warnings and plot the graph
suppressWarnings({
  corrplot(prop_table, 
           method = "circle",        # "circle" creates round balloons
           col = colorRampPalette(c("white", "skyblue"))(200), # Color scale
           cl.lim = c(0, 1),         # Color scale limits (proportions between 0 and 1)
           tl.cex = 1.2,             # Text size for axis labels
           cl.cex = 1.2,             # Color legend text size
           addCoef.col = NULL)       # Don't display numbers inside the balloons
})

# Load necessary libraries
library(ggplot2)  # for mpg dataset and plotting
library(dplyr)    # for data manipulation

# Calculate the average number of city miles ('cty') by 'class' and 'cyl'
avg_cty_by_class_cyl <- mpg %>%
  group_by(class, cyl) %>%
  summarise(avg_cty = mean(cty, na.rm = TRUE))

`summarise()` has grouped output by 'class'. You can override using the
`.groups` argument.

# Create a ggplot with circles representing average city miles
ggplot(avg_cty_by_class_cyl, aes(x = class, y = factor(cyl), size = avg_cty, color = avg_cty)) +
  geom_point(alpha = 0.7) +  # Use points (circles) with transparency
  scale_size_continuous(range = c(3, 20)) +  # Adjust circle size range
  scale_color_gradient(low = "white", high = "skyblue") +  # Color scale for fill
  labs(
    x = "Car Class",
    y = "Number of Cylinders",
    size = "Average City Miles",
    color = "Average City Miles",
    title = "Balloon Plot of Average City Miles by Class and Cylinders"
  ) +
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1),  # Angle the x-axis labels for readability
    plot.title = element_text(hjust = 0.5)  # Center the plot title
  )

Week 7 Post-Session Formative Exercise

if(!require(devtools)) install.packages("devtools")

Loading required package: devtools

Loading required package: usethis

devtools::install_github("kassambara/ggpubr")

Skipping install of 'ggpubr' from a github remote, the SHA1 (6aeb4f70) has not changed since last install.
  Use `force = TRUE` to force installation

One Sample T-Test (parametric) - Comparing one-sample mean to a standard known mean

set.seed(1234)
my_data <- data.frame(
  name = paste0(rep("M_", 10), 1:10),
  weight = round(rnorm(10, 20, 2), 1)
)

head(my_data, 10)

   name weight
1   M_1   17.6
2   M_2   20.6
3   M_3   22.2
4   M_4   15.3
5   M_5   20.9
6   M_6   21.0
7   M_7   18.9
8   M_8   18.9
9   M_9   18.9
10 M_10   18.2

summary(my_data$weight)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  15.30   18.38   18.90   19.25   20.82   22.20

library(ggpubr)
ggboxplot(my_data$weight, 
          ylab = "Weight (g)", xlab = FALSE,
          ggtheme = theme_minimal())

shapiro.test(my_data$weight) # => p-value = 0.6993


    Shapiro-Wilk normality test

data:  my_data$weight
W = 0.9526, p-value = 0.6993

library("ggpubr")
ggqqplot(my_data$weight, ylab = "Men's weight",
         ggtheme = theme_minimal())

# One-sample t-test
res <- t.test(my_data$weight, mu = 25)
# Printing the results
res


    One Sample t-test

data:  my_data$weight
t = -9.0783, df = 9, p-value = 7.953e-06
alternative hypothesis: true mean is not equal to 25
95 percent confidence interval:
 17.8172 20.6828
sample estimates:
mean of x 
    19.25

res$p.value

[1] 7.953383e-06

res$estimate

mean of x 
    19.25

res$conf.int

[1] 17.8172 20.6828
attr(,"conf.level")
[1] 0.95

The p-value of the test is 7.95310^{-6}, which is less than the significance level alpha = 0.05. We can conclude that the mean weight of the mice is significantly different from 25g with a p-value = 7.95310^{-6}

One-sample Wilcoxon test (non-parametric)

set.seed(1234)
my_data <- data.frame(
  name = paste0(rep("M_", 10), 1:10),
  weight = round(rnorm(10, 20, 2), 1)
)

# One-sample wilcoxon test
res <- wilcox.test(my_data$weight, mu = 25)

Warning in wilcox.test.default(my_data$weight, mu = 25): cannot compute exact
p-value with ties

# Printing the results
res


    Wilcoxon signed rank test with continuity correction

data:  my_data$weight
V = 0, p-value = 0.005793
alternative hypothesis: true location is not equal to 25

# print only the p-value
res$p.value

[1] 0.005793045

The p-value of the test is 0.005793, which is less than the significance level alpha = 0.05. We can reject the null hypothesis and conclude that the average weight of the mice is significantly different from 25g with a p-value = 0.005793.

Unpaired two samples t-test (parametric) - Comparing the means of two independent groups

# Data in two numeric vectors
women_weight <- c(38.9, 61.2, 73.3, 21.8, 63.4, 64.6, 48.4, 48.8, 48.5)
men_weight <- c(67.8, 60, 63.4, 76, 89.4, 73.3, 67.3, 61.3, 62.4) 
# Create a data frame
my_data <- data.frame( 
                group = rep(c("Woman", "Man"), each = 9),
                weight = c(women_weight,  men_weight)
                )

print(my_data)

   group weight
1  Woman   38.9
2  Woman   61.2
3  Woman   73.3
4  Woman   21.8
5  Woman   63.4
6  Woman   64.6
7  Woman   48.4
8  Woman   48.8
9  Woman   48.5
10   Man   67.8
11   Man   60.0
12   Man   63.4
13   Man   76.0
14   Man   89.4
15   Man   73.3
16   Man   67.3
17   Man   61.3
18   Man   62.4

library(dplyr)
group_by(my_data, group) %>%
  summarise(
    count = n(),
    mean = mean(weight, na.rm = TRUE),
    sd = sd(weight, na.rm = TRUE)
  )

# A tibble: 2 × 4
  group count  mean    sd
  <chr> <int> <dbl> <dbl>
1 Man       9  69.0  9.38
2 Woman     9  52.1 15.6

# Plot weight by group and color by group
library("ggpubr")
ggboxplot(my_data, x = "group", y = "weight", 
          color = "group", palette = c("#00AFBB", "#E7B800"),
        ylab = "Weight", xlab = "Groups")

Assumption 1: Are the two samples independents?

Yes, since the samples from men and women are not related.

Assumtion 2: Are the data from each of the 2 groups follow a normal distribution?

Use Shapiro-Wilk normality test as described at: Normality Test in R. - Null hypothesis: the data are normally distributed - Alternative hypothesis: the data are not normally distributed

We’ll use the functions with() and shapiro.test() to compute Shapiro-Wilk test for each group of samples.

# Shapiro-Wilk normality test for Men's weights
with(my_data, shapiro.test(weight[group == "Man"]))# p = 0.1


    Shapiro-Wilk normality test

data:  weight[group == "Man"]
W = 0.86425, p-value = 0.1066

# Shapiro-Wilk normality test for Women's weights
with(my_data, shapiro.test(weight[group == "Woman"])) # p = 0.6


    Shapiro-Wilk normality test

data:  weight[group == "Woman"]
W = 0.94266, p-value = 0.6101

From the output, the two p-values are greater than the significance level 0.05 implying that the distribution of the data are not significantly different from the normal distribution. In other words, we can assume the normality.

res.ftest <- var.test(weight ~ group, data = my_data)
res.ftest


    F test to compare two variances

data:  weight by group
F = 0.36134, num df = 8, denom df = 8, p-value = 0.1714
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
 0.08150656 1.60191315
sample estimates:
ratio of variances 
         0.3613398

The p-value of F-test is p = 0.1713596. It’s greater than the significance level alpha = 0.05. In conclusion, there is no significant difference between the variances of the two sets of data. Therefore, we can use the classic t-test witch assume equality of the two variances.

# Compute t-test
res <- t.test(women_weight, men_weight, var.equal = TRUE)
res


    Two Sample t-test

data:  women_weight and men_weight
t = -2.7842, df = 16, p-value = 0.01327
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -29.748019  -4.029759
sample estimates:
mean of x mean of y 
 52.10000  68.98889

# Compute t-test
res <- t.test(weight ~ group, data = my_data, var.equal = TRUE)
res


    Two Sample t-test

data:  weight by group
t = 2.7842, df = 16, p-value = 0.01327
alternative hypothesis: true difference in means between group Man and group Woman is not equal to 0
95 percent confidence interval:
  4.029759 29.748019
sample estimates:
  mean in group Man mean in group Woman 
           68.98889            52.10000

The p-value of the test is 0.01327, which is less than the significance level alpha = 0.05. We can conclude that men’s average weight is significantly different from women’s average weight with a p-value = 0.01327.

Paired samples t-test (parametric) - Comparing the means of paired samples

# Data in two numeric vectors
# ++++++++++++++++++++++++++
# Weight of the mice before treatment
before <-c(200.1, 190.9, 192.7, 213, 241.4, 196.9, 172.2, 185.5, 205.2, 193.7)
# Weight of the mice after treatment
after <-c(392.9, 393.2, 345.1, 393, 434, 427.9, 422, 383.9, 392.3, 352.2)
# Create a data frame
my_data <- data.frame( 
                group = rep(c("before", "after"), each = 10),
                weight = c(before,  after)
                )

print(my_data)

    group weight
1  before  200.1
2  before  190.9
3  before  192.7
4  before  213.0
5  before  241.4
6  before  196.9
7  before  172.2
8  before  185.5
9  before  205.2
10 before  193.7
11  after  392.9
12  after  393.2
13  after  345.1
14  after  393.0
15  after  434.0
16  after  427.9
17  after  422.0
18  after  383.9
19  after  392.3
20  after  352.2

library("dplyr")
group_by(my_data, group) %>%
  summarise(
    count = n(),
    mean = mean(weight, na.rm = TRUE),
    sd = sd(weight, na.rm = TRUE)
  )

# A tibble: 2 × 4
  group  count  mean    sd
  <chr>  <int> <dbl> <dbl>
1 after     10  394.  29.4
2 before    10  199.  18.5

# Plot weight by group and color by group
library("ggpubr")
ggboxplot(my_data, x = "group", y = "weight", 
          color = "group", palette = c("#00AFBB", "#E7B800"),
          order = c("before", "after"),
          ylab = "Weight", xlab = "Groups")

# Subset weight data before treatment
before <- subset(my_data,  group == "before", weight,
                 drop = TRUE)
# subset weight data after treatment
after <- subset(my_data,  group == "after", weight,
                 drop = TRUE)
# Plot paired data
library(PairedData)

Loading required package: MASS


Attaching package: 'MASS'

The following object is masked from 'package:dplyr':

    select

Loading required package: gld

Loading required package: mvtnorm

Loading required package: lattice


Attaching package: 'PairedData'

The following object is masked from 'package:base':

    summary

pd <- paired(before, after)
plot(pd, type = "profile") + theme_bw()

Assumption 1: Are the two samples paired?

Yes, since the data have been collected from measuring twice the weight of the same mice.

Assumption 2: Is this a large sample?

No, because n < 30. Since the sample size is not large enough (less than 30), we need to check whether the differences of the pairs follow a normal distribution.

How to check the normality?

Use Shapiro-Wilk normality test as described at: Normality Test in R.

Null hypothesis: the data are normally distributed Alternative hypothesis: the data are not normally distributed

# compute the difference
d <- with(my_data, 
        weight[group == "before"] - weight[group == "after"])
# Shapiro-Wilk normality test for the differences
shapiro.test(d) # => p-value = 0.6141


    Shapiro-Wilk normality test

data:  d
W = 0.94536, p-value = 0.6141

From the output, the p-value is greater than the significance level 0.05 implying that the distribution of the differences (d) are not significantly different from normal distribution. In other words, we can assume the normality.

# Compute t-test
res <- t.test(before, after, paired = TRUE)
res


    Paired t-test

data:  before and after
t = -20.883, df = 9, p-value = 6.2e-09
alternative hypothesis: true mean difference is not equal to 0
95 percent confidence interval:
 -215.5581 -173.4219
sample estimates:
mean difference 
        -194.49

# Compute t-test
res <- t.test(weight ~ group, data = my_data, paired = TRUE)
res


    Paired t-test

data:  weight by group
t = 20.883, df = 9, p-value = 6.2e-09
alternative hypothesis: true mean difference is not equal to 0
95 percent confidence interval:
 173.4219 215.5581
sample estimates:
mean difference 
         194.49

The p-value of the test is 6.210^{-9}, which is less than the significance level alpha = 0.05. We can then reject null hypothesis and conclude that the average weight of the mice before treatment is significantly different from the average weight after treatment with a p-value = 6.210^{-9}.

Paired samples Wilcoxon test (non-parametric)

# Data in two numeric vectors
# ++++++++++++++++++++++++++
# Weight of the mice before treatment
before <-c(200.1, 190.9, 192.7, 213, 241.4, 196.9, 172.2, 185.5, 205.2, 193.7)
# Weight of the mice after treatment
after <-c(392.9, 393.2, 345.1, 393, 434, 427.9, 422, 383.9, 392.3, 352.2)
# Create a data frame
my_data <- data.frame( 
                group = rep(c("before", "after"), each = 10),
                weight = c(before,  after)
                )

library("dplyr")
group_by(my_data, group) %>%
  summarise(
    count = n(),
    median = median(weight, na.rm = TRUE),
    IQR = IQR(weight, na.rm = TRUE)
  )

# A tibble: 2 × 4
  group  count median   IQR
  <chr>  <int>  <dbl> <dbl>
1 after     10   393.  28.8
2 before    10   195.  12.6

# Plot weight by group and color by group
library("ggpubr")
ggboxplot(my_data, x = "group", y = "weight", 
          color = "group", palette = c("#00AFBB", "#E7B800"),
          order = c("before", "after"),
          ylab = "Weight", xlab = "Groups")

# Subset weight data before treatment
before <- subset(my_data,  group == "before", weight,
                 drop = TRUE)
# subset weight data after treatment
after <- subset(my_data,  group == "after", weight,
                 drop = TRUE)
# Plot paired data
library(PairedData)
pd <- paired(before, after)
plot(pd, type = "profile") + theme_bw()

Question : Is there any significant changes in the weights of mice before after treatment?

Compute paired Wilcoxon test - Method 1: The data are saved in two different numeric vectors.

res <- wilcox.test(before, after, paired = TRUE)
res


    Wilcoxon signed rank exact test

data:  before and after
V = 0, p-value = 0.001953
alternative hypothesis: true location shift is not equal to 0

The p-value of the test is 0.001953, which is less than the significance level alpha = 0.05. We can conclude that the median weight of the mice before treatment is significantly different from the median weight after treatment with a p-value = 0.001953.

One-way ANOVA test - Comparing the means of more than two groups

my_data <- PlantGrowth

# Show a random sample
set.seed(1234)
dplyr::sample_n(my_data, 10)

   weight group
1    6.15  trt2
2    3.83  trt1
3    5.29  trt2
4    5.12  trt2
5    4.50  ctrl
6    4.17  trt1
7    5.87  trt1
8    5.33  ctrl
9    5.26  trt2
10   4.61  ctrl

# Show the levels
levels(my_data$group)

[1] "ctrl" "trt1" "trt2"

my_data$group <- ordered(my_data$group,
                         levels = c("ctrl", "trt1", "trt2"))

library(dplyr)
group_by(my_data, group) %>%
  summarise(
    count = n(),
    mean = mean(weight, na.rm = TRUE),
    sd = sd(weight, na.rm = TRUE)
  )

# A tibble: 3 × 4
  group count  mean    sd
  <ord> <int> <dbl> <dbl>
1 ctrl     10  5.03 0.583
2 trt1     10  4.66 0.794
3 trt2     10  5.53 0.443

# Install
if(!require(devtools)) install.packages("devtools")
devtools::install_github("kassambara/ggpubr")

Skipping install of 'ggpubr' from a github remote, the SHA1 (6aeb4f70) has not changed since last install.
  Use `force = TRUE` to force installation

# Box plots
# ++++++++++++++++++++
# Plot weight by group and color by group
library("ggpubr")
ggboxplot(my_data, x = "group", y = "weight", 
          color = "group", palette = c("#00AFBB", "#E7B800", "#FC4E07"),
          order = c("ctrl", "trt1", "trt2"),
          ylab = "Weight", xlab = "Treatment")

# Mean plots
# ++++++++++++++++++++
# Plot weight by group
# Add error bars: mean_se
# (other values include: mean_sd, mean_ci, median_iqr, ....)
library("ggpubr")
ggline(my_data, x = "group", y = "weight", 
       add = c("mean_se", "jitter"), 
       order = c("ctrl", "trt1", "trt2"),
       ylab = "Weight", xlab = "Treatment")

# Box plot
boxplot(weight ~ group, data = my_data,
        xlab = "Treatment", ylab = "Weight",
        frame = FALSE, col = c("#00AFBB", "#E7B800", "#FC4E07"))

# plotmeans
library("gplots")

Registered S3 method overwritten by 'gplots':
  method         from 
  reorder.factor gdata


Attaching package: 'gplots'

The following object is masked from 'package:stats':

    lowess

plotmeans(weight ~ group, data = my_data, frame = FALSE,
          xlab = "Treatment", ylab = "Weight",
          main="Mean Plot with 95% CI")

Warning in plot.xy(xy.coords(x, y), type = type, ...): "frame" is not a
graphical parameter

Warning in axis(1, at = 1:length(means), labels = legends, ...): "frame" is not
a graphical parameter

Warning in plot.xy(xy.coords(x, y), type = type, ...): "frame" is not a
graphical parameter

# Compute the analysis of variance
res.aov <- aov(weight ~ group, data = my_data)
# Summary of the analysis
summary(res.aov)

            Df Sum Sq Mean Sq F value Pr(>F)  
group        2  3.766  1.8832   4.846 0.0159 *
Residuals   27 10.492  0.3886                 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

As the p-value is less than the significance level 0.05, we can conclude that there are significant differences between the groups highlighted with “*” in the model summary.

In one-way ANOVA test, a significant p-value indicates that some of the group means are different, but we don’t know which pairs of groups are different.

It’s possible to perform multiple pairwise-comparison, to determine if the mean difference between specific pairs of group are statistically significant.

Tukey multiple pairwise-comparisons As the ANOVA test is significant, we can compute Tukey HSD (Tukey Honest Significant Differences, R function: TukeyHSD()) for performing multiple pairwise-comparison between the means of groups.

The function TukeyHD() takes the fitted ANOVA as an argument.

TukeyHSD(res.aov)

  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = weight ~ group, data = my_data)

$group
            diff        lwr       upr     p adj
trt1-ctrl -0.371 -1.0622161 0.3202161 0.3908711
trt2-ctrl  0.494 -0.1972161 1.1852161 0.1979960
trt2-trt1  0.865  0.1737839 1.5562161 0.0120064

It can be seen from the output, that only the difference between trt2 and trt1 is significant with an adjusted p-value of 0.012.

library(multcomp)

Loading required package: survival

Loading required package: TH.data


Attaching package: 'TH.data'

The following object is masked from 'package:MASS':

    geyser

summary(glht(res.aov, linfct = mcp(group = "Tukey")))


     Simultaneous Tests for General Linear Hypotheses

Multiple Comparisons of Means: Tukey Contrasts


Fit: aov(formula = weight ~ group, data = my_data)

Linear Hypotheses:
                 Estimate Std. Error t value Pr(>|t|)  
trt1 - ctrl == 0  -0.3710     0.2788  -1.331   0.3908  
trt2 - ctrl == 0   0.4940     0.2788   1.772   0.1979  
trt2 - trt1 == 0   0.8650     0.2788   3.103   0.0121 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Adjusted p values reported -- single-step method)

pairwise.t.test(my_data$weight, my_data$group,
                 p.adjust.method = "BH")


    Pairwise comparisons using t tests with pooled SD 

data:  my_data$weight and my_data$group 

     ctrl  trt1 
trt1 0.194 -    
trt2 0.132 0.013

P value adjustment method: BH

# 1. Homogeneity of variances
plot(res.aov, 1)

Points 17, 15, 4 are detected as outliers, which can severely affect normality and homogeneity of variance. It can be useful to remove outliers to meet the test assumptions.

library(car)

Loading required package: carData


Attaching package: 'car'

The following object is masked from 'package:dplyr':

    recode

The following object is masked from 'package:purrr':

    some

leveneTest(weight ~ group, data = my_data)

Levene's Test for Homogeneity of Variance (center = median)
      Df F value Pr(>F)
group  2  1.1192 0.3412
      27

From the output above we can see that the p-value is not less than the significance level of 0.05. This means that there is no evidence to suggest that the variance across groups is statistically significantly different. Therefore, we can assume the homogeneity of variances in the different treatment groups.

oneway.test(weight ~ group, data = my_data)


    One-way analysis of means (not assuming equal variances)

data:  weight and group
F = 5.181, num df = 2.000, denom df = 17.128, p-value = 0.01739

pairwise.t.test(my_data$weight, my_data$group,
                 p.adjust.method = "BH", pool.sd = FALSE)


    Pairwise comparisons using t tests with non-pooled SD 

data:  my_data$weight and my_data$group 

     ctrl  trt1 
trt1 0.250 -    
trt2 0.072 0.028

P value adjustment method: BH

# 2. Normality
plot(res.aov, 2)

# Extract the residuals
aov_residuals <- residuals(object = res.aov )
# Run Shapiro-Wilk test
shapiro.test(x = aov_residuals )


    Shapiro-Wilk normality test

data:  aov_residuals
W = 0.96607, p-value = 0.4379

kruskal.test(weight ~ group, data = my_data)


    Kruskal-Wallis rank sum test

data:  weight by group
Kruskal-Wallis chi-squared = 7.9882, df = 2, p-value = 0.01842

Two-Way ANOVA test

Two-way ANOVA test is used to evaluate simultaneously the effect of two grouping variables (A and B) on a response variable.

# Store the data in the variable my_data
my_data <- ToothGrowth

# Show a random sample
set.seed(1234)
dplyr::sample_n(my_data, 10)

    len supp dose
1  21.5   VC  2.0
2  17.3   VC  1.0
3  27.3   OJ  2.0
4  18.5   VC  2.0
5   8.2   OJ  0.5
6  26.4   OJ  1.0
7  25.8   OJ  1.0
8   5.2   VC  0.5
9   6.4   VC  0.5
10  9.4   OJ  0.5

# Check the structure
str(my_data)

'data.frame':   60 obs. of  3 variables:
 $ len : num  4.2 11.5 7.3 5.8 6.4 10 11.2 11.2 5.2 7 ...
 $ supp: Factor w/ 2 levels "OJ","VC": 2 2 2 2 2 2 2 2 2 2 ...
 $ dose: num  0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...

# Convert dose as a factor and recode the levels
# as "D0.5", "D1", "D2"
my_data$dose <- factor(my_data$dose, 
                  levels = c(0.5, 1, 2),
                  labels = c("D0.5", "D1", "D2"))
head(my_data)

   len supp dose
1  4.2   VC D0.5
2 11.5   VC D0.5
3  7.3   VC D0.5
4  5.8   VC D0.5
5  6.4   VC D0.5
6 10.0   VC D0.5

table(my_data$supp, my_data$dose)

    
     D0.5 D1 D2
  OJ   10 10 10
  VC   10 10 10

# Box plot with multiple groups
# +++++++++++++++++++++
# Plot tooth length ("len") by groups ("dose")
# Color box plot by a second group: "supp"
library("ggpubr")
ggboxplot(my_data, x = "dose", y = "len", color = "supp",
          palette = c("#00AFBB", "#E7B800"))

# Line plots with multiple groups
# +++++++++++++++++++++++
# Plot tooth length ("len") by groups ("dose")
# Color box plot by a second group: "supp"
# Add error bars: mean_se
# (other values include: mean_sd, mean_ci, median_iqr, ....)
library("ggpubr")
ggline(my_data, x = "dose", y = "len", color = "supp",
       add = c("mean_se", "dotplot"),
       palette = c("#00AFBB", "#E7B800"))

Bin width defaults to 1/30 of the range of the data. Pick better value with
`binwidth`.

# Box plot with two factor variables
boxplot(len ~ supp * dose, data=my_data, frame = FALSE, 
        col = c("#00AFBB", "#E7B800"), ylab="Tooth Length")

# Two-way interaction plot
interaction.plot(x.factor = my_data$dose, trace.factor = my_data$supp, 
                 response = my_data$len, fun = mean, 
                 type = "b", legend = TRUE, 
                 xlab = "Dose", ylab="Tooth Length",
                 pch=c(1,19), col = c("#00AFBB", "#E7B800"))

res.aov2 <- aov(len ~ supp + dose, data = my_data)
summary(res.aov2)

            Df Sum Sq Mean Sq F value   Pr(>F)    
supp         1  205.4   205.4   14.02 0.000429 ***
dose         2 2426.4  1213.2   82.81  < 2e-16 ***
Residuals   56  820.4    14.7                     
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

From the ANOVA table we can conclude that both supp and dose are statistically significant. dose is the most significant factor variable. These results would lead us to believe that changing delivery methods (supp) or the dose of vitamin C, will impact significantly the mean tooth length

# Two-way ANOVA with interaction effect
# These two calls are equivalent
res.aov3 <- aov(len ~ supp * dose, data = my_data)
res.aov3 <- aov(len ~ supp + dose + supp:dose, data = my_data)
summary(res.aov3)

            Df Sum Sq Mean Sq F value   Pr(>F)    
supp         1  205.4   205.4  15.572 0.000231 ***
dose         2 2426.4  1213.2  92.000  < 2e-16 ***
supp:dose    2  108.3    54.2   4.107 0.021860 *  
Residuals   54  712.1    13.2                     
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

require("dplyr")
group_by(my_data, supp, dose) %>%
  summarise(
    count = n(),
    mean = mean(len, na.rm = TRUE),
    sd = sd(len, na.rm = TRUE)
  )

`summarise()` has grouped output by 'supp'. You can override using the
`.groups` argument.

# A tibble: 6 × 5
# Groups:   supp [2]
  supp  dose  count  mean    sd
  <fct> <fct> <int> <dbl> <dbl>
1 OJ    D0.5     10 13.2   4.46
2 OJ    D1       10 22.7   3.91
3 OJ    D2       10 26.1   2.66
4 VC    D0.5     10  7.98  2.75
5 VC    D1       10 16.8   2.52
6 VC    D2       10 26.1   4.80

model.tables(res.aov3, type="means", se = TRUE)

Tables of means
Grand mean
         
18.81333 

 supp 
supp
    OJ     VC 
20.663 16.963 

 dose 
dose
  D0.5     D1     D2 
10.605 19.735 26.100 

 supp:dose 
    dose
supp D0.5  D1    D2   
  OJ 13.23 22.70 26.06
  VC  7.98 16.77 26.14

Standard errors for differences of means
          supp   dose supp:dose
        0.9376 1.1484    1.6240
replic.     30     20        10

TukeyHSD(res.aov3, which = "dose")

  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = len ~ supp + dose + supp:dose, data = my_data)

$dose
          diff       lwr       upr   p adj
D1-D0.5  9.130  6.362488 11.897512 0.0e+00
D2-D0.5 15.495 12.727488 18.262512 0.0e+00
D2-D1    6.365  3.597488  9.132512 2.7e-06

library(multcomp)
summary(glht(res.aov2, linfct = mcp(dose = "Tukey")))


     Simultaneous Tests for General Linear Hypotheses

Multiple Comparisons of Means: Tukey Contrasts


Fit: aov(formula = len ~ supp + dose, data = my_data)

Linear Hypotheses:
               Estimate Std. Error t value Pr(>|t|)    
D1 - D0.5 == 0    9.130      1.210   7.543   <1e-05 ***
D2 - D0.5 == 0   15.495      1.210  12.802   <1e-05 ***
D2 - D1 == 0      6.365      1.210   5.259   <1e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Adjusted p values reported -- single-step method)

pairwise.t.test(my_data$len, my_data$dose,
                p.adjust.method = "BH")


    Pairwise comparisons using t tests with pooled SD 

data:  my_data$len and my_data$dose 

   D0.5    D1     
D1 1.0e-08 -      
D2 4.4e-16 1.4e-05

P value adjustment method: BH

# 1. Homogeneity of variances
plot(res.aov3, 1)

library(car)
leveneTest(len ~ supp*dose, data = my_data)

Levene's Test for Homogeneity of Variance (center = median)
      Df F value Pr(>F)
group  5  1.7086 0.1484
      54

# 2. Normality
plot(res.aov3, 2)

# Extract the residuals
aov_residuals <- residuals(object = res.aov3)
# Run Shapiro-Wilk test
shapiro.test(x = aov_residuals )


    Shapiro-Wilk normality test

data:  aov_residuals
W = 0.98499, p-value = 0.6694

library(car)
my_anova <- aov(len ~ supp * dose, data = my_data)
Anova(my_anova, type = "III")

Anova Table (Type III tests)

Response: len
             Sum Sq Df F value    Pr(>F)    
(Intercept) 1750.33  1 132.730 3.603e-16 ***
supp         137.81  1  10.450  0.002092 ** 
dose         885.26  2  33.565 3.363e-10 ***
supp:dose    108.32  2   4.107  0.021860 *  
Residuals    712.11 54                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

MANOVA test: Multivariate analysis of variance

In the situation where there multiple response variables you can test them simultaneously using a multivariate analysis of variance (MANOVA). This article describes how to compute manova in R.

# Store the data in the variable my_data
my_data <- iris

# Show a random sample
set.seed(1234)
dplyr::sample_n(my_data, 10)

   Sepal.Length Sepal.Width Petal.Length Petal.Width    Species  Size
1           5.2         3.5          1.5         0.2     setosa small
2           5.7         2.6          3.5         1.0 versicolor small
3           6.3         3.3          6.0         2.5  virginica   big
4           6.5         3.2          5.1         2.0  virginica   big
5           6.3         3.4          5.6         2.4  virginica   big
6           6.4         2.8          5.6         2.2  virginica   big
7           6.8         3.2          5.9         2.3  virginica   big
8           7.9         3.8          6.4         2.0  virginica   big
9           6.2         2.9          4.3         1.3 versicolor   big
10          7.1         3.0          5.9         2.1  virginica   big

sepl <- iris$Sepal.Length
petl <- iris$Petal.Length
# MANOVA test
res.man <- manova(cbind(Sepal.Length, Petal.Length) ~ Species, data = iris)
summary(res.man)

           Df Pillai approx F num Df den Df    Pr(>F)    
Species     2 0.9885   71.829      4    294 < 2.2e-16 ***
Residuals 147                                            
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# Look to see which differ
summary.aov(res.man)

 Response Sepal.Length :
             Df Sum Sq Mean Sq F value    Pr(>F)    
Species       2 63.212  31.606  119.26 < 2.2e-16 ***
Residuals   147 38.956   0.265                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

 Response Petal.Length :
             Df Sum Sq Mean Sq F value    Pr(>F)    
Species       2 437.10 218.551  1180.2 < 2.2e-16 ***
Residuals   147  27.22   0.185                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

From the output above, it can be seen that the two variables are highly significantly different among Species.

Kruskal-Wallis test

Kruskal-Wallis test by rank is a non-parametric alternative to one-way ANOVA test, which extends the two-samples Wilcoxon test in the situation where there are more than two groups. It’s recommended when the assumptions of one-way ANOVA test are not met. This tutorial describes how to compute Kruskal-Wallis test in R software.

my_data <- PlantGrowth

library(dplyr)
group_by(my_data, group) %>%
  summarise(
    count = n(),
    mean = mean(weight, na.rm = TRUE),
    sd = sd(weight, na.rm = TRUE),
    median = median(weight, na.rm = TRUE),
    IQR = IQR(weight, na.rm = TRUE)
  )

# A tibble: 3 × 6
  group count  mean    sd median   IQR
  <fct> <int> <dbl> <dbl>  <dbl> <dbl>
1 ctrl     10  5.03 0.583   5.15 0.743
2 trt1     10  4.66 0.794   4.55 0.662
3 trt2     10  5.53 0.443   5.44 0.467

kruskal.test(weight ~ group, data = my_data)


    Kruskal-Wallis rank sum test

data:  weight by group
Kruskal-Wallis chi-squared = 7.9882, df = 2, p-value = 0.01842

As the p-value is less than the significance level 0.05, we can conclude that there are significant differences between the treatment groups.

From the output of the Kruskal-Wallis test, we know that there is a significant difference between groups, but we don’t know which pairs of groups are different.

It’s possible to use the function pairwise.wilcox.test() to calculate pairwise comparisons between group levels with corrections for multiple testing.

pairwise.wilcox.test(PlantGrowth$weight, PlantGrowth$group,
                 p.adjust.method = "BH")

Warning in wilcox.test.default(xi, xj, paired = paired, ...): cannot compute
exact p-value with ties


    Pairwise comparisons using Wilcoxon rank sum test with continuity correction 

data:  PlantGrowth$weight and PlantGrowth$group 

     ctrl  trt1 
trt1 0.199 -    
trt2 0.095 0.027

P value adjustment method: BH

The pairwise comparison shows that, only trt1 and trt2 are significantly different (p < 0.05).

Week 8 Post-Session Formative Exercise

Exercise 5 - Correlation

-0.77 (c) represents the strongest association out of the five correlation coefficients. This is because +1 or -1 represent the strongest possible positive or negative associations hence 1.05 (e) is an invalid coefficient.
A person with above average scores for X will probably have above average scores for Y if the coefficient is 0.55 (a)
If we have five representative samples of people aged 15, 20, 30, 45 and 60 years who completed a questionnaire of political conservatism and the average scores of political conservatism were: 60, 85, 80, 70, 65 correlation between age and political conservatism is non-linear.

r = 1.0: graph c
r = 0.5: graph d
r = 0: graph e
r = -0.5: graph a
r = -1.0: graph b

In this case the correlation coefficient wouldn’t change as “car’s age” and “year when was car produced” are essentially the same thing.
Regarding interpretation, between correlations 0.2 and 0.4 and between correlations 0.5 and 0.7 is approximately the same difference.
If IQ scores from test A are consistently higher about 10 points than IQ score from test B the highest possible correlation between test A and test B is 1.0.
If study 1 found correlation coefficient -0.2 and study 2 found correlation between the same variables 0.4 the proportion of shared variance decreased four times between study 1 and study 2 (b).
One study about heart attacks states that people who regularly go to church are in lower risk of heart attack therefore if you regularly go to church, your probability of getting heart attack is lower than in people who don’t go to church.

Week 9 Post-Session Formative Exercise

data(mtcars)

# Fit a linear model where mpg is the dependent variable and wt is the independent variable
model <- lm(mpg ~ wt, data = mtcars)

# Display a summary of the model to interpret the results
summary(model)


Call:
lm(formula = mpg ~ wt, data = mtcars)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.5432 -2.3647 -0.1252  1.4096  6.8727 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  37.2851     1.8776  19.858  < 2e-16 ***
wt           -5.3445     0.5591  -9.559 1.29e-10 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.046 on 30 degrees of freedom
Multiple R-squared:  0.7528,    Adjusted R-squared:  0.7446 
F-statistic: 91.38 on 1 and 30 DF,  p-value: 1.294e-10

anova(model)

Analysis of Variance Table

Response: mpg
          Df Sum Sq Mean Sq F value    Pr(>F)    
wt         1 847.73  847.73  91.375 1.294e-10 ***
Residuals 30 278.32    9.28                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The F-statistic tests whether the weight (wt) significantly explains the variation in mpg. The above and very small p-value - 1.294e-10 suggests that weight is highly significant

# Create a scatter plot and add the regression line
plot(mtcars$wt, mtcars$mpg, main = "MPG vs. Vehicle Weight", xlab = "Weight (1000 lbs)", ylab = "Miles per Gallon")
abline(model, col = "blue")

Based on the above we can conclude that there is a clear negative relationship between the weight of vehicles and fuel efficiency i.e. that as weight increases fuel efficiency tends to decrease.

In a linear model, coefficients represent the relationship between each predictor variable (independent variable) and the dependent variable. Specifically, the coefficients measure the change in the dependent variable for a one-unit change in the corresponding predictor variable, while holding other variables constant.

The estimated coefficient (slope interpretation) for weight is -5.3445, which implies that for every 1000lb increase in the weight of a vehicle, the expected decrease in fuel efficiency is 5.34 miles per gallon. Therefore, we can assume that the heavier a car is the more fuel it will consume per mile.

The R-squared value is a statistical measure that represents the proportion of the variance in the dependent variable that is explained by the independent variables in the model.An R2 of 1 means that the model perfectly explains the variance in the dependent variable (all the data points lie on the regression line). However, A high R2 doesn’t necessarily mean the independent variable(s) cause change in the dependent variable and so should be interpreted alongside other metrics (variables) to fully assess model performance.

The model above explains about 75% of the variation in MPG based on the weight of the vehicle (R-squared = 0.7528). This is a relatively strong relationship, suggesting that weight is a significant factor in determining fuel efficiency, though other factors may also contribute to the variation in MPG.Or in other words, approximately 75% of the variability in mpg can be explained by the weight of the vehicle.

The p-value for the coefficient of weight is very small (approximately 1.29e-10), which indicates that the relationship between weight and MPG is statistically significant. This means that the observed negative effect of weight on fuel efficiency is unlikely to be due to random chance.