R and Quarto

Author

Joss Sibbering

Published

October 1, 2024

Post Session 1 Video Work, 1 Oct 2024

Default Quarto Intro

Quarto enables you to weave together content and executable code into a finished document. To learn more about Quarto see https://quarto.org.

Writing in Quarto

Figure 1: R_Markdown_Cheatsheet

I’ve saved Figure 1 in the working directory, should have info for how to write good!

Inserting a photograph!

Warning

Make sure the working directory is setup correctly.

Files have to be loaded form same place as the file is saved.
To set working directory use setwd(“PATH/directoryname”)
! [] (imgdirectory) adds a photo (without spaces).

A macro shot of a red ladybird on a leaf looking at the camera — A ladybird on a leaf

Running Code

When you click the Render button a document will be generated that includes both content and the output of embedded code. You can embed code like this:

1 + 1

[1] 2

1 + 1

[1] 2

It seems that echo: true doesn’t change the output here.

Tip

Hint: use ctrl+alt+i to add a code chunk.

You can add options to executable code like this:

[1] 4

The echo: false option disables the printing of code (only output is displayed).

Loading Tidyverse:

See “Pre Session 3 (Tidyverse) Activities for more info [R for Grad Students by Y. Wendy Huynh; Chapter 4]

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(knitr)
library(gmodels)
library(ggplot2)

‘Output: false’ stops the workflow of loading the package (or any other operation) being displayed.

Warning

Trying to load the same package twice will give an error message!
A blank space between lines is needed to output separate lines and make bullet points work.

Create a plot:

Code

ggplot(diamonds, aes(x = cut)) +
  geom_bar() # geom_bar() makes the outputted graph a bar chart.

This is a reference: Figure 2.

Using LaTeX

I had to install a package for this bit, maybe tinytex?

We can put formulas inline like this: $\alpha$ We indent like this

\[ \pi \in \mathbb{R}\]

Google for more info!

Online Workshop Thursday 3rd Oct 2024

##Test This doesn’t work.

## Test

### Test

##### Test (This is smaller than standard text)

(Just testing out different heading sizes)

Tip

ctrl+alt+i is a very useful shortcut! It inputs the r code block.

“#|” denotes the header for a chunk. Sets the variables you want to happen for that chunk.

%>% is the “pipe” operator, so x%>%y says “do y to x”

Tip

ctrl+shift+m is a shortcut for %>%

Penguin Time!!!

library(palmerpenguins)

Pre Session 3 (Tidyverse) Activities R for Grad Students by Y. Wendy Huynh

Chapter 4

Tidyverse is a package of other packages!
- dplyr
- ggplot2
- etc
These will be shown when tidyverse is loaded.
Packages must be loaded EVERY time a new RStudio session is started.
- This is because otherwise there would be too many naming conflicts between packages.
  - eg filter(): one function from the dplyr package and one from the stats package.
  - R uses the function related to the most recently loaded package.
  - Since stats is inc. in the base R and dplyr is added with tidyverse, the filter() function used will be the dplyr version.
You can also just load individual functions instead of a whole library, eg:

Load only the filter function from dplyr:

dplyr::filter()

Load all functions from dplyr:

library(dplyr)

Example code:

diamonds %>% 
  group_by(clarity) %>% 
  summarize(m = mean(price)) %>% 
  ungroup()

# A tibble: 8 × 2
  clarity     m
  <ord>   <dbl>
1 I1      3924.
2 SI2     5063.
3 SI1     3996.
4 VS2     3925.
5 VS1     3839.
6 VVS2    3284.
7 VVS1    2523.
8 IF      2865.

Not really sure what this bit is about… (as of 3/10/2024)

Chapter 5

The “diamonds” dataset is built into ggplot2 and therefore included in tidyverse.

I guess it’s kinda like those songs you got included with windows media player back in the day!

Let’s view the diamonds dataset!

#View(diamonds)

It gives an excel-like view of the data.

You cannot edit the data cell-by-cell like in excel; that needs to be done through code.

Look at the structure of each variable in diamonds:

str(diamonds)

tibble [53,940 × 10] (S3: tbl_df/tbl/data.frame)
 $ carat  : num [1:53940] 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
 $ cut    : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
 $ color  : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
 $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
 $ depth  : num [1:53940] 61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
 $ table  : num [1:53940] 55 61 65 58 58 57 57 55 61 61 ...
 $ price  : int [1:53940] 326 326 327 334 335 336 336 337 337 338 ...
 $ x      : num [1:53940] 3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
 $ y      : num [1:53940] 3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
 $ z      : num [1:53940] 2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...

It details the number of data entries and how many variables for each entry along with their details.

Built in help/discriptions/explanations can be accessed via ?diamonds

Chapter 6 04/10/2024

6.1: mutate()

Use mutate() to:

Add new columns
Modify variables in dataset.

To create new var.s in diamonds;

A var. called JustOne where all values inside the column = 1
A var. called Values where all values are “something”
A var. called Simple where all values = TRUE

diamonds %>% 
  mutate(JustOne = 1,
         Values = "something",
         Simple = TRUE)

# A tibble: 53,940 × 13
   carat cut    color clarity depth table price     x     y     z JustOne Values
   <dbl> <ord>  <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>   <dbl> <chr> 
 1  0.23 Ideal  E     SI2      61.5    55   326  3.95  3.98  2.43       1 somet…
 2  0.21 Premi… E     SI1      59.8    61   326  3.89  3.84  2.31       1 somet…
 3  0.23 Good   E     VS1      56.9    65   327  4.05  4.07  2.31       1 somet…
 4  0.29 Premi… I     VS2      62.4    58   334  4.2   4.23  2.63       1 somet…
 5  0.31 Good   J     SI2      63.3    58   335  4.34  4.35  2.75       1 somet…
 6  0.24 Very … J     VVS2     62.8    57   336  3.94  3.96  2.48       1 somet…
 7  0.24 Very … I     VVS1     62.3    57   336  3.95  3.98  2.47       1 somet…
 8  0.26 Very … H     SI1      61.9    55   337  4.07  4.11  2.53       1 somet…
 9  0.22 Fair   E     VS2      65.1    61   337  3.87  3.78  2.49       1 somet…
10  0.23 Very … H     VS1      59.4    61   338  4     4.05  2.39       1 somet…
# ℹ 53,930 more rows
# ℹ 1 more variable: Simple <lgl>

A bit more info on the syntax here:

The pipe %>% is contained within tidyverse, specifically in a package called magrittr.
Pressing enter after a comma or a pipe auto-indents the code to make it more readable.
The mutate code reads as: “Within diamonds, I want to mutate/change things by creating the variables”JustOne”, which equals 1; “Values”, which equals “something”; and “Simple”, which equals TRUE.”
Separate comands within mutate() can be separated by commas (With or without a new line)

To create var.s based on existing var.s:

diamonds %>% 
  mutate(price200 = price - 200)

# A tibble: 53,940 × 11
   carat cut       color clarity depth table price     x     y     z price200
   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>    <dbl>
 1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43      126
 2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31      126
 3  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31      127
 4  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63      134
 5  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75      135
 6  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48      136
 7  0.24 Very Good I     VVS1     62.3    57   336  3.95  3.98  2.47      136
 8  0.26 Very Good H     SI1      61.9    55   337  4.07  4.11  2.53      137
 9  0.22 Fair      E     VS2      65.1    61   337  3.87  3.78  2.49      137
10  0.23 Very Good H     VS1      59.4    61   338  4     4.05  2.39      138
# ℹ 53,930 more rows

Which created the column/var “price200” whose values are that of “price” minus 200.

Warning

These changes are not saved!!

Executing str(diamonds) will show that they haven’t been saved.

To save them you must add a line of code to save them as a certain object:

diamonds.new <- THE REST OF THE CODE

Nesting functions

Other functions can be used inside mutate!

eg. using the mean(), sd() and median() func.s to calc. the descriptive stats of the diamonds:

diamonds %>% 
  mutate(m = mean(price),
         sd = sd(price),
         med = median(price))

# A tibble: 53,940 × 13
   carat cut       color clarity depth table price     x     y     z     m    sd
   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
 1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43 3933. 3989.
 2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31 3933. 3989.
 3  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31 3933. 3989.
 4  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63 3933. 3989.
 5  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75 3933. 3989.
 6  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48 3933. 3989.
 7  0.24 Very Good I     VVS1     62.3    57   336  3.95  3.98  2.47 3933. 3989.
 8  0.26 Very Good H     SI1      61.9    55   337  4.07  4.11  2.53 3933. 3989.
 9  0.22 Fair      E     VS2      65.1    61   337  3.87  3.78  2.49 3933. 3989.
10  0.23 Very Good H     VS1      59.4    61   338  4     4.05  2.39 3933. 3989.
# ℹ 53,930 more rows
# ℹ 1 more variable: med <dbl>

These funcs.s use all the data in the price column, so the output in each of these columns is the same.

(Exercises)

6.1.1.1: recode() (must be used within mutate I think)

This modifies values w/in a var.

Basic Structure:

data %>% mutate(NewVariable = recode(Variable, “old value” = “new value”))

eg:

diamonds %>% 
  mutate(cut.new = recode(cut, "Ideal" = "IDEAL"))

# A tibble: 53,940 × 11
   carat cut       color clarity depth table price     x     y     z cut.new  
   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl> <ord>    
 1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43 IDEAL    
 2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31 Premium  
 3  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31 Good     
 4  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63 Premium  
 5  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75 Good     
 6  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48 Very Good
 7  0.24 Very Good I     VVS1     62.3    57   336  3.95  3.98  2.47 Very Good
 8  0.26 Very Good H     SI1      61.9    55   337  4.07  4.11  2.53 Very Good
 9  0.22 Fair      E     VS2      65.1    61   337  3.87  3.78  2.49 Fair     
10  0.23 Very Good H     VS1      59.4    61   338  4     4.05  2.39 Very Good
# ℹ 53,930 more rows

Example:

# Creating a dataset w/ 2 var.s (Sex, TestScore)
Sex <- factor(c("male", "m", "M", "Female", "Female", "Female"))
TestScore <- c(10,20,10,25,12,5)
dataset <- tibble(Sex, TestScore)
str(dataset)

tibble [6 × 2] (S3: tbl_df/tbl/data.frame)
 $ Sex      : Factor w/ 4 levels "Female","m","M",..: 4 2 3 1 1 1
 $ TestScore: num [1:6] 10 20 10 25 12 5

Tip

Remember! c() is used to combine values into a vector or a list!

(I’m not sure what the factor() function does here)

factor() is a type of variable that is categorical with levels.

# Creating a new var Sex.new w/ recd values from original var, but where all male values = Male

dataset <-
  dataset %>% 
  mutate(Sex=recode(Sex,"m"="Male","M"="Male", "male"="Male"))
str(dataset)

tibble [6 × 2] (S3: tbl_df/tbl/data.frame)
 $ Sex      : Factor w/ 2 levels "Female","Male": 2 2 2 1 1 1
 $ TestScore: num [1:6] 10 20 10 25 12 5

6.2: summarize() OR summarise()

Returns a 1-row summary.

eg:

diamonds %>% 
  summarise(avg.price = mean(price))

# A tibble: 1 × 1
  avg.price
      <dbl>
1     3933.

6.3: group_by() and ungroup()

Takes existing data and groups specific vars. together for future operations.

eg:

## Creating identification number to represent 50 individual people
ID <- c(1:50)

## Creating sex variable (25 males/25 females)
Sex <- rep(c("male", "female"), 25) # rep stands for replicate

## Creating age variable (20-39 year olds)
Age <- c(26, 25, 39, 37, 31, 34, 34, 30, 26, 33, 
         39, 28, 26, 29, 33, 22, 35, 23, 26, 36, 
         21, 20, 31, 21, 35, 39, 36, 22, 22, 25, 
         27, 30, 26, 34, 38, 39, 30, 29, 26, 25, 
         26, 36, 23, 21, 21, 39, 26, 26, 27, 21) 

## Creating a dependent variable called Score
Score <- c(0.010, 0.418, 0.014, 0.090, 0.061, 0.328, 0.656, 0.002, 0.639, 0.173, 
           0.076, 0.152, 0.467, 0.186, 0.520, 0.493, 0.388, 0.501, 0.800, 0.482, 
           0.384, 0.046, 0.920, 0.865, 0.625, 0.035, 0.501, 0.851, 0.285, 0.752, 
           0.686, 0.339, 0.710, 0.665, 0.214, 0.560, 0.287, 0.665, 0.630, 0.567, 
           0.812, 0.637, 0.772, 0.905, 0.405, 0.363, 0.773, 0.410, 0.535, 0.449)

## Creating a unified dataset that puts together all variables
data <- tibble(ID, Sex, Age, Score)

summarise() and group_by() 07/10/24

data %>% 
  group_by(Sex) %>% 
  summarise(m = mean(Score), # Calculates the mean score
            s = sd(Score),   # Calc. the standard deviation
            n = n())  %>%    # Calc. the tot. observations.
  ungroup()

# A tibble: 2 × 4
  Sex        m     s     n
  <chr>  <dbl> <dbl> <int>
1 female 0.437 0.268    25
2 male   0.487 0.268    25

Compare to the same but without using group_by() and ungroup():

data %>%
  summarise(m = mean(Score), # Calculates the mean score
            s = sd(Score),   # Calc. the standard deviation
            n = n())         # Calc. the tot. observations.

# A tibble: 1 × 3
      m     s     n
  <dbl> <dbl> <int>
1 0.462 0.266    50

So by grouping we get to see the summaries of individual variables within the dataset!

Now lets group by Sex and Age:

data %>% 
  group_by(Sex, Age) %>% 
  summarise(m = mean(Score), # Calculates the mean score
            s = sd(Score),   # Calc. the standard deviation
            n = n())  %>%    # Calc. the tot. observations.
  ungroup()

`summarise()` has grouped output by 'Sex'. You can override using the `.groups`
argument.

# A tibble: 27 × 5
   Sex      Age     m      s     n
   <chr>  <dbl> <dbl>  <dbl> <int>
 1 female    20 0.046 NA         1
 2 female    21 0.740  0.253     3
 3 female    22 0.672  0.253     2
 4 female    23 0.501 NA         1
 5 female    25 0.579  0.167     3
 6 female    26 0.41  NA         1
 7 female    28 0.152 NA         1
 8 female    29 0.426  0.339     2
 9 female    30 0.170  0.238     2
10 female    33 0.173 NA         1
# ℹ 17 more rows

This has given us 27 rows in total. R has considered every unique combination of Sex and Age in the dataset.

The missing std dev values are because there is only 1 observtion for that group, while std dev requires at least 2.

mutate() and group_by()

This combination adds a new column to the dataset based on the group.

data %>% 
  group_by(Sex) %>% 
  mutate(MeanScoreSex = mean(Score)) %>% 
ungroup()

# A tibble: 50 × 5
      ID Sex      Age Score MeanScoreSex
   <int> <chr>  <dbl> <dbl>        <dbl>
 1     1 male      26 0.01         0.487
 2     2 female    25 0.418        0.437
 3     3 male      39 0.014        0.487
 4     4 female    37 0.09         0.437
 5     5 male      31 0.061        0.487
 6     6 female    34 0.328        0.437
 7     7 male      34 0.656        0.487
 8     8 female    30 0.002        0.437
 9     9 male      26 0.639        0.487
10    10 female    33 0.173        0.437
# ℹ 40 more rows

While there are only 2 results for MeanScoreSex, they have been duplicated for every row in the table.

Warning

If ungroup() is left out at the end, this will leave the grouping in place and likely lead to errors in further calculations!

Always use ungroup() after group_by()!!

6.4 filter()

This only retains certain rows that meet the specified requirements.

diamonds %>% filter(cut == "Fair")

# A tibble: 1,610 × 10
   carat cut   color clarity depth table price     x     y     z
   <dbl> <ord> <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1  0.22 Fair  E     VS2      65.1    61   337  3.87  3.78  2.49
 2  0.86 Fair  E     SI2      55.1    69  2757  6.45  6.33  3.52
 3  0.96 Fair  F     SI2      66.3    62  2759  6.27  5.95  4.07
 4  0.7  Fair  F     VS2      64.5    57  2762  5.57  5.53  3.58
 5  0.7  Fair  F     VS2      65.3    55  2762  5.63  5.58  3.66
 6  0.91 Fair  H     SI2      64.4    57  2763  6.11  6.09  3.93
 7  0.91 Fair  H     SI2      65.7    60  2763  6.03  5.99  3.95
 8  0.98 Fair  H     SI2      67.9    60  2777  6.05  5.97  4.08
 9  0.84 Fair  G     SI1      55.1    67  2782  6.39  6.2   3.47
10  1.01 Fair  E     I1       64.5    58  2788  6.29  6.21  4.03
# ℹ 1,600 more rows

Multiple conditions can be included using | or a comma:

Bar denotes OR
Comma denotes AND.

diamonds %>% filter(cut == "Fair" | cut == "Good", price <= 600)

# A tibble: 505 × 10
   carat cut   color clarity depth table price     x     y     z
   <dbl> <ord> <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1  0.23 Good  E     VS1      56.9    65   327  4.05  4.07  2.31
 2  0.31 Good  J     SI2      63.3    58   335  4.34  4.35  2.75
 3  0.22 Fair  E     VS2      65.1    61   337  3.87  3.78  2.49
 4  0.3  Good  J     SI1      64      55   339  4.25  4.28  2.73
 5  0.3  Good  J     SI1      63.4    54   351  4.23  4.29  2.7 
 6  0.3  Good  J     SI1      63.8    56   351  4.23  4.26  2.71
 7  0.3  Good  I     SI2      63.3    56   351  4.26  4.3   2.71
 8  0.23 Good  F     VS1      58.2    59   402  4.06  4.08  2.37
 9  0.23 Good  E     VS1      64.1    59   402  3.83  3.85  2.46
10  0.31 Good  H     SI1      64      54   402  4.29  4.31  2.75
# ℹ 495 more rows

This can also be achieved through:

diamonds %>% filter(cut %in% c("Fair", "Good"), price <= 600)

# A tibble: 505 × 10
   carat cut   color clarity depth table price     x     y     z
   <dbl> <ord> <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1  0.23 Good  E     VS1      56.9    65   327  4.05  4.07  2.31
 2  0.31 Good  J     SI2      63.3    58   335  4.34  4.35  2.75
 3  0.22 Fair  E     VS2      65.1    61   337  3.87  3.78  2.49
 4  0.3  Good  J     SI1      64      55   339  4.25  4.28  2.73
 5  0.3  Good  J     SI1      63.4    54   351  4.23  4.29  2.7 
 6  0.3  Good  J     SI1      63.8    56   351  4.23  4.26  2.71
 7  0.3  Good  I     SI2      63.3    56   351  4.26  4.3   2.71
 8  0.23 Good  F     VS1      58.2    59   402  4.06  4.08  2.37
 9  0.23 Good  E     VS1      64.1    59   402  3.83  3.85  2.46
10  0.31 Good  H     SI1      64      54   402  4.29  4.31  2.75
# ℹ 495 more rows

6.5 select()

Select the columns you want to see.

You can select using the column name or position number.

The order the columns are listed are the order they are displayed.

diamonds %>% 
  select(cut, color)

# A tibble: 53,940 × 2
   cut       color
   <ord>     <ord>
 1 Ideal     E    
 2 Premium   E    
 3 Good      E    
 4 Premium   I    
 5 Good      J    
 6 Very Good J    
 7 Very Good I    
 8 Very Good H    
 9 Fair      E    
10 Very Good H    
# ℹ 53,930 more rows

diamonds %>% 
select(1:5)

# A tibble: 53,940 × 5
   carat cut       color clarity depth
   <dbl> <ord>     <ord> <ord>   <dbl>
 1  0.23 Ideal     E     SI2      61.5
 2  0.21 Premium   E     SI1      59.8
 3  0.23 Good      E     VS1      56.9
 4  0.29 Premium   I     VS2      62.4
 5  0.31 Good      J     SI2      63.3
 6  0.24 Very Good J     VVS2     62.8
 7  0.24 Very Good I     VVS1     62.3
 8  0.26 Very Good H     SI1      61.9
 9  0.22 Fair      E     VS2      65.1
10  0.23 Very Good H     VS1      59.4
# ℹ 53,930 more rows

diamonds %>% 
  select(-cut) # Selects all except "cut"

# A tibble: 53,940 × 9
   carat color clarity depth table price     x     y     z
   <dbl> <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1  0.23 E     SI2      61.5    55   326  3.95  3.98  2.43
 2  0.21 E     SI1      59.8    61   326  3.89  3.84  2.31
 3  0.23 E     VS1      56.9    65   327  4.05  4.07  2.31
 4  0.29 I     VS2      62.4    58   334  4.2   4.23  2.63
 5  0.31 J     SI2      63.3    58   335  4.34  4.35  2.75
 6  0.24 J     VVS2     62.8    57   336  3.94  3.96  2.48
 7  0.24 I     VVS1     62.3    57   336  3.95  3.98  2.47
 8  0.26 H     SI1      61.9    55   337  4.07  4.11  2.53
 9  0.22 E     VS2      65.1    61   337  3.87  3.78  2.49
10  0.23 H     VS1      59.4    61   338  4     4.05  2.39
# ℹ 53,930 more rows

diamonds %>% 
  select(-(5:10))

# A tibble: 53,940 × 4
   carat cut       color clarity
   <dbl> <ord>     <ord> <ord>  
 1  0.23 Ideal     E     SI2    
 2  0.21 Premium   E     SI1    
 3  0.23 Good      E     VS1    
 4  0.29 Premium   I     VS2    
 5  0.31 Good      J     SI2    
 6  0.24 Very Good J     VVS2   
 7  0.24 Very Good I     VVS1   
 8  0.26 Very Good H     SI1    
 9  0.22 Fair      E     VS2    
10  0.23 Very Good H     VS1    
# ℹ 53,930 more rows

You can also retain everything, but more certain columns to the front:

diamonds %>% 
  select(x,y,z,everything())

# A tibble: 53,940 × 10
       x     y     z carat cut       color clarity depth table price
   <dbl> <dbl> <dbl> <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int>
 1  3.95  3.98  2.43  0.23 Ideal     E     SI2      61.5    55   326
 2  3.89  3.84  2.31  0.21 Premium   E     SI1      59.8    61   326
 3  4.05  4.07  2.31  0.23 Good      E     VS1      56.9    65   327
 4  4.2   4.23  2.63  0.29 Premium   I     VS2      62.4    58   334
 5  4.34  4.35  2.75  0.31 Good      J     SI2      63.3    58   335
 6  3.94  3.96  2.48  0.24 Very Good J     VVS2     62.8    57   336
 7  3.95  3.98  2.47  0.24 Very Good I     VVS1     62.3    57   336
 8  4.07  4.11  2.53  0.26 Very Good H     SI1      61.9    55   337
 9  3.87  3.78  2.49  0.22 Fair      E     VS2      65.1    61   337
10  4     4.05  2.39  0.23 Very Good H     VS1      59.4    61   338
# ℹ 53,930 more rows

This could be used to permenantly reorder the dataset:

testdata1 <-
  diamonds %>% 
  select(x,y,z,everything())

6.6 arrange()

Arranges values within a variable by ascending or descending alphabetical or numerical order.

diamonds %>% 
  arrange(cut)

# A tibble: 53,940 × 10
   carat cut   color clarity depth table price     x     y     z
   <dbl> <ord> <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1  0.22 Fair  E     VS2      65.1    61   337  3.87  3.78  2.49
 2  0.86 Fair  E     SI2      55.1    69  2757  6.45  6.33  3.52
 3  0.96 Fair  F     SI2      66.3    62  2759  6.27  5.95  4.07
 4  0.7  Fair  F     VS2      64.5    57  2762  5.57  5.53  3.58
 5  0.7  Fair  F     VS2      65.3    55  2762  5.63  5.58  3.66
 6  0.91 Fair  H     SI2      64.4    57  2763  6.11  6.09  3.93
 7  0.91 Fair  H     SI2      65.7    60  2763  6.03  5.99  3.95
 8  0.98 Fair  H     SI2      67.9    60  2777  6.05  5.97  4.08
 9  0.84 Fair  G     SI1      55.1    67  2782  6.39  6.2   3.47
10  1.01 Fair  E     I1       64.5    58  2788  6.29  6.21  4.03
# ℹ 53,930 more rows

diamonds %>% arrange(desc(price))

# A tibble: 53,940 × 10
   carat cut       color clarity depth table price     x     y     z
   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1  2.29 Premium   I     VS2      60.8    60 18823  8.5   8.47  5.16
 2  2    Very Good G     SI1      63.5    56 18818  7.9   7.97  5.04
 3  1.51 Ideal     G     IF       61.7    55 18806  7.37  7.41  4.56
 4  2.07 Ideal     G     SI2      62.5    55 18804  8.2   8.13  5.11
 5  2    Very Good H     SI1      62.8    57 18803  7.95  8     5.01
 6  2.29 Premium   I     SI1      61.8    59 18797  8.52  8.45  5.24
 7  2.04 Premium   H     SI1      58.1    60 18795  8.37  8.28  4.84
 8  2    Premium   I     VS1      60.8    59 18795  8.13  8.02  4.91
 9  1.71 Premium   F     VS2      62.3    59 18791  7.57  7.53  4.7 
10  2.15 Ideal     G     SI2      62.6    54 18791  8.29  8.35  5.21
# ℹ 53,930 more rows

6.6.1 Exercises 10/10/2024

Problem A

midwest %>% #call the midwest object; 
  group_by(state) %>% #states that the following commands are only to act on the variable "state"
  summarise(poptotalmean = mean(poptotal), #give a summary of "poptotalmean", defined as the mean of the column poptotal
            poptotalmed = median(poptotal), #give a summary of poptotalmed, defined as the median of poptotal
            popmax = max(poptotal), #give a summary of popmax, defined as the maximum value of poptotal
            popmin = min(poptotal), #give a summary of popmin, defined as the minimum value of poptotal
            popdistinct = n_distinct(poptotal), #give a summary of popdistinct, defined as the number of unique values in poptotal
            popfirst = first(poptotal), #give a summary of popfirst, defined as the first value in poptotal
            popany = any(poptotal < 5000), #give a summary of popany, defined as TRUE if it meets the condition; false otherwise
            popany2 = any(poptotal > 2000000)) %>% #give a summary of popany2, defined as True if it meets the condition; false otherwise
  ungroup() #ungroup the dataset.

# A tibble: 5 × 9
  state poptotalmean poptotalmed  popmax popmin popdistinct popfirst popany
  <chr>        <dbl>       <dbl>   <int>  <int>       <int>    <int> <lgl> 
1 IL         112065.      24486. 5105067   4373         101    66090 TRUE  
2 IN          60263.      30362.  797159   5315          92    31095 FALSE 
3 MI         111992.      37308  2111687   1701          83    10145 TRUE  
4 OH         123263.      54930. 1412140  11098          88    25371 FALSE 
5 WI          67941.      33528   959275   3890          72    15682 TRUE  
# ℹ 1 more variable: popany2 <lgl>

Problem B

midwest %>% #do what comes next to the midwest dataset
  group_by(state) %>% #group by the state variable
  summarise(num5k = sum(poptotal < 5000), #give a summary of num5k, defined as the number of entries with poptotal less than 5000 
            num2mil = sum(poptotal > 2000000), #give a summary of num2mil, defined as the number of entries with poptotal greater than 2,000,000
            numrows = n()) %>% #give a summary of numrows, defined as the total number of rows.
  ungroup() #ungroup the dataset

# A tibble: 5 × 4
  state num5k num2mil numrows
  <chr> <int>   <int>   <int>
1 IL        1       1     102
2 IN        0       0      92
3 MI        1       1      83
4 OH        0       0      88
5 WI        2       0      72

Problem C

midwest %>% #call the object midwest
  group_by(county) %>% #just use the variable "county"
  summarise(x = n_distinct(state)) %>% #summarise the object x, defined as the number of unique states within the midwest object
  arrange(desc(x)) %>% #arrange x in descending order, from most common state to least
  ungroup() #ungroup

# A tibble: 320 × 2
   county         x
   <chr>      <int>
 1 CRAWFORD       5
 2 JACKSON        5
 3 MONROE         5
 4 ADAMS          4
 5 BROWN          4
 6 CLARK          4
 7 CLINTON        4
 8 JEFFERSON      4
 9 LAKE           4
10 WASHINGTON     4
# ℹ 310 more rows

midwest %>% #call the midwest object
  group_by(county) %>% #group the object by county
  summarise(x = n()) %>% #summarise the object x, defined as the number of entries for each county
  ungroup() #ungroup

# A tibble: 320 × 2
   county        x
   <chr>     <int>
 1 ADAMS         4
 2 ALCONA        1
 3 ALEXANDER     1
 4 ALGER         1
 5 ALLEGAN       1
 6 ALLEN         2
 7 ALPENA        1
 8 ANTRIM        1
 9 ARENAC        1
10 ASHLAND       2
# ℹ 310 more rows

#n() differs from n_distinct() by just giving the number of each unique entry, while n_distinct gives hte number of unique entries

midwest %>% #call midwest object
  group_by(county) %>% #group by counties
  summarise(x = n_distinct(county)) %>% #define x as the number of distinct counties in each county and print a summary of this
  ungroup() #ungroup

# A tibble: 320 × 2
   county        x
   <chr>     <int>
 1 ADAMS         1
 2 ALCONA        1
 3 ALEXANDER     1
 4 ALGER         1
 5 ALLEGAN       1
 6 ALLEN         1
 7 ALPENA        1
 8 ANTRIM        1
 9 ARENAC        1
10 ASHLAND       1
# ℹ 310 more rows

#obviously, there is only 1 county in each county!

Problem D

diamonds %>% #load diamonds
  group_by(clarity) %>% #select the clarity variable
  summarise(a = n_distinct(color), #print a summary of the number of unique colors
            b = n_distinct(price), #print a summary of the number of unique prices
            c = n()) %>% #print a summary of the total number of variables
  ungroup() #ungroup

# A tibble: 8 × 4
  clarity     a     b     c
  <ord>   <int> <int> <int>
1 I1          7   632   741
2 SI2         7  4904  9194
3 SI1         7  5380 13065
4 VS2         7  5051 12258
5 VS1         7  3926  8171
6 VVS2        7  2409  5066
7 VVS1        7  1623  3655
8 IF          7   902  1790

Problem E

diamonds %>% #call diamonds
  group_by(color, cut) %>% #give us every unique combo of color and cut
  summarise(m = mean(price), #summarise the mean price of each combo
            s = sd(price)) %>% #summarise the standard deviation of each combo
  ungroup() #ungroup

`summarise()` has grouped output by 'color'. You can override using the
`.groups` argument.

# A tibble: 35 × 4
   color cut           m     s
   <ord> <ord>     <dbl> <dbl>
 1 D     Fair      4291. 3286.
 2 D     Good      3405. 3175.
 3 D     Very Good 3470. 3524.
 4 D     Premium   3631. 3712.
 5 D     Ideal     2629. 3001.
 6 E     Fair      3682. 2977.
 7 E     Good      3424. 3331.
 8 E     Very Good 3215. 3408.
 9 E     Premium   3539. 3795.
10 E     Ideal     2598. 2956.
# ℹ 25 more rows

diamonds %>% #call diamonds
  group_by(cut, color) %>% #select every unique combo of cut and color
  summarise(m = mean(price), #summarise the mean price of each combo
            s = sd(price)) %>% #summarise the standard deviation of each combo
  ungroup() #ungroup

`summarise()` has grouped output by 'cut'. You can override using the `.groups`
argument.

# A tibble: 35 × 4
   cut   color     m     s
   <ord> <ord> <dbl> <dbl>
 1 Fair  D     4291. 3286.
 2 Fair  E     3682. 2977.
 3 Fair  F     3827. 3223.
 4 Fair  G     4239. 3610.
 5 Fair  H     5136. 3886.
 6 Fair  I     4685. 3730.
 7 Fair  J     4976. 4050.
 8 Good  D     3405. 3175.
 9 Good  E     3424. 3331.
10 Good  F     3496. 3202.
# ℹ 25 more rows

The second part simply changes the printed order from sorted by color to being sorted by cut.

diamonds %>% #call diamonds
  group_by(cut, color, clarity) %>% #select every unique instance of cut, color and clarity
  summarise(m = mean(price), #summarise the mean of each instance
            s = sd(price), #summarise the std dev of each instance
            msale = m* 0.80) %>% #define msale as 80% of the mean price
  ungroup() #ungroup

`summarise()` has grouped output by 'cut', 'color'. You can override using the
`.groups` argument.

# A tibble: 276 × 6
   cut   color clarity     m     s msale
   <ord> <ord> <ord>   <dbl> <dbl> <dbl>
 1 Fair  D     I1      7383  5899. 5906.
 2 Fair  D     SI2     4355. 3260. 3484.
 3 Fair  D     SI1     4273. 3019. 3419.
 4 Fair  D     VS2     4513. 3383. 3610.
 5 Fair  D     VS1     2921. 2550. 2337.
 6 Fair  D     VVS2    3607  3629. 2886.
 7 Fair  D     VVS1    4473  5457. 3578.
 8 Fair  D     IF      1620.  525. 1296.
 9 Fair  E     I1      2095.  824. 1676.
10 Fair  E     SI2     4172. 3055. 3338.
# ℹ 266 more rows

Problem F

diamonds %>% 
  group_by(cut) %>% 
  summarise(potato = mean(depth),
            pizza = mean(price),
            popcorn = median(y),
            pineapple = potato - pizza,
            papaya = pineapple^2,
            peach = n()) %>% 
  ungroup()

# A tibble: 5 × 7
  cut       potato pizza popcorn pineapple    papaya peach
  <ord>      <dbl> <dbl>   <dbl>     <dbl>     <dbl> <int>
1 Fair        64.0 4359.    6.1     -4295. 18444586.  1610
2 Good        62.4 3929.    5.99    -3866. 14949811.  4906
3 Very Good   61.8 3982.    5.77    -3920. 15365942. 12082
4 Premium     61.3 4584.    6.06    -4523. 20457466. 13791
5 Ideal       61.7 3458.    5.26    -3396. 11531679. 21551

Problem G

diamonds %>% #call diamonds
  group_by(color) %>% #group all unique varieties of color
  summarize(m = mean(price)) %>% #print the mean price of each color
  mutate(x1 = str_c("Diamond color", color), #create a column x1 with values of "Diamond color"+COLOR
         x2 = 5) %>% #create a column x2 of just the value 5 in every row
  ungroup() #ungroup

# A tibble: 7 × 4
  color     m x1                x2
  <ord> <dbl> <chr>          <dbl>
1 D     3170. Diamond colorD     5
2 E     3077. Diamond colorE     5
3 F     3725. Diamond colorF     5
4 G     3999. Diamond colorG     5
5 H     4487. Diamond colorH     5
6 I     5092. Diamond colorI     5
7 J     5324. Diamond colorJ     5

diamonds %>% #call diamonds
  group_by(color) %>% #group all unique vars of color
  summarise(m = mean(price)) %>% #print the mean price of each color group 
  ungroup() %>% #ungroup
  mutate(x1 = str_c("Diamond color ", color), #create a column x1 with values of "Diamond color"+COLOR
         x2 = 5) #create a column x2 of just the value 5 in every row

# A tibble: 7 × 4
  color     m x1                 x2
  <ord> <dbl> <chr>           <dbl>
1 D     3170. Diamond color D     5
2 E     3077. Diamond color E     5
3 F     3725. Diamond color F     5
4 G     3999. Diamond color G     5
5 H     4487. Diamond color H     5
6 I     5092. Diamond color I     5
7 J     5324. Diamond color J     5

The position of the ungroup in the prev 2 examples doesn’t change the output as the mutate() function doesn’t rely on the group.

Problem H

diamonds %>% #call diamonds
  group_by(color) %>% #group all unique vars of color
  mutate(x1 = price*0.5) %>% #create a new column, x1, equal to the price of each color x 0.5
  summarise(m = mean(x1)) %>% #print a summary of the mean of x1 called m. this summary excludes all other columns, so the x1 column isn't shown.
  ungroup()

# A tibble: 7 × 2
  color     m
  <ord> <dbl>
1 D     1585.
2 E     1538.
3 F     1862.
4 G     2000.
5 H     2243.
6 I     2546.
7 J     2662.

diamonds %>% #call diamonds
  group_by(color) %>% #group by each unique value of color
  mutate(x1 = price*0.5) %>% #add a column equal to half of the price called x1
  ungroup() %>% #ungroup
  summarise(m = mean(x1)) #print a summary of the mean values of x1

# A tibble: 1 × 1
      m
  <dbl>
1 1966.

Since this last one ungroups before printing the summary, the printout isn’t separated into individual colors.

Why is grouping data neccessary?

So that functions can be performed on entire groups of the data

Why is ungrouping neccessary?

To ensure that the data isn’t still grouped when you next go to use it.

When should you ungroup data?

Once you’re finished using it in groups.

6.7 Extra Practice!

#View(diamonds) #make sure it has a capital V! opens the data.frame of diamonds in a new window

arrange(diamonds, price) #this gives ther data with price ascending from lowest to highest

# A tibble: 53,940 × 10
   carat cut       color clarity depth table price     x     y     z
   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
 2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
 3  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31
 4  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
 5  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75
 6  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48
 7  0.24 Very Good I     VVS1     62.3    57   336  3.95  3.98  2.47
 8  0.26 Very Good H     SI1      61.9    55   337  4.07  4.11  2.53
 9  0.22 Fair      E     VS2      65.1    61   337  3.87  3.78  2.49
10  0.23 Very Good H     VS1      59.4    61   338  4     4.05  2.39
# ℹ 53,930 more rows

diamonds %>% 
  arrange(price) #this does exactly the same as above

# A tibble: 53,940 × 10
   carat cut       color clarity depth table price     x     y     z
   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
 2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
 3  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31
 4  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
 5  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75
 6  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48
 7  0.24 Very Good I     VVS1     62.3    57   336  3.95  3.98  2.47
 8  0.26 Very Good H     SI1      61.9    55   337  4.07  4.11  2.53
 9  0.22 Fair      E     VS2      65.1    61   337  3.87  3.78  2.49
10  0.23 Very Good H     VS1      59.4    61   338  4     4.05  2.39
# ℹ 53,930 more rows

diamonds %>% 
  arrange(desc(price)) #this sorts data with price descending

# A tibble: 53,940 × 10
   carat cut       color clarity depth table price     x     y     z
   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1  2.29 Premium   I     VS2      60.8    60 18823  8.5   8.47  5.16
 2  2    Very Good G     SI1      63.5    56 18818  7.9   7.97  5.04
 3  1.51 Ideal     G     IF       61.7    55 18806  7.37  7.41  4.56
 4  2.07 Ideal     G     SI2      62.5    55 18804  8.2   8.13  5.11
 5  2    Very Good H     SI1      62.8    57 18803  7.95  8     5.01
 6  2.29 Premium   I     SI1      61.8    59 18797  8.52  8.45  5.24
 7  2.04 Premium   H     SI1      58.1    60 18795  8.37  8.28  4.84
 8  2    Premium   I     VS1      60.8    59 18795  8.13  8.02  4.91
 9  1.71 Premium   F     VS2      62.3    59 18791  7.57  7.53  4.7 
10  2.15 Ideal     G     SI2      62.6    54 18791  8.29  8.35  5.21
# ℹ 53,930 more rows

diamonds %>% 
  arrange(price) %>%
  arrange(cut)

# A tibble: 53,940 × 10
   carat cut   color clarity depth table price     x     y     z
   <dbl> <ord> <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1  0.22 Fair  E     VS2      65.1    61   337  3.87  3.78  2.49
 2  0.25 Fair  E     VS1      55.2    64   361  4.21  4.23  2.33
 3  0.23 Fair  G     VVS2     61.4    66   369  3.87  3.91  2.39
 4  0.27 Fair  E     VS1      66.4    58   371  3.99  4.02  2.66
 5  0.3  Fair  J     VS2      64.8    58   416  4.24  4.16  2.72
 6  0.3  Fair  F     SI1      63.1    58   496  4.3   4.22  2.69
 7  0.34 Fair  J     SI1      64.5    57   497  4.38  4.36  2.82
 8  0.37 Fair  F     SI1      65.3    56   527  4.53  4.47  2.94
 9  0.3  Fair  D     SI2      64.6    54   536  4.29  4.25  2.76
10  0.25 Fair  D     VS1      61.2    55   563  4.09  4.11  2.51
# ℹ 53,930 more rows

diamonds %>% 
  arrange(desc(price)) %>% 
  arrange(desc(cut))

# A tibble: 53,940 × 10
   carat cut   color clarity depth table price     x     y     z
   <dbl> <ord> <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1  1.51 Ideal G     IF       61.7  55   18806  7.37  7.41  4.56
 2  2.07 Ideal G     SI2      62.5  55   18804  8.2   8.13  5.11
 3  2.15 Ideal G     SI2      62.6  54   18791  8.29  8.35  5.21
 4  2.05 Ideal G     SI1      61.9  57   18787  8.1   8.16  5.03
 5  1.6  Ideal F     VS1      62    56   18780  7.47  7.52  4.65
 6  2.06 Ideal I     VS2      62.2  55   18779  8.15  8.19  5.08
 7  1.71 Ideal G     VVS2     62.1  55   18768  7.66  7.63  4.75
 8  2.08 Ideal H     SI1      58.7  60   18760  8.36  8.4   4.92
 9  2.03 Ideal G     SI1      60    55.8 18757  8.17  8.3   4.95
10  2.61 Ideal I     SI2      62.1  56   18756  8.85  8.73  5.46
# ℹ 53,930 more rows

diamonds %>% 
  arrange(price) %>% 
  arrange(desc(clarity))

# A tibble: 53,940 × 10
   carat cut       color clarity depth table price     x     y     z
   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1  0.23 Very Good H     IF       63.9    55   369  3.89  3.9   2.49
 2  0.24 Very Good H     IF       61.3    56   449  4.04  4.06  2.48
 3  0.26 Ideal     H     IF       61.1    57   468  4.12  4.16  2.53
 4  0.23 Very Good F     IF       61      62   485  3.95  3.99  2.42
 5  0.3  Ideal     J     IF       61.5    56   489  4.32  4.33  2.66
 6  0.3  Ideal     J     IF       61.5    57   489  4.29  4.36  2.66
 7  0.23 Very Good E     IF       59.9    58   492  3.98  4.03  2.4 
 8  0.24 Good      F     IF       65.1    58   492  3.86  3.88  2.52
 9  0.24 Ideal     H     IF       62.5    54   504  3.97  4     2.49
10  0.24 Ideal     H     IF       62.1    57   504  4     4.04  2.5 
# ℹ 53,930 more rows

diamonds %>% 
  mutate(saleprice = price - 250)

# A tibble: 53,940 × 11
   carat cut       color clarity depth table price     x     y     z saleprice
   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>     <dbl>
 1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43        76
 2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31        76
 3  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31        77
 4  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63        84
 5  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75        85
 6  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48        86
 7  0.24 Very Good I     VVS1     62.3    57   336  3.95  3.98  2.47        86
 8  0.26 Very Good H     SI1      61.9    55   337  4.07  4.11  2.53        87
 9  0.22 Fair      E     VS2      65.1    61   337  3.87  3.78  2.49        87
10  0.23 Very Good H     VS1      59.4    61   338  4     4.05  2.39        88
# ℹ 53,930 more rows

diamonds %>% 
  select(-x,-y,-z)

# A tibble: 53,940 × 7
   carat cut       color clarity depth table price
   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int>
 1  0.23 Ideal     E     SI2      61.5    55   326
 2  0.21 Premium   E     SI1      59.8    61   326
 3  0.23 Good      E     VS1      56.9    65   327
 4  0.29 Premium   I     VS2      62.4    58   334
 5  0.31 Good      J     SI2      63.3    58   335
 6  0.24 Very Good J     VVS2     62.8    57   336
 7  0.24 Very Good I     VVS1     62.3    57   336
 8  0.26 Very Good H     SI1      61.9    55   337
 9  0.22 Fair      E     VS2      65.1    61   337
10  0.23 Very Good H     VS1      59.4    61   338
# ℹ 53,930 more rows

diamonds %>% 
  group_by(cut) %>% 
  summarise(number = n()) %>% 
  ungroup()

# A tibble: 5 × 2
  cut       number
  <ord>      <int>
1 Fair        1610
2 Good        4906
3 Very Good  12082
4 Premium    13791
5 Ideal      21551

diamonds %>% 
  mutate(totalNum = sum(n()))

# A tibble: 53,940 × 11
   carat cut       color clarity depth table price     x     y     z totalNum
   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>    <int>
 1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43    53940
 2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31    53940
 3  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31    53940
 4  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63    53940
 5  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75    53940
 6  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48    53940
 7  0.24 Very Good I     VVS1     62.3    57   336  3.95  3.98  2.47    53940
 8  0.26 Very Good H     SI1      61.9    55   337  4.07  4.11  2.53    53940
 9  0.22 Fair      E     VS2      65.1    61   337  3.87  3.78  2.49    53940
10  0.23 Very Good H     VS1      59.4    61   338  4     4.05  2.39    53940
# ℹ 53,930 more rows

Pre-Session 4: Chapter 2 of R Graphics Cookbook, 14/10/24

Sometimes, for simple things, it’s good to plot in base R language. However, ggplot2 is WAAY better for more complicated things!

Creating a scatter plot

We’ll be using the “mtcars” dataset. Here’s a graph from the inbuilt r code:

plot(mtcars$wt, mtcars$mpg)

Where *mtcars$wt* returns the column names wt from the mtcars data frame (and similar for mtcars$mpg)

And here’s a graph from the ggplot2 code:

## library(ggplot2) ## Loaded this at the start

ggplot(mtcars, aes(x = wt, y = mpg)) + geom_point()

ggplot(): tells it to create a plot object.
geom_point(): tells it to add a layer of points to the plot.
(mtcars, aes(x = wt, y = mpg)) calls the data frame mtcars before telling it which variables to use for the x and y values.
to use vectors instead of a data frame, you can swap “mtcars” with data = NULL, followed by the same aes(x = vector1, y = vector2):

ggplot(data = NULL, aes(x = mtcars$wt, y = mtcars$mpg)) + geom_point()

15/10/2024 (continuing)

Creating a line graph

Using plot():

#plot(pressure$temperature, pressure$pressure, type = "1")

That didn’t work for some reason.

plot(pressure$temperature, pressure$pressure, type = "l")

I misread the l as a 1 - hence where it went wrong.

To add a new line to the graph using plot():

use plot() for the first one, then use points() or lines() to add corresponding data points to the graph.

plot(pressure$temperature, pressure$pressure, type = "l")
points(pressure$temperature, pressure$pressure/2)
lines(pressure$temperature, pressure$pressure/4, col = "red")

Now using ggplot2:

ggplot(pressure, aes(x = temperature, y = pressure)) + geom_line()

To add data points:

ggplot(pressure, aes(x = temperature, y = pressure)) + geom_line() + geom_point()

Creating a bar graph

Using built in barplot:

barplot(BOD$demand, names.arg = BOD$Time)

Using ggplot2:

ggplot(BOD, aes(x = Time, y = demand)) +
  geom_col() +
  ggtitle("A Random Bar Chart")

To ensure the x variable is treated as a discrete var.:

use factor():

ggplot(BOD, aes(x = factor(Time), y = demand)) +
  geom_col() +
  ggtitle("A Random Bar Chart")

Creating a histogram

with hist():

specify number of bins with breaks

hist(mtcars$mpg, breaks = 10)

using ggplot2:

## library(tidyverse) ## loaded this at start

ggplot(mtcars, aes(x = mpg)) + geom_histogram()

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Adding binwidth in geom_histogram() allows you to definte the width of the bins:

#library(tidyverse)

ggplot(mtcars, aes(x = mpg)) + geom_histogram(binwidth = 4)

Creating a boxplot

The built-in plot() function will automatically create a boxplot is the x values are a factor (as opposed to a numeric vector):

plot(ToothGrowth$supp, ToothGrowth$len)

and using ggplot2:

ggplot(ToothGrowth, aes(x = supp, y = len)) +
  geom_boxplot()

and we can make box plots for multiple variables using interaction()

ggplot(ToothGrowth, aes(x = interaction(supp, dose), y = len)) + 
  geom_boxplot()

Does penguin sex explain flipper length

ggplot(penguins, aes(x = species, y = flipper_length_mm, color = sex)) + geom_boxplot()

Warning: Removed 2 rows containing non-finite outside the scale range
(`stat_boxplot()`).

16/10/2024 Pre Session 4 cont.

Plotting a function curve

Use curve() with an expression with the variable x:

curve(x^3 - 5*x, from = -4, to = 4)

Warning

Don’t forget the comma after the = -4!

Plotting a user-defined function

It needs to take a numeric vector and return a numeric vector as output.

Using add = TRUE will add it to the previously created plot.

myfun <- function(xvar) {
  1 / (1 + exp(-xvar + 10))
}

curve(myfun(x), from = 0, to = 20)

curve(1 - myfun(x), add = TRUE, col = "red")

Post session 4

library(modeldata)


Attaching package: 'modeldata'

The following object is masked from 'package:palmerpenguins':

    penguins

Scatter plot

ggplot(crickets, 
       aes(x = temp, y = rate,
           color = species)) + 
  geom_point(color = "red",     #specifies the color of the datapoints 
             size = 2,          #specifies the size
             alpha = .3,        # makes transparent
             shape = "square") +
  geom_smooth(method = "lm",    #adds trendline; lm makes it a linear model
              se = FALSE) +     #turns off standard error shading
  labs(x = "Temperature",       # labs = labels
       y = "Chirp rate",
       color = "Species",       # changes title of legend
       title = "Cricket chirps",
       caption = "Source: McDonald (2009)") +
  scale_color_brewer(palette = "Dark2")

`geom_smooth()` using formula = 'y ~ x'

Other plots

# histogram - one quantitative variable
ggplot(crickets, aes(x = rate)) +
  geom_histogram(bins = 15) #can also use binwidth to change the bins

#frequency polygon
ggplot(crickets, aes(x = rate)) +
  geom_freqpoly(bins = 15)

#bar chart
ggplot(crickets, aes(x = species)) +
  geom_bar(color = "black",           #make boundary black
           fill = "lightblue")        #make fill light blue

ggplot(crickets, aes(x = species, fill = species)) + # fill = species makes species represented by a different fill
  geom_bar(show.legend = FALSE)

#box plot

ggplot(crickets, aes(x = species,
                     y = rate,
                     color = species)) +
  geom_boxplot() + 
  theme_minimal()

# faceting

#Not great example:

ggplot(crickets, aes(x = rate,
                     fill = species)) +
  geom_histogram(bins = 15) +
  scale_fill_brewer(palette = "Dark2")

ggplot(crickets, aes(x = rate,
                     fill = species)) +
  geom_histogram(bins = 15, show.legend = FALSE) +
  scale_fill_brewer(palette = "Dark2") +
  facet_wrap(~species) +                           # facet_wrap puts the data on two different graphs, side by side, with the same axes scales
  theme_minimal()

18/10/2024 Experimenting With Tree Data and ggplot2

Uploading my own data:

View("Tree_Phenology_Data")

That didn’t work. How do we get the data loaded into R?

Maybe because it wasn’t saved as .csv

View("Tree_Phenology_Data_P1")

Nope again!

TreesP1 <- read_csv("Tree_Phenology_Data_P1.csv")

Rows: 951 Columns: 11
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): Date observed, Species
dbl (9): Obs. no., Latitude, Longitude, Altitude, Tree diameter, Urbanisatio...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

TreesP2 <- read_csv("Tree_Phenology_Data_P2.csv")

Rows: 828 Columns: 11
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): Date observed, Species
dbl (9): Obs. no., Latitude, Longitude, Altitude, Tree diameter, Urbanisatio...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

This worked!

Tip

read.csv and read_csv are DIFFERENT functions!

read.csv is base R language and returns a data frame with factors as default
read_csv is tidyverse language and returns a tibble with characters as default

Use read_csv!!!

summary(TreesP1)

    Obs. no.     Date observed         Latitude        Longitude        
 Min.   :  1.0   Length:951         Min.   :-35.73   Min.   :-115.9786  
 1st Qu.:239.5   Class :character   1st Qu.: 51.37   1st Qu.:  -3.2639  
 Median :477.0   Mode  :character   Median : 52.08   Median :  -1.9382  
 Mean   :477.3                      Mean   : 52.07   Mean   :  -1.9622  
 3rd Qu.:715.5                      3rd Qu.: 53.50   3rd Qu.:  -0.3468  
 Max.   :953.0                      Max.   : 57.70   Max.   : 174.3406  
    Altitude         Species          Tree diameter    Urbanisation index
 Min.   :   0.00   Length:951         Min.   :  0.34   Min.   :1.000     
 1st Qu.:  16.50   Class :character   1st Qu.: 47.72   1st Qu.:2.000     
 Median :  48.00   Mode  :character   Median : 75.00   Median :3.000     
 Mean   :  86.92                      Mean   :104.84   Mean   :2.905     
 3rd Qu.: 105.00                      3rd Qu.:119.85   3rd Qu.:4.000     
 Max.   :1076.00                      Max.   :632.00   Max.   :4.000     
 Stand density index  Canopy index   Phenological index
 Min.   :1.000       Min.   : 5.00   Min.   :1.000     
 1st Qu.:2.000       1st Qu.:75.00   1st Qu.:2.000     
 Median :3.000       Median :85.00   Median :2.000     
 Mean   :2.836       Mean   :78.56   Mean   :2.168     
 3rd Qu.:4.000       3rd Qu.:85.00   3rd Qu.:2.000     
 Max.   :4.000       Max.   :95.00   Max.   :4.000

Here, we can see that the “Date observed” column/variable is read as a character variable, which is usually used for text to be treaded as strings.

Is there a difference between summary() and summarise()?

summarise(TreesP1)

# A tibble: 1 × 0

#this doesn't work, you need to specify what exactly you're creating/summarising.

How can we get “Date observed” to be read as a date?

Our dates are in the form: “dd/mm/yyyy”

We can use the function as.Date(DATA$DATE_COLUMN, format = “%d/%m/%Y”):

TreesP1$"Date observed" <- as.Date(TreesP1$"Date observed", format = "%d/%m/%Y")

TreesP2$"Date observed" <- as.Date(TreesP2$"Date observed", format = "%d/%m/%Y")

summary(TreesP1$"Date observed")

        Min.      1st Qu.       Median         Mean      3rd Qu.         Max. 
"1994-10-18" "2022-10-16" "2022-10-18" "2022-08-03" "2022-10-21" "2022-11-30"

summary(TreesP2$"Date observed")

        Min.      1st Qu.       Median         Mean      3rd Qu.         Max. 
"2022-01-30" "2022-11-27" "2022-11-30" "2022-11-27" "2022-12-03" "2022-12-31"

Here we can see that all the date entries have been correctly loaded.

But we can also see some outliers! for example, 18/10/1994 and 30/01/2022.

Now we’re ready to start examining the data properly.

Data Wrangling:

Let’s have a look at the structure of these datasets:

str(TreesP1)

spc_tbl_ [951 × 11] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ Obs. no.           : num [1:951] 1 2 3 4 5 6 7 8 9 10 ...
 $ Date observed      : Date[1:951], format: "2022-10-18" "2022-10-18" ...
 $ Latitude           : num [1:951] 53 53 53 53 51.7 ...
 $ Longitude          : num [1:951] -1.1 -1.1 -1.1 -1.1 -1.57 ...
 $ Altitude           : num [1:951] 69 69 69 69 106 106 106 119 119 119 ...
 $ Species            : chr [1:951] "Quercus robur" "Quercus robur" "Quercus robur" "Quercus robur" ...
 $ Tree diameter      : num [1:951] 137 182.5 200 259.5 99.7 ...
 $ Urbanisation index : num [1:951] 2 2 2 2 4 4 4 4 4 4 ...
 $ Stand density index: num [1:951] 4 4 4 4 2 2 2 4 4 4 ...
 $ Canopy index       : num [1:951] 85 65 85 85 65 65 75 95 95 95 ...
 $ Phenological index : num [1:951] 2 2 2 2 3 3 2 1 1 1 ...
 - attr(*, "spec")=
  .. cols(
  ..   `Obs. no.` = col_double(),
  ..   `Date observed` = col_character(),
  ..   Latitude = col_double(),
  ..   Longitude = col_double(),
  ..   Altitude = col_double(),
  ..   Species = col_character(),
  ..   `Tree diameter` = col_double(),
  ..   `Urbanisation index` = col_double(),
  ..   `Stand density index` = col_double(),
  ..   `Canopy index` = col_double(),
  ..   `Phenological index` = col_double()
  .. )
 - attr(*, "problems")=<externalptr>

str(TreesP2)

spc_tbl_ [828 × 11] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ Obs. no.           : num [1:828] 1 2 3 4 5 6 7 8 9 10 ...
 $ Date observed      : Date[1:828], format: "2022-11-30" "2022-11-30" ...
 $ Latitude           : num [1:828] 53 53 53 53 51.7 ...
 $ Longitude          : num [1:828] -1.1 -1.1 -1.1 -1.1 -1.57 ...
 $ Altitude           : num [1:828] 0 0 0 0 106 106 106 24 24 24 ...
 $ Species            : chr [1:828] "Quercus robur" "Quercus robur" "Quercus robur" "Quercus robur" ...
 $ Tree diameter      : num [1:828] 137 182.5 200 259.5 99.7 ...
 $ Urbanisation index : num [1:828] 2 2 2 2 4 4 4 4 4 4 ...
 $ Stand density index: num [1:828] 4 4 4 4 2 2 2 2 2 2 ...
 $ Canopy index       : num [1:828] 65 55 45 35 15 5 25 55 25 45 ...
 $ Phenological index : num [1:828] 3 4 4 4 4 4 4 4 4 4 ...
 - attr(*, "spec")=
  .. cols(
  ..   `Obs. no.` = col_double(),
  ..   `Date observed` = col_character(),
  ..   Latitude = col_double(),
  ..   Longitude = col_double(),
  ..   Altitude = col_double(),
  ..   Species = col_character(),
  ..   `Tree diameter` = col_double(),
  ..   `Urbanisation index` = col_double(),
  ..   `Stand density index` = col_double(),
  ..   `Canopy index` = col_double(),
  ..   `Phenological index` = col_double()
  .. )
 - attr(*, "problems")=<externalptr>

Let’s now have a closer look at the “Species” variable to see what we’ve got in there:

summary(TreesP1$Species)

   Length     Class      Mode 
      951 character character

summary(TreesP2$Species)

   Length     Class      Mode 
      828 character character

OK that really wasn’t useful now was it R?!

How can we get a list of the different entries into species?

Before, in the Wendy Book, an example was given to do similar. Let’s try to adapt it!

TreesP1 %>% 
  group_by(Species) %>% 
  count(Species) %>% 
  ungroup()

# A tibble: 4 × 2
  Species                  n
  <chr>                <int>
1 Other deciduous oak     27
2 Other deciduous tree    34
3 Quercus petraea        179
4 Quercus robur          711

TreesP2 %>% 
  group_by(Species) %>% 
  count(Species) %>% 
  ungroup()

# A tibble: 4 × 2
  Species                  n
  <chr>                <int>
1 Other deciduous oak     25
2 Other deciduous tree    29
3 Quercus petraea        150
4 Quercus robur          624

Nice! This tells us exactly what we wanted to know: The range of entries in Species and how many of each occurred. We can also see that there weren’t any slightly different or wrong inputs here.

Tip

If there were wrongly inputted things, we could use:

mutate(CorrectSpecies = recode(Species, “Quercus” = “Other deciduous oak”))

I wonder if the Group_by bit was really neccessary:

#TreesP1 %>% 
#  count(Species)

#TreesP2 %>% 
#  count(species)

Ok that didn’t work, let’s stick with what we did before.

What else could we look at?

Lets see if there is any meaningful variation in variables between each species.

For this, we will need to use group_by(Species) and then summarise()

TreesP1 %>% 
  group_by(Species) %>% 
  summarise("MeanAlt" = mean(Altitude)) %>% 
  ungroup()

# A tibble: 4 × 2
  Species              MeanAlt
  <chr>                  <dbl>
1 Other deciduous oak    272. 
2 Other deciduous tree    55.1
3 Quercus petraea         90.3
4 Quercus robur           80.6

This is quite interesting, I wonder if a box plot would show any relation here!

Code

ggplot(TreesP1, aes(x = Species, y = Altitude, fill = Species)) +
  geom_boxplot(show.legend = FALSE) +
  theme_minimal() +
  labs(y = "Altitude/m")

A box plot of the altitude for each species of tree observed

Lets look at the same for latitude and longitude:

Code

ggplot(TreesP1, aes(x = Species, y = Latitude, fill = Species)) +
  geom_boxplot(show.legend = FALSE) +
  theme_minimal()

Box plots of the latitude and longitude for each species of tree observed

Code

ggplot(TreesP1, aes(x = Species, y = Longitude, fill = Species)) +
  geom_boxplot(show.legend = FALSE) +
  theme_minimal()

Code

ggplot(TreesP1, aes(x = Species, y = Longitude, color = Species)) +
  geom_point(show.legend = FALSE) +
  theme_minimal()

A scatter plot of the longitude for each species of tree observed to check if our hypothesis that there would be no observations between -5 and -70 due to the Atlantic Ocean is true

Code

ggplot(TreesP1, aes(x = Longitude, y = Latitude, color = Species)) +
  geom_point() +
  theme_minimal()

A scatter plot of the latitude against longitude to produce a representation of the world map

Code

ggplot(TreesP1, aes(x = Longitude, y = Latitude, color = Altitude)) +
  geom_point(size = 2) +
  theme_minimal()

A scatter plot of the latitude against longitude to produce a representation of the world map with Altitude represented by colour of the points

Code

TreesP1LOG <- 
  TreesP1 %>%  
  mutate(logAlt = log(Altitude)) #defining a new column equal to log(Altutide)

ggplot(TreesP1LOG, aes(x = Longitude, y = Latitude, color = logAlt)) +
  geom_point(size = 2) +
  theme_minimal() +
  xlim(-125, 25) + #set x and y limits to concentrate on europe and N 
  ylim(40, 60) +   #america
  labs(color = "ln(Altitude)")

Warning: Removed 6 rows containing missing values or values outside the scale range
(`geom_point()`).

A scatter plot of the latitude against longitude to produce a representation of Europe and N America with log(Altitude) represented by colour of the points to provide better clarity of the scale

22/10/2024 Week 5 Post Session

Task: Reproduce the graphs and identify which tests can be associated with them. Explain why!

data(iris)

summary(iris)

  Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
 Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
 1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
 Median :5.800   Median :3.000   Median :4.350   Median :1.300  
 Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
 3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
 Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
       Species  
 setosa    :50  
 versicolor:50  
 virginica :50

str(iris)

'data.frame':   150 obs. of  5 variables:
 $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
 $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
 $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
 $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
 $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

Code

ggplot(iris, aes(x = Species, y = Sepal.Length, color = Species)) +
  geom_boxplot() +
  labs(y = "Sepal Length/mm")

A box plot of the distribution of sepal length for three Iris species.

The associated test with this one would be a **mean**-based test where more than 2 levels can be used. The data seems to be normally distributed as well, so **repeated measures ANOVA** could be used.

Code

ggplot(iris, aes(x = Petal.Length, fill = Species)) + 
  geom_density(alpha = 0.3) +
  labs(x = "Petal Length/mm", y = "Density")

A density plot showing the distribution of petal length for each species of Iris.

Since this plot deals with densities, the associated test would be a frequency based test. We have one nominal variable and one continuous variable. I don’t know what you would use here, but maybe **Chi-Square** since it has >2 groups (ruling out Z test, McNemar and Cochrane), the groups are independent and we’re checking the association between the categories.

Code

ggplot(iris, aes(x = Petal.Length, y = Petal.Width)) +
  geom_point(mapping = aes(color = Species, shape = Species)) + # mapping = aes() being in geom_point instead of ggplot ensures that the calculation of the line of best fit covers all data points instead of having one for each species.
  geom_smooth(method = lm, color = "black") + 
  labs(x = "Petal Length/mm", y = "Petal Width/mm")

`geom_smooth()` using formula = 'y ~ x'

A scatter plot of petal length against petal width with a line of best fit.

We have two continuous variables with three treatments (the species). If the data is normally distributed, **multivariate-ANOVA** would be suitable, otherwise **ANOSIM** is the non-parametric alternative

Code

#Create a variable for big/small sepal length

iris2 <-
  iris %>%
  mutate(Size=ifelse(Sepal.Length < median(Sepal.Length),
    "small", "big"))

ggplot(iris2, aes(x = Species, fill = Size)) +
  geom_bar(position = "dodge") +
  labs(y = "Count")

A bar chart comparing the relative sizes of sepal lengths for each species of Iris.

We are comparing two categorical variables here, one nominal and one binary. The groups are likely unpaired and so the **Pearson Chi-Squared** or **Fisher Exact Test** could be used.

Code

# Position = "dodge" ensures the columns for each size are stacked next to each other instead of on top of each other.

28/10/2024 Contingency Tables

Contingency tables are used to quantify values corresponding to categorical variables.
If you have a dataset with two categorical variables, and want to see how many occurrences of each combination of their categories there is, you use a contingency table.
Use the command: table(dataset$var1, dataset$var2)
This returns a table

library(ISLR2)

example <- table(Carseats$Urban, Carseats$ShelveLoc)

example

     
      Bad Good Medium
  No   22   28     68
  Yes  74   57    151

Lets try it with the tree data:

table(TreesP1$"Urbanisation index", TreesP1$"Stand density index")

   
      1   2   3   4
  1  10  72  26   5
  2  12  93  62  78
  3  22  92  59  39
  4  19  96  65 201

The first variable is on the left and the second on the top.

Urbanisation index:

1 = Urban
4 = Rural

Stand density index:

1 = Alone
4 = Woodland

This tells us that most trees were in rural woodlands, 5 were in urban woodlands, very few were standing alone (first column) and disproportionately many were with a few close trees (second column).

Now let’s see if there was any relation between urban index and phenology index!

pheno <- table(TreesP1$"Urbanisation index", TreesP1$"Phenological index")

Cool!

So these are essentially matrices

addmargins() adds a column/row for sum

proportions() gives the relative sizes of each cell

addmargins(pheno)

     
        1   2   3   4 Sum
  1    20  56  30   7 113
  2    19 171  45  10 245
  3    23 127  49  13 212
  4    49 254  69   9 381
  Sum 111 608 193  39 951

proportions(pheno)

   
              1           2           3           4
  1 0.021030494 0.058885384 0.031545741 0.007360673
  2 0.019978970 0.179810726 0.047318612 0.010515247
  3 0.024185068 0.133543638 0.051524711 0.013669821
  4 0.051524711 0.267087277 0.072555205 0.009463722

The number of decimal places here is waaay too high!

04/11/2024: Post-session 6

1-proportion Z-test

Compares an observed proportion to a theoretical one.

Used for only 2 categories.

use binom.test() (n<30) or prop.test() (n>30)

x = # successes
n = tot # of trials
p = expected probability
correct = TRUE/FALSE (false removes yates continuity correction)
alternative = “less”/“greater” - used to define one-tailed tests

eg for cases of cancer in mice, where 160 developed cancer, of which 95 were male. Are males more likely to get cancer in mice?

ztest <- prop.test(x = 95, n = 160, p = 0.5, correct = FALSE, alternative = "greater")

ztest


    1-sample proportions test without continuity correction

data:  95 out of 160, null probability 0.5
X-squared = 5.625, df = 1, p-value = 0.008853
alternative hypothesis: true p is greater than 0.5
95 percent confidence interval:
 0.5288397 1.0000000
sample estimates:
      p 
0.59375

ztest$p.value ## Returns the p-value

[1] 0.008853033

So this is significant and yes males are more likely to get cancer.

Two-proportions z-test

Compares two observed proportions

Exactly the same as for 1-proportion z-test, but x and n are vectors with two values, one for category A and one for category B.

eg. does smoking affect rates of cases of lung cancer?

# x = number of smokers with lung cancer, without lung cancer
# n = number of cases with lung cancer,cases without

atest <- prop.test(x = c(490, 400), n=c(500,500))

atest


    2-sample test for equality of proportions with continuity correction

data:  c(490, 400) out of c(500, 500)
X-squared = 80.909, df = 1, p-value < 2.2e-16
alternative hypothesis: two.sided
95 percent confidence interval:
 0.1408536 0.2191464
sample estimates:
prop 1 prop 2 
  0.98   0.80

atest$p.value

[1] 2.363439e-19

Chi-squared goodness of fit test

Compares an observed distribution with an expected distribution.

chi <- chisq.test(x=c(81,50,27), p=c(1/3,1/3,1/3))

chi


    Chi-squared test for given probabilities

data:  c(81, 50, 27)
X-squared = 27.886, df = 2, p-value = 8.803e-07

Chi-squared test of independence

analyses a frequenct table aka contingency table formed by 2 categorical variables

calculated by inputting a contingency table into chisq.test()

To make a contingency table, input two vectors into table()

Using crosstables to do the chi-squared test of independence

Using the mpg dataset:

str(mpg)

tibble [234 × 11] (S3: tbl_df/tbl/data.frame)
 $ manufacturer: chr [1:234] "audi" "audi" "audi" "audi" ...
 $ model       : chr [1:234] "a4" "a4" "a4" "a4" ...
 $ displ       : num [1:234] 1.8 1.8 2 2 2.8 2.8 3.1 1.8 1.8 2 ...
 $ year        : int [1:234] 1999 1999 2008 2008 1999 1999 2008 1999 1999 2008 ...
 $ cyl         : int [1:234] 4 4 4 4 6 6 6 4 4 4 ...
 $ trans       : chr [1:234] "auto(l5)" "manual(m5)" "manual(m6)" "auto(av)" ...
 $ drv         : chr [1:234] "f" "f" "f" "f" ...
 $ cty         : int [1:234] 18 21 20 21 16 18 18 18 16 20 ...
 $ hwy         : int [1:234] 29 29 31 30 26 26 27 26 25 28 ...
 $ fl          : chr [1:234] "p" "p" "p" "p" ...
 $ class       : chr [1:234] "compact" "compact" "compact" "compact" ...

mpg %>% 
  group_by(class, cyl) %>% ## show only the columns class and cylinder 
  summarize(n=n()) %>% ## create a new column "n" stating the number of occurrences
  kable() ## Kable is part of Knitr and just makes neat looking tables

`summarise()` has grouped output by 'class'. You can override using the
`.groups` argument.

class	cyl	n
2seater	8	5
compact	4	32
compact	5	2
compact	6	13
midsize	4	16
midsize	6	23
midsize	8	2
minivan	4	1
minivan	6	10
pickup	4	3
pickup	6	10
pickup	8	20
subcompact	4	21
subcompact	5	2
subcompact	6	7
subcompact	8	5
suv	4	8
suv	6	16
suv	8	38

mpg %>% 
  group_by(class, cyl) %>% 
  summarise(n=n()) %>% 
  spread(cyl, n) %>% ## This creates columns for each cyl value with n as the crosstab response
  kable()

`summarise()` has grouped output by 'class'. You can override using the
`.groups` argument.

class	4	5	6	8
2seater	NA	NA	NA	5
compact	32	2	13	NA
midsize	16	NA	23	2
minivan	1	NA	10	NA
pickup	3	NA	10	20
subcompact	21	2	7	5
suv	8	NA	16	38

## How different is this to the table() variant?

classmpg <- table(mpg$class, mpg$cyl)

classmpg ##This is exactly the same except that N/A values are replaced with 0, which is more useful.

            
              4  5  6  8
  2seater     0  0  0  5
  compact    32  2 13  0
  midsize    16  0 23  2
  minivan     1  0 10  0
  pickup      3  0 10 20
  subcompact 21  2  7  5
  suv         8  0 16 38

kable(classmpg) ## This makes it neat like the other one.

	4	5	6	8
2seater	0	0	0	5
compact	32	2	13	0
midsize	16	0	23	2
minivan	1	0	10	0
pickup	3	0	10	20
subcompact	21	2	7	5
suv	8	0	16	38

Summary statistics other than frequency

mpg %>% 
  group_by(class, cyl) %>% 
  summarise(mean_cty_miles = mean(cty)) %>%  ## Here compared to the last one we've changed the frequency to the mean city miles
  spread(cyl, mean_cty_miles) %>% 
  kable()

`summarise()` has grouped output by 'class'. You can override using the
`.groups` argument.

class	4	5	6	8
2seater	NA	NA	NA	15.40000
compact	21.37500	21	16.92308	NA
midsize	20.50000	NA	17.78261	16.00000
minivan	18.00000	NA	15.60000	NA
pickup	16.00000	NA	14.50000	11.80000
subcompact	22.85714	20	17.00000	14.80000
suv	18.00000	NA	14.50000	12.13158

Proportions in a crosstable

mpg %>% 
  group_by(class, cyl) %>% 
  summarise(n=n()) %>% 
  mutate(prop = n/sum(n)) %>% ## Create a column for the proportions
  subset(select = c("class", "cyl", "prop")) %>% ## Select all but "n" column
  spread(class, prop) %>% 
  kable()

`summarise()` has grouped output by 'class'. You can override using the
`.groups` argument.

cyl	2seater	compact	midsize	minivan	pickup	subcompact	suv
4	NA	0.6808511	0.3902439	0.0909091	0.0909091	0.6000000	0.1290323
5	NA	0.0425532	NA	NA	NA	0.0571429	NA
6	NA	0.2765957	0.5609756	0.9090909	0.3030303	0.2000000	0.2580645
8	1	NA	0.0487805	NA	0.6060606	0.1428571	0.6129032

## I wonder if just using select would work?

mpg %>% 
  group_by(class, cyl) %>% 
  summarise(n=n()) %>% 
  mutate(prop = n/sum(n)) %>%
  select(-n) %>% ## Selects all columns except the "n" column
  spread(class, prop) %>% 
  kable()

`summarise()` has grouped output by 'class'. You can override using the
`.groups` argument.

cyl	2seater	compact	midsize	minivan	pickup	subcompact	suv
4	NA	0.6808511	0.3902439	0.0909091	0.0909091	0.6000000	0.1290323
5	NA	0.0425532	NA	NA	NA	0.0571429	NA
6	NA	0.2765957	0.5609756	0.9090909	0.3030303	0.2000000	0.2580645
8	1	NA	0.0487805	NA	0.6060606	0.1428571	0.6129032

## Yep, that works exactly the same

CrossTable() also gives frequencies and proportions:

## Make sure gmodels is installed and loaded for CrossTable to work!!

CrossTable(mpg$class, mpg$cyl)


 
   Cell Contents
|-------------------------|
|                       N |
| Chi-square contribution |
|           N / Row Total |
|           N / Col Total |
|         N / Table Total |
|-------------------------|

 
Total Observations in Table:  234 

 
             | mpg$cyl 
   mpg$class |         4 |         5 |         6 |         8 | Row Total | 
-------------|-----------|-----------|-----------|-----------|-----------|
     2seater |         0 |         0 |         0 |         5 |         5 | 
             |     1.731 |     0.085 |     1.688 |     8.210 |           | 
             |     0.000 |     0.000 |     0.000 |     1.000 |     0.021 | 
             |     0.000 |     0.000 |     0.000 |     0.071 |           | 
             |     0.000 |     0.000 |     0.000 |     0.021 |           | 
-------------|-----------|-----------|-----------|-----------|-----------|
     compact |        32 |         2 |        13 |         0 |        47 | 
             |    15.210 |     1.782 |     0.518 |    14.060 |           | 
             |     0.681 |     0.043 |     0.277 |     0.000 |     0.201 | 
             |     0.395 |     0.500 |     0.165 |     0.000 |           | 
             |     0.137 |     0.009 |     0.056 |     0.000 |           | 
-------------|-----------|-----------|-----------|-----------|-----------|
     midsize |        16 |         0 |        23 |         2 |        41 | 
             |     0.230 |     0.701 |     6.059 |     8.591 |           | 
             |     0.390 |     0.000 |     0.561 |     0.049 |     0.175 | 
             |     0.198 |     0.000 |     0.291 |     0.029 |           | 
             |     0.068 |     0.000 |     0.098 |     0.009 |           | 
-------------|-----------|-----------|-----------|-----------|-----------|
     minivan |         1 |         0 |        10 |         0 |        11 | 
             |     2.070 |     0.188 |    10.641 |     3.291 |           | 
             |     0.091 |     0.000 |     0.909 |     0.000 |     0.047 | 
             |     0.012 |     0.000 |     0.127 |     0.000 |           | 
             |     0.004 |     0.000 |     0.043 |     0.000 |           | 
-------------|-----------|-----------|-----------|-----------|-----------|
      pickup |         3 |         0 |        10 |        20 |        33 | 
             |     6.211 |     0.564 |     0.117 |    10.391 |           | 
             |     0.091 |     0.000 |     0.303 |     0.606 |     0.141 | 
             |     0.037 |     0.000 |     0.127 |     0.286 |           | 
             |     0.013 |     0.000 |     0.043 |     0.085 |           | 
-------------|-----------|-----------|-----------|-----------|-----------|
  subcompact |        21 |         2 |         7 |         5 |        35 | 
             |     6.515 |     3.284 |     1.963 |     2.858 |           | 
             |     0.600 |     0.057 |     0.200 |     0.143 |     0.150 | 
             |     0.259 |     0.500 |     0.089 |     0.071 |           | 
             |     0.090 |     0.009 |     0.030 |     0.021 |           | 
-------------|-----------|-----------|-----------|-----------|-----------|
         suv |         8 |         0 |        16 |        38 |        62 | 
             |     8.444 |     1.060 |     1.162 |    20.403 |           | 
             |     0.129 |     0.000 |     0.258 |     0.613 |     0.265 | 
             |     0.099 |     0.000 |     0.203 |     0.543 |           | 
             |     0.034 |     0.000 |     0.068 |     0.162 |           | 
-------------|-----------|-----------|-----------|-----------|-----------|
Column Total |        81 |         4 |        79 |        70 |       234 | 
             |     0.346 |     0.017 |     0.338 |     0.299 |           | 
-------------|-----------|-----------|-----------|-----------|-----------|

Performing Chi-squared test of independence:

classcyl <- mpg %>% 
  group_by(class, cyl) %>% 
  summarise(n=n()) %>% 
  spread(class, n, fill = 0) %>% ## Fill missing values with 0
  ungroup() %>% 
  select(-cyl) %>% 
  kable() %>%
  as.matrix()  ## Since chisq.test() requires a matrix

`summarise()` has grouped output by 'class'. You can override using the
`.groups` argument.

classcyl <- mpg %>% 
  group_by(class, cyl) %>% 
  summarise(n=n()) %>% 
  spread(cyl, n) %>% ## This creates columns for each cyl value with n as the crosstab response
  kable()

`summarise()` has grouped output by 'class'. You can override using the
`.groups` argument.

#classcyl2 <- as.matrix(classcyl)

#chisq.test(classcyl2)

#chiresult <- chisq.test(classcyl)

#chiresult

chisq.test(mpg$class, mpg$cyl)

Warning in chisq.test(mpg$class, mpg$cyl): Chi-squared approximation may be
incorrect


    Pearson's Chi-squared test

data:  mpg$class and mpg$cyl
X-squared = 138.03, df = 18, p-value < 2.2e-16