1 + 1[1] 2
Quarto enables you to weave together content and executable code into a finished document. To learn more about Quarto see https://quarto.org.
I’ve saved Figure 1 in the working directory, should have info for how to write good!
Make sure the working directory is setup correctly.
Files have to be loaded form same place as the file is saved.
To set working directory use setwd(“PATH/directoryname”)
! [] (imgdirectory) adds a photo (without spaces).
When you click the Render button a document will be generated that includes both content and the output of embedded code. You can embed code like this:
1 + 1[1] 2
1 + 1[1] 2
It seems that echo: true doesn’t change the output here.
Hint: use ctrl+alt+i to add a code chunk.
You can add options to executable code like this:
[1] 4
The echo: false option disables the printing of code (only output is displayed).
See “Pre Session 3 (Tidyverse) Activities for more info [R for Grad Students by Y. Wendy Huynh; Chapter 4]
library(tidyverse)── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
‘Output: false’ stops the workflow of loading the package (or any other operation) being displayed.
Trying to load the same package twice will give an error message!
A blank space between lines is needed to output separate lines and make bullet points work.
ggplot(diamonds, aes(x = cut)) +
geom_bar() # geom_bar() makes the outputted graph a bar chart. This is a reference: Figure 2.
I had to install a package for this bit, maybe tinytex?
We can put formulas inline like this: \(\alpha\) We indent like this
\[ \pi \in \mathbb{R}\]
Google for more info!
##Test This doesn’t work.
(Just testing out different heading sizes)
ctrl+alt+i is a very useful shortcut! It inputs the r code block.
“#|” denotes the header for a chunk. Sets the variables you want to happen for that chunk.
%>% is the “pipe” operator, so x%>%y says “do y to x”
ctrl+shift+m is a shortcut for %>%
library(palmerpenguins)dplyr::filter()
library(dplyr)
diamonds %>%
group_by(clarity) %>%
summarize(m = mean(price)) %>%
ungroup() # A tibble: 8 × 2
clarity m
<ord> <dbl>
1 I1 3924.
2 SI2 5063.
3 SI1 3996.
4 VS2 3925.
5 VS1 3839.
6 VVS2 3284.
7 VVS1 2523.
8 IF 2865.
Not really sure what this bit is about… (as of 3/10/2024)
The “diamonds” dataset is built into ggplot2 and therefore included in tidyverse.
#View(diamonds)It gives an excel-like view of the data.
You cannot edit the data cell-by-cell like in excel; that needs to be done through code.
str(diamonds)tibble [53,940 × 10] (S3: tbl_df/tbl/data.frame)
$ carat : num [1:53940] 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
$ cut : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
$ color : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
$ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
$ depth : num [1:53940] 61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
$ table : num [1:53940] 55 61 65 58 58 57 57 55 61 61 ...
$ price : int [1:53940] 326 326 327 334 335 336 336 337 337 338 ...
$ x : num [1:53940] 3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
$ y : num [1:53940] 3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
$ z : num [1:53940] 2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...
It details the number of data entries and how many variables for each entry along with their details.
Use mutate() to:
Add new columns
Modify variables in dataset.
To create new var.s in diamonds;
A var. called JustOne where all values inside the column = 1
A var. called Values where all values are “something”
A var. called Simple where all values = TRUE
diamonds %>%
mutate(JustOne = 1,
Values = "something",
Simple = TRUE)# A tibble: 53,940 × 13
carat cut color clarity depth table price x y z JustOne Values
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <dbl> <chr>
1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43 1 somet…
2 0.21 Premi… E SI1 59.8 61 326 3.89 3.84 2.31 1 somet…
3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31 1 somet…
4 0.29 Premi… I VS2 62.4 58 334 4.2 4.23 2.63 1 somet…
5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75 1 somet…
6 0.24 Very … J VVS2 62.8 57 336 3.94 3.96 2.48 1 somet…
7 0.24 Very … I VVS1 62.3 57 336 3.95 3.98 2.47 1 somet…
8 0.26 Very … H SI1 61.9 55 337 4.07 4.11 2.53 1 somet…
9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49 1 somet…
10 0.23 Very … H VS1 59.4 61 338 4 4.05 2.39 1 somet…
# ℹ 53,930 more rows
# ℹ 1 more variable: Simple <lgl>
A bit more info on the syntax here:
The pipe %>% is contained within tidyverse, specifically in a package called magrittr.
Pressing enter after a comma or a pipe auto-indents the code to make it more readable.
The mutate code reads as: “Within diamonds, I want to mutate/change things by creating the variables”JustOne”, which equals 1; “Values”, which equals “something”; and “Simple”, which equals TRUE.”
Separate comands within mutate() can be separated by commas (With or without a new line)
To create var.s based on existing var.s:
diamonds %>%
mutate(price200 = price - 200)# A tibble: 53,940 × 11
carat cut color clarity depth table price x y z price200
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <dbl>
1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43 126
2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31 126
3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31 127
4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63 134
5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75 135
6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48 136
7 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47 136
8 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53 137
9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49 137
10 0.23 Very Good H VS1 59.4 61 338 4 4.05 2.39 138
# ℹ 53,930 more rows
Which created the column/var “price200” whose values are that of “price” minus 200.
These changes are not saved!!
Executing str(diamonds) will show that they haven’t been saved.
To save them you must add a line of code to save them as a certain object:
diamonds.new <- THE REST OF THE CODE
Other functions can be used inside mutate!
eg. using the mean(), sd() and median() func.s to calc. the descriptive stats of the diamonds:
diamonds %>%
mutate(m = mean(price),
sd = sd(price),
med = median(price))# A tibble: 53,940 × 13
carat cut color clarity depth table price x y z m sd
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43 3933. 3989.
2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31 3933. 3989.
3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31 3933. 3989.
4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63 3933. 3989.
5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75 3933. 3989.
6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48 3933. 3989.
7 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47 3933. 3989.
8 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53 3933. 3989.
9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49 3933. 3989.
10 0.23 Very Good H VS1 59.4 61 338 4 4.05 2.39 3933. 3989.
# ℹ 53,930 more rows
# ℹ 1 more variable: med <dbl>
These funcs.s use all the data in the price column, so the output in each of these columns is the same.
This modifies values w/in a var.
Basic Structure:
data %>% mutate(NewVariable = recode(Variable, “old value” = “new value”))
eg:
diamonds %>%
mutate(cut.new = recode(cut, "Ideal" = "IDEAL"))# A tibble: 53,940 × 11
carat cut color clarity depth table price x y z cut.new
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <ord>
1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43 IDEAL
2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31 Premium
3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31 Good
4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63 Premium
5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75 Good
6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48 Very Good
7 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47 Very Good
8 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53 Very Good
9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49 Fair
10 0.23 Very Good H VS1 59.4 61 338 4 4.05 2.39 Very Good
# ℹ 53,930 more rows
Example:
# Creating a dataset w/ 2 var.s (Sex, TestScore)
Sex <- factor(c("male", "m", "M", "Female", "Female", "Female"))
TestScore <- c(10,20,10,25,12,5)
dataset <- tibble(Sex, TestScore)
str(dataset)tibble [6 × 2] (S3: tbl_df/tbl/data.frame)
$ Sex : Factor w/ 4 levels "Female","m","M",..: 4 2 3 1 1 1
$ TestScore: num [1:6] 10 20 10 25 12 5
Remember! c() is used to combine values into a vector or a list!
(I’m not sure what the factor() function does here)
factor() is a type of variable that is categorical with levels.
# Creating a new var Sex.new w/ recd values from original var, but where all male values = Male
dataset <-
dataset %>%
mutate(Sex=recode(Sex,"m"="Male","M"="Male", "male"="Male"))
str(dataset)tibble [6 × 2] (S3: tbl_df/tbl/data.frame)
$ Sex : Factor w/ 2 levels "Female","Male": 2 2 2 1 1 1
$ TestScore: num [1:6] 10 20 10 25 12 5
Returns a 1-row summary.
eg:
diamonds %>%
summarise(avg.price = mean(price))# A tibble: 1 × 1
avg.price
<dbl>
1 3933.
Takes existing data and groups specific vars. together for future operations.
eg:
## Creating identification number to represent 50 individual people
ID <- c(1:50)
## Creating sex variable (25 males/25 females)
Sex <- rep(c("male", "female"), 25) # rep stands for replicate
## Creating age variable (20-39 year olds)
Age <- c(26, 25, 39, 37, 31, 34, 34, 30, 26, 33,
39, 28, 26, 29, 33, 22, 35, 23, 26, 36,
21, 20, 31, 21, 35, 39, 36, 22, 22, 25,
27, 30, 26, 34, 38, 39, 30, 29, 26, 25,
26, 36, 23, 21, 21, 39, 26, 26, 27, 21)
## Creating a dependent variable called Score
Score <- c(0.010, 0.418, 0.014, 0.090, 0.061, 0.328, 0.656, 0.002, 0.639, 0.173,
0.076, 0.152, 0.467, 0.186, 0.520, 0.493, 0.388, 0.501, 0.800, 0.482,
0.384, 0.046, 0.920, 0.865, 0.625, 0.035, 0.501, 0.851, 0.285, 0.752,
0.686, 0.339, 0.710, 0.665, 0.214, 0.560, 0.287, 0.665, 0.630, 0.567,
0.812, 0.637, 0.772, 0.905, 0.405, 0.363, 0.773, 0.410, 0.535, 0.449)
## Creating a unified dataset that puts together all variables
data <- tibble(ID, Sex, Age, Score)data %>%
group_by(Sex) %>%
summarise(m = mean(Score), # Calculates the mean score
s = sd(Score), # Calc. the standard deviation
n = n()) %>% # Calc. the tot. observations.
ungroup()# A tibble: 2 × 4
Sex m s n
<chr> <dbl> <dbl> <int>
1 female 0.437 0.268 25
2 male 0.487 0.268 25
Compare to the same but without using group_by() and ungroup():
data %>%
summarise(m = mean(Score), # Calculates the mean score
s = sd(Score), # Calc. the standard deviation
n = n()) # Calc. the tot. observations.# A tibble: 1 × 3
m s n
<dbl> <dbl> <int>
1 0.462 0.266 50
So by grouping we get to see the summaries of individual variables within the dataset!
Now lets group by Sex and Age:
data %>%
group_by(Sex, Age) %>%
summarise(m = mean(Score), # Calculates the mean score
s = sd(Score), # Calc. the standard deviation
n = n()) %>% # Calc. the tot. observations.
ungroup()`summarise()` has grouped output by 'Sex'. You can override using the `.groups`
argument.
# A tibble: 27 × 5
Sex Age m s n
<chr> <dbl> <dbl> <dbl> <int>
1 female 20 0.046 NA 1
2 female 21 0.740 0.253 3
3 female 22 0.672 0.253 2
4 female 23 0.501 NA 1
5 female 25 0.579 0.167 3
6 female 26 0.41 NA 1
7 female 28 0.152 NA 1
8 female 29 0.426 0.339 2
9 female 30 0.170 0.238 2
10 female 33 0.173 NA 1
# ℹ 17 more rows
This has given us 27 rows in total. R has considered every unique combination of Sex and Age in the dataset.
The missing std dev values are because there is only 1 observtion for that group, while std dev requires at least 2.
This combination adds a new column to the dataset based on the group.
data %>%
group_by(Sex) %>%
mutate(MeanScoreSex = mean(Score)) %>%
ungroup()# A tibble: 50 × 5
ID Sex Age Score MeanScoreSex
<int> <chr> <dbl> <dbl> <dbl>
1 1 male 26 0.01 0.487
2 2 female 25 0.418 0.437
3 3 male 39 0.014 0.487
4 4 female 37 0.09 0.437
5 5 male 31 0.061 0.487
6 6 female 34 0.328 0.437
7 7 male 34 0.656 0.487
8 8 female 30 0.002 0.437
9 9 male 26 0.639 0.487
10 10 female 33 0.173 0.437
# ℹ 40 more rows
While there are only 2 results for MeanScoreSex, they have been duplicated for every row in the table.
If ungroup() is left out at the end, this will leave the grouping in place and likely lead to errors in further calculations!
Always use ungroup() after group_by()!!
This only retains certain rows that meet the specified requirements.
diamonds %>% filter(cut == "Fair")# A tibble: 1,610 × 10
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49
2 0.86 Fair E SI2 55.1 69 2757 6.45 6.33 3.52
3 0.96 Fair F SI2 66.3 62 2759 6.27 5.95 4.07
4 0.7 Fair F VS2 64.5 57 2762 5.57 5.53 3.58
5 0.7 Fair F VS2 65.3 55 2762 5.63 5.58 3.66
6 0.91 Fair H SI2 64.4 57 2763 6.11 6.09 3.93
7 0.91 Fair H SI2 65.7 60 2763 6.03 5.99 3.95
8 0.98 Fair H SI2 67.9 60 2777 6.05 5.97 4.08
9 0.84 Fair G SI1 55.1 67 2782 6.39 6.2 3.47
10 1.01 Fair E I1 64.5 58 2788 6.29 6.21 4.03
# ℹ 1,600 more rows
Multiple conditions can be included using | or a comma:
Bar denotes OR
Comma denotes AND.
diamonds %>% filter(cut == "Fair" | cut == "Good", price <= 600)# A tibble: 505 × 10
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
2 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
3 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49
4 0.3 Good J SI1 64 55 339 4.25 4.28 2.73
5 0.3 Good J SI1 63.4 54 351 4.23 4.29 2.7
6 0.3 Good J SI1 63.8 56 351 4.23 4.26 2.71
7 0.3 Good I SI2 63.3 56 351 4.26 4.3 2.71
8 0.23 Good F VS1 58.2 59 402 4.06 4.08 2.37
9 0.23 Good E VS1 64.1 59 402 3.83 3.85 2.46
10 0.31 Good H SI1 64 54 402 4.29 4.31 2.75
# ℹ 495 more rows
This can also be achieved through:
diamonds %>% filter(cut %in% c("Fair", "Good"), price <= 600)# A tibble: 505 × 10
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
2 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
3 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49
4 0.3 Good J SI1 64 55 339 4.25 4.28 2.73
5 0.3 Good J SI1 63.4 54 351 4.23 4.29 2.7
6 0.3 Good J SI1 63.8 56 351 4.23 4.26 2.71
7 0.3 Good I SI2 63.3 56 351 4.26 4.3 2.71
8 0.23 Good F VS1 58.2 59 402 4.06 4.08 2.37
9 0.23 Good E VS1 64.1 59 402 3.83 3.85 2.46
10 0.31 Good H SI1 64 54 402 4.29 4.31 2.75
# ℹ 495 more rows
Select the columns you want to see.
You can select using the column name or position number.
The order the columns are listed are the order they are displayed.
diamonds %>%
select(cut, color)# A tibble: 53,940 × 2
cut color
<ord> <ord>
1 Ideal E
2 Premium E
3 Good E
4 Premium I
5 Good J
6 Very Good J
7 Very Good I
8 Very Good H
9 Fair E
10 Very Good H
# ℹ 53,930 more rows
diamonds %>%
select(1:5)# A tibble: 53,940 × 5
carat cut color clarity depth
<dbl> <ord> <ord> <ord> <dbl>
1 0.23 Ideal E SI2 61.5
2 0.21 Premium E SI1 59.8
3 0.23 Good E VS1 56.9
4 0.29 Premium I VS2 62.4
5 0.31 Good J SI2 63.3
6 0.24 Very Good J VVS2 62.8
7 0.24 Very Good I VVS1 62.3
8 0.26 Very Good H SI1 61.9
9 0.22 Fair E VS2 65.1
10 0.23 Very Good H VS1 59.4
# ℹ 53,930 more rows
diamonds %>%
select(-cut) # Selects all except "cut"# A tibble: 53,940 × 9
carat color clarity depth table price x y z
<dbl> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 0.23 E SI2 61.5 55 326 3.95 3.98 2.43
2 0.21 E SI1 59.8 61 326 3.89 3.84 2.31
3 0.23 E VS1 56.9 65 327 4.05 4.07 2.31
4 0.29 I VS2 62.4 58 334 4.2 4.23 2.63
5 0.31 J SI2 63.3 58 335 4.34 4.35 2.75
6 0.24 J VVS2 62.8 57 336 3.94 3.96 2.48
7 0.24 I VVS1 62.3 57 336 3.95 3.98 2.47
8 0.26 H SI1 61.9 55 337 4.07 4.11 2.53
9 0.22 E VS2 65.1 61 337 3.87 3.78 2.49
10 0.23 H VS1 59.4 61 338 4 4.05 2.39
# ℹ 53,930 more rows
diamonds %>%
select(-(5:10))# A tibble: 53,940 × 4
carat cut color clarity
<dbl> <ord> <ord> <ord>
1 0.23 Ideal E SI2
2 0.21 Premium E SI1
3 0.23 Good E VS1
4 0.29 Premium I VS2
5 0.31 Good J SI2
6 0.24 Very Good J VVS2
7 0.24 Very Good I VVS1
8 0.26 Very Good H SI1
9 0.22 Fair E VS2
10 0.23 Very Good H VS1
# ℹ 53,930 more rows
You can also retain everything, but more certain columns to the front:
diamonds %>%
select(x,y,z,everything())# A tibble: 53,940 × 10
x y z carat cut color clarity depth table price
<dbl> <dbl> <dbl> <dbl> <ord> <ord> <ord> <dbl> <dbl> <int>
1 3.95 3.98 2.43 0.23 Ideal E SI2 61.5 55 326
2 3.89 3.84 2.31 0.21 Premium E SI1 59.8 61 326
3 4.05 4.07 2.31 0.23 Good E VS1 56.9 65 327
4 4.2 4.23 2.63 0.29 Premium I VS2 62.4 58 334
5 4.34 4.35 2.75 0.31 Good J SI2 63.3 58 335
6 3.94 3.96 2.48 0.24 Very Good J VVS2 62.8 57 336
7 3.95 3.98 2.47 0.24 Very Good I VVS1 62.3 57 336
8 4.07 4.11 2.53 0.26 Very Good H SI1 61.9 55 337
9 3.87 3.78 2.49 0.22 Fair E VS2 65.1 61 337
10 4 4.05 2.39 0.23 Very Good H VS1 59.4 61 338
# ℹ 53,930 more rows
This could be used to permenantly reorder the dataset:
testdata1 <-
diamonds %>%
select(x,y,z,everything())Arranges values within a variable by ascending or descending alphabetical or numerical order.
diamonds %>%
arrange(cut)# A tibble: 53,940 × 10
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49
2 0.86 Fair E SI2 55.1 69 2757 6.45 6.33 3.52
3 0.96 Fair F SI2 66.3 62 2759 6.27 5.95 4.07
4 0.7 Fair F VS2 64.5 57 2762 5.57 5.53 3.58
5 0.7 Fair F VS2 65.3 55 2762 5.63 5.58 3.66
6 0.91 Fair H SI2 64.4 57 2763 6.11 6.09 3.93
7 0.91 Fair H SI2 65.7 60 2763 6.03 5.99 3.95
8 0.98 Fair H SI2 67.9 60 2777 6.05 5.97 4.08
9 0.84 Fair G SI1 55.1 67 2782 6.39 6.2 3.47
10 1.01 Fair E I1 64.5 58 2788 6.29 6.21 4.03
# ℹ 53,930 more rows
diamonds %>% arrange(desc(price))# A tibble: 53,940 × 10
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 2.29 Premium I VS2 60.8 60 18823 8.5 8.47 5.16
2 2 Very Good G SI1 63.5 56 18818 7.9 7.97 5.04
3 1.51 Ideal G IF 61.7 55 18806 7.37 7.41 4.56
4 2.07 Ideal G SI2 62.5 55 18804 8.2 8.13 5.11
5 2 Very Good H SI1 62.8 57 18803 7.95 8 5.01
6 2.29 Premium I SI1 61.8 59 18797 8.52 8.45 5.24
7 2.04 Premium H SI1 58.1 60 18795 8.37 8.28 4.84
8 2 Premium I VS1 60.8 59 18795 8.13 8.02 4.91
9 1.71 Premium F VS2 62.3 59 18791 7.57 7.53 4.7
10 2.15 Ideal G SI2 62.6 54 18791 8.29 8.35 5.21
# ℹ 53,930 more rows
midwest %>% #call the midwest object;
group_by(state) %>% #states that the following commands are only to act on the variable "state"
summarise(poptotalmean = mean(poptotal), #give a summary of "poptotalmean", defined as the mean of the column poptotal
poptotalmed = median(poptotal), #give a summary of poptotalmed, defined as the median of poptotal
popmax = max(poptotal), #give a summary of popmax, defined as the maximum value of poptotal
popmin = min(poptotal), #give a summary of popmin, defined as the minimum value of poptotal
popdistinct = n_distinct(poptotal), #give a summary of popdistinct, defined as the number of unique values in poptotal
popfirst = first(poptotal), #give a summary of popfirst, defined as the first value in poptotal
popany = any(poptotal < 5000), #give a summary of popany, defined as TRUE if it meets the condition; false otherwise
popany2 = any(poptotal > 2000000)) %>% #give a summary of popany2, defined as True if it meets the condition; false otherwise
ungroup() #ungroup the dataset.# A tibble: 5 × 9
state poptotalmean poptotalmed popmax popmin popdistinct popfirst popany
<chr> <dbl> <dbl> <int> <int> <int> <int> <lgl>
1 IL 112065. 24486. 5105067 4373 101 66090 TRUE
2 IN 60263. 30362. 797159 5315 92 31095 FALSE
3 MI 111992. 37308 2111687 1701 83 10145 TRUE
4 OH 123263. 54930. 1412140 11098 88 25371 FALSE
5 WI 67941. 33528 959275 3890 72 15682 TRUE
# ℹ 1 more variable: popany2 <lgl>
midwest %>% #do what comes next to the midwest dataset
group_by(state) %>% #group by the state variable
summarise(num5k = sum(poptotal < 5000), #give a summary of num5k, defined as the number of entries with poptotal less than 5000
num2mil = sum(poptotal > 2000000), #give a summary of num2mil, defined as the number of entries with poptotal greater than 2,000,000
numrows = n()) %>% #give a summary of numrows, defined as the total number of rows.
ungroup() #ungroup the dataset# A tibble: 5 × 4
state num5k num2mil numrows
<chr> <int> <int> <int>
1 IL 1 1 102
2 IN 0 0 92
3 MI 1 1 83
4 OH 0 0 88
5 WI 2 0 72
midwest %>% #call the object midwest
group_by(county) %>% #just use the variable "county"
summarise(x = n_distinct(state)) %>% #summarise the object x, defined as the number of unique states within the midwest object
arrange(desc(x)) %>% #arrange x in descending order, from most common state to least
ungroup() #ungroup# A tibble: 320 × 2
county x
<chr> <int>
1 CRAWFORD 5
2 JACKSON 5
3 MONROE 5
4 ADAMS 4
5 BROWN 4
6 CLARK 4
7 CLINTON 4
8 JEFFERSON 4
9 LAKE 4
10 WASHINGTON 4
# ℹ 310 more rows
midwest %>% #call the midwest object
group_by(county) %>% #group the object by county
summarise(x = n()) %>% #summarise the object x, defined as the number of entries for each county
ungroup() #ungroup# A tibble: 320 × 2
county x
<chr> <int>
1 ADAMS 4
2 ALCONA 1
3 ALEXANDER 1
4 ALGER 1
5 ALLEGAN 1
6 ALLEN 2
7 ALPENA 1
8 ANTRIM 1
9 ARENAC 1
10 ASHLAND 2
# ℹ 310 more rows
#n() differs from n_distinct() by just giving the number of each unique entry, while n_distinct gives hte number of unique entriesmidwest %>% #call midwest object
group_by(county) %>% #group by counties
summarise(x = n_distinct(county)) %>% #define x as the number of distinct counties in each county and print a summary of this
ungroup() #ungroup# A tibble: 320 × 2
county x
<chr> <int>
1 ADAMS 1
2 ALCONA 1
3 ALEXANDER 1
4 ALGER 1
5 ALLEGAN 1
6 ALLEN 1
7 ALPENA 1
8 ANTRIM 1
9 ARENAC 1
10 ASHLAND 1
# ℹ 310 more rows
#obviously, there is only 1 county in each county!diamonds %>% #load diamonds
group_by(clarity) %>% #select the clarity variable
summarise(a = n_distinct(color), #print a summary of the number of unique colors
b = n_distinct(price), #print a summary of the number of unique prices
c = n()) %>% #print a summary of the total number of variables
ungroup() #ungroup# A tibble: 8 × 4
clarity a b c
<ord> <int> <int> <int>
1 I1 7 632 741
2 SI2 7 4904 9194
3 SI1 7 5380 13065
4 VS2 7 5051 12258
5 VS1 7 3926 8171
6 VVS2 7 2409 5066
7 VVS1 7 1623 3655
8 IF 7 902 1790
diamonds %>% #call diamonds
group_by(color, cut) %>% #give us every unique combo of color and cut
summarise(m = mean(price), #summarise the mean price of each combo
s = sd(price)) %>% #summarise the standard deviation of each combo
ungroup() #ungroup`summarise()` has grouped output by 'color'. You can override using the
`.groups` argument.
# A tibble: 35 × 4
color cut m s
<ord> <ord> <dbl> <dbl>
1 D Fair 4291. 3286.
2 D Good 3405. 3175.
3 D Very Good 3470. 3524.
4 D Premium 3631. 3712.
5 D Ideal 2629. 3001.
6 E Fair 3682. 2977.
7 E Good 3424. 3331.
8 E Very Good 3215. 3408.
9 E Premium 3539. 3795.
10 E Ideal 2598. 2956.
# ℹ 25 more rows
diamonds %>% #call diamonds
group_by(cut, color) %>% #select every unique combo of cut and color
summarise(m = mean(price), #summarise the mean price of each combo
s = sd(price)) %>% #summarise the standard deviation of each combo
ungroup() #ungroup`summarise()` has grouped output by 'cut'. You can override using the `.groups`
argument.
# A tibble: 35 × 4
cut color m s
<ord> <ord> <dbl> <dbl>
1 Fair D 4291. 3286.
2 Fair E 3682. 2977.
3 Fair F 3827. 3223.
4 Fair G 4239. 3610.
5 Fair H 5136. 3886.
6 Fair I 4685. 3730.
7 Fair J 4976. 4050.
8 Good D 3405. 3175.
9 Good E 3424. 3331.
10 Good F 3496. 3202.
# ℹ 25 more rows
The second part simply changes the printed order from sorted by color to being sorted by cut.
diamonds %>% #call diamonds
group_by(cut, color, clarity) %>% #select every unique instance of cut, color and clarity
summarise(m = mean(price), #summarise the mean of each instance
s = sd(price), #summarise the std dev of each instance
msale = m* 0.80) %>% #define msale as 80% of the mean price
ungroup() #ungroup`summarise()` has grouped output by 'cut', 'color'. You can override using the
`.groups` argument.
# A tibble: 276 × 6
cut color clarity m s msale
<ord> <ord> <ord> <dbl> <dbl> <dbl>
1 Fair D I1 7383 5899. 5906.
2 Fair D SI2 4355. 3260. 3484.
3 Fair D SI1 4273. 3019. 3419.
4 Fair D VS2 4513. 3383. 3610.
5 Fair D VS1 2921. 2550. 2337.
6 Fair D VVS2 3607 3629. 2886.
7 Fair D VVS1 4473 5457. 3578.
8 Fair D IF 1620. 525. 1296.
9 Fair E I1 2095. 824. 1676.
10 Fair E SI2 4172. 3055. 3338.
# ℹ 266 more rows
diamonds %>%
group_by(cut) %>%
summarise(potato = mean(depth),
pizza = mean(price),
popcorn = median(y),
pineapple = potato - pizza,
papaya = pineapple^2,
peach = n()) %>%
ungroup()# A tibble: 5 × 7
cut potato pizza popcorn pineapple papaya peach
<ord> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
1 Fair 64.0 4359. 6.1 -4295. 18444586. 1610
2 Good 62.4 3929. 5.99 -3866. 14949811. 4906
3 Very Good 61.8 3982. 5.77 -3920. 15365942. 12082
4 Premium 61.3 4584. 6.06 -4523. 20457466. 13791
5 Ideal 61.7 3458. 5.26 -3396. 11531679. 21551
diamonds %>% #call diamonds
group_by(color) %>% #group all unique varieties of color
summarize(m = mean(price)) %>% #print the mean price of each color
mutate(x1 = str_c("Diamond color", color), #create a column x1 with values of "Diamond color"+COLOR
x2 = 5) %>% #create a column x2 of just the value 5 in every row
ungroup() #ungroup# A tibble: 7 × 4
color m x1 x2
<ord> <dbl> <chr> <dbl>
1 D 3170. Diamond colorD 5
2 E 3077. Diamond colorE 5
3 F 3725. Diamond colorF 5
4 G 3999. Diamond colorG 5
5 H 4487. Diamond colorH 5
6 I 5092. Diamond colorI 5
7 J 5324. Diamond colorJ 5
diamonds %>% #call diamonds
group_by(color) %>% #group all unique vars of color
summarise(m = mean(price)) %>% #print the mean price of each color group
ungroup() %>% #ungroup
mutate(x1 = str_c("Diamond color ", color), #create a column x1 with values of "Diamond color"+COLOR
x2 = 5) #create a column x2 of just the value 5 in every row# A tibble: 7 × 4
color m x1 x2
<ord> <dbl> <chr> <dbl>
1 D 3170. Diamond color D 5
2 E 3077. Diamond color E 5
3 F 3725. Diamond color F 5
4 G 3999. Diamond color G 5
5 H 4487. Diamond color H 5
6 I 5092. Diamond color I 5
7 J 5324. Diamond color J 5
The position of the ungroup in the prev 2 examples doesn’t change the output as the mutate() function doesn’t rely on the group.
diamonds %>% #call diamonds
group_by(color) %>% #group all unique vars of color
mutate(x1 = price*0.5) %>% #create a new column, x1, equal to the price of each color x 0.5
summarise(m = mean(x1)) %>% #print a summary of the mean of x1 called m. this summary excludes all other columns, so the x1 column isn't shown.
ungroup()# A tibble: 7 × 2
color m
<ord> <dbl>
1 D 1585.
2 E 1538.
3 F 1862.
4 G 2000.
5 H 2243.
6 I 2546.
7 J 2662.
diamonds %>% #call diamonds
group_by(color) %>% #group by each unique value of color
mutate(x1 = price*0.5) %>% #add a column equal to half of the price called x1
ungroup() %>% #ungroup
summarise(m = mean(x1)) #print a summary of the mean values of x1# A tibble: 1 × 1
m
<dbl>
1 1966.
Since this last one ungroups before printing the summary, the printout isn’t separated into individual colors.
Why is grouping data neccessary?
So that functions can be performed on entire groups of the data
Why is ungrouping neccessary?
To ensure that the data isn’t still grouped when you next go to use it.
When should you ungroup data?
Once you’re finished using it in groups.
#View(diamonds) #make sure it has a capital V! opens the data.frame of diamonds in a new window
arrange(diamonds, price) #this gives ther data with price ascending from lowest to highest# A tibble: 53,940 × 10
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63
5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
7 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47
8 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53
9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49
10 0.23 Very Good H VS1 59.4 61 338 4 4.05 2.39
# ℹ 53,930 more rows
diamonds %>%
arrange(price) #this does exactly the same as above# A tibble: 53,940 × 10
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63
5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
7 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47
8 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53
9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49
10 0.23 Very Good H VS1 59.4 61 338 4 4.05 2.39
# ℹ 53,930 more rows
diamonds %>%
arrange(desc(price)) #this sorts data with price descending# A tibble: 53,940 × 10
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 2.29 Premium I VS2 60.8 60 18823 8.5 8.47 5.16
2 2 Very Good G SI1 63.5 56 18818 7.9 7.97 5.04
3 1.51 Ideal G IF 61.7 55 18806 7.37 7.41 4.56
4 2.07 Ideal G SI2 62.5 55 18804 8.2 8.13 5.11
5 2 Very Good H SI1 62.8 57 18803 7.95 8 5.01
6 2.29 Premium I SI1 61.8 59 18797 8.52 8.45 5.24
7 2.04 Premium H SI1 58.1 60 18795 8.37 8.28 4.84
8 2 Premium I VS1 60.8 59 18795 8.13 8.02 4.91
9 1.71 Premium F VS2 62.3 59 18791 7.57 7.53 4.7
10 2.15 Ideal G SI2 62.6 54 18791 8.29 8.35 5.21
# ℹ 53,930 more rows
diamonds %>%
arrange(price) %>%
arrange(cut)# A tibble: 53,940 × 10
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49
2 0.25 Fair E VS1 55.2 64 361 4.21 4.23 2.33
3 0.23 Fair G VVS2 61.4 66 369 3.87 3.91 2.39
4 0.27 Fair E VS1 66.4 58 371 3.99 4.02 2.66
5 0.3 Fair J VS2 64.8 58 416 4.24 4.16 2.72
6 0.3 Fair F SI1 63.1 58 496 4.3 4.22 2.69
7 0.34 Fair J SI1 64.5 57 497 4.38 4.36 2.82
8 0.37 Fair F SI1 65.3 56 527 4.53 4.47 2.94
9 0.3 Fair D SI2 64.6 54 536 4.29 4.25 2.76
10 0.25 Fair D VS1 61.2 55 563 4.09 4.11 2.51
# ℹ 53,930 more rows
diamonds %>%
arrange(desc(price)) %>%
arrange(desc(cut))# A tibble: 53,940 × 10
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 1.51 Ideal G IF 61.7 55 18806 7.37 7.41 4.56
2 2.07 Ideal G SI2 62.5 55 18804 8.2 8.13 5.11
3 2.15 Ideal G SI2 62.6 54 18791 8.29 8.35 5.21
4 2.05 Ideal G SI1 61.9 57 18787 8.1 8.16 5.03
5 1.6 Ideal F VS1 62 56 18780 7.47 7.52 4.65
6 2.06 Ideal I VS2 62.2 55 18779 8.15 8.19 5.08
7 1.71 Ideal G VVS2 62.1 55 18768 7.66 7.63 4.75
8 2.08 Ideal H SI1 58.7 60 18760 8.36 8.4 4.92
9 2.03 Ideal G SI1 60 55.8 18757 8.17 8.3 4.95
10 2.61 Ideal I SI2 62.1 56 18756 8.85 8.73 5.46
# ℹ 53,930 more rows
diamonds %>%
arrange(price) %>%
arrange(desc(clarity))# A tibble: 53,940 × 10
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 0.23 Very Good H IF 63.9 55 369 3.89 3.9 2.49
2 0.24 Very Good H IF 61.3 56 449 4.04 4.06 2.48
3 0.26 Ideal H IF 61.1 57 468 4.12 4.16 2.53
4 0.23 Very Good F IF 61 62 485 3.95 3.99 2.42
5 0.3 Ideal J IF 61.5 56 489 4.32 4.33 2.66
6 0.3 Ideal J IF 61.5 57 489 4.29 4.36 2.66
7 0.23 Very Good E IF 59.9 58 492 3.98 4.03 2.4
8 0.24 Good F IF 65.1 58 492 3.86 3.88 2.52
9 0.24 Ideal H IF 62.5 54 504 3.97 4 2.49
10 0.24 Ideal H IF 62.1 57 504 4 4.04 2.5
# ℹ 53,930 more rows
diamonds %>%
mutate(saleprice = price - 250)# A tibble: 53,940 × 11
carat cut color clarity depth table price x y z saleprice
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <dbl>
1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43 76
2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31 76
3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31 77
4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63 84
5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75 85
6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48 86
7 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47 86
8 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53 87
9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49 87
10 0.23 Very Good H VS1 59.4 61 338 4 4.05 2.39 88
# ℹ 53,930 more rows
diamonds %>%
select(-x,-y,-z)# A tibble: 53,940 × 7
carat cut color clarity depth table price
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int>
1 0.23 Ideal E SI2 61.5 55 326
2 0.21 Premium E SI1 59.8 61 326
3 0.23 Good E VS1 56.9 65 327
4 0.29 Premium I VS2 62.4 58 334
5 0.31 Good J SI2 63.3 58 335
6 0.24 Very Good J VVS2 62.8 57 336
7 0.24 Very Good I VVS1 62.3 57 336
8 0.26 Very Good H SI1 61.9 55 337
9 0.22 Fair E VS2 65.1 61 337
10 0.23 Very Good H VS1 59.4 61 338
# ℹ 53,930 more rows
diamonds %>%
group_by(cut) %>%
summarise(number = n()) %>%
ungroup()# A tibble: 5 × 2
cut number
<ord> <int>
1 Fair 1610
2 Good 4906
3 Very Good 12082
4 Premium 13791
5 Ideal 21551
diamonds %>%
mutate(totalNum = sum(n()))# A tibble: 53,940 × 11
carat cut color clarity depth table price x y z totalNum
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <int>
1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43 53940
2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31 53940
3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31 53940
4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63 53940
5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75 53940
6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48 53940
7 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47 53940
8 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53 53940
9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49 53940
10 0.23 Very Good H VS1 59.4 61 338 4 4.05 2.39 53940
# ℹ 53,930 more rows
Sometimes, for simple things, it’s good to plot in base R language. However, ggplot2 is WAAY better for more complicated things!
We’ll be using the “mtcars” dataset. Here’s a graph from the inbuilt r code:
plot(mtcars$wt, mtcars$mpg)Where *mtcars\(wt* returns the column names wt from the mtcars data frame (and similar for mtcars\)mpg)
And here’s a graph from the ggplot2 code:
library(ggplot2)
ggplot(mtcars, aes(x = wt, y = mpg)) + geom_point()ggplot(): tells it to create a plot object.
geom_point(): tells it to add a layer of points to the plot.
(mtcars, aes(x = wt, y = mpg)) calls the data frame mtcars before telling it which variables to use for the x and y values.
to use vectors instead of a data frame, you can swap “mtcars” with data = NULL, followed by the same aes(x = vector1, y = vector2):
ggplot(data = NULL, aes(x = mtcars$wt, y = mtcars$mpg)) + geom_point()Using plot():
#plot(pressure$temperature, pressure$pressure, type = "1")That didn’t work for some reason.
plot(pressure$temperature, pressure$pressure, type = "l")I misread the l as a 1 - hence where it went wrong.
To add a new line to the graph using plot():
plot(pressure$temperature, pressure$pressure, type = "l")
points(pressure$temperature, pressure$pressure/2)
lines(pressure$temperature, pressure$pressure/4, col = "red")Now using ggplot2:
ggplot(pressure, aes(x = temperature, y = pressure)) + geom_line()To add data points:
ggplot(pressure, aes(x = temperature, y = pressure)) + geom_line() + geom_point()Using built in barplot:
barplot(BOD$demand, names.arg = BOD$Time)Using ggplot2:
ggplot(BOD, aes(x = Time, y = demand)) +
geom_col() +
ggtitle("A Random Bar Chart")To ensure the x variable is treated as a discrete var.:
ggplot(BOD, aes(x = factor(Time), y = demand)) +
geom_col() +
ggtitle("A Random Bar Chart")with hist():
hist(mtcars$mpg, breaks = 10)using ggplot2:
library(tidyverse)
ggplot(mtcars, aes(x = mpg)) + geom_histogram()`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Adding binwidth in geom_histogram() allows you to definte the width of the bins:
#library(tidyverse)
ggplot(mtcars, aes(x = mpg)) + geom_histogram(binwidth = 4)The built-in plot() function will automatically create a boxplot is the x values are a factor (as opposed to a numeric vector):
plot(ToothGrowth$supp, ToothGrowth$len)and using ggplot2:
ggplot(ToothGrowth, aes(x = supp, y = len)) +
geom_boxplot()and we can make box plots for multiple variables using interaction()
ggplot(ToothGrowth, aes(x = interaction(supp, dose), y = len)) +
geom_boxplot()library(palmerpenguins)
ggplot(penguins, aes(x = species, y = flipper_length_mm, color = sex)) + geom_boxplot()Warning: Removed 2 rows containing non-finite outside the scale range
(`stat_boxplot()`).
Use curve() with an expression with the variable x:
curve(x^3 - 5*x, from = -4, to = 4)Don’t forget the comma after the = -4!
It needs to take a numeric vector and return a numeric vector as output.
Using add = TRUE will add it to the previously created plot.
myfun <- function(xvar) {
1 / (1 + exp(-xvar + 10))
}
curve(myfun(x), from = 0, to = 20)
curve(1 - myfun(x), add = TRUE, col = "red")library(modeldata)
Attaching package: 'modeldata'
The following object is masked from 'package:palmerpenguins':
penguins
ggplot(crickets,
aes(x = temp, y = rate,
color = species)) +
geom_point(color = "red", #specifies the color of the datapoints
size = 2, #specifies the size
alpha = .3, # makes transparent
shape = "square") +
geom_smooth(method = "lm", #adds trendline; lm makes it a linear model
se = FALSE) + #turns off standard error shading
labs(x = "Temperature", # labs = labels
y = "Chirp rate",
color = "Species", # changes title of legend
title = "Cricket chirps",
caption = "Source: McDonald (2009)") +
scale_color_brewer(palette = "Dark2")`geom_smooth()` using formula = 'y ~ x'
# histogram - one quantitative variable
ggplot(crickets, aes(x = rate)) +
geom_histogram(bins = 15) #can also use binwidth to change the bins#frequency polygon
ggplot(crickets, aes(x = rate)) +
geom_freqpoly(bins = 15)#bar chart
ggplot(crickets, aes(x = species)) +
geom_bar(color = "black", #make boundary black
fill = "lightblue") #make fill light blueggplot(crickets, aes(x = species, fill = species)) + # fill = species makes species represented by a different fill
geom_bar(show.legend = FALSE)#box plot
ggplot(crickets, aes(x = species,
y = rate,
color = species)) +
geom_boxplot() +
theme_minimal()# faceting
#Not great example:
ggplot(crickets, aes(x = rate,
fill = species)) +
geom_histogram(bins = 15) +
scale_fill_brewer(palette = "Dark2")ggplot(crickets, aes(x = rate,
fill = species)) +
geom_histogram(bins = 15, show.legend = FALSE) +
scale_fill_brewer(palette = "Dark2") +
facet_wrap(~species) + # facet_wrap puts the data on two different graphs, side by side, with the same axes scales
theme_minimal()View("Tree_Phenology_Data")That didn’t work. How do we get the data loaded into R?
Maybe because it wasn’t saved as .csv
View("Tree_Phenology_Data_P1")Nope again!
TreesP1 <- read_csv("Tree_Phenology_Data_P1.csv")Rows: 951 Columns: 11
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): Date observed, Species
dbl (9): Obs. no., Latitude, Longitude, Altitude, Tree diameter, Urbanisatio...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
TreesP2 <- read_csv("Tree_Phenology_Data_P2.csv")Rows: 828 Columns: 11
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): Date observed, Species
dbl (9): Obs. no., Latitude, Longitude, Altitude, Tree diameter, Urbanisatio...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
This worked!
read.csv and read_csv are DIFFERENT functions!
read.csv is base R language and returns a data frame with factors as default
read_csv is tidyverse language and returns a tibble with characters as default
Use read_csv!!!
summary(TreesP1) Obs. no. Date observed Latitude Longitude
Min. : 1.0 Length:951 Min. :-35.73 Min. :-115.9786
1st Qu.:239.5 Class :character 1st Qu.: 51.37 1st Qu.: -3.2639
Median :477.0 Mode :character Median : 52.08 Median : -1.9382
Mean :477.3 Mean : 52.07 Mean : -1.9622
3rd Qu.:715.5 3rd Qu.: 53.50 3rd Qu.: -0.3468
Max. :953.0 Max. : 57.70 Max. : 174.3406
Altitude Species Tree diameter Urbanisation index
Min. : 0.00 Length:951 Min. : 0.34 Min. :1.000
1st Qu.: 16.50 Class :character 1st Qu.: 47.72 1st Qu.:2.000
Median : 48.00 Mode :character Median : 75.00 Median :3.000
Mean : 86.92 Mean :104.84 Mean :2.905
3rd Qu.: 105.00 3rd Qu.:119.85 3rd Qu.:4.000
Max. :1076.00 Max. :632.00 Max. :4.000
Stand density index Canopy index Phenological index
Min. :1.000 Min. : 5.00 Min. :1.000
1st Qu.:2.000 1st Qu.:75.00 1st Qu.:2.000
Median :3.000 Median :85.00 Median :2.000
Mean :2.836 Mean :78.56 Mean :2.168
3rd Qu.:4.000 3rd Qu.:85.00 3rd Qu.:2.000
Max. :4.000 Max. :95.00 Max. :4.000
Here, we can see that the “Date observed” column/variable is read as a character variable, which is usually used for text to be treaded as strings.
Is there a difference between summary() and summarise()?
summarise(TreesP1)# A tibble: 1 × 0
#this doesn't work, you need to specify what exactly you're creating/summarising.How can we get “Date observed” to be read as a date?
Our dates are in the form: “dd/mm/yyyy”
We can use the function as.Date(DATA$DATE_COLUMN, format = “%d/%m/%Y”):
TreesP1$"Date observed" <- as.Date(TreesP1$"Date observed", format = "%d/%m/%Y")
TreesP2$"Date observed" <- as.Date(TreesP2$"Date observed", format = "%d/%m/%Y")
summary(TreesP1$"Date observed") Min. 1st Qu. Median Mean 3rd Qu. Max.
"1994-10-18" "2022-10-16" "2022-10-18" "2022-08-03" "2022-10-21" "2022-11-30"
summary(TreesP2$"Date observed") Min. 1st Qu. Median Mean 3rd Qu. Max.
"2022-01-30" "2022-11-27" "2022-11-30" "2022-11-27" "2022-12-03" "2022-12-31"
Here we can see that all the date entries have been correctly loaded.
But we can also see some outliers! for example, 18/10/1994 and 30/01/2022.
Now we’re ready to start examining the data properly.
Let’s have a look at the structure of these datasets:
str(TreesP1)spc_tbl_ [951 × 11] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
$ Obs. no. : num [1:951] 1 2 3 4 5 6 7 8 9 10 ...
$ Date observed : Date[1:951], format: "2022-10-18" "2022-10-18" ...
$ Latitude : num [1:951] 53 53 53 53 51.7 ...
$ Longitude : num [1:951] -1.1 -1.1 -1.1 -1.1 -1.57 ...
$ Altitude : num [1:951] 69 69 69 69 106 106 106 119 119 119 ...
$ Species : chr [1:951] "Quercus robur" "Quercus robur" "Quercus robur" "Quercus robur" ...
$ Tree diameter : num [1:951] 137 182.5 200 259.5 99.7 ...
$ Urbanisation index : num [1:951] 2 2 2 2 4 4 4 4 4 4 ...
$ Stand density index: num [1:951] 4 4 4 4 2 2 2 4 4 4 ...
$ Canopy index : num [1:951] 85 65 85 85 65 65 75 95 95 95 ...
$ Phenological index : num [1:951] 2 2 2 2 3 3 2 1 1 1 ...
- attr(*, "spec")=
.. cols(
.. `Obs. no.` = col_double(),
.. `Date observed` = col_character(),
.. Latitude = col_double(),
.. Longitude = col_double(),
.. Altitude = col_double(),
.. Species = col_character(),
.. `Tree diameter` = col_double(),
.. `Urbanisation index` = col_double(),
.. `Stand density index` = col_double(),
.. `Canopy index` = col_double(),
.. `Phenological index` = col_double()
.. )
- attr(*, "problems")=<externalptr>
str(TreesP2)spc_tbl_ [828 × 11] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
$ Obs. no. : num [1:828] 1 2 3 4 5 6 7 8 9 10 ...
$ Date observed : Date[1:828], format: "2022-11-30" "2022-11-30" ...
$ Latitude : num [1:828] 53 53 53 53 51.7 ...
$ Longitude : num [1:828] -1.1 -1.1 -1.1 -1.1 -1.57 ...
$ Altitude : num [1:828] 0 0 0 0 106 106 106 24 24 24 ...
$ Species : chr [1:828] "Quercus robur" "Quercus robur" "Quercus robur" "Quercus robur" ...
$ Tree diameter : num [1:828] 137 182.5 200 259.5 99.7 ...
$ Urbanisation index : num [1:828] 2 2 2 2 4 4 4 4 4 4 ...
$ Stand density index: num [1:828] 4 4 4 4 2 2 2 2 2 2 ...
$ Canopy index : num [1:828] 65 55 45 35 15 5 25 55 25 45 ...
$ Phenological index : num [1:828] 3 4 4 4 4 4 4 4 4 4 ...
- attr(*, "spec")=
.. cols(
.. `Obs. no.` = col_double(),
.. `Date observed` = col_character(),
.. Latitude = col_double(),
.. Longitude = col_double(),
.. Altitude = col_double(),
.. Species = col_character(),
.. `Tree diameter` = col_double(),
.. `Urbanisation index` = col_double(),
.. `Stand density index` = col_double(),
.. `Canopy index` = col_double(),
.. `Phenological index` = col_double()
.. )
- attr(*, "problems")=<externalptr>
Let’s now have a closer look at the “Species” variable to see what we’ve got in there:
summary(TreesP1$Species) Length Class Mode
951 character character
summary(TreesP2$Species) Length Class Mode
828 character character
OK that really wasn’t useful now was it R?!
How can we get a list of the different entries into species?
Before, in the Wendy Book, an example was given to do similar. Let’s try to adapt it!
TreesP1 %>%
group_by(Species) %>%
count(Species) %>%
ungroup()# A tibble: 4 × 2
Species n
<chr> <int>
1 Other deciduous oak 27
2 Other deciduous tree 34
3 Quercus petraea 179
4 Quercus robur 711
TreesP2 %>%
group_by(Species) %>%
count(Species) %>%
ungroup()# A tibble: 4 × 2
Species n
<chr> <int>
1 Other deciduous oak 25
2 Other deciduous tree 29
3 Quercus petraea 150
4 Quercus robur 624
Nice! This tells us exactly what we wanted to know: The range of entries in Species and how many of each occurred. We can also see that there weren’t any slightly different or wrong inputs here.
If there were wrongly inputted things, we could use:
I wonder if the Group_by bit was really neccessary:
#TreesP1 %>%
# count(Species)
#TreesP2 %>%
# count(species)Ok that didn’t work, let’s stick with what we did before.
What else could we look at?
Lets see if there is any meaningful variation in variables between each species.
For this, we will need to use group_by(Species) and then summarise()
TreesP1 %>%
group_by(Species) %>%
summarise("MeanAlt" = mean(Altitude)) %>%
ungroup()# A tibble: 4 × 2
Species MeanAlt
<chr> <dbl>
1 Other deciduous oak 272.
2 Other deciduous tree 55.1
3 Quercus petraea 90.3
4 Quercus robur 80.6
This is quite interesting, I wonder if a box plot would show any relation here!
ggplot(TreesP1, aes(x = Species, y = Altitude, fill = Species)) +
geom_boxplot(show.legend = FALSE) +
theme_minimal() +
labs(y = "Altitude/m")Lets look at the same for latitude and longitude:
ggplot(TreesP1, aes(x = Species, y = Latitude, fill = Species)) +
geom_boxplot(show.legend = FALSE) +
theme_minimal()ggplot(TreesP1, aes(x = Species, y = Longitude, fill = Species)) +
geom_boxplot(show.legend = FALSE) +
theme_minimal()ggplot(TreesP1, aes(x = Species, y = Longitude, fill = Species)) +
geom_point(show.legend = FALSE) +
theme_minimal()ggplot(TreesP1, aes(x = Longitude, y = Latitude, fill = Species)) +
geom_point(show.legend = FALSE) +
theme_minimal()ggplot(TreesP1, aes(x = Longitude, y = Latitude, color = Altitude)) +
geom_point() +
theme_minimal()