1 + 1
[1] 2
Quarto enables you to weave together content and executable code into a finished document. To learn more about Quarto see https://quarto.org.
I’ve saved Figure 1 in the working directory, should have info for how to write good!
Make sure the working directory is setup correctly.
Files have to be loaded form same place as the file is saved.
To set working directory use setwd(“PATH/directoryname”)
! [] (imgdirectory) adds a photo (without spaces).
When you click the Render button a document will be generated that includes both content and the output of embedded code. You can embed code like this:
1 + 1
[1] 2
1 + 1
[1] 2
It seems that echo: true doesn’t change the output here.
Hint: use ctrl+alt+i to add a code chunk.
You can add options to executable code like this:
[1] 4
The echo: false
option disables the printing of code (only output is displayed).
See “Pre Session 3 (Tidyverse) Activities for more info [R for Grad Students by Y. Wendy Huynh; Chapter 4]
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
‘Output: false’ stops the workflow of loading the package (or any other operation) being displayed.
Trying to load the same package twice will give an error message!
A blank space between lines is needed to output separate lines and make bullet points work.
ggplot(diamonds, aes(x = cut)) +
geom_bar() # geom_bar() makes the outputted graph a bar chart.
This is a reference: Figure 2.
I had to install a package for this bit, maybe tinytex?
We can put formulas inline like this: \(\alpha\) We indent like this
\[ \pi \in \mathbb{R}\]
Google for more info!
##Test This doesn’t work.
(Just testing out different heading sizes)
ctrl+alt+i is a very useful shortcut! It inputs the r code block.
“#|” denotes the header for a chunk. Sets the variables you want to happen for that chunk.
%>% is the “pipe” operator, so x%>%y says “do y to x”
ctrl+shift+m is a shortcut for %>%
library(palmerpenguins)
dplyr::filter()
library(dplyr)
%>%
diamonds group_by(clarity) %>%
summarize(m = mean(price)) %>%
ungroup()
# A tibble: 8 × 2
clarity m
<ord> <dbl>
1 I1 3924.
2 SI2 5063.
3 SI1 3996.
4 VS2 3925.
5 VS1 3839.
6 VVS2 3284.
7 VVS1 2523.
8 IF 2865.
Not really sure what this bit is about… (as of 3/10/2024)
The “diamonds” dataset is built into ggplot2 and therefore included in tidyverse.
View(diamonds)
Ok, that didn’t work. But it DOES work if typed direct into the console - then it opens a new RStudio tab.
It gives an excel-like view of the data.
You cannot edit the data cell-by-cell like in excel; that needs to be done through code.
str(diamonds)
Again, type this into the console to get the actual readout.
It details the number of data entries and how many variables for each entry along with their details.
Use mutate() to:
Add new columns
Modify variables in dataset.
To create new var.s in diamonds;
A var. called JustOne where all values inside the column = 1
A var. called Values where all values are “something”
A var. called Simple where all values = TRUE
%>%
diamonds mutate(JustOne = 1,
Values = "something",
Simple = TRUE)
# A tibble: 53,940 × 13
carat cut color clarity depth table price x y z JustOne Values
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <dbl> <chr>
1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43 1 somet…
2 0.21 Premi… E SI1 59.8 61 326 3.89 3.84 2.31 1 somet…
3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31 1 somet…
4 0.29 Premi… I VS2 62.4 58 334 4.2 4.23 2.63 1 somet…
5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75 1 somet…
6 0.24 Very … J VVS2 62.8 57 336 3.94 3.96 2.48 1 somet…
7 0.24 Very … I VVS1 62.3 57 336 3.95 3.98 2.47 1 somet…
8 0.26 Very … H SI1 61.9 55 337 4.07 4.11 2.53 1 somet…
9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49 1 somet…
10 0.23 Very … H VS1 59.4 61 338 4 4.05 2.39 1 somet…
# ℹ 53,930 more rows
# ℹ 1 more variable: Simple <lgl>
A bit more info on the syntax here:
The pipe %>% is contained within tidyverse, specifically in a package called magrittr.
Pressing enter after a comma or a pipe auto-indents the code to make it more readable.
The mutate code reads as: “Within diamonds, I want to mutate/change things by creating the variables”JustOne”, which equals 1; “Values”, which equals “something”; and “Simple”, which equals TRUE.”
Separate comands within mutate() can be separated by commas (With or without a new line)
To create var.s based on existing var.s:
%>%
diamonds mutate(price200 = price - 200)
# A tibble: 53,940 × 11
carat cut color clarity depth table price x y z price200
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <dbl>
1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43 126
2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31 126
3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31 127
4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63 134
5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75 135
6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48 136
7 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47 136
8 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53 137
9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49 137
10 0.23 Very Good H VS1 59.4 61 338 4 4.05 2.39 138
# ℹ 53,930 more rows
Which created the column/var “price200” whose values are that of “price” minus 200.
These changes are not saved!!
Executing str(diamonds) will show that they haven’t been saved.
To save them you must add a line of code to save them as a certain object:
diamonds.new <- THE REST OF THE CODE
Other functions can be used inside mutate!
eg. using the mean(), sd() and median() func.s to calc. the descriptive stats of the diamonds:
%>%
diamonds mutate(m = mean(price),
sd = sd(price),
med = median(price))
# A tibble: 53,940 × 13
carat cut color clarity depth table price x y z m sd
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43 3933. 3989.
2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31 3933. 3989.
3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31 3933. 3989.
4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63 3933. 3989.
5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75 3933. 3989.
6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48 3933. 3989.
7 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47 3933. 3989.
8 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53 3933. 3989.
9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49 3933. 3989.
10 0.23 Very Good H VS1 59.4 61 338 4 4.05 2.39 3933. 3989.
# ℹ 53,930 more rows
# ℹ 1 more variable: med <dbl>
These funcs.s use all the data in the price column, so the output in each of these columns is the same.
This modifies values w/in a var.
Basic Structure:
data %>% mutate(NewVariable = recode(Variable, “old value” = “new value”))
eg:
%>%
diamonds mutate(cut.new = recode(cut, "Ideal" = "IDEAL"))
# A tibble: 53,940 × 11
carat cut color clarity depth table price x y z cut.new
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <ord>
1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43 IDEAL
2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31 Premium
3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31 Good
4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63 Premium
5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75 Good
6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48 Very Good
7 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47 Very Good
8 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53 Very Good
9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49 Fair
10 0.23 Very Good H VS1 59.4 61 338 4 4.05 2.39 Very Good
# ℹ 53,930 more rows
Example:
# Creating a dataset w/ 2 var.s (Sex, TestScore)
<- factor(c("male", "m", "M", "Female", "Female", "Female"))
Sex <- c(10,20,10,25,12,5)
TestScore <- tibble(Sex, TestScore)
dataset str(dataset)
tibble [6 × 2] (S3: tbl_df/tbl/data.frame)
$ Sex : Factor w/ 4 levels "Female","m","M",..: 4 2 3 1 1 1
$ TestScore: num [1:6] 10 20 10 25 12 5
Remember! c() is used to combine values into a vector or a list!
(I’m not sure what the factor() function does here)
# Creating a new var Sex.new w/ recd values from original var, but where all male values = Male
<-
dataset %>%
dataset mutate(Sex=recode(Sex,"m"="Male","M"="Male", "male"="Male"))
str(dataset)
tibble [6 × 2] (S3: tbl_df/tbl/data.frame)
$ Sex : Factor w/ 2 levels "Female","Male": 2 2 2 1 1 1
$ TestScore: num [1:6] 10 20 10 25 12 5
Returns a 1-row summary.
eg:
%>%
diamonds summarise(avg.price = mean(price))
# A tibble: 1 × 1
avg.price
<dbl>
1 3933.
Takes existing data and groups specific vars. together for future operations.
eg:
## Creating identification number to represent 50 individual people
<- c(1:50)
ID
## Creating sex variable (25 males/25 females)
<- rep(c("male", "female"), 25) # rep stands for replicate
Sex
## Creating age variable (20-39 year olds)
<- c(26, 25, 39, 37, 31, 34, 34, 30, 26, 33,
Age 39, 28, 26, 29, 33, 22, 35, 23, 26, 36,
21, 20, 31, 21, 35, 39, 36, 22, 22, 25,
27, 30, 26, 34, 38, 39, 30, 29, 26, 25,
26, 36, 23, 21, 21, 39, 26, 26, 27, 21)
## Creating a dependent variable called Score
<- c(0.010, 0.418, 0.014, 0.090, 0.061, 0.328, 0.656, 0.002, 0.639, 0.173,
Score 0.076, 0.152, 0.467, 0.186, 0.520, 0.493, 0.388, 0.501, 0.800, 0.482,
0.384, 0.046, 0.920, 0.865, 0.625, 0.035, 0.501, 0.851, 0.285, 0.752,
0.686, 0.339, 0.710, 0.665, 0.214, 0.560, 0.287, 0.665, 0.630, 0.567,
0.812, 0.637, 0.772, 0.905, 0.405, 0.363, 0.773, 0.410, 0.535, 0.449)
## Creating a unified dataset that puts together all variables
<- tibble(ID, Sex, Age, Score) data
%>%
data group_by(Sex) %>%
summarise(m = mean(Score), # Calculates the mean score
s = sd(Score), # Calc. the standard deviation
n = n()) %>% # Calc. the tot. observations.
ungroup()
# A tibble: 2 × 4
Sex m s n
<chr> <dbl> <dbl> <int>
1 female 0.437 0.268 25
2 male 0.487 0.268 25
Compare to the same but without using group_by() and ungroup():
%>%
data summarise(m = mean(Score), # Calculates the mean score
s = sd(Score), # Calc. the standard deviation
n = n()) # Calc. the tot. observations.
# A tibble: 1 × 3
m s n
<dbl> <dbl> <int>
1 0.462 0.266 50
So by grouping we get to see the summaries of individual variables within the dataset!
Now lets group by Sex and Age:
%>%
data group_by(Sex, Age) %>%
summarise(m = mean(Score), # Calculates the mean score
s = sd(Score), # Calc. the standard deviation
n = n()) %>% # Calc. the tot. observations.
ungroup()
`summarise()` has grouped output by 'Sex'. You can override using the `.groups`
argument.
# A tibble: 27 × 5
Sex Age m s n
<chr> <dbl> <dbl> <dbl> <int>
1 female 20 0.046 NA 1
2 female 21 0.740 0.253 3
3 female 22 0.672 0.253 2
4 female 23 0.501 NA 1
5 female 25 0.579 0.167 3
6 female 26 0.41 NA 1
7 female 28 0.152 NA 1
8 female 29 0.426 0.339 2
9 female 30 0.170 0.238 2
10 female 33 0.173 NA 1
# ℹ 17 more rows
This has given us 27 rows in total. R has considered every unique combination of Sex and Age in the dataset.
The missing std dev values are because there is only 1 observtion for that group, while std dev requires at least 2.
This combination adds a new column to the dataset based on the group.
%>%
data group_by(Sex) %>%
mutate(MeanScoreSex = mean(Score)) %>%
ungroup()
# A tibble: 50 × 5
ID Sex Age Score MeanScoreSex
<int> <chr> <dbl> <dbl> <dbl>
1 1 male 26 0.01 0.487
2 2 female 25 0.418 0.437
3 3 male 39 0.014 0.487
4 4 female 37 0.09 0.437
5 5 male 31 0.061 0.487
6 6 female 34 0.328 0.437
7 7 male 34 0.656 0.487
8 8 female 30 0.002 0.437
9 9 male 26 0.639 0.487
10 10 female 33 0.173 0.437
# ℹ 40 more rows
While there are only 2 results for MeanScoreSex, they have been duplicated for every row in the table.
If ungroup() is left out at the end, this will leave the grouping in place and likely lead to errors in further calculations!
Always use ungroup() after group_by()!!
This only retains certain rows that meet the specified requirements.
%>% filter(cut == "Fair") diamonds
# A tibble: 1,610 × 10
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49
2 0.86 Fair E SI2 55.1 69 2757 6.45 6.33 3.52
3 0.96 Fair F SI2 66.3 62 2759 6.27 5.95 4.07
4 0.7 Fair F VS2 64.5 57 2762 5.57 5.53 3.58
5 0.7 Fair F VS2 65.3 55 2762 5.63 5.58 3.66
6 0.91 Fair H SI2 64.4 57 2763 6.11 6.09 3.93
7 0.91 Fair H SI2 65.7 60 2763 6.03 5.99 3.95
8 0.98 Fair H SI2 67.9 60 2777 6.05 5.97 4.08
9 0.84 Fair G SI1 55.1 67 2782 6.39 6.2 3.47
10 1.01 Fair E I1 64.5 58 2788 6.29 6.21 4.03
# ℹ 1,600 more rows
Multiple conditions can be included using | or a comma:
Bar denotes OR
Comma denotes AND.
%>% filter(cut == "Fair" | cut == "Good", price <= 600) diamonds
# A tibble: 505 × 10
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
2 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
3 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49
4 0.3 Good J SI1 64 55 339 4.25 4.28 2.73
5 0.3 Good J SI1 63.4 54 351 4.23 4.29 2.7
6 0.3 Good J SI1 63.8 56 351 4.23 4.26 2.71
7 0.3 Good I SI2 63.3 56 351 4.26 4.3 2.71
8 0.23 Good F VS1 58.2 59 402 4.06 4.08 2.37
9 0.23 Good E VS1 64.1 59 402 3.83 3.85 2.46
10 0.31 Good H SI1 64 54 402 4.29 4.31 2.75
# ℹ 495 more rows
This can also be achieved through:
%>% filter(cut %in% c("Fair", "Good"), price <= 600) diamonds
# A tibble: 505 × 10
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
2 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
3 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49
4 0.3 Good J SI1 64 55 339 4.25 4.28 2.73
5 0.3 Good J SI1 63.4 54 351 4.23 4.29 2.7
6 0.3 Good J SI1 63.8 56 351 4.23 4.26 2.71
7 0.3 Good I SI2 63.3 56 351 4.26 4.3 2.71
8 0.23 Good F VS1 58.2 59 402 4.06 4.08 2.37
9 0.23 Good E VS1 64.1 59 402 3.83 3.85 2.46
10 0.31 Good H SI1 64 54 402 4.29 4.31 2.75
# ℹ 495 more rows
Select the columns you want to see.
You can select using the column name or position number.
The order the columns are listed are the order they are displayed.
%>%
diamonds select(cut, color)
# A tibble: 53,940 × 2
cut color
<ord> <ord>
1 Ideal E
2 Premium E
3 Good E
4 Premium I
5 Good J
6 Very Good J
7 Very Good I
8 Very Good H
9 Fair E
10 Very Good H
# ℹ 53,930 more rows
%>%
diamonds select(1:5)
# A tibble: 53,940 × 5
carat cut color clarity depth
<dbl> <ord> <ord> <ord> <dbl>
1 0.23 Ideal E SI2 61.5
2 0.21 Premium E SI1 59.8
3 0.23 Good E VS1 56.9
4 0.29 Premium I VS2 62.4
5 0.31 Good J SI2 63.3
6 0.24 Very Good J VVS2 62.8
7 0.24 Very Good I VVS1 62.3
8 0.26 Very Good H SI1 61.9
9 0.22 Fair E VS2 65.1
10 0.23 Very Good H VS1 59.4
# ℹ 53,930 more rows
%>%
diamonds select(-cut) # Selects all except "cut"
# A tibble: 53,940 × 9
carat color clarity depth table price x y z
<dbl> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 0.23 E SI2 61.5 55 326 3.95 3.98 2.43
2 0.21 E SI1 59.8 61 326 3.89 3.84 2.31
3 0.23 E VS1 56.9 65 327 4.05 4.07 2.31
4 0.29 I VS2 62.4 58 334 4.2 4.23 2.63
5 0.31 J SI2 63.3 58 335 4.34 4.35 2.75
6 0.24 J VVS2 62.8 57 336 3.94 3.96 2.48
7 0.24 I VVS1 62.3 57 336 3.95 3.98 2.47
8 0.26 H SI1 61.9 55 337 4.07 4.11 2.53
9 0.22 E VS2 65.1 61 337 3.87 3.78 2.49
10 0.23 H VS1 59.4 61 338 4 4.05 2.39
# ℹ 53,930 more rows
%>%
diamonds select(-(5:10))
# A tibble: 53,940 × 4
carat cut color clarity
<dbl> <ord> <ord> <ord>
1 0.23 Ideal E SI2
2 0.21 Premium E SI1
3 0.23 Good E VS1
4 0.29 Premium I VS2
5 0.31 Good J SI2
6 0.24 Very Good J VVS2
7 0.24 Very Good I VVS1
8 0.26 Very Good H SI1
9 0.22 Fair E VS2
10 0.23 Very Good H VS1
# ℹ 53,930 more rows
You can also retain everything, but more certain columns to the front:
%>%
diamonds select(x,y,z,everything())
# A tibble: 53,940 × 10
x y z carat cut color clarity depth table price
<dbl> <dbl> <dbl> <dbl> <ord> <ord> <ord> <dbl> <dbl> <int>
1 3.95 3.98 2.43 0.23 Ideal E SI2 61.5 55 326
2 3.89 3.84 2.31 0.21 Premium E SI1 59.8 61 326
3 4.05 4.07 2.31 0.23 Good E VS1 56.9 65 327
4 4.2 4.23 2.63 0.29 Premium I VS2 62.4 58 334
5 4.34 4.35 2.75 0.31 Good J SI2 63.3 58 335
6 3.94 3.96 2.48 0.24 Very Good J VVS2 62.8 57 336
7 3.95 3.98 2.47 0.24 Very Good I VVS1 62.3 57 336
8 4.07 4.11 2.53 0.26 Very Good H SI1 61.9 55 337
9 3.87 3.78 2.49 0.22 Fair E VS2 65.1 61 337
10 4 4.05 2.39 0.23 Very Good H VS1 59.4 61 338
# ℹ 53,930 more rows
This could be used to permenantly reorder the dataset:
<-
testdata1 %>%
diamonds select(x,y,z,everything())
Arranges values within a variable by ascending or descending alphabetical or numerical order.
%>%
diamonds arrange(cut)
# A tibble: 53,940 × 10
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49
2 0.86 Fair E SI2 55.1 69 2757 6.45 6.33 3.52
3 0.96 Fair F SI2 66.3 62 2759 6.27 5.95 4.07
4 0.7 Fair F VS2 64.5 57 2762 5.57 5.53 3.58
5 0.7 Fair F VS2 65.3 55 2762 5.63 5.58 3.66
6 0.91 Fair H SI2 64.4 57 2763 6.11 6.09 3.93
7 0.91 Fair H SI2 65.7 60 2763 6.03 5.99 3.95
8 0.98 Fair H SI2 67.9 60 2777 6.05 5.97 4.08
9 0.84 Fair G SI1 55.1 67 2782 6.39 6.2 3.47
10 1.01 Fair E I1 64.5 58 2788 6.29 6.21 4.03
# ℹ 53,930 more rows
%>% arrange(desc(price)) diamonds
# A tibble: 53,940 × 10
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 2.29 Premium I VS2 60.8 60 18823 8.5 8.47 5.16
2 2 Very Good G SI1 63.5 56 18818 7.9 7.97 5.04
3 1.51 Ideal G IF 61.7 55 18806 7.37 7.41 4.56
4 2.07 Ideal G SI2 62.5 55 18804 8.2 8.13 5.11
5 2 Very Good H SI1 62.8 57 18803 7.95 8 5.01
6 2.29 Premium I SI1 61.8 59 18797 8.52 8.45 5.24
7 2.04 Premium H SI1 58.1 60 18795 8.37 8.28 4.84
8 2 Premium I VS1 60.8 59 18795 8.13 8.02 4.91
9 1.71 Premium F VS2 62.3 59 18791 7.57 7.53 4.7
10 2.15 Ideal G SI2 62.6 54 18791 8.29 8.35 5.21
# ℹ 53,930 more rows
%>% #call the midwest object;
midwest group_by(state) %>% #states that the following commands are only to act on the variable "state"
summarise(poptotalmean = mean(poptotal), #give a summary of "poptotalmean", defined as the mean of the column poptotal
poptotalmed = median(poptotal), #give a summary of poptotalmed, defined as the median of poptotal
popmax = max(poptotal), #give a summary of popmax, defined as the maximum value of poptotal
popmin = min(poptotal), #give a summary of popmin, defined as the minimum value of poptotal
popdistinct = n_distinct(poptotal), #give a summary of popdistinct, defined as the number of unique values in poptotal
popfirst = first(poptotal), #give a summary of popfirst, defined as the first value in poptotal
popany = any(poptotal < 5000), #give a summary of popany, defined as TRUE if it meets the condition; false otherwise
popany2 = any(poptotal > 2000000)) %>% #give a summary of popany2, defined as True if it meets the condition; false otherwise
ungroup() #ungroup the dataset.
# A tibble: 5 × 9
state poptotalmean poptotalmed popmax popmin popdistinct popfirst popany
<chr> <dbl> <dbl> <int> <int> <int> <int> <lgl>
1 IL 112065. 24486. 5105067 4373 101 66090 TRUE
2 IN 60263. 30362. 797159 5315 92 31095 FALSE
3 MI 111992. 37308 2111687 1701 83 10145 TRUE
4 OH 123263. 54930. 1412140 11098 88 25371 FALSE
5 WI 67941. 33528 959275 3890 72 15682 TRUE
# ℹ 1 more variable: popany2 <lgl>
%>% #do what comes next to the midwest dataset
midwest group_by(state) %>% #group by the state variable
summarise(num5k = sum(poptotal < 5000), #give a summary of num5k, defined as the number of entries with poptotal less than 5000
num2mil = sum(poptotal > 2000000), #give a summary of num2mil, defined as the number of entries with poptotal greater than 2,000,000
numrows = n()) %>% #give a summary of numrows, defined as the total number of rows.
ungroup() #ungroup the dataset
# A tibble: 5 × 4
state num5k num2mil numrows
<chr> <int> <int> <int>
1 IL 1 1 102
2 IN 0 0 92
3 MI 1 1 83
4 OH 0 0 88
5 WI 2 0 72
%>% #call the object midwest
midwest group_by(county) %>% #just use the variable "county"
summarise(x = n_distinct(state)) %>% #summarise the object x, defined as the number of unique states within the midwest object
arrange(desc(x)) %>% #arrange x in descending order, from most common state to least
ungroup() #ungroup
# A tibble: 320 × 2
county x
<chr> <int>
1 CRAWFORD 5
2 JACKSON 5
3 MONROE 5
4 ADAMS 4
5 BROWN 4
6 CLARK 4
7 CLINTON 4
8 JEFFERSON 4
9 LAKE 4
10 WASHINGTON 4
# ℹ 310 more rows
%>% #call the midwest object
midwest group_by(county) %>% #group the object by county
summarise(x = n()) %>% #summarise the object x, defined as the number of entries for each county
ungroup() #ungroup
# A tibble: 320 × 2
county x
<chr> <int>
1 ADAMS 4
2 ALCONA 1
3 ALEXANDER 1
4 ALGER 1
5 ALLEGAN 1
6 ALLEN 2
7 ALPENA 1
8 ANTRIM 1
9 ARENAC 1
10 ASHLAND 2
# ℹ 310 more rows
#n() differs from n_distinct() by just giving the number of each unique entry, while n_distinct gives hte number of unique entries
%>% #call midwest object
midwest group_by(county) %>% #group by counties
summarise(x = n_distinct(county)) %>% #define x as the number of distinct counties in each county and print a summary of this
ungroup() #ungroup
# A tibble: 320 × 2
county x
<chr> <int>
1 ADAMS 1
2 ALCONA 1
3 ALEXANDER 1
4 ALGER 1
5 ALLEGAN 1
6 ALLEN 1
7 ALPENA 1
8 ANTRIM 1
9 ARENAC 1
10 ASHLAND 1
# ℹ 310 more rows
#obviously, there is only 1 county in each county!
%>% #load diamonds
diamonds group_by(clarity) %>% #select the clarity variable
summarise(a = n_distinct(color), #print a summary of the number of unique colors
b = n_distinct(price), #print a summary of the number of unique prices
c = n()) %>% #print a summary of the total number of variables
ungroup() #ungroup
# A tibble: 8 × 4
clarity a b c
<ord> <int> <int> <int>
1 I1 7 632 741
2 SI2 7 4904 9194
3 SI1 7 5380 13065
4 VS2 7 5051 12258
5 VS1 7 3926 8171
6 VVS2 7 2409 5066
7 VVS1 7 1623 3655
8 IF 7 902 1790
%>% #call diamonds
diamonds group_by(color, cut) %>% #give us every unique combo of color and cut
summarise(m = mean(price), #summarise the mean price of each combo
s = sd(price)) %>% #summarise the standard deviation of each combo
ungroup() #ungroup
`summarise()` has grouped output by 'color'. You can override using the
`.groups` argument.
# A tibble: 35 × 4
color cut m s
<ord> <ord> <dbl> <dbl>
1 D Fair 4291. 3286.
2 D Good 3405. 3175.
3 D Very Good 3470. 3524.
4 D Premium 3631. 3712.
5 D Ideal 2629. 3001.
6 E Fair 3682. 2977.
7 E Good 3424. 3331.
8 E Very Good 3215. 3408.
9 E Premium 3539. 3795.
10 E Ideal 2598. 2956.
# ℹ 25 more rows
%>% #call diamonds
diamonds group_by(cut, color) %>% #select every unique combo of cut and color
summarise(m = mean(price), #summarise the mean price of each combo
s = sd(price)) %>% #summarise the standard deviation of each combo
ungroup() #ungroup
`summarise()` has grouped output by 'cut'. You can override using the `.groups`
argument.
# A tibble: 35 × 4
cut color m s
<ord> <ord> <dbl> <dbl>
1 Fair D 4291. 3286.
2 Fair E 3682. 2977.
3 Fair F 3827. 3223.
4 Fair G 4239. 3610.
5 Fair H 5136. 3886.
6 Fair I 4685. 3730.
7 Fair J 4976. 4050.
8 Good D 3405. 3175.
9 Good E 3424. 3331.
10 Good F 3496. 3202.
# ℹ 25 more rows
The second part simply changes the printed order from sorted by color to being sorted by cut.
%>% #call diamonds
diamonds group_by(cut, color, clarity) %>% #select every unique instance of cut, color and clarity
summarise(m = mean(price), #summarise the mean of each instance
s = sd(price), #summarise the std dev of each instance
msale = m* 0.80) %>% #define msale as 80% of the mean price
ungroup() #ungroup
`summarise()` has grouped output by 'cut', 'color'. You can override using the
`.groups` argument.
# A tibble: 276 × 6
cut color clarity m s msale
<ord> <ord> <ord> <dbl> <dbl> <dbl>
1 Fair D I1 7383 5899. 5906.
2 Fair D SI2 4355. 3260. 3484.
3 Fair D SI1 4273. 3019. 3419.
4 Fair D VS2 4513. 3383. 3610.
5 Fair D VS1 2921. 2550. 2337.
6 Fair D VVS2 3607 3629. 2886.
7 Fair D VVS1 4473 5457. 3578.
8 Fair D IF 1620. 525. 1296.
9 Fair E I1 2095. 824. 1676.
10 Fair E SI2 4172. 3055. 3338.
# ℹ 266 more rows
%>%
diamonds group_by(cut) %>%
summarise(potato = mean(depth),
pizza = mean(price),
popcorn = median(y),
pineapple = potato - pizza,
papaya = pineapple^2,
peach = n()) %>%
ungroup()
# A tibble: 5 × 7
cut potato pizza popcorn pineapple papaya peach
<ord> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
1 Fair 64.0 4359. 6.1 -4295. 18444586. 1610
2 Good 62.4 3929. 5.99 -3866. 14949811. 4906
3 Very Good 61.8 3982. 5.77 -3920. 15365942. 12082
4 Premium 61.3 4584. 6.06 -4523. 20457466. 13791
5 Ideal 61.7 3458. 5.26 -3396. 11531679. 21551
%>% #call diamonds
diamonds group_by(color) %>% #group all unique varieties of color
summarize(m = mean(price)) %>% #print the mean price of each color
mutate(x1 = str_c("Diamond color", color), #create a column x1 with values of "Diamond color"+COLOR
x2 = 5) %>% #create a column x2 of just the value 5 in every row
ungroup() #ungroup
# A tibble: 7 × 4
color m x1 x2
<ord> <dbl> <chr> <dbl>
1 D 3170. Diamond colorD 5
2 E 3077. Diamond colorE 5
3 F 3725. Diamond colorF 5
4 G 3999. Diamond colorG 5
5 H 4487. Diamond colorH 5
6 I 5092. Diamond colorI 5
7 J 5324. Diamond colorJ 5
%>% #call diamonds
diamonds group_by(color) %>% #group all unique vars of color
summarise(m = mean(price)) %>% #print the mean price of each color group
ungroup() %>% #ungroup
mutate(x1 = str_c("Diamond color ", color), #create a column x1 with values of "Diamond color"+COLOR
x2 = 5) #create a column x2 of just the value 5 in every row
# A tibble: 7 × 4
color m x1 x2
<ord> <dbl> <chr> <dbl>
1 D 3170. Diamond color D 5
2 E 3077. Diamond color E 5
3 F 3725. Diamond color F 5
4 G 3999. Diamond color G 5
5 H 4487. Diamond color H 5
6 I 5092. Diamond color I 5
7 J 5324. Diamond color J 5
The position of the ungroup in the prev 2 examples doesn’t change the output as the mutate() function doesn’t rely on the group.
%>% #call diamonds
diamonds group_by(color) %>% #group all unique vars of color
mutate(x1 = price*0.5) %>% #create a new column, x1, equal to the price of each color x 0.5
summarise(m = mean(x1)) %>% #print a summary of the mean of x1 called m. this summary excludes all other columns, so the x1 column isn't shown.
ungroup()
# A tibble: 7 × 2
color m
<ord> <dbl>
1 D 1585.
2 E 1538.
3 F 1862.
4 G 2000.
5 H 2243.
6 I 2546.
7 J 2662.
%>% #call diamonds
diamonds group_by(color) %>% #group by each unique value of color
mutate(x1 = price*0.5) %>% #add a column equal to half of the price called x1
ungroup() %>% #ungroup
summarise(m = mean(x1)) #print a summary of the mean values of x1
# A tibble: 1 × 1
m
<dbl>
1 1966.
Since this last one ungroups before printing the summary, the printout isn’t separated into individual colors.
Why is grouping data neccessary?
So that functions can be performed on entire groups of the data
Why is ungrouping neccessary?
To ensure that the data isn’t still grouped when you next go to use it.
When should you ungroup data?
Once you’re finished using it in groups.
#View(diamonds) #make sure it has a capital V! opens the data.frame of diamonds in a new window
arrange(diamonds, price) #this gives ther data with price ascending from lowest to highest
# A tibble: 53,940 × 10
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63
5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
7 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47
8 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53
9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49
10 0.23 Very Good H VS1 59.4 61 338 4 4.05 2.39
# ℹ 53,930 more rows
%>%
diamonds arrange(price) #this does exactly the same as above
# A tibble: 53,940 × 10
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63
5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
7 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47
8 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53
9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49
10 0.23 Very Good H VS1 59.4 61 338 4 4.05 2.39
# ℹ 53,930 more rows
%>%
diamonds arrange(desc(price)) #this sorts data with price descending
# A tibble: 53,940 × 10
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 2.29 Premium I VS2 60.8 60 18823 8.5 8.47 5.16
2 2 Very Good G SI1 63.5 56 18818 7.9 7.97 5.04
3 1.51 Ideal G IF 61.7 55 18806 7.37 7.41 4.56
4 2.07 Ideal G SI2 62.5 55 18804 8.2 8.13 5.11
5 2 Very Good H SI1 62.8 57 18803 7.95 8 5.01
6 2.29 Premium I SI1 61.8 59 18797 8.52 8.45 5.24
7 2.04 Premium H SI1 58.1 60 18795 8.37 8.28 4.84
8 2 Premium I VS1 60.8 59 18795 8.13 8.02 4.91
9 1.71 Premium F VS2 62.3 59 18791 7.57 7.53 4.7
10 2.15 Ideal G SI2 62.6 54 18791 8.29 8.35 5.21
# ℹ 53,930 more rows
%>%
diamonds arrange(price) %>%
arrange(cut)
# A tibble: 53,940 × 10
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49
2 0.25 Fair E VS1 55.2 64 361 4.21 4.23 2.33
3 0.23 Fair G VVS2 61.4 66 369 3.87 3.91 2.39
4 0.27 Fair E VS1 66.4 58 371 3.99 4.02 2.66
5 0.3 Fair J VS2 64.8 58 416 4.24 4.16 2.72
6 0.3 Fair F SI1 63.1 58 496 4.3 4.22 2.69
7 0.34 Fair J SI1 64.5 57 497 4.38 4.36 2.82
8 0.37 Fair F SI1 65.3 56 527 4.53 4.47 2.94
9 0.3 Fair D SI2 64.6 54 536 4.29 4.25 2.76
10 0.25 Fair D VS1 61.2 55 563 4.09 4.11 2.51
# ℹ 53,930 more rows
%>%
diamonds arrange(desc(price)) %>%
arrange(desc(cut))
# A tibble: 53,940 × 10
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 1.51 Ideal G IF 61.7 55 18806 7.37 7.41 4.56
2 2.07 Ideal G SI2 62.5 55 18804 8.2 8.13 5.11
3 2.15 Ideal G SI2 62.6 54 18791 8.29 8.35 5.21
4 2.05 Ideal G SI1 61.9 57 18787 8.1 8.16 5.03
5 1.6 Ideal F VS1 62 56 18780 7.47 7.52 4.65
6 2.06 Ideal I VS2 62.2 55 18779 8.15 8.19 5.08
7 1.71 Ideal G VVS2 62.1 55 18768 7.66 7.63 4.75
8 2.08 Ideal H SI1 58.7 60 18760 8.36 8.4 4.92
9 2.03 Ideal G SI1 60 55.8 18757 8.17 8.3 4.95
10 2.61 Ideal I SI2 62.1 56 18756 8.85 8.73 5.46
# ℹ 53,930 more rows
%>%
diamonds arrange(price) %>%
arrange(desc(clarity))
# A tibble: 53,940 × 10
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 0.23 Very Good H IF 63.9 55 369 3.89 3.9 2.49
2 0.24 Very Good H IF 61.3 56 449 4.04 4.06 2.48
3 0.26 Ideal H IF 61.1 57 468 4.12 4.16 2.53
4 0.23 Very Good F IF 61 62 485 3.95 3.99 2.42
5 0.3 Ideal J IF 61.5 56 489 4.32 4.33 2.66
6 0.3 Ideal J IF 61.5 57 489 4.29 4.36 2.66
7 0.23 Very Good E IF 59.9 58 492 3.98 4.03 2.4
8 0.24 Good F IF 65.1 58 492 3.86 3.88 2.52
9 0.24 Ideal H IF 62.5 54 504 3.97 4 2.49
10 0.24 Ideal H IF 62.1 57 504 4 4.04 2.5
# ℹ 53,930 more rows
%>%
diamonds mutate(saleprice = price - 250)
# A tibble: 53,940 × 11
carat cut color clarity depth table price x y z saleprice
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <dbl>
1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43 76
2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31 76
3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31 77
4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63 84
5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75 85
6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48 86
7 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47 86
8 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53 87
9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49 87
10 0.23 Very Good H VS1 59.4 61 338 4 4.05 2.39 88
# ℹ 53,930 more rows
%>%
diamonds select(-x,-y,-z)
# A tibble: 53,940 × 7
carat cut color clarity depth table price
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int>
1 0.23 Ideal E SI2 61.5 55 326
2 0.21 Premium E SI1 59.8 61 326
3 0.23 Good E VS1 56.9 65 327
4 0.29 Premium I VS2 62.4 58 334
5 0.31 Good J SI2 63.3 58 335
6 0.24 Very Good J VVS2 62.8 57 336
7 0.24 Very Good I VVS1 62.3 57 336
8 0.26 Very Good H SI1 61.9 55 337
9 0.22 Fair E VS2 65.1 61 337
10 0.23 Very Good H VS1 59.4 61 338
# ℹ 53,930 more rows
%>%
diamonds group_by(cut) %>%
summarise(number = n()) %>%
ungroup()
# A tibble: 5 × 2
cut number
<ord> <int>
1 Fair 1610
2 Good 4906
3 Very Good 12082
4 Premium 13791
5 Ideal 21551
%>%
diamonds mutate(totalNum = sum(n()))
# A tibble: 53,940 × 11
carat cut color clarity depth table price x y z totalNum
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <int>
1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43 53940
2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31 53940
3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31 53940
4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63 53940
5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75 53940
6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48 53940
7 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47 53940
8 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53 53940
9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49 53940
10 0.23 Very Good H VS1 59.4 61 338 4 4.05 2.39 53940
# ℹ 53,930 more rows
Sometimes, for simple things, it’s good to plot in base R language. However, ggplot2 is WAAY better for more complicated things!
We’ll be using the “mtcars” dataset. Here’s a graph from the inbuilt r code:
plot(mtcars$wt, mtcars$mpg)
Where *mtcars\(wt* returns the column names wt from the mtcars data frame (and similar for mtcars\)mpg)
And here’s a graph from the ggplot2 code:
library(ggplot2)
ggplot(mtcars, aes(x = wt, y = mpg)) + geom_point()
ggplot(): tells it to create a plot object.
geom_point(): tells it to add a layer of points to the plot.
(mtcars, aes(x = wt, y = mpg)) calls the data frame mtcars before telling it which variables to use for the x and y values.
to use vectors instead of a data frame, you can swap “mtcars” with data = NULL, followed by the same aes(x = vector1, y = vector2):
ggplot(data = NULL, aes(x = mtcars$wt, y = mtcars$mpg)) + geom_point()
Using plot():
#plot(pressure$temperature, pressure$pressure, type = "1")
That didn’t work for some reason.
plot(pressure$temperature, pressure$pressure, type = "l")
I misread the l as a 1 - hence where it went wrong.
To add a new line to the graph using plot():
plot(pressure$temperature, pressure$pressure, type = "l")
points(pressure$temperature, pressure$pressure/2)
lines(pressure$temperature, pressure$pressure/4, col = "red")
Now using ggplot2:
ggplot(pressure, aes(x = temperature, y = pressure)) + geom_line()
To add data points:
ggplot(pressure, aes(x = temperature, y = pressure)) + geom_line() + geom_point()
Using built in barplot:
barplot(BOD$demand, names.arg = BOD$Time)
Using ggplot2:
ggplot(BOD, aes(x = Time, y = demand)) +
geom_col() +
ggtitle("A Random Bar Chart")
To ensure the x variable is treated as a discrete var.:
ggplot(BOD, aes(x = factor(Time), y = demand)) +
geom_col() +
ggtitle("A Random Bar Chart")
with hist():
hist(mtcars$mpg, breaks = 10)
using ggplot2:
library(tidyverse)
ggplot(mtcars, aes(x = mpg)) + geom_histogram()
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Adding binwidth in geom_histogram() allows you to definte the width of the bins:
library(tidyverse)
ggplot(mtcars, aes(x = mpg)) + geom_histogram(binwidth = 4)
The built-in plot() function will automatically create a boxplot is the x values are a factor (as opposed to a numeric vector):
plot(ToothGrowth$supp, ToothGrowth$len)
and using ggplot2:
ggplot(ToothGrowth, aes(x = supp, y = len)) +
geom_boxplot()
and we can make box plots for multiple variables using interaction()
ggplot(ToothGrowth, aes(x = interaction(supp, dose), y = len)) +
geom_boxplot()
library(palmerpenguins)
ggplot(penguins, aes(x = species, y = flipper_length_mm, color = sex)) + geom_boxplot()
Warning: Removed 2 rows containing non-finite outside the scale range
(`stat_boxplot()`).
Use curve() with an expression with the variable x:
curve(x^3 - 5*x, from = -4, to = 4)
Don’t forget the comma after the = -4!
It needs to take a numeric vector and return a numeric vector as output.
Using add = TRUE will add it to the previously created plot.
<- function(xvar) {
myfun 1 / (1 + exp(-xvar + 10))
}
curve(myfun(x), from = 0, to = 20)
curve(1 - myfun(x), add = TRUE, col = "red")
library(modeldata)
Attaching package: 'modeldata'
The following object is masked from 'package:palmerpenguins':
penguins
ggplot(crickets,
aes(x = temp, y = rate,
color = species)) +
geom_point(color = "red", #specifies the color of the datapoints
size = 2, #specifies the size
alpha = .3, # makes transparent
shape = "square") +
geom_smooth(method = "lm", #adds trendline; lm makes it a linear model
se = FALSE) + #turns off standard error shading
labs(x = "Temperature", # labs = labels
y = "Chirp rate",
color = "Species", # changes title of legend
title = "Cricket chirps",
caption = "Source: McDonald (2009)") +
scale_color_brewer(palette = "Dark2")
`geom_smooth()` using formula = 'y ~ x'
# histogram - one quantitative variable
ggplot(crickets, aes(x = rate)) +
geom_histogram(bins = 15) #can also use binwidth to change the bins
#frequency polygon
ggplot(crickets, aes(x = rate)) +
geom_freqpoly(bins = 15)
#bar chart
ggplot(crickets, aes(x = species)) +
geom_bar(color = "black", #make boundary black
fill = "lightblue") #make fill light blue
ggplot(crickets, aes(x = species, fill = species)) + # fill = species makes species represented by a different fill
geom_bar(show.legend = FALSE)
#box plot
ggplot(crickets, aes(x = species,
y = rate,
color = species)) +
geom_boxplot() +
theme_minimal()
# faceting
#Not great example:
ggplot(crickets, aes(x = rate,
fill = species)) +
geom_histogram(bins = 15) +
scale_fill_brewer(palette = "Dark2")
ggplot(crickets, aes(x = rate,
fill = species)) +
geom_histogram(bins = 15, show.legend = FALSE) +
scale_fill_brewer(palette = "Dark2") +
facet_wrap(~species) + # facet_wrap puts the data on two different graphs, side by side, with the same axes scales
theme_minimal()