1 + 1
[1] 2
Quarto enables you to weave together content and executable code into a finished document. To learn more about Quarto see https://quarto.org.
I’ve saved Figure 1 in the working directory, should have info for how to write good!
Make sure the working directory is setup correctly.
Files have to be loaded form same place as the file is saved.
To set working directory use setwd(“PATH/directoryname”)
! [] (imgdirectory) adds a photo (without spaces).
When you click the Render button a document will be generated that includes both content and the output of embedded code. You can embed code like this:
1 + 1
[1] 2
1 + 1
[1] 2
It seems that echo: true doesn’t change the output here.
Hint: use ctrl+alt+i to add a code chunk.
You can add options to executable code like this:
[1] 4
The echo: false
option disables the printing of code (only output is displayed).
See “Pre Session 3 (Tidyverse) Activities for more info [R for Grad Students by Y. Wendy Huynh; Chapter 4]
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(knitr)
library(gmodels)
library(ggplot2)
‘Output: false’ stops the workflow of loading the package (or any other operation) being displayed.
Trying to load the same package twice will give an error message!
A blank space between lines is needed to output separate lines and make bullet points work.
ggplot(diamonds, aes(x = cut)) +
geom_bar() # geom_bar() makes the outputted graph a bar chart.
This is a reference: Figure 2.
I had to install a package for this bit, maybe tinytex?
We can put formulas inline like this: \(\alpha\) We indent like this
\[ \pi \in \mathbb{R}\]
Google for more info!
##Test This doesn’t work.
(Just testing out different heading sizes)
ctrl+alt+i is a very useful shortcut! It inputs the r code block.
“#|” denotes the header for a chunk. Sets the variables you want to happen for that chunk.
%>% is the “pipe” operator, so x%>%y says “do y to x”
ctrl+shift+m is a shortcut for %>%
library(palmerpenguins)
dplyr::filter()
library(dplyr)
%>%
diamonds group_by(clarity) %>%
summarize(m = mean(price)) %>%
ungroup()
# A tibble: 8 × 2
clarity m
<ord> <dbl>
1 I1 3924.
2 SI2 5063.
3 SI1 3996.
4 VS2 3925.
5 VS1 3839.
6 VVS2 3284.
7 VVS1 2523.
8 IF 2865.
Not really sure what this bit is about… (as of 3/10/2024)
The “diamonds” dataset is built into ggplot2 and therefore included in tidyverse.
#View(diamonds)
It gives an excel-like view of the data.
You cannot edit the data cell-by-cell like in excel; that needs to be done through code.
str(diamonds)
tibble [53,940 × 10] (S3: tbl_df/tbl/data.frame)
$ carat : num [1:53940] 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
$ cut : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
$ color : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
$ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
$ depth : num [1:53940] 61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
$ table : num [1:53940] 55 61 65 58 58 57 57 55 61 61 ...
$ price : int [1:53940] 326 326 327 334 335 336 336 337 337 338 ...
$ x : num [1:53940] 3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
$ y : num [1:53940] 3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
$ z : num [1:53940] 2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...
It details the number of data entries and how many variables for each entry along with their details.
Use mutate() to:
Add new columns
Modify variables in dataset.
To create new var.s in diamonds;
A var. called JustOne where all values inside the column = 1
A var. called Values where all values are “something”
A var. called Simple where all values = TRUE
%>%
diamonds mutate(JustOne = 1,
Values = "something",
Simple = TRUE)
# A tibble: 53,940 × 13
carat cut color clarity depth table price x y z JustOne Values
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <dbl> <chr>
1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43 1 somet…
2 0.21 Premi… E SI1 59.8 61 326 3.89 3.84 2.31 1 somet…
3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31 1 somet…
4 0.29 Premi… I VS2 62.4 58 334 4.2 4.23 2.63 1 somet…
5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75 1 somet…
6 0.24 Very … J VVS2 62.8 57 336 3.94 3.96 2.48 1 somet…
7 0.24 Very … I VVS1 62.3 57 336 3.95 3.98 2.47 1 somet…
8 0.26 Very … H SI1 61.9 55 337 4.07 4.11 2.53 1 somet…
9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49 1 somet…
10 0.23 Very … H VS1 59.4 61 338 4 4.05 2.39 1 somet…
# ℹ 53,930 more rows
# ℹ 1 more variable: Simple <lgl>
A bit more info on the syntax here:
The pipe %>% is contained within tidyverse, specifically in a package called magrittr.
Pressing enter after a comma or a pipe auto-indents the code to make it more readable.
The mutate code reads as: “Within diamonds, I want to mutate/change things by creating the variables”JustOne”, which equals 1; “Values”, which equals “something”; and “Simple”, which equals TRUE.”
Separate comands within mutate() can be separated by commas (With or without a new line)
To create var.s based on existing var.s:
%>%
diamonds mutate(price200 = price - 200)
# A tibble: 53,940 × 11
carat cut color clarity depth table price x y z price200
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <dbl>
1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43 126
2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31 126
3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31 127
4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63 134
5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75 135
6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48 136
7 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47 136
8 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53 137
9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49 137
10 0.23 Very Good H VS1 59.4 61 338 4 4.05 2.39 138
# ℹ 53,930 more rows
Which created the column/var “price200” whose values are that of “price” minus 200.
These changes are not saved!!
Executing str(diamonds) will show that they haven’t been saved.
To save them you must add a line of code to save them as a certain object:
diamonds.new <- THE REST OF THE CODE
Other functions can be used inside mutate!
eg. using the mean(), sd() and median() func.s to calc. the descriptive stats of the diamonds:
%>%
diamonds mutate(m = mean(price),
sd = sd(price),
med = median(price))
# A tibble: 53,940 × 13
carat cut color clarity depth table price x y z m sd
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43 3933. 3989.
2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31 3933. 3989.
3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31 3933. 3989.
4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63 3933. 3989.
5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75 3933. 3989.
6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48 3933. 3989.
7 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47 3933. 3989.
8 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53 3933. 3989.
9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49 3933. 3989.
10 0.23 Very Good H VS1 59.4 61 338 4 4.05 2.39 3933. 3989.
# ℹ 53,930 more rows
# ℹ 1 more variable: med <dbl>
These funcs.s use all the data in the price column, so the output in each of these columns is the same.
This modifies values w/in a var.
Basic Structure:
data %>% mutate(NewVariable = recode(Variable, “old value” = “new value”))
eg:
%>%
diamonds mutate(cut.new = recode(cut, "Ideal" = "IDEAL"))
# A tibble: 53,940 × 11
carat cut color clarity depth table price x y z cut.new
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <ord>
1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43 IDEAL
2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31 Premium
3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31 Good
4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63 Premium
5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75 Good
6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48 Very Good
7 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47 Very Good
8 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53 Very Good
9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49 Fair
10 0.23 Very Good H VS1 59.4 61 338 4 4.05 2.39 Very Good
# ℹ 53,930 more rows
Example:
# Creating a dataset w/ 2 var.s (Sex, TestScore)
<- factor(c("male", "m", "M", "Female", "Female", "Female"))
Sex <- c(10,20,10,25,12,5)
TestScore <- tibble(Sex, TestScore)
dataset str(dataset)
tibble [6 × 2] (S3: tbl_df/tbl/data.frame)
$ Sex : Factor w/ 4 levels "Female","m","M",..: 4 2 3 1 1 1
$ TestScore: num [1:6] 10 20 10 25 12 5
Remember! c() is used to combine values into a vector or a list!
(I’m not sure what the factor() function does here)
factor() is a type of variable that is categorical with levels.
# Creating a new var Sex.new w/ recd values from original var, but where all male values = Male
<-
dataset %>%
dataset mutate(Sex=recode(Sex,"m"="Male","M"="Male", "male"="Male"))
str(dataset)
tibble [6 × 2] (S3: tbl_df/tbl/data.frame)
$ Sex : Factor w/ 2 levels "Female","Male": 2 2 2 1 1 1
$ TestScore: num [1:6] 10 20 10 25 12 5
Returns a 1-row summary.
eg:
%>%
diamonds summarise(avg.price = mean(price))
# A tibble: 1 × 1
avg.price
<dbl>
1 3933.
Takes existing data and groups specific vars. together for future operations.
eg:
## Creating identification number to represent 50 individual people
<- c(1:50)
ID
## Creating sex variable (25 males/25 females)
<- rep(c("male", "female"), 25) # rep stands for replicate
Sex
## Creating age variable (20-39 year olds)
<- c(26, 25, 39, 37, 31, 34, 34, 30, 26, 33,
Age 39, 28, 26, 29, 33, 22, 35, 23, 26, 36,
21, 20, 31, 21, 35, 39, 36, 22, 22, 25,
27, 30, 26, 34, 38, 39, 30, 29, 26, 25,
26, 36, 23, 21, 21, 39, 26, 26, 27, 21)
## Creating a dependent variable called Score
<- c(0.010, 0.418, 0.014, 0.090, 0.061, 0.328, 0.656, 0.002, 0.639, 0.173,
Score 0.076, 0.152, 0.467, 0.186, 0.520, 0.493, 0.388, 0.501, 0.800, 0.482,
0.384, 0.046, 0.920, 0.865, 0.625, 0.035, 0.501, 0.851, 0.285, 0.752,
0.686, 0.339, 0.710, 0.665, 0.214, 0.560, 0.287, 0.665, 0.630, 0.567,
0.812, 0.637, 0.772, 0.905, 0.405, 0.363, 0.773, 0.410, 0.535, 0.449)
## Creating a unified dataset that puts together all variables
<- tibble(ID, Sex, Age, Score) data
%>%
data group_by(Sex) %>%
summarise(m = mean(Score), # Calculates the mean score
s = sd(Score), # Calc. the standard deviation
n = n()) %>% # Calc. the tot. observations.
ungroup()
# A tibble: 2 × 4
Sex m s n
<chr> <dbl> <dbl> <int>
1 female 0.437 0.268 25
2 male 0.487 0.268 25
Compare to the same but without using group_by() and ungroup():
%>%
data summarise(m = mean(Score), # Calculates the mean score
s = sd(Score), # Calc. the standard deviation
n = n()) # Calc. the tot. observations.
# A tibble: 1 × 3
m s n
<dbl> <dbl> <int>
1 0.462 0.266 50
So by grouping we get to see the summaries of individual variables within the dataset!
Now lets group by Sex and Age:
%>%
data group_by(Sex, Age) %>%
summarise(m = mean(Score), # Calculates the mean score
s = sd(Score), # Calc. the standard deviation
n = n()) %>% # Calc. the tot. observations.
ungroup()
`summarise()` has grouped output by 'Sex'. You can override using the `.groups`
argument.
# A tibble: 27 × 5
Sex Age m s n
<chr> <dbl> <dbl> <dbl> <int>
1 female 20 0.046 NA 1
2 female 21 0.740 0.253 3
3 female 22 0.672 0.253 2
4 female 23 0.501 NA 1
5 female 25 0.579 0.167 3
6 female 26 0.41 NA 1
7 female 28 0.152 NA 1
8 female 29 0.426 0.339 2
9 female 30 0.170 0.238 2
10 female 33 0.173 NA 1
# ℹ 17 more rows
This has given us 27 rows in total. R has considered every unique combination of Sex and Age in the dataset.
The missing std dev values are because there is only 1 observtion for that group, while std dev requires at least 2.
This combination adds a new column to the dataset based on the group.
%>%
data group_by(Sex) %>%
mutate(MeanScoreSex = mean(Score)) %>%
ungroup()
# A tibble: 50 × 5
ID Sex Age Score MeanScoreSex
<int> <chr> <dbl> <dbl> <dbl>
1 1 male 26 0.01 0.487
2 2 female 25 0.418 0.437
3 3 male 39 0.014 0.487
4 4 female 37 0.09 0.437
5 5 male 31 0.061 0.487
6 6 female 34 0.328 0.437
7 7 male 34 0.656 0.487
8 8 female 30 0.002 0.437
9 9 male 26 0.639 0.487
10 10 female 33 0.173 0.437
# ℹ 40 more rows
While there are only 2 results for MeanScoreSex, they have been duplicated for every row in the table.
If ungroup() is left out at the end, this will leave the grouping in place and likely lead to errors in further calculations!
Always use ungroup() after group_by()!!
This only retains certain rows that meet the specified requirements.
%>% filter(cut == "Fair") diamonds
# A tibble: 1,610 × 10
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49
2 0.86 Fair E SI2 55.1 69 2757 6.45 6.33 3.52
3 0.96 Fair F SI2 66.3 62 2759 6.27 5.95 4.07
4 0.7 Fair F VS2 64.5 57 2762 5.57 5.53 3.58
5 0.7 Fair F VS2 65.3 55 2762 5.63 5.58 3.66
6 0.91 Fair H SI2 64.4 57 2763 6.11 6.09 3.93
7 0.91 Fair H SI2 65.7 60 2763 6.03 5.99 3.95
8 0.98 Fair H SI2 67.9 60 2777 6.05 5.97 4.08
9 0.84 Fair G SI1 55.1 67 2782 6.39 6.2 3.47
10 1.01 Fair E I1 64.5 58 2788 6.29 6.21 4.03
# ℹ 1,600 more rows
Multiple conditions can be included using | or a comma:
Bar denotes OR
Comma denotes AND.
%>% filter(cut == "Fair" | cut == "Good", price <= 600) diamonds
# A tibble: 505 × 10
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
2 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
3 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49
4 0.3 Good J SI1 64 55 339 4.25 4.28 2.73
5 0.3 Good J SI1 63.4 54 351 4.23 4.29 2.7
6 0.3 Good J SI1 63.8 56 351 4.23 4.26 2.71
7 0.3 Good I SI2 63.3 56 351 4.26 4.3 2.71
8 0.23 Good F VS1 58.2 59 402 4.06 4.08 2.37
9 0.23 Good E VS1 64.1 59 402 3.83 3.85 2.46
10 0.31 Good H SI1 64 54 402 4.29 4.31 2.75
# ℹ 495 more rows
This can also be achieved through:
%>% filter(cut %in% c("Fair", "Good"), price <= 600) diamonds
# A tibble: 505 × 10
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
2 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
3 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49
4 0.3 Good J SI1 64 55 339 4.25 4.28 2.73
5 0.3 Good J SI1 63.4 54 351 4.23 4.29 2.7
6 0.3 Good J SI1 63.8 56 351 4.23 4.26 2.71
7 0.3 Good I SI2 63.3 56 351 4.26 4.3 2.71
8 0.23 Good F VS1 58.2 59 402 4.06 4.08 2.37
9 0.23 Good E VS1 64.1 59 402 3.83 3.85 2.46
10 0.31 Good H SI1 64 54 402 4.29 4.31 2.75
# ℹ 495 more rows
Select the columns you want to see.
You can select using the column name or position number.
The order the columns are listed are the order they are displayed.
%>%
diamonds select(cut, color)
# A tibble: 53,940 × 2
cut color
<ord> <ord>
1 Ideal E
2 Premium E
3 Good E
4 Premium I
5 Good J
6 Very Good J
7 Very Good I
8 Very Good H
9 Fair E
10 Very Good H
# ℹ 53,930 more rows
%>%
diamonds select(1:5)
# A tibble: 53,940 × 5
carat cut color clarity depth
<dbl> <ord> <ord> <ord> <dbl>
1 0.23 Ideal E SI2 61.5
2 0.21 Premium E SI1 59.8
3 0.23 Good E VS1 56.9
4 0.29 Premium I VS2 62.4
5 0.31 Good J SI2 63.3
6 0.24 Very Good J VVS2 62.8
7 0.24 Very Good I VVS1 62.3
8 0.26 Very Good H SI1 61.9
9 0.22 Fair E VS2 65.1
10 0.23 Very Good H VS1 59.4
# ℹ 53,930 more rows
%>%
diamonds select(-cut) # Selects all except "cut"
# A tibble: 53,940 × 9
carat color clarity depth table price x y z
<dbl> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 0.23 E SI2 61.5 55 326 3.95 3.98 2.43
2 0.21 E SI1 59.8 61 326 3.89 3.84 2.31
3 0.23 E VS1 56.9 65 327 4.05 4.07 2.31
4 0.29 I VS2 62.4 58 334 4.2 4.23 2.63
5 0.31 J SI2 63.3 58 335 4.34 4.35 2.75
6 0.24 J VVS2 62.8 57 336 3.94 3.96 2.48
7 0.24 I VVS1 62.3 57 336 3.95 3.98 2.47
8 0.26 H SI1 61.9 55 337 4.07 4.11 2.53
9 0.22 E VS2 65.1 61 337 3.87 3.78 2.49
10 0.23 H VS1 59.4 61 338 4 4.05 2.39
# ℹ 53,930 more rows
%>%
diamonds select(-(5:10))
# A tibble: 53,940 × 4
carat cut color clarity
<dbl> <ord> <ord> <ord>
1 0.23 Ideal E SI2
2 0.21 Premium E SI1
3 0.23 Good E VS1
4 0.29 Premium I VS2
5 0.31 Good J SI2
6 0.24 Very Good J VVS2
7 0.24 Very Good I VVS1
8 0.26 Very Good H SI1
9 0.22 Fair E VS2
10 0.23 Very Good H VS1
# ℹ 53,930 more rows
You can also retain everything, but more certain columns to the front:
%>%
diamonds select(x,y,z,everything())
# A tibble: 53,940 × 10
x y z carat cut color clarity depth table price
<dbl> <dbl> <dbl> <dbl> <ord> <ord> <ord> <dbl> <dbl> <int>
1 3.95 3.98 2.43 0.23 Ideal E SI2 61.5 55 326
2 3.89 3.84 2.31 0.21 Premium E SI1 59.8 61 326
3 4.05 4.07 2.31 0.23 Good E VS1 56.9 65 327
4 4.2 4.23 2.63 0.29 Premium I VS2 62.4 58 334
5 4.34 4.35 2.75 0.31 Good J SI2 63.3 58 335
6 3.94 3.96 2.48 0.24 Very Good J VVS2 62.8 57 336
7 3.95 3.98 2.47 0.24 Very Good I VVS1 62.3 57 336
8 4.07 4.11 2.53 0.26 Very Good H SI1 61.9 55 337
9 3.87 3.78 2.49 0.22 Fair E VS2 65.1 61 337
10 4 4.05 2.39 0.23 Very Good H VS1 59.4 61 338
# ℹ 53,930 more rows
This could be used to permenantly reorder the dataset:
<-
testdata1 %>%
diamonds select(x,y,z,everything())
Arranges values within a variable by ascending or descending alphabetical or numerical order.
%>%
diamonds arrange(cut)
# A tibble: 53,940 × 10
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49
2 0.86 Fair E SI2 55.1 69 2757 6.45 6.33 3.52
3 0.96 Fair F SI2 66.3 62 2759 6.27 5.95 4.07
4 0.7 Fair F VS2 64.5 57 2762 5.57 5.53 3.58
5 0.7 Fair F VS2 65.3 55 2762 5.63 5.58 3.66
6 0.91 Fair H SI2 64.4 57 2763 6.11 6.09 3.93
7 0.91 Fair H SI2 65.7 60 2763 6.03 5.99 3.95
8 0.98 Fair H SI2 67.9 60 2777 6.05 5.97 4.08
9 0.84 Fair G SI1 55.1 67 2782 6.39 6.2 3.47
10 1.01 Fair E I1 64.5 58 2788 6.29 6.21 4.03
# ℹ 53,930 more rows
%>% arrange(desc(price)) diamonds
# A tibble: 53,940 × 10
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 2.29 Premium I VS2 60.8 60 18823 8.5 8.47 5.16
2 2 Very Good G SI1 63.5 56 18818 7.9 7.97 5.04
3 1.51 Ideal G IF 61.7 55 18806 7.37 7.41 4.56
4 2.07 Ideal G SI2 62.5 55 18804 8.2 8.13 5.11
5 2 Very Good H SI1 62.8 57 18803 7.95 8 5.01
6 2.29 Premium I SI1 61.8 59 18797 8.52 8.45 5.24
7 2.04 Premium H SI1 58.1 60 18795 8.37 8.28 4.84
8 2 Premium I VS1 60.8 59 18795 8.13 8.02 4.91
9 1.71 Premium F VS2 62.3 59 18791 7.57 7.53 4.7
10 2.15 Ideal G SI2 62.6 54 18791 8.29 8.35 5.21
# ℹ 53,930 more rows
%>% #call the midwest object;
midwest group_by(state) %>% #states that the following commands are only to act on the variable "state"
summarise(poptotalmean = mean(poptotal), #give a summary of "poptotalmean", defined as the mean of the column poptotal
poptotalmed = median(poptotal), #give a summary of poptotalmed, defined as the median of poptotal
popmax = max(poptotal), #give a summary of popmax, defined as the maximum value of poptotal
popmin = min(poptotal), #give a summary of popmin, defined as the minimum value of poptotal
popdistinct = n_distinct(poptotal), #give a summary of popdistinct, defined as the number of unique values in poptotal
popfirst = first(poptotal), #give a summary of popfirst, defined as the first value in poptotal
popany = any(poptotal < 5000), #give a summary of popany, defined as TRUE if it meets the condition; false otherwise
popany2 = any(poptotal > 2000000)) %>% #give a summary of popany2, defined as True if it meets the condition; false otherwise
ungroup() #ungroup the dataset.
# A tibble: 5 × 9
state poptotalmean poptotalmed popmax popmin popdistinct popfirst popany
<chr> <dbl> <dbl> <int> <int> <int> <int> <lgl>
1 IL 112065. 24486. 5105067 4373 101 66090 TRUE
2 IN 60263. 30362. 797159 5315 92 31095 FALSE
3 MI 111992. 37308 2111687 1701 83 10145 TRUE
4 OH 123263. 54930. 1412140 11098 88 25371 FALSE
5 WI 67941. 33528 959275 3890 72 15682 TRUE
# ℹ 1 more variable: popany2 <lgl>
%>% #do what comes next to the midwest dataset
midwest group_by(state) %>% #group by the state variable
summarise(num5k = sum(poptotal < 5000), #give a summary of num5k, defined as the number of entries with poptotal less than 5000
num2mil = sum(poptotal > 2000000), #give a summary of num2mil, defined as the number of entries with poptotal greater than 2,000,000
numrows = n()) %>% #give a summary of numrows, defined as the total number of rows.
ungroup() #ungroup the dataset
# A tibble: 5 × 4
state num5k num2mil numrows
<chr> <int> <int> <int>
1 IL 1 1 102
2 IN 0 0 92
3 MI 1 1 83
4 OH 0 0 88
5 WI 2 0 72
%>% #call the object midwest
midwest group_by(county) %>% #just use the variable "county"
summarise(x = n_distinct(state)) %>% #summarise the object x, defined as the number of unique states within the midwest object
arrange(desc(x)) %>% #arrange x in descending order, from most common state to least
ungroup() #ungroup
# A tibble: 320 × 2
county x
<chr> <int>
1 CRAWFORD 5
2 JACKSON 5
3 MONROE 5
4 ADAMS 4
5 BROWN 4
6 CLARK 4
7 CLINTON 4
8 JEFFERSON 4
9 LAKE 4
10 WASHINGTON 4
# ℹ 310 more rows
%>% #call the midwest object
midwest group_by(county) %>% #group the object by county
summarise(x = n()) %>% #summarise the object x, defined as the number of entries for each county
ungroup() #ungroup
# A tibble: 320 × 2
county x
<chr> <int>
1 ADAMS 4
2 ALCONA 1
3 ALEXANDER 1
4 ALGER 1
5 ALLEGAN 1
6 ALLEN 2
7 ALPENA 1
8 ANTRIM 1
9 ARENAC 1
10 ASHLAND 2
# ℹ 310 more rows
#n() differs from n_distinct() by just giving the number of each unique entry, while n_distinct gives hte number of unique entries
%>% #call midwest object
midwest group_by(county) %>% #group by counties
summarise(x = n_distinct(county)) %>% #define x as the number of distinct counties in each county and print a summary of this
ungroup() #ungroup
# A tibble: 320 × 2
county x
<chr> <int>
1 ADAMS 1
2 ALCONA 1
3 ALEXANDER 1
4 ALGER 1
5 ALLEGAN 1
6 ALLEN 1
7 ALPENA 1
8 ANTRIM 1
9 ARENAC 1
10 ASHLAND 1
# ℹ 310 more rows
#obviously, there is only 1 county in each county!
%>% #load diamonds
diamonds group_by(clarity) %>% #select the clarity variable
summarise(a = n_distinct(color), #print a summary of the number of unique colors
b = n_distinct(price), #print a summary of the number of unique prices
c = n()) %>% #print a summary of the total number of variables
ungroup() #ungroup
# A tibble: 8 × 4
clarity a b c
<ord> <int> <int> <int>
1 I1 7 632 741
2 SI2 7 4904 9194
3 SI1 7 5380 13065
4 VS2 7 5051 12258
5 VS1 7 3926 8171
6 VVS2 7 2409 5066
7 VVS1 7 1623 3655
8 IF 7 902 1790
%>% #call diamonds
diamonds group_by(color, cut) %>% #give us every unique combo of color and cut
summarise(m = mean(price), #summarise the mean price of each combo
s = sd(price)) %>% #summarise the standard deviation of each combo
ungroup() #ungroup
`summarise()` has grouped output by 'color'. You can override using the
`.groups` argument.
# A tibble: 35 × 4
color cut m s
<ord> <ord> <dbl> <dbl>
1 D Fair 4291. 3286.
2 D Good 3405. 3175.
3 D Very Good 3470. 3524.
4 D Premium 3631. 3712.
5 D Ideal 2629. 3001.
6 E Fair 3682. 2977.
7 E Good 3424. 3331.
8 E Very Good 3215. 3408.
9 E Premium 3539. 3795.
10 E Ideal 2598. 2956.
# ℹ 25 more rows
%>% #call diamonds
diamonds group_by(cut, color) %>% #select every unique combo of cut and color
summarise(m = mean(price), #summarise the mean price of each combo
s = sd(price)) %>% #summarise the standard deviation of each combo
ungroup() #ungroup
`summarise()` has grouped output by 'cut'. You can override using the `.groups`
argument.
# A tibble: 35 × 4
cut color m s
<ord> <ord> <dbl> <dbl>
1 Fair D 4291. 3286.
2 Fair E 3682. 2977.
3 Fair F 3827. 3223.
4 Fair G 4239. 3610.
5 Fair H 5136. 3886.
6 Fair I 4685. 3730.
7 Fair J 4976. 4050.
8 Good D 3405. 3175.
9 Good E 3424. 3331.
10 Good F 3496. 3202.
# ℹ 25 more rows
The second part simply changes the printed order from sorted by color to being sorted by cut.
%>% #call diamonds
diamonds group_by(cut, color, clarity) %>% #select every unique instance of cut, color and clarity
summarise(m = mean(price), #summarise the mean of each instance
s = sd(price), #summarise the std dev of each instance
msale = m* 0.80) %>% #define msale as 80% of the mean price
ungroup() #ungroup
`summarise()` has grouped output by 'cut', 'color'. You can override using the
`.groups` argument.
# A tibble: 276 × 6
cut color clarity m s msale
<ord> <ord> <ord> <dbl> <dbl> <dbl>
1 Fair D I1 7383 5899. 5906.
2 Fair D SI2 4355. 3260. 3484.
3 Fair D SI1 4273. 3019. 3419.
4 Fair D VS2 4513. 3383. 3610.
5 Fair D VS1 2921. 2550. 2337.
6 Fair D VVS2 3607 3629. 2886.
7 Fair D VVS1 4473 5457. 3578.
8 Fair D IF 1620. 525. 1296.
9 Fair E I1 2095. 824. 1676.
10 Fair E SI2 4172. 3055. 3338.
# ℹ 266 more rows
%>%
diamonds group_by(cut) %>%
summarise(potato = mean(depth),
pizza = mean(price),
popcorn = median(y),
pineapple = potato - pizza,
papaya = pineapple^2,
peach = n()) %>%
ungroup()
# A tibble: 5 × 7
cut potato pizza popcorn pineapple papaya peach
<ord> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
1 Fair 64.0 4359. 6.1 -4295. 18444586. 1610
2 Good 62.4 3929. 5.99 -3866. 14949811. 4906
3 Very Good 61.8 3982. 5.77 -3920. 15365942. 12082
4 Premium 61.3 4584. 6.06 -4523. 20457466. 13791
5 Ideal 61.7 3458. 5.26 -3396. 11531679. 21551
%>% #call diamonds
diamonds group_by(color) %>% #group all unique varieties of color
summarize(m = mean(price)) %>% #print the mean price of each color
mutate(x1 = str_c("Diamond color", color), #create a column x1 with values of "Diamond color"+COLOR
x2 = 5) %>% #create a column x2 of just the value 5 in every row
ungroup() #ungroup
# A tibble: 7 × 4
color m x1 x2
<ord> <dbl> <chr> <dbl>
1 D 3170. Diamond colorD 5
2 E 3077. Diamond colorE 5
3 F 3725. Diamond colorF 5
4 G 3999. Diamond colorG 5
5 H 4487. Diamond colorH 5
6 I 5092. Diamond colorI 5
7 J 5324. Diamond colorJ 5
%>% #call diamonds
diamonds group_by(color) %>% #group all unique vars of color
summarise(m = mean(price)) %>% #print the mean price of each color group
ungroup() %>% #ungroup
mutate(x1 = str_c("Diamond color ", color), #create a column x1 with values of "Diamond color"+COLOR
x2 = 5) #create a column x2 of just the value 5 in every row
# A tibble: 7 × 4
color m x1 x2
<ord> <dbl> <chr> <dbl>
1 D 3170. Diamond color D 5
2 E 3077. Diamond color E 5
3 F 3725. Diamond color F 5
4 G 3999. Diamond color G 5
5 H 4487. Diamond color H 5
6 I 5092. Diamond color I 5
7 J 5324. Diamond color J 5
The position of the ungroup in the prev 2 examples doesn’t change the output as the mutate() function doesn’t rely on the group.
%>% #call diamonds
diamonds group_by(color) %>% #group all unique vars of color
mutate(x1 = price*0.5) %>% #create a new column, x1, equal to the price of each color x 0.5
summarise(m = mean(x1)) %>% #print a summary of the mean of x1 called m. this summary excludes all other columns, so the x1 column isn't shown.
ungroup()
# A tibble: 7 × 2
color m
<ord> <dbl>
1 D 1585.
2 E 1538.
3 F 1862.
4 G 2000.
5 H 2243.
6 I 2546.
7 J 2662.
%>% #call diamonds
diamonds group_by(color) %>% #group by each unique value of color
mutate(x1 = price*0.5) %>% #add a column equal to half of the price called x1
ungroup() %>% #ungroup
summarise(m = mean(x1)) #print a summary of the mean values of x1
# A tibble: 1 × 1
m
<dbl>
1 1966.
Since this last one ungroups before printing the summary, the printout isn’t separated into individual colors.
Why is grouping data neccessary?
So that functions can be performed on entire groups of the data
Why is ungrouping neccessary?
To ensure that the data isn’t still grouped when you next go to use it.
When should you ungroup data?
Once you’re finished using it in groups.
#View(diamonds) #make sure it has a capital V! opens the data.frame of diamonds in a new window
arrange(diamonds, price) #this gives ther data with price ascending from lowest to highest
# A tibble: 53,940 × 10
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63
5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
7 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47
8 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53
9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49
10 0.23 Very Good H VS1 59.4 61 338 4 4.05 2.39
# ℹ 53,930 more rows
%>%
diamonds arrange(price) #this does exactly the same as above
# A tibble: 53,940 × 10
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63
5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
7 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47
8 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53
9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49
10 0.23 Very Good H VS1 59.4 61 338 4 4.05 2.39
# ℹ 53,930 more rows
%>%
diamonds arrange(desc(price)) #this sorts data with price descending
# A tibble: 53,940 × 10
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 2.29 Premium I VS2 60.8 60 18823 8.5 8.47 5.16
2 2 Very Good G SI1 63.5 56 18818 7.9 7.97 5.04
3 1.51 Ideal G IF 61.7 55 18806 7.37 7.41 4.56
4 2.07 Ideal G SI2 62.5 55 18804 8.2 8.13 5.11
5 2 Very Good H SI1 62.8 57 18803 7.95 8 5.01
6 2.29 Premium I SI1 61.8 59 18797 8.52 8.45 5.24
7 2.04 Premium H SI1 58.1 60 18795 8.37 8.28 4.84
8 2 Premium I VS1 60.8 59 18795 8.13 8.02 4.91
9 1.71 Premium F VS2 62.3 59 18791 7.57 7.53 4.7
10 2.15 Ideal G SI2 62.6 54 18791 8.29 8.35 5.21
# ℹ 53,930 more rows
%>%
diamonds arrange(price) %>%
arrange(cut)
# A tibble: 53,940 × 10
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49
2 0.25 Fair E VS1 55.2 64 361 4.21 4.23 2.33
3 0.23 Fair G VVS2 61.4 66 369 3.87 3.91 2.39
4 0.27 Fair E VS1 66.4 58 371 3.99 4.02 2.66
5 0.3 Fair J VS2 64.8 58 416 4.24 4.16 2.72
6 0.3 Fair F SI1 63.1 58 496 4.3 4.22 2.69
7 0.34 Fair J SI1 64.5 57 497 4.38 4.36 2.82
8 0.37 Fair F SI1 65.3 56 527 4.53 4.47 2.94
9 0.3 Fair D SI2 64.6 54 536 4.29 4.25 2.76
10 0.25 Fair D VS1 61.2 55 563 4.09 4.11 2.51
# ℹ 53,930 more rows
%>%
diamonds arrange(desc(price)) %>%
arrange(desc(cut))
# A tibble: 53,940 × 10
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 1.51 Ideal G IF 61.7 55 18806 7.37 7.41 4.56
2 2.07 Ideal G SI2 62.5 55 18804 8.2 8.13 5.11
3 2.15 Ideal G SI2 62.6 54 18791 8.29 8.35 5.21
4 2.05 Ideal G SI1 61.9 57 18787 8.1 8.16 5.03
5 1.6 Ideal F VS1 62 56 18780 7.47 7.52 4.65
6 2.06 Ideal I VS2 62.2 55 18779 8.15 8.19 5.08
7 1.71 Ideal G VVS2 62.1 55 18768 7.66 7.63 4.75
8 2.08 Ideal H SI1 58.7 60 18760 8.36 8.4 4.92
9 2.03 Ideal G SI1 60 55.8 18757 8.17 8.3 4.95
10 2.61 Ideal I SI2 62.1 56 18756 8.85 8.73 5.46
# ℹ 53,930 more rows
%>%
diamonds arrange(price) %>%
arrange(desc(clarity))
# A tibble: 53,940 × 10
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 0.23 Very Good H IF 63.9 55 369 3.89 3.9 2.49
2 0.24 Very Good H IF 61.3 56 449 4.04 4.06 2.48
3 0.26 Ideal H IF 61.1 57 468 4.12 4.16 2.53
4 0.23 Very Good F IF 61 62 485 3.95 3.99 2.42
5 0.3 Ideal J IF 61.5 56 489 4.32 4.33 2.66
6 0.3 Ideal J IF 61.5 57 489 4.29 4.36 2.66
7 0.23 Very Good E IF 59.9 58 492 3.98 4.03 2.4
8 0.24 Good F IF 65.1 58 492 3.86 3.88 2.52
9 0.24 Ideal H IF 62.5 54 504 3.97 4 2.49
10 0.24 Ideal H IF 62.1 57 504 4 4.04 2.5
# ℹ 53,930 more rows
%>%
diamonds mutate(saleprice = price - 250)
# A tibble: 53,940 × 11
carat cut color clarity depth table price x y z saleprice
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <dbl>
1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43 76
2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31 76
3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31 77
4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63 84
5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75 85
6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48 86
7 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47 86
8 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53 87
9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49 87
10 0.23 Very Good H VS1 59.4 61 338 4 4.05 2.39 88
# ℹ 53,930 more rows
%>%
diamonds select(-x,-y,-z)
# A tibble: 53,940 × 7
carat cut color clarity depth table price
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int>
1 0.23 Ideal E SI2 61.5 55 326
2 0.21 Premium E SI1 59.8 61 326
3 0.23 Good E VS1 56.9 65 327
4 0.29 Premium I VS2 62.4 58 334
5 0.31 Good J SI2 63.3 58 335
6 0.24 Very Good J VVS2 62.8 57 336
7 0.24 Very Good I VVS1 62.3 57 336
8 0.26 Very Good H SI1 61.9 55 337
9 0.22 Fair E VS2 65.1 61 337
10 0.23 Very Good H VS1 59.4 61 338
# ℹ 53,930 more rows
%>%
diamonds group_by(cut) %>%
summarise(number = n()) %>%
ungroup()
# A tibble: 5 × 2
cut number
<ord> <int>
1 Fair 1610
2 Good 4906
3 Very Good 12082
4 Premium 13791
5 Ideal 21551
%>%
diamonds mutate(totalNum = sum(n()))
# A tibble: 53,940 × 11
carat cut color clarity depth table price x y z totalNum
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <int>
1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43 53940
2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31 53940
3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31 53940
4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63 53940
5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75 53940
6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48 53940
7 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47 53940
8 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53 53940
9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49 53940
10 0.23 Very Good H VS1 59.4 61 338 4 4.05 2.39 53940
# ℹ 53,930 more rows
Sometimes, for simple things, it’s good to plot in base R language. However, ggplot2 is WAAY better for more complicated things!
We’ll be using the “mtcars” dataset. Here’s a graph from the inbuilt r code:
plot(mtcars$wt, mtcars$mpg)
Where *mtcars\(wt* returns the column names wt from the mtcars data frame (and similar for mtcars\)mpg)
And here’s a graph from the ggplot2 code:
## library(ggplot2) ## Loaded this at the start
ggplot(mtcars, aes(x = wt, y = mpg)) + geom_point()
ggplot(): tells it to create a plot object.
geom_point(): tells it to add a layer of points to the plot.
(mtcars, aes(x = wt, y = mpg)) calls the data frame mtcars before telling it which variables to use for the x and y values.
to use vectors instead of a data frame, you can swap “mtcars” with data = NULL, followed by the same aes(x = vector1, y = vector2):
ggplot(data = NULL, aes(x = mtcars$wt, y = mtcars$mpg)) + geom_point()
Using plot():
#plot(pressure$temperature, pressure$pressure, type = "1")
That didn’t work for some reason.
plot(pressure$temperature, pressure$pressure, type = "l")
I misread the l as a 1 - hence where it went wrong.
To add a new line to the graph using plot():
plot(pressure$temperature, pressure$pressure, type = "l")
points(pressure$temperature, pressure$pressure/2)
lines(pressure$temperature, pressure$pressure/4, col = "red")
Now using ggplot2:
ggplot(pressure, aes(x = temperature, y = pressure)) + geom_line()
To add data points:
ggplot(pressure, aes(x = temperature, y = pressure)) + geom_line() + geom_point()
Using built in barplot:
barplot(BOD$demand, names.arg = BOD$Time)
Using ggplot2:
ggplot(BOD, aes(x = Time, y = demand)) +
geom_col() +
ggtitle("A Random Bar Chart")
To ensure the x variable is treated as a discrete var.:
ggplot(BOD, aes(x = factor(Time), y = demand)) +
geom_col() +
ggtitle("A Random Bar Chart")
with hist():
hist(mtcars$mpg, breaks = 10)
using ggplot2:
## library(tidyverse) ## loaded this at start
ggplot(mtcars, aes(x = mpg)) + geom_histogram()
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Adding binwidth in geom_histogram() allows you to definte the width of the bins:
#library(tidyverse)
ggplot(mtcars, aes(x = mpg)) + geom_histogram(binwidth = 4)
The built-in plot() function will automatically create a boxplot is the x values are a factor (as opposed to a numeric vector):
plot(ToothGrowth$supp, ToothGrowth$len)
and using ggplot2:
ggplot(ToothGrowth, aes(x = supp, y = len)) +
geom_boxplot()
and we can make box plots for multiple variables using interaction()
ggplot(ToothGrowth, aes(x = interaction(supp, dose), y = len)) +
geom_boxplot()
ggplot(penguins, aes(x = species, y = flipper_length_mm, color = sex)) + geom_boxplot()
Warning: Removed 2 rows containing non-finite outside the scale range
(`stat_boxplot()`).
Use curve() with an expression with the variable x:
curve(x^3 - 5*x, from = -4, to = 4)
Don’t forget the comma after the = -4!
It needs to take a numeric vector and return a numeric vector as output.
Using add = TRUE will add it to the previously created plot.
<- function(xvar) {
myfun 1 / (1 + exp(-xvar + 10))
}
curve(myfun(x), from = 0, to = 20)
curve(1 - myfun(x), add = TRUE, col = "red")
library(modeldata)
Attaching package: 'modeldata'
The following object is masked from 'package:palmerpenguins':
penguins
ggplot(crickets,
aes(x = temp, y = rate,
color = species)) +
geom_point(color = "red", #specifies the color of the datapoints
size = 2, #specifies the size
alpha = .3, # makes transparent
shape = "square") +
geom_smooth(method = "lm", #adds trendline; lm makes it a linear model
se = FALSE) + #turns off standard error shading
labs(x = "Temperature", # labs = labels
y = "Chirp rate",
color = "Species", # changes title of legend
title = "Cricket chirps",
caption = "Source: McDonald (2009)") +
scale_color_brewer(palette = "Dark2")
`geom_smooth()` using formula = 'y ~ x'
# histogram - one quantitative variable
ggplot(crickets, aes(x = rate)) +
geom_histogram(bins = 15) #can also use binwidth to change the bins
#frequency polygon
ggplot(crickets, aes(x = rate)) +
geom_freqpoly(bins = 15)
#bar chart
ggplot(crickets, aes(x = species)) +
geom_bar(color = "black", #make boundary black
fill = "lightblue") #make fill light blue
ggplot(crickets, aes(x = species, fill = species)) + # fill = species makes species represented by a different fill
geom_bar(show.legend = FALSE)
#box plot
ggplot(crickets, aes(x = species,
y = rate,
color = species)) +
geom_boxplot() +
theme_minimal()
# faceting
#Not great example:
ggplot(crickets, aes(x = rate,
fill = species)) +
geom_histogram(bins = 15) +
scale_fill_brewer(palette = "Dark2")
ggplot(crickets, aes(x = rate,
fill = species)) +
geom_histogram(bins = 15, show.legend = FALSE) +
scale_fill_brewer(palette = "Dark2") +
facet_wrap(~species) + # facet_wrap puts the data on two different graphs, side by side, with the same axes scales
theme_minimal()
View("Tree_Phenology_Data")
That didn’t work. How do we get the data loaded into R?
Maybe because it wasn’t saved as .csv
View("Tree_Phenology_Data_P1")
Nope again!
<- read_csv("Tree_Phenology_Data_P1.csv") TreesP1
Rows: 951 Columns: 11
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): Date observed, Species
dbl (9): Obs. no., Latitude, Longitude, Altitude, Tree diameter, Urbanisatio...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
<- read_csv("Tree_Phenology_Data_P2.csv") TreesP2
Rows: 828 Columns: 11
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): Date observed, Species
dbl (9): Obs. no., Latitude, Longitude, Altitude, Tree diameter, Urbanisatio...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
This worked!
read.csv and read_csv are DIFFERENT functions!
read.csv is base R language and returns a data frame with factors as default
read_csv is tidyverse language and returns a tibble with characters as default
Use read_csv!!!
summary(TreesP1)
Obs. no. Date observed Latitude Longitude
Min. : 1.0 Length:951 Min. :-35.73 Min. :-115.9786
1st Qu.:239.5 Class :character 1st Qu.: 51.37 1st Qu.: -3.2639
Median :477.0 Mode :character Median : 52.08 Median : -1.9382
Mean :477.3 Mean : 52.07 Mean : -1.9622
3rd Qu.:715.5 3rd Qu.: 53.50 3rd Qu.: -0.3468
Max. :953.0 Max. : 57.70 Max. : 174.3406
Altitude Species Tree diameter Urbanisation index
Min. : 0.00 Length:951 Min. : 0.34 Min. :1.000
1st Qu.: 16.50 Class :character 1st Qu.: 47.72 1st Qu.:2.000
Median : 48.00 Mode :character Median : 75.00 Median :3.000
Mean : 86.92 Mean :104.84 Mean :2.905
3rd Qu.: 105.00 3rd Qu.:119.85 3rd Qu.:4.000
Max. :1076.00 Max. :632.00 Max. :4.000
Stand density index Canopy index Phenological index
Min. :1.000 Min. : 5.00 Min. :1.000
1st Qu.:2.000 1st Qu.:75.00 1st Qu.:2.000
Median :3.000 Median :85.00 Median :2.000
Mean :2.836 Mean :78.56 Mean :2.168
3rd Qu.:4.000 3rd Qu.:85.00 3rd Qu.:2.000
Max. :4.000 Max. :95.00 Max. :4.000
Here, we can see that the “Date observed” column/variable is read as a character variable, which is usually used for text to be treaded as strings.
Is there a difference between summary() and summarise()?
summarise(TreesP1)
# A tibble: 1 × 0
#this doesn't work, you need to specify what exactly you're creating/summarising.
How can we get “Date observed” to be read as a date?
Our dates are in the form: “dd/mm/yyyy”
We can use the function as.Date(DATA$DATE_COLUMN, format = “%d/%m/%Y”):
$"Date observed" <- as.Date(TreesP1$"Date observed", format = "%d/%m/%Y")
TreesP1
$"Date observed" <- as.Date(TreesP2$"Date observed", format = "%d/%m/%Y")
TreesP2
summary(TreesP1$"Date observed")
Min. 1st Qu. Median Mean 3rd Qu. Max.
"1994-10-18" "2022-10-16" "2022-10-18" "2022-08-03" "2022-10-21" "2022-11-30"
summary(TreesP2$"Date observed")
Min. 1st Qu. Median Mean 3rd Qu. Max.
"2022-01-30" "2022-11-27" "2022-11-30" "2022-11-27" "2022-12-03" "2022-12-31"
Here we can see that all the date entries have been correctly loaded.
But we can also see some outliers! for example, 18/10/1994 and 30/01/2022.
Now we’re ready to start examining the data properly.
Let’s have a look at the structure of these datasets:
str(TreesP1)
spc_tbl_ [951 × 11] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
$ Obs. no. : num [1:951] 1 2 3 4 5 6 7 8 9 10 ...
$ Date observed : Date[1:951], format: "2022-10-18" "2022-10-18" ...
$ Latitude : num [1:951] 53 53 53 53 51.7 ...
$ Longitude : num [1:951] -1.1 -1.1 -1.1 -1.1 -1.57 ...
$ Altitude : num [1:951] 69 69 69 69 106 106 106 119 119 119 ...
$ Species : chr [1:951] "Quercus robur" "Quercus robur" "Quercus robur" "Quercus robur" ...
$ Tree diameter : num [1:951] 137 182.5 200 259.5 99.7 ...
$ Urbanisation index : num [1:951] 2 2 2 2 4 4 4 4 4 4 ...
$ Stand density index: num [1:951] 4 4 4 4 2 2 2 4 4 4 ...
$ Canopy index : num [1:951] 85 65 85 85 65 65 75 95 95 95 ...
$ Phenological index : num [1:951] 2 2 2 2 3 3 2 1 1 1 ...
- attr(*, "spec")=
.. cols(
.. `Obs. no.` = col_double(),
.. `Date observed` = col_character(),
.. Latitude = col_double(),
.. Longitude = col_double(),
.. Altitude = col_double(),
.. Species = col_character(),
.. `Tree diameter` = col_double(),
.. `Urbanisation index` = col_double(),
.. `Stand density index` = col_double(),
.. `Canopy index` = col_double(),
.. `Phenological index` = col_double()
.. )
- attr(*, "problems")=<externalptr>
str(TreesP2)
spc_tbl_ [828 × 11] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
$ Obs. no. : num [1:828] 1 2 3 4 5 6 7 8 9 10 ...
$ Date observed : Date[1:828], format: "2022-11-30" "2022-11-30" ...
$ Latitude : num [1:828] 53 53 53 53 51.7 ...
$ Longitude : num [1:828] -1.1 -1.1 -1.1 -1.1 -1.57 ...
$ Altitude : num [1:828] 0 0 0 0 106 106 106 24 24 24 ...
$ Species : chr [1:828] "Quercus robur" "Quercus robur" "Quercus robur" "Quercus robur" ...
$ Tree diameter : num [1:828] 137 182.5 200 259.5 99.7 ...
$ Urbanisation index : num [1:828] 2 2 2 2 4 4 4 4 4 4 ...
$ Stand density index: num [1:828] 4 4 4 4 2 2 2 2 2 2 ...
$ Canopy index : num [1:828] 65 55 45 35 15 5 25 55 25 45 ...
$ Phenological index : num [1:828] 3 4 4 4 4 4 4 4 4 4 ...
- attr(*, "spec")=
.. cols(
.. `Obs. no.` = col_double(),
.. `Date observed` = col_character(),
.. Latitude = col_double(),
.. Longitude = col_double(),
.. Altitude = col_double(),
.. Species = col_character(),
.. `Tree diameter` = col_double(),
.. `Urbanisation index` = col_double(),
.. `Stand density index` = col_double(),
.. `Canopy index` = col_double(),
.. `Phenological index` = col_double()
.. )
- attr(*, "problems")=<externalptr>
Let’s now have a closer look at the “Species” variable to see what we’ve got in there:
summary(TreesP1$Species)
Length Class Mode
951 character character
summary(TreesP2$Species)
Length Class Mode
828 character character
OK that really wasn’t useful now was it R?!
How can we get a list of the different entries into species?
Before, in the Wendy Book, an example was given to do similar. Let’s try to adapt it!
%>%
TreesP1 group_by(Species) %>%
count(Species) %>%
ungroup()
# A tibble: 4 × 2
Species n
<chr> <int>
1 Other deciduous oak 27
2 Other deciduous tree 34
3 Quercus petraea 179
4 Quercus robur 711
%>%
TreesP2 group_by(Species) %>%
count(Species) %>%
ungroup()
# A tibble: 4 × 2
Species n
<chr> <int>
1 Other deciduous oak 25
2 Other deciduous tree 29
3 Quercus petraea 150
4 Quercus robur 624
Nice! This tells us exactly what we wanted to know: The range of entries in Species and how many of each occurred. We can also see that there weren’t any slightly different or wrong inputs here.
If there were wrongly inputted things, we could use:
I wonder if the Group_by bit was really neccessary:
#TreesP1 %>%
# count(Species)
#TreesP2 %>%
# count(species)
Ok that didn’t work, let’s stick with what we did before.
What else could we look at?
Lets see if there is any meaningful variation in variables between each species.
For this, we will need to use group_by(Species) and then summarise()
%>%
TreesP1 group_by(Species) %>%
summarise("MeanAlt" = mean(Altitude)) %>%
ungroup()
# A tibble: 4 × 2
Species MeanAlt
<chr> <dbl>
1 Other deciduous oak 272.
2 Other deciduous tree 55.1
3 Quercus petraea 90.3
4 Quercus robur 80.6
This is quite interesting, I wonder if a box plot would show any relation here!
ggplot(TreesP1, aes(x = Species, y = Altitude, fill = Species)) +
geom_boxplot(show.legend = FALSE) +
theme_minimal() +
labs(y = "Altitude/m")
Lets look at the same for latitude and longitude:
ggplot(TreesP1, aes(x = Species, y = Latitude, fill = Species)) +
geom_boxplot(show.legend = FALSE) +
theme_minimal()
ggplot(TreesP1, aes(x = Species, y = Longitude, fill = Species)) +
geom_boxplot(show.legend = FALSE) +
theme_minimal()
ggplot(TreesP1, aes(x = Species, y = Longitude, color = Species)) +
geom_point(show.legend = FALSE) +
theme_minimal()
ggplot(TreesP1, aes(x = Longitude, y = Latitude, color = Species)) +
geom_point() +
theme_minimal()
ggplot(TreesP1, aes(x = Longitude, y = Latitude, color = Altitude)) +
geom_point(size = 2) +
theme_minimal()
<-
TreesP1LOG %>%
TreesP1 mutate(logAlt = log(Altitude)) #defining a new column equal to log(Altutide)
ggplot(TreesP1LOG, aes(x = Longitude, y = Latitude, color = logAlt)) +
geom_point(size = 2) +
theme_minimal() +
xlim(-125, 25) + #set x and y limits to concentrate on europe and N
ylim(40, 60) + #america
labs(color = "ln(Altitude)")
Warning: Removed 6 rows containing missing values or values outside the scale range
(`geom_point()`).
Task: Reproduce the graphs and identify which tests can be associated with them. Explain why!
data(iris)
summary(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width
Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
Median :5.800 Median :3.000 Median :4.350 Median :1.300
Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
Species
setosa :50
versicolor:50
virginica :50
str(iris)
'data.frame': 150 obs. of 5 variables:
$ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
$ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
$ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
$ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
$ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
ggplot(iris, aes(x = Species, y = Sepal.Length, color = Species)) +
geom_boxplot() +
labs(y = "Sepal Length/mm")
ggplot(iris, aes(x = Petal.Length, fill = Species)) +
geom_density(alpha = 0.3) +
labs(x = "Petal Length/mm", y = "Density")
ggplot(iris, aes(x = Petal.Length, y = Petal.Width)) +
geom_point(mapping = aes(color = Species, shape = Species)) + # mapping = aes() being in geom_point instead of ggplot ensures that the calculation of the line of best fit covers all data points instead of having one for each species.
geom_smooth(method = lm, color = "black") +
labs(x = "Petal Length/mm", y = "Petal Width/mm")
`geom_smooth()` using formula = 'y ~ x'
#Create a variable for big/small sepal length
<-
iris2 %>%
iris mutate(Size=ifelse(Sepal.Length < median(Sepal.Length),
"small", "big"))
ggplot(iris2, aes(x = Species, fill = Size)) +
geom_bar(position = "dodge") +
labs(y = "Count")
# Position = "dodge" ensures the columns for each size are stacked next to each other instead of on top of each other.
Contingency tables are used to quantify values corresponding to categorical variables.
If you have a dataset with two categorical variables, and want to see how many occurrences of each combination of their categories there is, you use a contingency table.
Use the command: table(dataset\(var1, dataset\)var2)
This returns a table
library(ISLR2)
<- table(Carseats$Urban, Carseats$ShelveLoc)
example
example
Bad Good Medium
No 22 28 68
Yes 74 57 151
Lets try it with the tree data:
table(TreesP1$"Urbanisation index", TreesP1$"Stand density index")
1 2 3 4
1 10 72 26 5
2 12 93 62 78
3 22 92 59 39
4 19 96 65 201
Urbanisation index:
1 = Urban
4 = Rural
Stand density index:
1 = Alone
4 = Woodland
This tells us that most trees were in rural woodlands, 5 were in urban woodlands, very few were standing alone (first column) and disproportionately many were with a few close trees (second column).
Now let’s see if there was any relation between urban index and phenology index!
<- table(TreesP1$"Urbanisation index", TreesP1$"Phenological index") pheno
Cool!
So these are essentially matrices
addmargins() adds a column/row for sum
proportions() gives the relative sizes of each cell
addmargins(pheno)
1 2 3 4 Sum
1 20 56 30 7 113
2 19 171 45 10 245
3 23 127 49 13 212
4 49 254 69 9 381
Sum 111 608 193 39 951
proportions(pheno)
1 2 3 4
1 0.021030494 0.058885384 0.031545741 0.007360673
2 0.019978970 0.179810726 0.047318612 0.010515247
3 0.024185068 0.133543638 0.051524711 0.013669821
4 0.051524711 0.267087277 0.072555205 0.009463722
The number of decimal places here is waaay too high!
Compares an observed proportion to a theoretical one.
Used for only 2 categories.
use binom.test() (n<30) or prop.test() (n>30)
x = # successes
n = tot # of trials
p = expected probability
correct = TRUE/FALSE (false removes yates continuity correction)
alternative = “less”/“greater” - used to define one-tailed tests
eg for cases of cancer in mice, where 160 developed cancer, of which 95 were male. Are males more likely to get cancer in mice?
<- prop.test(x = 95, n = 160, p = 0.5, correct = FALSE, alternative = "greater")
ztest
ztest
1-sample proportions test without continuity correction
data: 95 out of 160, null probability 0.5
X-squared = 5.625, df = 1, p-value = 0.008853
alternative hypothesis: true p is greater than 0.5
95 percent confidence interval:
0.5288397 1.0000000
sample estimates:
p
0.59375
$p.value ## Returns the p-value ztest
[1] 0.008853033
So this is significant and yes males are more likely to get cancer.
Compares two observed proportions
Exactly the same as for 1-proportion z-test, but x and n are vectors with two values, one for category A and one for category B.
eg. does smoking affect rates of cases of lung cancer?
# x = number of smokers with lung cancer, without lung cancer
# n = number of cases with lung cancer,cases without
<- prop.test(x = c(490, 400), n=c(500,500))
atest
atest
2-sample test for equality of proportions with continuity correction
data: c(490, 400) out of c(500, 500)
X-squared = 80.909, df = 1, p-value < 2.2e-16
alternative hypothesis: two.sided
95 percent confidence interval:
0.1408536 0.2191464
sample estimates:
prop 1 prop 2
0.98 0.80
$p.value atest
[1] 2.363439e-19
Compares an observed distribution with an expected distribution.
<- chisq.test(x=c(81,50,27), p=c(1/3,1/3,1/3))
chi
chi
Chi-squared test for given probabilities
data: c(81, 50, 27)
X-squared = 27.886, df = 2, p-value = 8.803e-07
calculated by inputting a contingency table into chisq.test()
To make a contingency table, input two vectors into table()
Using the mpg dataset:
str(mpg)
tibble [234 × 11] (S3: tbl_df/tbl/data.frame)
$ manufacturer: chr [1:234] "audi" "audi" "audi" "audi" ...
$ model : chr [1:234] "a4" "a4" "a4" "a4" ...
$ displ : num [1:234] 1.8 1.8 2 2 2.8 2.8 3.1 1.8 1.8 2 ...
$ year : int [1:234] 1999 1999 2008 2008 1999 1999 2008 1999 1999 2008 ...
$ cyl : int [1:234] 4 4 4 4 6 6 6 4 4 4 ...
$ trans : chr [1:234] "auto(l5)" "manual(m5)" "manual(m6)" "auto(av)" ...
$ drv : chr [1:234] "f" "f" "f" "f" ...
$ cty : int [1:234] 18 21 20 21 16 18 18 18 16 20 ...
$ hwy : int [1:234] 29 29 31 30 26 26 27 26 25 28 ...
$ fl : chr [1:234] "p" "p" "p" "p" ...
$ class : chr [1:234] "compact" "compact" "compact" "compact" ...
%>%
mpg group_by(class, cyl) %>% ## show only the columns class and cylinder
summarize(n=n()) %>% ## create a new column "n" stating the number of occurrences
kable() ## Kable is part of Knitr and just makes neat looking tables
`summarise()` has grouped output by 'class'. You can override using the
`.groups` argument.
class | cyl | n |
---|---|---|
2seater | 8 | 5 |
compact | 4 | 32 |
compact | 5 | 2 |
compact | 6 | 13 |
midsize | 4 | 16 |
midsize | 6 | 23 |
midsize | 8 | 2 |
minivan | 4 | 1 |
minivan | 6 | 10 |
pickup | 4 | 3 |
pickup | 6 | 10 |
pickup | 8 | 20 |
subcompact | 4 | 21 |
subcompact | 5 | 2 |
subcompact | 6 | 7 |
subcompact | 8 | 5 |
suv | 4 | 8 |
suv | 6 | 16 |
suv | 8 | 38 |
%>%
mpg group_by(class, cyl) %>%
summarise(n=n()) %>%
spread(cyl, n) %>% ## This creates columns for each cyl value with n as the crosstab response
kable()
`summarise()` has grouped output by 'class'. You can override using the
`.groups` argument.
class | 4 | 5 | 6 | 8 |
---|---|---|---|---|
2seater | NA | NA | NA | 5 |
compact | 32 | 2 | 13 | NA |
midsize | 16 | NA | 23 | 2 |
minivan | 1 | NA | 10 | NA |
pickup | 3 | NA | 10 | 20 |
subcompact | 21 | 2 | 7 | 5 |
suv | 8 | NA | 16 | 38 |
## How different is this to the table() variant?
<- table(mpg$class, mpg$cyl)
classmpg
##This is exactly the same except that N/A values are replaced with 0, which is more useful. classmpg
4 5 6 8
2seater 0 0 0 5
compact 32 2 13 0
midsize 16 0 23 2
minivan 1 0 10 0
pickup 3 0 10 20
subcompact 21 2 7 5
suv 8 0 16 38
kable(classmpg) ## This makes it neat like the other one.
4 | 5 | 6 | 8 | |
---|---|---|---|---|
2seater | 0 | 0 | 0 | 5 |
compact | 32 | 2 | 13 | 0 |
midsize | 16 | 0 | 23 | 2 |
minivan | 1 | 0 | 10 | 0 |
pickup | 3 | 0 | 10 | 20 |
subcompact | 21 | 2 | 7 | 5 |
suv | 8 | 0 | 16 | 38 |
Summary statistics other than frequency
%>%
mpg group_by(class, cyl) %>%
summarise(mean_cty_miles = mean(cty)) %>% ## Here compared to the last one we've changed the frequency to the mean city miles
spread(cyl, mean_cty_miles) %>%
kable()
`summarise()` has grouped output by 'class'. You can override using the
`.groups` argument.
class | 4 | 5 | 6 | 8 |
---|---|---|---|---|
2seater | NA | NA | NA | 15.40000 |
compact | 21.37500 | 21 | 16.92308 | NA |
midsize | 20.50000 | NA | 17.78261 | 16.00000 |
minivan | 18.00000 | NA | 15.60000 | NA |
pickup | 16.00000 | NA | 14.50000 | 11.80000 |
subcompact | 22.85714 | 20 | 17.00000 | 14.80000 |
suv | 18.00000 | NA | 14.50000 | 12.13158 |
Proportions in a crosstable
%>%
mpg group_by(class, cyl) %>%
summarise(n=n()) %>%
mutate(prop = n/sum(n)) %>% ## Create a column for the proportions
subset(select = c("class", "cyl", "prop")) %>% ## Select all but "n" column
spread(class, prop) %>%
kable()
`summarise()` has grouped output by 'class'. You can override using the
`.groups` argument.
cyl | 2seater | compact | midsize | minivan | pickup | subcompact | suv |
---|---|---|---|---|---|---|---|
4 | NA | 0.6808511 | 0.3902439 | 0.0909091 | 0.0909091 | 0.6000000 | 0.1290323 |
5 | NA | 0.0425532 | NA | NA | NA | 0.0571429 | NA |
6 | NA | 0.2765957 | 0.5609756 | 0.9090909 | 0.3030303 | 0.2000000 | 0.2580645 |
8 | 1 | NA | 0.0487805 | NA | 0.6060606 | 0.1428571 | 0.6129032 |
## I wonder if just using select would work?
%>%
mpg group_by(class, cyl) %>%
summarise(n=n()) %>%
mutate(prop = n/sum(n)) %>%
select(-n) %>% ## Selects all columns except the "n" column
spread(class, prop) %>%
kable()
`summarise()` has grouped output by 'class'. You can override using the
`.groups` argument.
cyl | 2seater | compact | midsize | minivan | pickup | subcompact | suv |
---|---|---|---|---|---|---|---|
4 | NA | 0.6808511 | 0.3902439 | 0.0909091 | 0.0909091 | 0.6000000 | 0.1290323 |
5 | NA | 0.0425532 | NA | NA | NA | 0.0571429 | NA |
6 | NA | 0.2765957 | 0.5609756 | 0.9090909 | 0.3030303 | 0.2000000 | 0.2580645 |
8 | 1 | NA | 0.0487805 | NA | 0.6060606 | 0.1428571 | 0.6129032 |
## Yep, that works exactly the same
CrossTable() also gives frequencies and proportions:
## Make sure gmodels is installed and loaded for CrossTable to work!!
CrossTable(mpg$class, mpg$cyl)
Cell Contents
|-------------------------|
| N |
| Chi-square contribution |
| N / Row Total |
| N / Col Total |
| N / Table Total |
|-------------------------|
Total Observations in Table: 234
| mpg$cyl
mpg$class | 4 | 5 | 6 | 8 | Row Total |
-------------|-----------|-----------|-----------|-----------|-----------|
2seater | 0 | 0 | 0 | 5 | 5 |
| 1.731 | 0.085 | 1.688 | 8.210 | |
| 0.000 | 0.000 | 0.000 | 1.000 | 0.021 |
| 0.000 | 0.000 | 0.000 | 0.071 | |
| 0.000 | 0.000 | 0.000 | 0.021 | |
-------------|-----------|-----------|-----------|-----------|-----------|
compact | 32 | 2 | 13 | 0 | 47 |
| 15.210 | 1.782 | 0.518 | 14.060 | |
| 0.681 | 0.043 | 0.277 | 0.000 | 0.201 |
| 0.395 | 0.500 | 0.165 | 0.000 | |
| 0.137 | 0.009 | 0.056 | 0.000 | |
-------------|-----------|-----------|-----------|-----------|-----------|
midsize | 16 | 0 | 23 | 2 | 41 |
| 0.230 | 0.701 | 6.059 | 8.591 | |
| 0.390 | 0.000 | 0.561 | 0.049 | 0.175 |
| 0.198 | 0.000 | 0.291 | 0.029 | |
| 0.068 | 0.000 | 0.098 | 0.009 | |
-------------|-----------|-----------|-----------|-----------|-----------|
minivan | 1 | 0 | 10 | 0 | 11 |
| 2.070 | 0.188 | 10.641 | 3.291 | |
| 0.091 | 0.000 | 0.909 | 0.000 | 0.047 |
| 0.012 | 0.000 | 0.127 | 0.000 | |
| 0.004 | 0.000 | 0.043 | 0.000 | |
-------------|-----------|-----------|-----------|-----------|-----------|
pickup | 3 | 0 | 10 | 20 | 33 |
| 6.211 | 0.564 | 0.117 | 10.391 | |
| 0.091 | 0.000 | 0.303 | 0.606 | 0.141 |
| 0.037 | 0.000 | 0.127 | 0.286 | |
| 0.013 | 0.000 | 0.043 | 0.085 | |
-------------|-----------|-----------|-----------|-----------|-----------|
subcompact | 21 | 2 | 7 | 5 | 35 |
| 6.515 | 3.284 | 1.963 | 2.858 | |
| 0.600 | 0.057 | 0.200 | 0.143 | 0.150 |
| 0.259 | 0.500 | 0.089 | 0.071 | |
| 0.090 | 0.009 | 0.030 | 0.021 | |
-------------|-----------|-----------|-----------|-----------|-----------|
suv | 8 | 0 | 16 | 38 | 62 |
| 8.444 | 1.060 | 1.162 | 20.403 | |
| 0.129 | 0.000 | 0.258 | 0.613 | 0.265 |
| 0.099 | 0.000 | 0.203 | 0.543 | |
| 0.034 | 0.000 | 0.068 | 0.162 | |
-------------|-----------|-----------|-----------|-----------|-----------|
Column Total | 81 | 4 | 79 | 70 | 234 |
| 0.346 | 0.017 | 0.338 | 0.299 | |
-------------|-----------|-----------|-----------|-----------|-----------|
Performing Chi-squared test of independence:
<- mpg %>%
classcyl group_by(class, cyl) %>%
summarise(n=n()) %>%
spread(class, n, fill = 0) %>% ## Fill missing values with 0
ungroup() %>%
select(-cyl) %>%
kable() %>%
as.matrix() ## Since chisq.test() requires a matrix
`summarise()` has grouped output by 'class'. You can override using the
`.groups` argument.
<- mpg %>%
classcyl group_by(class, cyl) %>%
summarise(n=n()) %>%
spread(cyl, n) %>% ## This creates columns for each cyl value with n as the crosstab response
kable()
`summarise()` has grouped output by 'class'. You can override using the
`.groups` argument.
#classcyl2 <- as.matrix(classcyl)
#chisq.test(classcyl2)
#chiresult <- chisq.test(classcyl)
#chiresult
chisq.test(mpg$class, mpg$cyl)
Warning in chisq.test(mpg$class, mpg$cyl): Chi-squared approximation may be
incorrect
Pearson's Chi-squared test
data: mpg$class and mpg$cyl
X-squared = 138.03, df = 18, p-value < 2.2e-16