1 + 1
[1] 2
Quarto enables you to weave together content and executable code into a finished document. To learn more about Quarto see https://quarto.org.
When you click the Render button a document will be generated that includes both content and the output of embedded code. You can embed code like this:
1 + 1
[1] 2
You can add options to executable code like this
[1] 4
The echo: false
option disables the printing of code (only output is displayed).
Make sure the working directory is set up correctly.
Files have to be loaded from the same place the file is saved.
To set working directory use: setwd(“PATH/directoryname”)
To load a package within RStudio use the following sequence within a code chunk:
#| label: load packages
#| include: true
#| warning: false
library(palmerpenguins)
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.1 ✔ readr 2.1.4
✔ forcats 1.0.0 ✔ stringr 1.5.0
✔ ggplot2 3.4.2 ✔ tibble 3.2.1
✔ lubridate 1.9.2 ✔ tidyr 1.3.0
✔ purrr 1.0.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
(Note: “#|” denotes the header for a chunk and establishes line by line the output functions)
Simply starting a new code chunk and typing in the word “penguins” will open up a table of the Palmer penguins data, like so:
penguins
# A tibble: 344 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<fct> <fct> <dbl> <dbl> <int> <int>
1 Adelie Torgersen 39.1 18.7 181 3750
2 Adelie Torgersen 39.5 17.4 186 3800
3 Adelie Torgersen 40.3 18 195 3250
4 Adelie Torgersen NA NA NA NA
5 Adelie Torgersen 36.7 19.3 193 3450
6 Adelie Torgersen 39.3 20.6 190 3650
7 Adelie Torgersen 38.9 17.8 181 3625
8 Adelie Torgersen 39.2 19.6 195 4675
9 Adelie Torgersen 34.1 18.1 193 3475
10 Adelie Torgersen 42 20.2 190 4250
# ℹ 334 more rows
# ℹ 2 more variables: sex <fct>, year <int>
%>% {This is the ‘pipe’ symbol and orders R to perform an action/function based on a specific input e.g. a dataset}. So penguins %>% glimpse commands R to ‘glimpse’/view the penguins dataset giving as much information as possible, see below. (The pipe %>% is contained within tidyverse, specifically in a package called magrittr)
%>% glimpse penguins
Rows: 344
Columns: 8
$ species <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
$ island <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
$ bill_length_mm <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
$ bill_depth_mm <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
$ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
$ body_mass_g <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
$ sex <fct> male, female, female, NA, female, male, female, male…
$ year <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…
Inserting an image in to a Quarto workbook via RStudio is very easy. Simply switch from ‘Source’ to ‘Visual’ mode, click on”insert”, “Figure/Image”, copy and paste the URL address of the image then click “OK”. Here’s a nice photograph of Bob Marley smiling:`
Embedding a video in to Quarto is also very straight forward. Simply right click on the video you wish to embed, click “copy embed code”, paste it in the source pane/script editor and click Render (no need to add it in to a code chunk):
There are many ways to display data analyses outputs in RStudio and even more ways to manipulate and scrutinize the data. Below are just two examples of visual representations for data analyses (histogram and scatterplot).
The tidyverse package is actually a “package of other packages” (dplyr, ggplot2, etc.) and these are shown when Tidyverse is loaded.Packages such as tidyverse must be loaded at the start of every new RStudio session due to naming conflicts between other packages/package functions e.g., filter(): which has one function within the dplyr package and one within the stats package.You can also choose to load individual functions from a package instead of whole libraries, e.g.to load only the filter function from dplyr you can use “dplyr::filter()”
Here we are going to be working with the “diamonds” dataset. This dataset is built into ggplot2 and therefore included in tidyverse.
view(diamonds)
The above command opens the diamonds dataset in a separate tab.
To view the structure/type of each variable in the diamonds dataset run “str(diamonds)” as below:
str(diamonds)
tibble [53,940 × 10] (S3: tbl_df/tbl/data.frame)
$ carat : num [1:53940] 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
$ cut : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
$ color : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
$ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
$ depth : num [1:53940] 61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
$ table : num [1:53940] 55 61 65 58 58 57 57 55 61 61 ...
$ price : int [1:53940] 326 326 327 334 335 336 336 337 337 338 ...
$ x : num [1:53940] 3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
$ y : num [1:53940] 3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
$ z : num [1:53940] 2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...
As we can see there are 53,940 entries (individual diamonds) and 10 variables to describe each entry.
With built-in datasets such as “diamonds” running a simple query via (?diamonds) will provide further information on the dataset:
Prices of over 50,000 round cut diamonds.
Description: A dataset containing the prices and other attributes of almost 54,000 diamonds. The variables are as follows:
Usage diamonds Format A data frame with 53940 rows and 10 variables:
-price price in US dollars ($326–$18,823)
-carat weight of the diamond (0.2–5.01)
-cut quality of the cut (Fair, Good, Very Good, Premium, Ideal)
-color diamond colour, from D (best) to J (worst)
-clarity a measurement of how clear the diamond is (I1 (worst), SI2, SI1, VS2, VS1, VVS2, VVS1, IF (best))
-x length in mm (0–10.74)
-y width in mm (0–58.9)
-z depth in mm (0–31.8)
-depth total depth percentage = z / mean(x, y) = 2 * z / (x + y) (43–79)
-table width of top of diamond relative to widest point (43–95)
mutate() adds new columns or modifies current variables in a dataset
E.g. to create three new variables within the diamonds dataset:
I would use:
%>% mutate(JustOne = 1, Values = "something", Simple = TRUE) diamonds
# A tibble: 53,940 × 13
carat cut color clarity depth table price x y z JustOne Values
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <dbl> <chr>
1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43 1 somet…
2 0.21 Premi… E SI1 59.8 61 326 3.89 3.84 2.31 1 somet…
3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31 1 somet…
4 0.29 Premi… I VS2 62.4 58 334 4.2 4.23 2.63 1 somet…
5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75 1 somet…
6 0.24 Very … J VVS2 62.8 57 336 3.94 3.96 2.48 1 somet…
7 0.24 Very … I VVS1 62.3 57 336 3.95 3.98 2.47 1 somet…
8 0.26 Very … H SI1 61.9 55 337 4.07 4.11 2.53 1 somet…
9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49 1 somet…
10 0.23 Very … H VS1 59.4 61 338 4 4.05 2.39 1 somet…
# ℹ 53,930 more rows
# ℹ 1 more variable: Simple <lgl>
mutate() can also be used to create new variables based on existing variables from the dataset.
E.g. to create a new column detailing a 20% reduction in price for each diamond I would use:
<-
diamonds.new %>%
diamonds mutate(price20percoff = price * 0.80)
These added columns/variables have now been saved to the original dataset using (<-) above.
This line of code commands R to save new variables as an ‘object’.
We can also use other functions inside mutate() to create our new variable(s). For example, we might use the mean(), standard deviation (sd()) and/or median() function(s) to calculate the average price value for all diamonds in the dataset. This is called nesting i.e., where one function e.g. mean() “nests” inside another function mutate().
E.g. diamonds %>% mutate(m = mean(price), sd = sd(price), med = median(price))
%>%
diamonds mutate(m = mean(price),
sd = sd(price),
med = median(price))
# A tibble: 53,940 × 13
carat cut color clarity depth table price x y z m sd
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43 3933. 3989.
2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31 3933. 3989.
3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31 3933. 3989.
4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63 3933. 3989.
5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75 3933. 3989.
6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48 3933. 3989.
7 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47 3933. 3989.
8 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53 3933. 3989.
9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49 3933. 3989.
10 0.23 Very Good H VS1 59.4 61 338 4 4.05 2.39 3933. 3989.
# ℹ 53,930 more rows
# ℹ 1 more variable: med <dbl>
Note: The values in these new columns will be the same for every row because R takes all of the values in price to calculate the mean/standard deviation/median.
%>%
diamonds mutate(totweight = sum(carat))
# A tibble: 53,940 × 11
carat cut color clarity depth table price x y z totweight
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <dbl>
1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43 43041.
2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31 43041.
3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31 43041.
4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63 43041.
5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75 43041.
6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48 43041.
7 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47 43041.
8 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53 43041.
9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49 43041.
10 0.23 Very Good H VS1 59.4 61 338 4 4.05 2.39 43041.
# ℹ 53,930 more rows
The above formula has calculated the total weight of all diamonds i.e. sum of all carats
Recode modifies the values within a variable like so:
data %>% mutate(Variable = recode(Variable, “old value” = “new value”))
<-
cut.new %>%
diamonds mutate(cut.new = recode(cut, "Ideal" = "IDEAL"))
This has simply changed the column “Ideal” to all caps i.e. “IDEAL”. Of course R can perform far more complex recoding functions, but this is the basic idea.
Remember to save the changes within the code chunk:
cut.new <- diamonds %>% mutate(cut.new = recode(cut, “Ideal” = “IDEAL”))
Summarize() collapses all rows and returns a one-row summary. R will recognize both the British and American spelling (summarise/summarize).
%>%
diamonds summarize(avg.price = mean(price))
# A tibble: 1 × 1
avg.price
<dbl>
1 3933.
Here above we see the average (mean) price of all the diamonds within the dataset.
As with the mutate() function, multiple operations can be performed simultaneously within sumarize() as can nesting, like so:
%>%
diamonds summarize(avg.price = mean(price),
random.add = 1 + 2, # math operation without an existing variable
avg.carat = mean(carat),
stdev.price = sd(price)) # calculating the standard deviation
# A tibble: 1 × 4
avg.price random.add avg.carat stdev.price
<dbl> <dbl> <dbl> <dbl>
1 3933. 3 0.798 3989.
This function takes existing data and groups specific variables together for future operations. Many operations are performed on groups. After running some data see here based on males and females, their ages, and their scores in a separate R Script (i.e. not my Quarto workbook qmd file), the output came in the Environment Pane.
Thereafter, applying the following formula in the same script gave me a small table detailing mean(score), sd(score), and n(number of subjects) for girls and boys separately:
So by grouping we get to see the summaries of individual variables within the dataset!
Now lets group by Sex and Age (note below that entering “Sex” first and then “Age” in group_by() will display these variables in the table in that order from left to right, and vice versa):
This has given us 27 rows in total. R has considered every unique combination of Sex and Age in the dataset.
The missing std dev (s) values are because there is only 1 observtion for that group, while std dev requires at least 2.
If ungroup() is left out at the end of a code chunk wishing to group variables i.e. group_by(), this will leave the grouping in place and likely lead to errors in further calculations.
Always use ungroup() after group_by().
filter() only retains specific rows of data that meet the specified requirement(s), e.g.
To only display data from the diamonds dataset that have a cut value of Fair:
%>% filter(cut == "Fair") diamonds
# A tibble: 1,610 × 10
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49
2 0.86 Fair E SI2 55.1 69 2757 6.45 6.33 3.52
3 0.96 Fair F SI2 66.3 62 2759 6.27 5.95 4.07
4 0.7 Fair F VS2 64.5 57 2762 5.57 5.53 3.58
5 0.7 Fair F VS2 65.3 55 2762 5.63 5.58 3.66
6 0.91 Fair H SI2 64.4 57 2763 6.11 6.09 3.93
7 0.91 Fair H SI2 65.7 60 2763 6.03 5.99 3.95
8 0.98 Fair H SI2 67.9 60 2777 6.05 5.97 4.08
9 0.84 Fair G SI1 55.1 67 2782 6.39 6.2 3.47
10 1.01 Fair E I1 64.5 58 2788 6.29 6.21 4.03
# ℹ 1,600 more rows
Only display data from diamonds that have a cut value of Fair or Good and a price at or under $600 (notice how the or statement is obtained with | while the and statement is achieved through a comma):
%>%
diamonds filter(cut == "Fair" | cut == "Good", price <= 600)
# A tibble: 505 × 10
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
2 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
3 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49
4 0.3 Good J SI1 64 55 339 4.25 4.28 2.73
5 0.3 Good J SI1 63.4 54 351 4.23 4.29 2.7
6 0.3 Good J SI1 63.8 56 351 4.23 4.26 2.71
7 0.3 Good I SI2 63.3 56 351 4.26 4.3 2.71
8 0.23 Good F VS1 58.2 59 402 4.06 4.08 2.37
9 0.23 Good E VS1 64.1 59 402 3.83 3.85 2.46
10 0.31 Good H SI1 64 54 402 4.29 4.31 2.75
# ℹ 495 more rows
The following code would require the cut be Fair and Good (for which none exists):
%>%
diamonds filter(cut == "Fair", cut == "Good", price <= 600)
# A tibble: 0 × 10
# ℹ 10 variables: carat <dbl>, cut <ord>, color <ord>, clarity <ord>,
# depth <dbl>, table <dbl>, price <int>, x <dbl>, y <dbl>, z <dbl>
Only display data from diamonds that do not have a cut value of Fair:
%>% filter(cut != "Fair") diamonds
# A tibble: 52,330 × 10
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63
5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
7 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47
8 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53
9 0.23 Very Good H VS1 59.4 61 338 4 4.05 2.39
10 0.3 Good J SI1 64 55 339 4.25 4.28 2.73
# ℹ 52,320 more rows
Function: Selects only the columns (variables) that you want to see. Gets rid of all other columns. You can refer to the columns by the column position (first column) or by name. The order in which you list the column names/positions is the order that the columns will be displayed.
In the diamonds dataset, only retain the first 4 columns:
%>% select(1:4) diamonds
# A tibble: 53,940 × 4
carat cut color clarity
<dbl> <ord> <ord> <ord>
1 0.23 Ideal E SI2
2 0.21 Premium E SI1
3 0.23 Good E VS1
4 0.29 Premium I VS2
5 0.31 Good J SI2
6 0.24 Very Good J VVS2
7 0.24 Very Good I VVS1
8 0.26 Very Good H SI1
9 0.22 Fair E VS2
10 0.23 Very Good H VS1
# ℹ 53,930 more rows
Retain all of the columns except for cut and color:
%>% select(-cut, -color) diamonds
# A tibble: 53,940 × 8
carat clarity depth table price x y z
<dbl> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 0.23 SI2 61.5 55 326 3.95 3.98 2.43
2 0.21 SI1 59.8 61 326 3.89 3.84 2.31
3 0.23 VS1 56.9 65 327 4.05 4.07 2.31
4 0.29 VS2 62.4 58 334 4.2 4.23 2.63
5 0.31 SI2 63.3 58 335 4.34 4.35 2.75
6 0.24 VVS2 62.8 57 336 3.94 3.96 2.48
7 0.24 VVS1 62.3 57 336 3.95 3.98 2.47
8 0.26 SI1 61.9 55 337 4.07 4.11 2.53
9 0.22 VS2 65.1 61 337 3.87 3.78 2.49
10 0.23 VS1 59.4 61 338 4 4.05 2.39
# ℹ 53,930 more rows
You can also retain all of the columns, but rearrange some of the columns to appear at the beginning—this moves the x,y,z variables to the first 3 columns:
%>% select(x,y,z, everything()) diamonds
# A tibble: 53,940 × 10
x y z carat cut color clarity depth table price
<dbl> <dbl> <dbl> <dbl> <ord> <ord> <ord> <dbl> <dbl> <int>
1 3.95 3.98 2.43 0.23 Ideal E SI2 61.5 55 326
2 3.89 3.84 2.31 0.21 Premium E SI1 59.8 61 326
3 4.05 4.07 2.31 0.23 Good E VS1 56.9 65 327
4 4.2 4.23 2.63 0.29 Premium I VS2 62.4 58 334
5 4.34 4.35 2.75 0.31 Good J SI2 63.3 58 335
6 3.94 3.96 2.48 0.24 Very Good J VVS2 62.8 57 336
7 3.95 3.98 2.47 0.24 Very Good I VVS1 62.3 57 336
8 4.07 4.11 2.53 0.26 Very Good H SI1 61.9 55 337
9 3.87 3.78 2.49 0.22 Fair E VS2 65.1 61 337
10 4 4.05 2.39 0.23 Very Good H VS1 59.4 61 338
# ℹ 53,930 more rows
Function: Allows you to arrange values within a variable in ascending or descending order (if that is applicable to your values). This can apply to both numerical and non-numerical values, e.g.to arrange colour by alphabetical order (A to Z):
%>% arrange(color) diamonds
# A tibble: 53,940 × 10
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 0.23 Very Good D VS2 60.5 61 357 3.96 3.97 2.4
2 0.23 Very Good D VS1 61.9 58 402 3.92 3.96 2.44
3 0.26 Very Good D VS2 60.8 59 403 4.13 4.16 2.52
4 0.26 Good D VS2 65.2 56 403 3.99 4.02 2.61
5 0.26 Good D VS1 58.4 63 403 4.19 4.24 2.46
6 0.22 Premium D VS2 59.3 62 404 3.91 3.88 2.31
7 0.3 Premium D SI1 62.6 59 552 4.23 4.27 2.66
8 0.3 Ideal D SI1 62.5 57 552 4.29 4.32 2.69
9 0.3 Ideal D SI1 62.1 56 552 4.3 4.33 2.68
10 0.24 Very Good D VVS1 61.5 60 553 3.97 4 2.45
# ℹ 53,930 more rows
Or, to arrange price by numerical order (lowest to highest):
%>% arrange(price) diamonds
# A tibble: 53,940 × 10
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63
5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
7 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47
8 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53
9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49
10 0.23 Very Good H VS1 59.4 61 338 4 4.05 2.39
# ℹ 53,930 more rows
Or, to arrange price in descending numerical order:
%>% arrange(desc(price)) diamonds
# A tibble: 53,940 × 10
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 2.29 Premium I VS2 60.8 60 18823 8.5 8.47 5.16
2 2 Very Good G SI1 63.5 56 18818 7.9 7.97 5.04
3 1.51 Ideal G IF 61.7 55 18806 7.37 7.41 4.56
4 2.07 Ideal G SI2 62.5 55 18804 8.2 8.13 5.11
5 2 Very Good H SI1 62.8 57 18803 7.95 8 5.01
6 2.29 Premium I SI1 61.8 59 18797 8.52 8.45 5.24
7 2.04 Premium H SI1 58.1 60 18795 8.37 8.28 4.84
8 2 Premium I VS1 60.8 59 18795 8.13 8.02 4.91
9 1.71 Premium F VS2 62.3 59 18791 7.57 7.53 4.7
10 2.15 Ideal G SI2 62.6 54 18791 8.29 8.35 5.21
# ℹ 53,930 more rows
Problem A
library(tidyverse)
%>% # utilises the midwest dataset
midwest group_by(state) %>% # implies that the following commands are to be grouped by state for that state
summarize(poptotalmean = mean(poptotal), # display the mean of the poptotal column for that state
popmax = max(poptotal), #display the maximum value of poptotal for that state
popmin = min(poptotal), #display the minimum value of poptotal for that state
popdistinct = n_distinct(poptotal), # displays the number of unique values in poptotal for each state popfirst = first(poptotal), #display the first value in poptotal
popany = any(poptotal < 5000), #split the popany column in to true and false values i.e. if less than 5000 = true if more = false
popany2 = any(poptotal > 2000000)) %>% #same as above i.e. true if more than 2m false if less
ungroup() #ungroup the dataset
# A tibble: 5 × 7
state poptotalmean popmax popmin popdistinct popany popany2
<chr> <dbl> <int> <int> <int> <lgl> <lgl>
1 IL 112065. 5105067 4373 101 TRUE TRUE
2 IN 60263. 797159 5315 92 FALSE FALSE
3 MI 111992. 2111687 1701 83 TRUE TRUE
4 OH 123263. 1412140 11098 88 FALSE FALSE
5 WI 67941. 959275 3890 72 TRUE FALSE
Problem B
%>% # do the following to the midwest dataset
midwest group_by(state) %>% # group all other variables according to the state variable
summarize(num5k = sum(poptotal < 5000), # summarize the num5k column according to the no of entries with poptotal < 5000
num2mil = sum(poptotal > 2000000), # summarize the num2mil column according to the no of entries with poptotal > 2,000,000
numrows = n()) %>% # summarize the total no of rows
ungroup() # ungroup variables
# A tibble: 5 × 4
state num5k num2mil numrows
<chr> <int> <int> <int>
1 IL 1 1 102
2 IN 0 0 92
3 MI 1 1 83
4 OH 0 0 88
5 WI 2 0 72
Problem C
%>%
midwest group_by(county) %>%
summarize(x = n_distinct(state)) %>% # summarise object x, defined as the number of unique states within the midwest dataset
arrange(desc(x)) %>% # arrange x in descending order i.e. from most common state to least
ungroup()
# A tibble: 320 × 2
county x
<chr> <int>
1 CRAWFORD 5
2 JACKSON 5
3 MONROE 5
4 ADAMS 4
5 BROWN 4
6 CLARK 4
7 CLINTON 4
8 JEFFERSON 4
9 LAKE 4
10 WASHINGTON 4
# ℹ 310 more rows
%>%
midwest group_by(county) %>%
summarize(x = n()) %>% # summarize x by displaying the number of entries for each county
ungroup()
# A tibble: 320 × 2
county x
<chr> <int>
1 ADAMS 4
2 ALCONA 1
3 ALEXANDER 1
4 ALGER 1
5 ALLEGAN 1
6 ALLEN 2
7 ALPENA 1
8 ANTRIM 1
9 ARENAC 1
10 ASHLAND 2
# ℹ 310 more rows
Problem D
%>%
diamonds group_by(clarity) %>%
summarize(a = n_distinct(color), # show the no of unique colours
b = n_distinct(price), # show the no of unique prices
c = n()) %>% # show the total no of variables
ungroup()
# A tibble: 8 × 4
clarity a b c
<ord> <int> <int> <int>
1 I1 7 632 741
2 SI2 7 4904 9194
3 SI1 7 5380 13065
4 VS2 7 5051 12258
5 VS1 7 3926 8171
6 VVS2 7 2409 5066
7 VVS1 7 1623 3655
8 IF 7 902 1790
Problem E
%>%
diamonds group_by(color, cut) %>% # display every unique combination of colour and cut [note the comma between variables]
summarize(m = mean(price), # summarize the mean/average price of each combination
s = sd(price)) %>% # show the sd of each combination
ungroup()
`summarise()` has grouped output by 'color'. You can override using the
`.groups` argument.
# A tibble: 35 × 4
color cut m s
<ord> <ord> <dbl> <dbl>
1 D Fair 4291. 3286.
2 D Good 3405. 3175.
3 D Very Good 3470. 3524.
4 D Premium 3631. 3712.
5 D Ideal 2629. 3001.
6 E Fair 3682. 2977.
7 E Good 3424. 3331.
8 E Very Good 3215. 3408.
9 E Premium 3539. 3795.
10 E Ideal 2598. 2956.
# ℹ 25 more rows
Problem F
%>%
diamonds group_by(cut) %>%
summarize(potato = mean(depth),
pizza = mean(price),
popcorn = median(y),
pineapple = potato - pizza,
papaya = pineapple ^ 2,
peach = n()) %>%
ungroup()
# A tibble: 5 × 7
cut potato pizza popcorn pineapple papaya peach
<ord> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
1 Fair 64.0 4359. 6.1 -4295. 18444586. 1610
2 Good 62.4 3929. 5.99 -3866. 14949811. 4906
3 Very Good 61.8 3982. 5.77 -3920. 15365942. 12082
4 Premium 61.3 4584. 6.06 -4523. 20457466. 13791
5 Ideal 61.7 3458. 5.26 -3396. 11531679. 21551
Problem G
%>%
diamonds group_by(color) %>% # group all unique variations of colour
summarize(m = mean(price)) %>% # summarize the mean price of each colour group
mutate(x1 = str_c("Diamond color ", color),
x2 = 5) %>% # create 2 new columns - x1 with values of "Diamond color"+ color & x2 with the value 5 in every row
ungroup()
# A tibble: 7 × 4
color m x1 x2
<ord> <dbl> <chr> <dbl>
1 D 3170. Diamond color D 5
2 E 3077. Diamond color E 5
3 F 3725. Diamond color F 5
4 G 3999. Diamond color G 5
5 H 4487. Diamond color H 5
6 I 5092. Diamond color I 5
7 J 5324. Diamond color J 5
Problem H
%>%
diamonds group_by(color) %>%
mutate(x1 = price * 0.5) %>% # create a new column: x1, equal to the price of each color x 0.5
summarize(m = mean(x1)) %>% # summarize the mean of x1 for each colour {m}. This summary excludes the original x1 column and replaces it with column m
ungroup()
# A tibble: 7 × 2
color m
<ord> <dbl>
1 D 1585.
2 E 1538.
3 F 1862.
4 G 2000.
5 H 2243.
6 I 2546.
7 J 2662.
%>%
diamonds group_by(color) %>%
mutate(x1 = price * 0.5) %>% # add a column equal to half of the price called x1
ungroup() %>%
summarize(m = mean(x1)) # summarize the mean values of x1
# A tibble: 1 × 1
m
<dbl>
1 1966.
Since the above ungroups before printing the summary, the output isn’t separated into individual colors and instead gives the mean of all colours combined.
Grouping data is necessary as it allows statistical analysis to be carried out on specific and targeted groups/variables within the dataset
Ungrouping is necessary as it disables the data from being grouped in the same fashion next time it is used
You should ungroup data as soon as you no longer need or want the data to be grouped
No, no need
View(diamonds) # opens the diamonds dataset in a new window
arrange(diamonds, price) # this displays the data with price ascending from lowest to highest
# A tibble: 53,940 × 10
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63
5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
7 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47
8 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53
9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49
10 0.23 Very Good H VS1 59.4 61 338 4 4.05 2.39
# ℹ 53,930 more rows
diamonds %>% arrange(price) #this does exactly the same as above
%>%
diamonds arrange(price) %>%
arrange(cut)
# A tibble: 53,940 × 10
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49
2 0.25 Fair E VS1 55.2 64 361 4.21 4.23 2.33
3 0.23 Fair G VVS2 61.4 66 369 3.87 3.91 2.39
4 0.27 Fair E VS1 66.4 58 371 3.99 4.02 2.66
5 0.3 Fair J VS2 64.8 58 416 4.24 4.16 2.72
6 0.3 Fair F SI1 63.1 58 496 4.3 4.22 2.69
7 0.34 Fair J SI1 64.5 57 497 4.38 4.36 2.82
8 0.37 Fair F SI1 65.3 56 527 4.53 4.47 2.94
9 0.3 Fair D SI2 64.6 54 536 4.29 4.25 2.76
10 0.25 Fair D VS1 61.2 55 563 4.09 4.11 2.51
# ℹ 53,930 more rows
The above formula arranges the data according to lowest price (ascending) within each cut (also ascending in terms of quality)
%>%
diamonds arrange(desc(price)) %>%
arrange(desc(cut))
# A tibble: 53,940 × 10
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 1.51 Ideal G IF 61.7 55 18806 7.37 7.41 4.56
2 2.07 Ideal G SI2 62.5 55 18804 8.2 8.13 5.11
3 2.15 Ideal G SI2 62.6 54 18791 8.29 8.35 5.21
4 2.05 Ideal G SI1 61.9 57 18787 8.1 8.16 5.03
5 1.6 Ideal F VS1 62 56 18780 7.47 7.52 4.65
6 2.06 Ideal I VS2 62.2 55 18779 8.15 8.19 5.08
7 1.71 Ideal G VVS2 62.1 55 18768 7.66 7.63 4.75
8 2.08 Ideal H SI1 58.7 60 18760 8.36 8.4 4.92
9 2.03 Ideal G SI1 60 55.8 18757 8.17 8.3 4.95
10 2.61 Ideal I SI2 62.1 56 18756 8.85 8.73 5.46
# ℹ 53,930 more rows
The above table displays the opposite of the previous table i.e. highest price (and descending) of the highest quality cut (also descending)
%>%
diamonds arrange(price) %>%
arrange(clarity)
# A tibble: 53,940 × 10
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 0.32 Premium E I1 60.9 58 345 4.38 4.42 2.68
2 0.32 Good D I1 64 54 361 4.33 4.36 2.78
3 0.31 Premium F I1 62.9 59 394 4.33 4.29 2.71
4 0.34 Ideal D I1 62.5 57 413 4.47 4.49 2.8
5 0.34 Ideal D I1 61.4 55 413 4.5 4.52 2.77
6 0.32 Premium E I1 60.9 58 444 4.42 4.38 2.68
7 0.43 Premium H I1 62 59 452 4.78 4.83 2.98
8 0.41 Good G I1 63.8 56 467 4.7 4.74 3.01
9 0.32 Good D I1 64 54 468 4.36 4.33 2.78
10 0.32 Ideal E I1 60.7 57 490 4.45 4.41 2.69
# ℹ 53,930 more rows
The above arranges the diamonds from lowest to highest price starting with the lowest grade clarity (and ascending)
%>%
diamonds mutate(saleprice = price - 250)
# A tibble: 53,940 × 11
carat cut color clarity depth table price x y z saleprice
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <dbl>
1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43 76
2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31 76
3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31 77
4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63 84
5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75 85
6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48 86
7 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47 86
8 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53 87
9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49 87
10 0.23 Very Good H VS1 59.4 61 338 4 4.05 2.39 88
# ℹ 53,930 more rows
The above formula has created a new variable named salePprice that reflects a $250 discount off the original cost of each diamond
%>%
diamonds select(-x,-y,-z)
# A tibble: 53,940 × 7
carat cut color clarity depth table price
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int>
1 0.23 Ideal E SI2 61.5 55 326
2 0.21 Premium E SI1 59.8 61 326
3 0.23 Good E VS1 56.9 65 327
4 0.29 Premium I VS2 62.4 58 334
5 0.31 Good J SI2 63.3 58 335
6 0.24 Very Good J VVS2 62.8 57 336
7 0.24 Very Good I VVS1 62.3 57 336
8 0.26 Very Good H SI1 61.9 55 337
9 0.22 Fair E VS2 65.1 61 337
10 0.23 Very Good H VS1 59.4 61 338
# ℹ 53,930 more rows
The above formula removed the x, y and z variables from the dataset
%>%
diamonds group_by(cut) %>%
summarise(number = n()) %>%
ungroup()
# A tibble: 5 × 2
cut number
<ord> <int>
1 Fair 1610
2 Good 4906
3 Very Good 12082
4 Premium 13791
5 Ideal 21551
The above formula has determined the no of diamonds there are for each cut value
%>%
diamonds mutate(totalNum = sum(n()))
# A tibble: 53,940 × 11
carat cut color clarity depth table price x y z totalNum
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <int>
1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43 53940
2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31 53940
3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31 53940
4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63 53940
5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75 53940
6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48 53940
7 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47 53940
8 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53 53940
9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49 53940
10 0.23 Very Good H VS1 59.4 61 338 4 4.05 2.39 53940
# ℹ 53,930 more rows
The above formula has created a new column named totalNum that calculates the total number of diamonds.
First of all I have viewed/loaded the “mtcars” dataset by typing in data(“mtcars”), see below. “head(mtcars)” then allows me to view/prints the data in tabular form. Well, at least the first 6 rows.
data("mtcars")
head(mtcars)
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
Using (?mtcars):
The data was extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973–74 models).
A data frame with 32 observations on 11 (numeric) variables.
[, 1] mpg Miles/(US) gallon [, 2] cyl Number of cylinders [, 3] disp Displacement (cu.in.) [, 4] hp Gross horsepower [, 5] drat Rear axle ratio [, 6] wt Weight (1000 lbs) [, 7] qsec 1/4 mile time [, 8] vs Engine (0 = V-shaped, 1 = straight) [, 9] am Transmission (0 = automatic, 1 = manual) [,10] gear Number of forward gears [,11] carb Number of carburetors
See Below - Scatter Plots:
plot(mtcars$wt, mtcars$mpg)
So above we can see there is a clear negative correlation between miles per gallon and car weight i.e. the heavier the car is the less mpg it provides (and vice versa), generally speaking.
The above scatter plot was created using the plotting functions of base R i.e. plotting functions that come with R Studio.
library(ggplot2)
ggplot(mtcars, aes(x = wt, y = mpg)) +
geom_point()
Above is the output from ggplot2. So as you can see a bit neater/more informative and less clutter in the axis descriptions.
See Below - Line Graphs:
Below is an example of a line graph created using ggplot2 for a generic pressure vs temperature gradient.
ggplot(pressure, aes(x = temperature, y = pressure)) +
geom_line() +
geom_point()
See Below - Bar Graphs
Below shows how base R functions output a bar graph
# There are 11 cases of the value 4, 7 cases of 6, and 14 cases of 8
table(mtcars$cyl)
4 6 8
11 7 14
#>
#> 4 6 8
#> 11 7 14
# Generate a table of counts
barplot(table(mtcars$cyl))
Below shows how ggplot2 produces a bar graph:
library(ggplot2)
# Bar graph of values. This uses the BOD data frame, with the
# "Time" column for x values and the "demand" column for y values.
ggplot(BOD, aes(x = Time, y = demand)) +
geom_col()
Above treats the Y-Axis (Time) as a continuous variable, while below shows the difference in inputting time as a discrete value:
# Convert the x variable to a factor, so that it is treated as discrete
ggplot(BOD, aes(x = factor(Time), y = demand)) +
geom_col()
ggplot2 can also be used to plot the count of the number of data rows in each category by using geom_bar() instead of geom_col(). Once again, notice the difference between a continuous x-axis and a discrete one. For some kinds of data, it may make more sense to convert the continuous x variable to a discrete one, with the factor() function.
# Bar graph of counts This uses the mtcars data frame, with the "cyl" column for
# x position. The y position is calculated by counting the number of rows for
# each value of cyl.
ggplot(mtcars, aes(x = cyl)) +
geom_bar()
# Bar graph of counts
ggplot(mtcars, aes(x = factor(cyl))) +
geom_bar()
See Below - Histograms
hist(mtcars$mpg)
# Specify approximate number of bins with breaks
hist(mtcars$mpg, breaks = 10)
Histograms with base graphics (left); With more bins. Notice that because the bins are narrower, there are fewer items in each bin. (right).
library(ggplot2)
ggplot(mtcars, aes(x = mpg)) +
geom_histogram()
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`
# With wider bins
ggplot(mtcars, aes(x = mpg)) +
geom_histogram(binwidth = 4)
When you create a histogram without specifying the bin width, ggplot() prints out a message telling you that it’s defaulting to 30 bins, and to pick a better bin width. This is because it’s important to explore your data using different bin widths; the default of 30 may or may not show you something useful about your data.
See Below - Box Plots:
To make a box plot, use plot() and pass it a factor of x values and a vector of y values. When x is a factor (as opposed to a numeric vector), it will automatically create a box plot:
library(ggplot2)
ggplot(ToothGrowth, aes(x = supp, y = len)) +
geom_boxplot()
It’s also possible to make box plots for multiple variables, by combining the variables with interaction():
ggplot(ToothGrowth, aes(x = interaction(supp, dose), y = len)) +
geom_boxplot()
Simple:
%>%
penguinsgroup_by(species)%>%
ggplot(aes(x=bill_length_mm, colour = species, fill = species))+geom_density()
Warning: Removed 2 rows containing non-finite values (`stat_density()`).
More complicated:
%>%
penguins na.omit() %>%
pivot_longer(bill_length_mm:body_mass_g, names_to = "trait") %>%
ggplot(aes(x=value,
group=species,
fill=species,
color=species))+
geom_density(alpha=0.7)+
facet_grid(~trait, scales = "free_x" )+
theme(axis.text=element_text(size=16),
axis.title=element_text(size=16))+
theme_minimal()
Bill Length…
data("penguins")
%>%
penguins group_by(species) %>%
ggplot(aes(x=species,
y=bill_length_mm,
color=species,
fill=species))+
geom_boxplot(alpha=0.5)+
theme(axis.text=element_text(size=16),
axis.title=element_text(size=16))
Warning: Removed 2 rows containing non-finite values (`stat_boxplot()`).
Flipper Length..
data("penguins")
%>%
penguins group_by(species) %>%
ggplot(aes(x=flipper_length_mm, color=species, fill=species))+
geom_histogram()
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Warning: Removed 2 rows containing non-finite values (`stat_bin()`).
Individuals in each species count…
%>%
penguins ggplot(aes(x=species,
color=species,
fill=species))+
geom_bar(alpha=0.5)+
theme(axis.text=element_text(size=16),
axis.title=element_text(size=16))
count by year…
%>%
penguins ggplot(aes(x=year,color=species, fill=species))+
geom_bar(position = "dodge")+
theme(axis.text = element_text(size=16),axis.title=element_text(size=16))
count per island…
%>%
penguins ggplot(aes(x=island,
color=species,
fill=species))+
geom_bar()+
theme(axis.text=element_text(size=16),
axis.title=element_text(size=16))
Note above: for each species count the total count of each bar for each species i.e. Dream Island has ca. 55 Adelie penguins, not 125
%>%
penguins ggplot(aes(x=bill_length_mm, y = bill_depth_mm))+
geom_point()+
geom_smooth(method = "lm")+
theme(axis.text=element_text(size=16),axis.title=element_text(size=16))
`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 2 rows containing non-finite values (`stat_smooth()`).
Warning: Removed 2 rows containing missing values (`geom_point()`).
This is for all individuals within all 3 species combined
%>%
penguins ggplot(aes(x=bill_length_mm,
y = bill_depth_mm,color=species,
fill=species))+geom_point()+geom_smooth(method = "lm",se=FALSE)+
theme(axis.text=element_text(size=16),
axis.title=element_text(size=16))
`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 2 rows containing non-finite values (`stat_smooth()`).
Warning: Removed 2 rows containing missing values (`geom_point()`).
Separated by species - clear positive relationship between bill depth and bill length i.e. as one increases so does the other.
%>%
penguins na.omit() %>%
ggplot(aes(x=sex, y = body_mass_g,color=species, fill=species))+
geom_boxplot(alpha=0.7)+
theme(axis.text=element_text(size=16),
axis.title=element_text(size=16))
Body mass grouped by sex (above).
%>%
penguins na.omit() %>%
ggplot(aes(x=species,
y = body_mass_g,
color=sex,
fill=sex))+
geom_boxplot(alpha=0.7)+
theme(axis.text=element_text(size=16),
axis.title=element_text(size=16))
Body mass grouped by species.
QUESTION - Can Body mass predict bill length?
How to: Need to find relationship between body mass and bill length
Scatter-plot with linear regression will help to do this:
library(ggplot2)
%>%
penguinsna.omit() %>%
ggplot(aes(x=body_mass_g, y = bill_length_mm,color=species,))+
geom_point()+
geom_smooth(method = "lm", se=FALSE)+ theme(axis.text=element_text(size=16),axis.title=element_text(size=16))+
labs(title = "Body mass vs bill length by Species")
`geom_smooth()` using formula = 'y ~ x'
Positive Correlation: Body mass and bill length are positively correlated for all three species. Bill length tends to rise in tandem with body bulk.
Variations by Species: When compared to the other two species, the Gentoo typically has the largest body mass, although has a relatively similar bill length to Chinstrap penguins.
Adelie (red): The body bulk and length of the bill are typically the lowest amongst all species in this species, although body mass can be compared somewhat to chinstrap penguins.
Conclusion
Body mass can be used to predict bill length, according to the scatterplot with regression lines; the strength of this association varies slightly between species. There may be some variance around the trend lines, though, so this relationship may not be entirely linear.
QUESTION 2:
Does sex explain flipper length?
To answer this we need to find significant difference between flipper length between male and female penguins.
library(ggplot2)
%>%
penguinsna.omit() %>%
ggplot(aes(x=sex, y =flipper_length_mm,fill=species))+
geom_boxplot()+
theme(axis.text=element_text(size=16),axis.title=element_text(size=16))+
labs(title = "Flipper length by sex and species")
Discussion
Gentoo penguins (blue) have the longest flipper lengths overall, regardless of sex.
Chinstrap penguins (green) have intermediate flipper lengths.
Adelie penguins (red) have the shortest flipper lengths.
For each species, males tend to have longer flippers compared to females, as shown by the higher median values for males.
According to the box plot, flipper length is highly influenced by both sex and species. Gentoo penguins have significantly longer flippers than other species, regardless of sex, although with males typically having longer flippers than females.
library(tidyverse)
library(modeldata)
Attaching package: 'modeldata'
The following object is masked _by_ '.GlobalEnv':
penguins
The following object is masked from 'package:palmerpenguins':
penguins
view(crickets)
(?crickets) run in code chunk:
Rates of Cricket Chirps
Description
These data are from from McDonald (2009), by way of Mangiafico (2015), on the relationship between the ambient temperature and the rate of cricket chirps per minute. Data were collected for two species of the genus Oecanthus: O. exclamationis and O. niveus. The data are contained in a data frame called crickets with a total of 31 data points.
Source Mangiafico, S. 2015. “An R Companion for the Handbook of Biological Statistics.” https://rcompanion.org/handbook/.
McDonald, J. 2009. Handbook of Biological Statistics. Sparky House Publishing.
ggplot(crickets, aes(x = temp,
y = rate,
color = species)) +
geom_point() +
labs(x = "Temperature",
y = "Chirp rate",
color = "Species",
title = "Cricket chirps",
caption = "Source: McDonald (2009)") +
scale_color_brewer(palette = "Dark2")
# Modifiying basic properties of the plot
ggplot(crickets, aes(x = temp,
y = rate)) +
geom_point(color = "red",
size = 2,
alpha = .3,
shape = "triangle") +
labs(x = "Temperature",
y = "Chirp rate",
title = "Cricket chirps",
caption = "Source: McDonald (2009)")
# Learn more about the options for the geom_abline()
# with ?geom_point
# Adding another layer
ggplot(crickets, aes(x = temp,
y = rate)) +
geom_point() +
geom_smooth(method = "lm",
se = FALSE) +
labs(x = "Temperature",
y = "Chirp rate",
title = "Cricket chirps",
caption = "Source: McDonald (2009)")
`geom_smooth()` using formula = 'y ~ x'
Above is shown the general trend for both species combined.
ggplot(crickets, aes(x = temp,
y = rate,
color = species)) +
geom_point() +
geom_smooth(method = "lm",
se = FALSE) +
labs(x = "Temperature",
y = "Chirp rate",
color = "Species",
title = "Cricket chirps",
caption = "Source: McDonald (2009)") +
scale_color_brewer(palette = "Dark2")
`geom_smooth()` using formula = 'y ~ x'
Here shows trend lines by species (in colour).
# Other plots
ggplot(crickets, aes(x = rate)) +
geom_histogram(bins = 15) # one quantitative variable
ggplot(crickets, aes(x = rate)) +
geom_freqpoly(bins = 15)
ggplot(crickets, aes(x = species)) +
geom_bar(color = "black",
fill = "lightblue")
ggplot(crickets, aes(x = species,
fill = species)) +
geom_bar(show.legend = FALSE) +
scale_fill_brewer(palette = "Dark2")
ggplot(crickets, aes(x = species,
y = rate,
color = species)) +
geom_boxplot(show.legend = FALSE) +
scale_color_brewer(palette = "Dark2") +
theme_minimal()
# faceting
# not great:
ggplot(crickets, aes(x = rate,
fill = species)) +
geom_histogram(bins = 15) +
scale_fill_brewer(palette = "Dark2")
ggplot(crickets, aes(x = rate,
fill = species)) +
geom_histogram(bins = 15,
show.legend = FALSE) +
facet_wrap(~species) +
scale_fill_brewer(palette = "Dark2")
ggplot(crickets, aes(x = rate,
fill = species)) +
geom_histogram(bins = 15,
show.legend = FALSE) +
facet_wrap(~species,
ncol = 1) +
scale_fill_brewer(palette = "Dark2") +
theme_minimal()
Box Plots - T-tests (Independent and Paired), ANOVAs (One and Two-way), Mann-Whitney U test, Kruskal-Wallis test, Mood’s Median test (all comparing medians), Levene’s test (variance test), and Outlier Detection tests e.g. Tukey’s Fences are all tests associated with box plots. All these tests can be used to determine the significance of some aspect of the data illustrated by a box plot. For example median tests all compare medians across groups which is explicitly shown by a box plot. Meanwhile, variance tests assess the significance of variances of data across different group/variables which is shown through the interquartile range (IQR) and whiskers in a box plot.
Density Plots - Normality tests e.g. Shapiro-Wilk test, Two-Sample tests e.g. Mann-Whitney U test, Goodness-of-Fit tests e.g. Chi-Square and Bandwidth Selection tests e.g. Cross-Validation tests are all tests associated with density plots. As with box plots, density graphs display visual, data distribution information and are often used in more advanced statistical analysis, such as model fitting, hypothesis testing, and comparison of multiple distributions, as they provide a detailed and flexible representation of the data.The statistical tests associated with density plots are used to analyse specific properties of the data distribution e.g. Normality tests assess whether the data follow a normal distribution while Two-Sample tests evaluate whether two distributions differ from each other.
Scatter Plots - Correlation tests e.g. Pearson’s Correlation Coefficient or Spearman’s Rank Correlation, Regression Analysis e.g. Simple Linear Regression or Multiple Linear Regression, Hypothesis Tests for Regression e.g. Regression Coefficient t-test or ANOVA for Regression, Outlier Detection tests e.g. Grubbs’ test and Non-Parametric tests for Relationships e.g. Kruskal-Wallis test are all statistical tests associated with scatter plots. Scatter plots are used to visualise the relationship between at least two continuous variables and the associated statistical tests can quantify and evaluate these relationships in the data in more depth. For example, correlation tests assess and then report the strength and direction of any linear relationship between the two or more variables, while regression analysis is used to model the relationship between variables. It is worth noting here that there are very many variations to correlation tests and linear models.
Bar graphs - Chi-Square Tests e.g. chi-Square Test of Independence, T-tests (independent and paired), ANOVAs (One and Two-Way), G-tests i.e. the Likelihood Ratio Test, Contingency Tables, Proportion tests e.g. Z-test or Fisher’s Exact Test, Log-linear Models e.g. Saturated Model and Regression Analysis are all statistical tests associated with bar graphs. Bar graphs are typically used to display and compare categorical data, often showing the frequency or proportion of different groups, and hence associated stats tests generally focus on comparing group means, proportions, or distributions across categories. For example, Chi-square tests compare categorical data to test the independence or goodness-of-fit between groups, whereas log-linear models are used to analyse relationships (in categorical data) presented in contingency tables. Specifically, log-linear models calculate the expected frequencies of multi-way tables (tables with two or more categorical variables) by using the natural logarithm of expected cell counts. G-tests on the other hand are used to determine if there are significant differences between observed and expected frequencies in categorical data.
data ("iris")
head(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
(?iris) = This famous (Fisher’s or Anderson’s) iris data set gives the measurements in centimeters of the variables sepal length and width and petal length and width, respectively, for 50 flowers from each of 3 species of iris. The species are Iris setosa, versicolor, and virginica.
%>%
iris glimpse()
Rows: 150
Columns: 5
$ Sepal.Length <dbl> 5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, 4.4, 4.9, 5.4, 4.…
$ Sepal.Width <dbl> 3.5, 3.0, 3.2, 3.1, 3.6, 3.9, 3.4, 3.4, 2.9, 3.1, 3.7, 3.…
$ Petal.Length <dbl> 1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4, 1.5, 1.5, 1.…
$ Petal.Width <dbl> 0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.3, 0.2, 0.2, 0.1, 0.2, 0.…
$ Species <fct> setosa, setosa, setosa, setosa, setosa, setosa, setosa, s…
data(iris)
library(ggplot2)
ggplot(iris, aes(x = Species, y = Sepal.Length, color = Species)) +
geom_boxplot(fill = "white") +
labs(title = "Sepal Length by Species", x = "Species", y = "Sepal Length (cm)") +
theme(plot.title = element_text(hjust = 0.5))
library(ggplot2)
data(iris)
ggplot(iris, aes(x = Petal.Length, fill = Species)) +
geom_density(alpha = 0.5) +
labs(title = "Petal Length by Species", x = "Petal Length (cm)", y = "Density") +
theme(plot.title = element_text(hjust = 0.5))
data(iris)
library(ggplot2)
# create a scatter graph of petal length vs petal width
ggplot(iris, aes(x = Petal.Length, y = Petal.Width, color = Species)) +
geom_point(aes(shape = Species, color = Species), size = 2) +
geom_smooth(method = "lm", color = "blue") +
labs(title = "Petal Length vs Petal width",
x = "Petal Length (cm)",
y = "Petal Width (cm)") +
theme(plot.title = element_text(hjust = 0.5))
`geom_smooth()` using formula = 'y ~ x'
data(iris)
library(ggplot2)
$Size <-
irisifelse(iris$Sepal.Length <
median(iris$Sepal.Length),
"small", "big")
ggplot(iris, aes(x = Species, fill = Size)) +
geom_bar(position = "dodge") +
labs(title = "Counts of Species by Size (of Sepal Length)",
x = "Species", y = "Count", fill = "Size") +
theme(plot.title = element_text(hjust = 0.5))
Let’s load dplyr, tidyr, and ggplot2(for the sample data):
library(ggplot2)
library(dplyr)
library(tidyr)
library(knitr)
# get the total number of cars with each class & cyl combination using group_by() and summarize().
%>%
mpggroup_by(class, cyl)%>%
summarize(n=n())%>%
kable()
`summarise()` has grouped output by 'class'. You can override using the
`.groups` argument.
class | cyl | n |
---|---|---|
2seater | 8 | 5 |
compact | 4 | 32 |
compact | 5 | 2 |
compact | 6 | 13 |
midsize | 4 | 16 |
midsize | 6 | 23 |
midsize | 8 | 2 |
minivan | 4 | 1 |
minivan | 6 | 10 |
pickup | 4 | 3 |
pickup | 6 | 10 |
pickup | 8 | 20 |
subcompact | 4 | 21 |
subcompact | 5 | 2 |
subcompact | 6 | 7 |
subcompact | 8 | 5 |
suv | 4 | 8 |
suv | 6 | 16 |
suv | 8 | 38 |
# Turn our summary data into a crosstab or contingency table. We need variable A (class) to be listed by row, and variable B (cyl) to be listed by column.We can achieve this by including the spread() command, to create columns for each cyl value, with n as the crosstab response value.
%>%
mpggroup_by(class, cyl)%>%
summarise(n=n())%>%
spread(cyl, n)%>%
kable()
`summarise()` has grouped output by 'class'. You can override using the
`.groups` argument.
class | 4 | 5 | 6 | 8 |
---|---|---|---|---|
2seater | NA | NA | NA | 5 |
compact | 32 | 2 | 13 | NA |
midsize | 16 | NA | 23 | 2 |
minivan | 1 | NA | 10 | NA |
pickup | 3 | NA | 10 | 20 |
subcompact | 21 | 2 | 7 | 5 |
suv | 8 | NA | 16 | 38 |
# Instead of displaying frequencies, we can get the average number of city miles by class & cyl
%>%
mpggroup_by(class, cyl)%>%
summarise(mean_cty=mean(cty))%>%
spread(cyl, mean_cty)%>%
kable()
`summarise()` has grouped output by 'class'. You can override using the
`.groups` argument.
class | 4 | 5 | 6 | 8 |
---|---|---|---|---|
2seater | NA | NA | NA | 15.40000 |
compact | 21.37500 | 21 | 16.92308 | NA |
midsize | 20.50000 | NA | 17.78261 | 16.00000 |
minivan | 18.00000 | NA | 15.60000 | NA |
pickup | 16.00000 | NA | 14.50000 | 11.80000 |
subcompact | 22.85714 | 20 | 17.00000 | 14.80000 |
suv | 18.00000 | NA | 14.50000 | 12.13158 |
# Or max number of city miles by class & cyl
%>%
mpggroup_by(class, cyl)%>%
summarise(max_cty=max(cty))%>%
spread(cyl, max_cty)%>%
kable()
`summarise()` has grouped output by 'class'. You can override using the
`.groups` argument.
class | 4 | 5 | 6 | 8 |
---|---|---|---|---|
2seater | NA | NA | NA | 16 |
compact | 33 | 21 | 18 | NA |
midsize | 23 | NA | 19 | 16 |
minivan | 18 | NA | 17 | NA |
pickup | 17 | NA | 16 | 14 |
subcompact | 35 | 20 | 18 | 15 |
suv | 20 | NA | 17 | 14 |
# find proportions by creating a new, calculated variable dividing row frequency by table frequency.
%>%
mpggroup_by(class)%>%
summarize(n=n())%>%
mutate(prop=n/sum(n))%>% # our new proportion variable
kable()
class | n | prop |
---|---|---|
2seater | 5 | 0.0213675 |
compact | 47 | 0.2008547 |
midsize | 41 | 0.1752137 |
minivan | 11 | 0.0470085 |
pickup | 33 | 0.1410256 |
subcompact | 35 | 0.1495726 |
suv | 62 | 0.2649573 |
# create a contingency table of proportion values by applying the same spread command as before. Vary the group_by() and spread() arguments to produce proportions of different variables.
%>%
mpggroup_by(class, cyl)%>%
summarize(n=n())%>%
mutate(prop=n/sum(n))%>%
subset(select=c("class","cyl","prop"))%>% #drop the frequency value
spread(class, prop)%>%
kable()
`summarise()` has grouped output by 'class'. You can override using the
`.groups` argument.
cyl | 2seater | compact | midsize | minivan | pickup | subcompact | suv |
---|---|---|---|---|---|---|---|
4 | NA | 0.6808511 | 0.3902439 | 0.0909091 | 0.0909091 | 0.6000000 | 0.1290323 |
5 | NA | 0.0425532 | NA | NA | NA | 0.0571429 | NA |
6 | NA | 0.2765957 | 0.5609756 | 0.9090909 | 0.3030303 | 0.2000000 | 0.2580645 |
8 | 1 | NA | 0.0487805 | NA | 0.6060606 | 0.1428571 | 0.6129032 |
#table() is a quick way to pull together row/column frequencies and proportions for categorical variables
# Using the basic table() command, we can get a contingency table of vehicle class by number of cylinders.
table(mpg$class, mpg$cyl)
4 5 6 8
2seater 0 0 0 5
compact 32 2 13 0
midsize 16 0 23 2
minivan 1 0 10 0
pickup 3 0 10 20
subcompact 21 2 7 5
suv 8 0 16 38
#The table frequency can also be called by using the ftable() command.
<- table(mpg$class, mpg$cyl) #define object w/table parameters for simple calling
mpg_tableftable(mpg_table)
4 5 6 8
2seater 0 0 0 5
compact 32 2 13 0
midsize 16 0 23 2
minivan 1 0 10 0
pickup 3 0 10 20
subcompact 21 2 7 5
suv 8 0 16 38
# For row frequencies, we use the margin.table() command, with the 1 argument.
margin.table(mpg_table, 1)
2seater compact midsize minivan pickup subcompact suv
5 47 41 11 33 35 62
#For column frequencies, we use the margin.table() command, with the 2 argument
margin.table(mpg_table, 2)
4 5 6 8
81 4 79 70
#We can get the proportion values for our variable combinations as well.
# For proportion of the entire table, we use the prop.table() command.
prop.table(mpg_table)
4 5 6 8
2seater 0.000000000 0.000000000 0.000000000 0.021367521
compact 0.136752137 0.008547009 0.055555556 0.000000000
midsize 0.068376068 0.000000000 0.098290598 0.008547009
minivan 0.004273504 0.000000000 0.042735043 0.000000000
pickup 0.012820513 0.000000000 0.042735043 0.085470085
subcompact 0.089743590 0.008547009 0.029914530 0.021367521
suv 0.034188034 0.000000000 0.068376068 0.162393162
# For row proportions, we use the prop.table() command, with the 1 argument following the table name.
prop.table(mpg_table, 1)
4 5 6 8
2seater 0.00000000 0.00000000 0.00000000 1.00000000
compact 0.68085106 0.04255319 0.27659574 0.00000000
midsize 0.39024390 0.00000000 0.56097561 0.04878049
minivan 0.09090909 0.00000000 0.90909091 0.00000000
pickup 0.09090909 0.00000000 0.30303030 0.60606061
subcompact 0.60000000 0.05714286 0.20000000 0.14285714
suv 0.12903226 0.00000000 0.25806452 0.61290323
# For column proportions, we use the prop.table() command, with the 2 argument following the table name.
prop.table(mpg_table, 2)
4 5 6 8
2seater 0.00000000 0.00000000 0.00000000 0.07142857
compact 0.39506173 0.50000000 0.16455696 0.00000000
midsize 0.19753086 0.00000000 0.29113924 0.02857143
minivan 0.01234568 0.00000000 0.12658228 0.00000000
pickup 0.03703704 0.00000000 0.12658228 0.28571429
subcompact 0.25925926 0.50000000 0.08860759 0.07142857
suv 0.09876543 0.00000000 0.20253165 0.54285714
library(gmodels)
CrossTable(mpg$class, mpg$cyl)
Cell Contents
|-------------------------|
| N |
| Chi-square contribution |
| N / Row Total |
| N / Col Total |
| N / Table Total |
|-------------------------|
Total Observations in Table: 234
| mpg$cyl
mpg$class | 4 | 5 | 6 | 8 | Row Total |
-------------|-----------|-----------|-----------|-----------|-----------|
2seater | 0 | 0 | 0 | 5 | 5 |
| 1.731 | 0.085 | 1.688 | 8.210 | |
| 0.000 | 0.000 | 0.000 | 1.000 | 0.021 |
| 0.000 | 0.000 | 0.000 | 0.071 | |
| 0.000 | 0.000 | 0.000 | 0.021 | |
-------------|-----------|-----------|-----------|-----------|-----------|
compact | 32 | 2 | 13 | 0 | 47 |
| 15.210 | 1.782 | 0.518 | 14.060 | |
| 0.681 | 0.043 | 0.277 | 0.000 | 0.201 |
| 0.395 | 0.500 | 0.165 | 0.000 | |
| 0.137 | 0.009 | 0.056 | 0.000 | |
-------------|-----------|-----------|-----------|-----------|-----------|
midsize | 16 | 0 | 23 | 2 | 41 |
| 0.230 | 0.701 | 6.059 | 8.591 | |
| 0.390 | 0.000 | 0.561 | 0.049 | 0.175 |
| 0.198 | 0.000 | 0.291 | 0.029 | |
| 0.068 | 0.000 | 0.098 | 0.009 | |
-------------|-----------|-----------|-----------|-----------|-----------|
minivan | 1 | 0 | 10 | 0 | 11 |
| 2.070 | 0.188 | 10.641 | 3.291 | |
| 0.091 | 0.000 | 0.909 | 0.000 | 0.047 |
| 0.012 | 0.000 | 0.127 | 0.000 | |
| 0.004 | 0.000 | 0.043 | 0.000 | |
-------------|-----------|-----------|-----------|-----------|-----------|
pickup | 3 | 0 | 10 | 20 | 33 |
| 6.211 | 0.564 | 0.117 | 10.391 | |
| 0.091 | 0.000 | 0.303 | 0.606 | 0.141 |
| 0.037 | 0.000 | 0.127 | 0.286 | |
| 0.013 | 0.000 | 0.043 | 0.085 | |
-------------|-----------|-----------|-----------|-----------|-----------|
subcompact | 21 | 2 | 7 | 5 | 35 |
| 6.515 | 3.284 | 1.963 | 2.858 | |
| 0.600 | 0.057 | 0.200 | 0.143 | 0.150 |
| 0.259 | 0.500 | 0.089 | 0.071 | |
| 0.090 | 0.009 | 0.030 | 0.021 | |
-------------|-----------|-----------|-----------|-----------|-----------|
suv | 8 | 0 | 16 | 38 | 62 |
| 8.444 | 1.060 | 1.162 | 20.403 | |
| 0.129 | 0.000 | 0.258 | 0.613 | 0.265 |
| 0.099 | 0.000 | 0.203 | 0.543 | |
| 0.034 | 0.000 | 0.068 | 0.162 | |
-------------|-----------|-----------|-----------|-----------|-----------|
Column Total | 81 | 4 | 79 | 70 | 234 |
| 0.346 | 0.017 | 0.338 | 0.299 | |
-------------|-----------|-----------|-----------|-----------|-----------|
ggplot2:
# Load necessary libraries
library(ggplot2)
library(dplyr)
# Create a summary table of counts for class and cyl
<- mpg %>%
data_summary count(class, cyl)
# Create the balloon plot
ggplot(data_summary, aes(x = class, y = factor(cyl), fill = n)) +
geom_tile(aes(width = 0.9, height = 0.9)) + # Create "balloons" using tiles
scale_fill_gradient(low = "skyblue", high = "darkblue", name = "Count") + # Color gradient for count
labs(title = "Balloon Plot of Car Class vs. Number of Cylinders",
x = "Car Class",
y = "Number of Cylinders") +
theme_minimal() +
theme(legend.position = "right") # Adjust legend position
Corrplot:
# Load necessary libraries
library(ggplot2) # for mpg dataset
library(corrplot) # for the balloon plot
corrplot 0.95 loaded
# Create a contingency table for 'class' and 'cyl'
<- table(mpg$class, mpg$cyl)
contingency_table
# Convert the table to proportions (values between 0 and 1)
<- prop.table(contingency_table)
prop_table
# Set larger margins and reduce title font size (since title is removed, no need to worry about it now)
par(mar = c(7, 4, 7, 2)) # Adjust margins (top margin is set to 7 for space)
# Suppress warnings and plot the graph
suppressWarnings({
corrplot(prop_table,
method = "circle", # "circle" creates round balloons
col = colorRampPalette(c("white", "skyblue"))(200), # Color scale
cl.lim = c(0, 1), # Color scale limits (proportions between 0 and 1)
tl.cex = 1.2, # Text size for axis labels
cl.cex = 1.2, # Color legend text size
addCoef.col = NULL) # Don't display numbers inside the balloons
})
# Load necessary libraries
library(ggplot2) # for mpg dataset and plotting
library(dplyr) # for data manipulation
# Calculate the average number of city miles ('cty') by 'class' and 'cyl'
<- mpg %>%
avg_cty_by_class_cyl group_by(class, cyl) %>%
summarise(avg_cty = mean(cty, na.rm = TRUE))
`summarise()` has grouped output by 'class'. You can override using the
`.groups` argument.
# Create a ggplot with circles representing average city miles
ggplot(avg_cty_by_class_cyl, aes(x = class, y = factor(cyl), size = avg_cty, color = avg_cty)) +
geom_point(alpha = 0.7) + # Use points (circles) with transparency
scale_size_continuous(range = c(3, 20)) + # Adjust circle size range
scale_color_gradient(low = "white", high = "skyblue") + # Color scale for fill
labs(
x = "Car Class",
y = "Number of Cylinders",
size = "Average City Miles",
color = "Average City Miles",
title = "Balloon Plot of Average City Miles by Class and Cylinders"
+
) theme_minimal() +
theme(
axis.text.x = element_text(angle = 45, hjust = 1), # Angle the x-axis labels for readability
plot.title = element_text(hjust = 0.5) # Center the plot title
)