Week 1
Importing a data set.
Meet the penguins

The penguins data from the palmerpenguins package contains size measurements for 344 penguins from three species observed on three islands in the Palmer Archipelago, Antarctica.
Penguin Raw Data
Rows: 344
Columns: 8
$ species <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
$ island <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
$ bill_length_mm <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
$ bill_depth_mm <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
$ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
$ body_mass_g <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
$ sex <fct> male, female, female, NA, female, male, female, male…
$ year <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…
Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_point()`).
The above plot displays the relationship between flipper and bill length in the three named penguin species.
Despite being a bird enthusiast Marissa is actually terrified of penguins as a result of being hissed at by one at a zoo when she was 5.
Week 3
Data wrangling - The Diamonds Data Set.
Mutate
To add new columns or to modify current variables.
Below - where three new variables have been added “JustOne”, “Values” and “Simple”.
# A tibble: 53,940 × 13
carat cut color clarity depth table price x y z JustOne Values
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <dbl> <chr>
1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43 1 somet…
2 0.21 Premi… E SI1 59.8 61 326 3.89 3.84 2.31 1 somet…
3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31 1 somet…
4 0.29 Premi… I VS2 62.4 58 334 4.2 4.23 2.63 1 somet…
5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75 1 somet…
6 0.24 Very … J VVS2 62.8 57 336 3.94 3.96 2.48 1 somet…
7 0.24 Very … I VVS1 62.3 57 336 3.95 3.98 2.47 1 somet…
8 0.26 Very … H SI1 61.9 55 337 4.07 4.11 2.53 1 somet…
9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49 1 somet…
10 0.23 Very … H VS1 59.4 61 338 4 4.05 2.39 1 somet…
# ℹ 53,930 more rows
# ℹ 1 more variable: Simple <lgl>
Below - where multiple variables (columns) have been created based off existing variables.
# A tibble: 53,940 × 15
carat cut color clarity depth table price x y z price200
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <dbl>
1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43 126
2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31 126
3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31 127
4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63 134
5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75 135
6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48 136
7 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47 136
8 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53 137
9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49 137
10 0.23 Very Good H VS1 59.4 61 338 4 4.05 2.39 138
# ℹ 53,930 more rows
# ℹ 4 more variables: price20perc <dbl>, price20percoff <dbl>,
# pricepercarat <dbl>, sqdep <dbl>
Mutate - additional exercise - “Midwest”.
# A tibble: 437 × 34
PID county state area poptotal popdensity popwhite popblack popamerindian
<int> <chr> <chr> <dbl> <int> <dbl> <int> <int> <int>
1 561 ADAMS IL 0.052 66090 1271. 63917 1702 98
2 562 ALEXAN… IL 0.014 10626 759 7054 3496 19
3 563 BOND IL 0.022 14991 681. 14477 429 35
4 564 BOONE IL 0.017 30806 1812. 29344 127 46
5 565 BROWN IL 0.018 5836 324. 5264 547 14
6 566 BUREAU IL 0.05 35688 714. 35157 50 65
7 567 CALHOUN IL 0.017 5322 313. 5298 1 8
8 568 CARROLL IL 0.027 16805 622. 16519 111 30
9 569 CASS IL 0.024 13437 560. 13384 16 8
10 570 CHAMPA… IL 0.058 173025 2983. 146506 16559 331
# ℹ 427 more rows
# ℹ 25 more variables: popasian <int>, popother <int>, percwhite <dbl>,
# percblack <dbl>, percamerindan <dbl>, percasian <dbl>, percother <dbl>,
# popadults <int>, perchsd <dbl>, percollege <dbl>, percprof <dbl>,
# poppovertyknown <int>, percpovertyknown <dbl>, percbelowpoverty <dbl>,
# percchildbelowpovert <dbl>, percadultpoverty <dbl>,
# percelderlypoverty <dbl>, inmetro <int>, category <chr>, …
Summarize
To collapse rows and return a one-row summary.
# A tibble: 1 × 5
avg.price dbl.price random.add avg.carat stdev.price
<dbl> <dbl> <dbl> <dbl> <dbl>
1 3933. 7866. 3 0.798 3989.
Group by and ungroup
To take existing data and group specific variables together for future operations.
Summarizing () and group by () - To compare the averages of two groups separately:
# A tibble: 2 × 4
Sex m s n
<chr> <dbl> <dbl> <int>
1 female 0.437 0.268 25
2 male 0.487 0.268 25
This code has been grouped by sex to ensure calculations performed on data accounts for males and females separately.
`summarise()` has grouped output by 'Sex'. You can override using the `.groups`
argument.
# A tibble: 27 × 5
Sex Age m s n
<chr> <dbl> <dbl> <dbl> <int>
1 female 20 0.046 NA 1
2 female 21 0.740 0.253 3
3 female 22 0.672 0.253 2
4 female 23 0.501 NA 1
5 female 25 0.579 0.167 3
6 female 26 0.41 NA 1
7 female 28 0.152 NA 1
8 female 29 0.426 0.339 2
9 female 30 0.170 0.238 2
10 female 33 0.173 NA 1
# ℹ 17 more rows
This code has been grouped by both sex and age, resulting in more rows.
Mutate () and group by () - To add new columns based on the existing group:
# A tibble: 50 × 5
ID Sex Age Score m
<int> <chr> <dbl> <dbl> <dbl>
1 1 male 26 0.01 0.487
2 2 female 25 0.418 0.437
3 3 male 39 0.014 0.487
4 4 female 37 0.09 0.437
5 5 male 31 0.061 0.487
6 6 female 34 0.328 0.437
7 7 male 34 0.656 0.487
8 8 female 30 0.002 0.437
9 9 male 26 0.639 0.487
10 10 female 33 0.173 0.437
# ℹ 40 more rows
Filter
To retain specific rows of data that meet specified requirements.
Below - code to only display data from diamonds that have a cut value of fair or good, and a price at or under $600:
# A tibble: 505 × 10
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
2 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
3 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49
4 0.3 Good J SI1 64 55 339 4.25 4.28 2.73
5 0.3 Good J SI1 63.4 54 351 4.23 4.29 2.7
6 0.3 Good J SI1 63.8 56 351 4.23 4.26 2.71
7 0.3 Good I SI2 63.3 56 351 4.26 4.3 2.71
8 0.23 Good F VS1 58.2 59 402 4.06 4.08 2.37
9 0.23 Good E VS1 64.1 59 402 3.83 3.85 2.46
10 0.31 Good H SI1 64 54 402 4.29 4.31 2.75
# ℹ 495 more rows
Select
To select only the variables (columns) desired, the order in which variable names are listed is the order they will be displayed.
To retain only cut and color:
# A tibble: 53,940 × 2
cut color
<ord> <ord>
1 Ideal E
2 Premium E
3 Good E
4 Premium I
5 Good J
6 Very Good J
7 Very Good I
8 Very Good H
9 Fair E
10 Very Good H
# ℹ 53,930 more rows
To retain all except cut and color:
# A tibble: 53,940 × 8
carat clarity depth table price x y z
<dbl> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 0.23 SI2 61.5 55 326 3.95 3.98 2.43
2 0.21 SI1 59.8 61 326 3.89 3.84 2.31
3 0.23 VS1 56.9 65 327 4.05 4.07 2.31
4 0.29 VS2 62.4 58 334 4.2 4.23 2.63
5 0.31 SI2 63.3 58 335 4.34 4.35 2.75
6 0.24 VVS2 62.8 57 336 3.94 3.96 2.48
7 0.24 VVS1 62.3 57 336 3.95 3.98 2.47
8 0.26 SI1 61.9 55 337 4.07 4.11 2.53
9 0.22 VS2 65.1 61 337 3.87 3.78 2.49
10 0.23 VS1 59.4 61 338 4 4.05 2.39
# ℹ 53,930 more rows
To retain only the first five columns:
# A tibble: 53,940 × 5
carat cut color clarity depth
<dbl> <ord> <ord> <ord> <dbl>
1 0.23 Ideal E SI2 61.5
2 0.21 Premium E SI1 59.8
3 0.23 Good E VS1 56.9
4 0.29 Premium I VS2 62.4
5 0.31 Good J SI2 63.3
6 0.24 Very Good J VVS2 62.8
7 0.24 Very Good I VVS1 62.3
8 0.26 Very Good H SI1 61.9
9 0.22 Fair E VS2 65.1
10 0.23 Very Good H VS1 59.4
# ℹ 53,930 more rows
To retain all except the first five columns:
# A tibble: 53,940 × 5
table price x y z
<dbl> <int> <dbl> <dbl> <dbl>
1 55 326 3.95 3.98 2.43
2 61 326 3.89 3.84 2.31
3 65 327 4.05 4.07 2.31
4 58 334 4.2 4.23 2.63
5 58 335 4.34 4.35 2.75
6 57 336 3.94 3.96 2.48
7 57 336 3.95 3.98 2.47
8 55 337 4.07 4.11 2.53
9 61 337 3.87 3.78 2.49
10 61 338 4 4.05 2.39
# ℹ 53,930 more rows
Arrange
To arrange values within a variable in either ascending or descending order, applicable to both numerical and non-numerical values.
To arrange cut in alphabetical order:
# A tibble: 53,940 × 10
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49
2 0.86 Fair E SI2 55.1 69 2757 6.45 6.33 3.52
3 0.96 Fair F SI2 66.3 62 2759 6.27 5.95 4.07
4 0.7 Fair F VS2 64.5 57 2762 5.57 5.53 3.58
5 0.7 Fair F VS2 65.3 55 2762 5.63 5.58 3.66
6 0.91 Fair H SI2 64.4 57 2763 6.11 6.09 3.93
7 0.91 Fair H SI2 65.7 60 2763 6.03 5.99 3.95
8 0.98 Fair H SI2 67.9 60 2777 6.05 5.97 4.08
9 0.84 Fair G SI1 55.1 67 2782 6.39 6.2 3.47
10 1.01 Fair E I1 64.5 58 2788 6.29 6.21 4.03
# ℹ 53,930 more rows
To arrange cut in descending alphabetic order:
# A tibble: 53,940 × 10
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
2 0.23 Ideal J VS1 62.8 56 340 3.93 3.9 2.46
3 0.31 Ideal J SI2 62.2 54 344 4.35 4.37 2.71
4 0.3 Ideal I SI2 62 54 348 4.31 4.34 2.68
5 0.33 Ideal I SI2 61.8 55 403 4.49 4.51 2.78
6 0.33 Ideal I SI2 61.2 56 403 4.49 4.5 2.75
7 0.33 Ideal J SI1 61.1 56 403 4.49 4.55 2.76
8 0.23 Ideal G VS1 61.9 54 404 3.93 3.95 2.44
9 0.32 Ideal I SI1 60.9 55 404 4.45 4.48 2.72
10 0.3 Ideal I SI2 61 59 405 4.3 4.33 2.63
# ℹ 53,930 more rows
To arrange price in numerical order:
# A tibble: 53,940 × 10
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63
5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
7 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47
8 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53
9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49
10 0.23 Very Good H VS1 59.4 61 338 4 4.05 2.39
# ℹ 53,930 more rows
To arrange price in descending numerical order:
# A tibble: 53,940 × 10
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 2.29 Premium I VS2 60.8 60 18823 8.5 8.47 5.16
2 2 Very Good G SI1 63.5 56 18818 7.9 7.97 5.04
3 1.51 Ideal G IF 61.7 55 18806 7.37 7.41 4.56
4 2.07 Ideal G SI2 62.5 55 18804 8.2 8.13 5.11
5 2 Very Good H SI1 62.8 57 18803 7.95 8 5.01
6 2.29 Premium I SI1 61.8 59 18797 8.52 8.45 5.24
7 2.04 Premium H SI1 58.1 60 18795 8.37 8.28 4.84
8 2 Premium I VS1 60.8 59 18795 8.13 8.02 4.91
9 1.71 Premium F VS2 62.3 59 18791 7.57 7.53 4.7
10 2.15 Ideal G SI2 62.6 54 18791 8.29 8.35 5.21
# ℹ 53,930 more rows
Week 4
Data Exploration - The Crickets Data Set.
Not exactly the biggest fan of bugs either, but certainly an improvement from penguins!
Basic Scatter Plot
For Two Quantitative Variables.
Attaching package: 'modeldata'
The following object is masked from 'package:palmerpenguins':
penguins
Modifying Plot Properties
Additional Layers
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'
Additional Plots
Histogram - For A Single Quantitative Variable With Multiple Frequencies.
Frequency Polygon.
Bar Chart Non-Colour Specified - For A Single Categorical Variable.
Bar Chart Colour Specified.
Box Plot - For One Quantitative And One Categorical Variable.
Faceting
To Create Individual Plots For Each Value Of A Categorical Variable Specified
A Good Hypothesis
A hypothesis is a key element of the scientific research process, and can be described as either a theoretical or hypothetical explanation for observations, measurements and any phenomenons that occur during a research project or experiment. Typically displayed in a mathematical model a “good” hypothesis should consist of testability, objectivity, clarity and relevance, as this allows for an objective to be worked towards whilst avoiding excessive descriptions and overall remaining relevant to the area of knowledge desired.
Week 5
Choosing The Correct Type Of Analysis
Graph 1 - Box Plot
Contains one continuous quantitative variable (Sepal Length) and three ordinal categorical variables (Species). Mean test required to test for differences across the three means, a One-Way ANOVA would be applicable due to comparing the mean sepal lengths across the three species.
Graph 2 - Density Plot
I have not seen or used this type of graph before, therefore I do not feel confident in assigning it to a statistical test “family” as I’m unsure what it is actually displaying.
Contains one continuous quantitative variable (Petal Length)?
Graph 3 - Scatter Plot
`geom_smooth()` using formula = 'y ~ x'
Contains two continuous quantitative variables (Petal Length and Petal Width). Correlation test required, data is linear (normally distributed) therefore the Pearson correlation coefficient test would be applicable to determine the association between the two variables.
Graph 4 - Grouped Bar Plot
Contains two ordinal categorical variables (big and small). Frequency test required to test for associations between the two categorical variables. The use of a chi-square test here would be suitable, as would test for a relationship (and therefore association) between the two categorical variables.