Marissa’s attempt at a Quarto workbook

Week 1

Importing a data set.

Meet the penguins

The penguins data from the palmerpenguins package contains size measurements for 344 penguins from three species observed on three islands in the Palmer Archipelago, Antarctica.

Penguin Raw Data

Rows: 344
Columns: 8
$ species           <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
$ island            <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
$ bill_length_mm    <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
$ bill_depth_mm     <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
$ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
$ body_mass_g       <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
$ sex               <fct> male, female, female, NA, female, male, female, male…
$ year              <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…

Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_point()`).

The above plot displays the relationship between flipper and bill length in the three named penguin species.

Penguin Phobia

Despite being a bird enthusiast Marissa is actually terrified of penguins as a result of being hissed at by one at a zoo when she was 5.

Week 3

Data wrangling - The Diamonds Data Set.

Mutate

To add new columns or to modify current variables.

Below - where three new variables have been added “JustOne”, “Values” and “Simple”.

# A tibble: 53,940 × 13
   carat cut    color clarity depth table price     x     y     z JustOne Values
   <dbl> <ord>  <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>   <dbl> <chr> 
 1  0.23 Ideal  E     SI2      61.5    55   326  3.95  3.98  2.43       1 somet…
 2  0.21 Premi… E     SI1      59.8    61   326  3.89  3.84  2.31       1 somet…
 3  0.23 Good   E     VS1      56.9    65   327  4.05  4.07  2.31       1 somet…
 4  0.29 Premi… I     VS2      62.4    58   334  4.2   4.23  2.63       1 somet…
 5  0.31 Good   J     SI2      63.3    58   335  4.34  4.35  2.75       1 somet…
 6  0.24 Very … J     VVS2     62.8    57   336  3.94  3.96  2.48       1 somet…
 7  0.24 Very … I     VVS1     62.3    57   336  3.95  3.98  2.47       1 somet…
 8  0.26 Very … H     SI1      61.9    55   337  4.07  4.11  2.53       1 somet…
 9  0.22 Fair   E     VS2      65.1    61   337  3.87  3.78  2.49       1 somet…
10  0.23 Very … H     VS1      59.4    61   338  4     4.05  2.39       1 somet…
# ℹ 53,930 more rows
# ℹ 1 more variable: Simple <lgl>

Below - where multiple variables (columns) have been created based off existing variables.

# A tibble: 53,940 × 15
   carat cut       color clarity depth table price     x     y     z price200
   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>    <dbl>
 1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43      126
 2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31      126
 3  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31      127
 4  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63      134
 5  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75      135
 6  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48      136
 7  0.24 Very Good I     VVS1     62.3    57   336  3.95  3.98  2.47      136
 8  0.26 Very Good H     SI1      61.9    55   337  4.07  4.11  2.53      137
 9  0.22 Fair      E     VS2      65.1    61   337  3.87  3.78  2.49      137
10  0.23 Very Good H     VS1      59.4    61   338  4     4.05  2.39      138
# ℹ 53,930 more rows
# ℹ 4 more variables: price20perc <dbl>, price20percoff <dbl>,
#   pricepercarat <dbl>, sqdep <dbl>

Mutate - additional exercise - “Midwest”.

# A tibble: 437 × 34
     PID county  state  area poptotal popdensity popwhite popblack popamerindian
   <int> <chr>   <chr> <dbl>    <int>      <dbl>    <int>    <int>         <int>
 1   561 ADAMS   IL    0.052    66090      1271.    63917     1702            98
 2   562 ALEXAN… IL    0.014    10626       759      7054     3496            19
 3   563 BOND    IL    0.022    14991       681.    14477      429            35
 4   564 BOONE   IL    0.017    30806      1812.    29344      127            46
 5   565 BROWN   IL    0.018     5836       324.     5264      547            14
 6   566 BUREAU  IL    0.05     35688       714.    35157       50            65
 7   567 CALHOUN IL    0.017     5322       313.     5298        1             8
 8   568 CARROLL IL    0.027    16805       622.    16519      111            30
 9   569 CASS    IL    0.024    13437       560.    13384       16             8
10   570 CHAMPA… IL    0.058   173025      2983.   146506    16559           331
# ℹ 427 more rows
# ℹ 25 more variables: popasian <int>, popother <int>, percwhite <dbl>,
#   percblack <dbl>, percamerindan <dbl>, percasian <dbl>, percother <dbl>,
#   popadults <int>, perchsd <dbl>, percollege <dbl>, percprof <dbl>,
#   poppovertyknown <int>, percpovertyknown <dbl>, percbelowpoverty <dbl>,
#   percchildbelowpovert <dbl>, percadultpoverty <dbl>,
#   percelderlypoverty <dbl>, inmetro <int>, category <chr>, …

Summarize

To collapse rows and return a one-row summary.

# A tibble: 1 × 5
  avg.price dbl.price random.add avg.carat stdev.price
      <dbl>     <dbl>      <dbl>     <dbl>       <dbl>
1     3933.     7866.          3     0.798       3989.

Group by and ungroup

To take existing data and group specific variables together for future operations.

Summarizing () and group by () - To compare the averages of two groups separately:

# A tibble: 2 × 4
  Sex        m     s     n
  <chr>  <dbl> <dbl> <int>
1 female 0.437 0.268    25
2 male   0.487 0.268    25

This code has been grouped by sex to ensure calculations performed on data accounts for males and females separately.

`summarise()` has grouped output by 'Sex'. You can override using the `.groups`
argument.

# A tibble: 27 × 5
   Sex      Age     m      s     n
   <chr>  <dbl> <dbl>  <dbl> <int>
 1 female    20 0.046 NA         1
 2 female    21 0.740  0.253     3
 3 female    22 0.672  0.253     2
 4 female    23 0.501 NA         1
 5 female    25 0.579  0.167     3
 6 female    26 0.41  NA         1
 7 female    28 0.152 NA         1
 8 female    29 0.426  0.339     2
 9 female    30 0.170  0.238     2
10 female    33 0.173 NA         1
# ℹ 17 more rows

This code has been grouped by both sex and age, resulting in more rows.

Mutate () and group by () - To add new columns based on the existing group:

# A tibble: 50 × 5
      ID Sex      Age Score     m
   <int> <chr>  <dbl> <dbl> <dbl>
 1     1 male      26 0.01  0.487
 2     2 female    25 0.418 0.437
 3     3 male      39 0.014 0.487
 4     4 female    37 0.09  0.437
 5     5 male      31 0.061 0.487
 6     6 female    34 0.328 0.437
 7     7 male      34 0.656 0.487
 8     8 female    30 0.002 0.437
 9     9 male      26 0.639 0.487
10    10 female    33 0.173 0.437
# ℹ 40 more rows

Filter

To retain specific rows of data that meet specified requirements.

Below - code to only display data from diamonds that have a cut value of fair or good, and a price at or under $600:

# A tibble: 505 × 10
   carat cut   color clarity depth table price     x     y     z
   <dbl> <ord> <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1  0.23 Good  E     VS1      56.9    65   327  4.05  4.07  2.31
 2  0.31 Good  J     SI2      63.3    58   335  4.34  4.35  2.75
 3  0.22 Fair  E     VS2      65.1    61   337  3.87  3.78  2.49
 4  0.3  Good  J     SI1      64      55   339  4.25  4.28  2.73
 5  0.3  Good  J     SI1      63.4    54   351  4.23  4.29  2.7 
 6  0.3  Good  J     SI1      63.8    56   351  4.23  4.26  2.71
 7  0.3  Good  I     SI2      63.3    56   351  4.26  4.3   2.71
 8  0.23 Good  F     VS1      58.2    59   402  4.06  4.08  2.37
 9  0.23 Good  E     VS1      64.1    59   402  3.83  3.85  2.46
10  0.31 Good  H     SI1      64      54   402  4.29  4.31  2.75
# ℹ 495 more rows

Select

To select only the variables (columns) desired, the order in which variable names are listed is the order they will be displayed.

To retain only cut and color:

# A tibble: 53,940 × 2
   cut       color
   <ord>     <ord>
 1 Ideal     E    
 2 Premium   E    
 3 Good      E    
 4 Premium   I    
 5 Good      J    
 6 Very Good J    
 7 Very Good I    
 8 Very Good H    
 9 Fair      E    
10 Very Good H    
# ℹ 53,930 more rows

To retain all except cut and color:

# A tibble: 53,940 × 8
   carat clarity depth table price     x     y     z
   <dbl> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1  0.23 SI2      61.5    55   326  3.95  3.98  2.43
 2  0.21 SI1      59.8    61   326  3.89  3.84  2.31
 3  0.23 VS1      56.9    65   327  4.05  4.07  2.31
 4  0.29 VS2      62.4    58   334  4.2   4.23  2.63
 5  0.31 SI2      63.3    58   335  4.34  4.35  2.75
 6  0.24 VVS2     62.8    57   336  3.94  3.96  2.48
 7  0.24 VVS1     62.3    57   336  3.95  3.98  2.47
 8  0.26 SI1      61.9    55   337  4.07  4.11  2.53
 9  0.22 VS2      65.1    61   337  3.87  3.78  2.49
10  0.23 VS1      59.4    61   338  4     4.05  2.39
# ℹ 53,930 more rows

To retain only the first five columns:

# A tibble: 53,940 × 5
   carat cut       color clarity depth
   <dbl> <ord>     <ord> <ord>   <dbl>
 1  0.23 Ideal     E     SI2      61.5
 2  0.21 Premium   E     SI1      59.8
 3  0.23 Good      E     VS1      56.9
 4  0.29 Premium   I     VS2      62.4
 5  0.31 Good      J     SI2      63.3
 6  0.24 Very Good J     VVS2     62.8
 7  0.24 Very Good I     VVS1     62.3
 8  0.26 Very Good H     SI1      61.9
 9  0.22 Fair      E     VS2      65.1
10  0.23 Very Good H     VS1      59.4
# ℹ 53,930 more rows

To retain all except the first five columns:

# A tibble: 53,940 × 5
   table price     x     y     z
   <dbl> <int> <dbl> <dbl> <dbl>
 1    55   326  3.95  3.98  2.43
 2    61   326  3.89  3.84  2.31
 3    65   327  4.05  4.07  2.31
 4    58   334  4.2   4.23  2.63
 5    58   335  4.34  4.35  2.75
 6    57   336  3.94  3.96  2.48
 7    57   336  3.95  3.98  2.47
 8    55   337  4.07  4.11  2.53
 9    61   337  3.87  3.78  2.49
10    61   338  4     4.05  2.39
# ℹ 53,930 more rows

Arrange

To arrange values within a variable in either ascending or descending order, applicable to both numerical and non-numerical values.

To arrange cut in alphabetical order:

# A tibble: 53,940 × 10
   carat cut   color clarity depth table price     x     y     z
   <dbl> <ord> <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1  0.22 Fair  E     VS2      65.1    61   337  3.87  3.78  2.49
 2  0.86 Fair  E     SI2      55.1    69  2757  6.45  6.33  3.52
 3  0.96 Fair  F     SI2      66.3    62  2759  6.27  5.95  4.07
 4  0.7  Fair  F     VS2      64.5    57  2762  5.57  5.53  3.58
 5  0.7  Fair  F     VS2      65.3    55  2762  5.63  5.58  3.66
 6  0.91 Fair  H     SI2      64.4    57  2763  6.11  6.09  3.93
 7  0.91 Fair  H     SI2      65.7    60  2763  6.03  5.99  3.95
 8  0.98 Fair  H     SI2      67.9    60  2777  6.05  5.97  4.08
 9  0.84 Fair  G     SI1      55.1    67  2782  6.39  6.2   3.47
10  1.01 Fair  E     I1       64.5    58  2788  6.29  6.21  4.03
# ℹ 53,930 more rows

To arrange cut in descending alphabetic order:

# A tibble: 53,940 × 10
   carat cut   color clarity depth table price     x     y     z
   <dbl> <ord> <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1  0.23 Ideal E     SI2      61.5    55   326  3.95  3.98  2.43
 2  0.23 Ideal J     VS1      62.8    56   340  3.93  3.9   2.46
 3  0.31 Ideal J     SI2      62.2    54   344  4.35  4.37  2.71
 4  0.3  Ideal I     SI2      62      54   348  4.31  4.34  2.68
 5  0.33 Ideal I     SI2      61.8    55   403  4.49  4.51  2.78
 6  0.33 Ideal I     SI2      61.2    56   403  4.49  4.5   2.75
 7  0.33 Ideal J     SI1      61.1    56   403  4.49  4.55  2.76
 8  0.23 Ideal G     VS1      61.9    54   404  3.93  3.95  2.44
 9  0.32 Ideal I     SI1      60.9    55   404  4.45  4.48  2.72
10  0.3  Ideal I     SI2      61      59   405  4.3   4.33  2.63
# ℹ 53,930 more rows

To arrange price in numerical order:

# A tibble: 53,940 × 10
   carat cut       color clarity depth table price     x     y     z
   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
 2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
 3  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31
 4  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
 5  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75
 6  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48
 7  0.24 Very Good I     VVS1     62.3    57   336  3.95  3.98  2.47
 8  0.26 Very Good H     SI1      61.9    55   337  4.07  4.11  2.53
 9  0.22 Fair      E     VS2      65.1    61   337  3.87  3.78  2.49
10  0.23 Very Good H     VS1      59.4    61   338  4     4.05  2.39
# ℹ 53,930 more rows

To arrange price in descending numerical order:

# A tibble: 53,940 × 10
   carat cut       color clarity depth table price     x     y     z
   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1  2.29 Premium   I     VS2      60.8    60 18823  8.5   8.47  5.16
 2  2    Very Good G     SI1      63.5    56 18818  7.9   7.97  5.04
 3  1.51 Ideal     G     IF       61.7    55 18806  7.37  7.41  4.56
 4  2.07 Ideal     G     SI2      62.5    55 18804  8.2   8.13  5.11
 5  2    Very Good H     SI1      62.8    57 18803  7.95  8     5.01
 6  2.29 Premium   I     SI1      61.8    59 18797  8.52  8.45  5.24
 7  2.04 Premium   H     SI1      58.1    60 18795  8.37  8.28  4.84
 8  2    Premium   I     VS1      60.8    59 18795  8.13  8.02  4.91
 9  1.71 Premium   F     VS2      62.3    59 18791  7.57  7.53  4.7 
10  2.15 Ideal     G     SI2      62.6    54 18791  8.29  8.35  5.21
# ℹ 53,930 more rows

Week 4

Data Exploration - The Crickets Data Set.

Cricket Verdict

Not exactly the biggest fan of bugs either, but certainly an improvement from penguins!

Basic Scatter Plot

For Two Quantitative Variables.


Attaching package: 'modeldata'

The following object is masked from 'package:palmerpenguins':

    penguins

Modifying Plot Properties

Additional Layers

`geom_smooth()` using formula = 'y ~ x'

`geom_smooth()` using formula = 'y ~ x'

Additional Plots

Histogram - For A Single Quantitative Variable With Multiple Frequencies.

Frequency Polygon.

Bar Chart Non-Colour Specified - For A Single Categorical Variable.

Bar Chart Colour Specified.

Box Plot - For One Quantitative And One Categorical Variable.

Faceting

To Create Individual Plots For Each Value Of A Categorical Variable Specified

A Good Hypothesis

A hypothesis is a key element of the scientific research process, and can be described as either a theoretical or hypothetical explanation for observations, measurements and any phenomenons that occur during a research project or experiment. Typically displayed in a mathematical model a “good” hypothesis should consist of testability, objectivity, clarity and relevance, as this allows for an objective to be worked towards whilst avoiding excessive descriptions and overall remaining relevant to the area of knowledge desired.

Week 5

Choosing The Correct Type Of Analysis

Graph 1 - Box Plot

Contains one continuous quantitative variable (Sepal Length) and three ordinal categorical variables (Species). Mean test required to test for differences across the three means, a One-Way ANOVA would be applicable due to comparing the mean sepal lengths across the three species.

Graph 2 - Density Plot

I have not seen or used this type of graph before, therefore I do not feel confident in assigning it to a statistical test “family” as I’m unsure what it is actually displaying.

Contains one continuous quantitative variable (Petal Length)?

Graph 3 - Scatter Plot

`geom_smooth()` using formula = 'y ~ x'

Contains two continuous quantitative variables (Petal Length and Petal Width). Correlation test required, data is linear (normally distributed) therefore the Pearson correlation coefficient test would be applicable to determine the association between the two variables.

Graph 4 - Grouped Bar Plot

Contains two ordinal categorical variables (big and small). Frequency test required to test for associations between the two categorical variables. The use of a chi-square test here would be suitable, as would test for a relationship (and therefore association) between the two categorical variables.