Marissa’s attempt at a Quarto workbook

Week 1/2

Importing a data set.

Meet the penguins

Illustration of three species of Palmer Archipelago penguins: Chinstrap, Gentoo, and Adelie. Artwork by @allison_horst.

The penguins data from the palmerpenguins package contains size measurements for 344 penguins from three species observed on three islands in the Palmer Archipelago, Antarctica.

Penguin Raw Data

Rows: 344
Columns: 8
$ species           <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
$ island            <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
$ bill_length_mm    <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
$ bill_depth_mm     <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
$ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
$ body_mass_g       <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
$ sex               <fct> male, female, female, NA, female, male, female, male…
$ year              <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…
Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_point()`).

The above plot displays the relationship between flipper and bill length in the three named penguin species.

Penguin Phobia

Despite being a bird enthusiast Marissa is actually terrified of penguins as a result of being hissed at by one at a zoo when she was 5.

Week 3

Data wrangling - The Diamonds Data Set.

Mutate

To add new columns or to modify current variables.

Below - where three new variables have been added “JustOne”, “Values” and “Simple”.

# A tibble: 53,940 × 13
   carat cut    color clarity depth table price     x     y     z JustOne Values
   <dbl> <ord>  <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>   <dbl> <chr> 
 1  0.23 Ideal  E     SI2      61.5    55   326  3.95  3.98  2.43       1 somet…
 2  0.21 Premi… E     SI1      59.8    61   326  3.89  3.84  2.31       1 somet…
 3  0.23 Good   E     VS1      56.9    65   327  4.05  4.07  2.31       1 somet…
 4  0.29 Premi… I     VS2      62.4    58   334  4.2   4.23  2.63       1 somet…
 5  0.31 Good   J     SI2      63.3    58   335  4.34  4.35  2.75       1 somet…
 6  0.24 Very … J     VVS2     62.8    57   336  3.94  3.96  2.48       1 somet…
 7  0.24 Very … I     VVS1     62.3    57   336  3.95  3.98  2.47       1 somet…
 8  0.26 Very … H     SI1      61.9    55   337  4.07  4.11  2.53       1 somet…
 9  0.22 Fair   E     VS2      65.1    61   337  3.87  3.78  2.49       1 somet…
10  0.23 Very … H     VS1      59.4    61   338  4     4.05  2.39       1 somet…
# ℹ 53,930 more rows
# ℹ 1 more variable: Simple <lgl>

Below - where multiple variables (columns) have been created based off existing variables.

# A tibble: 53,940 × 15
   carat cut       color clarity depth table price     x     y     z price200
   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>    <dbl>
 1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43      126
 2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31      126
 3  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31      127
 4  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63      134
 5  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75      135
 6  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48      136
 7  0.24 Very Good I     VVS1     62.3    57   336  3.95  3.98  2.47      136
 8  0.26 Very Good H     SI1      61.9    55   337  4.07  4.11  2.53      137
 9  0.22 Fair      E     VS2      65.1    61   337  3.87  3.78  2.49      137
10  0.23 Very Good H     VS1      59.4    61   338  4     4.05  2.39      138
# ℹ 53,930 more rows
# ℹ 4 more variables: price20perc <dbl>, price20percoff <dbl>,
#   pricepercarat <dbl>, sqdep <dbl>

Mutate - additional exercise - “Midwest”.

# A tibble: 437 × 34
     PID county  state  area poptotal popdensity popwhite popblack popamerindian
   <int> <chr>   <chr> <dbl>    <int>      <dbl>    <int>    <int>         <int>
 1   561 ADAMS   IL    0.052    66090      1271.    63917     1702            98
 2   562 ALEXAN… IL    0.014    10626       759      7054     3496            19
 3   563 BOND    IL    0.022    14991       681.    14477      429            35
 4   564 BOONE   IL    0.017    30806      1812.    29344      127            46
 5   565 BROWN   IL    0.018     5836       324.     5264      547            14
 6   566 BUREAU  IL    0.05     35688       714.    35157       50            65
 7   567 CALHOUN IL    0.017     5322       313.     5298        1             8
 8   568 CARROLL IL    0.027    16805       622.    16519      111            30
 9   569 CASS    IL    0.024    13437       560.    13384       16             8
10   570 CHAMPA… IL    0.058   173025      2983.   146506    16559           331
# ℹ 427 more rows
# ℹ 25 more variables: popasian <int>, popother <int>, percwhite <dbl>,
#   percblack <dbl>, percamerindan <dbl>, percasian <dbl>, percother <dbl>,
#   popadults <int>, perchsd <dbl>, percollege <dbl>, percprof <dbl>,
#   poppovertyknown <int>, percpovertyknown <dbl>, percbelowpoverty <dbl>,
#   percchildbelowpovert <dbl>, percadultpoverty <dbl>,
#   percelderlypoverty <dbl>, inmetro <int>, category <chr>, …

Summarize

To collapse rows and return a one-row summary.

# A tibble: 1 × 5
  avg.price dbl.price random.add avg.carat stdev.price
      <dbl>     <dbl>      <dbl>     <dbl>       <dbl>
1     3933.     7866.          3     0.798       3989.

Group by and ungroup

To take existing data and group specific variables together for future operations.

Summarizing () and group by () - To compare the averages of two groups separately:

# A tibble: 2 × 4
  Sex        m     s     n
  <chr>  <dbl> <dbl> <int>
1 female 0.437 0.268    25
2 male   0.487 0.268    25

This code has been grouped by sex to ensure calculations performed on data accounts for males and females separately.

`summarise()` has grouped output by 'Sex'. You can override using the `.groups`
argument.
# A tibble: 27 × 5
   Sex      Age     m      s     n
   <chr>  <dbl> <dbl>  <dbl> <int>
 1 female    20 0.046 NA         1
 2 female    21 0.740  0.253     3
 3 female    22 0.672  0.253     2
 4 female    23 0.501 NA         1
 5 female    25 0.579  0.167     3
 6 female    26 0.41  NA         1
 7 female    28 0.152 NA         1
 8 female    29 0.426  0.339     2
 9 female    30 0.170  0.238     2
10 female    33 0.173 NA         1
# ℹ 17 more rows

This code has been grouped by both sex and age, resulting in more rows.

Mutate () and group by () - To add new columns based on the existing group:

# A tibble: 50 × 5
      ID Sex      Age Score     m
   <int> <chr>  <dbl> <dbl> <dbl>
 1     1 male      26 0.01  0.487
 2     2 female    25 0.418 0.437
 3     3 male      39 0.014 0.487
 4     4 female    37 0.09  0.437
 5     5 male      31 0.061 0.487
 6     6 female    34 0.328 0.437
 7     7 male      34 0.656 0.487
 8     8 female    30 0.002 0.437
 9     9 male      26 0.639 0.487
10    10 female    33 0.173 0.437
# ℹ 40 more rows

Filter

To retain specific rows of data that meet specified requirements.

Below - code to only display data from diamonds that have a cut value of fair or good, and a price at or under $600:

# A tibble: 505 × 10
   carat cut   color clarity depth table price     x     y     z
   <dbl> <ord> <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1  0.23 Good  E     VS1      56.9    65   327  4.05  4.07  2.31
 2  0.31 Good  J     SI2      63.3    58   335  4.34  4.35  2.75
 3  0.22 Fair  E     VS2      65.1    61   337  3.87  3.78  2.49
 4  0.3  Good  J     SI1      64      55   339  4.25  4.28  2.73
 5  0.3  Good  J     SI1      63.4    54   351  4.23  4.29  2.7 
 6  0.3  Good  J     SI1      63.8    56   351  4.23  4.26  2.71
 7  0.3  Good  I     SI2      63.3    56   351  4.26  4.3   2.71
 8  0.23 Good  F     VS1      58.2    59   402  4.06  4.08  2.37
 9  0.23 Good  E     VS1      64.1    59   402  3.83  3.85  2.46
10  0.31 Good  H     SI1      64      54   402  4.29  4.31  2.75
# ℹ 495 more rows

Select

To select only the variables (columns) desired, the order in which variable names are listed is the order they will be displayed.

To retain only cut and color:

# A tibble: 53,940 × 2
   cut       color
   <ord>     <ord>
 1 Ideal     E    
 2 Premium   E    
 3 Good      E    
 4 Premium   I    
 5 Good      J    
 6 Very Good J    
 7 Very Good I    
 8 Very Good H    
 9 Fair      E    
10 Very Good H    
# ℹ 53,930 more rows

To retain all except cut and color:

# A tibble: 53,940 × 8
   carat clarity depth table price     x     y     z
   <dbl> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1  0.23 SI2      61.5    55   326  3.95  3.98  2.43
 2  0.21 SI1      59.8    61   326  3.89  3.84  2.31
 3  0.23 VS1      56.9    65   327  4.05  4.07  2.31
 4  0.29 VS2      62.4    58   334  4.2   4.23  2.63
 5  0.31 SI2      63.3    58   335  4.34  4.35  2.75
 6  0.24 VVS2     62.8    57   336  3.94  3.96  2.48
 7  0.24 VVS1     62.3    57   336  3.95  3.98  2.47
 8  0.26 SI1      61.9    55   337  4.07  4.11  2.53
 9  0.22 VS2      65.1    61   337  3.87  3.78  2.49
10  0.23 VS1      59.4    61   338  4     4.05  2.39
# ℹ 53,930 more rows

To retain only the first five columns:

# A tibble: 53,940 × 5
   carat cut       color clarity depth
   <dbl> <ord>     <ord> <ord>   <dbl>
 1  0.23 Ideal     E     SI2      61.5
 2  0.21 Premium   E     SI1      59.8
 3  0.23 Good      E     VS1      56.9
 4  0.29 Premium   I     VS2      62.4
 5  0.31 Good      J     SI2      63.3
 6  0.24 Very Good J     VVS2     62.8
 7  0.24 Very Good I     VVS1     62.3
 8  0.26 Very Good H     SI1      61.9
 9  0.22 Fair      E     VS2      65.1
10  0.23 Very Good H     VS1      59.4
# ℹ 53,930 more rows

To retain all except the first five columns:

# A tibble: 53,940 × 5
   table price     x     y     z
   <dbl> <int> <dbl> <dbl> <dbl>
 1    55   326  3.95  3.98  2.43
 2    61   326  3.89  3.84  2.31
 3    65   327  4.05  4.07  2.31
 4    58   334  4.2   4.23  2.63
 5    58   335  4.34  4.35  2.75
 6    57   336  3.94  3.96  2.48
 7    57   336  3.95  3.98  2.47
 8    55   337  4.07  4.11  2.53
 9    61   337  3.87  3.78  2.49
10    61   338  4     4.05  2.39
# ℹ 53,930 more rows

Arrange

To arrange values within a variable in either ascending or descending order, applicable to both numerical and non-numerical values.

To arrange cut in alphabetical order:

# A tibble: 53,940 × 10
   carat cut   color clarity depth table price     x     y     z
   <dbl> <ord> <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1  0.22 Fair  E     VS2      65.1    61   337  3.87  3.78  2.49
 2  0.86 Fair  E     SI2      55.1    69  2757  6.45  6.33  3.52
 3  0.96 Fair  F     SI2      66.3    62  2759  6.27  5.95  4.07
 4  0.7  Fair  F     VS2      64.5    57  2762  5.57  5.53  3.58
 5  0.7  Fair  F     VS2      65.3    55  2762  5.63  5.58  3.66
 6  0.91 Fair  H     SI2      64.4    57  2763  6.11  6.09  3.93
 7  0.91 Fair  H     SI2      65.7    60  2763  6.03  5.99  3.95
 8  0.98 Fair  H     SI2      67.9    60  2777  6.05  5.97  4.08
 9  0.84 Fair  G     SI1      55.1    67  2782  6.39  6.2   3.47
10  1.01 Fair  E     I1       64.5    58  2788  6.29  6.21  4.03
# ℹ 53,930 more rows

To arrange cut in descending alphabetic order:

# A tibble: 53,940 × 10
   carat cut   color clarity depth table price     x     y     z
   <dbl> <ord> <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1  0.23 Ideal E     SI2      61.5    55   326  3.95  3.98  2.43
 2  0.23 Ideal J     VS1      62.8    56   340  3.93  3.9   2.46
 3  0.31 Ideal J     SI2      62.2    54   344  4.35  4.37  2.71
 4  0.3  Ideal I     SI2      62      54   348  4.31  4.34  2.68
 5  0.33 Ideal I     SI2      61.8    55   403  4.49  4.51  2.78
 6  0.33 Ideal I     SI2      61.2    56   403  4.49  4.5   2.75
 7  0.33 Ideal J     SI1      61.1    56   403  4.49  4.55  2.76
 8  0.23 Ideal G     VS1      61.9    54   404  3.93  3.95  2.44
 9  0.32 Ideal I     SI1      60.9    55   404  4.45  4.48  2.72
10  0.3  Ideal I     SI2      61      59   405  4.3   4.33  2.63
# ℹ 53,930 more rows

To arrange price in numerical order:

# A tibble: 53,940 × 10
   carat cut       color clarity depth table price     x     y     z
   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
 2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
 3  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31
 4  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
 5  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75
 6  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48
 7  0.24 Very Good I     VVS1     62.3    57   336  3.95  3.98  2.47
 8  0.26 Very Good H     SI1      61.9    55   337  4.07  4.11  2.53
 9  0.22 Fair      E     VS2      65.1    61   337  3.87  3.78  2.49
10  0.23 Very Good H     VS1      59.4    61   338  4     4.05  2.39
# ℹ 53,930 more rows

To arrange price in descending numerical order:

# A tibble: 53,940 × 10
   carat cut       color clarity depth table price     x     y     z
   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1  2.29 Premium   I     VS2      60.8    60 18823  8.5   8.47  5.16
 2  2    Very Good G     SI1      63.5    56 18818  7.9   7.97  5.04
 3  1.51 Ideal     G     IF       61.7    55 18806  7.37  7.41  4.56
 4  2.07 Ideal     G     SI2      62.5    55 18804  8.2   8.13  5.11
 5  2    Very Good H     SI1      62.8    57 18803  7.95  8     5.01
 6  2.29 Premium   I     SI1      61.8    59 18797  8.52  8.45  5.24
 7  2.04 Premium   H     SI1      58.1    60 18795  8.37  8.28  4.84
 8  2    Premium   I     VS1      60.8    59 18795  8.13  8.02  4.91
 9  1.71 Premium   F     VS2      62.3    59 18791  7.57  7.53  4.7 
10  2.15 Ideal     G     SI2      62.6    54 18791  8.29  8.35  5.21
# ℹ 53,930 more rows

Week 4

Data Exploration - The Crickets Data Set.

Cricket Verdict

Not exactly the biggest fan of bugs either, but certainly an improvement from penguins!

Basic Scatter Plot

For Two Quantitative Variables.


Attaching package: 'modeldata'
The following object is masked from 'package:palmerpenguins':

    penguins

Modifying Plot Properties

Additional Layers

`geom_smooth()` using formula = 'y ~ x'

`geom_smooth()` using formula = 'y ~ x'

Additional Plots

Histogram - For A Single Quantitative Variable With Multiple Frequencies.

Frequency Polygon.

Bar Chart Non-Colour Specified - For A Single Categorical Variable.

Bar Chart Colour Specified.

Box Plot - For One Quantitative And One Categorical Variable.

Faceting

To Create Individual Plots For Each Value Of A Categorical Variable Specified

A Good Hypothesis

A hypothesis is a key element of the scientific research process, and can be described as either a theoretical or hypothetical explanation for observations, measurements and any phenomenons that occur during a research project or experiment. Typically displayed in a mathematical model a “good” hypothesis should consist of testability, objectivity, clarity and relevance, as this allows for an objective to be worked towards whilst avoiding excessive descriptions and overall remaining relevant to the area of knowledge desired.

Week 5

Choosing The Correct Type Of Analysis

Graph 1 - Box Plot

Contains one continuous quantitative variable (Sepal Length) and three ordinal categorical variables (Species). Mean test required to test for differences across the three means, a One-Way ANOVA would be applicable due to comparing the mean sepal lengths across the three species.

Graph 2 - Density Plot

I have not seen or used this type of graph before, therefore I do not feel confident in assigning it to a statistical test “family” as I’m unsure what it is actually displaying.

Contains one continuous quantitative variable (Petal Length)?

Graph 3 - Scatter Plot

`geom_smooth()` using formula = 'y ~ x'

Contains two continuous quantitative variables (Petal Length and Petal Width). Correlation test required, data is linear (normally distributed) therefore the Pearson correlation coefficient test would be applicable to determine the association between the two variables.

Graph 4 - Grouped Bar Plot

Contains two ordinal categorical variables (big and small). Frequency test required to test for associations between the two categorical variables. The use of a chi-square test here would be suitable, as would test for a relationship (and therefore association) between the two categorical variables.

Week 6

Week 7

Types of mean tests

Parametric: Suitable in cases where data is normally distributed, typically possess greater statistical power & more likely to detect an effect.

Non-Parametric: Suitable in cases where data is not normally distributed or if sample size is small, based around differences in the median opposed to the mean and therefore distribution free.

Comparing One Sample Mean to a Standard Known Mean

One Sample T-Test (Parametric)

Rows: 10 Columns: 2
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): name
dbl (1): weight

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

T-Test Formula Abbreviations

H0 = Null Hypothesis

Ha = Alternate Hypotheses

m = Mean

μ = Theoretical Value/Mean

n = Sample Size

s = Standard Deviation

T-Test Code Format

(x, mu = 0, alternative = ““)

x = numeric vector

mu = theoretical mean (0 is default)

alternative = alternative hypothesis (two.sided is default but can be greater or less)


    One Sample t-test

data:  Mice_Weights_$weight
t = -9.0783, df = 9, p-value = 7.953e-06
alternative hypothesis: true mean is not equal to 25
95 percent confidence interval:
 17.8172 20.6828
sample estimates:
mean of x 
    19.25 

One Sample Wilcoxon Test (Non-Parametric)

Wilcoxon Test Formula Abbreviations

H0 = Null Hypothesis

Ha = Alternate Hypotheses

m = Median

M0 = Theoretical Value/Mean

Wilcoxon Test Code Format

(x, mu = 0, alternative = ““)

x = numeric vector

mu = theoretical mean/median value (0 is default)

alternative = alternative hypothesis (two.sided is default but can be greater or less)

Warning in wilcox.test.default(Mice_Weights_$weight, mu = 25): cannot compute
exact p-value with ties

    Wilcoxon signed rank test with continuity correction

data:  Mice_Weights_$weight
V = 0, p-value = 0.005793
alternative hypothesis: true location is not equal to 25

Comparing the Means of Two Independent Groups

Unpaired Two Samples T-Test (Parametric)

Rows: 18 Columns: 2
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): Group
dbl (1): Weight

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Unpaired Two Samples T-Test Formula Abbreviations

H0 = Null Hypothesis

Ha = Alternate Hypotheses

MA = Group A Mean

MB = Group B Mean

NA = Group A Size

NB = Group B Size

S2 = Pooled Variance Estimator of the Two Groups

Unpaired Two Samples T-Test Code Format

(x,y, alternative = ““, var.equal = FALSE)

x & y = numeric vectors

alternative = alternative hypothesis (two.sided is default but can be greater or less)

var.equal = logical variable that indicates whether to treat the two variances as being equal. If TRUE is used the pooled variance is used to estimate the variance otherwise Welch is used.


    Two Sample t-test

data:  Weight by Group
t = 2.7842, df = 16, p-value = 0.01327
alternative hypothesis: true difference in means between group Man and group Woman is not equal to 0
95 percent confidence interval:
  4.029759 29.748019
sample estimates:
  mean in group Man mean in group Woman 
           68.98889            52.10000 

Unpaired Two Samples Wilcoxon Test (Non-Parametric)

Unpaired Two Samples Wilcoxon Test Code Format

(x, y, alternative = ““)

x & y = numeric vectors

alternative = alternative hypothesis (two.sided is default but can be greater or less)

Comparing the Means of Paired Samples

Paired Samples T-Test (Parametric)

Rows: 20 Columns: 2
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): Group
dbl (1): Weight

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Paired Samples T-Test Formula Abbreviations

H0 = Null Hypothesis

Ha = Alternate Hypotheses

M = Mean Differences

N = Sample Size

S = Standard Deviation of d

Paired Samples T-Test Code Format

(x,y, paired = TRUE, alternative = ““)

x & y = numeric vectors

paired = logical value specifying the want for a compute paired t-test

alternative = alternative hypothesis (two.sided is default but can be greater or less)

Paired Samples Wilcoxon Test (Non-Parametric)

Comparing the Means of More Than Two Groups

Analysis of Variance - ANOVA (Parametric)

One Way ANOVA

Two Way ANOVA

MANOVA Multivariate Analysis of Variance

Kruskal-Wallis Test (Non-Parametric)

Week 8

Correlation Tests

Correlation tests can only be used for numerical variables.

To be used when effects are either unexpected or inexplicable.

Pearson’s Correlation Test

For assumed normal distribution (e.g. parametric)

Example:

cor.test(dataset dollarsign variable, dataset$variable)

(For above don’t use any spaces and use only $ - software had a hissy fit when trying to type)

Spearman’s Correlation Test

For assumed non-normal distribution (e.g. non-parametric)

Example:

cor.test(dataset dollarsign variable, dataset$variable, method = “spearman”)

(Same note as above regarding type up)

Understanding Correlation Magnitude’s

Absolute Value of r: |r| Strength
0 ≤ |r| < 0.10 Very Weak
0.10 ≤ |r| < 0.20 Weak
0.20 ≤ |r| < 0.30 Moderate
|r| ≥ 0.30 Strong

|r| = Number between -1 and 1 that determines a correlation.

≤ = Less than or equal to.

≥ = Greater than or equal to.

GGally & ggplot

Registered S3 method overwritten by 'GGally':
  method from   
  +.gg   ggplot2
Warning: Removed 2 rows containing non-finite outside the scale range
(`stat_boxplot()`).
Removed 2 rows containing non-finite outside the scale range
(`stat_boxplot()`).
Removed 2 rows containing non-finite outside the scale range
(`stat_boxplot()`).
Removed 2 rows containing non-finite outside the scale range
(`stat_boxplot()`).
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Warning: Removed 2 rows containing non-finite outside the scale range
(`stat_bin()`).
Warning: Removed 2 rows containing non-finite outside the scale range
(`stat_density()`).
Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
Removed 2 rows containing missing values
Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
Removed 2 rows containing missing values
Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
Removed 2 rows containing missing values
Warning: Removed 2 rows containing non-finite outside the scale range
(`stat_boxplot()`).
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Warning: Removed 2 rows containing non-finite outside the scale range
(`stat_bin()`).
Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_point()`).
Warning: Removed 2 rows containing non-finite outside the scale range
(`stat_density()`).
Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
Removed 2 rows containing missing values
Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
Removed 2 rows containing missing values
Warning: Removed 2 rows containing non-finite outside the scale range
(`stat_boxplot()`).
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Warning: Removed 2 rows containing non-finite outside the scale range
(`stat_bin()`).
Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_point()`).
Removed 2 rows containing missing values or values outside the scale range
(`geom_point()`).
Warning: Removed 2 rows containing non-finite outside the scale range
(`stat_density()`).
Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
Removed 2 rows containing missing values
Warning: Removed 2 rows containing non-finite outside the scale range
(`stat_boxplot()`).
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Warning: Removed 2 rows containing non-finite outside the scale range
(`stat_bin()`).
Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_point()`).
Removed 2 rows containing missing values or values outside the scale range
(`geom_point()`).
Removed 2 rows containing missing values or values outside the scale range
(`geom_point()`).
Warning: Removed 2 rows containing non-finite outside the scale range
(`stat_density()`).
Warning: Removed 2 rows containing non-finite outside the scale range
(`stat_boxplot()`).
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Warning: Removed 2 rows containing non-finite outside the scale range
(`stat_bin()`).
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Warning: Removed 2 rows containing non-finite outside the scale range
(`stat_bin()`).
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Warning: Removed 2 rows containing non-finite outside the scale range
(`stat_bin()`).
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Warning: Removed 2 rows containing non-finite outside the scale range
(`stat_bin()`).

Additional Information

Theoretical Possibilities When Considering Correlations:

  • Was A caused by B?

  • Was B caused by A?

  • Were A and B caused by something else? (C)

  • Are A and B completely unrelated and their correlation is entirely coincidence?

Correlation and Causation are not the same thing.

e.g. Ice cream sales & sunburns both occur during periods of hot weather yet are not caused by each other, therefore the causation of both is the weather and the measurable correlation is between the two outcomes.

In cases where non-linear correlations are present (line not straight) correlation tests should not be used as are inappropriate.

Week 9

Linear Models & Methods

Can only be used in cases were data is numerical vs numerical.

When should linear models be used?

  • When effects are expected

  • When effects are able to be explained

  • When predictive values are needed

Linear Method Components

response = factor + error

For use in R

lm(response~factor)

Response

y = a + bx + error

Types of Probability Distributions

Discrete

Possess finite number of different possible outcomes.

  • Bernoulli Distribution

  • Binomial Distribution

  • Uniform Distribution

  • Poisson Distribution

Continuous

Possess infinite many consecutive possible values.

  • Normal Distribution

  • Chi-squared Distribution

  • Exponential Distribution

  • Logistic Distribution

  • Student’ T Distribution

Linear Models

Mean

R Syntax: lm(y~1) / lm(fomula = qsec ~ 1, data = mtcars)

Effect

R Syntax: lm(y~x) / lm(formula = qsec ~ hp, data = mtcars)

Error

R Syntax: lm(y~x) (R will calculate the error here)

Residuals

Shapiro-Wilk normality test

data: m1~residuals

W = 0.94395, p-value = 0.09698

MT Car Task

What is the effect of vehicle weight on fuel efficiency?

Effect Graph

`geom_smooth()` using formula = 'y ~ x'

Fuel efficiency appears to decrease the heavier the vehicle is.

Summary


Call:
lm(formula = wt ~ mpg, data = mtcars)

Coefficients:
(Intercept)          mpg  
     6.0473      -0.1409  

Anova

I gave up - sorry.

Week 10

Logistic Regression & Models

Logistic Regression

Can be used for all types of categorical data.

Belongs to the Generalized Linear Model family (GLM)

Used to predict the category of individuals based on one or multiple predictor variables.

Used to model a binary outcome - that is a variable - can have only two potential values (0 or 1/ yes or no/ diseased or non-diseased etc.)

Displayed as an S shaped curve (the ‘sharper’ the S is is a good indicator) which can be written as - p = 1/[1 + exp (y)].

x = predictor variable

y = b0 + b1*x

exp() = exponential

p = probability of event to occur

Multiple ‘family’ options in GLM - ‘binomial’ needs to be specified in order to fit logistic regression.

The most commonly used ‘families’ are:

  • Binomial

  • Gaussian

  • Gamma

  • Poisson

Data Preparation

    pregnant glucose pressure triceps insulin mass pedigree age diabetes
28         1      97       66      15     140 23.2    0.487  22      neg
714        0     134       58      20     291 26.4    0.352  21      neg
569        4     154       72      29     126 31.3    0.338  37      neg

Simple Logistic Regression

Used to predict probability based on one predictor variable

To predict the probability of being diabetes positive based on plasma-glucose concentration.

               Estimate  Std. Error   z value     Pr(>|z|)
(Intercept) -6.09552139 0.629787038 -9.678703 3.713993e-22
glucose      0.04242099 0.004760623  8.910805 5.066328e-19
    1     2 
"neg" "pos" 

Multiple Logistic Regression

Used to predict probability based on multiple predictor variables.

               Estimate Std. Error   z value     Pr(>|z|)
(Intercept) -8.85569796 0.96972059 -9.132216 6.711123e-20
glucose      0.03824376 0.00480179  7.964480 1.659204e-15
mass         0.08144044 0.02029394  4.013043 5.994104e-05
pregnant     0.14922221 0.04031645  3.701273 2.145202e-04

To include all predictor variables.

                 Estimate  Std. Error    z value     Pr(>|z|)
(Intercept) -1.004074e+01 1.217674335 -8.2458330 1.640136e-16
glucose      3.826952e-02 0.005767709  6.6351344 3.242069e-11
mass         7.053758e-02 0.027342138  2.5798122 9.885405e-03
pregnant     8.215942e-02 0.055425546  1.4823385 1.382502e-01
pressure    -1.420290e-03 0.011833396 -0.1200239 9.044642e-01
triceps      1.122139e-02 0.017083709  0.6568474 5.112790e-01
insulin     -8.253128e-04 0.001306439 -0.6317270 5.275653e-01
pedigree     1.140909e+00 0.427433723  2.6692059 7.603082e-03
age          3.395162e-02 0.018381721  1.8470318 6.474254e-02

Estimate = the intercept (b0) and the beta coefficient estimates associated to each predictor variable.

Std. Error = the standard error of the coefficient estimates - represents coefficient accuracy - the larger the error the less confident.

Z Value = the z statistic - the coefficient estimate (column 2) divided by the standard error (column 3)

Pr (>|z|) = the p value corresponding to the z statistic - the smaller the value the more significant.

Interpretation

An important concept to understand, for interpreting the logistic beta coefficients, is the odds ratio. An odds ratio measures the association between a predictor variable (x) and the outcome variable (y). It represents the ratio of the odds that an event will occur (event = 1) given the presence of the predictor x (x = 1), compared to the odds of the event occurring in the absence of that predictor (x = 0).