Week 1/2
Importing a data set.
Meet the penguins

The penguins data from the palmerpenguins package contains size measurements for 344 penguins from three species observed on three islands in the Palmer Archipelago, Antarctica.
Penguin Raw Data
Rows: 344
Columns: 8
$ species <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
$ island <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
$ bill_length_mm <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
$ bill_depth_mm <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
$ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
$ body_mass_g <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
$ sex <fct> male, female, female, NA, female, male, female, male…
$ year <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…
Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_point()`).
The above plot displays the relationship between flipper and bill length in the three named penguin species.
Despite being a bird enthusiast Marissa is actually terrified of penguins as a result of being hissed at by one at a zoo when she was 5.
Week 3
Data wrangling - The Diamonds Data Set.
Mutate
To add new columns or to modify current variables.
Below - where three new variables have been added “JustOne”, “Values” and “Simple”.
# A tibble: 53,940 × 13
carat cut color clarity depth table price x y z JustOne Values
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <dbl> <chr>
1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43 1 somet…
2 0.21 Premi… E SI1 59.8 61 326 3.89 3.84 2.31 1 somet…
3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31 1 somet…
4 0.29 Premi… I VS2 62.4 58 334 4.2 4.23 2.63 1 somet…
5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75 1 somet…
6 0.24 Very … J VVS2 62.8 57 336 3.94 3.96 2.48 1 somet…
7 0.24 Very … I VVS1 62.3 57 336 3.95 3.98 2.47 1 somet…
8 0.26 Very … H SI1 61.9 55 337 4.07 4.11 2.53 1 somet…
9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49 1 somet…
10 0.23 Very … H VS1 59.4 61 338 4 4.05 2.39 1 somet…
# ℹ 53,930 more rows
# ℹ 1 more variable: Simple <lgl>
Below - where multiple variables (columns) have been created based off existing variables.
# A tibble: 53,940 × 15
carat cut color clarity depth table price x y z price200
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <dbl>
1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43 126
2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31 126
3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31 127
4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63 134
5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75 135
6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48 136
7 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47 136
8 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53 137
9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49 137
10 0.23 Very Good H VS1 59.4 61 338 4 4.05 2.39 138
# ℹ 53,930 more rows
# ℹ 4 more variables: price20perc <dbl>, price20percoff <dbl>,
# pricepercarat <dbl>, sqdep <dbl>
Mutate - additional exercise - “Midwest”.
# A tibble: 437 × 34
PID county state area poptotal popdensity popwhite popblack popamerindian
<int> <chr> <chr> <dbl> <int> <dbl> <int> <int> <int>
1 561 ADAMS IL 0.052 66090 1271. 63917 1702 98
2 562 ALEXAN… IL 0.014 10626 759 7054 3496 19
3 563 BOND IL 0.022 14991 681. 14477 429 35
4 564 BOONE IL 0.017 30806 1812. 29344 127 46
5 565 BROWN IL 0.018 5836 324. 5264 547 14
6 566 BUREAU IL 0.05 35688 714. 35157 50 65
7 567 CALHOUN IL 0.017 5322 313. 5298 1 8
8 568 CARROLL IL 0.027 16805 622. 16519 111 30
9 569 CASS IL 0.024 13437 560. 13384 16 8
10 570 CHAMPA… IL 0.058 173025 2983. 146506 16559 331
# ℹ 427 more rows
# ℹ 25 more variables: popasian <int>, popother <int>, percwhite <dbl>,
# percblack <dbl>, percamerindan <dbl>, percasian <dbl>, percother <dbl>,
# popadults <int>, perchsd <dbl>, percollege <dbl>, percprof <dbl>,
# poppovertyknown <int>, percpovertyknown <dbl>, percbelowpoverty <dbl>,
# percchildbelowpovert <dbl>, percadultpoverty <dbl>,
# percelderlypoverty <dbl>, inmetro <int>, category <chr>, …
Summarize
To collapse rows and return a one-row summary.
# A tibble: 1 × 5
avg.price dbl.price random.add avg.carat stdev.price
<dbl> <dbl> <dbl> <dbl> <dbl>
1 3933. 7866. 3 0.798 3989.
Group by and ungroup
To take existing data and group specific variables together for future operations.
Summarizing () and group by () - To compare the averages of two groups separately:
# A tibble: 2 × 4
Sex m s n
<chr> <dbl> <dbl> <int>
1 female 0.437 0.268 25
2 male 0.487 0.268 25
This code has been grouped by sex to ensure calculations performed on data accounts for males and females separately.
`summarise()` has grouped output by 'Sex'. You can override using the `.groups`
argument.
# A tibble: 27 × 5
Sex Age m s n
<chr> <dbl> <dbl> <dbl> <int>
1 female 20 0.046 NA 1
2 female 21 0.740 0.253 3
3 female 22 0.672 0.253 2
4 female 23 0.501 NA 1
5 female 25 0.579 0.167 3
6 female 26 0.41 NA 1
7 female 28 0.152 NA 1
8 female 29 0.426 0.339 2
9 female 30 0.170 0.238 2
10 female 33 0.173 NA 1
# ℹ 17 more rows
This code has been grouped by both sex and age, resulting in more rows.
Mutate () and group by () - To add new columns based on the existing group:
# A tibble: 50 × 5
ID Sex Age Score m
<int> <chr> <dbl> <dbl> <dbl>
1 1 male 26 0.01 0.487
2 2 female 25 0.418 0.437
3 3 male 39 0.014 0.487
4 4 female 37 0.09 0.437
5 5 male 31 0.061 0.487
6 6 female 34 0.328 0.437
7 7 male 34 0.656 0.487
8 8 female 30 0.002 0.437
9 9 male 26 0.639 0.487
10 10 female 33 0.173 0.437
# ℹ 40 more rows
Filter
To retain specific rows of data that meet specified requirements.
Below - code to only display data from diamonds that have a cut value of fair or good, and a price at or under $600:
# A tibble: 505 × 10
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
2 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
3 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49
4 0.3 Good J SI1 64 55 339 4.25 4.28 2.73
5 0.3 Good J SI1 63.4 54 351 4.23 4.29 2.7
6 0.3 Good J SI1 63.8 56 351 4.23 4.26 2.71
7 0.3 Good I SI2 63.3 56 351 4.26 4.3 2.71
8 0.23 Good F VS1 58.2 59 402 4.06 4.08 2.37
9 0.23 Good E VS1 64.1 59 402 3.83 3.85 2.46
10 0.31 Good H SI1 64 54 402 4.29 4.31 2.75
# ℹ 495 more rows
Select
To select only the variables (columns) desired, the order in which variable names are listed is the order they will be displayed.
To retain only cut and color:
# A tibble: 53,940 × 2
cut color
<ord> <ord>
1 Ideal E
2 Premium E
3 Good E
4 Premium I
5 Good J
6 Very Good J
7 Very Good I
8 Very Good H
9 Fair E
10 Very Good H
# ℹ 53,930 more rows
To retain all except cut and color:
# A tibble: 53,940 × 8
carat clarity depth table price x y z
<dbl> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 0.23 SI2 61.5 55 326 3.95 3.98 2.43
2 0.21 SI1 59.8 61 326 3.89 3.84 2.31
3 0.23 VS1 56.9 65 327 4.05 4.07 2.31
4 0.29 VS2 62.4 58 334 4.2 4.23 2.63
5 0.31 SI2 63.3 58 335 4.34 4.35 2.75
6 0.24 VVS2 62.8 57 336 3.94 3.96 2.48
7 0.24 VVS1 62.3 57 336 3.95 3.98 2.47
8 0.26 SI1 61.9 55 337 4.07 4.11 2.53
9 0.22 VS2 65.1 61 337 3.87 3.78 2.49
10 0.23 VS1 59.4 61 338 4 4.05 2.39
# ℹ 53,930 more rows
To retain only the first five columns:
# A tibble: 53,940 × 5
carat cut color clarity depth
<dbl> <ord> <ord> <ord> <dbl>
1 0.23 Ideal E SI2 61.5
2 0.21 Premium E SI1 59.8
3 0.23 Good E VS1 56.9
4 0.29 Premium I VS2 62.4
5 0.31 Good J SI2 63.3
6 0.24 Very Good J VVS2 62.8
7 0.24 Very Good I VVS1 62.3
8 0.26 Very Good H SI1 61.9
9 0.22 Fair E VS2 65.1
10 0.23 Very Good H VS1 59.4
# ℹ 53,930 more rows
To retain all except the first five columns:
# A tibble: 53,940 × 5
table price x y z
<dbl> <int> <dbl> <dbl> <dbl>
1 55 326 3.95 3.98 2.43
2 61 326 3.89 3.84 2.31
3 65 327 4.05 4.07 2.31
4 58 334 4.2 4.23 2.63
5 58 335 4.34 4.35 2.75
6 57 336 3.94 3.96 2.48
7 57 336 3.95 3.98 2.47
8 55 337 4.07 4.11 2.53
9 61 337 3.87 3.78 2.49
10 61 338 4 4.05 2.39
# ℹ 53,930 more rows
Arrange
To arrange values within a variable in either ascending or descending order, applicable to both numerical and non-numerical values.
To arrange cut in alphabetical order:
# A tibble: 53,940 × 10
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49
2 0.86 Fair E SI2 55.1 69 2757 6.45 6.33 3.52
3 0.96 Fair F SI2 66.3 62 2759 6.27 5.95 4.07
4 0.7 Fair F VS2 64.5 57 2762 5.57 5.53 3.58
5 0.7 Fair F VS2 65.3 55 2762 5.63 5.58 3.66
6 0.91 Fair H SI2 64.4 57 2763 6.11 6.09 3.93
7 0.91 Fair H SI2 65.7 60 2763 6.03 5.99 3.95
8 0.98 Fair H SI2 67.9 60 2777 6.05 5.97 4.08
9 0.84 Fair G SI1 55.1 67 2782 6.39 6.2 3.47
10 1.01 Fair E I1 64.5 58 2788 6.29 6.21 4.03
# ℹ 53,930 more rows
To arrange cut in descending alphabetic order:
# A tibble: 53,940 × 10
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
2 0.23 Ideal J VS1 62.8 56 340 3.93 3.9 2.46
3 0.31 Ideal J SI2 62.2 54 344 4.35 4.37 2.71
4 0.3 Ideal I SI2 62 54 348 4.31 4.34 2.68
5 0.33 Ideal I SI2 61.8 55 403 4.49 4.51 2.78
6 0.33 Ideal I SI2 61.2 56 403 4.49 4.5 2.75
7 0.33 Ideal J SI1 61.1 56 403 4.49 4.55 2.76
8 0.23 Ideal G VS1 61.9 54 404 3.93 3.95 2.44
9 0.32 Ideal I SI1 60.9 55 404 4.45 4.48 2.72
10 0.3 Ideal I SI2 61 59 405 4.3 4.33 2.63
# ℹ 53,930 more rows
To arrange price in numerical order:
# A tibble: 53,940 × 10
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63
5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
7 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47
8 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53
9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49
10 0.23 Very Good H VS1 59.4 61 338 4 4.05 2.39
# ℹ 53,930 more rows
To arrange price in descending numerical order:
# A tibble: 53,940 × 10
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 2.29 Premium I VS2 60.8 60 18823 8.5 8.47 5.16
2 2 Very Good G SI1 63.5 56 18818 7.9 7.97 5.04
3 1.51 Ideal G IF 61.7 55 18806 7.37 7.41 4.56
4 2.07 Ideal G SI2 62.5 55 18804 8.2 8.13 5.11
5 2 Very Good H SI1 62.8 57 18803 7.95 8 5.01
6 2.29 Premium I SI1 61.8 59 18797 8.52 8.45 5.24
7 2.04 Premium H SI1 58.1 60 18795 8.37 8.28 4.84
8 2 Premium I VS1 60.8 59 18795 8.13 8.02 4.91
9 1.71 Premium F VS2 62.3 59 18791 7.57 7.53 4.7
10 2.15 Ideal G SI2 62.6 54 18791 8.29 8.35 5.21
# ℹ 53,930 more rows
Week 4
Data Exploration - The Crickets Data Set.
Not exactly the biggest fan of bugs either, but certainly an improvement from penguins!
Basic Scatter Plot
For Two Quantitative Variables.
Attaching package: 'modeldata'
The following object is masked from 'package:palmerpenguins':
penguins
Modifying Plot Properties
Additional Layers
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'
Additional Plots
Histogram - For A Single Quantitative Variable With Multiple Frequencies.
Frequency Polygon.
Bar Chart Non-Colour Specified - For A Single Categorical Variable.
Bar Chart Colour Specified.
Box Plot - For One Quantitative And One Categorical Variable.
Faceting
To Create Individual Plots For Each Value Of A Categorical Variable Specified
A Good Hypothesis
A hypothesis is a key element of the scientific research process, and can be described as either a theoretical or hypothetical explanation for observations, measurements and any phenomenons that occur during a research project or experiment. Typically displayed in a mathematical model a “good” hypothesis should consist of testability, objectivity, clarity and relevance, as this allows for an objective to be worked towards whilst avoiding excessive descriptions and overall remaining relevant to the area of knowledge desired.
Week 5
Choosing The Correct Type Of Analysis
Graph 1 - Box Plot
Contains one continuous quantitative variable (Sepal Length) and three ordinal categorical variables (Species). Mean test required to test for differences across the three means, a One-Way ANOVA would be applicable due to comparing the mean sepal lengths across the three species.
Graph 2 - Density Plot
I have not seen or used this type of graph before, therefore I do not feel confident in assigning it to a statistical test “family” as I’m unsure what it is actually displaying.
Contains one continuous quantitative variable (Petal Length)?
Graph 3 - Scatter Plot
`geom_smooth()` using formula = 'y ~ x'
Contains two continuous quantitative variables (Petal Length and Petal Width). Correlation test required, data is linear (normally distributed) therefore the Pearson correlation coefficient test would be applicable to determine the association between the two variables.
Graph 4 - Grouped Bar Plot
Contains two ordinal categorical variables (big and small). Frequency test required to test for associations between the two categorical variables. The use of a chi-square test here would be suitable, as would test for a relationship (and therefore association) between the two categorical variables.
Week 7
Types of mean tests
Parametric: Suitable in cases where data is normally distributed, typically possess greater statistical power & more likely to detect an effect.
Non-Parametric: Suitable in cases where data is not normally distributed or if sample size is small, based around differences in the median opposed to the mean and therefore distribution free.
Comparing One Sample Mean to a Standard Known Mean
One Sample T-Test (Parametric)
Rows: 10 Columns: 2
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): name
dbl (1): weight
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
T-Test Formula Abbreviations
H0 = Null Hypothesis
Ha = Alternate Hypotheses
m = Mean
μ = Theoretical Value/Mean
n = Sample Size
s = Standard Deviation
T-Test Code Format
(x, mu = 0, alternative = ““)
x = numeric vector
mu = theoretical mean (0 is default)
alternative = alternative hypothesis (two.sided is default but can be greater or less)
One Sample t-test
data: Mice_Weights_$weight
t = -9.0783, df = 9, p-value = 7.953e-06
alternative hypothesis: true mean is not equal to 25
95 percent confidence interval:
17.8172 20.6828
sample estimates:
mean of x
19.25
One Sample Wilcoxon Test (Non-Parametric)
Wilcoxon Test Formula Abbreviations
H0 = Null Hypothesis
Ha = Alternate Hypotheses
m = Median
M0 = Theoretical Value/Mean
Wilcoxon Test Code Format
(x, mu = 0, alternative = ““)
x = numeric vector
mu = theoretical mean/median value (0 is default)
alternative = alternative hypothesis (two.sided is default but can be greater or less)
Warning in wilcox.test.default(Mice_Weights_$weight, mu = 25): cannot compute
exact p-value with ties
Wilcoxon signed rank test with continuity correction
data: Mice_Weights_$weight
V = 0, p-value = 0.005793
alternative hypothesis: true location is not equal to 25
Comparing the Means of Two Independent Groups
Unpaired Two Samples T-Test (Parametric)
Rows: 18 Columns: 2
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): Group
dbl (1): Weight
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Unpaired Two Samples T-Test Formula Abbreviations
H0 = Null Hypothesis
Ha = Alternate Hypotheses
MA = Group A Mean
MB = Group B Mean
NA = Group A Size
NB = Group B Size
S2 = Pooled Variance Estimator of the Two Groups
Unpaired Two Samples T-Test Code Format
(x,y, alternative = ““, var.equal = FALSE)
x & y = numeric vectors
alternative = alternative hypothesis (two.sided is default but can be greater or less)
var.equal = logical variable that indicates whether to treat the two variances as being equal. If TRUE is used the pooled variance is used to estimate the variance otherwise Welch is used.
Two Sample t-test
data: Weight by Group
t = 2.7842, df = 16, p-value = 0.01327
alternative hypothesis: true difference in means between group Man and group Woman is not equal to 0
95 percent confidence interval:
4.029759 29.748019
sample estimates:
mean in group Man mean in group Woman
68.98889 52.10000
Unpaired Two Samples Wilcoxon Test (Non-Parametric)
Unpaired Two Samples Wilcoxon Test Code Format
(x, y, alternative = ““)
x & y = numeric vectors
alternative = alternative hypothesis (two.sided is default but can be greater or less)
Comparing the Means of Paired Samples
Paired Samples T-Test (Parametric)
Rows: 20 Columns: 2
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): Group
dbl (1): Weight
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Paired Samples T-Test Formula Abbreviations
H0 = Null Hypothesis
Ha = Alternate Hypotheses
M = Mean Differences
N = Sample Size
S = Standard Deviation of d
Paired Samples T-Test Code Format
(x,y, paired = TRUE, alternative = ““)
x & y = numeric vectors
paired = logical value specifying the want for a compute paired t-test
alternative = alternative hypothesis (two.sided is default but can be greater or less)
Paired Samples Wilcoxon Test (Non-Parametric)
Comparing the Means of More Than Two Groups
Analysis of Variance - ANOVA (Parametric)
One Way ANOVA
Two Way ANOVA
MANOVA Multivariate Analysis of Variance
Kruskal-Wallis Test (Non-Parametric)
Week 8
Correlation Tests
Correlation tests can only be used for numerical variables.
To be used when effects are either unexpected or inexplicable.
Pearson’s Correlation Test
For assumed normal distribution (e.g. parametric)
Example:
cor.test(dataset dollarsign variable, dataset$variable)
(For above don’t use any spaces and use only $ - software had a hissy fit when trying to type)
Spearman’s Correlation Test
For assumed non-normal distribution (e.g. non-parametric)
Example:
cor.test(dataset dollarsign variable, dataset$variable, method = “spearman”)
(Same note as above regarding type up)
Understanding Correlation Magnitude’s
| Absolute Value of r: |r| |
Strength |
| 0 ≤ |r| < 0.10 |
Very Weak |
| 0.10 ≤ |r| < 0.20 |
Weak |
| 0.20 ≤ |r| < 0.30 |
Moderate |
| |r| ≥ 0.30 |
Strong |
|r| = Number between -1 and 1 that determines a correlation.
≤ = Less than or equal to.
≥ = Greater than or equal to.
GGally & ggplot
Registered S3 method overwritten by 'GGally':
method from
+.gg ggplot2
Warning: Removed 2 rows containing non-finite outside the scale range
(`stat_boxplot()`).
Removed 2 rows containing non-finite outside the scale range
(`stat_boxplot()`).
Removed 2 rows containing non-finite outside the scale range
(`stat_boxplot()`).
Removed 2 rows containing non-finite outside the scale range
(`stat_boxplot()`).
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Warning: Removed 2 rows containing non-finite outside the scale range
(`stat_bin()`).
Warning: Removed 2 rows containing non-finite outside the scale range
(`stat_density()`).
Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
Removed 2 rows containing missing values
Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
Removed 2 rows containing missing values
Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
Removed 2 rows containing missing values
Warning: Removed 2 rows containing non-finite outside the scale range
(`stat_boxplot()`).
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Warning: Removed 2 rows containing non-finite outside the scale range
(`stat_bin()`).
Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_point()`).
Warning: Removed 2 rows containing non-finite outside the scale range
(`stat_density()`).
Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
Removed 2 rows containing missing values
Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
Removed 2 rows containing missing values
Warning: Removed 2 rows containing non-finite outside the scale range
(`stat_boxplot()`).
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Warning: Removed 2 rows containing non-finite outside the scale range
(`stat_bin()`).
Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_point()`).
Removed 2 rows containing missing values or values outside the scale range
(`geom_point()`).
Warning: Removed 2 rows containing non-finite outside the scale range
(`stat_density()`).
Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
Removed 2 rows containing missing values
Warning: Removed 2 rows containing non-finite outside the scale range
(`stat_boxplot()`).
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Warning: Removed 2 rows containing non-finite outside the scale range
(`stat_bin()`).
Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_point()`).
Removed 2 rows containing missing values or values outside the scale range
(`geom_point()`).
Removed 2 rows containing missing values or values outside the scale range
(`geom_point()`).
Warning: Removed 2 rows containing non-finite outside the scale range
(`stat_density()`).
Warning: Removed 2 rows containing non-finite outside the scale range
(`stat_boxplot()`).
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Warning: Removed 2 rows containing non-finite outside the scale range
(`stat_bin()`).
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Warning: Removed 2 rows containing non-finite outside the scale range
(`stat_bin()`).
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Warning: Removed 2 rows containing non-finite outside the scale range
(`stat_bin()`).
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Warning: Removed 2 rows containing non-finite outside the scale range
(`stat_bin()`).
Week 9
Linear Models & Methods
Can only be used in cases were data is numerical vs numerical.
When should linear models be used?
When effects are expected
When effects are able to be explained
When predictive values are needed
Linear Method Components
response = factor + error
For use in R
lm(response~factor)
Response
y = a + bx + error
Types of Probability Distributions
Discrete
Possess finite number of different possible outcomes.
Bernoulli Distribution
Binomial Distribution
Uniform Distribution
Poisson Distribution
Continuous
Possess infinite many consecutive possible values.
Normal Distribution
Chi-squared Distribution
Exponential Distribution
Logistic Distribution
Student’ T Distribution
Linear Models
Mean
R Syntax: lm(y~1) / lm(fomula = qsec ~ 1, data = mtcars)
Effect
R Syntax: lm(y~x) / lm(formula = qsec ~ hp, data = mtcars)
Error
R Syntax: lm(y~x) (R will calculate the error here)
Residuals
Shapiro-Wilk normality test
data: m1~residuals
W = 0.94395, p-value = 0.09698
MT Car Task
What is the effect of vehicle weight on fuel efficiency?
Effect Graph
`geom_smooth()` using formula = 'y ~ x'
Fuel efficiency appears to decrease the heavier the vehicle is.
Summary
Call:
lm(formula = wt ~ mpg, data = mtcars)
Coefficients:
(Intercept) mpg
6.0473 -0.1409
Anova
I gave up - sorry.
Week 10
Logistic Regression & Models
Logistic Regression
Can be used for all types of categorical data.
Belongs to the Generalized Linear Model family (GLM)
Used to predict the category of individuals based on one or multiple predictor variables.
Used to model a binary outcome - that is a variable - can have only two potential values (0 or 1/ yes or no/ diseased or non-diseased etc.)
Displayed as an S shaped curve (the ‘sharper’ the S is is a good indicator) which can be written as - p = 1/[1 + exp (y)].
x = predictor variable
y = b0 + b1*x
exp() = exponential
p = probability of event to occur
Multiple ‘family’ options in GLM - ‘binomial’ needs to be specified in order to fit logistic regression.
The most commonly used ‘families’ are:
Binomial
Gaussian
Gamma
Poisson
Data Preparation
pregnant glucose pressure triceps insulin mass pedigree age diabetes
28 1 97 66 15 140 23.2 0.487 22 neg
714 0 134 58 20 291 26.4 0.352 21 neg
569 4 154 72 29 126 31.3 0.338 37 neg
Simple Logistic Regression
Used to predict probability based on one predictor variable
To predict the probability of being diabetes positive based on plasma-glucose concentration.
Estimate Std. Error z value Pr(>|z|)
(Intercept) -6.09552139 0.629787038 -9.678703 3.713993e-22
glucose 0.04242099 0.004760623 8.910805 5.066328e-19
Multiple Logistic Regression
Used to predict probability based on multiple predictor variables.
Estimate Std. Error z value Pr(>|z|)
(Intercept) -8.85569796 0.96972059 -9.132216 6.711123e-20
glucose 0.03824376 0.00480179 7.964480 1.659204e-15
mass 0.08144044 0.02029394 4.013043 5.994104e-05
pregnant 0.14922221 0.04031645 3.701273 2.145202e-04
To include all predictor variables.
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.004074e+01 1.217674335 -8.2458330 1.640136e-16
glucose 3.826952e-02 0.005767709 6.6351344 3.242069e-11
mass 7.053758e-02 0.027342138 2.5798122 9.885405e-03
pregnant 8.215942e-02 0.055425546 1.4823385 1.382502e-01
pressure -1.420290e-03 0.011833396 -0.1200239 9.044642e-01
triceps 1.122139e-02 0.017083709 0.6568474 5.112790e-01
insulin -8.253128e-04 0.001306439 -0.6317270 5.275653e-01
pedigree 1.140909e+00 0.427433723 2.6692059 7.603082e-03
age 3.395162e-02 0.018381721 1.8470318 6.474254e-02
Estimate = the intercept (b0) and the beta coefficient estimates associated to each predictor variable.
Std. Error = the standard error of the coefficient estimates - represents coefficient accuracy - the larger the error the less confident.
Z Value = the z statistic - the coefficient estimate (column 2) divided by the standard error (column 3)
Pr (>|z|) = the p value corresponding to the z statistic - the smaller the value the more significant.
Interpretation
An important concept to understand, for interpreting the logistic beta coefficients, is the odds ratio. An odds ratio measures the association between a predictor variable (x) and the outcome variable (y). It represents the ratio of the odds that an event will occur (event = 1) given the presence of the predictor x (x = 1), compared to the odds of the event occurring in the absence of that predictor (x = 0).