We will work through this script in-class to discuss how to deal with missing data in PCA.
Missing data is VERY common in SNP data. SNP datasets are usually HUGE, so we will talk about the methods for dealing with NAs using a smaller dataset - penguins
When you are done with this assignment you will render it to RPubs and submit it.
na.omit() can’t always be used to deal
with NAs# packages
library(ggplot2)
library(ggpubr)
library(vegan)
## Loading required package: permute
## Loading required package: lattice
## This is vegan 2.6-4
# data
library(palmerpenguins)
data("penguins")
The penguins package comes with 2 datasets
Which function tells us what we have in memory in our workspace?
# TASK:
# determine what is loaded into your workspace
sessionInfo()
## R version 4.2.1 (2022-06-23)
## Platform: aarch64-apple-darwin20 (64-bit)
## Running under: macOS Monterey 12.0
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] palmerpenguins_0.1.1 vegan_2.6-4 lattice_0.20-45
## [4] permute_0.9-7 ggpubr_0.4.0 ggplot2_3.3.6
##
## loaded via a namespace (and not attached):
## [1] tidyselect_1.2.0 xfun_0.32 bslib_0.4.0 purrr_0.3.5
## [5] splines_4.2.1 carData_3.0-5 colorspace_2.0-3 vctrs_0.4.1
## [9] generics_0.1.3 htmltools_0.5.3 mgcv_1.8-40 yaml_2.3.5
## [13] utf8_1.2.2 rlang_1.0.6 jquerylib_0.1.4 pillar_1.8.1
## [17] glue_1.6.2 withr_2.5.0 lifecycle_1.0.3 stringr_1.4.1
## [21] munsell_0.5.0 ggsignif_0.6.4 gtable_0.3.1 evaluate_0.16
## [25] knitr_1.40 fastmap_1.1.0 parallel_4.2.1 fansi_1.0.3
## [29] broom_1.0.1 scales_1.2.1 backports_1.4.1 cachem_1.0.6
## [33] jsonlite_1.8.0 abind_1.4-5 digest_0.6.29 stringi_1.7.8
## [37] rstatix_0.7.0 dplyr_1.0.10 grid_4.2.1 cli_3.4.1
## [41] tools_4.2.1 magrittr_2.0.3 sass_0.4.2 tibble_3.1.8
## [45] cluster_2.1.4 tidyr_1.2.1 car_3.1-0 pkgconfig_2.0.3
## [49] Matrix_1.5-1 MASS_7.3-58.1 rmarkdown_2.16 rstudioapi_0.14
## [53] R6_2.5.1 nlme_3.1-159 compiler_4.2.1
The penguins_raw data has variables we’ve used before
and some new ones:
# subset columns
penguins2 <- penguins_raw[,c("Species",
"Culmen Length (mm)",
"Culmen Depth (mm)",
"Flipper Length (mm)",
"Body Mass (g)",
"Delta 15 N (o/oo)",
"Delta 13 C (o/oo)")]
# rename
names(penguins2) <- c("Species",
"bill_length",
"bill_depth",
"flipper_length",
"mass",
"dietN",
"dietC"
)
# why am I doing this?
penguins2 <- data.frame(penguins2)
Complete eradication of NAs.
How many samples removed?
## TASK
# look original data - how many NAs?
sum(is.na(penguins2))
## [1] 35
# TASK
# remove NAs
penguins_noNA <- na.omit(penguins2)
# confirm removal of NAs
sum(is.na(penguins_noNA))
## [1] 0
# species labels
species <- penguins_noNA$Species
Data almost always scaled
# what does -1 do?
penguins_noNA_scale <- scale(penguins_noNA[,-1])
In true R fashion, there are 3 major ways to do PCA
Make PCA
penguin_pca <- vegan::rda(penguins_noNA_scale)
Plot PCA
biplot(penguin_pca,
display = c("sites",
"species"),
type = c("text",
"points"))
ordihull(penguin_pca,
group = factor(species))
# TASK: use function screeplot()
screeplot(penguin_pca)
Simple solution - swap mean for a feature whenever there is NA
# TASK
# run is.na() to see
# TRUE/FALSE , what is an NA
is.na(penguins2$dietC)
## [1] TRUE FALSE FALSE TRUE FALSE FALSE FALSE FALSE TRUE FALSE FALSE TRUE
## [13] TRUE TRUE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [37] FALSE FALSE FALSE TRUE FALSE TRUE FALSE FALSE FALSE FALSE TRUE TRUE
## [49] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [61] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [73] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [85] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [97] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [109] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [121] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [133] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [145] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [157] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [169] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [181] FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [193] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [205] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [217] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [229] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [241] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [253] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [265] FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE
## [277] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [289] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [301] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [313] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [325] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [337] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
# which() returns index value
# TASK: use which() to determine which value = 9.46180
which(penguins2$dietC == 9.46180)
## integer(0)
# save index
i.datapoint <- which(penguins2$dietC == 9.46180)
# look at value
# TASK: add the vector containing the index
penguins2$dietC[i.datapoint]
## numeric(0)
# Indices of missing values
# TASK: call which() to get the indices
which(is.na(penguins2$dietC) == TRUE)
## [1] 1 4 9 12 13 14 16 40 42 47 48 183 272
# save indices
i.NA <- which(is.na(penguins2$dietC) == TRUE)
# look at NAs
# TASK: add the vector containing the indices of the NAs
penguins2$dietC[i.NA]
## [1] NA NA NA NA NA NA NA NA NA NA NA NA NA
# calculate mean
mean_C <- mean(penguins2$dietC, na.rm = T)
mean_C
## [1] -25.68629
# Assign mean to NAs
penguins2$dietC[i.NA] <- mean_C
# Look at results
## TASK - add index to look at data
penguins2$dietC[i.NA]
## [1] -25.68629 -25.68629 -25.68629 -25.68629 -25.68629 -25.68629 -25.68629
## [8] -25.68629 -25.68629 -25.68629 -25.68629 -25.68629 -25.68629
Look at data
# raw data
## TASK - make histogram with gghistogram
## add mean with add = "mean
gghistogram(data = penguins2,
x = "dietC",
add = "mean")
## Warning: Using `bins = 30` by default. Pick better value with the argument
## `bins`.
## Warning: geom_vline(): Ignoring `mapping` because `xintercept` was provided.
## Warning: geom_vline(): Ignoring `data` because `xintercept` was provided.
# with density smoother
## TASK - make histogram with gghistogram
## add mean with add = "mean
## add density smoother to show shape
gghistogram(data = penguins2,
x = "dietC",
add = "mean",
add_density = TRUE)
## Warning: Using `bins = 30` by default. Pick better value with the argument
## `bins`.
## Warning: geom_vline(): Ignoring `mapping` because `xintercept` was provided.
## Warning: geom_vline(): Ignoring `data` because `xintercept` was provided.
Simulate data from estimated distribution of data
the bell-curve of the normal distribution is described by 2 parameters
# parameters
## parameter 1 - mean
mean_C <- mean(penguins2$dietC, na.rm = T)
## parameter 2 - SD
sd_C <- sd(penguins2$dietC, na.rm = T)
#simulate 1 random value w/ parameters
# TASK: simulate 1 value with rnorm()
rnorm(n = 1, # number to simulate
mean = mean_C, # parameter 1
sd = sd_C) # parameter 2
## [1] -26.4481
# simulate 2 random values w/ parameters
# TASK: simulate 2 values with rnorm()
rnorm(n = 2,
mean = mean_C,
sd = sd_C)
## [1] -25.55838 -24.50364
# simulate 3 random values
# TASK: simulate 3 value2 with rnorm()
rnorm(n = 3,
mean = mean_C,
sd = sd_C)
## [1] -24.20456 -25.88602 -25.95271
# how many NAs in C
# Indices of missing values
## TASK: use which() to find where the NAs are
which(is.na(penguins2$dietC) == TRUE)
## integer(0)
# save indices of NAs
i.C.NA <- which(is.na(penguins2$dietC) == TRUE)
# save a copy (for later)
i.C.NAcopy <- i.C.NA
# which values are NOT NA?
## (we'll need this later)
i.C.notNA <- which(is.na(penguins2$dietC) == FALSE) # == FALSE
# check data
# TASK: look where the NAs are using the index
penguins2$dietC[i.C.NA]
## numeric(0)
# how many total NAs?
# TASK: add the index
length(i.C.NA)
## [1] 0
# save number of NAs
N.C.NA <- length(i.C.NA)
# simulate N random values
C.random.param <- rnorm(n = N.C.NA, # number to simulate
mean = mean_C,
sd = sd_C)
# assign random data to missing values
penguins2$dietC[i.C.NA] <- C.random.param
# Look at results
# TASK: use index to look at simulated values
penguins2$dietC[i.C.NA]
## numeric(0)
# raw data
# TASK: make histogram
gghistogram(data = penguins2,
x = "dietC",
add = "mean")
## Warning: Using `bins = 30` by default. Pick better value with the argument
## `bins`.
## Warning: geom_vline(): Ignoring `mapping` because `xintercept` was provided.
## Warning: geom_vline(): Ignoring `data` because `xintercept` was provided.
# with density smoother
# TASK: make histogram with density smoother
ggpubr::gghistogram(data = penguins2,
x = "dietC",
add = "mean",
add_density = TRUE)
## Warning: Using `bins = 30` by default. Pick better value with the argument
## `bins`.
## Warning: geom_vline(): Ignoring `mapping` because `xintercept` was provided.
## Warning: geom_vline(): Ignoring `data` because `xintercept` was provided.
What are the locations where there were NOT NAs?
# Note - we already filled in the NAs
which(is.na(penguins2$dietC) == TRUE)
## integer(0)
# original locations of NAs
i.C.NAcopy
## integer(0)
length(i.C.NAcopy)
## [1] 0
length(i.C.notNA)
## [1] 344
nrow(penguins2)
## [1] 344
length(i.C.NAcopy) + length(i.C.notNA)
## [1] 344
Select random values from the observed data
dietC_not_NA <- penguins2$dietC[i.C.notNA]
# are there any NAs?
(dietC_not_NA)
## [1] -25.68629 -24.69454 -25.33302 -25.68629 -25.32426 -25.29805 -25.21799
## [8] -24.89958 -25.68629 -25.09368 -25.21315 -25.68629 -25.68629 -25.68629
## [15] -25.22588 -25.68629 -25.06691 -25.13993 -25.23319 -24.77227 -25.09383
## [22] -25.06390 -25.03474 -25.22664 -25.29856 -24.36130 -25.36288 -25.49448
## [29] -25.19837 -25.11609 -25.11223 -25.01020 -25.06020 -25.14591 -25.23061
## [36] -25.03469 -25.12255 -25.49523 -25.04169 -25.68629 -24.42280 -25.68629
## [43] -25.03492 -24.52698 -25.01745 -24.10255 -25.68629 -25.68629 -25.07683
## [50] -25.18543 -26.12989 -26.55602 -26.17213 -26.31460 -26.54718 -26.27853
## [57] -26.07188 -25.98843 -25.88156 -25.89677 -26.40943 -26.37809 -26.21569
## [64] -26.60023 -26.11650 -26.09294 -25.95541 -25.81012 -26.07821 -26.06209
## [71] -26.63085 -25.23453 -26.55351 -26.45978 -26.15003 -26.38986 -26.02002
## [78] -26.44787 -26.51382 -25.81513 -26.53870 -26.23027 -26.24837 -26.46254
## [85] -26.38396 -26.09635 -26.22227 -26.08165 -26.12417 -26.11199 -26.05621
## [92] -25.83352 -26.69543 -26.42406 -26.30037 -26.57941 -26.50086 -26.01152
## [99] -26.15775 -26.03679 -26.28055 -26.38085 -26.22530 -25.69199 -25.86482
## [106] -25.80208 -26.48973 -25.70711 -25.27385 -25.45171 -26.57563 -26.32909
## [113] -26.06203 -26.41218 -26.78958 -26.42018 -25.77264 -26.36678 -23.90309
## [120] -26.09989 -26.19444 -26.42563 -25.61039 -25.42621 -25.95399 -25.60826
## [127] -25.89741 -26.02450 -25.73722 -26.01363 -25.57956 -26.13960 -25.57647
## [134] -26.49288 -25.77951 -26.06943 -25.97696 -26.32601 -26.07021 -25.88798
## [141] -25.54976 -26.01549 -26.61075 -25.79529 -25.85203 -26.03442 -25.79549
## [148] -25.83060 -25.79189 -26.03495 -26.07081 -26.06967 -25.51390 -25.39369
## [155] -25.46172 -25.40075 -25.54456 -25.32829 -25.46782 -25.42760 -25.32176
## [162] -25.38017 -25.39330 -25.50562 -25.41680 -25.48025 -25.62618 -25.52473
## [169] -25.52627 -25.00169 -25.37899 -25.39587 -25.37746 -25.46569 -25.39470
## [176] -25.38157 -25.39181 -25.42826 -25.69327 -25.48383 -25.50811 -25.19017
## [183] -25.68629 -25.46327 -25.53768 -25.68210 -26.63405 -26.86127 -26.70968
## [190] -26.79093 -26.68311 -26.84506 -26.59467 -26.84272 -26.72791 -26.76990
## [197] -26.60436 -26.77650 -26.61788 -26.57585 -26.89644 -26.67799 -26.86352
## [204] -26.79358 -27.01854 -26.61414 -26.83006 -26.75621 -26.84415 -26.74890
## [211] -26.86485 -26.79846 -26.79053 -26.84374 -26.60484 -26.84166 -26.61601
## [218] -26.69166 -26.81540 -26.74809 -26.68867 -26.74249 -26.72751 -26.70783
## [225] -26.66958 -26.59290 -26.95470 -26.71199 -26.78733 -26.76821 -26.65359
## [232] -26.65931 -26.27660 -26.27573 -26.24369 -26.39677 -26.20372 -26.30019
## [239] -26.18599 -26.34330 -26.23613 -26.11657 -26.08547 -25.89834 -26.22848
## [246] -25.69195 -26.16524 -26.11244 -26.06594 -26.04726 -26.13971 -26.36863
## [253] -26.18763 -26.35425 -26.21651 -25.79203 -26.23886 -26.05756 -26.44815
## [260] -26.33867 -26.38092 -26.22664 -26.18466 -26.21019 -26.11046 -25.76147
## [267] -25.96013 -26.18161 -26.18444 -25.88547 -26.20538 -25.68629 -26.13832
## [274] -26.04117 -26.11969 -26.15531 -24.30229 -24.23592 -24.75570 -24.62717
## [281] -24.61867 -24.49433 -24.55644 -24.84059 -24.29229 -24.36088 -24.59897
## [288] -24.38751 -24.80526 -24.38933 -24.90024 -24.72940 -24.54704 -24.57994
## [295] -24.64162 -24.64335 -24.68790 -24.26375 -25.01185 -24.97134 -24.59467
## [302] -24.47142 -23.89017 -24.34684 -24.52896 -24.69638 -24.54903 -24.59996
## [309] -24.45195 -24.17282 -24.31198 -25.14550 -24.65859 -24.68440 -24.73735
## [316] -24.90816 -24.66867 -24.66188 -24.86594 -24.66259 -24.16566 -24.35575
## [323] -24.45721 -24.45189 -24.43062 -24.41562 -24.48403 -24.36202 -24.80500
## [330] -24.59066 -24.60882 -24.56481 -24.78984 -24.59513 -24.40400 -24.65786
## [337] -23.78767 -24.48153 -24.31912 -24.53494 -24.40753 -24.70615 -24.68741
## [344] -24.25255
# how many values am I working with?
(i.C.NAcopy)
## integer(0)
N.NAs.to.replace <- length(i.C.NAcopy)
# TASK
# pick 1 random value from the observed data
# using sample()
sample(x = dietC_not_NA,
size = 1,
replace = T)
## [1] -26.20538
# TASK
# pick as many random values from the
# observed data as there are missing values
sample(x = dietC_not_NA,
size = N.NAs.to.replace,
replace = T)
## numeric(0)
# Make vector of random values
C.random.nonparm <- sample(x = dietC_not_NA,
size = N.NAs.to.replace,
replace = T)
# assign random data to missing values
penguins2$dietC[i.C.NAcopy] <- C.random.nonparm
# check if there are any NAs
(penguins2$dietC)
## [1] -25.68629 -24.69454 -25.33302 -25.68629 -25.32426 -25.29805 -25.21799
## [8] -24.89958 -25.68629 -25.09368 -25.21315 -25.68629 -25.68629 -25.68629
## [15] -25.22588 -25.68629 -25.06691 -25.13993 -25.23319 -24.77227 -25.09383
## [22] -25.06390 -25.03474 -25.22664 -25.29856 -24.36130 -25.36288 -25.49448
## [29] -25.19837 -25.11609 -25.11223 -25.01020 -25.06020 -25.14591 -25.23061
## [36] -25.03469 -25.12255 -25.49523 -25.04169 -25.68629 -24.42280 -25.68629
## [43] -25.03492 -24.52698 -25.01745 -24.10255 -25.68629 -25.68629 -25.07683
## [50] -25.18543 -26.12989 -26.55602 -26.17213 -26.31460 -26.54718 -26.27853
## [57] -26.07188 -25.98843 -25.88156 -25.89677 -26.40943 -26.37809 -26.21569
## [64] -26.60023 -26.11650 -26.09294 -25.95541 -25.81012 -26.07821 -26.06209
## [71] -26.63085 -25.23453 -26.55351 -26.45978 -26.15003 -26.38986 -26.02002
## [78] -26.44787 -26.51382 -25.81513 -26.53870 -26.23027 -26.24837 -26.46254
## [85] -26.38396 -26.09635 -26.22227 -26.08165 -26.12417 -26.11199 -26.05621
## [92] -25.83352 -26.69543 -26.42406 -26.30037 -26.57941 -26.50086 -26.01152
## [99] -26.15775 -26.03679 -26.28055 -26.38085 -26.22530 -25.69199 -25.86482
## [106] -25.80208 -26.48973 -25.70711 -25.27385 -25.45171 -26.57563 -26.32909
## [113] -26.06203 -26.41218 -26.78958 -26.42018 -25.77264 -26.36678 -23.90309
## [120] -26.09989 -26.19444 -26.42563 -25.61039 -25.42621 -25.95399 -25.60826
## [127] -25.89741 -26.02450 -25.73722 -26.01363 -25.57956 -26.13960 -25.57647
## [134] -26.49288 -25.77951 -26.06943 -25.97696 -26.32601 -26.07021 -25.88798
## [141] -25.54976 -26.01549 -26.61075 -25.79529 -25.85203 -26.03442 -25.79549
## [148] -25.83060 -25.79189 -26.03495 -26.07081 -26.06967 -25.51390 -25.39369
## [155] -25.46172 -25.40075 -25.54456 -25.32829 -25.46782 -25.42760 -25.32176
## [162] -25.38017 -25.39330 -25.50562 -25.41680 -25.48025 -25.62618 -25.52473
## [169] -25.52627 -25.00169 -25.37899 -25.39587 -25.37746 -25.46569 -25.39470
## [176] -25.38157 -25.39181 -25.42826 -25.69327 -25.48383 -25.50811 -25.19017
## [183] -25.68629 -25.46327 -25.53768 -25.68210 -26.63405 -26.86127 -26.70968
## [190] -26.79093 -26.68311 -26.84506 -26.59467 -26.84272 -26.72791 -26.76990
## [197] -26.60436 -26.77650 -26.61788 -26.57585 -26.89644 -26.67799 -26.86352
## [204] -26.79358 -27.01854 -26.61414 -26.83006 -26.75621 -26.84415 -26.74890
## [211] -26.86485 -26.79846 -26.79053 -26.84374 -26.60484 -26.84166 -26.61601
## [218] -26.69166 -26.81540 -26.74809 -26.68867 -26.74249 -26.72751 -26.70783
## [225] -26.66958 -26.59290 -26.95470 -26.71199 -26.78733 -26.76821 -26.65359
## [232] -26.65931 -26.27660 -26.27573 -26.24369 -26.39677 -26.20372 -26.30019
## [239] -26.18599 -26.34330 -26.23613 -26.11657 -26.08547 -25.89834 -26.22848
## [246] -25.69195 -26.16524 -26.11244 -26.06594 -26.04726 -26.13971 -26.36863
## [253] -26.18763 -26.35425 -26.21651 -25.79203 -26.23886 -26.05756 -26.44815
## [260] -26.33867 -26.38092 -26.22664 -26.18466 -26.21019 -26.11046 -25.76147
## [267] -25.96013 -26.18161 -26.18444 -25.88547 -26.20538 -25.68629 -26.13832
## [274] -26.04117 -26.11969 -26.15531 -24.30229 -24.23592 -24.75570 -24.62717
## [281] -24.61867 -24.49433 -24.55644 -24.84059 -24.29229 -24.36088 -24.59897
## [288] -24.38751 -24.80526 -24.38933 -24.90024 -24.72940 -24.54704 -24.57994
## [295] -24.64162 -24.64335 -24.68790 -24.26375 -25.01185 -24.97134 -24.59467
## [302] -24.47142 -23.89017 -24.34684 -24.52896 -24.69638 -24.54903 -24.59996
## [309] -24.45195 -24.17282 -24.31198 -25.14550 -24.65859 -24.68440 -24.73735
## [316] -24.90816 -24.66867 -24.66188 -24.86594 -24.66259 -24.16566 -24.35575
## [323] -24.45721 -24.45189 -24.43062 -24.41562 -24.48403 -24.36202 -24.80500
## [330] -24.59066 -24.60882 -24.56481 -24.78984 -24.59513 -24.40400 -24.65786
## [337] -23.78767 -24.48153 -24.31912 -24.53494 -24.40753 -24.70615 -24.68741
## [344] -24.25255
Some variables are correlated with each other. How can we use this information to do better missing data prediction?
# TASK - make pairs plot
pairs(penguins2[,-1],
panel = panel.smooth)
# TASK - calculate correlations
cor(penguins2$mass,
penguins2$flipper_length,
use = "complete.obs")
## [1] 0.8712018
cor(penguins2$mass,
penguins2$bill_length,
use = "complete.obs")
## [1] 0.5951098
Note - ggpubr does a fair bit of rounding
#TASK - make scatterplot
# note "+" sign - this is ggplot2 syntax (good exam question)
ggscatter(data = penguins2,
y = "flipper_length",
x = "mass",
add = "reg.line",
xlab = "Known values = mass",
ylab = "Want to predict = flipper length",
cor.coef = F) + stat_regline_equation(label.y.npc = "top")
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 2 rows containing non-finite values (stat_smooth).
## Warning: Removed 2 rows containing non-finite values (stat_regline_equation).
## Warning: Removed 2 rows containing missing values (geom_point).
# TASK
# make lm model with lm()
lm_out <- lm(flipper_length ~ mass,
data = penguins2)
# TASK use coef() to
# get coefficients/ parameters for equation
coef(lm_out)
## (Intercept) mass
## 136.72955927 0.01527592
Equation:
intercept = 136.72955927 slope = 0.01527592
mass = intercept + slope X mass mass = 136.72955927 + 0.01527592 X mass
If mass of penguin with unknown flipper length = 5000 grams, what is predicted flipper
136.72955927 + 0.01527592 * 5000
## [1] 213.1092
Make scatterplot and visualize prediction with horiz and vert lines
Note ggplot2 “+” syntax
# TASK - make scatterplot
ggscatter(data = penguins2,
y = "flipper_length",
x = "mass",
add = "reg.line",
xlab = "Known values = mass",
ylab = "Want to predict = flipper length",
cor.coef = F) +
stat_regline_equation(label.y.npc = "top") +
geom_vline(xintercept = 5000) +
geom_hline(yintercept = 136.72 + 0.0153 * 5000)
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 2 rows containing non-finite values (stat_smooth).
## Warning: Removed 2 rows containing non-finite values (stat_regline_equation).
## Warning: Removed 2 rows containing missing values (geom_point).