Introduction

We will work through this script in-class to discuss how to deal with missing data in PCA.

Missing data is VERY common in SNP data. SNP datasets are usually HUGE, so we will talk about the methods for dealing with NAs using a smaller dataset - penguins

In-class assignment

When you are done with this assignment you will render it to RPubs and submit it.

Learning objectives

Preliminaries

Packages and data

# packages
library(ggplot2)
library(ggpubr)
library(vegan)
## Loading required package: permute
## Loading required package: lattice
## This is vegan 2.6-4
# data
library(palmerpenguins)
data("penguins")

The penguins package comes with 2 datasets

Which function tells us what we have in memory in our workspace?

# TASK:
# determine what is loaded into your workspace
sessionInfo()
## R version 4.2.1 (2022-06-23)
## Platform: aarch64-apple-darwin20 (64-bit)
## Running under: macOS Monterey 12.0
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] palmerpenguins_0.1.1 vegan_2.6-4          lattice_0.20-45     
## [4] permute_0.9-7        ggpubr_0.4.0         ggplot2_3.3.6       
## 
## loaded via a namespace (and not attached):
##  [1] tidyselect_1.2.0 xfun_0.32        bslib_0.4.0      purrr_0.3.5     
##  [5] splines_4.2.1    carData_3.0-5    colorspace_2.0-3 vctrs_0.4.1     
##  [9] generics_0.1.3   htmltools_0.5.3  mgcv_1.8-40      yaml_2.3.5      
## [13] utf8_1.2.2       rlang_1.0.6      jquerylib_0.1.4  pillar_1.8.1    
## [17] glue_1.6.2       withr_2.5.0      lifecycle_1.0.3  stringr_1.4.1   
## [21] munsell_0.5.0    ggsignif_0.6.4   gtable_0.3.1     evaluate_0.16   
## [25] knitr_1.40       fastmap_1.1.0    parallel_4.2.1   fansi_1.0.3     
## [29] broom_1.0.1      scales_1.2.1     backports_1.4.1  cachem_1.0.6    
## [33] jsonlite_1.8.0   abind_1.4-5      digest_0.6.29    stringi_1.7.8   
## [37] rstatix_0.7.0    dplyr_1.0.10     grid_4.2.1       cli_3.4.1       
## [41] tools_4.2.1      magrittr_2.0.3   sass_0.4.2       tibble_3.1.8    
## [45] cluster_2.1.4    tidyr_1.2.1      car_3.1-0        pkgconfig_2.0.3 
## [49] Matrix_1.5-1     MASS_7.3-58.1    rmarkdown_2.16   rstudioapi_0.14 
## [53] R6_2.5.1         nlme_3.1-159     compiler_4.2.1

Data preparation

The penguins_raw data has variables we’ve used before and some new ones:

  • Familiar: Bill (“culmen”; length and depth), flipper, and size data
  • New: Data on N and C content of diet
# subset columns
penguins2 <- penguins_raw[,c("Species",
                         "Culmen Length (mm)", 
                             "Culmen Depth (mm)",
                         "Flipper Length (mm)",
                         "Body Mass (g)",
                         "Delta 15 N (o/oo)",
                         "Delta 13 C (o/oo)")]

# rename
names(penguins2) <- c("Species",
                      "bill_length",
                      "bill_depth",
                      "flipper_length",
                      "mass",
                      "dietN",
                      "dietC"
                      )

# why am I doing this?
penguins2 <- data.frame(penguins2)

Dealing with NAs

na.omit()

Complete eradication of NAs.

How many samples removed?

## TASK
# look original data - how many NAs?
sum(is.na(penguins2))
## [1] 35
# TASK
# remove NAs
penguins_noNA <- na.omit(penguins2)

# confirm removal of NAs
sum(is.na(penguins_noNA))
## [1] 0
# species labels
species <- penguins_noNA$Species

PCA on data w/ no NAs

Data almost always scaled

# what does -1 do?
penguins_noNA_scale <- scale(penguins_noNA[,-1])

In true R fashion, there are 3 major ways to do PCA

  1. prcomp()
  2. princomp()
  3. vegan::rda()

Make PCA

penguin_pca <- vegan::rda(penguins_noNA_scale)

Plot PCA

biplot(penguin_pca, 
       display = c("sites", 
                   "species"),
       type = c("text",
                "points"))
ordihull(penguin_pca,
         group = factor(species))

Scree plots help you know how many PCs need to be plotted

  • Scree plots help you if a 3rd or higher PC is useful
  • Scree plot use is more of an art than science
  • This class: If PC3 is very close to PC2 on scree plot, it may be worthwhile to look at PC3
  • If PC3 is much lower than PC2, then a scatterplot / biplot of just the PC1 and PC2 is sufficient to represent data (e.g. show any separate groups)
# TASK: use function screeplot()
screeplot(penguin_pca)

NAs and the curse of dimensionality

  • “curse of dimensionality” - suite of issues that arise with many features
  • “more data, more problems”
  • As you collect more features of data, it more likely that something will be missing for some samples
  • High dimensional data often has more features than samples (more columns than rows)
  • e.g 1000 genomes project: 1000 people = 1000 rows; but 640,000 SNPs!
  • Eventually, ever row of data (sample) will end up with at least 1 or more missing values when you have 1000s of features
  • na.omit() does row-wise deletion - have an NA, remove row
  • with high-dim data, you end up with no data!

Imputation

  • fill in data for missing values

Mean imputation

Simple solution - swap mean for a feature whenever there is NA

  • is.na() returns TRUE/FALSE if an NA is present
  • Several ways to work with this
  • Easiest to understand: which(is.na(df$column) == TRUE )

Find the NAs

# TASK
# run is.na() to see
# TRUE/FALSE , what is an NA
is.na(penguins2$dietC)
##   [1]  TRUE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE  TRUE
##  [13]  TRUE  TRUE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [37] FALSE FALSE FALSE  TRUE FALSE  TRUE FALSE FALSE FALSE FALSE  TRUE  TRUE
##  [49] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [61] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [73] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [85] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [97] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [109] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [121] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [133] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [145] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [157] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [169] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [181] FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [193] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [205] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [217] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [229] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [241] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [253] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [265] FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE
## [277] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [289] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [301] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [313] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [325] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [337] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
# which() returns index value 
# TASK: use which() to determine which value = 9.46180
which(penguins2$dietC == 9.46180)
## integer(0)
# save index
i.datapoint <- which(penguins2$dietC == 9.46180)

# look at value
# TASK: add the vector containing the index
penguins2$dietC[i.datapoint]
## numeric(0)
# Indices of missing values
# TASK: call which() to get the indices
which(is.na(penguins2$dietC) == TRUE)
##  [1]   1   4   9  12  13  14  16  40  42  47  48 183 272
# save indices
i.NA <- which(is.na(penguins2$dietC) == TRUE)

# look at NAs
# TASK: add the vector containing the indices of the NAs
penguins2$dietC[i.NA]
##  [1] NA NA NA NA NA NA NA NA NA NA NA NA NA

Fill in NAs with mean

# calculate mean
mean_C <- mean(penguins2$dietC, na.rm = T)
mean_C
## [1] -25.68629
# Assign mean to NAs
penguins2$dietC[i.NA] <- mean_C

# Look at results
## TASK - add index to look at data
penguins2$dietC[i.NA] 
##  [1] -25.68629 -25.68629 -25.68629 -25.68629 -25.68629 -25.68629 -25.68629
##  [8] -25.68629 -25.68629 -25.68629 -25.68629 -25.68629 -25.68629

Random imputation - parametric imputation

  • Some more complex method integrate random variation into imputation
  • We’ll look at a very basic first step of these approaches

Look at data

# raw data
## TASK - make histogram with gghistogram
##        add mean with add = "mean
gghistogram(data = penguins2,
            x = "dietC",
            add = "mean")
## Warning: Using `bins = 30` by default. Pick better value with the argument
## `bins`.
## Warning: geom_vline(): Ignoring `mapping` because `xintercept` was provided.
## Warning: geom_vline(): Ignoring `data` because `xintercept` was provided.

# with density smoother
## TASK - make histogram with gghistogram
##        add mean with add = "mean
##        add density smoother to show shape
gghistogram(data = penguins2,
            x = "dietC",
            add = "mean",
            add_density = TRUE)
## Warning: Using `bins = 30` by default. Pick better value with the argument
## `bins`.
## Warning: geom_vline(): Ignoring `mapping` because `xintercept` was provided.
## Warning: geom_vline(): Ignoring `data` because `xintercept` was provided.

Simulate data from estimated distribution of data

the bell-curve of the normal distribution is described by 2 parameters

  • mean
  • sd
# parameters
## parameter 1 - mean
mean_C <- mean(penguins2$dietC, na.rm = T)

## parameter 2 - SD
sd_C   <- sd(penguins2$dietC, na.rm = T)


#simulate 1 random value w/ parameters
# TASK: simulate 1 value with rnorm()
rnorm(n = 1,         # number to simulate
      mean = mean_C, # parameter 1
      sd = sd_C)     # parameter 2
## [1] -26.4481
# simulate 2 random values w/ parameters
# TASK: simulate 2 values with rnorm()
rnorm(n = 2, 
      mean = mean_C, 
      sd = sd_C)
## [1] -25.55838 -24.50364
# simulate 3 random values
# TASK: simulate 3 value2 with rnorm()
rnorm(n = 3, 
      mean = mean_C, 
      sd = sd_C)
## [1] -24.20456 -25.88602 -25.95271
# how many NAs in C
# Indices of missing values
## TASK: use which() to find where the NAs are
which(is.na(penguins2$dietC) == TRUE)
## integer(0)
# save indices of NAs
i.C.NA <- which(is.na(penguins2$dietC) == TRUE)

# save a copy (for later)
i.C.NAcopy <- i.C.NA

# which values are NOT NA?
## (we'll need this later)
i.C.notNA <- which(is.na(penguins2$dietC) == FALSE) # == FALSE

# check data
# TASK: look where the NAs are using the index
penguins2$dietC[i.C.NA]
## numeric(0)
# how many total NAs?
# TASK: add the index
length(i.C.NA)
## [1] 0
# save number of NAs
N.C.NA <- length(i.C.NA)

# simulate N random values
C.random.param <- rnorm(n = N.C.NA,  # number to simulate
                        mean = mean_C, 
                        sd = sd_C)

# assign random data to missing values
penguins2$dietC[i.C.NA] <- C.random.param

# Look at results
# TASK: use index to look at simulated values
penguins2$dietC[i.C.NA]
## numeric(0)

Random imputation - non-parametric imputation

  • Look at data - they aren’t very normal.
  • Normal data would have a bell-shaped “density” smoother
# raw data
# TASK: make histogram
gghistogram(data = penguins2,
            x = "dietC",
            add = "mean")
## Warning: Using `bins = 30` by default. Pick better value with the argument
## `bins`.
## Warning: geom_vline(): Ignoring `mapping` because `xintercept` was provided.
## Warning: geom_vline(): Ignoring `data` because `xintercept` was provided.

# with density smoother
# TASK: make histogram with density smoother
ggpubr::gghistogram(data = penguins2,
                    x = "dietC",
                    add = "mean",
                    add_density = TRUE)
## Warning: Using `bins = 30` by default. Pick better value with the argument
## `bins`.
## Warning: geom_vline(): Ignoring `mapping` because `xintercept` was provided.
## Warning: geom_vline(): Ignoring `data` because `xintercept` was provided.

What are the locations where there were NOT NAs?

# Note - we already filled in the NAs
which(is.na(penguins2$dietC) == TRUE)
## integer(0)
# original locations of NAs
i.C.NAcopy
## integer(0)
length(i.C.NAcopy)
## [1] 0
length(i.C.notNA)
## [1] 344
nrow(penguins2)
## [1] 344
length(i.C.NAcopy) + length(i.C.notNA)
## [1] 344

Select random values from the observed data

dietC_not_NA <- penguins2$dietC[i.C.notNA]

# are there any NAs?
(dietC_not_NA)
##   [1] -25.68629 -24.69454 -25.33302 -25.68629 -25.32426 -25.29805 -25.21799
##   [8] -24.89958 -25.68629 -25.09368 -25.21315 -25.68629 -25.68629 -25.68629
##  [15] -25.22588 -25.68629 -25.06691 -25.13993 -25.23319 -24.77227 -25.09383
##  [22] -25.06390 -25.03474 -25.22664 -25.29856 -24.36130 -25.36288 -25.49448
##  [29] -25.19837 -25.11609 -25.11223 -25.01020 -25.06020 -25.14591 -25.23061
##  [36] -25.03469 -25.12255 -25.49523 -25.04169 -25.68629 -24.42280 -25.68629
##  [43] -25.03492 -24.52698 -25.01745 -24.10255 -25.68629 -25.68629 -25.07683
##  [50] -25.18543 -26.12989 -26.55602 -26.17213 -26.31460 -26.54718 -26.27853
##  [57] -26.07188 -25.98843 -25.88156 -25.89677 -26.40943 -26.37809 -26.21569
##  [64] -26.60023 -26.11650 -26.09294 -25.95541 -25.81012 -26.07821 -26.06209
##  [71] -26.63085 -25.23453 -26.55351 -26.45978 -26.15003 -26.38986 -26.02002
##  [78] -26.44787 -26.51382 -25.81513 -26.53870 -26.23027 -26.24837 -26.46254
##  [85] -26.38396 -26.09635 -26.22227 -26.08165 -26.12417 -26.11199 -26.05621
##  [92] -25.83352 -26.69543 -26.42406 -26.30037 -26.57941 -26.50086 -26.01152
##  [99] -26.15775 -26.03679 -26.28055 -26.38085 -26.22530 -25.69199 -25.86482
## [106] -25.80208 -26.48973 -25.70711 -25.27385 -25.45171 -26.57563 -26.32909
## [113] -26.06203 -26.41218 -26.78958 -26.42018 -25.77264 -26.36678 -23.90309
## [120] -26.09989 -26.19444 -26.42563 -25.61039 -25.42621 -25.95399 -25.60826
## [127] -25.89741 -26.02450 -25.73722 -26.01363 -25.57956 -26.13960 -25.57647
## [134] -26.49288 -25.77951 -26.06943 -25.97696 -26.32601 -26.07021 -25.88798
## [141] -25.54976 -26.01549 -26.61075 -25.79529 -25.85203 -26.03442 -25.79549
## [148] -25.83060 -25.79189 -26.03495 -26.07081 -26.06967 -25.51390 -25.39369
## [155] -25.46172 -25.40075 -25.54456 -25.32829 -25.46782 -25.42760 -25.32176
## [162] -25.38017 -25.39330 -25.50562 -25.41680 -25.48025 -25.62618 -25.52473
## [169] -25.52627 -25.00169 -25.37899 -25.39587 -25.37746 -25.46569 -25.39470
## [176] -25.38157 -25.39181 -25.42826 -25.69327 -25.48383 -25.50811 -25.19017
## [183] -25.68629 -25.46327 -25.53768 -25.68210 -26.63405 -26.86127 -26.70968
## [190] -26.79093 -26.68311 -26.84506 -26.59467 -26.84272 -26.72791 -26.76990
## [197] -26.60436 -26.77650 -26.61788 -26.57585 -26.89644 -26.67799 -26.86352
## [204] -26.79358 -27.01854 -26.61414 -26.83006 -26.75621 -26.84415 -26.74890
## [211] -26.86485 -26.79846 -26.79053 -26.84374 -26.60484 -26.84166 -26.61601
## [218] -26.69166 -26.81540 -26.74809 -26.68867 -26.74249 -26.72751 -26.70783
## [225] -26.66958 -26.59290 -26.95470 -26.71199 -26.78733 -26.76821 -26.65359
## [232] -26.65931 -26.27660 -26.27573 -26.24369 -26.39677 -26.20372 -26.30019
## [239] -26.18599 -26.34330 -26.23613 -26.11657 -26.08547 -25.89834 -26.22848
## [246] -25.69195 -26.16524 -26.11244 -26.06594 -26.04726 -26.13971 -26.36863
## [253] -26.18763 -26.35425 -26.21651 -25.79203 -26.23886 -26.05756 -26.44815
## [260] -26.33867 -26.38092 -26.22664 -26.18466 -26.21019 -26.11046 -25.76147
## [267] -25.96013 -26.18161 -26.18444 -25.88547 -26.20538 -25.68629 -26.13832
## [274] -26.04117 -26.11969 -26.15531 -24.30229 -24.23592 -24.75570 -24.62717
## [281] -24.61867 -24.49433 -24.55644 -24.84059 -24.29229 -24.36088 -24.59897
## [288] -24.38751 -24.80526 -24.38933 -24.90024 -24.72940 -24.54704 -24.57994
## [295] -24.64162 -24.64335 -24.68790 -24.26375 -25.01185 -24.97134 -24.59467
## [302] -24.47142 -23.89017 -24.34684 -24.52896 -24.69638 -24.54903 -24.59996
## [309] -24.45195 -24.17282 -24.31198 -25.14550 -24.65859 -24.68440 -24.73735
## [316] -24.90816 -24.66867 -24.66188 -24.86594 -24.66259 -24.16566 -24.35575
## [323] -24.45721 -24.45189 -24.43062 -24.41562 -24.48403 -24.36202 -24.80500
## [330] -24.59066 -24.60882 -24.56481 -24.78984 -24.59513 -24.40400 -24.65786
## [337] -23.78767 -24.48153 -24.31912 -24.53494 -24.40753 -24.70615 -24.68741
## [344] -24.25255
# how many values am I working with?
(i.C.NAcopy)
## integer(0)
N.NAs.to.replace <- length(i.C.NAcopy)

# TASK
# pick 1 random value from the observed data
# using sample()
sample(x = dietC_not_NA,
       size = 1,
       replace = T)
## [1] -26.20538
# TASK
# pick as many random values from the
# observed data as there are missing values
sample(x = dietC_not_NA,
       size = N.NAs.to.replace,
       replace = T)
## numeric(0)
# Make vector of random values
C.random.nonparm <- sample(x = dietC_not_NA,
       size = N.NAs.to.replace,
       replace = T)


# assign random data to missing values
penguins2$dietC[i.C.NAcopy] <- C.random.nonparm

# check if there are any NAs
(penguins2$dietC)
##   [1] -25.68629 -24.69454 -25.33302 -25.68629 -25.32426 -25.29805 -25.21799
##   [8] -24.89958 -25.68629 -25.09368 -25.21315 -25.68629 -25.68629 -25.68629
##  [15] -25.22588 -25.68629 -25.06691 -25.13993 -25.23319 -24.77227 -25.09383
##  [22] -25.06390 -25.03474 -25.22664 -25.29856 -24.36130 -25.36288 -25.49448
##  [29] -25.19837 -25.11609 -25.11223 -25.01020 -25.06020 -25.14591 -25.23061
##  [36] -25.03469 -25.12255 -25.49523 -25.04169 -25.68629 -24.42280 -25.68629
##  [43] -25.03492 -24.52698 -25.01745 -24.10255 -25.68629 -25.68629 -25.07683
##  [50] -25.18543 -26.12989 -26.55602 -26.17213 -26.31460 -26.54718 -26.27853
##  [57] -26.07188 -25.98843 -25.88156 -25.89677 -26.40943 -26.37809 -26.21569
##  [64] -26.60023 -26.11650 -26.09294 -25.95541 -25.81012 -26.07821 -26.06209
##  [71] -26.63085 -25.23453 -26.55351 -26.45978 -26.15003 -26.38986 -26.02002
##  [78] -26.44787 -26.51382 -25.81513 -26.53870 -26.23027 -26.24837 -26.46254
##  [85] -26.38396 -26.09635 -26.22227 -26.08165 -26.12417 -26.11199 -26.05621
##  [92] -25.83352 -26.69543 -26.42406 -26.30037 -26.57941 -26.50086 -26.01152
##  [99] -26.15775 -26.03679 -26.28055 -26.38085 -26.22530 -25.69199 -25.86482
## [106] -25.80208 -26.48973 -25.70711 -25.27385 -25.45171 -26.57563 -26.32909
## [113] -26.06203 -26.41218 -26.78958 -26.42018 -25.77264 -26.36678 -23.90309
## [120] -26.09989 -26.19444 -26.42563 -25.61039 -25.42621 -25.95399 -25.60826
## [127] -25.89741 -26.02450 -25.73722 -26.01363 -25.57956 -26.13960 -25.57647
## [134] -26.49288 -25.77951 -26.06943 -25.97696 -26.32601 -26.07021 -25.88798
## [141] -25.54976 -26.01549 -26.61075 -25.79529 -25.85203 -26.03442 -25.79549
## [148] -25.83060 -25.79189 -26.03495 -26.07081 -26.06967 -25.51390 -25.39369
## [155] -25.46172 -25.40075 -25.54456 -25.32829 -25.46782 -25.42760 -25.32176
## [162] -25.38017 -25.39330 -25.50562 -25.41680 -25.48025 -25.62618 -25.52473
## [169] -25.52627 -25.00169 -25.37899 -25.39587 -25.37746 -25.46569 -25.39470
## [176] -25.38157 -25.39181 -25.42826 -25.69327 -25.48383 -25.50811 -25.19017
## [183] -25.68629 -25.46327 -25.53768 -25.68210 -26.63405 -26.86127 -26.70968
## [190] -26.79093 -26.68311 -26.84506 -26.59467 -26.84272 -26.72791 -26.76990
## [197] -26.60436 -26.77650 -26.61788 -26.57585 -26.89644 -26.67799 -26.86352
## [204] -26.79358 -27.01854 -26.61414 -26.83006 -26.75621 -26.84415 -26.74890
## [211] -26.86485 -26.79846 -26.79053 -26.84374 -26.60484 -26.84166 -26.61601
## [218] -26.69166 -26.81540 -26.74809 -26.68867 -26.74249 -26.72751 -26.70783
## [225] -26.66958 -26.59290 -26.95470 -26.71199 -26.78733 -26.76821 -26.65359
## [232] -26.65931 -26.27660 -26.27573 -26.24369 -26.39677 -26.20372 -26.30019
## [239] -26.18599 -26.34330 -26.23613 -26.11657 -26.08547 -25.89834 -26.22848
## [246] -25.69195 -26.16524 -26.11244 -26.06594 -26.04726 -26.13971 -26.36863
## [253] -26.18763 -26.35425 -26.21651 -25.79203 -26.23886 -26.05756 -26.44815
## [260] -26.33867 -26.38092 -26.22664 -26.18466 -26.21019 -26.11046 -25.76147
## [267] -25.96013 -26.18161 -26.18444 -25.88547 -26.20538 -25.68629 -26.13832
## [274] -26.04117 -26.11969 -26.15531 -24.30229 -24.23592 -24.75570 -24.62717
## [281] -24.61867 -24.49433 -24.55644 -24.84059 -24.29229 -24.36088 -24.59897
## [288] -24.38751 -24.80526 -24.38933 -24.90024 -24.72940 -24.54704 -24.57994
## [295] -24.64162 -24.64335 -24.68790 -24.26375 -25.01185 -24.97134 -24.59467
## [302] -24.47142 -23.89017 -24.34684 -24.52896 -24.69638 -24.54903 -24.59996
## [309] -24.45195 -24.17282 -24.31198 -25.14550 -24.65859 -24.68440 -24.73735
## [316] -24.90816 -24.66867 -24.66188 -24.86594 -24.66259 -24.16566 -24.35575
## [323] -24.45721 -24.45189 -24.43062 -24.41562 -24.48403 -24.36202 -24.80500
## [330] -24.59066 -24.60882 -24.56481 -24.78984 -24.59513 -24.40400 -24.65786
## [337] -23.78767 -24.48153 -24.31912 -24.53494 -24.40753 -24.70615 -24.68741
## [344] -24.25255

Regression imputation

Some variables are correlated with each other. How can we use this information to do better missing data prediction?

Pairs plot with smoother

# TASK - make pairs plot
pairs(penguins2[,-1], 
      panel = panel.smooth)

Correlations

# TASK - calculate correlations
cor(penguins2$mass, 
    penguins2$flipper_length, 
    use = "complete.obs")
## [1] 0.8712018
cor(penguins2$mass, 
    penguins2$bill_length, 
    use = "complete.obs")
## [1] 0.5951098

Regression line

ggpubr::ggscatter

Note - ggpubr does a fair bit of rounding

#TASK - make scatterplot
#      note "+" sign - this is ggplot2 syntax (good exam question)
ggscatter(data = penguins2,
          y = "flipper_length",
          x = "mass", 
          add = "reg.line", 
          xlab = "Known values = mass",
          ylab = "Want to predict = flipper length",
          cor.coef = F) + stat_regline_equation(label.y.npc = "top")
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 2 rows containing non-finite values (stat_smooth).
## Warning: Removed 2 rows containing non-finite values (stat_regline_equation).
## Warning: Removed 2 rows containing missing values (geom_point).

lm()

# TASK
# make lm model with lm()
lm_out <- lm(flipper_length ~ mass,
             data = penguins2)

# TASK use coef() to
# get coefficients/ parameters for equation
coef(lm_out)
##  (Intercept)         mass 
## 136.72955927   0.01527592

Equation:

intercept = 136.72955927 slope = 0.01527592

mass = intercept + slope X mass mass = 136.72955927 + 0.01527592 X mass

If mass of penguin with unknown flipper length = 5000 grams, what is predicted flipper

136.72955927 + 0.01527592 * 5000
## [1] 213.1092

Make scatterplot and visualize prediction with horiz and vert lines

Note ggplot2 “+” syntax

# TASK - make scatterplot
ggscatter(data = penguins2,
          y = "flipper_length",
          x = "mass", 
          add = "reg.line", 
          xlab = "Known values = mass",
          ylab = "Want to predict = flipper length",
          cor.coef = F) +
            stat_regline_equation(label.y.npc = "top") +
            geom_vline(xintercept = 5000) +
            geom_hline(yintercept = 136.72 + 0.0153 * 5000)
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 2 rows containing non-finite values (stat_smooth).
## Warning: Removed 2 rows containing non-finite values (stat_regline_equation).
## Warning: Removed 2 rows containing missing values (geom_point).

Additional data exploration