Introduction

We will work through this script in-class to discuss how to deal with missing data in PCA.

Missing data is VERY common in SNP data. SNP datasets are usually HUGE, so we will talk about the methods for dealing with NAs using a smaller dataset - penguins

In-class assignment

When you are done with this assignment you will render it to RPubs and submit it.

Learning objectives

appreciate the curse of dimensionality with regard to missing data
learn the basics of how to use scree plots
understand why na.omit() can’t always be used to deal with NAs
be able to identify different approaches to dealing with NAs
know the difference between parametric and non-parametric sampling
review interpretation of biplots
review making scatterplots with ggpubr
review regression, regression coefficients, and prediction with regression

Preliminaries

Packages and data

# packages
library(ggplot2)
library(ggpubr)
library(vegan)

## Loading required package: permute

## Loading required package: lattice

## This is vegan 2.6-4

# data
library(palmerpenguins)
data("penguins")

The penguins package comes with 2 datasets

Which function tells us what we have in memory in our workspace?

# TASK:
# determine what is loaded into your workspace
sessionInfo()

## R version 4.2.1 (2022-06-23)
## Platform: aarch64-apple-darwin20 (64-bit)
## Running under: macOS Monterey 12.0
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] palmerpenguins_0.1.1 vegan_2.6-4          lattice_0.20-45     
## [4] permute_0.9-7        ggpubr_0.4.0         ggplot2_3.3.6       
## 
## loaded via a namespace (and not attached):
##  [1] tidyselect_1.2.0 xfun_0.32        bslib_0.4.0      purrr_0.3.5     
##  [5] splines_4.2.1    carData_3.0-5    colorspace_2.0-3 vctrs_0.4.1     
##  [9] generics_0.1.3   htmltools_0.5.3  mgcv_1.8-40      yaml_2.3.5      
## [13] utf8_1.2.2       rlang_1.0.6      jquerylib_0.1.4  pillar_1.8.1    
## [17] glue_1.6.2       withr_2.5.0      lifecycle_1.0.3  stringr_1.4.1   
## [21] munsell_0.5.0    ggsignif_0.6.4   gtable_0.3.1     evaluate_0.16   
## [25] knitr_1.40       fastmap_1.1.0    parallel_4.2.1   fansi_1.0.3     
## [29] broom_1.0.1      scales_1.2.1     backports_1.4.1  cachem_1.0.6    
## [33] jsonlite_1.8.0   abind_1.4-5      digest_0.6.29    stringi_1.7.8   
## [37] rstatix_0.7.0    dplyr_1.0.10     grid_4.2.1       cli_3.4.1       
## [41] tools_4.2.1      magrittr_2.0.3   sass_0.4.2       tibble_3.1.8    
## [45] cluster_2.1.4    tidyr_1.2.1      car_3.1-0        pkgconfig_2.0.3 
## [49] Matrix_1.5-1     MASS_7.3-58.1    rmarkdown_2.16   rstudioapi_0.14 
## [53] R6_2.5.1         nlme_3.1-159     compiler_4.2.1

Data preparation

The penguins_raw data has variables we’ve used before and some new ones:

Familiar: Bill (“culmen”; length and depth), flipper, and size data
New: Data on N and C content of diet

# subset columns
penguins2 <- penguins_raw[,c("Species",
                         "Culmen Length (mm)", 
                             "Culmen Depth (mm)",
                         "Flipper Length (mm)",
                         "Body Mass (g)",
                         "Delta 15 N (o/oo)",
                         "Delta 13 C (o/oo)")]

# rename
names(penguins2) <- c("Species",
                      "bill_length",
                      "bill_depth",
                      "flipper_length",
                      "mass",
                      "dietN",
                      "dietC"
                      )

# why am I doing this?
penguins2 <- data.frame(penguins2)

Dealing with NAs

na.omit()

Complete eradication of NAs.

How many samples removed?

## TASK
# look original data - how many NAs?
sum(is.na(penguins2))

## [1] 35

# TASK
# remove NAs
penguins_noNA <- na.omit(penguins2)

# confirm removal of NAs
sum(is.na(penguins_noNA))

## [1] 0

# species labels
species <- penguins_noNA$Species

PCA on data w/ no NAs

Data almost always scaled

# what does -1 do?
penguins_noNA_scale <- scale(penguins_noNA[,-1])

In true R fashion, there are 3 major ways to do PCA

prcomp()
princomp()
vegan::rda()

Make PCA

penguin_pca <- vegan::rda(penguins_noNA_scale)

Plot PCA

biplot(penguin_pca, 
       display = c("sites", 
                   "species"),
       type = c("text",
                "points"))
ordihull(penguin_pca,
         group = factor(species))

Scree plots help you know how many PCs need to be plotted

Scree plots help you if a 3rd or higher PC is useful
Scree plot use is more of an art than science
This class: If PC3 is very close to PC2 on scree plot, it may be worthwhile to look at PC3
If PC3 is much lower than PC2, then a scatterplot / biplot of just the PC1 and PC2 is sufficient to represent data (e.g. show any separate groups)

# TASK: use function screeplot()
screeplot(penguin_pca)

NAs and the curse of dimensionality

“curse of dimensionality” - suite of issues that arise with many features
“more data, more problems”
As you collect more features of data, it more likely that something will be missing for some samples
High dimensional data often has more features than samples (more columns than rows)
e.g 1000 genomes project: 1000 people = 1000 rows; but 640,000 SNPs!
Eventually, ever row of data (sample) will end up with at least 1 or more missing values when you have 1000s of features
na.omit() does row-wise deletion - have an NA, remove row
with high-dim data, you end up with no data!

Imputation

fill in data for missing values

Mean imputation

Simple solution - swap mean for a feature whenever there is NA

is.na() returns TRUE/FALSE if an NA is present
Several ways to work with this
Easiest to understand: which(is.na(df$column) == TRUE )

Find the NAs

# TASK
# run is.na() to see
# TRUE/FALSE , what is an NA
is.na(penguins2$dietC)

##   [1]  TRUE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE  TRUE
##  [13]  TRUE  TRUE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [37] FALSE FALSE FALSE  TRUE FALSE  TRUE FALSE FALSE FALSE FALSE  TRUE  TRUE
##  [49] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [61] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [73] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [85] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [97] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [109] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [121] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [133] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [145] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [157] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [169] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [181] FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [193] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [205] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [217] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [229] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [241] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [253] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [265] FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE
## [277] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [289] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [301] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [313] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [325] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [337] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

# which() returns index value 
# TASK: use which() to determine which value = 9.46180
which(penguins2$dietC == 9.46180)

## integer(0)

# save index
i.datapoint <- which(penguins2$dietC == 9.46180)

# look at value
# TASK: add the vector containing the index
penguins2$dietC[i.datapoint]

## numeric(0)

# Indices of missing values
# TASK: call which() to get the indices
which(is.na(penguins2$dietC) == TRUE)

##  [1]   1   4   9  12  13  14  16  40  42  47  48 183 272

# save indices
i.NA <- which(is.na(penguins2$dietC) == TRUE)

# look at NAs
# TASK: add the vector containing the indices of the NAs
penguins2$dietC[i.NA]

##  [1] NA NA NA NA NA NA NA NA NA NA NA NA NA

Fill in NAs with mean

# calculate mean
mean_C <- mean(penguins2$dietC, na.rm = T)
mean_C

## [1] -25.68629

# Assign mean to NAs
penguins2$dietC[i.NA] <- mean_C

# Look at results
## TASK - add index to look at data
penguins2$dietC[i.NA]

##  [1] -25.68629 -25.68629 -25.68629 -25.68629 -25.68629 -25.68629 -25.68629
##  [8] -25.68629 -25.68629 -25.68629 -25.68629 -25.68629 -25.68629

Random imputation - parametric imputation

Some more complex method integrate random variation into imputation
We’ll look at a very basic first step of these approaches

Look at data

# raw data
## TASK - make histogram with gghistogram
##        add mean with add = "mean
gghistogram(data = penguins2,
            x = "dietC",
            add = "mean")

## Warning: Using `bins = 30` by default. Pick better value with the argument
## `bins`.

## Warning: geom_vline(): Ignoring `mapping` because `xintercept` was provided.

## Warning: geom_vline(): Ignoring `data` because `xintercept` was provided.

# with density smoother
## TASK - make histogram with gghistogram
##        add mean with add = "mean
##        add density smoother to show shape
gghistogram(data = penguins2,
            x = "dietC",
            add = "mean",
            add_density = TRUE)

## Warning: Using `bins = 30` by default. Pick better value with the argument
## `bins`.

## Warning: geom_vline(): Ignoring `mapping` because `xintercept` was provided.

## Warning: geom_vline(): Ignoring `data` because `xintercept` was provided.

Simulate data from estimated distribution of data

the bell-curve of the normal distribution is described by 2 parameters

mean
sd

# parameters
## parameter 1 - mean
mean_C <- mean(penguins2$dietC, na.rm = T)

## parameter 2 - SD
sd_C   <- sd(penguins2$dietC, na.rm = T)


#simulate 1 random value w/ parameters
# TASK: simulate 1 value with rnorm()
rnorm(n = 1,         # number to simulate
      mean = mean_C, # parameter 1
      sd = sd_C)     # parameter 2

## [1] -26.4481

# simulate 2 random values w/ parameters
# TASK: simulate 2 values with rnorm()
rnorm(n = 2, 
      mean = mean_C, 
      sd = sd_C)

## [1] -25.55838 -24.50364

# simulate 3 random values
# TASK: simulate 3 value2 with rnorm()
rnorm(n = 3, 
      mean = mean_C, 
      sd = sd_C)

## [1] -24.20456 -25.88602 -25.95271

# how many NAs in C
# Indices of missing values
## TASK: use which() to find where the NAs are
which(is.na(penguins2$dietC) == TRUE)

## integer(0)

# save indices of NAs
i.C.NA <- which(is.na(penguins2$dietC) == TRUE)

# save a copy (for later)
i.C.NAcopy <- i.C.NA

# which values are NOT NA?
## (we'll need this later)
i.C.notNA <- which(is.na(penguins2$dietC) == FALSE) # == FALSE

# check data
# TASK: look where the NAs are using the index
penguins2$dietC[i.C.NA]

## numeric(0)

# how many total NAs?
# TASK: add the index
length(i.C.NA)

## [1] 0

# save number of NAs
N.C.NA <- length(i.C.NA)

# simulate N random values
C.random.param <- rnorm(n = N.C.NA,  # number to simulate
                        mean = mean_C, 
                        sd = sd_C)

# assign random data to missing values
penguins2$dietC[i.C.NA] <- C.random.param

# Look at results
# TASK: use index to look at simulated values
penguins2$dietC[i.C.NA]

## numeric(0)

Random imputation - non-parametric imputation

Look at data - they aren’t very normal.
Normal data would have a bell-shaped “density” smoother

# raw data
# TASK: make histogram
gghistogram(data = penguins2,
            x = "dietC",
            add = "mean")

## Warning: Using `bins = 30` by default. Pick better value with the argument
## `bins`.

## Warning: geom_vline(): Ignoring `mapping` because `xintercept` was provided.

## Warning: geom_vline(): Ignoring `data` because `xintercept` was provided.

# with density smoother
# TASK: make histogram with density smoother
ggpubr::gghistogram(data = penguins2,
                    x = "dietC",
                    add = "mean",
                    add_density = TRUE)

## Warning: Using `bins = 30` by default. Pick better value with the argument
## `bins`.

## Warning: geom_vline(): Ignoring `mapping` because `xintercept` was provided.

## Warning: geom_vline(): Ignoring `data` because `xintercept` was provided.

What are the locations where there were NOT NAs?

# Note - we already filled in the NAs
which(is.na(penguins2$dietC) == TRUE)

## integer(0)

# original locations of NAs
i.C.NAcopy

## integer(0)

length(i.C.NAcopy)

## [1] 0

length(i.C.notNA)

## [1] 344

nrow(penguins2)

## [1] 344

length(i.C.NAcopy) + length(i.C.notNA)

## [1] 344

Select random values from the observed data

dietC_not_NA <- penguins2$dietC[i.C.notNA]

# are there any NAs?
(dietC_not_NA)

##   [1] -25.68629 -24.69454 -25.33302 -25.68629 -25.32426 -25.29805 -25.21799
##   [8] -24.89958 -25.68629 -25.09368 -25.21315 -25.68629 -25.68629 -25.68629
##  [15] -25.22588 -25.68629 -25.06691 -25.13993 -25.23319 -24.77227 -25.09383
##  [22] -25.06390 -25.03474 -25.22664 -25.29856 -24.36130 -25.36288 -25.49448
##  [29] -25.19837 -25.11609 -25.11223 -25.01020 -25.06020 -25.14591 -25.23061
##  [36] -25.03469 -25.12255 -25.49523 -25.04169 -25.68629 -24.42280 -25.68629
##  [43] -25.03492 -24.52698 -25.01745 -24.10255 -25.68629 -25.68629 -25.07683
##  [50] -25.18543 -26.12989 -26.55602 -26.17213 -26.31460 -26.54718 -26.27853
##  [57] -26.07188 -25.98843 -25.88156 -25.89677 -26.40943 -26.37809 -26.21569
##  [64] -26.60023 -26.11650 -26.09294 -25.95541 -25.81012 -26.07821 -26.06209
##  [71] -26.63085 -25.23453 -26.55351 -26.45978 -26.15003 -26.38986 -26.02002
##  [78] -26.44787 -26.51382 -25.81513 -26.53870 -26.23027 -26.24837 -26.46254
##  [85] -26.38396 -26.09635 -26.22227 -26.08165 -26.12417 -26.11199 -26.05621
##  [92] -25.83352 -26.69543 -26.42406 -26.30037 -26.57941 -26.50086 -26.01152
##  [99] -26.15775 -26.03679 -26.28055 -26.38085 -26.22530 -25.69199 -25.86482
## [106] -25.80208 -26.48973 -25.70711 -25.27385 -25.45171 -26.57563 -26.32909
## [113] -26.06203 -26.41218 -26.78958 -26.42018 -25.77264 -26.36678 -23.90309
## [120] -26.09989 -26.19444 -26.42563 -25.61039 -25.42621 -25.95399 -25.60826
## [127] -25.89741 -26.02450 -25.73722 -26.01363 -25.57956 -26.13960 -25.57647
## [134] -26.49288 -25.77951 -26.06943 -25.97696 -26.32601 -26.07021 -25.88798
## [141] -25.54976 -26.01549 -26.61075 -25.79529 -25.85203 -26.03442 -25.79549
## [148] -25.83060 -25.79189 -26.03495 -26.07081 -26.06967 -25.51390 -25.39369
## [155] -25.46172 -25.40075 -25.54456 -25.32829 -25.46782 -25.42760 -25.32176
## [162] -25.38017 -25.39330 -25.50562 -25.41680 -25.48025 -25.62618 -25.52473
## [169] -25.52627 -25.00169 -25.37899 -25.39587 -25.37746 -25.46569 -25.39470
## [176] -25.38157 -25.39181 -25.42826 -25.69327 -25.48383 -25.50811 -25.19017
## [183] -25.68629 -25.46327 -25.53768 -25.68210 -26.63405 -26.86127 -26.70968
## [190] -26.79093 -26.68311 -26.84506 -26.59467 -26.84272 -26.72791 -26.76990
## [197] -26.60436 -26.77650 -26.61788 -26.57585 -26.89644 -26.67799 -26.86352
## [204] -26.79358 -27.01854 -26.61414 -26.83006 -26.75621 -26.84415 -26.74890
## [211] -26.86485 -26.79846 -26.79053 -26.84374 -26.60484 -26.84166 -26.61601
## [218] -26.69166 -26.81540 -26.74809 -26.68867 -26.74249 -26.72751 -26.70783
## [225] -26.66958 -26.59290 -26.95470 -26.71199 -26.78733 -26.76821 -26.65359
## [232] -26.65931 -26.27660 -26.27573 -26.24369 -26.39677 -26.20372 -26.30019
## [239] -26.18599 -26.34330 -26.23613 -26.11657 -26.08547 -25.89834 -26.22848
## [246] -25.69195 -26.16524 -26.11244 -26.06594 -26.04726 -26.13971 -26.36863
## [253] -26.18763 -26.35425 -26.21651 -25.79203 -26.23886 -26.05756 -26.44815
## [260] -26.33867 -26.38092 -26.22664 -26.18466 -26.21019 -26.11046 -25.76147
## [267] -25.96013 -26.18161 -26.18444 -25.88547 -26.20538 -25.68629 -26.13832
## [274] -26.04117 -26.11969 -26.15531 -24.30229 -24.23592 -24.75570 -24.62717
## [281] -24.61867 -24.49433 -24.55644 -24.84059 -24.29229 -24.36088 -24.59897
## [288] -24.38751 -24.80526 -24.38933 -24.90024 -24.72940 -24.54704 -24.57994
## [295] -24.64162 -24.64335 -24.68790 -24.26375 -25.01185 -24.97134 -24.59467
## [302] -24.47142 -23.89017 -24.34684 -24.52896 -24.69638 -24.54903 -24.59996
## [309] -24.45195 -24.17282 -24.31198 -25.14550 -24.65859 -24.68440 -24.73735
## [316] -24.90816 -24.66867 -24.66188 -24.86594 -24.66259 -24.16566 -24.35575
## [323] -24.45721 -24.45189 -24.43062 -24.41562 -24.48403 -24.36202 -24.80500
## [330] -24.59066 -24.60882 -24.56481 -24.78984 -24.59513 -24.40400 -24.65786
## [337] -23.78767 -24.48153 -24.31912 -24.53494 -24.40753 -24.70615 -24.68741
## [344] -24.25255

# how many values am I working with?
(i.C.NAcopy)

## integer(0)

N.NAs.to.replace <- length(i.C.NAcopy)

# TASK
# pick 1 random value from the observed data
# using sample()
sample(x = dietC_not_NA,
       size = 1,
       replace = T)

## [1] -26.20538

# TASK
# pick as many random values from the
# observed data as there are missing values
sample(x = dietC_not_NA,
       size = N.NAs.to.replace,
       replace = T)

## numeric(0)

# Make vector of random values
C.random.nonparm <- sample(x = dietC_not_NA,
       size = N.NAs.to.replace,
       replace = T)


# assign random data to missing values
penguins2$dietC[i.C.NAcopy] <- C.random.nonparm

# check if there are any NAs
(penguins2$dietC)

##   [1] -25.68629 -24.69454 -25.33302 -25.68629 -25.32426 -25.29805 -25.21799
##   [8] -24.89958 -25.68629 -25.09368 -25.21315 -25.68629 -25.68629 -25.68629
##  [15] -25.22588 -25.68629 -25.06691 -25.13993 -25.23319 -24.77227 -25.09383
##  [22] -25.06390 -25.03474 -25.22664 -25.29856 -24.36130 -25.36288 -25.49448
##  [29] -25.19837 -25.11609 -25.11223 -25.01020 -25.06020 -25.14591 -25.23061
##  [36] -25.03469 -25.12255 -25.49523 -25.04169 -25.68629 -24.42280 -25.68629
##  [43] -25.03492 -24.52698 -25.01745 -24.10255 -25.68629 -25.68629 -25.07683
##  [50] -25.18543 -26.12989 -26.55602 -26.17213 -26.31460 -26.54718 -26.27853
##  [57] -26.07188 -25.98843 -25.88156 -25.89677 -26.40943 -26.37809 -26.21569
##  [64] -26.60023 -26.11650 -26.09294 -25.95541 -25.81012 -26.07821 -26.06209
##  [71] -26.63085 -25.23453 -26.55351 -26.45978 -26.15003 -26.38986 -26.02002
##  [78] -26.44787 -26.51382 -25.81513 -26.53870 -26.23027 -26.24837 -26.46254
##  [85] -26.38396 -26.09635 -26.22227 -26.08165 -26.12417 -26.11199 -26.05621
##  [92] -25.83352 -26.69543 -26.42406 -26.30037 -26.57941 -26.50086 -26.01152
##  [99] -26.15775 -26.03679 -26.28055 -26.38085 -26.22530 -25.69199 -25.86482
## [106] -25.80208 -26.48973 -25.70711 -25.27385 -25.45171 -26.57563 -26.32909
## [113] -26.06203 -26.41218 -26.78958 -26.42018 -25.77264 -26.36678 -23.90309
## [120] -26.09989 -26.19444 -26.42563 -25.61039 -25.42621 -25.95399 -25.60826
## [127] -25.89741 -26.02450 -25.73722 -26.01363 -25.57956 -26.13960 -25.57647
## [134] -26.49288 -25.77951 -26.06943 -25.97696 -26.32601 -26.07021 -25.88798
## [141] -25.54976 -26.01549 -26.61075 -25.79529 -25.85203 -26.03442 -25.79549
## [148] -25.83060 -25.79189 -26.03495 -26.07081 -26.06967 -25.51390 -25.39369
## [155] -25.46172 -25.40075 -25.54456 -25.32829 -25.46782 -25.42760 -25.32176
## [162] -25.38017 -25.39330 -25.50562 -25.41680 -25.48025 -25.62618 -25.52473
## [169] -25.52627 -25.00169 -25.37899 -25.39587 -25.37746 -25.46569 -25.39470
## [176] -25.38157 -25.39181 -25.42826 -25.69327 -25.48383 -25.50811 -25.19017
## [183] -25.68629 -25.46327 -25.53768 -25.68210 -26.63405 -26.86127 -26.70968
## [190] -26.79093 -26.68311 -26.84506 -26.59467 -26.84272 -26.72791 -26.76990
## [197] -26.60436 -26.77650 -26.61788 -26.57585 -26.89644 -26.67799 -26.86352
## [204] -26.79358 -27.01854 -26.61414 -26.83006 -26.75621 -26.84415 -26.74890
## [211] -26.86485 -26.79846 -26.79053 -26.84374 -26.60484 -26.84166 -26.61601
## [218] -26.69166 -26.81540 -26.74809 -26.68867 -26.74249 -26.72751 -26.70783
## [225] -26.66958 -26.59290 -26.95470 -26.71199 -26.78733 -26.76821 -26.65359
## [232] -26.65931 -26.27660 -26.27573 -26.24369 -26.39677 -26.20372 -26.30019
## [239] -26.18599 -26.34330 -26.23613 -26.11657 -26.08547 -25.89834 -26.22848
## [246] -25.69195 -26.16524 -26.11244 -26.06594 -26.04726 -26.13971 -26.36863
## [253] -26.18763 -26.35425 -26.21651 -25.79203 -26.23886 -26.05756 -26.44815
## [260] -26.33867 -26.38092 -26.22664 -26.18466 -26.21019 -26.11046 -25.76147
## [267] -25.96013 -26.18161 -26.18444 -25.88547 -26.20538 -25.68629 -26.13832
## [274] -26.04117 -26.11969 -26.15531 -24.30229 -24.23592 -24.75570 -24.62717
## [281] -24.61867 -24.49433 -24.55644 -24.84059 -24.29229 -24.36088 -24.59897
## [288] -24.38751 -24.80526 -24.38933 -24.90024 -24.72940 -24.54704 -24.57994
## [295] -24.64162 -24.64335 -24.68790 -24.26375 -25.01185 -24.97134 -24.59467
## [302] -24.47142 -23.89017 -24.34684 -24.52896 -24.69638 -24.54903 -24.59996
## [309] -24.45195 -24.17282 -24.31198 -25.14550 -24.65859 -24.68440 -24.73735
## [316] -24.90816 -24.66867 -24.66188 -24.86594 -24.66259 -24.16566 -24.35575
## [323] -24.45721 -24.45189 -24.43062 -24.41562 -24.48403 -24.36202 -24.80500
## [330] -24.59066 -24.60882 -24.56481 -24.78984 -24.59513 -24.40400 -24.65786
## [337] -23.78767 -24.48153 -24.31912 -24.53494 -24.40753 -24.70615 -24.68741
## [344] -24.25255

Regression imputation

Some variables are correlated with each other. How can we use this information to do better missing data prediction?

Pairs plot with smoother

# TASK - make pairs plot
pairs(penguins2[,-1], 
      panel = panel.smooth)

Correlations

# TASK - calculate correlations
cor(penguins2$mass, 
    penguins2$flipper_length, 
    use = "complete.obs")

## [1] 0.8712018

cor(penguins2$mass, 
    penguins2$bill_length, 
    use = "complete.obs")

## [1] 0.5951098

Regression line

ggpubr::ggscatter

Note - ggpubr does a fair bit of rounding

#TASK - make scatterplot
#      note "+" sign - this is ggplot2 syntax (good exam question)
ggscatter(data = penguins2,
          y = "flipper_length",
          x = "mass", 
          add = "reg.line", 
          xlab = "Known values = mass",
          ylab = "Want to predict = flipper length",
          cor.coef = F) + stat_regline_equation(label.y.npc = "top")

## `geom_smooth()` using formula 'y ~ x'

## Warning: Removed 2 rows containing non-finite values (stat_smooth).

## Warning: Removed 2 rows containing non-finite values (stat_regline_equation).

## Warning: Removed 2 rows containing missing values (geom_point).

lm()

# TASK
# make lm model with lm()
lm_out <- lm(flipper_length ~ mass,
             data = penguins2)

# TASK use coef() to
# get coefficients/ parameters for equation
coef(lm_out)

##  (Intercept)         mass 
## 136.72955927   0.01527592

Equation:

intercept = 136.72955927 slope = 0.01527592

mass = intercept + slope X mass mass = 136.72955927 + 0.01527592 X mass

If mass of penguin with unknown flipper length = 5000 grams, what is predicted flipper

136.72955927 + 0.01527592 * 5000

## [1] 213.1092

Make scatterplot and visualize prediction with horiz and vert lines

Note ggplot2 “+” syntax

# TASK - make scatterplot
ggscatter(data = penguins2,
          y = "flipper_length",
          x = "mass", 
          add = "reg.line", 
          xlab = "Known values = mass",
          ylab = "Want to predict = flipper length",
          cor.coef = F) +
            stat_regline_equation(label.y.npc = "top") +
            geom_vline(xintercept = 5000) +
            geom_hline(yintercept = 136.72 + 0.0153 * 5000)

## `geom_smooth()` using formula 'y ~ x'

## Warning: Removed 2 rows containing non-finite values (stat_smooth).

## Warning: Removed 2 rows containing non-finite values (stat_regline_equation).

## Warning: Removed 2 rows containing missing values (geom_point).

In-class Assignment: PCA and missing data

Nathan Brouwer

2022-11-07