Initial Examination of the data

There are a few ways to initial examine the data in R:

# Determining what type of object R is storing the data as using is()
is(bones)
## [1] "tbl_df"     "tbl"        "data.frame" "list"       "oldClass"  
## [6] "vector"
# Looking at the different types of data in the data frame using str() for structure
str(bones)
## tibble [1,538 × 8] (S3: tbl_df/tbl/data.frame)
##  $ Sex     : chr [1:1538] "0" "1" "0" "0" ...
##  $ age     : chr [1:1538] "40-49" "50+" "30-39" "20-29" ...
##  $ Location: chr [1:1538] "Alaska, United States" "Alaska, United States" "Alaska, United States" "Alaska, United States" ...
##  $ Humerus : num [1:1538] 308 NA 311 289 295 ...
##  $ Radius  : num [1:1538] 229 NA NA 224 208 ...
##  $ Femur   : num [1:1538] 443 386 415 398 395 ...
##  $ Tibia   : num [1:1538] NA 283 330 312 306 298 351 361 338 311 ...
##  $ Iliac   : num [1:1538] 274 266 NA 240 286 ...
# Looking at the first 6 rows of the data
head(bones)
## # A tibble: 6 × 8
##   Sex   age   Location              Humerus Radius Femur Tibia Iliac
##   <chr> <chr> <chr>                   <dbl>  <dbl> <dbl> <dbl> <dbl>
## 1 0     40-49 Alaska, United States    308.   229    443    NA  274 
## 2 1     50+   Alaska, United States     NA     NA    386   283  266 
## 3 0     30-39 Alaska, United States    311     NA    415   330   NA 
## 4 0     20-29 Alaska, United States    289    224.   398   312  240.
## 5 1     50+   Alaska, United States    295    208    395   306  286.
## 6 1     30-39 Alaska, United States    270.   196.    NA   298   NA
# skim() in the skimr package will give a quick overview (summary) of the data:
skim(bones)
Data summary
Name bones
Number of rows 1538
Number of columns 8
_______________________
Column type frequency:
character 3
numeric 5
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
Sex 0 1 1 2 0 4 0
age 7 1 3 5 0 4 0
Location 0 1 4 39 0 45 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Humerus 162 0.89 304 23 230 287 304 319 376 ▁▅▇▅▁
Radius 217 0.86 233 19 179 219 233 247 290 ▁▆▇▅▁
Femur 117 0.92 427 32 345 404 428 448 531 ▂▆▇▃▁
Tibia 135 0.91 353 28 276 333 353 372 446 ▁▆▇▃▁
Iliac 69 0.96 262 18 184 251 263 274 324 ▁▂▇▆▁
# skim() also let's use see how much of the data is missing!

Cleaning the data

Since there are missing values for each of the variables (denoted by NA), we can remove them using filter() and complete.cases() in the tidyverse packages:

# Since some values are missing, denoted by NA, we want to only keep the rows with the complete data
# We can use the complete.data() function to do that. Complete.cases() will return a vector of True and False, 
# and we want to keep the rows that have all information for recorded (indicated by TRUE)
bones1 <- 
  bones %>% 
  
  filter(complete.cases(.)) |> 
  
  filter(Sex %in% c("0","1")) |> 
  
  dplyr::select(where(is.numeric))

skim(bones1)
Data summary
Name bones1
Number of rows 1049
Number of columns 5
_______________________
Column type frequency:
numeric 5
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Humerus 0 1 304 23 230 288 304 320 376 ▁▅▇▅▁
Radius 0 1 233 19 179 220 233 247 290 ▁▆▇▅▁
Femur 0 1 428 31 345 406 428 448 531 ▁▆▇▃▁
Tibia 0 1 353 28 276 333 354 372 446 ▁▆▇▃▁
Iliac 0 1 263 18 184 251 263 274 324 ▁▂▇▆▁

Getting the number of variables (number of columns) and sample size (number of rows)

# Variables in columns
p <- ncol(bones1) 

# Each row is a different skeleton
n <- nrow(bones1) 

Density plots and scatter plots

In addition to the summary statistics, it’s recommended to examine the shape each variable takes and the relationship between any 2 pairs of variables.

This can be accomplished with:

# The best way to display the density plots for all the variables is to change it into the "long" format using pivot_longer():
# ?pivot_longer
bones1 |> 
  
  pivot_longer(cols = Humerus:Iliac,     # Choosing the columns in the data to "lengthen"
               names_to = "Bone",        # Naming the column that indicates the bone type
               values_to = "Length") |>  # Naming the column that contains the bone measurement
  
  ggplot(mapping = aes(x = Length)) +
  
  geom_density(mapping = aes(fill = Bone), 
               show.legend = F) +
  
  # A separate density plot for each bone
  facet_wrap(facets = ~ Bone, 
             scales = "free") + 
  
  # Removing the extra space between the density plot and the x-axis
  scale_y_continuous(expand = c(0, 0, 0.05, 0))

All the density plots are roughly bellshaped, indicating the measurements are about Normal for all 5 bones

Up next, create a scatterplot matrix using ggpairs in the GGally package

ggpairs(bones1) 

From the scatterplot matrix, we can see:

  1. All the pairs appear to be linearly associated (correlated), but some are stronger than others.
  2. The Illiac appears to be the weakest related to the other variables (0.40 < r < 0.50)
  3. The arm and leg bones have strong correlations (0.87 < r < 0.91)
  4. There appears to be 1 clear outlier with a short Humerus but typical lengths for the other 4 bones

ggpairs also displays the individual density plots, so we don’t have to make them ourselves!

Calculating the Summary Statistics

While we can calculate the 5 number summary and mean for each variable with summary, we often need the mean vector \(\bar{y}\), covariance matrix \(S\), and correlation matrix \(R\).

# 5 number summary per variable
summary(bones1)
##     Humerus        Radius        Femur         Tibia         Iliac    
##  Min.   :230   Min.   :179   Min.   :345   Min.   :276   Min.   :184  
##  1st Qu.:288   1st Qu.:220   1st Qu.:406   1st Qu.:333   1st Qu.:251  
##  Median :304   Median :233   Median :428   Median :354   Median :263  
##  Mean   :304   Mean   :233   Mean   :428   Mean   :353   Mean   :263  
##  3rd Qu.:320   3rd Qu.:247   3rd Qu.:448   3rd Qu.:372   3rd Qu.:274  
##  Max.   :376   Max.   :290   Max.   :531   Max.   :446   Max.   :324
# Mean vector of bone measurements can be calculated using colMeans()
meanvec <- colMeans(bones1)
meanvec
## Humerus  Radius   Femur   Tibia   Iliac 
##     304     233     428     353     263
# Covariance matrix can be calculated using cov() or var()
(covmat <- var(bones1) )
##         Humerus Radius Femur Tibia Iliac
## Humerus     526    374   650   549   206
## Radius      374    354   501   475   138
## Femur       650    501   970   800   272
## Tibia       549    475   800   793   213
## Iliac       206    138   272   213   330
cov(bones1) 
##         Humerus Radius Femur Tibia Iliac
## Humerus     526    374   650   549   206
## Radius      374    354   501   475   138
## Femur       650    501   970   800   272
## Tibia       549    475   800   793   213
## Iliac       206    138   272   213   330
# Correlation matrix using cor():
(R_mat <- cor(bones1))
##         Humerus Radius Femur Tibia Iliac
## Humerus    1.00   0.87  0.91  0.85  0.49
## Radius     0.87   1.00  0.86  0.90  0.41
## Femur      0.91   0.86  1.00  0.91  0.48
## Tibia      0.85   0.90  0.91  1.00  0.42
## Iliac      0.49   0.41  0.48  0.42  1.00

You can also use var, cov, and cor to calculate specific pairings of covariances and correlations instead of the entire matrix by using 2 arguments instead of 1

# If you give cov() 2 variables, just calculates the covariances between them
cov(x = bones1$Humerus, 
    y = bones1$Radius)
## [1] 374
# Remember, cov(Y,Y) = var(Y)!
cov(x = bones1$Humerus,
    y = bones1$Humerus)
## [1] 526
var(bones1$Humerus)
## [1] 526

Also works if you give cov() 2 matrices.

# For instance, if we want the covariance of the arm bones with the leg bones, but not the correlation of the pairs in the same limb:
cov(x = bones1 |> dplyr::select(Humerus, Radius),
    y = bones1 |> dplyr::select(Femur, Tibia))
##         Femur Tibia
## Humerus   650   549
## Radius    501   475

Spectral Decomposition: Eigenvalue and Eigenvectors

Many methods in the course will use the Eigenvalue and Eigenvectors of \(S\) or \(R\). We can use the eigen function to get both from the respective matrix!

eigen_bones <- 
  bones1 |> 
  cov() |> 
  eigen()
  
# We can use $values to just have the eigenvalues or $vectors to return just the eigenvectors:
eigen_bones$values
## [1] 2515  254   94   75   34
eigen_bones$vectors
##       [,1]   [,2]  [,3]    [,4]   [,5]
## [1,] -0.43 -0.039  0.57  0.4924  0.494
## [2,] -0.35  0.132 -0.22  0.6218 -0.653
## [3,] -0.61  0.046  0.38 -0.5959 -0.366
## [4,] -0.54  0.234 -0.67 -0.1253  0.442
## [5,] -0.19 -0.961 -0.20  0.0063 -0.019

Rebuilding the covariance matrix from the eigen values and vectors

Next, we’ll demonstrate we can return the original matrix and the square root matrix after \(S\) has been spectrally decomposed:

# Creating the Lambda matrix using the eigenvalues and diag() function:
(Lambda <- diag(eigen_bones$values))
##      [,1] [,2] [,3] [,4] [,5]
## [1,] 2515    0    0    0    0
## [2,]    0  254    0    0    0
## [3,]    0    0   94    0    0
## [4,]    0    0    0   75    0
## [5,]    0    0    0    0   34
# Saving just the vectors in a matrix called bones_E
bones_E <- eigen_bones$vectors


# Showing spectral decomposition gives us back our original matrix
bones_E %*% Lambda %*% t(bones_E)
##      [,1] [,2] [,3] [,4] [,5]
## [1,]  526  374  650  549  206
## [2,]  374  354  501  475  138
## [3,]  650  501  970  800  272
## [4,]  549  475  800  793  213
## [5,]  206  138  272  213  330
cov(bones1)
##         Humerus Radius Femur Tibia Iliac
## Humerus     526    374   650   549   206
## Radius      374    354   501   475   138
## Femur       650    501   970   800   272
## Tibia       549    475   800   793   213
## Iliac       206    138   272   213   330

Next, we’ll show that the covariance matrix can be created using the square root matrix:

sqrt_bones <- bones_E %*% sqrt(Lambda) %*% t(bones_E)

# Showing that the sqrt matrix times itself gives back the original covariance matrix
sqrt_bones %*% sqrt_bones
##      [,1] [,2] [,3] [,4] [,5]
## [1,]  526  374  650  549  206
## [2,]  374  354  501  475  138
## [3,]  650  501  970  800  272
## [4,]  549  475  800  793  213
## [5,]  206  138  272  213  330
cov(bones1)
##         Humerus Radius Femur Tibia Iliac
## Humerus     526    374   650   549   206
## Radius      374    354   501   475   138
## Femur       650    501   970   800   272
## Tibia       549    475   800   793   213
## Iliac       206    138   272   213   330