Stat 223: Chapter 3 - Bones Example

Initial Examination of the data

There are a few ways to initial examine the data in R:

is(data) will tell you what type of object R is storing the data as
str(data) will report the type of variable for each column
head(data) will show the first 6 rows
skim (in the skimr package) shows the descriptive statistics for the data

# Determining what type of object R is storing the data as using is()
is(bones)

## [1] "tbl_df"     "tbl"        "data.frame" "list"       "oldClass"  
## [6] "vector"

# Looking at the different types of data in the data frame using str() for structure
str(bones)

## tibble [1,538 × 8] (S3: tbl_df/tbl/data.frame)
##  $ Sex     : chr [1:1538] "0" "1" "0" "0" ...
##  $ age     : chr [1:1538] "40-49" "50+" "30-39" "20-29" ...
##  $ Location: chr [1:1538] "Alaska, United States" "Alaska, United States" "Alaska, United States" "Alaska, United States" ...
##  $ Humerus : num [1:1538] 308 NA 311 289 295 ...
##  $ Radius  : num [1:1538] 229 NA NA 224 208 ...
##  $ Femur   : num [1:1538] 443 386 415 398 395 ...
##  $ Tibia   : num [1:1538] NA 283 330 312 306 298 351 361 338 311 ...
##  $ Iliac   : num [1:1538] 274 266 NA 240 286 ...

# Looking at the first 6 rows of the data
head(bones)

## # A tibble: 6 × 8
##   Sex   age   Location              Humerus Radius Femur Tibia Iliac
##   <chr> <chr> <chr>                   <dbl>  <dbl> <dbl> <dbl> <dbl>
## 1 0     40-49 Alaska, United States    308.   229    443    NA  274 
## 2 1     50+   Alaska, United States     NA     NA    386   283  266 
## 3 0     30-39 Alaska, United States    311     NA    415   330   NA 
## 4 0     20-29 Alaska, United States    289    224.   398   312  240.
## 5 1     50+   Alaska, United States    295    208    395   306  286.
## 6 1     30-39 Alaska, United States    270.   196.    NA   298   NA

# skim() in the skimr package will give a quick overview (summary) of the data:
skim(bones)

Data summary
Name	bones
Number of rows	1538
Number of columns	8
_______________________
Column type frequency:
character	3
numeric	5
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	n_unique
Sex	0	1	1	2	4
age	7	1	3	5	4
Location	0	1	4	39	45

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
Humerus	162	0.89	304	23	230	287	304	319	376	▁▅▇▅▁
Radius	217	0.86	233	19	179	219	233	247	290	▁▆▇▅▁
Femur	117	0.92	427	32	345	404	428	448	531	▂▆▇▃▁
Tibia	135	0.91	353	28	276	333	353	372	446	▁▆▇▃▁
Iliac	69	0.96	262	18	184	251	263	274	324	▁▂▇▆▁

# skim() also let's use see how much of the data is missing!

Cleaning the data

Since there are missing values for each of the variables (denoted by NA), we can remove them using filter() and complete.cases() in the tidyverse packages:

# Since some values are missing, denoted by NA, we want to only keep the rows with the complete data
# We can use the complete.data() function to do that. Complete.cases() will return a vector of True and False, 
# and we want to keep the rows that have all information for recorded (indicated by TRUE)
bones1 <- 
  bones %>% 
  
  filter(complete.cases(.)) |> 
  
  filter(Sex %in% c("0","1")) |> 
  
  dplyr::select(where(is.numeric))

skim(bones1)

Data summary
Name	bones1
Number of rows	1049
Number of columns	5
_______________________
Column type frequency:
numeric	5
________________________
Group variables	None

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
Humerus	1	304	23	230	288	304	320	376	▁▅▇▅▁
Radius	1	233	19	179	220	233	247	290	▁▆▇▅▁
Femur	1	428	31	345	406	428	448	531	▁▆▇▃▁
Tibia	1	353	28	276	333	354	372	446	▁▆▇▃▁
Iliac	1	263	18	184	251	263	274	324	▁▂▇▆▁

Getting the number of variables (number of columns) and sample size (number of rows)

# Variables in columns
p <- ncol(bones1) 

# Each row is a different skeleton
n <- nrow(bones1)

Density plots and scatter plots

In addition to the summary statistics, it’s recommended to examine the shape each variable takes and the relationship between any 2 pairs of variables.

This can be accomplished with:

Density plots for the shape of the individual variables
Scatter plots for the relationship between 2 pairs of variables

# The best way to display the density plots for all the variables is to change it into the "long" format using pivot_longer():
# ?pivot_longer
bones1 |> 
  
  pivot_longer(cols = Humerus:Iliac,     # Choosing the columns in the data to "lengthen"
               names_to = "Bone",        # Naming the column that indicates the bone type
               values_to = "Length") |>  # Naming the column that contains the bone measurement
  
  ggplot(mapping = aes(x = Length)) +
  
  geom_density(mapping = aes(fill = Bone), 
               show.legend = F) +
  
  # A separate density plot for each bone
  facet_wrap(facets = ~ Bone, 
             scales = "free") + 
  
  # Removing the extra space between the density plot and the x-axis
  scale_y_continuous(expand = c(0, 0, 0.05, 0))

All the density plots are roughly bellshaped, indicating the measurements are about Normal for all 5 bones

Up next, create a scatterplot matrix using ggpairs in the GGally package

ggpairs(bones1)

From the scatterplot matrix, we can see:

All the pairs appear to be linearly associated (correlated), but some are stronger than others.
The Illiac appears to be the weakest related to the other variables (0.40 < r < 0.50)
The arm and leg bones have strong correlations (0.87 < r < 0.91)
There appears to be 1 clear outlier with a short Humerus but typical lengths for the other 4 bones

ggpairs also displays the individual density plots, so we don’t have to make them ourselves!

Calculating the Summary Statistics

While we can calculate the 5 number summary and mean for each variable with summary, we often need the mean vector \(\bar{y}\), covariance matrix \(S\), and correlation matrix \(R\).

# 5 number summary per variable
summary(bones1)

##     Humerus        Radius        Femur         Tibia         Iliac    
##  Min.   :230   Min.   :179   Min.   :345   Min.   :276   Min.   :184  
##  1st Qu.:288   1st Qu.:220   1st Qu.:406   1st Qu.:333   1st Qu.:251  
##  Median :304   Median :233   Median :428   Median :354   Median :263  
##  Mean   :304   Mean   :233   Mean   :428   Mean   :353   Mean   :263  
##  3rd Qu.:320   3rd Qu.:247   3rd Qu.:448   3rd Qu.:372   3rd Qu.:274  
##  Max.   :376   Max.   :290   Max.   :531   Max.   :446   Max.   :324

# Mean vector of bone measurements can be calculated using colMeans()
meanvec <- colMeans(bones1)
meanvec

## Humerus  Radius   Femur   Tibia   Iliac 
##     304     233     428     353     263

# Covariance matrix can be calculated using cov() or var()
(covmat <- var(bones1) )

##         Humerus Radius Femur Tibia Iliac
## Humerus     526    374   650   549   206
## Radius      374    354   501   475   138
## Femur       650    501   970   800   272
## Tibia       549    475   800   793   213
## Iliac       206    138   272   213   330

cov(bones1)

##         Humerus Radius Femur Tibia Iliac
## Humerus     526    374   650   549   206
## Radius      374    354   501   475   138
## Femur       650    501   970   800   272
## Tibia       549    475   800   793   213
## Iliac       206    138   272   213   330

# Correlation matrix using cor():
(R_mat <- cor(bones1))

##         Humerus Radius Femur Tibia Iliac
## Humerus    1.00   0.87  0.91  0.85  0.49
## Radius     0.87   1.00  0.86  0.90  0.41
## Femur      0.91   0.86  1.00  0.91  0.48
## Tibia      0.85   0.90  0.91  1.00  0.42
## Iliac      0.49   0.41  0.48  0.42  1.00

You can also use var, cov, and cor to calculate specific pairings of covariances and correlations instead of the entire matrix by using 2 arguments instead of 1

# If you give cov() 2 variables, just calculates the covariances between them
cov(x = bones1$Humerus, 
    y = bones1$Radius)

## [1] 374

# Remember, cov(Y,Y) = var(Y)!
cov(x = bones1$Humerus,
    y = bones1$Humerus)

## [1] 526

var(bones1$Humerus)

## [1] 526

Also works if you give cov() 2 matrices.

# For instance, if we want the covariance of the arm bones with the leg bones, but not the correlation of the pairs in the same limb:
cov(x = bones1 |> dplyr::select(Humerus, Radius),
    y = bones1 |> dplyr::select(Femur, Tibia))

##         Femur Tibia
## Humerus   650   549
## Radius    501   475

Spectral Decomposition: Eigenvalue and Eigenvectors

Many methods in the course will use the Eigenvalue and Eigenvectors of \(S\) or \(R\). We can use the eigen function to get both from the respective matrix!

eigen_bones <- 
  bones1 |> 
  cov() |> 
  eigen()
  
# We can use $values to just have the eigenvalues or $vectors to return just the eigenvectors:
eigen_bones$values

## [1] 2515  254   94   75   34

eigen_bones$vectors

##       [,1]   [,2]  [,3]    [,4]   [,5]
## [1,] -0.43 -0.039  0.57  0.4924  0.494
## [2,] -0.35  0.132 -0.22  0.6218 -0.653
## [3,] -0.61  0.046  0.38 -0.5959 -0.366
## [4,] -0.54  0.234 -0.67 -0.1253  0.442
## [5,] -0.19 -0.961 -0.20  0.0063 -0.019

Rebuilding the covariance matrix from the eigen values and vectors

Next, we’ll demonstrate we can return the original matrix and the square root matrix after \(S\) has been spectrally decomposed:

# Creating the Lambda matrix using the eigenvalues and diag() function:
(Lambda <- diag(eigen_bones$values))

##      [,1] [,2] [,3] [,4] [,5]
## [1,] 2515    0    0    0    0
## [2,]    0  254    0    0    0
## [3,]    0    0   94    0    0
## [4,]    0    0    0   75    0
## [5,]    0    0    0    0   34

# Saving just the vectors in a matrix called bones_E
bones_E <- eigen_bones$vectors


# Showing spectral decomposition gives us back our original matrix
bones_E %*% Lambda %*% t(bones_E)

##      [,1] [,2] [,3] [,4] [,5]
## [1,]  526  374  650  549  206
## [2,]  374  354  501  475  138
## [3,]  650  501  970  800  272
## [4,]  549  475  800  793  213
## [5,]  206  138  272  213  330

cov(bones1)

##         Humerus Radius Femur Tibia Iliac
## Humerus     526    374   650   549   206
## Radius      374    354   501   475   138
## Femur       650    501   970   800   272
## Tibia       549    475   800   793   213
## Iliac       206    138   272   213   330

Next, we’ll show that the covariance matrix can be created using the square root matrix:

sqrt_bones <- bones_E %*% sqrt(Lambda) %*% t(bones_E)

# Showing that the sqrt matrix times itself gives back the original covariance matrix
sqrt_bones %*% sqrt_bones

##      [,1] [,2] [,3] [,4] [,5]
## [1,]  526  374  650  549  206
## [2,]  374  354  501  475  138
## [3,]  650  501  970  800  272
## [4,]  549  475  800  793  213
## [5,]  206  138  272  213  330

cov(bones1)

##         Humerus Radius Femur Tibia Iliac
## Humerus     526    374   650   549   206
## Radius      374    354   501   475   138
## Femur       650    501   970   800   272
## Tibia       549    475   800   793   213
## Iliac       206    138   272   213   330