There are a few ways to initial examine the data in R:
# Determining what type of object R is storing the data as using is()
is(bones)
## [1] "tbl_df" "tbl" "data.frame" "list" "oldClass"
## [6] "vector"
# Looking at the different types of data in the data frame using str() for structure
str(bones)
## tibble [1,538 × 8] (S3: tbl_df/tbl/data.frame)
## $ Sex : chr [1:1538] "0" "1" "0" "0" ...
## $ age : chr [1:1538] "40-49" "50+" "30-39" "20-29" ...
## $ Location: chr [1:1538] "Alaska, United States" "Alaska, United States" "Alaska, United States" "Alaska, United States" ...
## $ Humerus : num [1:1538] 308 NA 311 289 295 ...
## $ Radius : num [1:1538] 229 NA NA 224 208 ...
## $ Femur : num [1:1538] 443 386 415 398 395 ...
## $ Tibia : num [1:1538] NA 283 330 312 306 298 351 361 338 311 ...
## $ Iliac : num [1:1538] 274 266 NA 240 286 ...
# Looking at the first 6 rows of the data
head(bones)
## # A tibble: 6 × 8
## Sex age Location Humerus Radius Femur Tibia Iliac
## <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0 40-49 Alaska, United States 308. 229 443 NA 274
## 2 1 50+ Alaska, United States NA NA 386 283 266
## 3 0 30-39 Alaska, United States 311 NA 415 330 NA
## 4 0 20-29 Alaska, United States 289 224. 398 312 240.
## 5 1 50+ Alaska, United States 295 208 395 306 286.
## 6 1 30-39 Alaska, United States 270. 196. NA 298 NA
# skim() in the skimr package will give a quick overview (summary) of the data:
skim(bones)
Name | bones |
Number of rows | 1538 |
Number of columns | 8 |
_______________________ | |
Column type frequency: | |
character | 3 |
numeric | 5 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
Sex | 0 | 1 | 1 | 2 | 0 | 4 | 0 |
age | 7 | 1 | 3 | 5 | 0 | 4 | 0 |
Location | 0 | 1 | 4 | 39 | 0 | 45 | 0 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
Humerus | 162 | 0.89 | 304 | 23 | 230 | 287 | 304 | 319 | 376 | ▁▅▇▅▁ |
Radius | 217 | 0.86 | 233 | 19 | 179 | 219 | 233 | 247 | 290 | ▁▆▇▅▁ |
Femur | 117 | 0.92 | 427 | 32 | 345 | 404 | 428 | 448 | 531 | ▂▆▇▃▁ |
Tibia | 135 | 0.91 | 353 | 28 | 276 | 333 | 353 | 372 | 446 | ▁▆▇▃▁ |
Iliac | 69 | 0.96 | 262 | 18 | 184 | 251 | 263 | 274 | 324 | ▁▂▇▆▁ |
# skim() also let's use see how much of the data is missing!
Since there are missing values for each of the variables (denoted by NA), we can remove them using filter() and complete.cases() in the tidyverse packages:
# Since some values are missing, denoted by NA, we want to only keep the rows with the complete data
# We can use the complete.data() function to do that. Complete.cases() will return a vector of True and False,
# and we want to keep the rows that have all information for recorded (indicated by TRUE)
bones1 <-
bones %>%
filter(complete.cases(.)) |>
filter(Sex %in% c("0","1")) |>
dplyr::select(where(is.numeric))
skim(bones1)
Name | bones1 |
Number of rows | 1049 |
Number of columns | 5 |
_______________________ | |
Column type frequency: | |
numeric | 5 |
________________________ | |
Group variables | None |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
Humerus | 0 | 1 | 304 | 23 | 230 | 288 | 304 | 320 | 376 | ▁▅▇▅▁ |
Radius | 0 | 1 | 233 | 19 | 179 | 220 | 233 | 247 | 290 | ▁▆▇▅▁ |
Femur | 0 | 1 | 428 | 31 | 345 | 406 | 428 | 448 | 531 | ▁▆▇▃▁ |
Tibia | 0 | 1 | 353 | 28 | 276 | 333 | 354 | 372 | 446 | ▁▆▇▃▁ |
Iliac | 0 | 1 | 263 | 18 | 184 | 251 | 263 | 274 | 324 | ▁▂▇▆▁ |
Getting the number of variables (number of columns) and sample size (number of rows)
# Variables in columns
p <- ncol(bones1)
# Each row is a different skeleton
n <- nrow(bones1)
In addition to the summary statistics, it’s recommended to examine the shape each variable takes and the relationship between any 2 pairs of variables.
This can be accomplished with:
# The best way to display the density plots for all the variables is to change it into the "long" format using pivot_longer():
# ?pivot_longer
bones1 |>
pivot_longer(cols = Humerus:Iliac, # Choosing the columns in the data to "lengthen"
names_to = "Bone", # Naming the column that indicates the bone type
values_to = "Length") |> # Naming the column that contains the bone measurement
ggplot(mapping = aes(x = Length)) +
geom_density(mapping = aes(fill = Bone),
show.legend = F) +
# A separate density plot for each bone
facet_wrap(facets = ~ Bone,
scales = "free") +
# Removing the extra space between the density plot and the x-axis
scale_y_continuous(expand = c(0, 0, 0.05, 0))
All the density plots are roughly bellshaped, indicating the measurements are about Normal for all 5 bones
Up next, create a scatterplot matrix using ggpairs in the GGally package
ggpairs(bones1)
From the scatterplot matrix, we can see:
ggpairs also displays the individual density plots, so we don’t have to make them ourselves!
While we can calculate the 5 number summary and mean for each variable with summary, we often need the mean vector \(\bar{y}\), covariance matrix \(S\), and correlation matrix \(R\).
# 5 number summary per variable
summary(bones1)
## Humerus Radius Femur Tibia Iliac
## Min. :230 Min. :179 Min. :345 Min. :276 Min. :184
## 1st Qu.:288 1st Qu.:220 1st Qu.:406 1st Qu.:333 1st Qu.:251
## Median :304 Median :233 Median :428 Median :354 Median :263
## Mean :304 Mean :233 Mean :428 Mean :353 Mean :263
## 3rd Qu.:320 3rd Qu.:247 3rd Qu.:448 3rd Qu.:372 3rd Qu.:274
## Max. :376 Max. :290 Max. :531 Max. :446 Max. :324
# Mean vector of bone measurements can be calculated using colMeans()
meanvec <- colMeans(bones1)
meanvec
## Humerus Radius Femur Tibia Iliac
## 304 233 428 353 263
# Covariance matrix can be calculated using cov() or var()
(covmat <- var(bones1) )
## Humerus Radius Femur Tibia Iliac
## Humerus 526 374 650 549 206
## Radius 374 354 501 475 138
## Femur 650 501 970 800 272
## Tibia 549 475 800 793 213
## Iliac 206 138 272 213 330
cov(bones1)
## Humerus Radius Femur Tibia Iliac
## Humerus 526 374 650 549 206
## Radius 374 354 501 475 138
## Femur 650 501 970 800 272
## Tibia 549 475 800 793 213
## Iliac 206 138 272 213 330
# Correlation matrix using cor():
(R_mat <- cor(bones1))
## Humerus Radius Femur Tibia Iliac
## Humerus 1.00 0.87 0.91 0.85 0.49
## Radius 0.87 1.00 0.86 0.90 0.41
## Femur 0.91 0.86 1.00 0.91 0.48
## Tibia 0.85 0.90 0.91 1.00 0.42
## Iliac 0.49 0.41 0.48 0.42 1.00
You can also use var, cov, and cor to calculate specific pairings of covariances and correlations instead of the entire matrix by using 2 arguments instead of 1
# If you give cov() 2 variables, just calculates the covariances between them
cov(x = bones1$Humerus,
y = bones1$Radius)
## [1] 374
# Remember, cov(Y,Y) = var(Y)!
cov(x = bones1$Humerus,
y = bones1$Humerus)
## [1] 526
var(bones1$Humerus)
## [1] 526
Also works if you give cov() 2 matrices.
# For instance, if we want the covariance of the arm bones with the leg bones, but not the correlation of the pairs in the same limb:
cov(x = bones1 |> dplyr::select(Humerus, Radius),
y = bones1 |> dplyr::select(Femur, Tibia))
## Femur Tibia
## Humerus 650 549
## Radius 501 475
Many methods in the course will use the Eigenvalue and Eigenvectors of \(S\) or \(R\). We can use the eigen function to get both from the respective matrix!
eigen_bones <-
bones1 |>
cov() |>
eigen()
# We can use $values to just have the eigenvalues or $vectors to return just the eigenvectors:
eigen_bones$values
## [1] 2515 254 94 75 34
eigen_bones$vectors
## [,1] [,2] [,3] [,4] [,5]
## [1,] -0.43 -0.039 0.57 0.4924 0.494
## [2,] -0.35 0.132 -0.22 0.6218 -0.653
## [3,] -0.61 0.046 0.38 -0.5959 -0.366
## [4,] -0.54 0.234 -0.67 -0.1253 0.442
## [5,] -0.19 -0.961 -0.20 0.0063 -0.019
Next, we’ll demonstrate we can return the original matrix and the square root matrix after \(S\) has been spectrally decomposed:
# Creating the Lambda matrix using the eigenvalues and diag() function:
(Lambda <- diag(eigen_bones$values))
## [,1] [,2] [,3] [,4] [,5]
## [1,] 2515 0 0 0 0
## [2,] 0 254 0 0 0
## [3,] 0 0 94 0 0
## [4,] 0 0 0 75 0
## [5,] 0 0 0 0 34
# Saving just the vectors in a matrix called bones_E
bones_E <- eigen_bones$vectors
# Showing spectral decomposition gives us back our original matrix
bones_E %*% Lambda %*% t(bones_E)
## [,1] [,2] [,3] [,4] [,5]
## [1,] 526 374 650 549 206
## [2,] 374 354 501 475 138
## [3,] 650 501 970 800 272
## [4,] 549 475 800 793 213
## [5,] 206 138 272 213 330
cov(bones1)
## Humerus Radius Femur Tibia Iliac
## Humerus 526 374 650 549 206
## Radius 374 354 501 475 138
## Femur 650 501 970 800 272
## Tibia 549 475 800 793 213
## Iliac 206 138 272 213 330
Next, we’ll show that the covariance matrix can be created using the square root matrix:
sqrt_bones <- bones_E %*% sqrt(Lambda) %*% t(bones_E)
# Showing that the sqrt matrix times itself gives back the original covariance matrix
sqrt_bones %*% sqrt_bones
## [,1] [,2] [,3] [,4] [,5]
## [1,] 526 374 650 549 206
## [2,] 374 354 501 475 138
## [3,] 650 501 970 800 272
## [4,] 549 475 800 793 213
## [5,] 206 138 272 213 330
cov(bones1)
## Humerus Radius Femur Tibia Iliac
## Humerus 526 374 650 549 206
## Radius 374 354 501 475 138
## Femur 650 501 970 800 272
## Tibia 549 475 800 793 213
## Iliac 206 138 272 213 330