Executive Summary

In closing, we have shown the several methods for dimension reduction using the dimension reduction package in R Studio. The methods were the SIR, pHd, weighted, and graphical. The SIR, pHd, and graphical methods calculated reducing the number of dimensions to two was the optimal solution for analyzing the data. The weighted method normally distributed the data more than the original using weighted multiples and further reduced the number of dimensions from four dimensions to one dimension. The purpose of dimension reduction is lower the number of dimensions, d, within a dataset. The original dataset is projected onto a smaller subspace, where k is less than d. This reductions increases the computing efficiency when creating the model or performing the analysis. The reduction in dimensions retains most of the information by selecting the variables that explain most of the variance within the data. We identified which variables explain the most variance by performing the principal component analysis, PCA. The dr package within R aided us in determining the principal components. The PCA is performed by first standardizing the data and decomposing the data into its associated Eigenvalues and Eigenvectors. The Eigenvalues are then sorted into descending order and k Eigenvectors are chosen. As stated earlier, the number of Eigenvectors are chosen by selecting how much of the variance is explained by k variables. The selected Eigenvectors form the projection matrix, which transforms the original dataset. In our example, we saw that the optimal number of dimensions was using two dimensions to explain the variance. If we used weighted values, then the number of dimensions was one dimension. A follow on analysis would need to be performed for us to determine if the dimension reduction made sense. The purpose of this exercise was to discuss the various method the dr package had for determining the number of dimensions.

Background and Objectives

The study of sport’s athletes’ health helps sport’s medicine caregivers with monitoring health trends. The monitoring of health ensures each athlete is performing at their peak performance. The data was collected by the Australian Institute of Sports (AIS) which is an organization that helps athletes and their careers. AIS helps train athletes, improve performance, and give athletes more opportunities for acquiring promotions. AIS works with tennis, basketball, football (soccer for the Americans), and six other sports that are internationally recognized at the Olympics. The objective of this analysis is to perform dimension reduction of the data and see which variables explain more of the variance within the data. We will perform this analysis using the dimension reduction (dr) package in R Studio. The dr package is modeled after the linear model (lm) package when performing a linear regression analysis. The purpose of dimension reduction is to reduce the amount of computing lifecyles and improve model performance.

Key Measures and Data Collection

The data was collected by the AIS in 1994 for 202 athletes. Of the 202 athletes, 102 were male and 100 were females. The sports demographics were mixed between ten different sports. The health aspects measured included height, weight, body mass index, lean body mass, percentage ob body fat, sum of skin folds, red and white blood cell counts, hemoglobin, hematocrit, and ferritins. Each of these health attributes are vital in determining the performance of the athletes. The data was measured in a lab setting at the AIS once for each athlete.

Model Specification & Fitting

As stated earlier, the purpose of dimension reduction is to reduce the amount of computing lifecyles and improve model performance. the dr package uses principal component analysis (PCA), which extracts the pertinent variables from the dataset. PCA uses statistical methods to convert data of more variables to less variables. The variables selected for transformation are linearly uncorrelated. The selection of variables is determined by investigating the Eigenvalues and Eigenvectors. PCA and dr helps data scientists morph a complex dataset into a more manageable set of data and easier to analyze. All statistical analyses occur at the 0.05 significance level.

Exploratory Data Analysis

We begin this analysis by performing a quick look at the data and see if there are any irregularities, missing data, or an abnormal distribution within the data. Our first plot is the correlation plots. As you see in the correlation plot, there are strong, positive and negative linear trends within the data. There is also no missing data from the dataset. The following set of bar plots show the spread of data for each variable. The purpose of viewing data like this is to see if the data normally distributed or bi-modal. As you can see in the various histograms below, the data is roughly normally distributed. The health factors are separated by the sex of the athlete. The males have predominately higher values than females.

library(tidyverse)
library(corrplot)
library(reshape2)
library(dr)
data(ais)
attach(ais)

df<-ais[1:12]
res<-cor(df)
corrplot(res,type="upper",method = "ellipse")

data_long <- df %>%
  pivot_longer(!Sex) %>% 
  as.data.frame()

ggplot(data_long,aes(value,fill=factor(Sex)))+
  geom_histogram(alpha=0.6,position = 'identity')+
  scale_fill_manual(name="Sex",labels=c('Male','Female'),
                    values=c("#E69F00","#56B4E9"))+
  facet_wrap(~name, scales = "free")+
  ggtitle("Histigram of Athlete Health Information")

Sliced Inversion Regression

Our first method of dimension reduction we will investigate is the sliced inversion regression method (SIR). The SIR method concentrates the mean for each variable and ignores other dependencies. Let’s say we want to determine if lean body mass is a function of the athlete’s height, weight, and blood cell counts. We input those variables into our dr function using the SIR method. As you can see from the summary of the model, we see the Eigenvectors, Eigenvalues, and the large-sample marginal dimension tests. The large-sample marginal dimension test shows us how many variables we would need to explain the variance for our dependent variable, or the lean body mass. The Null Hypothesis for each row is if the number of dimensions equals x or is greater than the dimension x. For example, in rows 0 and 1, the p-values are <<0.05 which implies we reject the Null Hypothesis and conclude that the number of dimensions is greater than 1. However, when we look at dimensions equal to 2, the p-value is greater than 0.05, which implies we fail to reject the NUll Hypothesis and conclude the number of variables used to explain the lean body mass is two independent variables. The two predictor variables that will be used for the follow on analysis are the first two Eigenvectors. The dimensions are now reduced from four dimension to two dimensions.

i1<-dr(LBM~Ht+Wt+log(RCC)+WCC,method="sir",nslices=8)
summary(i1)

## 
## Call:
## dr(formula = LBM ~ Ht + Wt + log(RCC) + WCC, method = "sir", 
##     nslices = 8)
## 
## Method:
## sir with 8 slices, n = 202.
## 
## Slice Sizes:
## 25 25 25 25 27 27 30 18 
## 
## Estimated Basis Vectors for Central Subspace:
##              Dir1      Dir2    Dir3     Dir4
## Ht       -0.01053  0.001086  0.3068 -0.04158
## Wt       -0.02383  0.003352 -0.1896  0.01005
## log(RCC) -0.99960 -0.999976 -0.8965  0.51527
## WCC       0.01097  0.005932  0.2574  0.85596
## 
##               Dir1   Dir2    Dir3    Dir4
## Eigenvalues 0.8774 0.1614 0.04244 0.01313
## R^2(OLS|dr) 0.9987 0.9988 0.99997 1.00000
## 
## Large-sample Marginal Dimension Tests:
##                Stat df   p.value
## 0D vs >= 1D 221.057 28 0.0000000
## 1D vs >= 2D  43.822 18 0.0006117
## 2D vs >= 3D  11.224 10 0.3403385
## 3D vs >= 4D   2.651  4 0.6177515

Principal Hessian Direction

Next, we will examine another method for dimension reduction. This method is known as the principal Hessian directions (pHd). The pHd method looks at the matrix created from the weights and the residuals from the Eigenvectors and Eigenvalues. The output below is similar to the SIR results, but the pHd performs asymptotic Chi-squared tests for dimensions. The p-values are given under the normal and general theory columns. As stated in the SIR results, the recommended set of dimensions is two as opposed to four.

i2<-update(i1,method="phdres")
summary(i2)

## 
## Call:
## dr(formula = LBM ~ Ht + Wt + log(RCC) + WCC, method = "phdres", 
##     nslices = 8)
## 
## Method:
## phdres, n = 202.
## 
## Estimated Basis Vectors for Central Subspace:
##              Dir1       Dir2      Dir3     Dir4
## Ht       -0.12764  0.0003378 -0.005550  0.02549
## Wt        0.02163 -0.0326138  0.007342 -0.01343
## log(RCC)  0.74348 -0.9816463 -0.999930 -0.99909
## WCC      -0.65611  0.1879008  0.007408 -0.03157
## 
##               Dir1   Dir2    Dir3    Dir4
## Eigenvalues 1.4303 1.1750 -1.1244 -0.3999
## R^2(OLS|dr) 0.2781 0.9642  0.9642  1.0000
## 
## Large-sample Marginal Dimension Tests:
##               Stat df Normal theory Indep. test General theory
## 0D vs >= 1D 35.015 10     0.0001241    0.003676        0.01632
## 1D vs >= 2D 20.248  6     0.0025012          NA        0.03152
## 2D vs >= 3D 10.281  3     0.0163211          NA        0.05819
## 3D vs >= 4D  1.155  1     0.2825955          NA        0.26625

Permutation Tests

The next method is known as the permutation test. The permutation test takes the results from the initial model, i1, and re-samples the data and fits the model to the data. The optimal dimensions are selected from the permutation test and given a p-value. As you can see from the results below, a similar outcome is given. Two dimensions are recommend instead of four dimensions.

dr.permutation.test(i1,npermute=499)

## $summary
##                   Stat p.value
## 0D vs >= 1D 221.056620   0.000
## 1D vs >= 2D  43.821844   0.000
## 2D vs >= 3D  11.223952   0.342
## 3D vs >= 4D   2.651365   0.550
## 
## $npermute
## [1] 499
## 
## attr(,"class")
## [1] "dr.permutation.test"

Visualization of the Number of Dimensions

The final method for dimension reduction is visualizing the data graphically in a scatter plot. Below are scatter plot depicting the principal directions of the dr model we created. The points are marked different colors based on which variable is plotted. As you can see from the plots, Dir2 visually shows more of a separation between each of the predictor variables. This means the more clusters are shown and imply that the two dimensions are orthogonal.

plot(i1,mark.by.y=TRUE)

Weights

Our final method we will use to calculate the number of dimensions to select when transforming our data is using weighted calculations. The weights are selected to bring the sample to a normal distribution than the original sample. The weights center the data and then compute a covariance matrix. The weighted results further reduce the number of dimensions from four dimensions to one dimension. We can see this by the p-value <0.05 using the pHd method for weighted dimension reduction.

wts <- dr.weights(LBM~Ht+Wt+RCC+WCC)
i3 <- dr(LBM~Ht+Wt+RCC+WCC,weights=wts,method="phdres")
summary(i3)

## 
## Call:
## dr(formula = LBM ~ Ht + Wt + RCC + WCC, weights = wts, method = "phdres")
## 
## Method:
## phdres, n = 192, using weights.
## 
## Estimated Basis Vectors for Central Subspace:
##          Dir1     Dir2      Dir3     Dir4
## Ht  -0.022481 -0.05184 -0.004853 -0.04053
## Wt  -0.009106  0.06558 -0.036104 -0.01896
## RCC  0.993042  0.99619  0.997433  0.04488
## WCC -0.115232  0.02499  0.061651  0.99799
## 
##                 Dir1   Dir2    Dir3   Dir4
## Eigenvalues -4.03074 0.9601 -0.7611 0.1149
## R^2(OLS|dr)  0.01624 0.7298  0.8272 0.8482
## 
## Large-sample Marginal Dimension Tests:
##                Stat df Normal theory Indep. test General theory
## 0D vs >= 1D 58.5609 10     6.777e-09    0.000561        0.05665
## 1D vs >= 2D  4.9925  6     5.448e-01          NA        0.73170
## 2D vs >= 3D  1.9533  3     5.822e-01          NA        0.75968
## 3D vs >= 4D  0.0435  1     8.348e-01          NA        0.84220

Conclusions

In closing, we have shown the several methods for dimension reduction using the dimension reduction package in R Studio. The methods were the SIR, pHd, weighted, and graphical. The SIR, pHd, and graphical methods calculated reducing the number of dimensions to two was the optimal solution for analyzing the data. The weighted method normally distributed the data more than the original using weighted multiples and further reduced the number of dimensions from four dimensions to one dimension. The purpose of dimension reduction is lower the number of dimensions, d, within a dataset. The original dataset is projected onto a smaller subspace, where k is less than d. This reductions increases the computing efficiency when creating the model or performing the analysis. The reduction in dimensions retains most of the information by selecting the variables that explain most of the variance within the data. We identified which variables explain the most variance by performing the principal component analysis, PCA. The dr package within R aided us in determining the principal components. The PCA is performed by first standardizing the data and decomposing the data into its associated Eigenvalues and Eigenvectors. The Eigenvalues are then sorted into descending order and k Eigenvectors are chosen. As stated earlier, the number of Eigenvectors are chosen by selecting how much of the variance is explained by k variables. The selected Eigenvectors form the projection matrix, which transforms the original dataset. In our example, we saw that the optimal number of dimensions was using two dimensions to explain the variance. If we used weighted values, then the number of dimensions was one dimension. A follow on analysis would need to be performed for us to determine if the dimension reduction made sense. The purpose of this exercise was to discuss the various method the dr package had for determining the number of dimensions.

Technical Notes

## R version 4.1.1 (2021-08-10)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 19044)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=English_United States.1252 
## [2] LC_CTYPE=English_United States.1252   
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] dr_3.0.10       MASS_7.3-55     reshape2_1.4.4  corrplot_0.92  
##  [5] forcats_0.5.1   stringr_1.4.0   dplyr_1.0.8     purrr_0.3.4    
##  [9] readr_2.1.2     tidyr_1.2.0     tibble_3.1.6    ggplot2_3.3.5  
## [13] tidyverse_1.3.1
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.8       lubridate_1.8.0  assertthat_0.2.1 digest_0.6.29   
##  [5] utf8_1.2.2       R6_2.5.1         cellranger_1.1.0 plyr_1.8.6      
##  [9] backports_1.4.1  reprex_2.0.1     evaluate_0.15    highr_0.9       
## [13] httr_1.4.2       pillar_1.7.0     rlang_1.0.2      readxl_1.3.1    
## [17] rstudioapi_0.13  jquerylib_0.1.4  rmarkdown_2.12   labeling_0.4.2  
## [21] munsell_0.5.0    broom_0.7.12     compiler_4.1.1   modelr_0.1.8    
## [25] xfun_0.30        pkgconfig_2.0.3  htmltools_0.5.2  tidyselect_1.1.2
## [29] fansi_1.0.2      crayon_1.5.0     tzdb_0.2.0       dbplyr_2.1.1    
## [33] withr_2.5.0      grid_4.1.1       jsonlite_1.8.0   gtable_0.3.0    
## [37] lifecycle_1.0.1  DBI_1.1.2        magrittr_2.0.2   scales_1.1.1    
## [41] cli_3.2.0        stringi_1.7.6    farver_2.1.0     fs_1.5.2        
## [45] xml2_1.3.3       bslib_0.3.1      ellipsis_0.3.2   generics_0.1.2  
## [49] vctrs_0.3.8      tools_4.1.1      glue_1.6.2       hms_1.1.1       
## [53] fastmap_1.1.0    yaml_2.3.5       colorspace_2.0-3 rvest_1.0.2     
## [57] knitr_1.37       haven_2.4.3      sass_0.4.0

Week 09 Homework: Dimension Reduction Analysis

Lance Ostby

2022-05-08