Introduction

The Civil Service List dataset was selected for the analysis in this assignment. This dataset is housed on the NYC OpenData website and the data is provided by the New York City Department of Citywide Administrative Services (DCAS) (NYC OpenData, 2016 – 2023). Per the website, the dataset consists of 609k rows of candidates who passed a Civil Service exam with 20 columns of attributes collected for each row. A data dictionary is provided on the website, providing an explanation of the type of data provided in each column of the dataset. The dataset was created on June 14, 2016 and has been updated daily through July 21, 2023. A comma separated values (CSV) version of the dataset was downloaded from the NYC OpenData website on July 25, 2023 for use in analysis.

The data.table (Dowle & Srinivasan, 2023), naniar (Tierney & Cook, 2023), Amelia (Honaker et al, 2011), tidyverse (Wickham et al., 2019), factoextra (Kassambara & Mundt, 2020), ggplot2 (Wickham, 2016), ClusterSim (Walesiak, 2020), NbClust (Charrad et. al, 2014), and clValid (Brock et. al, 2008) packages were loaded from the library for use in the analysis process.

Then, the CSV file was uploaded into R Studio with stringsAsFactor = TRUE.

#Load packages
library(data.table)
library(naniar)
library(Amelia)
## Loading required package: Rcpp
## ## 
## ## Amelia II: Multiple Imputation
## ## (Version 1.8.1, built: 2022-11-18)
## ## Copyright (C) 2005-2023 James Honaker, Gary King and Matthew Blackwell
## ## Refer to http://gking.harvard.edu/amelia/ for more information
## ##
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.2     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::between()     masks data.table::between()
## ✖ dplyr::filter()      masks stats::filter()
## ✖ dplyr::first()       masks data.table::first()
## ✖ lubridate::hour()    masks data.table::hour()
## ✖ lubridate::isoweek() masks data.table::isoweek()
## ✖ dplyr::lag()         masks stats::lag()
## ✖ dplyr::last()        masks data.table::last()
## ✖ lubridate::mday()    masks data.table::mday()
## ✖ lubridate::minute()  masks data.table::minute()
## ✖ lubridate::month()   masks data.table::month()
## ✖ lubridate::quarter() masks data.table::quarter()
## ✖ lubridate::second()  masks data.table::second()
## ✖ purrr::transpose()   masks data.table::transpose()
## ✖ lubridate::wday()    masks data.table::wday()
## ✖ lubridate::week()    masks data.table::week()
## ✖ lubridate::yday()    masks data.table::yday()
## ✖ lubridate::year()    masks data.table::year()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(factoextra)
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
library(ggplot2)
library(clusterSim)
## Loading required package: cluster
## Loading required package: MASS
## 
## Attaching package: 'MASS'
## 
## The following object is masked from 'package:dplyr':
## 
##     select
library(NbClust)
library(clValid)
#Upload data set
CSL <- fread("CivilServiceList.csv", stringsAsFactors = T)

Variable types

The dimensions of the dataset were confirmed using the str() function (608,574 rows of 20 variables). This function also provided the variable types. 13 of the variables are listed as factors (i.e., categorical), 6 are listed as numeric or integer, and 1 is logical. In reviewing the data dictionary and the dataset itself, the only two variables that are numbers are List No and Adj FA. List No is an assigned number so it is actually categorical, but Adj FA corresponds to the score a candidate received on their civil service exam, so this variable is a ratio-type numeric value.

str(CSL)
## Classes 'data.table' and 'data.frame':   608574 obs. of  20 variables:
##  $ Exam No           : int  162 162 162 162 162 162 162 162 162 162 ...
##  $ List No           : num  897 898 899 900 901 902 903 904 905 906 ...
##  $ First Name        : Factor w/ 60239 levels "","#NAME?",".BRANDON",..: 9874 59444 49586 8080 56111 36966 20359 9032 23624 40960 ...
##  $ MI                : Factor w/ 39 levels "","*",",","-",..: 30 1 1 1 26 1 1 22 1 1 ...
##  $ Last Name         : Factor w/ 85831 levels "","A BARI","A'GARD",..: 14673 64122 84009 82445 26136 5129 57601 60832 82445 72915 ...
##  $ Adj. FA           : num  75 75 75 75 75 75 75 75 75 75 ...
##  $ List Title Code   : int  10001 10001 10001 10001 10001 10001 10001 10001 10001 10001 ...
##  $ List Title Desc   : Factor w/ 509 levels "ACCOUNTANT","ADMINISTRATIVE ACCOUNTANT",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ Group No          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ List Agency Code  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ List Agency Desc  : Factor w/ 63 levels "ADMINISTRATION FOR CHILDREN'S SERVICES",..: 57 57 57 57 57 57 57 57 57 57 ...
##  $ List Div Code     : logi  NA NA NA NA NA NA ...
##  $ Published Date    : Factor w/ 158 levels "","1/11/2023",..: 104 104 104 104 104 104 104 104 104 104 ...
##  $ Established Date  : Factor w/ 308 levels "","1/10/2018",..: 243 243 243 243 243 243 243 243 243 243 ...
##  $ Anniversary Date  : Factor w/ 312 levels "","1/10/2022",..: 246 246 246 246 246 246 246 246 246 246 ...
##  $ Extension Date    : Factor w/ 131 levels "","1/10/2024",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Veteran Credit    : Factor w/ 3 levels "","Disabled Veteran's Credit",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Parent Lgy Credit : Factor w/ 2 levels "","Parent Legacy Credit": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Sibling Lgy Credit: Factor w/ 2 levels "","Sibling Legacy Credit": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Residency Credit  : Factor w/ 2 levels "","Residency Credit": 1 1 1 1 1 1 1 1 1 1 ...
##  - attr(*, ".internal.selfref")=<externalptr>

Missing Data

Next, the extent of missing data was investigated. A cursory review of the dataset noted that many values are blank. Blank cells were not registering as missing, so all such values were converted to “NA”. The is.na() function (Tierney & Cook, 2023) revealed that 3,811,083 values are missing from the dataset (either blank or natively designated as “NA”). Using pct_miss() function, it was found that missing data accounts for 31.3% of the dataset. Using the gg_miss_var() and vis_miss functions (Wickham, 2016), visualizations were created to depict the missing data. The visualizations reveal that data is missing in half of the variables. A missingness map was created using the missmap() function (Honaker et al, 2011), which shows missing data by row and column. This map shows that the missing data is dispersed throughout the dataset. A function provided by the instructor determined that 548,203 rows are missing more than 20% of their data, accounting for 90% of the dataset. Before proceeding with further analysis, the decision was made to remove all of these rows, as they could skew any further analysis. This reduced the dataset to 60,371 rows.

CSL[CSL==""] <- NA
#Count missing (NA) values

sum(is.na(CSL))
## [1] 3811083
#Find percent of rows with a missing value 

pct_miss(CSL)
## [1] 31.31158
#Create missingness maps

gg_miss_var(CSL, show_pct = TRUE)

vis_miss(CSL, warn_large_data = FALSE)

missmap(CSL)

#Determine how many rows have more than 20% missing values

myr = function (x){
  temp=rep(0,nrow(x))
  for (i in 1:nrow(x)){
    temp[i]=sum(is.na(x[i, 1:ncol(x)]))/ncol(x)}
  x$ROWMISS=temp
  x=x[order(-x$ROWMISS), ]
  return(x)
}
CSL=myr(CSL)

length(CSL$ROWMISS[CSL$ROWMISS>.2])
## [1] 548203
#Delete rows missing more than 20% of their data

CSL <- CSL[CSL$ROWMISS <= 0.2,]

Research Questions and Hypotheses

The following research questions and associated hypotheses were formulated to guide the analysis:

  • Business Problem 1: The City of New York wants to determine if certain service titles perform better than others on the civil service exam.

  • RQ1: Is there a difference between the mean adjusted final average test score (Adj. FA) of each civil service title (List Title Desc)?

    • HO1: There is no difference between the mean Adj. FA of each List Title Desc.
    • HA1: There is a difference between the mean Adj. FA of each List Title Desc.
  • Business Problem 2: The City of New York wants to determine if certain positions are more likely to be listed as open and competitive rather than internal hires.

  • RQ2: Is there a significant relationship between civil service title (List Title Desc) and the name of an appointing Agency (List Agency Desc)?

    • HO2: List Title Desc is independent of the List Agency Desc.
    • HA2: List Title Desc is not independent of the List Agency Desc.

Principal Component Analysis (PCA)

Before initiating cluster analysis, dimensionality reduction was performed to simplify the dataset to remove redundant data, reduce complexity, and decrease memory requirements for machine learning and processing (Fandango et al, 2021). The most common form of dimensionality reduction is Principal Component Analysis (PCA). PCA entails creating a new set of variables called principal components that are a linear combination of the original variables (Ciaburro, 2018). The number of principal components directly corresponds to the number of observed variables. PCA specifically focuses on noncorrelated variables. The researcher performed PCA on the Civil Service List dataset using the process on the Week 6 assignment page. After converting categorical variables to dummy numeric variables, the data was standardized using the scale() function.

#Convert categorical variables to dummy/numeric variables

CSL_numeric <- CSL %>%
  mutate_if(is.factor, as.numeric) %>%
  mutate_if(is.character, as.numeric)

#Standardize the data

CSL_std <- scale(CSL_numeric)

The covariance matrix was calculated, showing the covariances between the variables in the dataset.

CM <- cov(CSL_std)
CM
##                         Exam No       List No    First Name MI     Last Name
## Exam No             1.000000000 -0.2071865531 -0.0031460953 NA  0.0027646361
## List No            -0.207186553  1.0000000000  0.0037014121 NA -0.0003931556
## First Name         -0.003146095  0.0037014121  1.0000000000 NA -0.0026091044
## MI                           NA            NA            NA NA            NA
## Last Name           0.002764636 -0.0003931556 -0.0026091044 NA  1.0000000000
## Adj. FA            -0.059685914 -0.6204438787 -0.0043181540 NA  0.0005135206
## List Title Code     0.053300733  0.0340223425  0.0041798864 NA -0.0022935020
## List Title Desc    -0.797935517  0.1535134132  0.0048298757 NA -0.0005119026
## Group No                     NA            NA            NA NA            NA
## List Agency Code    0.002038474 -0.0292024054 -0.0001370981 NA  0.0023356522
## List Agency Desc   -0.010193877  0.0263417211 -0.0020896364 NA -0.0027308421
## List Div Code                NA            NA            NA NA            NA
## Published Date               NA            NA            NA NA            NA
## Established Date   -0.355064581 -0.1808459988 -0.0008538785 NA -0.0052700467
## Anniversary Date   -0.353600148 -0.1816577546 -0.0008584530 NA -0.0052778975
## Extension Date               NA            NA            NA NA            NA
## Veteran Credit               NA            NA            NA NA            NA
## Parent Lgy Credit            NA            NA            NA NA            NA
## Sibling Lgy Credit           NA            NA            NA NA            NA
## Residency Credit             NA            NA            NA NA            NA
## ROWMISS            -0.061133728  0.1361137758  0.0097213780 NA -0.0049053981
##                          Adj. FA List Title Code List Title Desc Group No
## Exam No            -0.0596859139     0.053300733   -0.7979355171       NA
## List No            -0.6204438787     0.034022343    0.1535134132       NA
## First Name         -0.0043181540     0.004179886    0.0048298757       NA
## MI                            NA              NA              NA       NA
## Last Name           0.0005135206    -0.002293502   -0.0005119026       NA
## Adj. FA             1.0000000000    -0.004317786    0.0304335503       NA
## List Title Code    -0.0043177855     1.000000000    0.0121841123       NA
## List Title Desc     0.0304335503     0.012184112    1.0000000000       NA
## Group No                      NA              NA              NA       NA
## List Agency Code   -0.0332860740    -0.049452302   -0.0269734926       NA
## List Agency Desc    0.0333247135     0.029945020    0.0338396652       NA
## List Div Code                 NA              NA              NA       NA
## Published Date                NA              NA              NA       NA
## Established Date   -0.0619768056     0.031309310    0.1860499339       NA
## Anniversary Date   -0.0622259869     0.032023496    0.1849106853       NA
## Extension Date                NA              NA              NA       NA
## Veteran Credit                NA              NA              NA       NA
## Parent Lgy Credit             NA              NA              NA       NA
## Sibling Lgy Credit            NA              NA              NA       NA
## Residency Credit              NA              NA              NA       NA
## ROWMISS            -0.1633637997    -0.011706350    0.0689629116       NA
##                    List Agency Code List Agency Desc List Div Code
## Exam No                0.0020384740     -0.010193877            NA
## List No               -0.0292024054      0.026341721            NA
## First Name            -0.0001370981     -0.002089636            NA
## MI                               NA               NA            NA
## Last Name              0.0023356522     -0.002730842            NA
## Adj. FA               -0.0332860740      0.033324713            NA
## List Title Code       -0.0494523019      0.029945020            NA
## List Title Desc       -0.0269734926      0.033839665            NA
## Group No                         NA               NA            NA
## List Agency Code       1.0000000000     -0.928994432            NA
## List Agency Desc      -0.9289944324      1.000000000            NA
## List Div Code                    NA               NA            NA
## Published Date                   NA               NA            NA
## Established Date       0.0867537963     -0.089740031            NA
## Anniversary Date       0.0869566960     -0.089965996            NA
## Extension Date                   NA               NA            NA
## Veteran Credit                   NA               NA            NA
## Parent Lgy Credit                NA               NA            NA
## Sibling Lgy Credit               NA               NA            NA
## Residency Credit                 NA               NA            NA
## ROWMISS                0.0052330844     -0.004705626            NA
##                    Published Date Established Date Anniversary Date
## Exam No                        NA    -0.3550645810     -0.353600148
## List No                        NA    -0.1808459988     -0.181657755
## First Name                     NA    -0.0008538785     -0.000858453
## MI                             NA               NA               NA
## Last Name                      NA    -0.0052700467     -0.005277897
## Adj. FA                        NA    -0.0619768056     -0.062225987
## List Title Code                NA     0.0313093101      0.032023496
## List Title Desc                NA     0.1860499339      0.184910685
## Group No                       NA               NA               NA
## List Agency Code               NA     0.0867537963      0.086956696
## List Agency Desc               NA    -0.0897400312     -0.089965996
## List Div Code                  NA               NA               NA
## Published Date                 NA               NA               NA
## Established Date               NA     1.0000000000      0.999990058
## Anniversary Date               NA     0.9999900583      1.000000000
## Extension Date                 NA               NA               NA
## Veteran Credit                 NA               NA               NA
## Parent Lgy Credit              NA               NA               NA
## Sibling Lgy Credit             NA               NA               NA
## Residency Credit               NA               NA               NA
## ROWMISS                        NA     0.0015901582      0.001406520
##                    Extension Date Veteran Credit Parent Lgy Credit
## Exam No                        NA             NA                NA
## List No                        NA             NA                NA
## First Name                     NA             NA                NA
## MI                             NA             NA                NA
## Last Name                      NA             NA                NA
## Adj. FA                        NA             NA                NA
## List Title Code                NA             NA                NA
## List Title Desc                NA             NA                NA
## Group No                       NA             NA                NA
## List Agency Code               NA             NA                NA
## List Agency Desc               NA             NA                NA
## List Div Code                  NA             NA                NA
## Published Date                 NA             NA                NA
## Established Date               NA             NA                NA
## Anniversary Date               NA             NA                NA
## Extension Date                 NA             NA                NA
## Veteran Credit                 NA             NA                NA
## Parent Lgy Credit              NA             NA                NA
## Sibling Lgy Credit             NA             NA                NA
## Residency Credit               NA             NA                NA
## ROWMISS                        NA             NA                NA
##                    Sibling Lgy Credit Residency Credit      ROWMISS
## Exam No                            NA               NA -0.061133728
## List No                            NA               NA  0.136113776
## First Name                         NA               NA  0.009721378
## MI                                 NA               NA           NA
## Last Name                          NA               NA -0.004905398
## Adj. FA                            NA               NA -0.163363800
## List Title Code                    NA               NA -0.011706350
## List Title Desc                    NA               NA  0.068962912
## Group No                           NA               NA           NA
## List Agency Code                   NA               NA  0.005233084
## List Agency Desc                   NA               NA -0.004705626
## List Div Code                      NA               NA           NA
## Published Date                     NA               NA           NA
## Established Date                   NA               NA  0.001590158
## Anniversary Date                   NA               NA  0.001406520
## Extension Date                     NA               NA           NA
## Veteran Credit                     NA               NA           NA
## Parent Lgy Credit                  NA               NA           NA
## Sibling Lgy Credit                 NA               NA           NA
## Residency Credit                   NA               NA           NA
## ROWMISS                            NA               NA  1.000000000

After removing infinite values, the eigenvectors and eigenvalues of the covariance matrix were calculated to determine the principal components. The eigenvalues are shown in decreasing order, with the highest (and most relevant) value being in the first position. Looking at the output, it is noted that only the first eigenvalue is of any significance.

#Remove infinite values 
IV <- !is.finite(CM)

#Compute eigenvalues and eigenvectors 
eigen(IV)
## eigen() decomposition
## $values
##  [1]  1.582475e+01  7.105427e-15  1.289856e-15  4.242455e-17  2.290481e-17
##  [6]  3.266109e-31  1.476411e-31  2.960189e-33 -2.138212e-50 -2.963138e-35
## [11] -1.725434e-34 -2.677969e-34 -4.213676e-34 -5.452279e-34 -1.912041e-33
## [16] -2.345255e-33 -2.684481e-33 -1.527345e-32 -1.625985e-32 -5.992561e-17
## [21] -6.824752e+00
## 
## $vectors
##             [,1]          [,2]        [,3]        [,4]         [,5]
##  [1,] -0.1584614  9.574271e-01  0.00000000  0.00000000  0.000000000
##  [2,] -0.1584614 -8.703883e-02  0.95162788 -0.03086802 -0.001264224
##  [3,] -0.1584614 -8.703883e-02 -0.09566987 -0.10803807 -0.094469118
##  [4,] -0.2786236  1.769418e-16  0.02832989 -0.20578680  0.656112886
##  [5,] -0.1584614 -8.703883e-02 -0.09566987 -0.10803807 -0.094469118
##  [6,] -0.1584614 -8.703883e-02 -0.09566987 -0.10803807 -0.094469118
##  [7,] -0.1584614 -8.703883e-02 -0.09566987 -0.10803807 -0.094469118
##  [8,] -0.1584614 -8.703883e-02 -0.09566987 -0.10803807 -0.094469118
##  [9,] -0.2786236  1.873501e-16  0.02117221  0.33440355 -0.263312067
## [10,] -0.1584614 -8.703883e-02 -0.09313446  0.44758630  0.378508586
## [11,] -0.1584614 -8.703883e-02 -0.09313446  0.44758630  0.378508586
## [12,] -0.2786236  1.873501e-16  0.02117221  0.33440355 -0.263312067
## [13,] -0.2786236  1.873501e-16  0.02117221  0.33440355 -0.263312067
## [14,] -0.1584614 -8.703883e-02 -0.09566987 -0.10803807 -0.094469118
## [15,] -0.1584614 -8.703883e-02 -0.09566987 -0.10803807 -0.094469118
## [16,] -0.2786236  1.769418e-16 -0.01836930 -0.15948477  0.026764663
## [17,] -0.2786236  1.769418e-16 -0.01836930 -0.15948477  0.026764663
## [18,] -0.2786236  1.769418e-16 -0.01836930 -0.15948477  0.026764663
## [19,] -0.2786236  1.769418e-16 -0.01836930 -0.15948477  0.026764663
## [20,] -0.2786236  1.769418e-16 -0.01836930 -0.15948477  0.026764663
## [21,] -0.1584614 -8.703883e-02 -0.09566987 -0.10803807 -0.094469118
##                [,6]          [,7]          [,8]          [,9]         [,10]
##  [1,]  0.000000e+00  0.000000e+00  0.000000e+00  0.000000e+00  0.000000e+00
##  [2,]  2.081668e-17  2.428613e-17  4.336809e-18 -1.432589e-19  0.000000e+00
##  [3,] -3.621829e-01 -1.278052e-01 -1.162316e-01 -3.144580e-16  2.174025e-02
##  [4,] -8.510553e-15  1.262300e-15 -1.598837e-16  2.105475e-17  2.818926e-18
##  [5,]  7.475435e-01  4.730938e-01 -1.017728e-01 -5.654966e-17  3.936494e-03
##  [6,] -4.846581e-02 -3.046925e-02 -3.234329e-01 -2.073408e-16  9.785553e-02
##  [7,] -1.353645e-01  4.919788e-02  3.398435e-01 -1.669158e-16 -2.855844e-03
##  [8,] -1.343773e-01  4.487242e-02  2.154209e-01  9.378936e-16 -1.786311e-01
##  [9,]  2.930594e-01 -5.414785e-01 -2.960161e-01  1.458759e-16 -3.380964e-02
## [10,] -7.940418e-02  4.826569e-02 -3.584971e-01 -1.319789e-16 -9.185617e-02
## [11,]  7.940418e-02 -4.826569e-02  3.584971e-01  1.176262e-16  9.185617e-02
## [12,] -2.782668e-01  5.476191e-01  6.095945e-02  5.671192e-17  7.716847e-03
## [13,] -1.479265e-02 -6.140549e-03  2.350567e-01 -1.708329e-16  2.609279e-02
## [14,] -5.348429e-02 -1.026274e-01  5.024846e-04 -2.932878e-15 -6.489960e-01
## [15,] -5.447684e-02 -1.010107e-01  6.300200e-03  2.906006e-15  7.165993e-01
## [16,] -1.878405e-01  2.689667e-01 -4.704553e-01 -8.402258e-17  3.883036e-03
## [17,]  1.932995e-01 -1.177718e-01  2.107077e-01 -6.969119e-17 -1.878714e-02
## [18,]  4.813878e-02 -4.937525e-02  1.402936e-01 -7.071068e-01 -2.089964e-02
## [19,]  4.813878e-02 -4.937525e-02  1.402936e-01  7.071068e-01 -2.089964e-02
## [20,] -1.017366e-01 -5.244446e-02 -2.083950e-02  8.364875e-17  5.670338e-02
## [21,]  4.080817e-02 -2.052515e-01 -2.062976e-02 -1.483033e-16 -9.648549e-03
##               [,11]         [,12]         [,13]         [,14]         [,15]
##  [1,]  0.000000e+00  0.000000e+00  0.000000e+00  0.000000e+00  0.000000e+00
##  [2,]  0.000000e+00  4.857226e-17  6.938894e-18  6.938894e-18  1.387779e-17
##  [3,] -1.957526e-01 -9.229729e-02 -7.566238e-02 -3.362651e-02 -1.673929e-01
##  [4,] -2.341877e-16  1.454277e-16 -4.625929e-17 -9.092842e-17  3.411623e-17
##  [5,]  4.557656e-02  5.437650e-03 -1.400016e-02 -7.676336e-03 -1.669763e-01
##  [6,]  2.200152e-01  2.466286e-01 -3.221114e-01  1.100734e-01  6.420648e-01
##  [7,] -1.932778e-01  5.029186e-01 -1.214690e-01  2.088319e-01 -6.126425e-03
##  [8,]  1.283180e-01  6.208259e-03  7.344530e-01 -5.019543e-03  3.401539e-02
##  [9,] -1.509983e-01  2.005477e-01  1.183857e-01  6.036524e-02 -1.403571e-01
## [10,]  1.929889e-01  9.980747e-02  2.862150e-01 -1.501833e-01  1.141746e-01
## [11,] -1.929889e-01 -9.980747e-02 -2.862150e-01  1.501833e-01 -1.141746e-01
## [12,] -1.360076e-02 -7.233859e-02 -1.941096e-02  7.518755e-02  7.629088e-02
## [13,]  1.645991e-01 -1.282091e-01 -9.897479e-02 -1.355528e-01  6.406620e-02
## [14,]  2.153423e-02 -1.408302e-01 -3.211033e-01 -4.412032e-01 -1.590690e-01
## [15,]  2.334661e-03 -1.034009e-01  3.988182e-02 -4.582360e-01 -1.607409e-01
## [16,] -4.466598e-01 -1.080047e-01 -6.489938e-03  7.187592e-02 -1.625901e-01
## [17,] -1.876594e-01 -4.078727e-01  6.269199e-02 -1.838886e-01  5.475305e-01
## [18,] -3.441606e-02  3.123440e-01  5.429172e-02 -3.280974e-02 -3.891494e-02
## [19,] -3.441606e-02  3.123440e-01  5.429172e-02 -3.280974e-02 -3.891494e-02
## [20,]  7.031513e-01 -1.088105e-01 -1.647855e-01  1.776322e-01 -3.071105e-01
## [21,] -2.874824e-02 -4.246647e-01  8.001149e-02  6.268562e-01 -1.577477e-02
##               [,16]         [,17]         [,18]         [,19]       [,20]
##  [1,]  0.000000e+00  0.000000e+00  0.000000e+00  0.000000e+00  0.00000000
##  [2,] -4.163336e-17 -5.551115e-17 -6.245005e-17  2.775558e-17  0.05040684
##  [3,] -1.160134e-01  3.383423e-02  7.283654e-01  2.944031e-01 -0.06589041
##  [4,] -2.168404e-16  3.019864e-16 -4.857226e-16  5.678328e-16 -0.64440203
##  [5,] -3.078561e-03  1.611531e-02  1.854308e-01  1.307335e-01 -0.06589041
##  [6,] -2.821067e-01  1.779864e-01 -9.332400e-02 -1.462595e-02 -0.06589041
##  [7,]  5.922298e-01 -1.533947e-02 -1.327624e-01  1.465281e-01 -0.06589041
##  [8,] -2.168958e-01  3.848809e-01 -1.461515e-01  5.448951e-02 -0.06589041
##  [9,]  1.148483e-01  2.497182e-01  3.802186e-02 -1.464278e-01 -0.20153158
## [10,]  2.648683e-01 -3.350035e-01  6.204837e-02  5.881283e-02  0.23835822
## [11,] -2.648683e-01  3.350035e-01 -6.204837e-02 -5.881283e-02  0.23835822
## [12,]  4.290728e-02  5.185503e-03  1.760967e-01 -4.852071e-01 -0.20153158
## [13,] -1.577556e-01 -2.549037e-01 -2.141186e-01  6.316350e-01 -0.20153158
## [14,]  6.263552e-03 -5.785291e-02 -2.027772e-01 -2.292135e-01 -0.06589041
## [15,]  4.910551e-02 -8.718508e-02 -2.013332e-01 -2.227591e-01 -0.06589041
## [16,] -1.693246e-02  1.567279e-01 -3.907922e-01  2.250485e-01  0.24979936
## [17,]  3.447310e-01  1.160106e-01  1.805678e-01  8.098752e-04  0.24979936
## [18,] -2.779247e-01 -2.680493e-01  8.162968e-02 -1.228333e-01  0.24979936
## [19,] -2.779247e-01 -2.680493e-01  8.162968e-02 -1.228333e-01  0.24979936
## [20,]  2.280510e-01  2.633601e-01  4.696502e-02  1.980819e-02  0.24979936
## [21,] -2.950436e-02 -4.524394e-01 -1.374478e-01 -1.595557e-01 -0.06589041
##            [,21]
##  [1,] -0.2412951
##  [2,] -0.2412951
##  [3,] -0.2412951
##  [4,]  0.1829755
##  [5,] -0.2412951
##  [6,] -0.2412951
##  [7,] -0.2412951
##  [8,] -0.2412951
##  [9,]  0.1829755
## [10,] -0.2412951
## [11,] -0.2412951
## [12,]  0.1829755
## [13,]  0.1829755
## [14,] -0.2412951
## [15,] -0.2412951
## [16,]  0.1829755
## [17,]  0.1829755
## [18,]  0.1829755
## [19,]  0.1829755
## [20,]  0.1829755
## [21,] -0.2412951
#Perform PCA to reduce dimensionality
PCA <- prcomp(IV, retx = TRUE, center = TRUE, scale. = FALSE,
         tol = NULL, rank. = NULL)
summary(PCA)
## Importance of components:
##                          PC1       PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10 PC11
## Standard deviation     1.757 8.147e-17   0   0   0   0   0   0   0    0    0
## Proportion of Variance 1.000 0.000e+00   0   0   0   0   0   0   0    0    0
## Cumulative Proportion  1.000 1.000e+00   1   1   1   1   1   1   1    1    1
##                        PC12 PC13 PC14 PC15 PC16 PC17 PC18 PC19 PC20 PC21
## Standard deviation        0    0    0    0    0    0    0    0    0    0
## Proportion of Variance    0    0    0    0    0    0    0    0    0    0
## Cumulative Proportion     1    1    1    1    1    1    1    1    1    1

Calculating the total variance explained by each principal component, it is shown that the first eigenvalue accounts for almost all of the variance.

var <- PCA$sdev^2 / sum(PCA$sdev^2)
print(var)
##  [1] 1.000000e+00 2.151216e-33 0.000000e+00 0.000000e+00 0.000000e+00
##  [6] 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00
## [11] 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00
## [16] 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00
## [21] 0.000000e+00

A scree plot was created to depict the principal components.

Scree_data <- data.frame(Principal_Component = 1:length(var), Variance_Explained = var)

ggplot(Scree_data, aes(x = Principal_Component, y = Variance_Explained)) +
  geom_line() +
  xlab("Principal Component") +
  ylab("Variance Explained") +
  ggtitle("Scree Plot") +
  scale_y_continuous(limits = c(0, 1))

A biplot was also created to show the distribution of data points and variables for the first and second principal components.

biplot(PCA, scale=0, cex=1)
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

The last step in PCA is to determine which components to retain for analysis. Though PC1 represents the majority of the variance in the dataset, at least 2 variables are required for clustering analysis, so both PC1 and PC2 were retained.

PCA_transform <- as.matrix(-PCA$x[,1:2])

Clustering

Clustering is a method used to group similar items together, usually by assessing characteristics or features of the data that are similar in order to discover hidden patterns in the dataset. Clustering is a type of unsupervised learning, which means that information is gleaned from observation rather than example (Fandango et al, 2021).

Another important aspect of clustering is determining the similarly measure. The most common similarity measures are the Euclidean, Manhattan, and Minkowski, Hamming, and Pearson Correlation distance calculations. For numeric data, the Euclidean distance measure is typically used. Since the data in the principal components are numeric, this similarity measure can be used.

Once a similarity measure has been selected, the appropriate number of clusters should be determined. The primary methods for this are the elbow method, silhouette method, and gap statistic. The elbow method involves examining the sum of squares within a cluster. This is typically done by looking at the visualization of this data. For the silhouette method, a score between -1 and 1 is assigned to the number of clusters. This is also best determined by reviewing the corresponding visualization of the silhouette score.

The grouping method is a key decision point in clustering analysis. The three main grouping methods are partitional, hierarchical, and density. Partitional grouping is most commonly accomplished using the k-means algorithm which evaluates the sum of squared errors and seeks to minimize them. This method is only applicable to continuous variables. Hierarchical clustering groups data by either a bottoms-up (agglomerative) or top-down (divisive) manner. Both partitioning and hierarchical clustering require the number of clusters to be identified before beginning the grouping process. Unlike the other two methods, density clustering does not require the number of clusters to be identified upfront. This approach looks at the density of data points and the minimum number of points in a specified radius (Fandango et al, 2021).

The researcher decided to use k-means clustering for this analysis. The within groups sum of squares (WSS) was used to determine the appropriate number of clusters. This analysis determined that 2 clusters should be used in the k-means analysis. Using the kmeans() function in R with 2 clusters specified, the data was separated into the 2 clusters. The ggplot() function (Wickham, 2016) was used to visualize the k-means clustering.

#Use within groups sum of squares (WSS) to determine number of clusters

wss <- (nrow(PCA_transform)-1)*sum(apply(PCA_transform, 2, var))
for (i in 2:2) wss[i] <- sum(kmeans(PCA_transform,
                                     centers=i)$withinss)
plot(1:2, wss, type="b", xlab="Number of Clusters", 
     ylab="Within groups sum of sequares")
title("Number of Clusters")

#K-Means Cluster Analysis

fit <- kmeans(PCA_transform, 2)
aggregate(PCA_transform, by=list(fit$cluster), FUN=mean)
##   Group.1       PC1           PC2
## 1       1  1.484615  1.804112e-16
## 2       2 -1.979487 -3.191891e-16
PCA_transform <- data.frame(PCA_transform, fit$cluster)
PCA_transform
##                          PC1           PC2 fit.cluster
## Exam No             1.484615  1.804112e-16           1
## List No             1.484615  1.804112e-16           1
## First Name          1.484615  1.804112e-16           1
## MI                 -1.979487 -3.191891e-16           2
## Last Name           1.484615  1.804112e-16           1
## Adj. FA             1.484615  1.804112e-16           1
## List Title Code     1.484615  1.804112e-16           1
## List Title Desc     1.484615  1.804112e-16           1
## Group No           -1.979487 -3.191891e-16           2
## List Agency Code    1.484615  1.804112e-16           1
## List Agency Desc    1.484615  1.804112e-16           1
## List Div Code      -1.979487 -3.191891e-16           2
## Published Date     -1.979487 -3.191891e-16           2
## Established Date    1.484615  1.804112e-16           1
## Anniversary Date    1.484615  1.804112e-16           1
## Extension Date     -1.979487 -3.191891e-16           2
## Veteran Credit     -1.979487 -3.191891e-16           2
## Parent Lgy Credit  -1.979487 -3.191891e-16           2
## Sibling Lgy Credit -1.979487 -3.191891e-16           2
## Residency Credit   -1.979487 -3.191891e-16           2
## ROWMISS             1.484615  1.804112e-16           1
PCA_transform$fit.cluster <- as.factor(PCA_transform$fit.cluster)
ggplot(PCA_transform, aes(x=PC1, y=PC2, color=PCA_transform$fit.cluster))+geom_point() + ggtitle("K-Means Clustering")

#Elbow method for k-means 

fviz_nbclust(PCA_transform, kmeans, method = "wss", k.max =2) +
  geom_vline(xintercept = 3, linetype = 2)

#Silhouette method for k-means

fviz_nbclust(PCA_transform, kmeans, method = 'silhouette', k.max =2)

Provide validations for the clusters

After the clustering process is completed, it is important for a researcher to assess the fit of the algorithm to the dataset. This can be done through evaluation of either internal (based on feature data) or external (based on cluster labels) factors. The researcher selected internal validation for this assignment. The most common interval validation methods are the Dunn index and silhouette coefficient. The silhouette coefficient ranges from -1 to 1 with a higher positive value designating better clustering. The Dunn index has a value between 0 and infinity and the maximum value is desired to confirm fit of the clusters. The R output provided a Silhouette value of 1 and a Dunn value of infinity, the highest values possible for theses measures. This indicates that each value in the cluster is well matched to its own cluster and poorly matched to any neighboring cluster.

kms <- list()
for (k in 1:2) {
  kms[[k]] <- kmeans(x = PCA_transform, centers = k)
}
kms[[1]] <- NULL
#Silhouette index 

ds <- dist(PCA_transform, method = "euclidean")
sapply(kms, function(km) {
  clusters <- km$cluster
  silhouette_result <- silhouette(clusters, ds)
  silhouette_width <- mean(silhouette_result[,"sil_width"])
  return(silhouette_width)
})
## [1] 1
#Dunn index

sapply(kms, function(km) dunn(ds, km$cluster))
## [1] Inf

References

Brock G, Pihur V, Datta S, Datta S (2008). “clValid: An R Package for Cluster Validation.” Journal of Statistical Software, 25(4), 1–22. https://www.jstatsoft.org/v25/i04/.

Charrad M, Ghazzali N, Boiteau V, Niknafs A (2014). “NbClust: An R Package for Determining the Relevant Number of Clusters in a Data Set.” Journal of Statistical Software, 61(6), 1–36. https://www.jstatsoft.org/v61/i06/.

Ciaburro, G. (2018). Regression analysis with R: Design and develop statistical nodes to identify unique relationships within data. Packt Publishing.

Dowle, M., Srinivasan, A. (2023). data.table: Extension of ‘data.frame’. https://r-datatable.com, https://Rdatatable.gitlab.io/data.table, https://github.com/Rdatatable/data.table.

Fandango, A., Idris, I., & Navlani. A. (2021). Python data analysis (3rd ed.). Packt Publishing.

Harmouch, M. (2021). 17 types of similarity and dissimilarity measures used in data science. Towards Data Science. Medium. Honaker J., King G., Blackwell M. (2011). “Amelia II: A Program for Missing Data.” Journal of Statistical Software, 45(7), 1–47. doi:10.18637/jss.v045.i07.

Kassambara, A. and Mundt, F. (2020) Factoextra: Extract and Visualize the Results of Multivariate Data Analyses. R Package Version 1.0.7.https://CRAN.R-project.org/package=factoextra

Tierney, N., Cook, D. (2023). “Expanding Tidy Data Principles to Facilitate Missing Data Exploration, Visualization and Assessment of Imputations.” Journal of Statistical Software, 105(7), 1–31. doi:10.18637/jss.v105.i07.

Walesiak M, Dudek A (2020). “The Choice of Variable Normalization Method in Cluster Analysis.” In Soliman KS (ed.), Education Excellence and Innovation Management: A 2025 Vision to Sustain Economic Development During Global Challenges, 325-340. ISBN 978-0-9998551-4-1.

Wickham, H., Averick, M., Bryan, J., Chang, W., McGowan, L.D., François, R., Grolemund, G., Hayes, A., Henry, L., Hester, J., Kuhn, M., Pedersen, T.L., Miller, E., Bache, S.M., Müller, K., Ooms, J., Robinson, D., Seidel, D.P., Spinu, V.,… Yutani, H. (2019). “Welcome to the tidyverse.” Journal of Open Source Software, 4(43), 1686. doi:10.21105/joss.01686.

Wickham, H. (2016). ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York. ISBN 978-3-319-24277-4, https://ggplot2.tidyverse.org.