The Civil Service List dataset was selected for the analysis in this assignment. This dataset is housed on the NYC OpenData website and the data is provided by the New York City Department of Citywide Administrative Services (DCAS) (NYC OpenData, 2016 – 2023). Per the website, the dataset consists of 609k rows of candidates who passed a Civil Service exam with 20 columns of attributes collected for each row. A data dictionary is provided on the website, providing an explanation of the type of data provided in each column of the dataset. The dataset was created on June 14, 2016 and has been updated daily through July 21, 2023. A comma separated values (CSV) version of the dataset was downloaded from the NYC OpenData website on July 25, 2023 for use in analysis.
The data.table (Dowle & Srinivasan, 2023), naniar (Tierney & Cook, 2023), Amelia (Honaker et al, 2011), tidyverse (Wickham et al., 2019), factoextra (Kassambara & Mundt, 2020), ggplot2 (Wickham, 2016), ClusterSim (Walesiak, 2020), NbClust (Charrad et. al, 2014), and clValid (Brock et. al, 2008) packages were loaded from the library for use in the analysis process.
Then, the CSV file was uploaded into R Studio with stringsAsFactor = TRUE.
#Load packages
library(data.table)
library(naniar)
library(Amelia)
## Loading required package: Rcpp
## ##
## ## Amelia II: Multiple Imputation
## ## (Version 1.8.1, built: 2022-11-18)
## ## Copyright (C) 2005-2023 James Honaker, Gary King and Matthew Blackwell
## ## Refer to http://gking.harvard.edu/amelia/ for more information
## ##
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.2 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.2 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::between() masks data.table::between()
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::first() masks data.table::first()
## ✖ lubridate::hour() masks data.table::hour()
## ✖ lubridate::isoweek() masks data.table::isoweek()
## ✖ dplyr::lag() masks stats::lag()
## ✖ dplyr::last() masks data.table::last()
## ✖ lubridate::mday() masks data.table::mday()
## ✖ lubridate::minute() masks data.table::minute()
## ✖ lubridate::month() masks data.table::month()
## ✖ lubridate::quarter() masks data.table::quarter()
## ✖ lubridate::second() masks data.table::second()
## ✖ purrr::transpose() masks data.table::transpose()
## ✖ lubridate::wday() masks data.table::wday()
## ✖ lubridate::week() masks data.table::week()
## ✖ lubridate::yday() masks data.table::yday()
## ✖ lubridate::year() masks data.table::year()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(factoextra)
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
library(ggplot2)
library(clusterSim)
## Loading required package: cluster
## Loading required package: MASS
##
## Attaching package: 'MASS'
##
## The following object is masked from 'package:dplyr':
##
## select
library(NbClust)
library(clValid)
#Upload data set
CSL <- fread("CivilServiceList.csv", stringsAsFactors = T)
The dimensions of the dataset were confirmed using the str() function (608,574 rows of 20 variables). This function also provided the variable types. 13 of the variables are listed as factors (i.e., categorical), 6 are listed as numeric or integer, and 1 is logical. In reviewing the data dictionary and the dataset itself, the only two variables that are numbers are List No and Adj FA. List No is an assigned number so it is actually categorical, but Adj FA corresponds to the score a candidate received on their civil service exam, so this variable is a ratio-type numeric value.
str(CSL)
## Classes 'data.table' and 'data.frame': 608574 obs. of 20 variables:
## $ Exam No : int 162 162 162 162 162 162 162 162 162 162 ...
## $ List No : num 897 898 899 900 901 902 903 904 905 906 ...
## $ First Name : Factor w/ 60239 levels "","#NAME?",".BRANDON",..: 9874 59444 49586 8080 56111 36966 20359 9032 23624 40960 ...
## $ MI : Factor w/ 39 levels "","*",",","-",..: 30 1 1 1 26 1 1 22 1 1 ...
## $ Last Name : Factor w/ 85831 levels "","A BARI","A'GARD",..: 14673 64122 84009 82445 26136 5129 57601 60832 82445 72915 ...
## $ Adj. FA : num 75 75 75 75 75 75 75 75 75 75 ...
## $ List Title Code : int 10001 10001 10001 10001 10001 10001 10001 10001 10001 10001 ...
## $ List Title Desc : Factor w/ 509 levels "ACCOUNTANT","ADMINISTRATIVE ACCOUNTANT",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ Group No : int 0 0 0 0 0 0 0 0 0 0 ...
## $ List Agency Code : int 0 0 0 0 0 0 0 0 0 0 ...
## $ List Agency Desc : Factor w/ 63 levels "ADMINISTRATION FOR CHILDREN'S SERVICES",..: 57 57 57 57 57 57 57 57 57 57 ...
## $ List Div Code : logi NA NA NA NA NA NA ...
## $ Published Date : Factor w/ 158 levels "","1/11/2023",..: 104 104 104 104 104 104 104 104 104 104 ...
## $ Established Date : Factor w/ 308 levels "","1/10/2018",..: 243 243 243 243 243 243 243 243 243 243 ...
## $ Anniversary Date : Factor w/ 312 levels "","1/10/2022",..: 246 246 246 246 246 246 246 246 246 246 ...
## $ Extension Date : Factor w/ 131 levels "","1/10/2024",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Veteran Credit : Factor w/ 3 levels "","Disabled Veteran's Credit",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Parent Lgy Credit : Factor w/ 2 levels "","Parent Legacy Credit": 1 1 1 1 1 1 1 1 1 1 ...
## $ Sibling Lgy Credit: Factor w/ 2 levels "","Sibling Legacy Credit": 1 1 1 1 1 1 1 1 1 1 ...
## $ Residency Credit : Factor w/ 2 levels "","Residency Credit": 1 1 1 1 1 1 1 1 1 1 ...
## - attr(*, ".internal.selfref")=<externalptr>
Next, the extent of missing data was investigated. A cursory review of the dataset noted that many values are blank. Blank cells were not registering as missing, so all such values were converted to “NA”. The is.na() function (Tierney & Cook, 2023) revealed that 3,811,083 values are missing from the dataset (either blank or natively designated as “NA”). Using pct_miss() function, it was found that missing data accounts for 31.3% of the dataset. Using the gg_miss_var() and vis_miss functions (Wickham, 2016), visualizations were created to depict the missing data. The visualizations reveal that data is missing in half of the variables. A missingness map was created using the missmap() function (Honaker et al, 2011), which shows missing data by row and column. This map shows that the missing data is dispersed throughout the dataset. A function provided by the instructor determined that 548,203 rows are missing more than 20% of their data, accounting for 90% of the dataset. Before proceeding with further analysis, the decision was made to remove all of these rows, as they could skew any further analysis. This reduced the dataset to 60,371 rows.
CSL[CSL==""] <- NA
#Count missing (NA) values
sum(is.na(CSL))
## [1] 3811083
#Find percent of rows with a missing value
pct_miss(CSL)
## [1] 31.31158
#Create missingness maps
gg_miss_var(CSL, show_pct = TRUE)
vis_miss(CSL, warn_large_data = FALSE)
missmap(CSL)
#Determine how many rows have more than 20% missing values
myr = function (x){
temp=rep(0,nrow(x))
for (i in 1:nrow(x)){
temp[i]=sum(is.na(x[i, 1:ncol(x)]))/ncol(x)}
x$ROWMISS=temp
x=x[order(-x$ROWMISS), ]
return(x)
}
CSL=myr(CSL)
length(CSL$ROWMISS[CSL$ROWMISS>.2])
## [1] 548203
#Delete rows missing more than 20% of their data
CSL <- CSL[CSL$ROWMISS <= 0.2,]
The following research questions and associated hypotheses were formulated to guide the analysis:
Business Problem 1: The City of New York wants to determine if certain service titles perform better than others on the civil service exam.
RQ1: Is there a difference between the mean adjusted final average test score (Adj. FA) of each civil service title (List Title Desc)?
Business Problem 2: The City of New York wants to determine if certain positions are more likely to be listed as open and competitive rather than internal hires.
RQ2: Is there a significant relationship between civil service title (List Title Desc) and the name of an appointing Agency (List Agency Desc)?
Before initiating cluster analysis, dimensionality reduction was performed to simplify the dataset to remove redundant data, reduce complexity, and decrease memory requirements for machine learning and processing (Fandango et al, 2021). The most common form of dimensionality reduction is Principal Component Analysis (PCA). PCA entails creating a new set of variables called principal components that are a linear combination of the original variables (Ciaburro, 2018). The number of principal components directly corresponds to the number of observed variables. PCA specifically focuses on noncorrelated variables. The researcher performed PCA on the Civil Service List dataset using the process on the Week 6 assignment page. After converting categorical variables to dummy numeric variables, the data was standardized using the scale() function.
#Convert categorical variables to dummy/numeric variables
CSL_numeric <- CSL %>%
mutate_if(is.factor, as.numeric) %>%
mutate_if(is.character, as.numeric)
#Standardize the data
CSL_std <- scale(CSL_numeric)
The covariance matrix was calculated, showing the covariances between the variables in the dataset.
CM <- cov(CSL_std)
CM
## Exam No List No First Name MI Last Name
## Exam No 1.000000000 -0.2071865531 -0.0031460953 NA 0.0027646361
## List No -0.207186553 1.0000000000 0.0037014121 NA -0.0003931556
## First Name -0.003146095 0.0037014121 1.0000000000 NA -0.0026091044
## MI NA NA NA NA NA
## Last Name 0.002764636 -0.0003931556 -0.0026091044 NA 1.0000000000
## Adj. FA -0.059685914 -0.6204438787 -0.0043181540 NA 0.0005135206
## List Title Code 0.053300733 0.0340223425 0.0041798864 NA -0.0022935020
## List Title Desc -0.797935517 0.1535134132 0.0048298757 NA -0.0005119026
## Group No NA NA NA NA NA
## List Agency Code 0.002038474 -0.0292024054 -0.0001370981 NA 0.0023356522
## List Agency Desc -0.010193877 0.0263417211 -0.0020896364 NA -0.0027308421
## List Div Code NA NA NA NA NA
## Published Date NA NA NA NA NA
## Established Date -0.355064581 -0.1808459988 -0.0008538785 NA -0.0052700467
## Anniversary Date -0.353600148 -0.1816577546 -0.0008584530 NA -0.0052778975
## Extension Date NA NA NA NA NA
## Veteran Credit NA NA NA NA NA
## Parent Lgy Credit NA NA NA NA NA
## Sibling Lgy Credit NA NA NA NA NA
## Residency Credit NA NA NA NA NA
## ROWMISS -0.061133728 0.1361137758 0.0097213780 NA -0.0049053981
## Adj. FA List Title Code List Title Desc Group No
## Exam No -0.0596859139 0.053300733 -0.7979355171 NA
## List No -0.6204438787 0.034022343 0.1535134132 NA
## First Name -0.0043181540 0.004179886 0.0048298757 NA
## MI NA NA NA NA
## Last Name 0.0005135206 -0.002293502 -0.0005119026 NA
## Adj. FA 1.0000000000 -0.004317786 0.0304335503 NA
## List Title Code -0.0043177855 1.000000000 0.0121841123 NA
## List Title Desc 0.0304335503 0.012184112 1.0000000000 NA
## Group No NA NA NA NA
## List Agency Code -0.0332860740 -0.049452302 -0.0269734926 NA
## List Agency Desc 0.0333247135 0.029945020 0.0338396652 NA
## List Div Code NA NA NA NA
## Published Date NA NA NA NA
## Established Date -0.0619768056 0.031309310 0.1860499339 NA
## Anniversary Date -0.0622259869 0.032023496 0.1849106853 NA
## Extension Date NA NA NA NA
## Veteran Credit NA NA NA NA
## Parent Lgy Credit NA NA NA NA
## Sibling Lgy Credit NA NA NA NA
## Residency Credit NA NA NA NA
## ROWMISS -0.1633637997 -0.011706350 0.0689629116 NA
## List Agency Code List Agency Desc List Div Code
## Exam No 0.0020384740 -0.010193877 NA
## List No -0.0292024054 0.026341721 NA
## First Name -0.0001370981 -0.002089636 NA
## MI NA NA NA
## Last Name 0.0023356522 -0.002730842 NA
## Adj. FA -0.0332860740 0.033324713 NA
## List Title Code -0.0494523019 0.029945020 NA
## List Title Desc -0.0269734926 0.033839665 NA
## Group No NA NA NA
## List Agency Code 1.0000000000 -0.928994432 NA
## List Agency Desc -0.9289944324 1.000000000 NA
## List Div Code NA NA NA
## Published Date NA NA NA
## Established Date 0.0867537963 -0.089740031 NA
## Anniversary Date 0.0869566960 -0.089965996 NA
## Extension Date NA NA NA
## Veteran Credit NA NA NA
## Parent Lgy Credit NA NA NA
## Sibling Lgy Credit NA NA NA
## Residency Credit NA NA NA
## ROWMISS 0.0052330844 -0.004705626 NA
## Published Date Established Date Anniversary Date
## Exam No NA -0.3550645810 -0.353600148
## List No NA -0.1808459988 -0.181657755
## First Name NA -0.0008538785 -0.000858453
## MI NA NA NA
## Last Name NA -0.0052700467 -0.005277897
## Adj. FA NA -0.0619768056 -0.062225987
## List Title Code NA 0.0313093101 0.032023496
## List Title Desc NA 0.1860499339 0.184910685
## Group No NA NA NA
## List Agency Code NA 0.0867537963 0.086956696
## List Agency Desc NA -0.0897400312 -0.089965996
## List Div Code NA NA NA
## Published Date NA NA NA
## Established Date NA 1.0000000000 0.999990058
## Anniversary Date NA 0.9999900583 1.000000000
## Extension Date NA NA NA
## Veteran Credit NA NA NA
## Parent Lgy Credit NA NA NA
## Sibling Lgy Credit NA NA NA
## Residency Credit NA NA NA
## ROWMISS NA 0.0015901582 0.001406520
## Extension Date Veteran Credit Parent Lgy Credit
## Exam No NA NA NA
## List No NA NA NA
## First Name NA NA NA
## MI NA NA NA
## Last Name NA NA NA
## Adj. FA NA NA NA
## List Title Code NA NA NA
## List Title Desc NA NA NA
## Group No NA NA NA
## List Agency Code NA NA NA
## List Agency Desc NA NA NA
## List Div Code NA NA NA
## Published Date NA NA NA
## Established Date NA NA NA
## Anniversary Date NA NA NA
## Extension Date NA NA NA
## Veteran Credit NA NA NA
## Parent Lgy Credit NA NA NA
## Sibling Lgy Credit NA NA NA
## Residency Credit NA NA NA
## ROWMISS NA NA NA
## Sibling Lgy Credit Residency Credit ROWMISS
## Exam No NA NA -0.061133728
## List No NA NA 0.136113776
## First Name NA NA 0.009721378
## MI NA NA NA
## Last Name NA NA -0.004905398
## Adj. FA NA NA -0.163363800
## List Title Code NA NA -0.011706350
## List Title Desc NA NA 0.068962912
## Group No NA NA NA
## List Agency Code NA NA 0.005233084
## List Agency Desc NA NA -0.004705626
## List Div Code NA NA NA
## Published Date NA NA NA
## Established Date NA NA 0.001590158
## Anniversary Date NA NA 0.001406520
## Extension Date NA NA NA
## Veteran Credit NA NA NA
## Parent Lgy Credit NA NA NA
## Sibling Lgy Credit NA NA NA
## Residency Credit NA NA NA
## ROWMISS NA NA 1.000000000
After removing infinite values, the eigenvectors and eigenvalues of the covariance matrix were calculated to determine the principal components. The eigenvalues are shown in decreasing order, with the highest (and most relevant) value being in the first position. Looking at the output, it is noted that only the first eigenvalue is of any significance.
#Remove infinite values
IV <- !is.finite(CM)
#Compute eigenvalues and eigenvectors
eigen(IV)
## eigen() decomposition
## $values
## [1] 1.582475e+01 7.105427e-15 1.289856e-15 4.242455e-17 2.290481e-17
## [6] 3.266109e-31 1.476411e-31 2.960189e-33 -2.138212e-50 -2.963138e-35
## [11] -1.725434e-34 -2.677969e-34 -4.213676e-34 -5.452279e-34 -1.912041e-33
## [16] -2.345255e-33 -2.684481e-33 -1.527345e-32 -1.625985e-32 -5.992561e-17
## [21] -6.824752e+00
##
## $vectors
## [,1] [,2] [,3] [,4] [,5]
## [1,] -0.1584614 9.574271e-01 0.00000000 0.00000000 0.000000000
## [2,] -0.1584614 -8.703883e-02 0.95162788 -0.03086802 -0.001264224
## [3,] -0.1584614 -8.703883e-02 -0.09566987 -0.10803807 -0.094469118
## [4,] -0.2786236 1.769418e-16 0.02832989 -0.20578680 0.656112886
## [5,] -0.1584614 -8.703883e-02 -0.09566987 -0.10803807 -0.094469118
## [6,] -0.1584614 -8.703883e-02 -0.09566987 -0.10803807 -0.094469118
## [7,] -0.1584614 -8.703883e-02 -0.09566987 -0.10803807 -0.094469118
## [8,] -0.1584614 -8.703883e-02 -0.09566987 -0.10803807 -0.094469118
## [9,] -0.2786236 1.873501e-16 0.02117221 0.33440355 -0.263312067
## [10,] -0.1584614 -8.703883e-02 -0.09313446 0.44758630 0.378508586
## [11,] -0.1584614 -8.703883e-02 -0.09313446 0.44758630 0.378508586
## [12,] -0.2786236 1.873501e-16 0.02117221 0.33440355 -0.263312067
## [13,] -0.2786236 1.873501e-16 0.02117221 0.33440355 -0.263312067
## [14,] -0.1584614 -8.703883e-02 -0.09566987 -0.10803807 -0.094469118
## [15,] -0.1584614 -8.703883e-02 -0.09566987 -0.10803807 -0.094469118
## [16,] -0.2786236 1.769418e-16 -0.01836930 -0.15948477 0.026764663
## [17,] -0.2786236 1.769418e-16 -0.01836930 -0.15948477 0.026764663
## [18,] -0.2786236 1.769418e-16 -0.01836930 -0.15948477 0.026764663
## [19,] -0.2786236 1.769418e-16 -0.01836930 -0.15948477 0.026764663
## [20,] -0.2786236 1.769418e-16 -0.01836930 -0.15948477 0.026764663
## [21,] -0.1584614 -8.703883e-02 -0.09566987 -0.10803807 -0.094469118
## [,6] [,7] [,8] [,9] [,10]
## [1,] 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00
## [2,] 2.081668e-17 2.428613e-17 4.336809e-18 -1.432589e-19 0.000000e+00
## [3,] -3.621829e-01 -1.278052e-01 -1.162316e-01 -3.144580e-16 2.174025e-02
## [4,] -8.510553e-15 1.262300e-15 -1.598837e-16 2.105475e-17 2.818926e-18
## [5,] 7.475435e-01 4.730938e-01 -1.017728e-01 -5.654966e-17 3.936494e-03
## [6,] -4.846581e-02 -3.046925e-02 -3.234329e-01 -2.073408e-16 9.785553e-02
## [7,] -1.353645e-01 4.919788e-02 3.398435e-01 -1.669158e-16 -2.855844e-03
## [8,] -1.343773e-01 4.487242e-02 2.154209e-01 9.378936e-16 -1.786311e-01
## [9,] 2.930594e-01 -5.414785e-01 -2.960161e-01 1.458759e-16 -3.380964e-02
## [10,] -7.940418e-02 4.826569e-02 -3.584971e-01 -1.319789e-16 -9.185617e-02
## [11,] 7.940418e-02 -4.826569e-02 3.584971e-01 1.176262e-16 9.185617e-02
## [12,] -2.782668e-01 5.476191e-01 6.095945e-02 5.671192e-17 7.716847e-03
## [13,] -1.479265e-02 -6.140549e-03 2.350567e-01 -1.708329e-16 2.609279e-02
## [14,] -5.348429e-02 -1.026274e-01 5.024846e-04 -2.932878e-15 -6.489960e-01
## [15,] -5.447684e-02 -1.010107e-01 6.300200e-03 2.906006e-15 7.165993e-01
## [16,] -1.878405e-01 2.689667e-01 -4.704553e-01 -8.402258e-17 3.883036e-03
## [17,] 1.932995e-01 -1.177718e-01 2.107077e-01 -6.969119e-17 -1.878714e-02
## [18,] 4.813878e-02 -4.937525e-02 1.402936e-01 -7.071068e-01 -2.089964e-02
## [19,] 4.813878e-02 -4.937525e-02 1.402936e-01 7.071068e-01 -2.089964e-02
## [20,] -1.017366e-01 -5.244446e-02 -2.083950e-02 8.364875e-17 5.670338e-02
## [21,] 4.080817e-02 -2.052515e-01 -2.062976e-02 -1.483033e-16 -9.648549e-03
## [,11] [,12] [,13] [,14] [,15]
## [1,] 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00
## [2,] 0.000000e+00 4.857226e-17 6.938894e-18 6.938894e-18 1.387779e-17
## [3,] -1.957526e-01 -9.229729e-02 -7.566238e-02 -3.362651e-02 -1.673929e-01
## [4,] -2.341877e-16 1.454277e-16 -4.625929e-17 -9.092842e-17 3.411623e-17
## [5,] 4.557656e-02 5.437650e-03 -1.400016e-02 -7.676336e-03 -1.669763e-01
## [6,] 2.200152e-01 2.466286e-01 -3.221114e-01 1.100734e-01 6.420648e-01
## [7,] -1.932778e-01 5.029186e-01 -1.214690e-01 2.088319e-01 -6.126425e-03
## [8,] 1.283180e-01 6.208259e-03 7.344530e-01 -5.019543e-03 3.401539e-02
## [9,] -1.509983e-01 2.005477e-01 1.183857e-01 6.036524e-02 -1.403571e-01
## [10,] 1.929889e-01 9.980747e-02 2.862150e-01 -1.501833e-01 1.141746e-01
## [11,] -1.929889e-01 -9.980747e-02 -2.862150e-01 1.501833e-01 -1.141746e-01
## [12,] -1.360076e-02 -7.233859e-02 -1.941096e-02 7.518755e-02 7.629088e-02
## [13,] 1.645991e-01 -1.282091e-01 -9.897479e-02 -1.355528e-01 6.406620e-02
## [14,] 2.153423e-02 -1.408302e-01 -3.211033e-01 -4.412032e-01 -1.590690e-01
## [15,] 2.334661e-03 -1.034009e-01 3.988182e-02 -4.582360e-01 -1.607409e-01
## [16,] -4.466598e-01 -1.080047e-01 -6.489938e-03 7.187592e-02 -1.625901e-01
## [17,] -1.876594e-01 -4.078727e-01 6.269199e-02 -1.838886e-01 5.475305e-01
## [18,] -3.441606e-02 3.123440e-01 5.429172e-02 -3.280974e-02 -3.891494e-02
## [19,] -3.441606e-02 3.123440e-01 5.429172e-02 -3.280974e-02 -3.891494e-02
## [20,] 7.031513e-01 -1.088105e-01 -1.647855e-01 1.776322e-01 -3.071105e-01
## [21,] -2.874824e-02 -4.246647e-01 8.001149e-02 6.268562e-01 -1.577477e-02
## [,16] [,17] [,18] [,19] [,20]
## [1,] 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.00000000
## [2,] -4.163336e-17 -5.551115e-17 -6.245005e-17 2.775558e-17 0.05040684
## [3,] -1.160134e-01 3.383423e-02 7.283654e-01 2.944031e-01 -0.06589041
## [4,] -2.168404e-16 3.019864e-16 -4.857226e-16 5.678328e-16 -0.64440203
## [5,] -3.078561e-03 1.611531e-02 1.854308e-01 1.307335e-01 -0.06589041
## [6,] -2.821067e-01 1.779864e-01 -9.332400e-02 -1.462595e-02 -0.06589041
## [7,] 5.922298e-01 -1.533947e-02 -1.327624e-01 1.465281e-01 -0.06589041
## [8,] -2.168958e-01 3.848809e-01 -1.461515e-01 5.448951e-02 -0.06589041
## [9,] 1.148483e-01 2.497182e-01 3.802186e-02 -1.464278e-01 -0.20153158
## [10,] 2.648683e-01 -3.350035e-01 6.204837e-02 5.881283e-02 0.23835822
## [11,] -2.648683e-01 3.350035e-01 -6.204837e-02 -5.881283e-02 0.23835822
## [12,] 4.290728e-02 5.185503e-03 1.760967e-01 -4.852071e-01 -0.20153158
## [13,] -1.577556e-01 -2.549037e-01 -2.141186e-01 6.316350e-01 -0.20153158
## [14,] 6.263552e-03 -5.785291e-02 -2.027772e-01 -2.292135e-01 -0.06589041
## [15,] 4.910551e-02 -8.718508e-02 -2.013332e-01 -2.227591e-01 -0.06589041
## [16,] -1.693246e-02 1.567279e-01 -3.907922e-01 2.250485e-01 0.24979936
## [17,] 3.447310e-01 1.160106e-01 1.805678e-01 8.098752e-04 0.24979936
## [18,] -2.779247e-01 -2.680493e-01 8.162968e-02 -1.228333e-01 0.24979936
## [19,] -2.779247e-01 -2.680493e-01 8.162968e-02 -1.228333e-01 0.24979936
## [20,] 2.280510e-01 2.633601e-01 4.696502e-02 1.980819e-02 0.24979936
## [21,] -2.950436e-02 -4.524394e-01 -1.374478e-01 -1.595557e-01 -0.06589041
## [,21]
## [1,] -0.2412951
## [2,] -0.2412951
## [3,] -0.2412951
## [4,] 0.1829755
## [5,] -0.2412951
## [6,] -0.2412951
## [7,] -0.2412951
## [8,] -0.2412951
## [9,] 0.1829755
## [10,] -0.2412951
## [11,] -0.2412951
## [12,] 0.1829755
## [13,] 0.1829755
## [14,] -0.2412951
## [15,] -0.2412951
## [16,] 0.1829755
## [17,] 0.1829755
## [18,] 0.1829755
## [19,] 0.1829755
## [20,] 0.1829755
## [21,] -0.2412951
#Perform PCA to reduce dimensionality
PCA <- prcomp(IV, retx = TRUE, center = TRUE, scale. = FALSE,
tol = NULL, rank. = NULL)
summary(PCA)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10 PC11
## Standard deviation 1.757 8.147e-17 0 0 0 0 0 0 0 0 0
## Proportion of Variance 1.000 0.000e+00 0 0 0 0 0 0 0 0 0
## Cumulative Proportion 1.000 1.000e+00 1 1 1 1 1 1 1 1 1
## PC12 PC13 PC14 PC15 PC16 PC17 PC18 PC19 PC20 PC21
## Standard deviation 0 0 0 0 0 0 0 0 0 0
## Proportion of Variance 0 0 0 0 0 0 0 0 0 0
## Cumulative Proportion 1 1 1 1 1 1 1 1 1 1
Calculating the total variance explained by each principal component, it is shown that the first eigenvalue accounts for almost all of the variance.
var <- PCA$sdev^2 / sum(PCA$sdev^2)
print(var)
## [1] 1.000000e+00 2.151216e-33 0.000000e+00 0.000000e+00 0.000000e+00
## [6] 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00
## [11] 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00
## [16] 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00
## [21] 0.000000e+00
A scree plot was created to depict the principal components.
Scree_data <- data.frame(Principal_Component = 1:length(var), Variance_Explained = var)
ggplot(Scree_data, aes(x = Principal_Component, y = Variance_Explained)) +
geom_line() +
xlab("Principal Component") +
ylab("Variance Explained") +
ggtitle("Scree Plot") +
scale_y_continuous(limits = c(0, 1))
A biplot was also created to show the distribution of data points and variables for the first and second principal components.
biplot(PCA, scale=0, cex=1)
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
The last step in PCA is to determine which components to retain for analysis. Though PC1 represents the majority of the variance in the dataset, at least 2 variables are required for clustering analysis, so both PC1 and PC2 were retained.
PCA_transform <- as.matrix(-PCA$x[,1:2])
Clustering is a method used to group similar items together, usually by assessing characteristics or features of the data that are similar in order to discover hidden patterns in the dataset. Clustering is a type of unsupervised learning, which means that information is gleaned from observation rather than example (Fandango et al, 2021).
Another important aspect of clustering is determining the similarly measure. The most common similarity measures are the Euclidean, Manhattan, and Minkowski, Hamming, and Pearson Correlation distance calculations. For numeric data, the Euclidean distance measure is typically used. Since the data in the principal components are numeric, this similarity measure can be used.
Once a similarity measure has been selected, the appropriate number of clusters should be determined. The primary methods for this are the elbow method, silhouette method, and gap statistic. The elbow method involves examining the sum of squares within a cluster. This is typically done by looking at the visualization of this data. For the silhouette method, a score between -1 and 1 is assigned to the number of clusters. This is also best determined by reviewing the corresponding visualization of the silhouette score.
The grouping method is a key decision point in clustering analysis. The three main grouping methods are partitional, hierarchical, and density. Partitional grouping is most commonly accomplished using the k-means algorithm which evaluates the sum of squared errors and seeks to minimize them. This method is only applicable to continuous variables. Hierarchical clustering groups data by either a bottoms-up (agglomerative) or top-down (divisive) manner. Both partitioning and hierarchical clustering require the number of clusters to be identified before beginning the grouping process. Unlike the other two methods, density clustering does not require the number of clusters to be identified upfront. This approach looks at the density of data points and the minimum number of points in a specified radius (Fandango et al, 2021).
The researcher decided to use k-means clustering for this analysis. The within groups sum of squares (WSS) was used to determine the appropriate number of clusters. This analysis determined that 2 clusters should be used in the k-means analysis. Using the kmeans() function in R with 2 clusters specified, the data was separated into the 2 clusters. The ggplot() function (Wickham, 2016) was used to visualize the k-means clustering.
#Use within groups sum of squares (WSS) to determine number of clusters
wss <- (nrow(PCA_transform)-1)*sum(apply(PCA_transform, 2, var))
for (i in 2:2) wss[i] <- sum(kmeans(PCA_transform,
centers=i)$withinss)
plot(1:2, wss, type="b", xlab="Number of Clusters",
ylab="Within groups sum of sequares")
title("Number of Clusters")
#K-Means Cluster Analysis
fit <- kmeans(PCA_transform, 2)
aggregate(PCA_transform, by=list(fit$cluster), FUN=mean)
## Group.1 PC1 PC2
## 1 1 1.484615 1.804112e-16
## 2 2 -1.979487 -3.191891e-16
PCA_transform <- data.frame(PCA_transform, fit$cluster)
PCA_transform
## PC1 PC2 fit.cluster
## Exam No 1.484615 1.804112e-16 1
## List No 1.484615 1.804112e-16 1
## First Name 1.484615 1.804112e-16 1
## MI -1.979487 -3.191891e-16 2
## Last Name 1.484615 1.804112e-16 1
## Adj. FA 1.484615 1.804112e-16 1
## List Title Code 1.484615 1.804112e-16 1
## List Title Desc 1.484615 1.804112e-16 1
## Group No -1.979487 -3.191891e-16 2
## List Agency Code 1.484615 1.804112e-16 1
## List Agency Desc 1.484615 1.804112e-16 1
## List Div Code -1.979487 -3.191891e-16 2
## Published Date -1.979487 -3.191891e-16 2
## Established Date 1.484615 1.804112e-16 1
## Anniversary Date 1.484615 1.804112e-16 1
## Extension Date -1.979487 -3.191891e-16 2
## Veteran Credit -1.979487 -3.191891e-16 2
## Parent Lgy Credit -1.979487 -3.191891e-16 2
## Sibling Lgy Credit -1.979487 -3.191891e-16 2
## Residency Credit -1.979487 -3.191891e-16 2
## ROWMISS 1.484615 1.804112e-16 1
PCA_transform$fit.cluster <- as.factor(PCA_transform$fit.cluster)
ggplot(PCA_transform, aes(x=PC1, y=PC2, color=PCA_transform$fit.cluster))+geom_point() + ggtitle("K-Means Clustering")
#Elbow method for k-means
fviz_nbclust(PCA_transform, kmeans, method = "wss", k.max =2) +
geom_vline(xintercept = 3, linetype = 2)
#Silhouette method for k-means
fviz_nbclust(PCA_transform, kmeans, method = 'silhouette', k.max =2)
After the clustering process is completed, it is important for a researcher to assess the fit of the algorithm to the dataset. This can be done through evaluation of either internal (based on feature data) or external (based on cluster labels) factors. The researcher selected internal validation for this assignment. The most common interval validation methods are the Dunn index and silhouette coefficient. The silhouette coefficient ranges from -1 to 1 with a higher positive value designating better clustering. The Dunn index has a value between 0 and infinity and the maximum value is desired to confirm fit of the clusters. The R output provided a Silhouette value of 1 and a Dunn value of infinity, the highest values possible for theses measures. This indicates that each value in the cluster is well matched to its own cluster and poorly matched to any neighboring cluster.
kms <- list()
for (k in 1:2) {
kms[[k]] <- kmeans(x = PCA_transform, centers = k)
}
kms[[1]] <- NULL
#Silhouette index
ds <- dist(PCA_transform, method = "euclidean")
sapply(kms, function(km) {
clusters <- km$cluster
silhouette_result <- silhouette(clusters, ds)
silhouette_width <- mean(silhouette_result[,"sil_width"])
return(silhouette_width)
})
## [1] 1
#Dunn index
sapply(kms, function(km) dunn(ds, km$cluster))
## [1] Inf
Brock G, Pihur V, Datta S, Datta S (2008). “clValid: An R Package for Cluster Validation.” Journal of Statistical Software, 25(4), 1–22. https://www.jstatsoft.org/v25/i04/.
Charrad M, Ghazzali N, Boiteau V, Niknafs A (2014). “NbClust: An R Package for Determining the Relevant Number of Clusters in a Data Set.” Journal of Statistical Software, 61(6), 1–36. https://www.jstatsoft.org/v61/i06/.
Ciaburro, G. (2018). Regression analysis with R: Design and develop statistical nodes to identify unique relationships within data. Packt Publishing.
Dowle, M., Srinivasan, A. (2023). data.table: Extension of ‘data.frame’. https://r-datatable.com, https://Rdatatable.gitlab.io/data.table, https://github.com/Rdatatable/data.table.
Fandango, A., Idris, I., & Navlani. A. (2021). Python data analysis (3rd ed.). Packt Publishing.
Harmouch, M. (2021). 17 types of similarity and dissimilarity measures used in data science. Towards Data Science. Medium. Honaker J., King G., Blackwell M. (2011). “Amelia II: A Program for Missing Data.” Journal of Statistical Software, 45(7), 1–47. doi:10.18637/jss.v045.i07.
Kassambara, A. and Mundt, F. (2020) Factoextra: Extract and Visualize the Results of Multivariate Data Analyses. R Package Version 1.0.7.https://CRAN.R-project.org/package=factoextra
Tierney, N., Cook, D. (2023). “Expanding Tidy Data Principles to Facilitate Missing Data Exploration, Visualization and Assessment of Imputations.” Journal of Statistical Software, 105(7), 1–31. doi:10.18637/jss.v105.i07.
Walesiak M, Dudek A (2020). “The Choice of Variable Normalization Method in Cluster Analysis.” In Soliman KS (ed.), Education Excellence and Innovation Management: A 2025 Vision to Sustain Economic Development During Global Challenges, 325-340. ISBN 978-0-9998551-4-1.
Wickham, H., Averick, M., Bryan, J., Chang, W., McGowan, L.D., François, R., Grolemund, G., Hayes, A., Henry, L., Hester, J., Kuhn, M., Pedersen, T.L., Miller, E., Bache, S.M., Müller, K., Ooms, J., Robinson, D., Seidel, D.P., Spinu, V.,… Yutani, H. (2019). “Welcome to the tidyverse.” Journal of Open Source Software, 4(43), 1686. doi:10.21105/joss.01686.
Wickham, H. (2016). ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York. ISBN 978-3-319-24277-4, https://ggplot2.tidyverse.org.