Dimension Reduction on Higly Dimensional Crime Data

In this unsupervised learning project, my aim is to explore the intricacies of community-level crime data using dimension reduction techniques. With a dataset from a public repository (source) comprising 125 predictive variables related to community characteristics and law enforcement, we seek to uncover patterns and relationships among these variables. The primary objective is to investigate whether dimension reduction methods, such as Principal Component Analysis (PCA) and Factor Analysis, can render the data more interpretable. By distilling the wealth of information into a reduced feature space, we aim to gain valuable insights into the factors influencing crime rates and foster a deeper understanding of community dynamics.

Libraries

library(tidyverse)
library(corrplot)
library(psych)
library(factoextra)
library(psy)
library(gridExtra)
library(rrcov)
library(GPArotation)
library(TIMP)
library(ggplot2)
library(cowplot)
library(ggpubr)
set.seed(77)

DATA SET

Reading the dataset as well as all modifications needed to use the data set from the repository and have appropriate column labels.

data<-read.csv("comm.txt",sep=",",header = FALSE)
names_text <- readLines("communities.names")
# Extract variable names
variable_names <- scan("communities.names", what = character(), sep = "\n")
names <- strsplit(variable_names, "\n")
names1 <- names[58:204]

After reading the dataset into R, the next step involves data preprocessing and exploration. We have excluded non-predictive variables such as ‘countyCode,’ ‘communityCode,’ ‘State,’ ‘communityname,’ and ‘fold.’ Subsequently, we have replaced any missing or “?” values with NA and converted the entire dataset to numeric format. The following R code provides a brief overview of the dataset using the str() function:

data1 <- data[, !names(data) %in% c("countyCode","communityCode","State","communityname","fold")]#deleting strings and nominal values
df <- mutate_all(data1, ~ifelse(. == "?", NA, .)) 
df <- mutate_all(df, as.numeric)
str(df)

## 'data.frame':    2215 obs. of  142 variables:
##  $ pop                : num  11980 23123 29344 16656 11245 ...
##  $ perHoush           : num  3.1 2.82 2.43 2.4 2.76 2.45 2.6 2.45 2.46 2.62 ...
##  $ pctBlack           : num  1.37 0.8 0.74 1.7 0.53 ...
##  $ pctWhite           : num  91.8 95.6 94.3 97.3 89.2 ...
##  $ pctAsian           : num  6.5 3.44 3.43 0.5 1.17 0.9 1.47 0.4 1.25 0.92 ...
##  $ pctHisp            : num  1.88 0.85 2.35 0.7 0.52 ...
##  $ pct12-21           : num  12.5 11 11.4 12.6 24.5 ...
##  $ pct12-29           : num  21.4 21.3 25.9 25.2 40.5 ...
##  $ pct16-24           : num  10.9 10.5 11 12.2 28.7 ...
##  $ pct65up            : num  11.3 17.2 10.3 17.6 12.6 ...
##  $ persUrban          : num  11980 23123 29344 0 0 ...
##  $ pctUrban           : num  100 100 100 0 0 100 100 100 100 100 ...
##  $ medIncome          : num  75122 47917 35669 20580 17390 ...
##  $ pctWwage           : num  89.2 79 82 68.2 69.3 ...
##  $ pctWfarm           : num  1.55 1.11 1.15 0.24 0.55 1 0.39 0.67 2.93 0.86 ...
##  $ pctWdiv            : num  70.2 64.1 55.7 39 42.8 ...
##  $ pctWsocsec         : num  23.6 35.5 22.2 39.5 32.2 ...
##  $ pctPubAsst         : num  1.03 2.75 2.94 11.71 11.21 ...
##  $ pctRetire          : num  18.4 22.9 14.6 18.3 14.4 ...
##  $ medFamIncome       : num  79584 55323 42112 26501 24018 ...
##  $ perCapInc          : num  29711 20148 16946 10810 8483 ...
##  $ whitePerCap        : num  30233 20191 17103 10909 9009 ...
##  $ blackPerCap        : num  13600 18137 16644 9984 887 ...
##  $ NAperCap           : num  5725 0 21606 4941 4425 ...
##  $ asianPerCap        : num  27101 20074 15528 3541 3352 ...
##  $ otherPerCap        : num  5115 5250 5954 2451 3000 ...
##  $ hispPerCap         : num  22838 12222 8405 4391 1328 ...
##  $ persPoverty        : num  227 885 1389 2831 2855 ...
##  $ pctPoverty         : num  1.96 3.98 4.75 17.23 29.99 ...
##  $ pctLowEdu          : num  5.81 5.61 2.8 11.05 12.15 ...
##  $ pctNotHSgrad       : num  9.9 13.72 9.09 33.68 23.06 ...
##  $ pctCollGrad        : num  48.2 29.9 30.1 10.8 25.3 ...
##  $ pctUnemploy        : num  2.7 2.43 4.01 9.86 9.08 5.72 4.85 8.19 4.18 8.39 ...
##  $ pctEmploy          : num  64.5 62 69.8 54.7 52.4 ...
##  $ pctEmployMfg       : num  14.65 12.26 15.95 31.22 6.89 ...
##  $ pctEmployProfServ  : num  28.8 29.3 21.5 27.4 36.5 ...
##  $ pctOccupManu       : num  5.49 6.39 8.79 26.76 10.94 ...
##  $ pctOccupMgmt       : num  50.7 37.6 32.5 22.7 27.8 ...
##  $ pctMaleDivorc      : num  3.67 4.23 10.1 10.98 7.51 ...
##  $ pctMaleNevMar      : num  26.4 28 25.8 28.1 50.7 ...
##  $ pctFemDivorc       : num  5.22 6.45 14.76 14.47 11.64 ...
##  $ pctAllDivorc       : num  4.47 5.42 12.55 12.91 9.73 ...
##  $ persPerFam         : num  3.22 3.11 2.95 2.98 2.98 2.89 3.14 2.95 3 3.11 ...
##  $ pct2Par            : num  91.4 86.9 78.5 64 58.6 ...
##  $ pctKids2Par        : num  90.2 85.3 78.8 62.4 55.2 ...
##  $ pctKids-4w2Par     : num  95.8 96.8 92.4 65.4 66.5 ...
##  $ pct12-17w2Par      : num  95.8 86.5 75.7 67.4 79.2 ...
##  $ pctWorkMom-6       : num  44.6 51.1 66.1 59.6 61.2 ...
##  $ pctWorkMom-18      : num  58.9 62.4 74.2 70.3 68.9 ...
##  $ kidsBornNevrMarr   : num  31 43 164 561 402 ...
##  $ pctKidsBornNevrMarr: num  0.36 0.24 0.88 3.84 4.7 1.58 1.18 4.66 1.64 4.71 ...
##  $ numForeignBorn     : num  1277 1920 1468 339 196 ...
##  $ pctFgnImmig-3      : num  8.69 5.21 16.42 13.86 46.94 ...
##  $ pctFgnImmig-5      : num  13 8.65 23.98 13.86 56.12 ...
##  $ pctFgnImmig-8      : num  21 13.3 32.1 15.3 67.9 ...
##  $ pctFgnImmig-10     : num  30.9 22.5 35.6 15.3 69.9 ...
##  $ pctImmig-3         : num  0.93 0.43 0.82 0.28 0.82 0.32 1.05 0.11 0.47 0.72 ...
##  $ pctImmig-5         : num  1.39 0.72 1.2 0.28 0.98 0.45 1.49 0.2 0.67 1.07 ...
##  $ pctImmig-8         : num  2.24 1.11 1.61 0.31 1.18 0.57 2.2 0.25 0.93 1.63 ...
##  $ pctImmig-10        : num  3.3 1.87 1.78 0.31 1.22 0.68 2.55 0.29 1.07 2.31 ...
##  $ pctSpeakOnlyEng    : num  85.7 87.8 93.1 95 94.6 ...
##  $ pctNotSpeakEng     : num  1.37 1.81 1.14 0.56 0.39 0.6 0.6 0.28 0.43 2.51 ...
##  $ pctLargHousFam     : num  4.81 4.25 2.97 3.93 5.23 3.08 5.08 3.85 2.59 6.7 ...
##  $ pctLargHous        : num  4.17 3.34 2.05 2.56 3.11 1.92 3.46 2.55 1.54 4.1 ...
##  $ persPerOccupHous   : num  2.99 2.7 2.42 2.37 2.35 2.28 2.55 2.36 2.32 2.45 ...
##  $ persPerOwnOccup    : num  3 2.83 2.69 2.51 2.55 2.37 2.89 2.42 2.77 2.47 ...
##  $ persPerRenterOccup : num  2.84 1.96 2.06 2.2 2.12 2.16 2.09 2.27 1.91 2.44 ...
##  $ pctPersOwnOccup    : num  91.5 89 64.2 58.2 58.1 ...
##  $ pctPopDenseHous    : num  0.39 1.01 2.03 1.21 2.94 2.11 1.47 1.9 1.67 6.14 ...
##  $ pctSmallHousUnits  : num  11.1 23.6 47.5 45.7 55.6 ...
##  $ medNumBedrm        : num  3 3 3 3 2 2 3 2 2 2 ...
##  $ houseVacant        : num  64 240 544 669 333 ...
##  $ pctHousOccup       : num  98.4 97.2 95.7 91.2 92.5 ...
##  $ pctHousOwnerOccup  : num  91 84.9 57.8 54.9 53.6 ...
##  $ pctVacantBoarded   : num  3.12 0 0.92 2.54 3.9 2.09 1.41 6.39 0.45 5.64 ...
##  $ pctVacant6up       : num  37.5 18.33 7.54 57.85 42.64 ...
##  $ medYrHousBuilt     : num  1959 1958 1976 1939 1958 ...
##  $ pctHousWOphone     : num  0 0.31 1.55 7 7.45 ...
##  $ pctHousWOplumb     : num  0.28 0.14 0.12 0.87 0.82 0.31 0.28 0.49 0.19 0.33 ...
##  $ ownHousLowQ        : num  215900 136300 74700 36400 30600 ...
##  $ ownHousMed         : num  262600 164200 90400 49600 43200 ...
##  $ ownHousUperQ       : num  326900 199900 112000 66500 59500 ...
##  $ ownHousQrange      : num  111000 63600 37300 30100 28900 35400 60400 26100 39200 38800 ...
##  $ rentLowQ           : num  685 467 370 195 202 215 463 186 241 192 ...
##  $ rentMed            : num  1001 560 428 250 283 ...
##  $ rentUpperQ         : num  1001 672 520 309 362 ...
##  $ rentQrange         : num  316 205 150 114 160 134 361 139 146 177 ...
##  $ medGrossRent       : num  1001 627 484 333 332 ...
##  $ medRentpctHousInc  : num  23.8 27.6 24.1 28.7 32.2 26.4 24.4 26.3 25.2 29.6 ...
##  $ medOwnCostpct      : num  21.1 20.7 21.7 20.6 23.2 17.3 20.8 15.1 20.7 19.4 ...
##  $ medOwnCostPctWO    : num  14 12.5 11.6 14.5 12.9 11.7 12.5 12.2 12.8 13 ...
##  $ persEmergShelt     : num  11 0 16 0 2 327 0 21 125 43 ...
##  $ persHomeless       : num  0 0 0 0 0 4 0 0 15 4 ...
##  $ pctForeignBorn     : num  10.66 8.3 5 2.04 1.74 ...
##  $ pctBornStateResid  : num  53.7 77.2 44.8 88.7 73.8 ...
##  $ pctSameHouse-5     : num  65.3 71.3 36.6 56.7 42.2 ...
##  $ pctSameCounty-5    : num  78.1 90.2 61.3 90.2 60.3 ...
##  $ pctSameState-5     : num  89.1 96.1 82.8 96.2 89 ...
##  $ numPolice          : num  NA NA NA NA NA NA NA NA NA 198 ...
##   [list output truncated]

Population and Demographics: Variables related to population size, age distribution, race percentages, household size, and income. Social and Economic Factors: Employment rates, education levels, poverty metrics, immigration statistics, and language-related attributes. Housing Information: Housing-related metrics, such as median rent, ownership costs, number of bedrooms, and vacancy details. Law Enforcement: Police force characteristics, such as the number of police officers, budget, and deployment of special units. Crime Attributes (Potential Goals for Prediction): murders, murdPerPop, rapes, rapesPerPop, robberies, robbbPerPop, assaults, assaultPerPop, burglaries, burglPerPop, larcenies, larcPerPop, autoTheft, autoTheftPerPop, arsons, arsonsPerPop, ViolentCrimesPerPop, nonViolPerPop

By executing the provided R code, we aim to gain insights into the data structure, identify any missing values, and ensure that the variables are in numeric format for further analysis. This initial exploration sets the stage for subsequent steps in the project.

length(colnames(df)[colSums(is.na(df)) > 0])

## [1] 39

sum(apply(df, 1, anyNA))

## [1] 1913

filtered_data <- df %>%   select_if(~sum(is.na(.))/length(.) <= 0.2)
colSums(is.na(filtered_data))[colSums(is.na(filtered_data))>0]

##     otherPerCap           rapes     rapesPerPop 
##               1             208             208 
##       robberies     robbbPerPop        assaults 
##               1               1              13 
##   assaultPerPop      burglaries     burglPerPop 
##              13               3               3 
##       larcenies      larcPerPop       autoTheft 
##               3               3               3 
## autoTheftPerPop          arsons    arsonsPerPop 
##               3              91              91 
##   violentPerPop   nonViolPerPop 
##             221              97

length(colnames(filtered_data)[colSums(is.na(filtered_data)) > 0])

## [1] 17

sum(apply(filtered_data, 1, anyNA))

## [1] 314

In the provided R code, we perform several data preprocessing steps and examine the dataset’s missing values. First we calculate the number of columns (variables) and then rows that have missing values. The output indicates that there are 39 columns and 1913 rows with missing values. Then I filter the dataset to include only columns with less than or equal to 20% missing values, with the amount of information we have in this data set. The filtering step aims to retain columns with relatively low missing values, reducing the impact of missing data on subsequent analyses. The output provides a clear overview of the remaining missing values in the filtered dataset, allowing for further decision-making regarding imputation or exclusion of specific variables. We are left with 17 columns with missing data but only 314 rows. Now, let’s explore the summary statistics of the columns with missing values to determine an appropriate method for replacing the missing data:

summary(filtered_data[,colnames(filtered_data)[colSums(is.na(filtered_data)) > 0]])

##   otherPerCap         rapes          rapesPerPop    
##  Min.   :     0   Min.   :   0.00   Min.   :  0.00  
##  1st Qu.:  5528   1st Qu.:   2.00   1st Qu.: 11.54  
##  Median :  8186   Median :   7.00   Median : 26.92  
##  Mean   :  9443   Mean   :  28.05   Mean   : 36.26  
##  3rd Qu.: 11526   3rd Qu.:  19.00   3rd Qu.: 51.47  
##  Max.   :137000   Max.   :2818.00   Max.   :401.35  
##  NA's   :1        NA's   :208       NA's   :208     
##    robberies        robbbPerPop         assaults      
##  Min.   :    0.0   Min.   :   0.00   Min.   :    0.0  
##  1st Qu.:    5.0   1st Qu.:  27.65   1st Qu.:   18.0  
##  Median :   19.0   Median :  74.80   Median :   56.0  
##  Mean   :  237.9   Mean   : 162.61   Mean   :  326.5  
##  3rd Qu.:   70.0   3rd Qu.: 187.16   3rd Qu.:  180.0  
##  Max.   :86001.0   Max.   :2264.13   Max.   :62778.0  
##  NA's   :1         NA's   :1         NA's   :13       
##  assaultPerPop       burglaries       burglPerPop      
##  Min.   :   0.00   Min.   :    2.0   Min.   :   16.92  
##  1st Qu.:  94.19   1st Qu.:   95.0   1st Qu.:  511.69  
##  Median : 226.53   Median :  205.0   Median :  822.72  
##  Mean   : 378.00   Mean   :  761.2   Mean   : 1033.43  
##  3rd Qu.: 504.39   3rd Qu.:  508.0   3rd Qu.: 1350.23  
##  Max.   :4932.50   Max.   :99207.0   Max.   :11881.02  
##  NA's   :13        NA's   :3         NA's   :3         
##    larcenies        larcPerPop         autoTheft       
##  Min.   :    10   Min.   :   77.86   Min.   :     1.0  
##  1st Qu.:   392   1st Qu.: 2040.08   1st Qu.:    30.0  
##  Median :   747   Median : 3079.51   Median :    75.0  
##  Mean   :  2138   Mean   : 3372.98   Mean   :   516.7  
##  3rd Qu.:  1675   3rd Qu.: 4335.41   3rd Qu.:   232.5  
##  Max.   :235132   Max.   :25910.55   Max.   :112464.0  
##  NA's   :3        NA's   :3          NA's   :3         
##  autoTheftPerPop       arsons         arsonsPerPop   
##  Min.   :   6.55   Min.   :   0.00   Min.   :  0.00  
##  1st Qu.: 156.95   1st Qu.:   1.00   1st Qu.:  7.67  
##  Median : 302.36   Median :   5.00   Median : 21.08  
##  Mean   : 473.97   Mean   :  30.91   Mean   : 32.15  
##  3rd Qu.: 589.77   3rd Qu.:  16.00   3rd Qu.: 42.85  
##  Max.   :4968.59   Max.   :5119.00   Max.   :436.37  
##  NA's   :3         NA's   :91        NA's   :91      
##  violentPerPop    nonViolPerPop    
##  Min.   :   0.0   Min.   :  116.8  
##  1st Qu.: 161.7   1st Qu.: 2918.1  
##  Median : 374.1   Median : 4425.4  
##  Mean   : 589.1   Mean   : 4908.2  
##  3rd Qu.: 794.4   3rd Qu.: 6229.3  
##  Max.   :4877.1   Max.   :27119.8  
##  NA's   :221      NA's   :97

The output of this code provides summary statistics for the columns with missing values in the filtered dataset. I examine key statistics such as mean, median, minimum, 1st quartile, 3rd quartile, and maximum for each variable.

The reason we analyze this summary is to make an informed decision on the method for replacing the missing values. Given that the variables in the dataset exhibit a wide range of values with both very low and very high minimum and maximum values, it is advisable to use the median instead of the mean for imputation. The median is less sensitive to extreme values (outliers) and provides a more robust measure of central tendency, making it a suitable choice for replacing missing values in this context.

filtered_data <- filtered_data %>%
  mutate_all(~ifelse(is.na(.), median(., na.rm = TRUE), .))
s_data <- scale(filtered_data, center = TRUE, scale = TRUE)

After addressing the remaining missing data by replacing it with the median values, the next step involves standardizing the dataset. By standardizing the data, we prepare it for subsequent unsupervised learning tasks, ensuring that the algorithms can effectively uncover patterns and relationships in the dataset without being influenced by differences in variable scales.

Relationship between the variables

As the idea for this project is to techniques like Principal Component Analysis (PCA) or other dimensionality reduction methods, understanding the correlation structure of the variables is crucial. PCA identifies the directions of maximum variance in the data, and correlation analysis can help us interpret which variables contribute most to these principal components.

#Correlation Matrix
corrplot(cor(s_data), type = "lower", order = "hclust", tl.col = "black", tl.cex = 0.2)

Note: While the correlation plot provides a visual representation of relationships between variables, the readability might still be challenging with a high-dimensional dataset.

In the plot we can see the patches of highly correlated variables both those positively correlated and negatively correlated. Unfortunately reading the labels of the plot is difficult. Because of the low readability of the plot we improve further analysis of the corralation.

cor_matrix <- cor(s_data)
# Exclude self-correlation by setting diagonals to 0
diag(cor_matrix) <- 0
# Set a threshold for correlation strength
threshold <- 0.9
# Find variable pairs with correlation above the threshold
highly_correlated_pairs <- which(cor_matrix > threshold, arr.ind = TRUE)
# Create a data frame to store results
correlation_results <- data.frame(
  Column = character(),
  NumCorrelatedPairs = integer(),
  CorrelatedPairs = character(),
  stringsAsFactors = FALSE
)

# Iterate over columns and count highly correlated pairs
for (col in colnames(cor_matrix)) {
  correlated_pairs <- which(cor_matrix[, col] > threshold)
  num_correlated_pairs <- length(correlated_pairs)
  if (num_correlated_pairs > 0) {
    correlation_results <- rbind(correlation_results, 
                                 data.frame(Column = col, 
                                            NumCorrelatedPairs = num_correlated_pairs,
                                            CorrelatedPairs = paste(colnames(cor_matrix)[correlated_pairs], collapse = ", "),
                                            stringsAsFactors = FALSE))
  }
}

# Print the results
cat("Number of columns with at least one highly correlated pair:", nrow(correlation_results), "\n")

## Number of columns with at least one highly correlated pair: 62

print(correlation_results)

##               Column NumCorrelatedPairs
## 1                pop                 12
## 2           pct12-21                  1
## 3           pct12-29                  1
## 4           pct16-24                  2
## 5            pct65up                  1
## 6          persUrban                 12
## 7          medIncome                  1
## 8         pctWsocsec                  1
## 9       medFamIncome                  2
## 10         perCapInc                  2
## 11       whitePerCap                  1
## 12       persPoverty                 13
## 13         pctLowEdu                  1
## 14      pctNotHSgrad                  1
## 15       pctCollGrad                  1
## 16      pctOccupMgmt                  1
## 17     pctMaleDivorc                  2
## 18      pctFemDivorc                  2
## 19      pctAllDivorc                  2
## 20        persPerFam                  1
## 21           pct2Par                  3
## 22       pctKids2Par                  3
## 23    pctKids-4w2Par                  2
## 24     pct12-17w2Par                  2
## 25      pctWorkMom-6                  1
## 26     pctWorkMom-18                  1
## 27  kidsBornNevrMarr                 10
## 28    numForeignBorn                  6
## 29     pctFgnImmig-3                  1
## 30     pctFgnImmig-5                  2
## 31     pctFgnImmig-8                  2
## 32    pctFgnImmig-10                  1
## 33        pctImmig-3                  4
## 34        pctImmig-5                  4
## 35        pctImmig-8                  4
## 36       pctImmig-10                  4
## 37    pctLargHousFam                  1
## 38       pctLargHous                  1
## 39  persPerOccupHous                  2
## 40   persPerOwnOccup                  1
## 41   pctPersOwnOccup                  1
## 42       houseVacant                  6
## 43 pctHousOwnerOccup                  1
## 44       ownHousLowQ                  2
## 45        ownHousMed                  2
## 46      ownHousUperQ                  2
## 47          rentLowQ                  3
## 48           rentMed                  3
## 49        rentUpperQ                  3
## 50      medGrossRent                  3
## 51    persEmergShelt                  7
## 52      persHomeless                  3
## 53    pctForeignBorn                  4
## 54           murders                  9
## 55         robberies                 12
## 56          assaults                  9
## 57        burglaries                 10
## 58         larcenies                 10
## 59        larcPerPop                  1
## 60         autoTheft                 13
## 61            arsons                  2
## 62     nonViolPerPop                  1
##                                                                                                                                            CorrelatedPairs
## 1    persUrban, persPoverty, kidsBornNevrMarr, numForeignBorn, houseVacant, persEmergShelt, murders, robberies, assaults, burglaries, larcenies, autoTheft
## 2                                                                                                                                                 pct16-24
## 3                                                                                                                                                 pct16-24
## 4                                                                                                                                       pct12-21, pct12-29
## 5                                                                                                                                               pctWsocsec
## 6          pop, persPoverty, kidsBornNevrMarr, numForeignBorn, houseVacant, persEmergShelt, murders, robberies, assaults, burglaries, larcenies, autoTheft
## 7                                                                                                                                             medFamIncome
## 8                                                                                                                                                  pct65up
## 9                                                                                                                                     medIncome, perCapInc
## 10                                                                                                                               medFamIncome, whitePerCap
## 11                                                                                                                                               perCapInc
## 12   pop, persUrban, kidsBornNevrMarr, numForeignBorn, houseVacant, persEmergShelt, murders, robberies, assaults, burglaries, larcenies, autoTheft, arsons
## 13                                                                                                                                            pctNotHSgrad
## 14                                                                                                                                               pctLowEdu
## 15                                                                                                                                            pctOccupMgmt
## 16                                                                                                                                             pctCollGrad
## 17                                                                                                                              pctFemDivorc, pctAllDivorc
## 18                                                                                                                             pctMaleDivorc, pctAllDivorc
## 19                                                                                                                             pctMaleDivorc, pctFemDivorc
## 20                                                                                                                                        persPerOccupHous
## 21                                                                                                              pctKids2Par, pctKids-4w2Par, pct12-17w2Par
## 22                                                                                                                  pct2Par, pctKids-4w2Par, pct12-17w2Par
## 23                                                                                                                                    pct2Par, pctKids2Par
## 24                                                                                                                                    pct2Par, pctKids2Par
## 25                                                                                                                                           pctWorkMom-18
## 26                                                                                                                                            pctWorkMom-6
## 27                                             pop, persUrban, persPoverty, persEmergShelt, murders, robberies, assaults, burglaries, larcenies, autoTheft
## 28                                                                                         pop, persUrban, persPoverty, persHomeless, robberies, autoTheft
## 29                                                                                                                                           pctFgnImmig-5
## 30                                                                                                                            pctFgnImmig-3, pctFgnImmig-8
## 31                                                                                                                           pctFgnImmig-5, pctFgnImmig-10
## 32                                                                                                                                           pctFgnImmig-8
## 33                                                                                                     pctImmig-5, pctImmig-8, pctImmig-10, pctForeignBorn
## 34                                                                                                     pctImmig-3, pctImmig-8, pctImmig-10, pctForeignBorn
## 35                                                                                                     pctImmig-3, pctImmig-5, pctImmig-10, pctForeignBorn
## 36                                                                                                      pctImmig-3, pctImmig-5, pctImmig-8, pctForeignBorn
## 37                                                                                                                                             pctLargHous
## 38                                                                                                                                          pctLargHousFam
## 39                                                                                                                             persPerFam, persPerOwnOccup
## 40                                                                                                                                        persPerOccupHous
## 41                                                                                                                                       pctHousOwnerOccup
## 42                                                                                           pop, persUrban, persPoverty, burglaries, larcenies, autoTheft
## 43                                                                                                                                         pctPersOwnOccup
## 44                                                                                                                                ownHousMed, ownHousUperQ
## 45                                                                                                                               ownHousLowQ, ownHousUperQ
## 46                                                                                                                                 ownHousLowQ, ownHousMed
## 47                                                                                                                       rentMed, rentUpperQ, medGrossRent
## 48                                                                                                                      rentLowQ, rentUpperQ, medGrossRent
## 49                                                                                                                         rentLowQ, rentMed, medGrossRent
## 50                                                                                                                           rentLowQ, rentMed, rentUpperQ
## 51                                                                       pop, persUrban, persPoverty, kidsBornNevrMarr, persHomeless, robberies, autoTheft
## 52                                                                                                               numForeignBorn, persEmergShelt, robberies
## 53                                                                                                         pctImmig-3, pctImmig-5, pctImmig-8, pctImmig-10
## 54                                                    pop, persUrban, persPoverty, kidsBornNevrMarr, robberies, assaults, burglaries, larcenies, autoTheft
## 55        pop, persUrban, persPoverty, kidsBornNevrMarr, numForeignBorn, persEmergShelt, persHomeless, murders, assaults, burglaries, larcenies, autoTheft
## 56                                                     pop, persUrban, persPoverty, kidsBornNevrMarr, murders, robberies, burglaries, larcenies, autoTheft
## 57                                          pop, persUrban, persPoverty, kidsBornNevrMarr, houseVacant, murders, robberies, assaults, larcenies, autoTheft
## 58                                         pop, persUrban, persPoverty, kidsBornNevrMarr, houseVacant, murders, robberies, assaults, burglaries, autoTheft
## 59                                                                                                                                           nonViolPerPop
## 60 pop, persUrban, persPoverty, kidsBornNevrMarr, numForeignBorn, houseVacant, persEmergShelt, murders, robberies, assaults, burglaries, larcenies, arsons
## 61                                                                                                                                  persPoverty, autoTheft
## 62                                                                                                                                              larcPerPop

The output indicates that there are 63 variables with at least one highly correlated pair. The listed variables are those that are part of highly correlated pairs, suggesting strong associations between them. This information can guide further analysis, variable selection, or model building by considering the redundancy introduced by highly correlated variables.

selected_columns <- c("persUrban", "persPoverty", "kidsBornNevrMarr", "autoTheft", "murders", "rapes", "robberies", "assaults","burglaries","larcenies","pctImmig-5","pctImmig-8","rentLowQ","rentUpperQ","ownHousUperQ","ownHousLowQ","pctAllDivorc","pct16-24","pctKids2Par")
s_data <- s_data[, !colnames(s_data) %in% selected_columns]

After analyzing the correlation structure, we observed that certain columns are highly correlated with each other. These variables are essentially capturing similar information and might lead to multicollinearity issues in our analysis. Moreover, many of these variables are already included in the dataset by the ‘pop’ variable and percentage variables for each, or variables that are included inside each other.

PCA

Principal Component Analysis (PCA) is employed to reduce the dimensionality of the dataset while retaining as much variability as possible. This is supposed to aid in simplifying the interpretation of the data and capturing its essential features.

data.pca <- prcomp(s_data, center = TRUE, scale = TRUE)
summary(data.pca)

## Importance of components:
##                           PC1    PC2     PC3     PC4
## Standard deviation     4.9472 3.7017 2.97719 2.70391
## Proportion of Variance 0.2423 0.1357 0.08776 0.07239
## Cumulative Proportion  0.2423 0.3780 0.46576 0.53815
##                            PC5     PC6     PC7     PC8
## Standard deviation     2.21962 2.04075 1.97232 1.73013
## Proportion of Variance 0.04878 0.04123 0.03852 0.02964
## Cumulative Proportion  0.58693 0.62816 0.66668 0.69631
##                            PC9    PC10    PC11    PC12
## Standard deviation     1.43651 1.31315 1.25313 1.23293
## Proportion of Variance 0.02043 0.01707 0.01555 0.01505
## Cumulative Proportion  0.71674 0.73382 0.74936 0.76442
##                           PC13   PC14   PC15   PC16    PC17
## Standard deviation     1.14882 1.0823 1.0345 0.9998 0.97899
## Proportion of Variance 0.01307 0.0116 0.0106 0.0099 0.00949
## Cumulative Proportion  0.77748 0.7891 0.7997 0.8096 0.81906
##                           PC18    PC19    PC20    PC21
## Standard deviation     0.96711 0.94097 0.90708 0.89173
## Proportion of Variance 0.00926 0.00877 0.00815 0.00787
## Cumulative Proportion  0.82832 0.83709 0.84524 0.85311
##                           PC22    PC23    PC24    PC25
## Standard deviation     0.85789 0.84550 0.84231 0.80955
## Proportion of Variance 0.00729 0.00708 0.00702 0.00649
## Cumulative Proportion  0.86040 0.86747 0.87450 0.88099
##                           PC26    PC27   PC28    PC29
## Standard deviation     0.80034 0.76773 0.7316 0.70729
## Proportion of Variance 0.00634 0.00584 0.0053 0.00495
## Cumulative Proportion  0.88733 0.89317 0.8985 0.90342
##                           PC30    PC31   PC32    PC33
## Standard deviation     0.69782 0.68282 0.6742 0.66970
## Proportion of Variance 0.00482 0.00462 0.0045 0.00444
## Cumulative Proportion  0.90824 0.91286 0.9174 0.92180
##                           PC34    PC35    PC36    PC37
## Standard deviation     0.65819 0.62200 0.61227 0.60794
## Proportion of Variance 0.00429 0.00383 0.00371 0.00366
## Cumulative Proportion  0.92609 0.92992 0.93363 0.93729
##                           PC38    PC39    PC40   PC41
## Standard deviation     0.58693 0.57053 0.56689 0.5509
## Proportion of Variance 0.00341 0.00322 0.00318 0.0030
## Cumulative Proportion  0.94070 0.94392 0.94710 0.9501
##                           PC42   PC43    PC44    PC45
## Standard deviation     0.54253 0.5220 0.52122 0.49716
## Proportion of Variance 0.00291 0.0027 0.00269 0.00245
## Cumulative Proportion  0.95302 0.9557 0.95841 0.96086
##                          PC46    PC47    PC48    PC49
## Standard deviation     0.4815 0.47878 0.47281 0.45594
## Proportion of Variance 0.0023 0.00227 0.00221 0.00206
## Cumulative Proportion  0.9631 0.96542 0.96764 0.96969
##                           PC50    PC51    PC52    PC53
## Standard deviation     0.44365 0.42510 0.40675 0.40550
## Proportion of Variance 0.00195 0.00179 0.00164 0.00163
## Cumulative Proportion  0.97164 0.97343 0.97507 0.97670
##                           PC54    PC55    PC56    PC57
## Standard deviation     0.39429 0.38820 0.36966 0.35965
## Proportion of Variance 0.00154 0.00149 0.00135 0.00128
## Cumulative Proportion  0.97824 0.97973 0.98108 0.98236
##                           PC58    PC59    PC60    PC61
## Standard deviation     0.35512 0.33626 0.33067 0.32075
## Proportion of Variance 0.00125 0.00112 0.00108 0.00102
## Cumulative Proportion  0.98361 0.98473 0.98581 0.98683
##                           PC62    PC63    PC64    PC65
## Standard deviation     0.30791 0.29186 0.28551 0.27428
## Proportion of Variance 0.00094 0.00084 0.00081 0.00074
## Cumulative Proportion  0.98777 0.98861 0.98942 0.99017
##                           PC66    PC67    PC68    PC69
## Standard deviation     0.25991 0.25016 0.24775 0.24162
## Proportion of Variance 0.00067 0.00062 0.00061 0.00058
## Cumulative Proportion  0.99083 0.99145 0.99206 0.99264
##                           PC70    PC71    PC72    PC73
## Standard deviation     0.23809 0.22853 0.22072 0.21807
## Proportion of Variance 0.00056 0.00052 0.00048 0.00047
## Cumulative Proportion  0.99320 0.99372 0.99420 0.99467
##                           PC74    PC75    PC76    PC77
## Standard deviation     0.21325 0.19829 0.19461 0.18822
## Proportion of Variance 0.00045 0.00039 0.00037 0.00035
## Cumulative Proportion  0.99512 0.99551 0.99589 0.99624
##                           PC78   PC79    PC80    PC81
## Standard deviation     0.17644 0.1726 0.17048 0.16648
## Proportion of Variance 0.00031 0.0003 0.00029 0.00027
## Cumulative Proportion  0.99654 0.9968 0.99713 0.99740
##                           PC82    PC83    PC84    PC85
## Standard deviation     0.16328 0.15784 0.15153 0.14631
## Proportion of Variance 0.00026 0.00025 0.00023 0.00021
## Cumulative Proportion  0.99767 0.99791 0.99814 0.99835
##                           PC86    PC87    PC88    PC89
## Standard deviation     0.14398 0.13137 0.13036 0.12707
## Proportion of Variance 0.00021 0.00017 0.00017 0.00016
## Cumulative Proportion  0.99856 0.99873 0.99890 0.99906
##                           PC90    PC91    PC92    PC93
## Standard deviation     0.11998 0.11783 0.11019 0.10866
## Proportion of Variance 0.00014 0.00014 0.00012 0.00012
## Cumulative Proportion  0.99920 0.99934 0.99946 0.99957
##                           PC94    PC95    PC96    PC97
## Standard deviation     0.10544 0.08813 0.08365 0.07998
## Proportion of Variance 0.00011 0.00008 0.00007 0.00006
## Cumulative Proportion  0.99968 0.99976 0.99983 0.99989
##                           PC98    PC99   PC100   PC101
## Standard deviation     0.06884 0.05902 0.04566 0.02219
## Proportion of Variance 0.00005 0.00003 0.00002 0.00000
## Cumulative Proportion  0.99994 0.99997 1.00000 1.00000

# Plot cumulative proportion of variance explained
eig<- get_eig(data.pca)
cum_var <- data.frame(Components=1:length(eig$eigenvalue), Cumulative_Variance = cumsum(eig$eigenvalue)/sum(eig$eigenvalue))
par(mfrow = c(1, 2))
ggplot(cum_var,aes(x=Components,y=Cumulative_Variance))+ geom_line(color="lightpink")+geom_point(color="lightpink")+labs(x="Number of Principal Components", y= "Cumulative Proportion of Variance Explained", title = "Cumulative Variance Explained") + theme_minimal()

fviz_eig(data.pca, choice = "eigenvalue", addlabels=TRUE, ncp=20)

Upon analyzing the output of get_eig, it becomes evident that 64 dimensions yield a cumulative variance percentage exceeding 99%. Notably, the eigenvalues of dimensions up to the 15th are above 1, collectively explaining around 80% of the dataset’s variance. This information guides our decision on the number of principal components to retain for subsequent analyses.

Interpratetion of principal components

PC1 <- fviz_contrib(data.pca, choice = "var", axes = 1,fill = "brown4",color = "brown4",top=10)+
labs(title= "Contr of var to Dim1")
PC2 <- fviz_contrib(data.pca, choice = "var", axes = 2,fill = "brown4",color = "brown4",top=10) +
labs(title= "Contr of var to Dim2")
PC3 <- fviz_contrib(data.pca, choice = "var", axes = 3,fill = "brown4",color = "brown4",top=10) +
labs(title= "Contr of var to Dim3")
grid.arrange(PC1, PC2, PC3,   ncol=3)

The grid of plots aims to simplify the understanding of how individual variables impact each principal component (PC). This visual representation enables the identification of variables that play a significant role in shaping the variation captured by each PC, shedding light on the underlying data structure. Such insights are valuable for pinpointing key features that contribute prominently to observed patterns. However, it’s important to note that, in our dataset, individual variable contributions to dimensions do not exceed 6%. Additionally, the abundance of variables with contributions above 1% poses a challenge to the interpretation process. This complexity challenges the efficacy of PCA for this dataset, making the identification of crucial variables for interpretation challenging amidst the multitude of contributors.

It’s essential to highlight that the contributions are specific to each main component, as expected. Moreover, emphasizing that the three axes and their contributions align intuitively, such as economic factors (unemployment, income, poverty) predominantly influencing the first axis, while social factors (foreign status, language proficiency) play a major role in the second axis. In the third chart we can see that the Dim3 first 10 contributions are connected to the size of the households.

Factor Analysis

When Principal Component Analysis (PCA) falls short in capturing the underlying structure of high-dimensional data, Factor Analysis (FA) can serve as an alternative approach. PCA falls short in capturing the underlying structure of high-dimensional data for our dataset because of the reasons we mentioned before - individual variable contributions to dimensions are limited to a maximum of 6%, and the sheer number of variables with contributions above 1% complicates the interpretation process, hindering the identification of crucial variables for analysis. Factor Analysis explores latent factors that contribute to the observed variables, aiming to reveal hidden patterns and relationships within the data.

Before performing Factor Analysis, it’s essential to assess the adequacy of the data for this technique. The Kaiser-Meyer-Olkin (KMO) measure is a statistical method used to evaluate the sampling adequacy for factor analysis.

KMO(s_data)

## Kaiser-Meyer-Olkin factor adequacy
## Call: KMO(r = s_data)
## Overall MSA =  0.92
## MSA for each item = 
##                 pop            perHoush            pctBlack 
##                0.82                0.88                0.85 
##            pctWhite            pctAsian             pctHisp 
##                0.89                0.75                0.89 
##            pct12-21            pct12-29             pct65up 
##                0.86                0.86                0.90 
##            pctUrban           medIncome            pctWwage 
##                0.93                0.94                0.94 
##            pctWfarm             pctWdiv          pctWsocsec 
##                0.84                0.97                0.92 
##          pctPubAsst           pctRetire        medFamIncome 
##                0.96                0.89                0.94 
##           perCapInc         whitePerCap         blackPerCap 
##                0.93                0.92                0.97 
##            NAperCap         asianPerCap         otherPerCap 
##                0.96                0.97                0.95 
##          hispPerCap          pctPoverty           pctLowEdu 
##                0.98                0.96                0.93 
##        pctNotHSgrad         pctCollGrad         pctUnemploy 
##                0.94                0.94                0.96 
##           pctEmploy        pctEmployMfg   pctEmployProfServ 
##                0.91                0.74                0.88 
##        pctOccupManu        pctOccupMgmt       pctMaleDivorc 
##                0.93                0.94                0.95 
##       pctMaleNevMar        pctFemDivorc          persPerFam 
##                0.91                0.94                0.88 
##             pct2Par      pctKids-4w2Par       pct12-17w2Par 
##                0.95                0.98                0.98 
##        pctWorkMom-6       pctWorkMom-18 pctKidsBornNevrMarr 
##                0.77                0.80                0.97 
##      numForeignBorn       pctFgnImmig-3       pctFgnImmig-5 
##                0.84                0.90                0.90 
##       pctFgnImmig-8      pctFgnImmig-10          pctImmig-3 
##                0.90                0.91                0.92 
##         pctImmig-10     pctSpeakOnlyEng      pctNotSpeakEng 
##                0.88                0.90                0.96 
##      pctLargHousFam         pctLargHous    persPerOccupHous 
##                0.87                0.86                0.79 
##     persPerOwnOccup  persPerRenterOccup     pctPersOwnOccup 
##                0.79                0.89                0.84 
##     pctPopDenseHous   pctSmallHousUnits         medNumBedrm 
##                0.95                0.95                0.96 
##         houseVacant        pctHousOccup   pctHousOwnerOccup 
##                0.83                0.89                0.82 
##    pctVacantBoarded        pctVacant6up      medYrHousBuilt 
##                0.97                0.91                0.87 
##      pctHousWOphone      pctHousWOplumb          ownHousMed 
##                0.97                0.98                0.95 
##       ownHousQrange             rentMed          rentQrange 
##                0.97                0.93                0.96 
##        medGrossRent   medRentpctHousInc       medOwnCostpct 
##                0.93                0.94                0.91 
##     medOwnCostPctWO      persEmergShelt        persHomeless 
##                0.81                0.85                0.85 
##      pctForeignBorn   pctBornStateResid      pctSameHouse-5 
##                0.91                0.84                0.89 
##     pctSameCounty-5      pctSameState-5            landArea 
##                0.94                0.79                0.68 
##          popDensity      pctUsePubTrans    pctOfficDrugUnit 
##                0.94                0.91                0.95 
##          murdPerPop         rapesPerPop         robbbPerPop 
##                0.99                0.95                0.94 
##       assaultPerPop         burglPerPop          larcPerPop 
##                0.92                0.92                0.85 
##     autoTheftPerPop              arsons        arsonsPerPop 
##                0.92                0.87                0.94 
##       violentPerPop       nonViolPerPop 
##                0.91                0.87

The KMO statistic ranges from 0 to 1, with higher values indicating better suitability for factor analysis. A value close to 1 suggests that the variables are well-suited for this technique. Here the KMO value of 0.89 indicates a high level of sampling adequacy, suggesting that the dataset is suitable for Factor Analysis. This implies that the observed variables share a substantial amount of common variance, making them appropriate for extracting underlying factors through this dimension reduction technique.

# compute the covariance matrix
s_data_cov <- cov(s_data)

factor_model <- fa(s_data_cov, nfactors = 15, rotate = "promax", max.iter = 1000, scores = "Bartlett")

In this code snippet, we apply Factor Analysis with the factanal function. We specify the number of factors to extract as 15, based on the information obtained from the earlier exploration of cumulative variance percentage in PCA. The choice of rotation method is set to “promax” for oblique rotation, allowing factors to be correlated.

# Check the summary of the cleaned factor model
summary(factor_model)

## 
## Factor analysis with Call: fa(r = s_data_cov, nfactors = 15, rotate = "promax", scores = "Bartlett", 
##     max.iter = 1000)
## 
## Test of the hypothesis that 15 factors are sufficient.
## The degrees of freedom for the model is 3640  and the objective function was  43.05 
## 
## The root mean square of the residuals (RMSA) is  0.01 
## The df corrected root mean square of the residuals is  0.02 
## 
##  With factor correlations of 
##        MR6   MR1   MR2   MR3   MR5   MR4   MR7   MR9   MR8
## MR6   1.00 -0.57  0.48 -0.34  0.19  0.04  0.12  0.15  0.14
## MR1  -0.57  1.00 -0.19  0.30  0.00 -0.03 -0.03  0.01 -0.14
## MR2   0.48 -0.19  1.00 -0.09  0.17  0.20  0.19  0.27 -0.05
## MR3  -0.34  0.30 -0.09  1.00 -0.05  0.01 -0.15 -0.17  0.11
## MR5   0.19  0.00  0.17 -0.05  1.00  0.14 -0.02  0.08 -0.02
## MR4   0.04 -0.03  0.20  0.01  0.14  1.00 -0.03  0.36 -0.43
## MR7   0.12 -0.03  0.19 -0.15 -0.02 -0.03  1.00  0.41  0.00
## MR9   0.15  0.01  0.27 -0.17  0.08  0.36  0.41  1.00 -0.43
## MR8   0.14 -0.14 -0.05  0.11 -0.02 -0.43  0.00 -0.43  1.00
## MR12  0.36 -0.60  0.05 -0.03 -0.14 -0.19  0.04 -0.03  0.12
## MR15  0.13 -0.36  0.22  0.03 -0.22 -0.01 -0.11 -0.15  0.26
## MR10 -0.55  0.43 -0.37  0.11 -0.10  0.13 -0.04  0.07 -0.24
## MR11  0.36 -0.39  0.42 -0.14  0.04  0.33 -0.21  0.14 -0.33
## MR13  0.24 -0.03  0.12  0.00  0.14 -0.32  0.00 -0.25  0.34
## MR14 -0.33  0.11 -0.27  0.09 -0.01  0.15 -0.40 -0.04 -0.27
##       MR12  MR15  MR10  MR11  MR13  MR14
## MR6   0.36  0.13 -0.55  0.36  0.24 -0.33
## MR1  -0.60 -0.36  0.43 -0.39 -0.03  0.11
## MR2   0.05  0.22 -0.37  0.42  0.12 -0.27
## MR3  -0.03  0.03  0.11 -0.14  0.00  0.09
## MR5  -0.14 -0.22 -0.10  0.04  0.14 -0.01
## MR4  -0.19 -0.01  0.13  0.33 -0.32  0.15
## MR7   0.04 -0.11 -0.04 -0.21  0.00 -0.40
## MR9  -0.03 -0.15  0.07  0.14 -0.25 -0.04
## MR8   0.12  0.26 -0.24 -0.33  0.34 -0.27
## MR12  1.00  0.18 -0.44  0.19  0.06 -0.19
## MR15  0.18  1.00 -0.07  0.19 -0.02 -0.21
## MR10 -0.44 -0.07  1.00 -0.33 -0.17  0.23
## MR11  0.19  0.19 -0.33  1.00 -0.16  0.11
## MR13  0.06 -0.02 -0.17 -0.16  1.00 -0.25
## MR14 -0.19 -0.21  0.23  0.11 -0.25  1.00

factor_correlations <- cor(factor_model$Phi)
corrplot(factor_correlations, method = "circle", type = "lower", tl.col = "black", tl.srt = 45)

The factor analysis was performed with the hypothesis that 15 factors are sufficient. The degrees of freedom for the model are 3640, and the objective function reached a value of 42.39. The root mean square of the residuals (RMSA) is 0.02, and the degree-of-freedom-corrected RMSA is also 0.02.

The factor correlation matrix provides insights into the relationships between the retained factors (MR1 to MR15):

Factor correlations range from -0.61 to 1.00. Negative correlations suggest an inverse relationship between factors, while positive correlations indicate a positive relationship. We can see it also in the plot, we have factors like MR6 that is relatively higly corralated with MR2, MR12, MR11, or MR10 with MR1. We can also see some high negative corralation betwen for example MR10 and MR6 or MR12 and MR1. These correlations can guide further interpretation of the factor structure and inform decisions about the number of factors to retain. It’s essential to consider the magnitude and pattern of factor loadings along with these correlations for a comprehensive understanding of the underlying structure.

Interpratetion of factor loadings

“The factor loading is the correlation between the item and the factor; a factor loading of more than 0.30 usually indicates a moderate correlation between the item and the factor.” [ Tavakol M, Wetzel A. 2020]

# Extract factor loadings from the factor model
factor_loadings <- factor_model$loadings

# Display the factor loadings for the first factor
first_factor_loadings <- factor_loadings[, 1]
print(first_factor_loadings)

##                 pop            perHoush            pctBlack 
##         0.013279236         0.020555512         1.231802153 
##            pctWhite            pctAsian             pctHisp 
##        -1.040540865         0.061918004        -0.163534866 
##            pct12-21            pct12-29             pct65up 
##        -0.001195710        -0.053368080        -0.056259219 
##            pctUrban           medIncome            pctWwage 
##         0.214313793        -0.041885682        -0.121221263 
##            pctWfarm             pctWdiv          pctWsocsec 
##        -0.249604027        -0.394960439        -0.021218459 
##          pctPubAsst           pctRetire        medFamIncome 
##         0.477004912        -0.082350811        -0.066311079 
##           perCapInc         whitePerCap         blackPerCap 
##         0.042009471         0.256205825        -0.085084825 
##            NAperCap         asianPerCap         otherPerCap 
##         0.014710129         0.059055235         0.058432962 
##          hispPerCap          pctPoverty           pctLowEdu 
##         0.113561725         0.379902603         0.054157496 
##        pctNotHSgrad         pctCollGrad         pctUnemploy 
##         0.209478949         0.011350182         0.326605701 
##           pctEmploy        pctEmployMfg   pctEmployProfServ 
##        -0.164934838         0.026352972         0.073738252 
##        pctOccupManu        pctOccupMgmt       pctMaleDivorc 
##         0.132472236         0.020690027         0.528377184 
##       pctMaleNevMar        pctFemDivorc          persPerFam 
##         0.209150888         0.566718201         0.109858356 
##             pct2Par      pctKids-4w2Par       pct12-17w2Par 
##        -0.794368059        -0.768488500        -0.813864778 
##        pctWorkMom-6       pctWorkMom-18 pctKidsBornNevrMarr 
##         0.233627342         0.109756581         1.022345673 
##      numForeignBorn       pctFgnImmig-3       pctFgnImmig-5 
##        -0.089312908        -0.030553558        -0.026902595 
##       pctFgnImmig-8      pctFgnImmig-10          pctImmig-3 
##         0.043434255         0.094047502        -0.111107003 
##         pctImmig-10     pctSpeakOnlyEng      pctNotSpeakEng 
##        -0.053098206         0.197672496        -0.146561075 
##      pctLargHousFam         pctLargHous    persPerOccupHous 
##         0.215232069         0.142429880         0.028592211 
##     persPerOwnOccup  persPerRenterOccup     pctPersOwnOccup 
##        -0.094443503         0.195788169        -0.222787848 
##     pctPopDenseHous   pctSmallHousUnits         medNumBedrm 
##         0.129770260         0.003950578         0.012868540 
##         houseVacant        pctHousOccup   pctHousOwnerOccup 
##         0.092298685        -0.128192180        -0.156867575 
##    pctVacantBoarded        pctVacant6up      medYrHousBuilt 
##         0.616881708         0.019495486         0.014597817 
##      pctHousWOphone      pctHousWOplumb          ownHousMed 
##         0.429542480         0.128896351        -0.021917944 
##       ownHousQrange             rentMed          rentQrange 
##         0.059940279        -0.085138031         0.005063219 
##        medGrossRent   medRentpctHousInc       medOwnCostpct 
##        -0.049432617         0.211640758        -0.034135167 
##     medOwnCostPctWO      persEmergShelt        persHomeless 
##        -0.003248375         0.014220949        -0.063706532 
##      pctForeignBorn   pctBornStateResid      pctSameHouse-5 
##        -0.099439902        -0.094919755        -0.034948168 
##     pctSameCounty-5      pctSameState-5            landArea 
##         0.137098035        -0.044856567         0.038219090 
##          popDensity      pctUsePubTrans    pctOfficDrugUnit 
##        -0.025715035         0.134467097         0.369657978 
##          murdPerPop         rapesPerPop         robbbPerPop 
##         0.911117334         0.638842555         1.029879415 
##       assaultPerPop         burglPerPop          larcPerPop 
##         0.890023486         0.921161136         0.725047683 
##     autoTheftPerPop              arsons        arsonsPerPop 
##         0.695338925         0.064020333         0.477672031 
##       violentPerPop       nonViolPerPop 
##         1.007523452         0.920379938

# Sort the variables based on their loadings for the first factor
sorted_loadings <- sort(first_factor_loadings, decreasing = TRUE)
print(sorted_loadings)

##            pctBlack         robbbPerPop pctKidsBornNevrMarr 
##         1.231802153         1.029879415         1.022345673 
##       violentPerPop         burglPerPop       nonViolPerPop 
##         1.007523452         0.921161136         0.920379938 
##          murdPerPop       assaultPerPop          larcPerPop 
##         0.911117334         0.890023486         0.725047683 
##     autoTheftPerPop         rapesPerPop    pctVacantBoarded 
##         0.695338925         0.638842555         0.616881708 
##        pctFemDivorc       pctMaleDivorc        arsonsPerPop 
##         0.566718201         0.528377184         0.477672031 
##          pctPubAsst      pctHousWOphone          pctPoverty 
##         0.477004912         0.429542480         0.379902603 
##    pctOfficDrugUnit         pctUnemploy         whitePerCap 
##         0.369657978         0.326605701         0.256205825 
##        pctWorkMom-6      pctLargHousFam            pctUrban 
##         0.233627342         0.215232069         0.214313793 
##   medRentpctHousInc        pctNotHSgrad       pctMaleNevMar 
##         0.211640758         0.209478949         0.209150888 
##     pctSpeakOnlyEng  persPerRenterOccup         pctLargHous 
##         0.197672496         0.195788169         0.142429880 
##     pctSameCounty-5      pctUsePubTrans        pctOccupManu 
##         0.137098035         0.134467097         0.132472236 
##     pctPopDenseHous      pctHousWOplumb          hispPerCap 
##         0.129770260         0.128896351         0.113561725 
##          persPerFam       pctWorkMom-18      pctFgnImmig-10 
##         0.109858356         0.109756581         0.094047502 
##         houseVacant   pctEmployProfServ              arsons 
##         0.092298685         0.073738252         0.064020333 
##            pctAsian       ownHousQrange         asianPerCap 
##         0.061918004         0.059940279         0.059055235 
##         otherPerCap           pctLowEdu       pctFgnImmig-8 
##         0.058432962         0.054157496         0.043434255 
##           perCapInc            landArea    persPerOccupHous 
##         0.042009471         0.038219090         0.028592211 
##        pctEmployMfg        pctOccupMgmt            perHoush 
##         0.026352972         0.020690027         0.020555512 
##        pctVacant6up            NAperCap      medYrHousBuilt 
##         0.019495486         0.014710129         0.014597817 
##      persEmergShelt                 pop         medNumBedrm 
##         0.014220949         0.013279236         0.012868540 
##         pctCollGrad          rentQrange   pctSmallHousUnits 
##         0.011350182         0.005063219         0.003950578 
##            pct12-21     medOwnCostPctWO          pctWsocsec 
##        -0.001195710        -0.003248375        -0.021218459 
##          ownHousMed          popDensity       pctFgnImmig-5 
##        -0.021917944        -0.025715035        -0.026902595 
##       pctFgnImmig-3       medOwnCostpct      pctSameHouse-5 
##        -0.030553558        -0.034135167        -0.034948168 
##           medIncome      pctSameState-5        medGrossRent 
##        -0.041885682        -0.044856567        -0.049432617 
##         pctImmig-10            pct12-29             pct65up 
##        -0.053098206        -0.053368080        -0.056259219 
##        persHomeless        medFamIncome           pctRetire 
##        -0.063706532        -0.066311079        -0.082350811 
##         blackPerCap             rentMed      numForeignBorn 
##        -0.085084825        -0.085138031        -0.089312908 
##     persPerOwnOccup   pctBornStateResid      pctForeignBorn 
##        -0.094443503        -0.094919755        -0.099439902 
##          pctImmig-3            pctWwage        pctHousOccup 
##        -0.111107003        -0.121221263        -0.128192180 
##      pctNotSpeakEng   pctHousOwnerOccup             pctHisp 
##        -0.146561075        -0.156867575        -0.163534866 
##           pctEmploy     pctPersOwnOccup            pctWfarm 
##        -0.164934838        -0.222787848        -0.249604027 
##             pctWdiv      pctKids-4w2Par             pct2Par 
##        -0.394960439        -0.768488500        -0.794368059 
##       pct12-17w2Par            pctWhite 
##        -0.813864778        -1.040540865

# Plot the factor loadings for the first factor
barplot(sorted_loadings, names.arg = names(sorted_loadings), col = "skyblue", main = "Factor Loadings for Factor 1", cex.names=0.5, las=2)
abline(h = c(0.3,-0.3),col="red")

As we can see most variables do not have a significant corralation with the Factor 1, when we use the guidlines from the article, in order to interpret two of the factors I excluded those items for better visual interpretation.

# Filter loadings between -0.3 and 0.3
filtered_loadings <- sorted_loadings[sorted_loadings > 0.3 | sorted_loadings < -0.3]

# Plot the factor loadings for the first factor
par(mar = c(7, 4, 2, 2) + 0.2)
barplot(filtered_loadings, names.arg = names(filtered_loadings), col = "skyblue", main = "Factor Loadings for Factor 1", cex.names=0.6, las=2)

secfacload <- factor_loadings[, 2]
sorted_loadings2 <- sort(secfacload, decreasing = TRUE)
# Filter loadings between -0.3 and 0.3
filtered_loadings2 <- sorted_loadings2[sorted_loadings2 > 0.3 | sorted_loadings2 < -0.3]

# Plot the factor loadings for the first factor
par(mar = c(7, 4, 2, 2) + 0.2)
barplot(filtered_loadings2, names.arg = names(filtered_loadings2), col = "skyblue", main = "Factor Loadings for Factor 2", cex.names=0.6, las=2)

Looking at the first two factors we can see the relevant factor loadings. We can see that the items in Factor 1 are related mostly related to crime but interestingly we also see the modarate corralation between variables about divorce and percentege of children born to non married couples. The second factor is related to factors about the income and ethnicity.

CONCLUSIONS

The project emphasizes the importance of careful variable selection, exploring various dimensionality reduction techniques, and adjusting the analysis based on the nature of the data to enhance interpretability.

Sources:

R documentatation <www.rdocumentation.org>
Repository https://archive.ics.uci.edu/dataset/211/communities+and+crime+unnormalized
Article about reading a specific format into R https://www.datafiles.samhsa.gov/get-help/format-specific-issues/how-do-i-read-data-r
Article about the culmulative variance graph https://medium.com/@averymarcell/project-3-analyzing-cumulative-variance-explained-with-a-line-plot-a28ba3fbd120
Tavakol M, Wetzel A. Factor Analysis: a means for theory and instrument development in support of construct validity.(2020) https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7883798/#:~:text=The%20factor%20loading%20is%20the,and%20R%2C%20provide%20factor%20loadings.

Annotation: This RMarkdown was prepared with a help of study materials for a course “Unsuperviesed Learning” at WNE University of Warsaw

Dimension Reduction

Zuzanna Kowalczyk