First, store the data as a variable. I stored the dataset as happy:
Then, get the Doing Business data for the 5 new variables.
require("readxl")
Merge 5 variables from biz with happy. Rename first column in biz to “Country” to create a primary key, then pick 5 variables from biz to merge with happy. The variables picked were:
New merged table stored in happybiz. Column “X” is dropped in happy, because it will be irrelevant to dimension reduction later.
Chosen reduction method: principle component analysis (PCA). Standardize numeric columns and store in new dataframe, stdhappybiz. Then, run PCA:
## Importance of components:
## Comp.1 Comp.2 Comp.3 Comp.4
## Standard deviation 3.2186644 1.34833591 1.21160446 1.11166942
## Proportion of Variance 0.5219746 0.09159972 0.07396388 0.06226576
## Cumulative Proportion 0.5219746 0.61357430 0.68753818 0.74980393
## Comp.5 Comp.6 Comp.7 Comp.8
## Standard deviation 0.95950427 0.88433365 0.74606223 0.71492322
## Proportion of Variance 0.04638652 0.03940309 0.02804452 0.02575234
## Cumulative Proportion 0.79619045 0.83559354 0.86363806 0.88939041
## Comp.9 Comp.10 Comp.11 Comp.12
## Standard deviation 0.6962408 0.68079349 0.54912356 0.52246769
## Proportion of Variance 0.0244240 0.02335225 0.01519281 0.01375361
## Cumulative Proportion 0.9138144 0.93716666 0.95235947 0.96611308
## Comp.13 Comp.14 Comp.15 Comp.16
## Standard deviation 0.50638163 0.43659514 0.346564188 0.248663742
## Proportion of Variance 0.01291974 0.00960408 0.006051532 0.003115465
## Cumulative Proportion 0.97903283 0.98863691 0.994688437 0.997803902
## Comp.17 Comp.18 Comp.19 Comp.20
## Standard deviation 0.192518007 0.0709625358 3.85693e-02 4.721377e-04
## Proportion of Variance 0.001867414 0.0002537209 7.49517e-05 1.123143e-08
## Cumulative Proportion 0.999671316 0.9999250371 1.00000e+00 1.000000e+00
After 8 principle components, the cumulative proportion of variance reaches ~0.9, meaning that only 8 variables are needed to explain most of the variance in World Happiness and Ease of Conducting Business in a country.
Verify this claim by examining the variances:
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5
## 1.035980e+01 1.818010e+00 1.467985e+00 1.235809e+00 9.206485e-01
## Comp.6 Comp.7 Comp.8 Comp.9 Comp.10
## 7.820460e-01 5.566089e-01 5.111152e-01 4.847512e-01 4.634798e-01
## Comp.11 Comp.12 Comp.13 Comp.14 Comp.15
## 3.015367e-01 2.729725e-01 2.564224e-01 1.906153e-01 1.201067e-01
## Comp.16 Comp.17 Comp.18 Comp.19 Comp.20
## 6.183366e-02 3.706318e-02 5.035681e-03 1.487591e-03 2.229140e-07
The barplot of variances shows that the PCAs begin flatlining in variance explanation after component 8. Examining the actual values of the variances reveals that the variables can actually be cut down to 4, using Kaiser’s criterion of only keeping PCAs with variances >= 1.
For fun: examine visual biplots to see which variables had similar response profiles, to find variables that could be dropped:
Full Biplot
Negative Vectors
Positive Vectors
Two variables’ vectors immediately stand out: freedom, which overlaps with the longer GNIperCap, and Starting a Business, which overlaps with the longer Trading Across Borders. The longer the vector, the higher the correlation between the vector’s variable and the variable’s fitted values. Therefore, the shorter vectors (freedom and Starting a Business) with less correlation could be dropped.
Create a distance matrix that sorts all rows by similarity based on distance; default distance is Euclidean.
Since the output of the distance matrix for 131 rows is difficult to read, reorganize distance matrix into a long table (stored in tb_shb). Only first few rows of resulting table displayed:
## Row 1 Row 2 Distance
## 1 1 2 6.503393
## 2 1 3 6.691797
## 3 1 4 5.055541
## 4 1 5 7.489941
## 5 1 6 6.706064
## 6 1 7 11.110892
First, get the Chicago crimes data. Store in the variable Chicago.
Transform data into ward-level with tidyr and dplyr. First few rows of results displayed:
require(tidyr)
require(dplyr)
## Ward Primary.Type Freq
## 1 1 ARSON 197
## 2 2 ARSON 137
## 3 3 ARSON 186
## 4 4 ARSON 92
## 5 5 ARSON 155
## 6 6 ARSON 270
Spread out frequency counts of each Primary.Type crime. First few rows and columns of results displayed:
## Ward ARSON ASSAULT BATTERY BURGLARY CONCEALED CARRY LICENSE VIOLATION
## 1 1 197 5335 14907 8480 0
## 2 2 137 12238 33261 6078 3
## 3 3 186 12191 37918 7263 1
## 4 4 92 7283 19945 5841 0
## 5 5 155 9523 27453 10603 2
## 6 6 270 13098 37741 11993 5
## CRIM SEXUAL ASSAULT CRIMINAL DAMAGE CRIMINAL TRESPASS DECEPTIVE PRACTICE
## 1 324 14371 3648 4774
## 2 723 18170 13054 12165
## 3 673 15943 13398 4998
## 4 424 11018 4161 4368
## 5 654 16142 4169 5102
## 6 858 20574 5154 6925
Standardizing data reveals that column “Domestic Violence” is full of NAs:
## CRIMINAL DAMAGE CRIMINAL TRESPASS DECEPTIVE PRACTICE DOMESTIC VIOLENCE
## 1 0.2029457 0.0964204 0.08632237 NaN
## 2 1.0045915 3.3945885 2.10815221 NaN
## 3 0.5346612 3.5152105 0.14759823 NaN
## 4 -0.5045874 0.2763014 -0.02474013 NaN
## 5 0.5766532 0.2791065 0.17604774 NaN
## 6 1.5118714 0.6244920 0.67473476 NaN
Eliminate NAs and run PCA:
## Importance of components:
## Comp.1 Comp.2 Comp.3 Comp.4
## Standard deviation 4.1293226 2.2457506 1.36443026 1.23393930
## Proportion of Variance 0.5117439 0.1513624 0.05587245 0.04569646
## Cumulative Proportion 0.5117439 0.6631063 0.71897872 0.76467519
## Comp.5 Comp.6 Comp.7 Comp.8
## Standard deviation 1.16577344 1.02142776 0.96769188 0.89515372
## Proportion of Variance 0.04078715 0.03131196 0.02810407 0.02404862
## Cumulative Proportion 0.80546233 0.83677430 0.86487837 0.88892699
## Comp.9 Comp.10 Comp.11 Comp.12
## Standard deviation 0.81067051 0.71545241 0.64160923 0.62088386
## Proportion of Variance 0.01972349 0.01536231 0.01235481 0.01156953
## Cumulative Proportion 0.90865048 0.92401279 0.93636761 0.94793714
## Comp.13 Comp.14 Comp.15 Comp.16
## Standard deviation 0.56435852 0.549134875 0.499487827 0.41368166
## Proportion of Variance 0.00955884 0.009050093 0.007487638 0.00513603
## Cumulative Proportion 0.95749598 0.966546069 0.974033707 0.97916974
## Comp.17 Comp.18 Comp.19 Comp.20
## Standard deviation 0.372161530 0.348926318 0.319265837 0.272991561
## Proportion of Variance 0.004156789 0.003653949 0.003059144 0.002236626
## Cumulative Proportion 0.983326526 0.986980475 0.990039619 0.992276245
## Comp.21 Comp.22 Comp.23 Comp.24
## Standard deviation 0.261863320 0.202885283 0.19431733 0.1672872441
## Proportion of Variance 0.002057995 0.001235367 0.00113323 0.0008398866
## Cumulative Proportion 0.994334240 0.995569607 0.99670284 0.9975427240
## Comp.25 Comp.26 Comp.27 Comp.28
## Standard deviation 0.1360293249 0.1295298854 0.1072695201 0.1063105726
## Proportion of Variance 0.0005553415 0.0005035412 0.0003453406 0.0003391938
## Cumulative Proportion 0.9980980655 0.9986016066 0.9989469473 0.9992861411
## Comp.29 Comp.30 Comp.31 Comp.32
## Standard deviation 0.0936325408 0.0824644881 0.0608241256 5.365655e-02
## Proportion of Variance 0.0002631168 0.0002040934 0.0001110316 8.640533e-05
## Cumulative Proportion 0.9995492579 0.9997533513 0.9998643829 9.999508e-01
## Comp.33 Comp.34
## Standard deviation 3.386827e-02 2.219629e-02
## Proportion of Variance 3.442556e-05 1.478617e-05
## Cumulative Proportion 9.999852e-01 1.000000e+00
After 9 principle components, the cumulative proportion of variance reaches ~0.9, meaning that only 9 variables are needed to explain most of the variance in crime rates in Chicago’s wards.
Verify this claim by examining the variances:
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5
## 1.705131e+01 5.043396e+00 1.861670e+00 1.522606e+00 1.359028e+00
## Comp.6 Comp.7 Comp.8 Comp.9 Comp.10
## 1.043315e+00 9.364276e-01 8.013002e-01 6.571867e-01 5.118722e-01
## Comp.11 Comp.12 Comp.13 Comp.14 Comp.15
## 4.116624e-01 3.854968e-01 3.185005e-01 3.015491e-01 2.494881e-01
## Comp.16 Comp.17 Comp.18 Comp.19 Comp.20
## 1.711325e-01 1.385042e-01 1.217496e-01 1.019307e-01 7.452439e-02
## Comp.21 Comp.22 Comp.23 Comp.24 Comp.25
## 6.857240e-02 4.116244e-02 3.775922e-02 2.798502e-02 1.850398e-02
## Comp.26 Comp.27 Comp.28 Comp.29 Comp.30
## 1.677799e-02 1.150675e-02 1.130194e-02 8.767053e-03 6.800392e-03
## Comp.31 Comp.32 Comp.33 Comp.34
## 3.699574e-03 2.879026e-03 1.147060e-03 4.926753e-04
The barplot of variances shows that the PCAs begin flatlining in variance explanation after component 9. Examining the actual values of the variances reveals that the variables can actually be cut down to 6, using Kaiser’s criterion of only keeping PCAs with variances >= 1.
For fun: run cluster analysis on attributes to see which variables could be dropped due to similarity. Distance used is default, Euclidean.
Because cluster analysis in R works by row, convert the dataframe of wrangled Chicago Ward data into a matrix, to be transformed. Transforming the matrix flips the Primary.Type to rows, allowing attribute-based cluster analysis. This transformed matrix is stored in tSCW. First few rows and columns of resulting matrix shown.
## 1 2 3
## ARSON 0.07463137 -0.48791666 -0.02850243
## ASSAULT -0.43885396 1.22778091 1.21643341
## BATTERY -0.47937961 0.91073462 1.26345130
## BURGLARY 0.56961395 -0.27403482 0.14216992
## CONCEALED CARRY LICENSE VIOLATION -0.56894446 0.04947343 -0.36280516
## CRIM SEXUAL ASSAULT -0.53685155 1.01131098 0.81730565
Next, acquire the distance matrix for the attributes; default distance, Euclidean, is used. The distance matrix shows the similarity between the attributes, but is difficult to quickly interpret due to the massive amount of numbers. To help make interpretation of the distances easier, create a dendogram to visually show the hierarchical clusters:
Both the cutoffs for Kaiser’s criterion and ~90% cumulative variance are illustrated by lines above. An analysis of the pros and cons of both cutoffs is listed below.
Kaiser’s Criterion
Pros:
Cons:
~90% Cumulative Variance The ~90% cumulative variance cutoff leads to keeping more variables than Kaiser’s criterion, which means that some of the benefits of Kaiser’s criterion remain.
Pros:
Cons:
Create distance matrix of wards, to find how similar wards are. Use distance matrix to plot dendogram, which helps visually analyze the distances among wards:
There appears to be two major clusters of wards (see leftmost and rightmost branches along horizontal line at height = 14), while ward 42 seems to be in its own category. Why?
Some Googling reveals that Ward 42 is located downtown, making it more likely to be the site for crimes. Indeed, its current (2017) representative, Alderman Brendan Reilly, has made an ‘urgent call’ for increased overnight police patrols.
What about the two other major clusters? Examining a map of the wards reveals that the wards in the leftmost cluster are either near downtown, or in the South Side, which has historically had higher crime rates due to white flight.
require(imager)
Therefore, it is reasonable to infer that the leftmost cluster includes wards more prone to violence, while the rightmost cluster includes wards less prone to violence.