ITP 350 HW 4

Augment the World Happiness Report data with 5 new variables

First, store the data as a variable. I stored the dataset as happy:

Then, get the Doing Business data for the 5 new variables.

require("readxl")

Merge 5 variables from biz with happy. Rename first column in biz to “Country” to create a primary key, then pick 5 variables from biz to merge with happy. The variables picked were:

Ease of Doing Business Rank
Starting a Business
Getting Credit
Trading across Borders
Enforcing Contracts

New merged table stored in happybiz. Column “X” is dropped in happy, because it will be irrelevant to dimension reduction later.

1. Reduce data dimensionality […] Interpret your findings.

Chosen reduction method: principle component analysis (PCA). Standardize numeric columns and store in new dataframe, stdhappybiz. Then, run PCA:

## Importance of components:
##                           Comp.1     Comp.2     Comp.3     Comp.4
## Standard deviation     3.2186644 1.34833591 1.21160446 1.11166942
## Proportion of Variance 0.5219746 0.09159972 0.07396388 0.06226576
## Cumulative Proportion  0.5219746 0.61357430 0.68753818 0.74980393
##                            Comp.5     Comp.6     Comp.7     Comp.8
## Standard deviation     0.95950427 0.88433365 0.74606223 0.71492322
## Proportion of Variance 0.04638652 0.03940309 0.02804452 0.02575234
## Cumulative Proportion  0.79619045 0.83559354 0.86363806 0.88939041
##                           Comp.9    Comp.10    Comp.11    Comp.12
## Standard deviation     0.6962408 0.68079349 0.54912356 0.52246769
## Proportion of Variance 0.0244240 0.02335225 0.01519281 0.01375361
## Cumulative Proportion  0.9138144 0.93716666 0.95235947 0.96611308
##                           Comp.13    Comp.14     Comp.15     Comp.16
## Standard deviation     0.50638163 0.43659514 0.346564188 0.248663742
## Proportion of Variance 0.01291974 0.00960408 0.006051532 0.003115465
## Cumulative Proportion  0.97903283 0.98863691 0.994688437 0.997803902
##                            Comp.17      Comp.18     Comp.19      Comp.20
## Standard deviation     0.192518007 0.0709625358 3.85693e-02 4.721377e-04
## Proportion of Variance 0.001867414 0.0002537209 7.49517e-05 1.123143e-08
## Cumulative Proportion  0.999671316 0.9999250371 1.00000e+00 1.000000e+00

After 8 principle components, the cumulative proportion of variance reaches ~0.9, meaning that only 8 variables are needed to explain most of the variance in World Happiness and Ease of Conducting Business in a country.

Verify this claim by examining the variances:

##       Comp.1       Comp.2       Comp.3       Comp.4       Comp.5 
## 1.035980e+01 1.818010e+00 1.467985e+00 1.235809e+00 9.206485e-01 
##       Comp.6       Comp.7       Comp.8       Comp.9      Comp.10 
## 7.820460e-01 5.566089e-01 5.111152e-01 4.847512e-01 4.634798e-01 
##      Comp.11      Comp.12      Comp.13      Comp.14      Comp.15 
## 3.015367e-01 2.729725e-01 2.564224e-01 1.906153e-01 1.201067e-01 
##      Comp.16      Comp.17      Comp.18      Comp.19      Comp.20 
## 6.183366e-02 3.706318e-02 5.035681e-03 1.487591e-03 2.229140e-07

The barplot of variances shows that the PCAs begin flatlining in variance explanation after component 8. Examining the actual values of the variances reveals that the variables can actually be cut down to 4, using Kaiser’s criterion of only keeping PCAs with variances >= 1.

For fun: examine visual biplots to see which variables had similar response profiles, to find variables that could be dropped:

Full Biplot

Negative Vectors

Positive Vectors

Two variables’ vectors immediately stand out: freedom, which overlaps with the longer GNIperCap, and Starting a Business, which overlaps with the longer Trading Across Borders. The longer the vector, the higher the correlation between the vector’s variable and the variable’s fitted values. Therefore, the shorter vectors (freedom and Starting a Business) with less correlation could be dropped.

2. Rank all entities from most to least ‘similar’, using a distance metric as the criterion.

Create a distance matrix that sorts all rows by similarity based on distance; default distance is Euclidean.

Since the output of the distance matrix for 131 rows is difficult to read, reorganize distance matrix into a long table (stored in tb_shb). Only first few rows of resulting table displayed:

##   Row 1 Row 2  Distance
## 1     1     2  6.503393
## 2     1     3  6.691797
## 3     1     4  5.055541
## 4     1     5  7.489941
## 5     1     6  6.706064
## 6     1     7 11.110892

For crime data from any large city (default: Chicago), work at the ‘ward level’…

First, get the Chicago crimes data. Store in the variable Chicago.

Transform data into ward-level with tidyr and dplyr. First few rows of results displayed:

require(tidyr)
require(dplyr)

##   Ward Primary.Type Freq
## 1    1        ARSON  197
## 2    2        ARSON  137
## 3    3        ARSON  186
## 4    4        ARSON   92
## 5    5        ARSON  155
## 6    6        ARSON  270

Spread out frequency counts of each Primary.Type crime. First few rows and columns of results displayed:

##   Ward ARSON ASSAULT BATTERY BURGLARY CONCEALED CARRY LICENSE VIOLATION
## 1    1   197    5335   14907     8480                                 0
## 2    2   137   12238   33261     6078                                 3
## 3    3   186   12191   37918     7263                                 1
## 4    4    92    7283   19945     5841                                 0
## 5    5   155    9523   27453    10603                                 2
## 6    6   270   13098   37741    11993                                 5
##   CRIM SEXUAL ASSAULT CRIMINAL DAMAGE CRIMINAL TRESPASS DECEPTIVE PRACTICE
## 1                 324           14371              3648               4774
## 2                 723           18170             13054              12165
## 3                 673           15943             13398               4998
## 4                 424           11018              4161               4368
## 5                 654           16142              4169               5102
## 6                 858           20574              5154               6925

3. Apply principal component analysis (PCA) to reduce the dimensionality in attribute space. Interpret your findings.

Standardizing data reveals that column “Domestic Violence” is full of NAs:

##   CRIMINAL DAMAGE CRIMINAL TRESPASS DECEPTIVE PRACTICE DOMESTIC VIOLENCE
## 1       0.2029457         0.0964204         0.08632237               NaN
## 2       1.0045915         3.3945885         2.10815221               NaN
## 3       0.5346612         3.5152105         0.14759823               NaN
## 4      -0.5045874         0.2763014        -0.02474013               NaN
## 5       0.5766532         0.2791065         0.17604774               NaN
## 6       1.5118714         0.6244920         0.67473476               NaN

Eliminate NAs and run PCA:

## Importance of components:
##                           Comp.1    Comp.2     Comp.3     Comp.4
## Standard deviation     4.1293226 2.2457506 1.36443026 1.23393930
## Proportion of Variance 0.5117439 0.1513624 0.05587245 0.04569646
## Cumulative Proportion  0.5117439 0.6631063 0.71897872 0.76467519
##                            Comp.5     Comp.6     Comp.7     Comp.8
## Standard deviation     1.16577344 1.02142776 0.96769188 0.89515372
## Proportion of Variance 0.04078715 0.03131196 0.02810407 0.02404862
## Cumulative Proportion  0.80546233 0.83677430 0.86487837 0.88892699
##                            Comp.9    Comp.10    Comp.11    Comp.12
## Standard deviation     0.81067051 0.71545241 0.64160923 0.62088386
## Proportion of Variance 0.01972349 0.01536231 0.01235481 0.01156953
## Cumulative Proportion  0.90865048 0.92401279 0.93636761 0.94793714
##                           Comp.13     Comp.14     Comp.15    Comp.16
## Standard deviation     0.56435852 0.549134875 0.499487827 0.41368166
## Proportion of Variance 0.00955884 0.009050093 0.007487638 0.00513603
## Cumulative Proportion  0.95749598 0.966546069 0.974033707 0.97916974
##                            Comp.17     Comp.18     Comp.19     Comp.20
## Standard deviation     0.372161530 0.348926318 0.319265837 0.272991561
## Proportion of Variance 0.004156789 0.003653949 0.003059144 0.002236626
## Cumulative Proportion  0.983326526 0.986980475 0.990039619 0.992276245
##                            Comp.21     Comp.22    Comp.23      Comp.24
## Standard deviation     0.261863320 0.202885283 0.19431733 0.1672872441
## Proportion of Variance 0.002057995 0.001235367 0.00113323 0.0008398866
## Cumulative Proportion  0.994334240 0.995569607 0.99670284 0.9975427240
##                             Comp.25      Comp.26      Comp.27      Comp.28
## Standard deviation     0.1360293249 0.1295298854 0.1072695201 0.1063105726
## Proportion of Variance 0.0005553415 0.0005035412 0.0003453406 0.0003391938
## Cumulative Proportion  0.9980980655 0.9986016066 0.9989469473 0.9992861411
##                             Comp.29      Comp.30      Comp.31      Comp.32
## Standard deviation     0.0936325408 0.0824644881 0.0608241256 5.365655e-02
## Proportion of Variance 0.0002631168 0.0002040934 0.0001110316 8.640533e-05
## Cumulative Proportion  0.9995492579 0.9997533513 0.9998643829 9.999508e-01
##                             Comp.33      Comp.34
## Standard deviation     3.386827e-02 2.219629e-02
## Proportion of Variance 3.442556e-05 1.478617e-05
## Cumulative Proportion  9.999852e-01 1.000000e+00

After 9 principle components, the cumulative proportion of variance reaches ~0.9, meaning that only 9 variables are needed to explain most of the variance in crime rates in Chicago’s wards.

Verify this claim by examining the variances:

##       Comp.1       Comp.2       Comp.3       Comp.4       Comp.5 
## 1.705131e+01 5.043396e+00 1.861670e+00 1.522606e+00 1.359028e+00 
##       Comp.6       Comp.7       Comp.8       Comp.9      Comp.10 
## 1.043315e+00 9.364276e-01 8.013002e-01 6.571867e-01 5.118722e-01 
##      Comp.11      Comp.12      Comp.13      Comp.14      Comp.15 
## 4.116624e-01 3.854968e-01 3.185005e-01 3.015491e-01 2.494881e-01 
##      Comp.16      Comp.17      Comp.18      Comp.19      Comp.20 
## 1.711325e-01 1.385042e-01 1.217496e-01 1.019307e-01 7.452439e-02 
##      Comp.21      Comp.22      Comp.23      Comp.24      Comp.25 
## 6.857240e-02 4.116244e-02 3.775922e-02 2.798502e-02 1.850398e-02 
##      Comp.26      Comp.27      Comp.28      Comp.29      Comp.30 
## 1.677799e-02 1.150675e-02 1.130194e-02 8.767053e-03 6.800392e-03 
##      Comp.31      Comp.32      Comp.33      Comp.34 
## 3.699574e-03 2.879026e-03 1.147060e-03 4.926753e-04

The barplot of variances shows that the PCAs begin flatlining in variance explanation after component 9. Examining the actual values of the variances reveals that the variables can actually be cut down to 6, using Kaiser’s criterion of only keeping PCAs with variances >= 1.

For fun: run cluster analysis on attributes to see which variables could be dropped due to similarity. Distance used is default, Euclidean.

Because cluster analysis in R works by row, convert the dataframe of wrangled Chicago Ward data into a matrix, to be transformed. Transforming the matrix flips the Primary.Type to rows, allowing attribute-based cluster analysis. This transformed matrix is stored in tSCW. First few rows and columns of resulting matrix shown.

##                                             1           2           3
## ARSON                              0.07463137 -0.48791666 -0.02850243
## ASSAULT                           -0.43885396  1.22778091  1.21643341
## BATTERY                           -0.47937961  0.91073462  1.26345130
## BURGLARY                           0.56961395 -0.27403482  0.14216992
## CONCEALED CARRY LICENSE VIOLATION -0.56894446  0.04947343 -0.36280516
## CRIM SEXUAL ASSAULT               -0.53685155  1.01131098  0.81730565

Next, acquire the distance matrix for the attributes; default distance, Euclidean, is used. The distance matrix shows the similarity between the attributes, but is difficult to quickly interpret due to the massive amount of numbers. To help make interpretation of the distances easier, create a dendogram to visually show the hierarchical clusters:

Both the cutoffs for Kaiser’s criterion and ~90% cumulative variance are illustrated by lines above. An analysis of the pros and cons of both cutoffs is listed below.

Kaiser’s Criterion

Pros:

Grouped several non-serious crimes (non-criminal, liquor and other narcotics violations) into one category
Grouped gang-related crimes (e.g. prostitution, human trafficking, theft, interference with officer, weapons violation) into one category

Cons:

Put non-criminal (subject specific) into its own category, which is redundant given the non-serious crimes category
Grouped several non-related crimes together (e.g. ritualism, arson, kidnapping)

~90% Cumulative Variance The ~90% cumulative variance cutoff leads to keeping more variables than Kaiser’s criterion, which means that some of the benefits of Kaiser’s criterion remain.

Pros:

Like Kaiser’s criterion, also grouped gang-related crimes (e.g. prostitution, human trafficking, theft, interference with officer, weapons violation) into one category
Disaggregated non-related crimes (e.g. ritualism, arson, kidnapping) that Kaiser’s criterion failed to do
- Split into ritualism, and kidnapping-related crimes.

Cons:

Disaggregates non-criminal crimes too much
- Non-criminal, non-criminal (subject specific), and another non-criminal, are all put in separate categories
Could disaggregate gang-related crimes further, perhaps into non-sexual (e.g. assault, theft) and sexual (e.g. human trafficking, sex offense)

4. Apply cluster analysis (CLA) to reduce the dimensionality in entity space. Interpret your findings.

Create distance matrix of wards, to find how similar wards are. Use distance matrix to plot dendogram, which helps visually analyze the distances among wards:

There appears to be two major clusters of wards (see leftmost and rightmost branches along horizontal line at height = 14), while ward 42 seems to be in its own category. Why?

Some Googling reveals that Ward 42 is located downtown, making it more likely to be the site for crimes. Indeed, its current (2017) representative, Alderman Brendan Reilly, has made an ‘urgent call’ for increased overnight police patrols.

What about the two other major clusters? Examining a map of the wards reveals that the wards in the leftmost cluster are either near downtown, or in the South Side, which has historically had higher crime rates due to white flight.

require(imager)

Therefore, it is reasonable to infer that the leftmost cluster includes wards more prone to violence, while the rightmost cluster includes wards less prone to violence.