Analysis of the 2016 U.S. election results using clustering

Introduction

According to the popular phrase, ’‘knowledge is power’’. The importance of that simple statement can be easily verified in business, for instance. Having a better understanding of the market, e.g. knowing the needs of customers or predicting decisions of other firms may lead to higher profit. Similarly, in politics. Knowledge about future moves of competitors may turn out to be a real game-changer or at least ensure a significant advantage. The temptation of gaining such knowledge is sometimes so strong that leads to morally and legally questionable actions. A brilliant example was the famous Watergate scandal when the administration of U.S. President Richard Nixon, as well as his campaign committee were involved in the break-in of their political opponent’s headquarter.

Unlike politicians involved in the Watergate scandal, we may propose an absolutely legal and acceptable source of information. This approach aimed to investigate some socio-economic factors that shape the political decisions of voters, one could not expect such tremendous revelations as obtained using wiretapping device, yet still valuable. Not only in having a slight advantage over opponents but for purely scientific or non-politically oriented purposes as well. No matter the reasons, we hereby present the application of a few clustering methods used in order to extract information from 2016 U.S. elections results by county enriched with some facts about demography and economic wellbeing of those local government units.

library("factoextra")
library("ClusterR")
library("cluster")
library("ggplot2")
library("fossil")
library("flexclust")
library("rsample")

Dataset details

In November 2016, the presidential election was held in the United States. Republican candidate Donald Trump defeated Democrat Hilary Clinton and became the 45th president. Under the following link, one may find the number of votes per candidate grouped by county. In addition, the dataset consists of about 20 columns that contain information about demographics and some economic indicators. They are presented below. While in the dataset information about results of the 2012 U.S. election (Mitt Romney vs Barack Obama) is stored as well but for the majority of this work, it will be skipped. We will focus mainly on the 2016 election.

data_raw <- read.csv("votes.csv", sep = ",")
colnames(data_raw)

##  [1] "X.1"                 "X"                   "combined_fips"      
##  [4] "votes_dem_2016"      "votes_gop_2016"      "total_votes_2016"   
##  [7] "Clinton"             "Trump"               "diff_2016"          
## [10] "per_point_diff_2016" "state_abbr"          "county_name"        
## [13] "FIPS"                "total_votes_2012"    "votes_dem_2012"     
## [16] "votes_gop_2012"      "county_fips"         "state_fips"         
## [19] "Obama"               "Romney"              "diff_2012"          
## [22] "per_point_diff_2012" "fips"                "area_name"          
## [25] "state_abbreviation"  "population2014"      "population2010"     
## [28] "population_change"   "POP010210"           "AGE135214"          
## [31] "AGE295214"           "age65plus"           "SEX255214"          
## [34] "White"               "Black"               "RHI325214"          
## [37] "RHI425214"           "RHI525214"           "RHI625214"          
## [40] "Hispanic"            "RHI825214"           "POP715213"          
## [43] "POP645213"           "NonEnglish"          "Edu_highschool"     
## [46] "Edu_batchelors"      "VET605213"           "LFE305213"          
## [49] "HSG010214"           "HSG445213"           "HSG096213"          
## [52] "HSG495213"           "HSD410213"           "HSD310213"          
## [55] "Income"              "INC110213"           "Poverty"            
## [58] "BZA010213"           "BZA110213"           "BZA115213"          
## [61] "NES010213"           "SBO001207"           "SBO315207"          
## [64] "SBO115207"           "SBO215207"           "SBO515207"          
## [67] "SBO415207"           "SBO015207"           "MAN450207"          
## [70] "WTN220207"           "RTN130207"           "RTN131207"          
## [73] "AFN120207"           "BPS030214"           "LND110210"          
## [76] "Density"             "Clinton_Obama"       "Trump_Romney"       
## [79] "Trump_Prediction"    "Clinton_Prediction"  "Trump_Deviation"    
## [82] "Clinton_Deviation"

head(data_raw)

##   X.1  X combined_fips votes_dem_2016 votes_gop_2016 total_votes_2016
## 1  30 29          1001           5908          18110            24661
## 2  31 30          1003          18409          72780            94090
## 3  32 31          1005           4848           5431            10390
## 4  33 32          1007           1874           6733             8748
## 5  34 33          1009           2150          22808            25384
## 6  35 34          1011           3530           1139             4701
##      Clinton     Trump diff_2016 per_point_diff_2016 state_abbr    county_name
## 1 0.23956855 0.7343579     12202         -0.49478934         AL Autauga County
## 2 0.19565310 0.7735147     54371         -0.57786162         AL Baldwin County
## 3 0.46660250 0.5227141       583         -0.05611165         AL Barbour County
## 4 0.21422039 0.7696616      4859         -0.55544124         AL    Bibb County
## 5 0.08469902 0.8985188     20658         -0.81381973         AL  Blount County
## 6 0.75090406 0.2422889      2391          0.50861519         AL Bullock County
##   FIPS total_votes_2012 votes_dem_2012 votes_gop_2012 county_fips state_fips
## 1 1001            23909           6354          17366        1001          1
## 2 1003            84988          18329          65772        1003          1
## 3 1005            11459           5873           5539        1005          1
## 4 1007             8391           2200           6131        1007          1
## 5 1009            23980           2961          20741        1009          1
## 6 1011             5318           4058           1250        1011          1
##       Obama    Romney diff_2012 per_point_diff_2012 fips      area_name
## 1 0.2657577 0.7263374     11012          -0.4605797 1001 Autauga County
## 2 0.2156657 0.7738975     47443          -0.5582318 1003 Baldwin County
## 3 0.5125229 0.4833755       334           0.0291474 1005 Barbour County
## 4 0.2621857 0.7306638      3931          -0.4684781 1007    Bibb County
## 5 0.1234779 0.8649291     17780          -0.7414512 1009  Blount County
## 6 0.7630688 0.2350508      2808           0.5280181 1011 Bullock County
##   state_abbreviation population2014 population2010 population_change POP010210
## 1                 AL          55395          54571               1.5     54571
## 2                 AL         200111         182265               9.8    182265
## 3                 AL          26887          27457              -2.1     27457
## 4                 AL          22506          22919              -1.8     22915
## 5                 AL          57719          57322               0.7     57322
## 6                 AL          10764          10915              -1.4     10914
##   AGE135214 AGE295214 age65plus SEX255214 White Black RHI325214 RHI425214
## 1       6.0      25.2      13.8      51.4 0.779 0.187       0.5       1.1
## 2       5.6      22.2      18.7      51.2 0.871 0.096       0.7       0.9
## 3       5.7      21.2      16.5      46.6 0.502 0.476       0.6       0.5
## 4       5.3      21.0      14.8      45.9 0.763 0.221       0.4       0.2
## 5       6.1      23.6      17.0      50.5 0.960 0.018       0.6       0.3
## 6       6.3      21.4      14.9      45.3 0.269 0.701       0.8       0.3
##   RHI525214 RHI625214 Hispanic RHI825214 POP715213 POP645213 NonEnglish
## 1       0.1       1.8    0.027      75.6      85.0       1.6        3.5
## 2       0.1       1.6    0.046      83.0      82.1       3.6        5.5
## 3       0.2       0.9    0.045      46.6      84.8       2.9        5.0
## 4       0.1       0.9    0.021      74.5      86.6       1.2        2.1
## 5       0.1       1.2    0.087      87.8      88.7       4.3        7.3
## 6       0.7       1.1    0.075      22.1      84.7       5.4        5.2
##   Edu_highschool Edu_batchelors VET605213 LFE305213 HSG010214 HSG445213
## 1           85.6           20.9      5922      26.2     22751      76.8
## 2           89.1           27.7     19346      25.9    107374      72.6
## 3           73.7           13.4      2120      24.6     11799      67.7
## 4           77.5           12.1      1327      27.6      8978      79.0
## 5           77.0           12.1      4540      33.9     23826      81.0
## 6           67.8           12.5       636      26.9      4461      74.3
##   HSG096213 HSG495213 HSD410213 HSD310213 Income INC110213 Poverty BZA010213
## 1       8.3    136200     20071      2.71  24571     53682    12.1       817
## 2      24.4    168600     73283      2.52  26766     50221    13.9      4871
## 3      10.6     89200      9200      2.66  16829     32911    26.7       464
## 4       7.3     90500      7091      3.03  17427     36447    18.1       275
## 5       4.5    117100     21108      2.70  20730     44145    15.8       660
## 6       8.7     70600      3741      2.73  18628     32033    21.6       112
##   BZA110213 BZA115213 NES010213 SBO001207 SBO315207 SBO115207 SBO215207
## 1     10120       2.1      2947      4067      15.2       0.0       1.3
## 2     54988       3.7     16508     19035       2.7       0.4       1.0
## 3      6611      -5.6      1546      1667       0.0       0.0       0.0
## 4      3145       7.5      1126      1385      14.9       0.0       0.0
## 5      6798       3.4      3563      4458       0.0       0.0       0.0
## 6         0       0.0       470       417       0.0       0.0       0.0
##   SBO515207 SBO415207 SBO015207 MAN450207 WTN220207 RTN130207 RTN131207
## 1         0       0.7      31.7         0         0    598175     12003
## 2         0       1.3      27.3   1410273         0   2966489     17166
## 3         0       0.0      27.0         0         0    188337      6334
## 4         0       0.0       0.0         0         0    124707      5804
## 5         0       0.0      23.2    341544         0    319700      5622
## 6         0       0.0      38.8         0         0     43810      3995
##   AFN120207 BPS030214 LND110210 Density Clinton_Obama  Trump_Romney
## 1     88157       131    594.44    91.8   -0.02618911  0.0080205305
## 2    436955      1384   1589.78   114.6   -0.02001264 -0.0003827715
## 3         0         8    884.88    31.0   -0.04592041  0.0393386355
## 4     10757        19    622.58    36.8   -0.04796528  0.0389978305
## 5     20941         3    644.78    88.9   -0.03877888  0.0335896444
## 6      3670         1    622.81    17.5   -0.01216476  0.0072381037
##   Trump_Prediction Clinton_Prediction Trump_Deviation Clinton_Deviation
## 1        0.6208587          0.3404929    -0.113499208        0.10092439
## 2        0.5867486          0.3595024    -0.186766123        0.16384929
## 3        0.5178320          0.4746928    -0.004882142        0.00809032
## 4        0.6922274          0.2860314    -0.077434258        0.07181099
## 5        0.7896486          0.1774903    -0.108870165        0.09279126
## 6        0.4452689          0.5722160     0.202980053       -0.17868806

sum(is.na(data_raw)) # Checking whether some observations are missing

## [1] 0

dim(data_raw)

## [1] 3112   82

One may notice that many variables have very strange names that are not self-explanatory at all. Therefore, very useful is to have an additional dictionary that contains their definitions. What is more, the whole dataset is organised in an untidy way. Some columns containing identical data exist more than once but with different names. Naming columns follow two exclusive formats. In some cases, abbreviations are used, while in others combinations of letters and numbers are applied. This has to be formatted uniformly. The dataset consists of many redundant columns as well. For instance, there is no need to have separate columns for numbers of votes for the Republican candidate and democrat because the total number of votes and support rate of one candidate ensure all sufficient data.

Data preparation

names <- read.csv("county_facts_dictionary.csv", sep = ",")
names

##    column_name
## 1    PST045214
## 2    PST040210
## 3    PST120214
## 4    POP010210
## 5    AGE135214
## 6    AGE295214
## 7    AGE775214
## 8    SEX255214
## 9    RHI125214
## 10   RHI225214
## 11   RHI325214
## 12   RHI425214
## 13   RHI525214
## 14   RHI625214
## 15   RHI725214
## 16   RHI825214
## 17   POP715213
## 18   POP645213
## 19   POP815213
## 20   EDU635213
## 21   EDU685213
## 22   VET605213
## 23   LFE305213
## 24   HSG010214
## 25   HSG445213
## 26   HSG096213
## 27   HSG495213
## 28   HSD410213
## 29   HSD310213
## 30   INC910213
## 31   INC110213
## 32   PVY020213
## 33   BZA010213
## 34   BZA110213
## 35   BZA115213
## 36   NES010213
## 37   SBO001207
## 38   SBO315207
## 39   SBO115207
## 40   SBO215207
## 41   SBO515207
## 42   SBO415207
## 43   SBO015207
## 44   MAN450207
## 45   WTN220207
## 46   RTN130207
## 47   RTN131207
## 48   AFN120207
## 49   BPS030214
## 50   LND110210
## 51   POP060210
##                                                               description
## 1                                               Population, 2014 estimate
## 2                               Population, 2010 (April 1) estimates base
## 3              Population, percent change - April 1, 2010 to July 1, 2014
## 4                                                        Population, 2010
## 5                                    Persons under 5 years, percent, 2014
## 6                                   Persons under 18 years, percent, 2014
## 7                                Persons 65 years and over, percent, 2014
## 8                                           Female persons, percent, 2014
## 9                                              White alone, percent, 2014
## 10                         Black or African American alone, percent, 2014
## 11                 American Indian and Alaska Native alone, percent, 2014
## 12                                             Asian alone, percent, 2014
## 13        Native Hawaiian and Other Pacific Islander alone, percent, 2014
## 14                                       Two or More Races, percent, 2014
## 15                                      Hispanic or Latino, percent, 2014
## 16                     White alone, not Hispanic or Latino, percent, 2014
## 17                 Living in same house 1 year & over, percent, 2009-2013
## 18                               Foreign born persons, percent, 2009-2013
## 19      Language other than English spoken at home, pct age 5+, 2009-2013
## 20  High school graduate or higher, percent of persons age 25+, 2009-2013
## 21     Bachelor's degree or higher, percent of persons age 25+, 2009-2013
## 22                                                    Veterans, 2009-2013
## 23         Mean travel time to work (minutes), workers age 16+, 2009-2013
## 24                                                    Housing units, 2014
## 25                                          Homeownership rate, 2009-2013
## 26             Housing units in multi-unit structures, percent, 2009-2013
## 27                Median value of owner-occupied housing units, 2009-2013
## 28                                                  Households, 2009-2013
## 29                                       Persons per household, 2009-2013
## 30    Per capita money income in past 12 months (2013 dollars), 2009-2013
## 31                                     Median household income, 2009-2013
## 32                        Persons below poverty level, percent, 2009-2013
## 33                                   Private nonfarm establishments, 2013
## 34                                      Private nonfarm employment,  2013
## 35                  Private nonfarm employment, percent change, 2012-2013
## 36                                       Nonemployer establishments, 2013
## 37                                            Total number of firms, 2007
## 38                                       Black-owned firms, percent, 2007
## 39          American Indian- and Alaska Native-owned firms, percent, 2007
## 40                                       Asian-owned firms, percent, 2007
## 41 Native Hawaiian- and Other Pacific Islander-owned firms, percent, 2007
## 42                                    Hispanic-owned firms, percent, 2007
## 43                                       Women-owned firms, percent, 2007
## 44                                 Manufacturers shipments, 2007 ($1,000)
## 45                               Merchant wholesaler sales, 2007 ($1,000)
## 46                                            Retail sales, 2007 ($1,000)
## 47                                          Retail sales per capita, 2007
## 48                   Accommodation and food services sales, 2007 ($1,000)
## 49                                                 Building permits, 2014
## 50                                        Land area in square miles, 2010
## 51                                       Population per square mile, 2010

Firstly, we need to remove redundant columns.

data_raw[, c('X.1', 'X', 'combined_fips', 'state_abbr', 'FIPS', 'county_fips', 'state_fips', 'fips', 'area_name', 'state_abbreviation', 'population2010', 'population_change', 'POP010210')] <- list(NULL)

Secondly, let us focus on uniforming the format of column names. Using the dictionary displayed above we may rename certain columns. We decided to choose the format consisting of whole words or their abbreviations as the default one.

names_to_change <- c('AGE135214', 'AGE295214', 'SEX255214', 'RHI325214', 'RHI425214',
                     'RHI525214', 'RHI625214', 'RHI825214', 'POP715213', 'POP645213',
                     'VET605213', 'LFE305213', 'HSG010214', 'HSG445213', 'HSG096213',                            'HSG495213', 'HSD410213', 'HSD310213', 'INC110213', 'BZA010213',                            'BZA110213', 'BZA115213', 'NES010213', 'SBO001207', 'SBO315207',                            'SBO115207', 'SBO215207', 'SBO515207', 'SBO415207', 'SBO015207',                            'MAN450207', 'WTN220207', 'RTN130207', 'RTN131207', 'AFN120207',                            'BPS030214', 'LND110210')

colnames(data_raw)[colnames(data_raw) %in% names_to_change] <- c("age_under5",     "age_under18", "%female", "%Naitive", "%Asian", "%Hawaiian", "%Mixed", "%White", "living_same_house", "foreign", "%veterans", "timetravel_to_work", "housing_units", "homeownership", "housing_units_multi-unit", "owner-occupied_housing", "households", "persons_per_household", "median_income", "nonfarm_establishments", "nonfarm_employment", "nonfarm_employment_change", "nonemployer_establishments", "number_firms", "black-owned_firms", "native-owned_firms", "asian-owned_firms", "hawaiian-owned_firms", "hispanic-owned_firms", "women-owned_firms", "manufacturers_shipments", "merchant_sales", "retail_sales", "retail_per_capita", "accommodation_food_services", "building_permits", "land_area")

colnames(data_raw)

##  [1] "votes_dem_2016"              "votes_gop_2016"             
##  [3] "total_votes_2016"            "Clinton"                    
##  [5] "Trump"                       "diff_2016"                  
##  [7] "per_point_diff_2016"         "county_name"                
##  [9] "total_votes_2012"            "votes_dem_2012"             
## [11] "votes_gop_2012"              "Obama"                      
## [13] "Romney"                      "diff_2012"                  
## [15] "per_point_diff_2012"         "population2014"             
## [17] "age_under5"                  "age_under18"                
## [19] "age65plus"                   "%female"                    
## [21] "White"                       "Black"                      
## [23] "%Naitive"                    "%Asian"                     
## [25] "%Hawaiian"                   "%Mixed"                     
## [27] "Hispanic"                    "%White"                     
## [29] "living_same_house"           "foreign"                    
## [31] "NonEnglish"                  "Edu_highschool"             
## [33] "Edu_batchelors"              "%veterans"                  
## [35] "timetravel_to_work"          "housing_units"              
## [37] "homeownership"               "housing_units_multi-unit"   
## [39] "owner-occupied_housing"      "households"                 
## [41] "persons_per_household"       "Income"                     
## [43] "median_income"               "Poverty"                    
## [45] "nonfarm_establishments"      "nonfarm_employment"         
## [47] "nonfarm_employment_change"   "nonemployer_establishments" 
## [49] "number_firms"                "black-owned_firms"          
## [51] "native-owned_firms"          "asian-owned_firms"          
## [53] "hawaiian-owned_firms"        "hispanic-owned_firms"       
## [55] "women-owned_firms"           "manufacturers_shipments"    
## [57] "merchant_sales"              "retail_sales"               
## [59] "retail_per_capita"           "accommodation_food_services"
## [61] "building_permits"            "land_area"                  
## [63] "Density"                     "Clinton_Obama"              
## [65] "Trump_Romney"                "Trump_Prediction"           
## [67] "Clinton_Prediction"          "Trump_Deviation"            
## [69] "Clinton_Deviation"

We stated above that for the purpose of hereby analysis results of the 2012 election will be neglected. As a motivation of that choice one may consider the following figure. The majority of counties are fairly stable in their political affiliation over time. Those who voted for Barack Obama in 2012 supported Hilary Clinton in 2016 as well. The correlation of results is high. Therefore, we may limit our considerations to the 2016 election and remove data describing the confrontation of Mitt Romney and Barack Obama in 2012. What is more, the 2016 election seems to be more interesting due to their being more ‘’up-to-date’’.

data <- data_raw
ggplot(data) + geom_point(aes(x = Clinton*100, y = Obama*100)) + labs(title = "Votes for democratic candidates in 2012 and 2016", x = "Percentage of votes for H. Clinton [%]", y = "Percentage of votes for B. Obama [%]")

We may use some observations, let’s say 15% of their total number, for prediction purposes. In other words, we need to split the dataset into two. The first one should contain 85% percent of total observation and will be used to build models. The second one consists of the remaining values. Its purpose will be to present the model’s ability of to predict which clusters should be a given observation assigned to.

set.seed(123456)
data_split <- initial_split(data, prop = .75)
data1 <- training(data_split)
data2  <- testing(data_split)

head(data1)

##      votes_dem_2016 votes_gop_2016 total_votes_2016   Clinton     Trump
## 234           67045         100178           182918 0.3665304 0.5476662
## 1066           3295           5174             8829 0.3732019 0.5860233
## 1905           3721           3179             7975 0.4665831 0.3986207
## 711           23392          24152            50815 0.4603365 0.4752927
## 2285         141597          90321           241818 0.5855519 0.3735082
## 2230           4256          12545            17481 0.2434643 0.7176363
##      diff_2016 per_point_diff_2016       county_name total_votes_2012
## 234      33133         -0.18113581    Douglas County           150160
## 1066      1879         -0.21282138      Rowan County             7641
## 1905       542          0.06796238     Cibola County             8154
## 711        760         -0.01495621    Kendall County            45042
## 2285     51276          0.21204377 Providence County           234092
## 2230      8289         -0.47417196    Clarion County            15227
##      votes_dem_2012 votes_gop_2012     Obama    Romney diff_2012
## 234           54093          93930 0.3602357 0.6255328     39837
## 1066           3438           4035 0.4499411 0.5280722       597
## 1905           4907           2966 0.6017905 0.3637479      1941
## 711           21219          23076 0.4710936 0.5123218      1857
## 2285         156022          74344 0.6664986 0.3175845     81678
## 2230           4731          10228 0.3106981 0.6717016      5497
##      per_point_diff_2012 population2014 age_under5 age_under18 age65plus
## 234          -0.26529702         314638        6.3        28.3       9.9
## 1066         -0.07813113          23655        5.6        19.3      13.4
## 1905          0.23804268          27349        7.4        24.5      14.2
## 711          -0.04122819         121350        7.3        29.8       9.0
## 2285          0.34891410         631974        5.8        21.1      14.2
## 2230         -0.36100348          38821        5.0        18.9      18.0
##      %female White Black %Naitive %Asian %Hawaiian %Mixed Hispanic %White
## 234     50.3 0.913 0.014      0.5    4.3       0.1    2.4    0.083   83.9
## 1066    51.3 0.960 0.019      0.2    0.8       0.0    1.2    0.015   94.6
## 1905    48.9 0.540 0.014     41.7    0.6       0.1    2.1    0.378   20.8
## 711     50.6 0.876 0.067      0.5    3.3       0.0    1.9    0.169   71.9
## 2285    51.5 0.797 0.114      1.2    4.4       0.2    3.0    0.209   64.1
## 2230    51.2 0.970 0.011      0.2    0.7       0.0    0.9    0.007   96.4
##      living_same_house foreign NonEnglish Edu_highschool Edu_batchelors
## 234               85.4     6.4        8.8           97.5           55.8
## 1066              79.5     1.0        2.1           77.1           23.8
## 1905              88.2     5.2       44.3           79.9           10.4
## 711               88.9     9.3       17.3           92.5           34.2
## 2285              85.7    17.9       29.7           80.7           26.4
## 2230              84.3     1.0        3.3           87.9           18.5
##      %veterans timetravel_to_work housing_units homeownership
## 234      19635               27.7        113777          80.9
## 1066      1231               21.0         10126          65.2
## 1905      1858               22.8         11086          72.4
## 711       5821               33.3         41183          83.6
## 2285     35075               23.1        263473          54.3
## 2230      3011               22.8         19879          70.9
##      housing_units_multi-unit owner-occupied_housing households
## 234                      15.8                 335600     103780
## 1066                     11.9                  96700       8305
## 1905                      7.9                  83800       8016
## 711                       9.0                 210800      38075
## 2285                     51.7                 224100     237800
## 2230                     14.0                 106200      15776
##      persons_per_household Income median_income Poverty nonfarm_establishments
## 234                   2.82  43634        101591     3.9                   8040
## 1066                  2.49  17094         35236    28.6                    489
## 1905                  3.15  16129         37237    27.6                    329
## 711                   3.05  31276         81765     4.8                   2008
## 2285                  2.52  26308         49297    17.1                  15524
## 2230                  2.40  21902         42389    18.3                    937
##      nonfarm_employment nonfarm_employment_change nonemployer_establishments
## 234               91934                       5.2                      28689
## 1066               7148                      -4.4                       1405
## 1905               5067                      -2.2                       1008
## 711               21251                       0.0                       7348
## 2285             254490                       1.2                      39945
## 2230              10918                       0.5                       2443
##      number_firms black-owned_firms native-owned_firms asian-owned_firms
## 234         32992               0.0                0.0               2.4
## 1066         1823               0.0                0.0               0.0
## 1905         1513               0.0               14.4               1.7
## 711          8311               3.1                0.0               0.0
## 2285        51806               5.5                0.5               2.6
## 2230         3460               0.0                0.0               0.0
##      hawaiian-owned_firms hispanic-owned_firms women-owned_firms
## 234                     0                  3.1              27.7
## 1066                    0                  0.0              18.0
## 1905                    0                 22.1              26.6
## 711                     0                  9.3              32.8
## 2285                    0                  9.7              28.1
## 2230                    0                  0.0               0.0
##      manufacturers_shipments merchant_sales retail_sales retail_per_capita
## 234                  1950842              0      3765450             13846
## 1066                  259812          54111       272076             12039
## 1905                       0          15859       217448              7992
## 711                   685730        1299755      1064021             10957
## 2285                 6422854        5856937      5924945              9430
## 2230                  586557         287384       421097             10563
##      accommodation_food_services building_permits land_area Density
## 234                       435099             3230    840.25   339.7
## 1066                       36058               14    279.80    83.4
## 1905                      117318                0   4539.48     6.0
## 711                        99906              283    320.34   358.2
## 2285                     1065946              342    409.50  1530.3
## 2230                       51159              422    600.83    66.6
##      Clinton_Obama Trump_Romney Trump_Prediction Clinton_Prediction
## 234    0.006294609  -0.07786660        0.3939749          0.5275057
## 1066  -0.076739159   0.05795109        0.6673069          0.2917867
## 1905  -0.135207460   0.03487284        0.4844554          0.4468122
## 711   -0.010757131  -0.03702910        0.5272468          0.4092330
## 2285  -0.080946698   0.05592364        0.5341748          0.4206900
## 2230  -0.067233769   0.04593471        0.7052056          0.2407079
##      Trump_Deviation Clinton_Deviation
## 234      -0.15369132       0.160975333
## 1066      0.08128355      -0.081415221
## 1905      0.08583473      -0.019770904
## 711       0.05195406      -0.051103467
## 2285      0.16066662      -0.164861976
## 2230     -0.01243071      -0.002756465

head(data2)

##    votes_dem_2016 votes_gop_2016 total_votes_2016   Clinton     Trump diff_2016
## 2           18409          72780            94090 0.1956531 0.7735147     54371
## 7            3716           4891             8685 0.4278641 0.5631549      1175
## 11           2909          15068            18255 0.1593536 0.8254177     12159
## 13           5712           7109            12936 0.4415584 0.5495516      1397
## 16           4194          15825            20513 0.2044557 0.7714620     11631
## 18           3069           3413             6543 0.4690509 0.5216262       344
##    per_point_diff_2016    county_name total_votes_2012 votes_dem_2012
## 2          -0.57786162 Baldwin County            84988          18329
## 7          -0.13529073  Butler County             9483           4367
## 11         -0.66606409 Chilton County            17434           3391
## 13         -0.10799320  Clarke County            13827           6317
## 16         -0.56700629  Coffee County            19715           4899
## 18         -0.05257527 Conecuh County             7013           3551
##    votes_gop_2012     Obama    Romney diff_2012 per_point_diff_2012
## 2           65772 0.2156657 0.7738975     47443         -0.55823175
## 7            5081 0.4605083 0.5358009       714         -0.07529263
## 11          13910 0.1945050 0.7978662     10519         -0.60336125
## 13           7463 0.4568598 0.5397411      1146         -0.08288132
## 16          14638 0.2484910 0.7424803      9739         -0.49398935
## 18           3434 0.5063454 0.4896621       117          0.01668330
##    population2014 age_under5 age_under18 age65plus %female White Black %Naitive
## 2          200111        5.6        22.2      18.7    51.2 0.871 0.096      0.7
## 7           20296        6.1        23.6      18.0    53.6 0.539 0.440      0.4
## 11          43931        6.4        24.2      15.2    50.8 0.871 0.106      0.5
## 13          24945        5.6        22.6      18.0    52.8 0.540 0.442      0.4
## 16          50909        6.1        23.7      15.9    50.5 0.764 0.177      1.4
## 18          12670        5.5        21.4      20.4    51.9 0.521 0.462      0.4
##    %Asian %Hawaiian %Mixed Hispanic %White living_same_house foreign NonEnglish
## 2     0.9       0.1    1.6    0.046   83.0              82.1     3.6        5.5
## 7     0.9       0.0    0.8    0.012   53.1              94.6     0.8        1.7
## 11    0.4       0.2    1.2    0.077   80.3              88.8     5.2        7.4
## 13    0.5       0.0    0.8    0.013   53.1              92.5     0.3        1.3
## 16    1.6       0.4    2.5    0.064   71.3              83.3     5.1        7.5
## 18    0.1       0.0    1.2    0.018   51.1              96.9     0.6        2.2
##    Edu_highschool Edu_batchelors %veterans timetravel_to_work housing_units
## 2            89.1           27.7     19346               25.9        107374
## 7            76.3           14.0      1497               24.0          9916
## 11           76.0           12.9      3308               31.8         19221
## 13           78.3           11.2      1583               25.5         12583
## 16           82.4           22.5      6066               21.3         22852
## 18           77.3            8.8      1101               27.2          7055
##    homeownership housing_units_multi-unit owner-occupied_housing households
## 2           72.6                     24.4                 168600      73283
## 7           70.3                     13.3                  74700       8235
## 11          74.5                      4.1                 102600      16232
## 13          72.9                      6.8                  85600       9631
## 16          69.9                     13.5                 131000      18820
## 18          79.1                      7.2                  69600       4994
##    persons_per_household Income median_income Poverty nonfarm_establishments
## 2                   2.52  26766         50221    13.9                   4871
## 7                   2.47  17403         29918    28.4                    393
## 11                  2.67  20701         41250    19.5                    703
## 13                  2.63  18979         29357    29.3                    586
## 16                  2.65  23380         43768    18.6                    954
## 18                  2.61  15605         24658    32.7                    194
##    nonfarm_employment nonfarm_employment_change nonemployer_establishments
## 2               54988                       3.7                      16508
## 7                5711                       2.7                       1095
## 11               7396                       8.8                       2719
## 13               6648                       3.7                       1640
## 16              12762                       0.3                       2602
## 18               2197                      -1.5                        756
##    number_firms black-owned_firms native-owned_firms asian-owned_firms
## 2         19035               2.7                0.4               1.0
## 7          1769               0.0                0.0               3.3
## 11            0               0.0                0.0               0.0
## 13         2374               0.0                0.0               0.0
## 16         4032               6.0                0.0               1.3
## 18          980               0.0                0.0               0.0
##    hawaiian-owned_firms hispanic-owned_firms women-owned_firms
## 2                     0                  1.3              27.3
## 7                     0                  0.0               0.0
## 11                    0                  0.0               0.0
## 13                    0                  0.0              28.6
## 16                    0                  0.7              29.3
## 18                    0                  0.0              23.0
##    manufacturers_shipments merchant_sales retail_sales retail_per_capita
## 2                  1410273              0      2966489             17166
## 7                   399132          56712       229277             11326
## 11                       0         155139       359910              8496
## 13                  571454          85803       344311             13034
## 16                  613758         131594       639623             13665
## 18                  183544          76736        71232              5430
##    accommodation_food_services building_permits land_area Density Clinton_Obama
## 2                       436955             1384   1589.78   114.6   -0.02001264
## 7                        28427                2    776.83    27.0   -0.03264414
## 11                       34073               78    692.85    63.0   -0.03515139
## 13                       23596               11   1238.47    20.9   -0.01530133
## 16                       50080               58    678.97    73.6   -0.04403529
## 18                        8667                3    850.16    15.6   -0.03729446
##     Trump_Romney Trump_Prediction Clinton_Prediction Trump_Deviation
## 2  -0.0003827715        0.5867486          0.3595024    -0.186766123
## 7   0.0273539578        0.5224916          0.4633133    -0.040663239
## 11  0.0275514554        0.7283339          0.2433266    -0.097083797
## 13  0.0098105526        0.5289034          0.4534988    -0.020648222
## 16  0.0289816548        0.5804720          0.3800510    -0.190990039
## 18  0.0319641092        0.5249156          0.4589741     0.003289476
##    Clinton_Deviation
## 2         0.16384929
## 7         0.03544912
## 11        0.08397298
## 13        0.01194032
## 16        0.17559525
## 18       -0.01007684

We have to normalise the dataset in order to ensure one variable does not dominate others in a way that its intake to a cost function minimised by algorithms used in clustering methods would exceed the remaining intakes. Leading the same to outcome in which data are clustered in an incorrect way because centroids/medoids location was found appropriately in some dimensions only.

data_train <- as.data.frame( lapply(data1[, -8], scale) ) # We need to exclude the categorical variable, i.e. name of the county
data_predict <- as.data.frame( lapply(data2[, -8], scale) )

county_names_train <- data1[, 8] # We will later use it while reviewing clustering results
county_names_predict <- data2[, 8]

Analysis

Since the dataset has been prepared for further analysis, we may start building our clustering models. Before that, however, we need to access whether it is rational to apply clustering algorithms. In other words, we want to check if our data are clusterable. Hopkins statistics and ordered dissimilarity matrics (ODM) are tools we need.

Accessing clustering tendency

access <- get_clust_tendency(data[, -8], n = nrow(data_raw)-1, gradient = list(low="red", mid="white", high="blue") )
access$hopkins_stat

## [1] 0.993888

access$plot

Hopkins statistics compares distances between nearest neighbours in our data with the same quantity but computed using real data points and numbers generated from the uniform distribution. That way Hopkins statistics is some kind of a ratio presenting how different the dataset is when compared with uniformly distributed points.

Obtained results suggest that we may proceed with clustering. The value of Hopkins statistics is (while far from perfect) somehow closed to one which in this setting means that we may reject the null hypothesis of not having visible clusters in the dataset. A similar conclusion might be formulated after analysing ordered dissimilarity matrics where blue color means higher dissimilarity in data.

Optimal number of clusters

After checking whether our work may lead to a satisfying outcome, we may continue preparations to apply clustering algorithms. The next purpose is to find the optimal number of clusters. There are a few quantities that might be used here. We have decided to use two of them: Silhouette and total within sum of square. These are popular and widely used measures. Their advantage is that they are available for both k-means and PAM methods which will turn out to be important in the future.

Let’s start with Silhouette. This measure presents the difference between the average distance to other objects assigned to the same cluster as an examined point and the average distance to the closest different cluster divided by the greater value of those two. The optimal situation is when silhouette equals one. On the contrary, the value of minus one is the worst possible scenario.

fviz_nbclust(data[, -8], FUNcluster = kmeans, method = "silhouette")

fviz_nbclust(data[, -8], FUNcluster = cluster::pam, method = "silhouette")

Finally, we may compute the total within sum of squares. The third measure informs about the total variance within data explained thanks to applying clustering (as in regression models developed for econometrics analysis).

fviz_nbclust(data[, -8], FUNcluster = kmeans, method = "wss")

Choosing an optimal number of clusters is always a demanding task. Various measures may lead to different conclusions. While for the PAM method three clusters seem to be a rational choice (similar value of silhouette as for two clusters and more obvious choice from wss perspective), the best number of clusters for k-means is uncertain. Therefore, it might be helpful to increase the number of indices the decision is based on and introduce two additional ones. For instance, we may use Akaike information criterion (AIC).

Optimal_Clusters_KMeans(data[, -8], max_clusters = 10, criterion = "AIC")

##  [1] 3.262875e+17 1.797354e+17 8.636668e+16 5.083518e+16 4.031187e+16
##  [6] 3.576490e+16 3.303212e+16 2.870489e+16 2.788313e+16 2.492855e+16

Let us choose two clusters as the optimal number for k-means.

The above analysis was carried out because of what we want to do now. Namely, build two models: one using PAM with three clusters and the second one consisting of two clusters found with the k-means algorithm. Then we are going to compare their results by computing the Rand index.

PAM and k-means models

model1 <- eclust(data_train, FUNcluster  = "kmeans", k = 2, hc_metric = "euclidean", graph = FALSE)
fviz_cluster(model1, data_train, geom = "point", ellipse.type = "convex")

clustered_counties1 <- data.frame(model1$cluster, county_names_train)
colnames(clustered_counties1) <- c("Cluster", "County Name")
tail(clustered_counties1)

##      Cluster      County Name
## 2329       1     Power County
## 2330       1    Marion County
## 2331       1  Tazewell County
## 2332       1     Otero County
## 2333       1    Cottle County
## 2334       1 Sherburne County

sil1 <- silhouette(model1$cluster, dist(data_train) )
fviz_silhouette(sil1)

##   cluster size ave.sil.width
## 1       1 2209          0.60
## 2       2  125          0.01

While the average silhouette for the first cluster is fairly satisfying, the second one seems to be far worse. About half of the observations assigned to that cluster are closer to the centre of another cluster (as may be seen in the Cluster plot above). Also a disproportion in the number of observations assigned to a given cluster.

model2 <- eclust(data_train, FUNcluster  = "pam", k = 3, hc_metric = "euclidean", graph = FALSE)
fviz_cluster(model2, data_train, geom = "point", ellipse.type = "convex")

sil2 <- silhouette(model2$cluster, dist(data_train) )
fviz_silhouette(sil2)

##   cluster size ave.sil.width
## 1       1  365         -0.09
## 2       2 1449          0.29
## 3       3  520          0.00

clustered_counties2 <- data.frame(model2$cluster, county_names_train)
colnames(clustered_counties2) <- c("Cluster", "County Name")
tail(clustered_counties2)

##      Cluster      County Name
## 2329       2     Power County
## 2330       2    Marion County
## 2331       2  Tazewell County
## 2332       3     Otero County
## 2333       2    Cottle County
## 2334       2 Sherburne County

Well, the results are not perfect. However, as promised we may compare cluster allocation returned by both models using the Rand index. The latter presents the number of pairs that consists of observations assigned to the same clusters in both approaches divided by the total number of pairs. Rand index might be, thus, treated as the frequency of occurrence of the same allocation.

rand.index(model1$cluster, model2$cluster) # Not using the adjusted version

## [1] 0.5386634

Prediction

The model with two clusters found using the k-means algorithm turned out to be better in terms of average silhouette. Therefore, this approach seems to be more suitable for prediction purposes.

kcca <- as.kcca(model1, data_train)
model_predict <- predict(kcca, data_predict)
predict_result <- data.frame(model_predict, county_names_predict)
tail(predict_result)

##     model_predict county_names_predict
## 773             1          Wirt County
## 774             1          Wood County
## 775             1       Wyoming County
## 776             1        Goshen County
## 777             1        Platte County
## 778             1         Uinta County

Hierarchical clustering

After applying k-means and PAM algorithms, we may present another approach to grouping unlabeled observations which is hierarchical clustering. This method aims to build a characteristic dendrogram that represents a hierarchy of clusters, i.e. connections between them based on their similarity. As in other methods we have to choose metrics that defines equation for calculating a distance between observations. The next step is to translate those distances between individual data points into a distance between sets they might be assigned to. In other words, we need so-called linkage criteria. Of course, a few options are here available to choose from.

Hierarchical clustering can be performed in a few different ways. We will present the Agnes algorithm (AGglomerative NESting) that starts with as many clusters as observation in a dataset and then tries to merge them based on their similarity. In effect, the number of clusters decreases as they include more observations. The complementary approach is applied in the second of popular hierarchical clustering algorithms which is Diana (DIvisie ANAlysis). Initially, all observations are grouped in one cluster which is then divided into smaller structures.

We decided to use only 30 observations for hierarchical clustering analysis due to the need of providing readable result visualization (which could be difficult for the whole dataset).

data_hier <- data_train[1:30, ]
model3 <- agnes(data_hier, method = "ward")
pltree(model3, main = "Dendrogram", labels = county_names_train[1:30])

model3$ac

## [1] 0.7826415

We may now cut our tree into let’s say two groups.

model3b <- cutree(model3, k = 2)
clustered_counties3 <- data.frame(model3b, county_names_train[1:30])
colnames(clustered_counties3) <- c("Cluster", "County name")
clustered_counties3

##    Cluster          County name
## 1        1       Douglas County
## 2        2         Rowan County
## 3        1        Cibola County
## 4        1       Kendall County
## 5        1    Providence County
## 6        2       Clarion County
## 7        2         Bristol city
## 8        2        Holmes County
## 9        1        Orange County
## 10       1 Chattahoochee County
## 11       2      Atchison County
## 12       1      Garfield County
## 13       2        Valley County
## 14       2      Stoddard County
## 15       2         Boone County
## 16       2    Schuylkill County
## 17       2       Hancock County
## 18       2         Union County
## 19       2        Putnam County
## 20       2     Limestone County
## 21       2       Warrick County
## 22       2  Indian River County
## 23       1      Monterey County
## 24       2       Swisher County
## 25       2     Red River County
## 26       2        George County
## 27       1         Norfolk city
## 28       1     St. Clair County
## 29       1       Calvert County
## 30       2       Bracken County

Summary

A few examples from a wide range of clustering methods have been presented using the dataset of 2016 United States election results. We applied k-means, PAM, and hierarchical clustering (in the form of the Agnes algorithm). While the results were not remarkable so to say, the way they were obtained made it possible to see a variety of measures and quantities important in clustering analysis.