Analysis of the 2016 U.S. election results using clustering
Introduction
According to the popular phrase, ’‘knowledge is power’’. The importance of that simple statement can be easily verified in business, for instance. Having a better understanding of the market, e.g. knowing the needs of customers or predicting decisions of other firms may lead to higher profit. Similarly, in politics. Knowledge about future moves of competitors may turn out to be a real game-changer or at least ensure a significant advantage. The temptation of gaining such knowledge is sometimes so strong that leads to morally and legally questionable actions. A brilliant example was the famous Watergate scandal when the administration of U.S. President Richard Nixon, as well as his campaign committee were involved in the break-in of their political opponent’s headquarter.
Unlike politicians involved in the Watergate scandal, we may propose an absolutely legal and acceptable source of information. This approach aimed to investigate some socio-economic factors that shape the political decisions of voters, one could not expect such tremendous revelations as obtained using wiretapping device, yet still valuable. Not only in having a slight advantage over opponents but for purely scientific or non-politically oriented purposes as well. No matter the reasons, we hereby present the application of a few clustering methods used in order to extract information from 2016 U.S. elections results by county enriched with some facts about demography and economic wellbeing of those local government units.
library("factoextra")
library("ClusterR")
library("cluster")
library("ggplot2")
library("fossil")
library("flexclust")
library("rsample")Dataset details
In November 2016, the presidential election was held in the United States. Republican candidate Donald Trump defeated Democrat Hilary Clinton and became the 45th president. Under the following link, one may find the number of votes per candidate grouped by county. In addition, the dataset consists of about 20 columns that contain information about demographics and some economic indicators. They are presented below. While in the dataset information about results of the 2012 U.S. election (Mitt Romney vs Barack Obama) is stored as well but for the majority of this work, it will be skipped. We will focus mainly on the 2016 election.
data_raw <- read.csv("votes.csv", sep = ",")
colnames(data_raw)## [1] "X.1" "X" "combined_fips"
## [4] "votes_dem_2016" "votes_gop_2016" "total_votes_2016"
## [7] "Clinton" "Trump" "diff_2016"
## [10] "per_point_diff_2016" "state_abbr" "county_name"
## [13] "FIPS" "total_votes_2012" "votes_dem_2012"
## [16] "votes_gop_2012" "county_fips" "state_fips"
## [19] "Obama" "Romney" "diff_2012"
## [22] "per_point_diff_2012" "fips" "area_name"
## [25] "state_abbreviation" "population2014" "population2010"
## [28] "population_change" "POP010210" "AGE135214"
## [31] "AGE295214" "age65plus" "SEX255214"
## [34] "White" "Black" "RHI325214"
## [37] "RHI425214" "RHI525214" "RHI625214"
## [40] "Hispanic" "RHI825214" "POP715213"
## [43] "POP645213" "NonEnglish" "Edu_highschool"
## [46] "Edu_batchelors" "VET605213" "LFE305213"
## [49] "HSG010214" "HSG445213" "HSG096213"
## [52] "HSG495213" "HSD410213" "HSD310213"
## [55] "Income" "INC110213" "Poverty"
## [58] "BZA010213" "BZA110213" "BZA115213"
## [61] "NES010213" "SBO001207" "SBO315207"
## [64] "SBO115207" "SBO215207" "SBO515207"
## [67] "SBO415207" "SBO015207" "MAN450207"
## [70] "WTN220207" "RTN130207" "RTN131207"
## [73] "AFN120207" "BPS030214" "LND110210"
## [76] "Density" "Clinton_Obama" "Trump_Romney"
## [79] "Trump_Prediction" "Clinton_Prediction" "Trump_Deviation"
## [82] "Clinton_Deviation"
head(data_raw)## X.1 X combined_fips votes_dem_2016 votes_gop_2016 total_votes_2016
## 1 30 29 1001 5908 18110 24661
## 2 31 30 1003 18409 72780 94090
## 3 32 31 1005 4848 5431 10390
## 4 33 32 1007 1874 6733 8748
## 5 34 33 1009 2150 22808 25384
## 6 35 34 1011 3530 1139 4701
## Clinton Trump diff_2016 per_point_diff_2016 state_abbr county_name
## 1 0.23956855 0.7343579 12202 -0.49478934 AL Autauga County
## 2 0.19565310 0.7735147 54371 -0.57786162 AL Baldwin County
## 3 0.46660250 0.5227141 583 -0.05611165 AL Barbour County
## 4 0.21422039 0.7696616 4859 -0.55544124 AL Bibb County
## 5 0.08469902 0.8985188 20658 -0.81381973 AL Blount County
## 6 0.75090406 0.2422889 2391 0.50861519 AL Bullock County
## FIPS total_votes_2012 votes_dem_2012 votes_gop_2012 county_fips state_fips
## 1 1001 23909 6354 17366 1001 1
## 2 1003 84988 18329 65772 1003 1
## 3 1005 11459 5873 5539 1005 1
## 4 1007 8391 2200 6131 1007 1
## 5 1009 23980 2961 20741 1009 1
## 6 1011 5318 4058 1250 1011 1
## Obama Romney diff_2012 per_point_diff_2012 fips area_name
## 1 0.2657577 0.7263374 11012 -0.4605797 1001 Autauga County
## 2 0.2156657 0.7738975 47443 -0.5582318 1003 Baldwin County
## 3 0.5125229 0.4833755 334 0.0291474 1005 Barbour County
## 4 0.2621857 0.7306638 3931 -0.4684781 1007 Bibb County
## 5 0.1234779 0.8649291 17780 -0.7414512 1009 Blount County
## 6 0.7630688 0.2350508 2808 0.5280181 1011 Bullock County
## state_abbreviation population2014 population2010 population_change POP010210
## 1 AL 55395 54571 1.5 54571
## 2 AL 200111 182265 9.8 182265
## 3 AL 26887 27457 -2.1 27457
## 4 AL 22506 22919 -1.8 22915
## 5 AL 57719 57322 0.7 57322
## 6 AL 10764 10915 -1.4 10914
## AGE135214 AGE295214 age65plus SEX255214 White Black RHI325214 RHI425214
## 1 6.0 25.2 13.8 51.4 0.779 0.187 0.5 1.1
## 2 5.6 22.2 18.7 51.2 0.871 0.096 0.7 0.9
## 3 5.7 21.2 16.5 46.6 0.502 0.476 0.6 0.5
## 4 5.3 21.0 14.8 45.9 0.763 0.221 0.4 0.2
## 5 6.1 23.6 17.0 50.5 0.960 0.018 0.6 0.3
## 6 6.3 21.4 14.9 45.3 0.269 0.701 0.8 0.3
## RHI525214 RHI625214 Hispanic RHI825214 POP715213 POP645213 NonEnglish
## 1 0.1 1.8 0.027 75.6 85.0 1.6 3.5
## 2 0.1 1.6 0.046 83.0 82.1 3.6 5.5
## 3 0.2 0.9 0.045 46.6 84.8 2.9 5.0
## 4 0.1 0.9 0.021 74.5 86.6 1.2 2.1
## 5 0.1 1.2 0.087 87.8 88.7 4.3 7.3
## 6 0.7 1.1 0.075 22.1 84.7 5.4 5.2
## Edu_highschool Edu_batchelors VET605213 LFE305213 HSG010214 HSG445213
## 1 85.6 20.9 5922 26.2 22751 76.8
## 2 89.1 27.7 19346 25.9 107374 72.6
## 3 73.7 13.4 2120 24.6 11799 67.7
## 4 77.5 12.1 1327 27.6 8978 79.0
## 5 77.0 12.1 4540 33.9 23826 81.0
## 6 67.8 12.5 636 26.9 4461 74.3
## HSG096213 HSG495213 HSD410213 HSD310213 Income INC110213 Poverty BZA010213
## 1 8.3 136200 20071 2.71 24571 53682 12.1 817
## 2 24.4 168600 73283 2.52 26766 50221 13.9 4871
## 3 10.6 89200 9200 2.66 16829 32911 26.7 464
## 4 7.3 90500 7091 3.03 17427 36447 18.1 275
## 5 4.5 117100 21108 2.70 20730 44145 15.8 660
## 6 8.7 70600 3741 2.73 18628 32033 21.6 112
## BZA110213 BZA115213 NES010213 SBO001207 SBO315207 SBO115207 SBO215207
## 1 10120 2.1 2947 4067 15.2 0.0 1.3
## 2 54988 3.7 16508 19035 2.7 0.4 1.0
## 3 6611 -5.6 1546 1667 0.0 0.0 0.0
## 4 3145 7.5 1126 1385 14.9 0.0 0.0
## 5 6798 3.4 3563 4458 0.0 0.0 0.0
## 6 0 0.0 470 417 0.0 0.0 0.0
## SBO515207 SBO415207 SBO015207 MAN450207 WTN220207 RTN130207 RTN131207
## 1 0 0.7 31.7 0 0 598175 12003
## 2 0 1.3 27.3 1410273 0 2966489 17166
## 3 0 0.0 27.0 0 0 188337 6334
## 4 0 0.0 0.0 0 0 124707 5804
## 5 0 0.0 23.2 341544 0 319700 5622
## 6 0 0.0 38.8 0 0 43810 3995
## AFN120207 BPS030214 LND110210 Density Clinton_Obama Trump_Romney
## 1 88157 131 594.44 91.8 -0.02618911 0.0080205305
## 2 436955 1384 1589.78 114.6 -0.02001264 -0.0003827715
## 3 0 8 884.88 31.0 -0.04592041 0.0393386355
## 4 10757 19 622.58 36.8 -0.04796528 0.0389978305
## 5 20941 3 644.78 88.9 -0.03877888 0.0335896444
## 6 3670 1 622.81 17.5 -0.01216476 0.0072381037
## Trump_Prediction Clinton_Prediction Trump_Deviation Clinton_Deviation
## 1 0.6208587 0.3404929 -0.113499208 0.10092439
## 2 0.5867486 0.3595024 -0.186766123 0.16384929
## 3 0.5178320 0.4746928 -0.004882142 0.00809032
## 4 0.6922274 0.2860314 -0.077434258 0.07181099
## 5 0.7896486 0.1774903 -0.108870165 0.09279126
## 6 0.4452689 0.5722160 0.202980053 -0.17868806
sum(is.na(data_raw)) # Checking whether some observations are missing## [1] 0
dim(data_raw)## [1] 3112 82
One may notice that many variables have very strange names that are not self-explanatory at all. Therefore, very useful is to have an additional dictionary that contains their definitions. What is more, the whole dataset is organised in an untidy way. Some columns containing identical data exist more than once but with different names. Naming columns follow two exclusive formats. In some cases, abbreviations are used, while in others combinations of letters and numbers are applied. This has to be formatted uniformly. The dataset consists of many redundant columns as well. For instance, there is no need to have separate columns for numbers of votes for the Republican candidate and democrat because the total number of votes and support rate of one candidate ensure all sufficient data.
Data preparation
names <- read.csv("county_facts_dictionary.csv", sep = ",")
names## column_name
## 1 PST045214
## 2 PST040210
## 3 PST120214
## 4 POP010210
## 5 AGE135214
## 6 AGE295214
## 7 AGE775214
## 8 SEX255214
## 9 RHI125214
## 10 RHI225214
## 11 RHI325214
## 12 RHI425214
## 13 RHI525214
## 14 RHI625214
## 15 RHI725214
## 16 RHI825214
## 17 POP715213
## 18 POP645213
## 19 POP815213
## 20 EDU635213
## 21 EDU685213
## 22 VET605213
## 23 LFE305213
## 24 HSG010214
## 25 HSG445213
## 26 HSG096213
## 27 HSG495213
## 28 HSD410213
## 29 HSD310213
## 30 INC910213
## 31 INC110213
## 32 PVY020213
## 33 BZA010213
## 34 BZA110213
## 35 BZA115213
## 36 NES010213
## 37 SBO001207
## 38 SBO315207
## 39 SBO115207
## 40 SBO215207
## 41 SBO515207
## 42 SBO415207
## 43 SBO015207
## 44 MAN450207
## 45 WTN220207
## 46 RTN130207
## 47 RTN131207
## 48 AFN120207
## 49 BPS030214
## 50 LND110210
## 51 POP060210
## description
## 1 Population, 2014 estimate
## 2 Population, 2010 (April 1) estimates base
## 3 Population, percent change - April 1, 2010 to July 1, 2014
## 4 Population, 2010
## 5 Persons under 5 years, percent, 2014
## 6 Persons under 18 years, percent, 2014
## 7 Persons 65 years and over, percent, 2014
## 8 Female persons, percent, 2014
## 9 White alone, percent, 2014
## 10 Black or African American alone, percent, 2014
## 11 American Indian and Alaska Native alone, percent, 2014
## 12 Asian alone, percent, 2014
## 13 Native Hawaiian and Other Pacific Islander alone, percent, 2014
## 14 Two or More Races, percent, 2014
## 15 Hispanic or Latino, percent, 2014
## 16 White alone, not Hispanic or Latino, percent, 2014
## 17 Living in same house 1 year & over, percent, 2009-2013
## 18 Foreign born persons, percent, 2009-2013
## 19 Language other than English spoken at home, pct age 5+, 2009-2013
## 20 High school graduate or higher, percent of persons age 25+, 2009-2013
## 21 Bachelor's degree or higher, percent of persons age 25+, 2009-2013
## 22 Veterans, 2009-2013
## 23 Mean travel time to work (minutes), workers age 16+, 2009-2013
## 24 Housing units, 2014
## 25 Homeownership rate, 2009-2013
## 26 Housing units in multi-unit structures, percent, 2009-2013
## 27 Median value of owner-occupied housing units, 2009-2013
## 28 Households, 2009-2013
## 29 Persons per household, 2009-2013
## 30 Per capita money income in past 12 months (2013 dollars), 2009-2013
## 31 Median household income, 2009-2013
## 32 Persons below poverty level, percent, 2009-2013
## 33 Private nonfarm establishments, 2013
## 34 Private nonfarm employment, 2013
## 35 Private nonfarm employment, percent change, 2012-2013
## 36 Nonemployer establishments, 2013
## 37 Total number of firms, 2007
## 38 Black-owned firms, percent, 2007
## 39 American Indian- and Alaska Native-owned firms, percent, 2007
## 40 Asian-owned firms, percent, 2007
## 41 Native Hawaiian- and Other Pacific Islander-owned firms, percent, 2007
## 42 Hispanic-owned firms, percent, 2007
## 43 Women-owned firms, percent, 2007
## 44 Manufacturers shipments, 2007 ($1,000)
## 45 Merchant wholesaler sales, 2007 ($1,000)
## 46 Retail sales, 2007 ($1,000)
## 47 Retail sales per capita, 2007
## 48 Accommodation and food services sales, 2007 ($1,000)
## 49 Building permits, 2014
## 50 Land area in square miles, 2010
## 51 Population per square mile, 2010
Firstly, we need to remove redundant columns.
data_raw[, c('X.1', 'X', 'combined_fips', 'state_abbr', 'FIPS', 'county_fips', 'state_fips', 'fips', 'area_name', 'state_abbreviation', 'population2010', 'population_change', 'POP010210')] <- list(NULL)Secondly, let us focus on uniforming the format of column names. Using the dictionary displayed above we may rename certain columns. We decided to choose the format consisting of whole words or their abbreviations as the default one.
names_to_change <- c('AGE135214', 'AGE295214', 'SEX255214', 'RHI325214', 'RHI425214',
'RHI525214', 'RHI625214', 'RHI825214', 'POP715213', 'POP645213',
'VET605213', 'LFE305213', 'HSG010214', 'HSG445213', 'HSG096213', 'HSG495213', 'HSD410213', 'HSD310213', 'INC110213', 'BZA010213', 'BZA110213', 'BZA115213', 'NES010213', 'SBO001207', 'SBO315207', 'SBO115207', 'SBO215207', 'SBO515207', 'SBO415207', 'SBO015207', 'MAN450207', 'WTN220207', 'RTN130207', 'RTN131207', 'AFN120207', 'BPS030214', 'LND110210')
colnames(data_raw)[colnames(data_raw) %in% names_to_change] <- c("age_under5", "age_under18", "%female", "%Naitive", "%Asian", "%Hawaiian", "%Mixed", "%White", "living_same_house", "foreign", "%veterans", "timetravel_to_work", "housing_units", "homeownership", "housing_units_multi-unit", "owner-occupied_housing", "households", "persons_per_household", "median_income", "nonfarm_establishments", "nonfarm_employment", "nonfarm_employment_change", "nonemployer_establishments", "number_firms", "black-owned_firms", "native-owned_firms", "asian-owned_firms", "hawaiian-owned_firms", "hispanic-owned_firms", "women-owned_firms", "manufacturers_shipments", "merchant_sales", "retail_sales", "retail_per_capita", "accommodation_food_services", "building_permits", "land_area")
colnames(data_raw)## [1] "votes_dem_2016" "votes_gop_2016"
## [3] "total_votes_2016" "Clinton"
## [5] "Trump" "diff_2016"
## [7] "per_point_diff_2016" "county_name"
## [9] "total_votes_2012" "votes_dem_2012"
## [11] "votes_gop_2012" "Obama"
## [13] "Romney" "diff_2012"
## [15] "per_point_diff_2012" "population2014"
## [17] "age_under5" "age_under18"
## [19] "age65plus" "%female"
## [21] "White" "Black"
## [23] "%Naitive" "%Asian"
## [25] "%Hawaiian" "%Mixed"
## [27] "Hispanic" "%White"
## [29] "living_same_house" "foreign"
## [31] "NonEnglish" "Edu_highschool"
## [33] "Edu_batchelors" "%veterans"
## [35] "timetravel_to_work" "housing_units"
## [37] "homeownership" "housing_units_multi-unit"
## [39] "owner-occupied_housing" "households"
## [41] "persons_per_household" "Income"
## [43] "median_income" "Poverty"
## [45] "nonfarm_establishments" "nonfarm_employment"
## [47] "nonfarm_employment_change" "nonemployer_establishments"
## [49] "number_firms" "black-owned_firms"
## [51] "native-owned_firms" "asian-owned_firms"
## [53] "hawaiian-owned_firms" "hispanic-owned_firms"
## [55] "women-owned_firms" "manufacturers_shipments"
## [57] "merchant_sales" "retail_sales"
## [59] "retail_per_capita" "accommodation_food_services"
## [61] "building_permits" "land_area"
## [63] "Density" "Clinton_Obama"
## [65] "Trump_Romney" "Trump_Prediction"
## [67] "Clinton_Prediction" "Trump_Deviation"
## [69] "Clinton_Deviation"
We stated above that for the purpose of hereby analysis results of the 2012 election will be neglected. As a motivation of that choice one may consider the following figure. The majority of counties are fairly stable in their political affiliation over time. Those who voted for Barack Obama in 2012 supported Hilary Clinton in 2016 as well. The correlation of results is high. Therefore, we may limit our considerations to the 2016 election and remove data describing the confrontation of Mitt Romney and Barack Obama in 2012. What is more, the 2016 election seems to be more interesting due to their being more ‘’up-to-date’’.
data <- data_raw
ggplot(data) + geom_point(aes(x = Clinton*100, y = Obama*100)) + labs(title = "Votes for democratic candidates in 2012 and 2016", x = "Percentage of votes for H. Clinton [%]", y = "Percentage of votes for B. Obama [%]")We may use some observations, let’s say 15% of their total number, for prediction purposes. In other words, we need to split the dataset into two. The first one should contain 85% percent of total observation and will be used to build models. The second one consists of the remaining values. Its purpose will be to present the model’s ability of to predict which clusters should be a given observation assigned to.
set.seed(123456)
data_split <- initial_split(data, prop = .75)
data1 <- training(data_split)
data2 <- testing(data_split)
head(data1)## votes_dem_2016 votes_gop_2016 total_votes_2016 Clinton Trump
## 234 67045 100178 182918 0.3665304 0.5476662
## 1066 3295 5174 8829 0.3732019 0.5860233
## 1905 3721 3179 7975 0.4665831 0.3986207
## 711 23392 24152 50815 0.4603365 0.4752927
## 2285 141597 90321 241818 0.5855519 0.3735082
## 2230 4256 12545 17481 0.2434643 0.7176363
## diff_2016 per_point_diff_2016 county_name total_votes_2012
## 234 33133 -0.18113581 Douglas County 150160
## 1066 1879 -0.21282138 Rowan County 7641
## 1905 542 0.06796238 Cibola County 8154
## 711 760 -0.01495621 Kendall County 45042
## 2285 51276 0.21204377 Providence County 234092
## 2230 8289 -0.47417196 Clarion County 15227
## votes_dem_2012 votes_gop_2012 Obama Romney diff_2012
## 234 54093 93930 0.3602357 0.6255328 39837
## 1066 3438 4035 0.4499411 0.5280722 597
## 1905 4907 2966 0.6017905 0.3637479 1941
## 711 21219 23076 0.4710936 0.5123218 1857
## 2285 156022 74344 0.6664986 0.3175845 81678
## 2230 4731 10228 0.3106981 0.6717016 5497
## per_point_diff_2012 population2014 age_under5 age_under18 age65plus
## 234 -0.26529702 314638 6.3 28.3 9.9
## 1066 -0.07813113 23655 5.6 19.3 13.4
## 1905 0.23804268 27349 7.4 24.5 14.2
## 711 -0.04122819 121350 7.3 29.8 9.0
## 2285 0.34891410 631974 5.8 21.1 14.2
## 2230 -0.36100348 38821 5.0 18.9 18.0
## %female White Black %Naitive %Asian %Hawaiian %Mixed Hispanic %White
## 234 50.3 0.913 0.014 0.5 4.3 0.1 2.4 0.083 83.9
## 1066 51.3 0.960 0.019 0.2 0.8 0.0 1.2 0.015 94.6
## 1905 48.9 0.540 0.014 41.7 0.6 0.1 2.1 0.378 20.8
## 711 50.6 0.876 0.067 0.5 3.3 0.0 1.9 0.169 71.9
## 2285 51.5 0.797 0.114 1.2 4.4 0.2 3.0 0.209 64.1
## 2230 51.2 0.970 0.011 0.2 0.7 0.0 0.9 0.007 96.4
## living_same_house foreign NonEnglish Edu_highschool Edu_batchelors
## 234 85.4 6.4 8.8 97.5 55.8
## 1066 79.5 1.0 2.1 77.1 23.8
## 1905 88.2 5.2 44.3 79.9 10.4
## 711 88.9 9.3 17.3 92.5 34.2
## 2285 85.7 17.9 29.7 80.7 26.4
## 2230 84.3 1.0 3.3 87.9 18.5
## %veterans timetravel_to_work housing_units homeownership
## 234 19635 27.7 113777 80.9
## 1066 1231 21.0 10126 65.2
## 1905 1858 22.8 11086 72.4
## 711 5821 33.3 41183 83.6
## 2285 35075 23.1 263473 54.3
## 2230 3011 22.8 19879 70.9
## housing_units_multi-unit owner-occupied_housing households
## 234 15.8 335600 103780
## 1066 11.9 96700 8305
## 1905 7.9 83800 8016
## 711 9.0 210800 38075
## 2285 51.7 224100 237800
## 2230 14.0 106200 15776
## persons_per_household Income median_income Poverty nonfarm_establishments
## 234 2.82 43634 101591 3.9 8040
## 1066 2.49 17094 35236 28.6 489
## 1905 3.15 16129 37237 27.6 329
## 711 3.05 31276 81765 4.8 2008
## 2285 2.52 26308 49297 17.1 15524
## 2230 2.40 21902 42389 18.3 937
## nonfarm_employment nonfarm_employment_change nonemployer_establishments
## 234 91934 5.2 28689
## 1066 7148 -4.4 1405
## 1905 5067 -2.2 1008
## 711 21251 0.0 7348
## 2285 254490 1.2 39945
## 2230 10918 0.5 2443
## number_firms black-owned_firms native-owned_firms asian-owned_firms
## 234 32992 0.0 0.0 2.4
## 1066 1823 0.0 0.0 0.0
## 1905 1513 0.0 14.4 1.7
## 711 8311 3.1 0.0 0.0
## 2285 51806 5.5 0.5 2.6
## 2230 3460 0.0 0.0 0.0
## hawaiian-owned_firms hispanic-owned_firms women-owned_firms
## 234 0 3.1 27.7
## 1066 0 0.0 18.0
## 1905 0 22.1 26.6
## 711 0 9.3 32.8
## 2285 0 9.7 28.1
## 2230 0 0.0 0.0
## manufacturers_shipments merchant_sales retail_sales retail_per_capita
## 234 1950842 0 3765450 13846
## 1066 259812 54111 272076 12039
## 1905 0 15859 217448 7992
## 711 685730 1299755 1064021 10957
## 2285 6422854 5856937 5924945 9430
## 2230 586557 287384 421097 10563
## accommodation_food_services building_permits land_area Density
## 234 435099 3230 840.25 339.7
## 1066 36058 14 279.80 83.4
## 1905 117318 0 4539.48 6.0
## 711 99906 283 320.34 358.2
## 2285 1065946 342 409.50 1530.3
## 2230 51159 422 600.83 66.6
## Clinton_Obama Trump_Romney Trump_Prediction Clinton_Prediction
## 234 0.006294609 -0.07786660 0.3939749 0.5275057
## 1066 -0.076739159 0.05795109 0.6673069 0.2917867
## 1905 -0.135207460 0.03487284 0.4844554 0.4468122
## 711 -0.010757131 -0.03702910 0.5272468 0.4092330
## 2285 -0.080946698 0.05592364 0.5341748 0.4206900
## 2230 -0.067233769 0.04593471 0.7052056 0.2407079
## Trump_Deviation Clinton_Deviation
## 234 -0.15369132 0.160975333
## 1066 0.08128355 -0.081415221
## 1905 0.08583473 -0.019770904
## 711 0.05195406 -0.051103467
## 2285 0.16066662 -0.164861976
## 2230 -0.01243071 -0.002756465
head(data2)## votes_dem_2016 votes_gop_2016 total_votes_2016 Clinton Trump diff_2016
## 2 18409 72780 94090 0.1956531 0.7735147 54371
## 7 3716 4891 8685 0.4278641 0.5631549 1175
## 11 2909 15068 18255 0.1593536 0.8254177 12159
## 13 5712 7109 12936 0.4415584 0.5495516 1397
## 16 4194 15825 20513 0.2044557 0.7714620 11631
## 18 3069 3413 6543 0.4690509 0.5216262 344
## per_point_diff_2016 county_name total_votes_2012 votes_dem_2012
## 2 -0.57786162 Baldwin County 84988 18329
## 7 -0.13529073 Butler County 9483 4367
## 11 -0.66606409 Chilton County 17434 3391
## 13 -0.10799320 Clarke County 13827 6317
## 16 -0.56700629 Coffee County 19715 4899
## 18 -0.05257527 Conecuh County 7013 3551
## votes_gop_2012 Obama Romney diff_2012 per_point_diff_2012
## 2 65772 0.2156657 0.7738975 47443 -0.55823175
## 7 5081 0.4605083 0.5358009 714 -0.07529263
## 11 13910 0.1945050 0.7978662 10519 -0.60336125
## 13 7463 0.4568598 0.5397411 1146 -0.08288132
## 16 14638 0.2484910 0.7424803 9739 -0.49398935
## 18 3434 0.5063454 0.4896621 117 0.01668330
## population2014 age_under5 age_under18 age65plus %female White Black %Naitive
## 2 200111 5.6 22.2 18.7 51.2 0.871 0.096 0.7
## 7 20296 6.1 23.6 18.0 53.6 0.539 0.440 0.4
## 11 43931 6.4 24.2 15.2 50.8 0.871 0.106 0.5
## 13 24945 5.6 22.6 18.0 52.8 0.540 0.442 0.4
## 16 50909 6.1 23.7 15.9 50.5 0.764 0.177 1.4
## 18 12670 5.5 21.4 20.4 51.9 0.521 0.462 0.4
## %Asian %Hawaiian %Mixed Hispanic %White living_same_house foreign NonEnglish
## 2 0.9 0.1 1.6 0.046 83.0 82.1 3.6 5.5
## 7 0.9 0.0 0.8 0.012 53.1 94.6 0.8 1.7
## 11 0.4 0.2 1.2 0.077 80.3 88.8 5.2 7.4
## 13 0.5 0.0 0.8 0.013 53.1 92.5 0.3 1.3
## 16 1.6 0.4 2.5 0.064 71.3 83.3 5.1 7.5
## 18 0.1 0.0 1.2 0.018 51.1 96.9 0.6 2.2
## Edu_highschool Edu_batchelors %veterans timetravel_to_work housing_units
## 2 89.1 27.7 19346 25.9 107374
## 7 76.3 14.0 1497 24.0 9916
## 11 76.0 12.9 3308 31.8 19221
## 13 78.3 11.2 1583 25.5 12583
## 16 82.4 22.5 6066 21.3 22852
## 18 77.3 8.8 1101 27.2 7055
## homeownership housing_units_multi-unit owner-occupied_housing households
## 2 72.6 24.4 168600 73283
## 7 70.3 13.3 74700 8235
## 11 74.5 4.1 102600 16232
## 13 72.9 6.8 85600 9631
## 16 69.9 13.5 131000 18820
## 18 79.1 7.2 69600 4994
## persons_per_household Income median_income Poverty nonfarm_establishments
## 2 2.52 26766 50221 13.9 4871
## 7 2.47 17403 29918 28.4 393
## 11 2.67 20701 41250 19.5 703
## 13 2.63 18979 29357 29.3 586
## 16 2.65 23380 43768 18.6 954
## 18 2.61 15605 24658 32.7 194
## nonfarm_employment nonfarm_employment_change nonemployer_establishments
## 2 54988 3.7 16508
## 7 5711 2.7 1095
## 11 7396 8.8 2719
## 13 6648 3.7 1640
## 16 12762 0.3 2602
## 18 2197 -1.5 756
## number_firms black-owned_firms native-owned_firms asian-owned_firms
## 2 19035 2.7 0.4 1.0
## 7 1769 0.0 0.0 3.3
## 11 0 0.0 0.0 0.0
## 13 2374 0.0 0.0 0.0
## 16 4032 6.0 0.0 1.3
## 18 980 0.0 0.0 0.0
## hawaiian-owned_firms hispanic-owned_firms women-owned_firms
## 2 0 1.3 27.3
## 7 0 0.0 0.0
## 11 0 0.0 0.0
## 13 0 0.0 28.6
## 16 0 0.7 29.3
## 18 0 0.0 23.0
## manufacturers_shipments merchant_sales retail_sales retail_per_capita
## 2 1410273 0 2966489 17166
## 7 399132 56712 229277 11326
## 11 0 155139 359910 8496
## 13 571454 85803 344311 13034
## 16 613758 131594 639623 13665
## 18 183544 76736 71232 5430
## accommodation_food_services building_permits land_area Density Clinton_Obama
## 2 436955 1384 1589.78 114.6 -0.02001264
## 7 28427 2 776.83 27.0 -0.03264414
## 11 34073 78 692.85 63.0 -0.03515139
## 13 23596 11 1238.47 20.9 -0.01530133
## 16 50080 58 678.97 73.6 -0.04403529
## 18 8667 3 850.16 15.6 -0.03729446
## Trump_Romney Trump_Prediction Clinton_Prediction Trump_Deviation
## 2 -0.0003827715 0.5867486 0.3595024 -0.186766123
## 7 0.0273539578 0.5224916 0.4633133 -0.040663239
## 11 0.0275514554 0.7283339 0.2433266 -0.097083797
## 13 0.0098105526 0.5289034 0.4534988 -0.020648222
## 16 0.0289816548 0.5804720 0.3800510 -0.190990039
## 18 0.0319641092 0.5249156 0.4589741 0.003289476
## Clinton_Deviation
## 2 0.16384929
## 7 0.03544912
## 11 0.08397298
## 13 0.01194032
## 16 0.17559525
## 18 -0.01007684
We have to normalise the dataset in order to ensure one variable does not dominate others in a way that its intake to a cost function minimised by algorithms used in clustering methods would exceed the remaining intakes. Leading the same to outcome in which data are clustered in an incorrect way because centroids/medoids location was found appropriately in some dimensions only.
data_train <- as.data.frame( lapply(data1[, -8], scale) ) # We need to exclude the categorical variable, i.e. name of the county
data_predict <- as.data.frame( lapply(data2[, -8], scale) )
county_names_train <- data1[, 8] # We will later use it while reviewing clustering results
county_names_predict <- data2[, 8]Analysis
Since the dataset has been prepared for further analysis, we may start building our clustering models. Before that, however, we need to access whether it is rational to apply clustering algorithms. In other words, we want to check if our data are clusterable. Hopkins statistics and ordered dissimilarity matrics (ODM) are tools we need.
Accessing clustering tendency
access <- get_clust_tendency(data[, -8], n = nrow(data_raw)-1, gradient = list(low="red", mid="white", high="blue") )
access$hopkins_stat## [1] 0.993888
access$plotHopkins statistics compares distances between nearest neighbours in our data with the same quantity but computed using real data points and numbers generated from the uniform distribution. That way Hopkins statistics is some kind of a ratio presenting how different the dataset is when compared with uniformly distributed points.
Obtained results suggest that we may proceed with clustering. The value of Hopkins statistics is (while far from perfect) somehow closed to one which in this setting means that we may reject the null hypothesis of not having visible clusters in the dataset. A similar conclusion might be formulated after analysing ordered dissimilarity matrics where blue color means higher dissimilarity in data.
Optimal number of clusters
After checking whether our work may lead to a satisfying outcome, we may continue preparations to apply clustering algorithms. The next purpose is to find the optimal number of clusters. There are a few quantities that might be used here. We have decided to use two of them: Silhouette and total within sum of square. These are popular and widely used measures. Their advantage is that they are available for both k-means and PAM methods which will turn out to be important in the future.
Let’s start with Silhouette. This measure presents the difference between the average distance to other objects assigned to the same cluster as an examined point and the average distance to the closest different cluster divided by the greater value of those two. The optimal situation is when silhouette equals one. On the contrary, the value of minus one is the worst possible scenario.
fviz_nbclust(data[, -8], FUNcluster = kmeans, method = "silhouette")fviz_nbclust(data[, -8], FUNcluster = cluster::pam, method = "silhouette")Finally, we may compute the total within sum of squares. The third measure informs about the total variance within data explained thanks to applying clustering (as in regression models developed for econometrics analysis).
fviz_nbclust(data[, -8], FUNcluster = kmeans, method = "wss")Choosing an optimal number of clusters is always a demanding task. Various measures may lead to different conclusions. While for the PAM method three clusters seem to be a rational choice (similar value of silhouette as for two clusters and more obvious choice from wss perspective), the best number of clusters for k-means is uncertain. Therefore, it might be helpful to increase the number of indices the decision is based on and introduce two additional ones. For instance, we may use Akaike information criterion (AIC).
Optimal_Clusters_KMeans(data[, -8], max_clusters = 10, criterion = "AIC")## [1] 3.262875e+17 1.797354e+17 8.636668e+16 5.083518e+16 4.031187e+16
## [6] 3.576490e+16 3.303212e+16 2.870489e+16 2.788313e+16 2.492855e+16
Let us choose two clusters as the optimal number for k-means.
The above analysis was carried out because of what we want to do now. Namely, build two models: one using PAM with three clusters and the second one consisting of two clusters found with the k-means algorithm. Then we are going to compare their results by computing the Rand index.
PAM and k-means models
model1 <- eclust(data_train, FUNcluster = "kmeans", k = 2, hc_metric = "euclidean", graph = FALSE)
fviz_cluster(model1, data_train, geom = "point", ellipse.type = "convex")clustered_counties1 <- data.frame(model1$cluster, county_names_train)
colnames(clustered_counties1) <- c("Cluster", "County Name")
tail(clustered_counties1)## Cluster County Name
## 2329 1 Power County
## 2330 1 Marion County
## 2331 1 Tazewell County
## 2332 1 Otero County
## 2333 1 Cottle County
## 2334 1 Sherburne County
sil1 <- silhouette(model1$cluster, dist(data_train) )
fviz_silhouette(sil1)## cluster size ave.sil.width
## 1 1 2209 0.60
## 2 2 125 0.01
While the average silhouette for the first cluster is fairly satisfying, the second one seems to be far worse. About half of the observations assigned to that cluster are closer to the centre of another cluster (as may be seen in the Cluster plot above). Also a disproportion in the number of observations assigned to a given cluster.
model2 <- eclust(data_train, FUNcluster = "pam", k = 3, hc_metric = "euclidean", graph = FALSE)
fviz_cluster(model2, data_train, geom = "point", ellipse.type = "convex")sil2 <- silhouette(model2$cluster, dist(data_train) )
fviz_silhouette(sil2)## cluster size ave.sil.width
## 1 1 365 -0.09
## 2 2 1449 0.29
## 3 3 520 0.00
clustered_counties2 <- data.frame(model2$cluster, county_names_train)
colnames(clustered_counties2) <- c("Cluster", "County Name")
tail(clustered_counties2)## Cluster County Name
## 2329 2 Power County
## 2330 2 Marion County
## 2331 2 Tazewell County
## 2332 3 Otero County
## 2333 2 Cottle County
## 2334 2 Sherburne County
Well, the results are not perfect. However, as promised we may compare cluster allocation returned by both models using the Rand index. The latter presents the number of pairs that consists of observations assigned to the same clusters in both approaches divided by the total number of pairs. Rand index might be, thus, treated as the frequency of occurrence of the same allocation.
rand.index(model1$cluster, model2$cluster) # Not using the adjusted version## [1] 0.5386634
Prediction
The model with two clusters found using the k-means algorithm turned out to be better in terms of average silhouette. Therefore, this approach seems to be more suitable for prediction purposes.
kcca <- as.kcca(model1, data_train)
model_predict <- predict(kcca, data_predict)
predict_result <- data.frame(model_predict, county_names_predict)
tail(predict_result)## model_predict county_names_predict
## 773 1 Wirt County
## 774 1 Wood County
## 775 1 Wyoming County
## 776 1 Goshen County
## 777 1 Platte County
## 778 1 Uinta County
Hierarchical clustering
After applying k-means and PAM algorithms, we may present another approach to grouping unlabeled observations which is hierarchical clustering. This method aims to build a characteristic dendrogram that represents a hierarchy of clusters, i.e. connections between them based on their similarity. As in other methods we have to choose metrics that defines equation for calculating a distance between observations. The next step is to translate those distances between individual data points into a distance between sets they might be assigned to. In other words, we need so-called linkage criteria. Of course, a few options are here available to choose from.
Hierarchical clustering can be performed in a few different ways. We will present the Agnes algorithm (AGglomerative NESting) that starts with as many clusters as observation in a dataset and then tries to merge them based on their similarity. In effect, the number of clusters decreases as they include more observations. The complementary approach is applied in the second of popular hierarchical clustering algorithms which is Diana (DIvisie ANAlysis). Initially, all observations are grouped in one cluster which is then divided into smaller structures.
We decided to use only 30 observations for hierarchical clustering analysis due to the need of providing readable result visualization (which could be difficult for the whole dataset).
data_hier <- data_train[1:30, ]
model3 <- agnes(data_hier, method = "ward")
pltree(model3, main = "Dendrogram", labels = county_names_train[1:30])model3$ac## [1] 0.7826415
We may now cut our tree into let’s say two groups.
model3b <- cutree(model3, k = 2)
clustered_counties3 <- data.frame(model3b, county_names_train[1:30])
colnames(clustered_counties3) <- c("Cluster", "County name")
clustered_counties3## Cluster County name
## 1 1 Douglas County
## 2 2 Rowan County
## 3 1 Cibola County
## 4 1 Kendall County
## 5 1 Providence County
## 6 2 Clarion County
## 7 2 Bristol city
## 8 2 Holmes County
## 9 1 Orange County
## 10 1 Chattahoochee County
## 11 2 Atchison County
## 12 1 Garfield County
## 13 2 Valley County
## 14 2 Stoddard County
## 15 2 Boone County
## 16 2 Schuylkill County
## 17 2 Hancock County
## 18 2 Union County
## 19 2 Putnam County
## 20 2 Limestone County
## 21 2 Warrick County
## 22 2 Indian River County
## 23 1 Monterey County
## 24 2 Swisher County
## 25 2 Red River County
## 26 2 George County
## 27 1 Norfolk city
## 28 1 St. Clair County
## 29 1 Calvert County
## 30 2 Bracken County
Summary
A few examples from a wide range of clustering methods have been presented using the dataset of 2016 United States election results. We applied k-means, PAM, and hierarchical clustering (in the form of the Agnes algorithm). While the results were not remarkable so to say, the way they were obtained made it possible to see a variety of measures and quantities important in clustering analysis.