The aim of this analysis is to try and predict populous tendencies in Europe.
The dataset has been acquired from kaggle https://www.kaggle.com/daliaresearch/trump-effect
The study has not been published on Kaggle because its engine was not capable of running this low intensity code from beginning to end uninterrupted.
In our possession are two sets of data, one for the US and one for EU.
The US data contains a question about the candidate elected in the Nov 2016 election whether that was Donald Trump, Hilary Clinton, another candidate or abstinence from voting.
In the EU dataset the subset of the GB Great Britain referred to as UK_set (with name variations) contains a question about the 2016 EU referendum whether an individual voted to leave, stay or abstain from voting.
The analysis is based on a number of assumption that individuals who voted for Donald Trump are similar or mimic those who voted for leaving EU.
The study is carried out in 6 steps overall with the first 4 steps pertaining to data homogenization and cleaning and the last two steps pertaining to the actual analysis and results.
Important Notes:
1- There are three major sets throughout the analysis. The UK_set, US_set and EU_set (containing EU countries except the UK).
2- all sets are subsetted so that only individuals who reported being residents are included in the study.
3- In the US set individuals who voted for Hilary are lumped with individuals who voted “Other candidate” for two reasons: a- we only care if the vote is for a populist or not.
b- this makes the number levels match the UK levels of the vote on the referendum. 4- Variable selection: At step 4 we perform variable selection where we remove variables that 1) contain many NA values, 2) have different levels between the US_set and the EU_set or 3) that we thought are redundant. This reduces the set from 68 variable to 44 and 48 (explained in item 5, next).
5- At step 4 we chose to divided the EU subest into two groups to be analyzed separately. The first containing only the 5 largest countries (in terms of number of observations). Those are ‘DE’,‘FR’,‘ES’,‘PL’,‘IT’. The second set contains all EU countries with more than 100 observations in the survey. The analysis is carried out ONLY for the 5 large countries however the datasets are processed and ready for anyone who wishes to carry out further analysis.
It is important to understand the difference the difference between the sets with suffix set_5c and suffix set_all. The variables (columns, or questions) between those two sets are different. sets US_set_5c, UK_set_5c, and EU_set_5c contain 48 variables and are to be studied together, whereas sets US_set_all, UK_set_all and EU_set_all contain 44 variables are are to be studied together.
The reason for this dichotomy is due the fact that many questions from the survey remain unanswered for many of the smaller countries and those are reported as NAs. This is not true for the larger countries which have answered those questions. Hence the 5 countries have 48 columns vs 44 with the larger EU_set_all.
6- In step 5 The analysis is carried out ONLY for the 5 large countries using “US_set_5c”, “UK_set_5c” and “EU_set_5c”. Multinomial regression is estimated for the US and UK set and the predictive power of each set is assessed internally. The US model based on election is used to predict UK results of referendum and vice versa. Afterwards both the UK and US model are used to predict the populous tendencies in the 5 countries ‘DE’,‘FR’,‘ES’,‘PL’,‘IT’. Details of model validation are included in the comments along with the code chunks.
In addition in this step we split the votes who abstained from voting into two groups to add a level of granularity that shows whether a no-vote has the tendency to vote populist or non-populist. The results are displayed in the same proportion tables.
7- In step 6 we carry out Classificataion tree models for the US set and the UK in an attempt to take a closer look at the factors that affected the vote selection.
Step 1: Data Homogenization
Summary: The code in this file will take the raw datasets of EU (90 variables) and US (68 variables) and do the following: 1- Remove the column name prefixes (X.meta.., X.dem.. X.question.. etc) 2- investigate the common variables between the EU_set and US_set. 3- modify any discrepancies in column names that refer to the same variable (only 1 case: ethnicity). 4- isolate -into a new dataset- the variables from EU dataset that are not common with the US (those are per country per party specific question). 5- remove from the EU_set the columns from step 4. 6- save the homogenized EU_set.csv, US_set.csv. 7- save UK_set.csv (a subset of EU_set) which contains a response and may be used for a prediction result 8- save the EU_set_uncommon.csv set (of step 4). 9- finally we subset the datasets by observations that claim citizenship or who can vote. Those are the observations that matter and they constitute the majority. 10- voter_US_set.csv, voter_UK_set.csv and voter_EU_set.csv are subsets of the US_set.csv, UK_set.csv and EU_set.csv where all observtions (individuals) are residents. Those are the people who voted or can vote.
—————————————————————————————————— The final EU_set is 67 columns wide.
The final US_set is 68 columns wide.
The US_set has one extra variable vote_for_in_us_election which serves as the response.
The UK_set has one extra variable vote_referrendum which is specific for the UK.
All other 67 columns are shared between the two sets.
——————————————————————————————————
data_Path<-'d:/uchicago/msca31008/09 Project/Kaggle/data_files'
save_Path<-'d:/uchicago/msca31008/09 Project/Kaggle'
EU_set<-read.csv(paste(data_Path, 'data_coded_e28.csv - data_coded_e28.csv',sep='/'), header=TRUE, sep=',')
US_set<-read.csv(paste(data_Path, 'data_coded_US.csv - data_coded_US.csv',sep='/'), header=TRUE, sep=',')
dir.create(paste(save_Path,'results',sep='/'))
## Warning in dir.create(paste(save_Path, "results", sep = "/")): 'd:\uchicago
## \msca31008\09 Project\Kaggle\results' already exists
result_Path<-paste(save_Path,'results',sep='/')
#Changing the name of the ethniticy variable in the `EU_set` from **X.dem..ethnic_background** to **X.dem..ethnicity**
names(EU_set)[which(names(EU_set)=='X.dem..ethnic_background')]<-'X.dem..ethnicity'
cbind(dim(EU_set), dim(US_set))
## [,1] [,2]
## [1,] 11283 1052
## [2,] 90 68
#name clean up: in order to remove the prefixes
trim_names<-function(dataset){
n<-dim(dataset)[2]
for(i in 1:n){
if (grepl('X.question..',names(dataset)[i])){
newname<-gsub('X.question..','',names(dataset)[i])
names(dataset)[i]<-newname
}
else if (grepl('X.dem..',names(dataset)[i])){
newname<-gsub('X.dem..','',names(dataset)[i])
names(dataset)[i]<-newname
}
else if (grepl('X.aud..',names(dataset)[i])){
newname<-gsub('X.aud..','',names(dataset)[i])
names(dataset)[i]<-newname
}
else if (grepl('X.meta..',names(dataset)[i])){
newname<-gsub('X.meta..','',names(dataset)[i])
names(dataset)[i]<-newname
}
}
return(dataset)
}
#Creating (extracting) UK set:
UK_set<-EU_set[EU_set$X.dem..country_code=='GB',]
#trimming all names:
EU_set<-trim_names(EU_set)
US_set<-trim_names(US_set)
UK_set<-trim_names(UK_set)
#Columns shared between the EU_set and US_set. 67 columns. Which is 1 column less than the entire US_set.
intersect(names(EU_set), names(US_set))
## [1] "uuid"
## [2] "weight"
## [3] "country_code"
## [4] "age"
## [5] "gender"
## [6] "education_level"
## [7] "degree_of_urbanisation"
## [8] "residency"
## [9] "household_size"
## [10] "immigration"
## [11] "origin"
## [12] "settlement_size"
## [13] "hometown"
## [14] "ethnicity"
## [15] "religion"
## [16] "employment_status"
## [17] "employment_status_in_education"
## [18] "work_type_routine"
## [19] "work_type_manual"
## [20] "lgbtq"
## [21] "income_net_monthly"
## [22] "disposable_income"
## [23] "household_finances_past12months"
## [24] "financial_security"
## [25] "change_household_finances_next12months"
## [26] "job_security"
## [27] "status_national_economy"
## [28] "change_economy_country_past12months"
## [29] "economy_country_next12months"
## [30] "social_networks_regularly_used"
## [31] "social_media_activity_rank"
## [32] "online_sharing_frequency"
## [33] "sharing_network_size"
## [34] "member_organization"
## [35] "organization_activities_timeperweek"
## [36] "media_tv_hours"
## [37] "media_radio_hours"
## [38] "media_print_hours"
## [39] "opinion_government"
## [40] "political_view"
## [41] "likelihood_to_demonstrate"
## [42] "frequency_of_voting"
## [43] "vote_next_national_election"
## [44] "important_issues_when_voting"
## [45] "ranking_importance_of_issues_when_voting"
## [46] "preferred_type_of_political_leader"
## [47] "frequent_sharing_of_politicalviews"
## [48] "independence_or_respect"
## [49] "obedience_or_selfreliance"
## [50] "consideration_or_good_behaviour"
## [51] "curiosity_or_good_manners"
## [52] "democracy_own_country_satisfaction"
## [53] "next_generation_opportunities"
## [54] "currentplace_change_past5years"
## [55] "hometown_change_past5years"
## [56] "country_direction_past5years"
## [57] "financial_situation_change_past5years"
## [58] "international_trade_gain_or_loss"
## [59] "country_comes_first"
## [60] "improving_life_by_hardwork"
## [61] "conspiracy"
## [62] "government_controlled_by_elite"
## [63] "trust_in_own_judgment"
## [64] "perceived_effect_of_diversity"
## [65] "gender_discrimination_importance"
## [66] "family_friends_highereducation"
## [67] "worldview"
#Columns in US_set not in EU_set. This it the regressor in the US_set.
setdiff(names(US_set),names(EU_set))
## [1] "vote_for_in_us_election"
#Columns in EU_set and not in US_set. Those need to be removed from the main EU_set.
setdiff(names(EU_set),names(US_set))
## [1] "opinion_eu" "vote_referendum"
## [3] "ranking_party_de" "voted_party_last_election_de"
## [5] "ranking_party_fr" "voted_party_last_election_fr"
## [7] "ranking_party_es" "voted_party_last_election_es"
## [9] "ranking_party_pl" "voted_party_last_election_pl"
## [11] "ranking_party_it" "voted_party_last_election_it"
## [13] "ranking_party_gb" "voted_party_last_election_gb"
## [15] "when_vote_decision_was_made" "vote_nextelection_de"
## [17] "vote_nextelection_fr" "vote_nextelection_es"
## [19] "vote_nextelection_pl" "vote_nextelection_it"
## [21] "vote_nextelection_gb" "certainty_party_to_vote"
## [23] "friends_family_same_vote"
#Columns in UK_set and not in US_set.
setdiff(names(UK_set),names(US_set))
## [1] "opinion_eu" "vote_referendum"
## [3] "ranking_party_de" "voted_party_last_election_de"
## [5] "ranking_party_fr" "voted_party_last_election_fr"
## [7] "ranking_party_es" "voted_party_last_election_es"
## [9] "ranking_party_pl" "voted_party_last_election_pl"
## [11] "ranking_party_it" "voted_party_last_election_it"
## [13] "ranking_party_gb" "voted_party_last_election_gb"
## [15] "when_vote_decision_was_made" "vote_nextelection_de"
## [17] "vote_nextelection_fr" "vote_nextelection_es"
## [19] "vote_nextelection_pl" "vote_nextelection_it"
## [21] "vote_nextelection_gb" "certainty_party_to_vote"
## [23] "friends_family_same_vote"
EU_set_uncommon<-data.frame(EU_set[,which(names(EU_set) %in% setdiff(names(EU_set),names(US_set)))])
EU_set_uncommon<-data.frame(cbind(uuid=EU_set$uuid,EU_set_uncommon))
EU_set_uncommon$vote_referendum<-NULL #this is part of the UK set and only applies to UK
#choosing
EU_set[,which(names(EU_set) %in% setdiff(names(EU_set),names(US_set)))]<-NULL
vote_referendum<-UK_set$vote_referendum
UK_set[,which(names(UK_set) %in% setdiff(names(UK_set),names(US_set)))]<-NULL
UK_set$vote_referendum<-vote_referendum
#remove UK observarions from EU dataset
EU_set<-EU_set[-which(EU_set$country_code=='GB'),]
EU_set$country_code<-factor(EU_set$country_code)
#saving data at this step
write.csv(EU_set, paste(save_Path,'EU_set.csv',sep='/'))
write.csv(US_set, paste(save_Path,'US_set.csv',sep='/'))
write.csv(UK_set, paste(save_Path,'UK_set.csv',sep='/'))
write.csv(EU_set_uncommon, paste(save_Path,'EU_set_uncommon.csv',sep='/'))
#subset the data with observations who claim to be citizens of their respective counties (i.e those who can vote).
voter_UK_set<-UK_set[UK_set$residency==1,]
voter_EU_set<-EU_set[EU_set$residency==1,]
voter_US_set<-US_set[US_set$residency==1,]
#remove the residence columns from the corresponding datasets
#remove country code from the UK and the US data.
voter_US_set$residency<-NULL
voter_US_set$country_code<-NULL
voter_UK_set$residency<-NULL
voter_UK_set$country_code<-NULL
voter_EU_set$residency<-NULL
#save the voter sets
write.csv(voter_EU_set, paste(save_Path,'voter_EU_set.csv',sep='/'))
write.csv(voter_US_set, paste(save_Path,'voter_US_set.csv',sep='/'))
write.csv(voter_UK_set, paste(save_Path,'voter_UK_set.csv',sep='/'))
End of Step 1
Step 2: Variable Modification
Summary:
-Many questions have answers that contains multiple classes such as social_networks_regularly_used. Those question have answers where individuals included multipls answers which appear as “3 | 4 | 12 | 9 | 5”.
-We proceed in this file by picking only the first value that appears in these observations.
-Starting at this point we are only concenred with the voter files voter_US_set.csv, voter_UK_set.csv and voter_EU_set.csv.
-At the end we prodice three files clean_US_set.csv, clean_UK_set.csv and clean_EU_set.csv.
library(stringr)
#The following function will return a vector of column indexes that contain values of the form "3 | 4 | 6 | 1 | 0".
find_cols<-function(dataset){
n_col<-dim(dataset)[2]
n_row<-dim(dataset)[1]
target_col<-c()
#finding the columns where the piper character appears
for (i in 1:n_col){
for (j in 1:n_row){
if (grepl('\\|',as.character(dataset[j,i]))!=TRUE){
}
else{
target_col<-c(target_col, i)
break}
}
}
return(target_col)
}
#This function will extract the first value from observations containing multiple levels or multiple ranks in the following form 3|4|6|1|0.
extract_obs<-function(my_col){
my_col<-as.character(my_col)
new_col<-c()
n<-length(my_col)
for (i in 1:n){
if (isTRUE(my_col[i]=="")){
my_col[i]<-NA}
else{
obs<-str_replace_all(my_col[i],' ','')
obs<-strsplit(obs,'\\|')
my_col[i]<-obs[[1]][1]
}
}
return(as.factor(my_col))
}
#This function returns a transformed dataset.
transform_set<-function(col_index,dataset){
n_col_index<-length(col_index)
for (i in 1:n_col_index){
dataset[,col_index[i]]<-extract_obs(dataset[,col_index[i]])
}
return(dataset)
}
#apply the find_cols() function
col_index_eu<-find_cols(voter_EU_set)
col_index_us<-find_cols(voter_US_set)
col_index_uk<-find_cols(voter_UK_set)
#applying the transform_set() function and producing new files.
clean_EU_set<-transform_set(col_index_eu, voter_EU_set)
clean_US_set<-transform_set(col_index_us, voter_US_set)
clean_UK_set<-transform_set(col_index_uk, voter_UK_set)
write.csv(clean_EU_set, paste(save_Path, 'clean_EU_set.csv', sep='/'))
write.csv(clean_UK_set, paste(save_Path, 'clean_UK_set.csv', sep='/'))
write.csv(clean_US_set, paste(save_Path, 'clean_US_set.csv', sep='/'))
End of Step 2
Step 3: Variable Selection
Summary:
-In this step we choose our variables that will undergo further analysis. It is important to understand what is done here to understand what countries the analysis and the results pertain to.
-We split the EU data clean_EU_set into two wide sets:
-the first set contins 5 countries with the largest number of observations ‘DE’,‘FR’,‘ES’,‘PL’,‘IT’.
-the second set contins countries that have > 100 observations registered.
chosen_count_all<-table(clean_EU_set$country_code)
#top 5 countries with the largest number of observations.
chosen_count_5c<-c('DE','FR','ES','PL','IT')
#countries with more than 100 observations in the dataset.
chosen_count_all<-chosen_count_all[chosen_count_all>=100]
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
EU_set_all<-filter(clean_EU_set,country_code %in% names(chosen_count_all))
EU_set_5c<-filter(clean_EU_set,country_code %in% chosen_count_5c)
vec_discard_all<-c('origin',
'employment_status_in_education',
'organization_activities_timeperweek',
'settlement_size',
'hometown',
'ethnicity',
'hometown_change_past5years',
'sharing_network_size',
'social_networks_regularly_used',
'social_media_activity_rank',
'online_sharing_frequency',
'media_print_hours',
'media_tv_hours',
'media_radio_hours',
'frequency_of_voting',
'frequent_sharing_of_politicalviews',
'ranking_importance_of_issues_when_voting',
'preferred_type_of_political_leader',
'member_organization',
'important_issues_when_voting',
'vote_next_national_election')
vec_discard_5c<-c('origin',
'employment_status_in_education',
'organization_activities_timeperweek',
'settlement_size',
'hometown',
'ethnicity',
'hometown_change_past5years',
'sharing_network_size',
#'social_networks_regularly_used',
'social_media_activity_rank',
'online_sharing_frequency',
'media_print_hours',
#'media_tv_hours',
'media_radio_hours',
'frequency_of_voting',
'frequent_sharing_of_politicalviews',
'ranking_importance_of_issues_when_voting',
#'preferred_type_of_political_leader',
'member_organization',
#'important_issues_when_voting'
'vote_next_national_election')
EU_set_all<-clean_EU_set[,-which(names(clean_EU_set) %in% vec_discard_all)]
table(EU_set_all$country_code,complete.cases(EU_set_all))
##
## TRUE
## AT 140
## BE 197
## BG 139
## CY 21
## CZ 212
## DE 1798
## DK 110
## EE 26
## ES 1000
## FI 100
## FR 1431
## GR 213
## HR 92
## HU 199
## IE 90
## IT 1153
## LT 61
## LU 15
## LV 34
## MT 11
## NL 343
## PL 978
## PT 245
## RO 420
## SE 195
## SI 41
## SK 137
EU_set_5c<-clean_EU_set[,-which(names(clean_EU_set) %in% vec_discard_5c)]
table(EU_set_5c$country_code,complete.cases(EU_set_5c))
##
## FALSE TRUE
## AT 140 0
## BE 197 0
## BG 139 0
## CY 21 0
## CZ 212 0
## DE 0 1798
## DK 110 0
## EE 26 0
## ES 0 1000
## FI 100 0
## FR 0 1431
## GR 213 0
## HR 92 0
## HU 199 0
## IE 90 0
## IT 0 1153
## LT 61 0
## LU 15 0
## LV 34 0
## MT 11 0
## NL 343 0
## PL 0 978
## PT 245 0
## RO 420 0
## SE 195 0
## SI 41 0
## SK 137 0
US_set_5c<-clean_US_set[,-which(names(clean_US_set) %in% vec_discard_5c)]
table(complete.cases(US_set_5c))
##
## TRUE
## 1008
US_set_all<-clean_US_set[,-which(names(clean_US_set) %in% vec_discard_all)]
table(complete.cases(US_set_all))
##
## TRUE
## 1008
UK_set_5c<-clean_UK_set[,-which(names(clean_UK_set) %in% vec_discard_5c)]
table(complete.cases(UK_set_5c))
##
## TRUE
## 1358
UK_set_all<-clean_UK_set[,-which(names(clean_UK_set) %in% vec_discard_all)]
table(complete.cases(UK_set_all))
##
## TRUE
## 1358
EU_list<-list(EU_set_5c=EU_set_5c, EU_set_all=EU_set_all)
US_list<-list(US_set_5c=US_set_5c, US_set_all=US_set_all)
UK_list<-list(UK_set_5c=UK_set_5c, UK_set_all=UK_set_all)
saveRDS(EU_list, paste(save_Path, 'EU_list.rds', sep='/'))
saveRDS(US_list, paste(save_Path, 'US_list.rds', sep='/'))
saveRDS(UK_list, paste(save_Path, 'UK_list.rds', sep='/'))
End of Step 3
Step 4: Factor relabeling
EU_set_5c<-EU_list$EU_set_5c
EU_set_all<-EU_list$EU_set_all
US_set_5c<-US_list$US_set_5c
US_set_all<-US_list$US_set_all
UK_set_5c<-UK_list$UK_set_5c
UK_set_all<-UK_list$UK_set_all
suppressMessages(library(dplyr))
relabel_column <- function(col_name, dat, metadata) {
label_names <- as.character(filter(metadata, q_id_reduced == col_name)$value)
return(factor(dat[,col_name], labels = label_names))
}
#The function would be applied on a loop to the selected columns.
relabel_frame <- function(col_range, dat, metadata) {
res <- dat
col_names <- colnames(dat)[col_range]
for (col_name in col_names) {res[,col_name] <- relabel_column(col_name, dat,metadata)}
return(res)
}
metadata<-read.csv(paste(data_Path,'codebook.csv',sep='/'))
metadata$q_id_reduced <- sapply(strsplit(as.character(metadata$question_id_mapped)," "),'[',2)
new_frame<-relabel_frame(col_range = 7:49,EU_set_5c,metadata)
new_frame[,1:2]<-NULL
EU_set_5c<-new_frame
new_frame<-relabel_frame(col_range = 7:45,EU_set_all,metadata)
new_frame[,1:2]<-NULL
EU_set_all<-new_frame
EU_list$EU_set_5c<-EU_set_5c[,-c(1:2)]
EU_list$EU_set_all<-EU_set_all[,-c(1:2)]
#In the question **vote_for_in_us_election** we absorb the level of "Hilary Clinton" into the "Other candidate".
#The final levels for this question will be three "Donald Trump", "Other candidate" and "I did not vote".
#We also reorder the levels in the vector for convenience.
metadata<-read.csv(paste(data_Path,'codebook.csv',sep='/'))
metadata$q_id_reduced <- sapply(strsplit(as.character(metadata$question_id_mapped)," "),'[',2)
new_frame<-relabel_frame(col_range = 6:49,US_set_5c,metadata)
new_frame[,1:2]<-NULL
US_set_5c<-new_frame
#for the 5 country US set
US_set_5c$vote_for_in_us_election[US_set_5c$vote_for_in_us_election=='Hillary Clinton']<-'Other candidate'
US_set_5c$vote_for_in_us_election<-factor(US_set_5c$vote_for_in_us_election, levels=c('Donald Trump','I did not vote','Other candidate' ))
new_frame<-relabel_frame(col_range = 6:45,US_set_all,metadata)
new_frame[,1:2]<-NULL
US_set_all<-new_frame
#for the all country US set.
US_set_all$vote_for_in_us_election[US_set_all$vote_for_in_us_election=='Hillary Clinton']<-'Other candidate'
US_set_all$vote_for_in_us_election<-factor(US_set_all$vote_for_in_us_election, levels=c('Donald Trump','I did not vote','Other candidate' ))
US_list$US_set_5c<-US_set_5c[,-c(1:2)]
US_list$US_set_all<-US_set_all[,-c(1:2)]
#we also reorder the levels of the UK referrendum
metadata<-read.csv(paste(data_Path,'codebook.csv',sep='/'))
metadata$value<-trimws(metadata$value)
metadata$q_id_reduced<-sapply(strsplit(as.character(metadata$question_id_mapped)," "),'[',2)
new_frame<-relabel_frame(col_range = 6:49,UK_set_5c,metadata)
new_frame[,1:2]<-NULL
UK_set_5c<-new_frame
UK_set_5c$vote_referendum<-factor(UK_set_5c$vote_referendum, levels=c("For the UK to leave the EU","I did not vote","For the UK to stay in the EU"))
new_frame<-relabel_frame(col_range = 6:45,UK_set_all,metadata)
new_frame[,1:2]<-NULL
UK_set_all<-new_frame
UK_set_all$vote_referendum<-factor(UK_set_all$vote_referendum, levels=c("For the UK to leave the EU","I did not vote","For the UK to stay in the EU"))
UK_list$UK_set_5c<-UK_set_5c[,-c(1:2)]
UK_list$UK_set_all<-UK_set_all[,-c(1:2)]
saveRDS(US_list, paste(save_Path, 'ready_US_list.rds', sep='/'))
saveRDS(UK_list, paste(save_Path, 'ready_UK_list.rds', sep='/'))
saveRDS(EU_list, paste(save_Path, 'ready_EU_list.rds', sep='/'))
data_Path<-'d:/uchicago/msca31008/09 Project/Kaggle/data_files'
save_Path<-'d:/uchicago/msca31008/09 Project/Kaggle'
result_Path<-paste(save_Path,'results',sep='/')
End of Step 4
Step 5: Multinomial regression
In this step we perform the following operations: US set
1- Create 10 different pairs of training and holdout subsets divided 80/20
2- Run a forward step function on each training subset
3- Using the resulting models we predict the training subset and the adjoining testing subset and plot the results.
4- Create an overall forward model using all 1008 datapoints of the US_set.
UK set
1- Create 10 different pairs of training and holdout subsets divided 70/30 (larger number of observations with UK set)
2- Run a forward step function and both [forward and backward] step function on each training subset
3- Using the resulting models we predict the training subset and the adjoining testing subset and plot the results.
4- Create an overall forward model(only) using all 1358 datapoints of the UK_set.
Perform a transAtlantic prediction by attempting to predict US election results using Brexit forward model and Brexit results using US election model.
Finally, extract the observations of 5 countries from the the EU_set_5c and attempt to predict populous vs. non-populous election results based on the US election model and the UK election model. Note that we also split the non-voters into those who swing to the populous vote and those who swing to a non-populous vote for a more complete picture.
set.seed(49851254)
US_set_5c<-ready_US_list$US_set_5c
list_names<-c(1:10)
t_sets<-vector('list', length=10)
h_sets<-vector('list', length=10)
for (i in 1:10){
N<-dim(US_set_5c)[1]
k<-dim(US_set_5c)[2]
selector<-(sample(c(1,0),N,replace=TRUE, prob=c(.8,.2)))
count<-0
while (count<100000){
temp<-sample(selector,N,replace=FALSE)
count<-count+1
}
selector<-temp
US_set_5c$selector<-selector
US_set_5c_h<-US_set_5c[US_set_5c$selector==FALSE,]
US_set_5c_t<-US_set_5c[US_set_5c$selector==TRUE,]
#remove selector column
US_set_5c_t$selector<-NULL
US_set_5c_h$selector<-NULL
US_set_5c$selector<-NULL
t_sets[[i]]<-US_set_5c_t
h_sets[[i]]<-US_set_5c_h
}
saveRDS(t_sets,paste(result_Path,'US_t_sets.rds',sep='/'))
saveRDS(h_sets,paste(result_Path,'US_h_sets.rds',sep='/'))
US_predicted_results_t_fwd<-vector('list', length=10)
US_predicted_results_h_fwd<-vector('list', length=10)
US_saved_model_fwd<-vector('list',length=10)
US_saved_model_null<-vector('list',length=10)
US_saved_model_full<-vector('list', length=10)
for (i in 1:10){
null_model<-nnet::multinom(vote_for_in_us_election~1, data=t_sets[[i]], MaxNWts=1000)
full_model<-nnet::multinom(vote_for_in_us_election~., data=t_sets[[i]], MaxNWts=1000, maxit=1000)
fwd_step<-step(null_model, scop=list(lower=null_model, upper=full_model), direction='forward', trace=0)
predict_training<-predict(fwd_step, type='probs')
index_t<-apply(predict_training, 1, function(x) which.max(x))
US_predicted_results_t_fwd[[i]]<-data.frame(predicted_t=index_t, actual=t_sets[[i]]$vote_for_in_us_election)
predict_holdout<-predict(fwd_step, newdata=h_sets[[i]],type='probs')
index_h<-apply(predict_holdout, 1, function(x) which.max(x))
US_predicted_results_h_fwd[[i]]<-data.frame(predicted_h=index_h, actual=h_sets[[i]]$vote_for_in_us_election)
US_saved_model_null[[i]]<-null_model
US_saved_model_full[[i]]<-full_model
US_saved_model_fwd[[i]]<-fwd_step
}
saveRDS(US_predicted_results_t_fwd, paste(result_Path, 'US_predicted_results_t_fwd.rds', sep='/'))
saveRDS(US_predicted_results_h_fwd, paste(result_Path, 'US_predicted_results_h_fwd.rds', sep='/'))
saveRDS(US_saved_model_null, paste(result_Path,'US_saved_model_null.rds',sep='/'))
saveRDS(US_saved_model_full, paste(result_Path,'US_saved_model_full.rds',sep='/'))
saveRDS(US_saved_model_fwd, paste(result_Path,'US_saved_model_fwd.rds',sep='/'))
#US training set predictive power
##################################
Predicting_Trump_t<-c()
Predicting_Trump_h<-c()
for (i in 1:10){
Predicting_Trump_t<-c(Predicting_Trump_t, round(prop.table(table(US_predicted_results_t_fwd[[i]]),2),3)[1,1])
Predicting_Trump_h<-c(Predicting_Trump_h, round(prop.table(table(US_predicted_results_h_fwd[[i]]),2),3)[1,1])
}
plot(c(1:10), Predicting_Trump_t, col='red',ylim=c(0.2,1), type='l',ylab='Predicting Trump Win',xlab='Random set', sub='Figure 1', main='US set holdout validation')
lines(c(1:10), Predicting_Trump_h, col='blue')
points(c(1:10), Predicting_Trump_t, col='red',pch=20)
points(c(1:10), Predicting_Trump_h, col='blue',pch=20)
legend('bottomright', col=c('red','blue'), c('training','holdout'), pch=c(20,20))
US_predicted_results_t_fwd<-readRDS(paste(result_Path))
#Building an overall US model based on the forward step model
############################################################
overall_null_model_US<-nnet::multinom(vote_for_in_us_election~1, US_set_5c, MaxNWts=1000)
overall_full_model_US<-nnet::multinom(vote_for_in_us_election~., US_set_5c, MaxNWts=1000, maxit=1000)
overall_fwd_step_US<-step(overall_null_model_US, scop=list(lower=overall_null_model_US, upper=overall_full_model_US), direction='forward', trace=0)
saveRDS(overall_fwd_step_US,paste(result_Path,'overall_fwd_step_US.rds',sep='/'))
#UK Study
########
set.seed(49851254)
UK_set_5c<-ready_UK_list$UK_set_5c
list_names<-c(1:10)
UK_t_sets<-vector('list', length=10)
UK_h_sets<-vector('list', length=10)
for (i in 1:10){
N<-dim(UK_set_5c)[1]
k<-dim(UK_set_5c)[2]
selector<-(sample(c(1,0),N,replace=TRUE, prob=c(.7,.3)))
count<-0
while (count<100000){
temp<-sample(selector,N,replace=FALSE)
count<-count+1
}
selector<-temp
UK_set_5c$selector<-selector
UK_set_5c_h<-UK_set_5c[UK_set_5c$selector==FALSE,]
UK_set_5c_t<-UK_set_5c[UK_set_5c$selector==TRUE,]
#remove selector column
UK_set_5c_t$selector<-NULL
UK_set_5c_h$selector<-NULL
UK_set_5c$selector<-NULL
UK_t_sets[[i]]<-UK_set_5c_t
UK_h_sets[[i]]<-UK_set_5c_h
}
saveRDS(UK_t_sets,paste(result_Path,'UK_t_sets.rds',sep='/'))
saveRDS(UK_h_sets,paste(result_Path,'UK_h_sets.rds',sep='/'))
UK_predicted_results_t_fwd<-vector('list', length=10)
UK_predicted_results_h_fwd<-vector('list', length=10)
UK_predicted_results_t_both<-vector('list', length=10)
UK_predicted_results_h_both<-vector('list', length=10)
UK_saved_model_null<-vector('list',length=10)
UK_saved_model_full<-vector('list',length=10)
UK_saved_model_fwd<-vector('list',length=10)
UK_saved_model_both<-vector('list',length=10)
for (i in 1:10){
UK_null_model<-nnet::multinom(vote_referendum~1, data=UK_t_sets[[i]], MaxNWts=1000)
UK_full_model<-nnet::multinom(vote_referendum~., data=UK_t_sets[[i]], MaxNWts=1000, maxit=1000)
######################### Fowrard Model
fwd_step<-step(UK_null_model,scop=list(lower=UK_null_model, upper=UK_full_model), direction='forward', trace=0)
UK_predict_training_fwd<-predict(fwd_step, type='probs')
UK_index_t_fwd<-apply(UK_predict_training_fwd, 1, function(x) which.max(x))
UK_predicted_results_t_fwd[[i]]<-data.frame(UK_predicted_t=UK_index_t_fwd, actual=UK_t_sets[[i]]$vote_referendum)
UK_predict_holdout_fwd<-predict(fwd_step, newdata=UK_h_sets[[i]],type='probs')
UK_index_h_fwd<-apply(UK_predict_holdout_fwd, 1, function(x) which.max(x))
UK_predicted_results_h_fwd[[i]]<-data.frame(UK_predicted_h=UK_index_h_fwd, actual=UK_h_sets[[i]]$vote_referendum)
UK_saved_model_fwd[[i]]<-fwd_step
######################### Both Model (forward and backward)
both_step<-step(UK_null_model, scop=list(upper=UK_full_model), direction='both', trace=0)
UK_predict_training_both<-predict(both_step, type='probs')
UK_index_t_both<-apply(UK_predict_training_both, 1, function(x) which.max(x))
UK_predicted_results_t_both[[i]]<-data.frame(UK_predicted_t=UK_index_t_both, actual=UK_t_sets[[i]]$vote_referendum)
UK_predict_holdout_both<-predict(both_step, newdata=UK_h_sets[[i]],type='probs')
UK_index_h_both<-apply(UK_predict_holdout_both, 1, function(x) which.max(x))
UK_predicted_results_h_both[[i]]<-data.frame(UK_predicted_h=UK_index_h_both, actual=UK_h_sets[[i]]$vote_referendum)
UK_saved_model_both[[i]]<-both_step
UK_saved_model_null[[i]]<-UK_null_model
UK_saved_model_full[[i]]<-UK_full_model
}
saveRDS(UK_predicted_results_t_fwd, paste(save_Path, 'UK_predicted_results_t_fwd.rds', sep='/'))
saveRDS(UK_predicted_results_h_fwd, paste(save_Path, 'UK_predicted_results_h_fwd.rds', sep='/'))
saveRDS(UK_predicted_results_t_both, paste(save_Path, 'UK_predicted_results_t_both.rds', sep='/'))
saveRDS(UK_predicted_results_h_both, paste(save_Path, 'UK_predicted_results_h_both.rds', sep='/'))
saveRDS(UK_saved_model_null, paste(result_Path,'UK_saved_model_null.rds',sep='/'))
saveRDS(UK_saved_model_full, paste(result_Path,'UK_saved_model_full.rds',sep='/'))
saveRDS(UK_saved_model_fwd, paste(result_Path,'UK_saved_model_fwd.rds',sep='/'))
saveRDS(UK_saved_model_both, paste(result_Path,'UK_saved_model_both.rds',sep='/'))
Predicting_Brexit_t_fwd<-c()
Predicting_Brexit_h_fwd<-c()
for (i in 1:10){
Predicting_Brexit_t_fwd<-c(Predicting_Brexit_t_fwd,round(prop.table(table(UK_predicted_results_t_fwd[[i]]),2),3)[1,1])
Predicting_Brexit_h_fwd<-c(Predicting_Brexit_h_fwd,round(prop.table(table(UK_predicted_results_h_fwd[[i]]),2),3)[1,1])
}
plot(c(1:10), Predicting_Brexit_t_fwd, col='red',ylim=c(0.2,1), type='l',ylab='Predicting Brexit',xlab='Random set', sub='Figure 2', main='UK set holdout validation Forward Model')
lines(c(1:10), Predicting_Brexit_h_fwd, col='blue')
points(c(1:10), Predicting_Brexit_t_fwd, col='red',pch=20)
points(c(1:10), Predicting_Brexit_h_fwd, col='blue',pch=20)
legend('bottomright', col=c('red','blue'), c('training','holdout'), pch=c(20,20))
Predicting_Brexit_t_both<-c()
Predicting_Brexit_h_both<-c()
for (i in 1:10){
Predicting_Brexit_t_both<-c(Predicting_Brexit_t_both,round(prop.table(table(UK_predicted_results_t_both[[i]]),2),3)[1,1])
Predicting_Brexit_h_both<-c(Predicting_Brexit_h_both,round(prop.table(table(UK_predicted_results_h_both[[i]]),2),3)[1,1])
}
plot(c(1:10), Predicting_Brexit_t_both, col='red',ylim=c(0.2,1), type='l',ylab='Predicting Brexit',xlab='Random set', sub='Figure 3', main='UK set holdout validation Both Model')
lines(c(1:10), Predicting_Brexit_h_both, col='blue')
points(c(1:10), Predicting_Brexit_t_both, col='red',pch=20)
points(c(1:10), Predicting_Brexit_h_both, col='blue',pch=20)
legend('bottomright', col=c('red','blue'), c('training','holdout'), pch=c(20,20))
#Building an overall UK model based on the forward step model
############################################################
overall_null_model_UK<-nnet::multinom(vote_referendum~1, UK_set_5c, MaxNWts=1000)
overall_full_model_UK<-nnet::multinom(vote_referendum~., UK_set_5c, MaxNWts=1000, maxit=1000)
overall_fwd_step_UK<-step(overall_null_model_UK, scop=list(lower=overall_null_model_UK, upper=overall_full_model_UK), direction='forward', trace=0)
saveRDS(overall_fwd_step_UK, paste(save_Path, 'overall_fwd_step_UK.rds', sep='/'))
#Predicting the UK referendum from the US election model
############################################################
library(nnet)
predicting_referrendum<-predict(overall_fwd_step_US, newdata=UK_set_5c, type='probs')
index_referrendum<-apply(predicting_referrendum, 1, function(x) which.max(x))
actual_referrendum<-UK_set_5c$vote_referendum
prop.table(table(predicted=index_referrendum,actual=actual_referrendum),2)
## actual
## predicted For the UK to leave the EU I did not vote
## 1 0.3243243 0.1525424
## 2 0.2776413 0.6392252
## 3 0.3980344 0.2082324
## actual
## predicted For the UK to stay in the EU
## 1 0.1394052
## 2 0.3420074
## 3 0.5185874
#Predicting the US election from the UK referendum model
############################################################
library(nnet)
predicting_election<-predict(overall_fwd_step_UK, newdata=US_set_5c, type='probs')
index_election<-apply(predicting_election, 1, function(x) which.max(x))
actual_election<-US_set_5c$vote_for_in_us_election
prop.table(table(predicted=index_election,actual=actual_election),2)
#EU 5 countries results prediction
##################################
table(EU_set_5c$country_code)
##
## AT BE BG CY CZ DE DK EE ES FI FR GR HR HU IE
## 140 197 139 21 212 1798 110 26 1000 100 1431 213 92 199 90
## IT LT LU LV MT NL PL PT RO SE SI SK
## 1153 61 15 34 11 343 978 245 420 195 41 137
#EU_set_5c$household_finances_past12months<-factor(EU_set_5c$household_finances_past12months, levels=c("a lot worse","a little better","a little worse","a lot better","Don't know","the same"))
#Germany
DE_set<-EU_set_5c[EU_set_5c$country_code=='DE',]
DE_set$country_code<-NULL
#France
FR_set<-EU_set_5c[EU_set_5c$country_code=='FR',]
FR_set$country_code<-NULL
#Italy
IT_set<-EU_set_5c[EU_set_5c$country_code=='IT',]
IT_set$country_code<-NULL
#Poland
PL_set<-EU_set_5c[EU_set_5c$country_code=='PL',]
PL_set$country_code<-NULL
#Spain
ES_set<-EU_set_5c[EU_set_5c$country_code=='ES',]
ES_set$country_code<-NULL
#US election Predicting UK Brexit
#################################
round(prop.table(table(predicted_UK_ref=predict(overall_fwd_step_US, newdata = UK_set_5c),actual=UK_set_5c$vote_referendum),2),3)
## actual
## predicted_UK_ref For the UK to leave the EU I did not vote
## Donald Trump 0.324 0.153
## I did not vote 0.278 0.639
## Other candidate 0.398 0.208
## actual
## predicted_UK_ref For the UK to stay in the EU
## Donald Trump 0.139
## I did not vote 0.342
## Other candidate 0.519
#UK Brexit Predicting US election
#################################
round(prop.table(table(predicted_US_elec=predict(overall_fwd_step_UK, newdata = US_set_5c),actual=US_set_5c$vote_for_in_us_election),2),3)
## actual
## predicted_US_elec Donald Trump I did not vote Other candidate
## For the UK to leave the EU 0.616 0.250 0.287
## I did not vote 0.121 0.432 0.114
## For the UK to stay in the EU 0.263 0.318 0.599
#Using both US election and UK Brexit model as predictors:
#DE Predictions
##############
data.frame(rbind(
DE_UK_predictor=round(prop.table(table(factor(apply(predict(overall_fwd_step_UK,newdata=DE_set,type='probs'),1, function(x) which.max(x)),labels=c('populist','no vote','non-populist')))),3),
DE_US_predictor=round(prop.table(table(factor(apply(na.omit(predict(overall_fwd_step_US,newdata=DE_set,type='probs')),1, function(x) which.max(x)),labels=c('populist','no vote','non-populist')))),3)
))
## populist no.vote non.populist
## DE_UK_predictor 0.335 0.219 0.446
## DE_US_predictor 0.144 0.414 0.442
#swing populist and swing non-populist
result_1<-apply(predict(overall_fwd_step_UK,newdata=DE_set,type='probs'), 1,function(voter) {
res<-which.max(voter)
if(res == 2){ res<-10*which.max(voter[-2])}
return(res)
})
result_2<-apply(na.omit(predict(overall_fwd_step_US,newdata=DE_set,type='probs')), 1,function(voter) {
res<-which.max(voter)
if(res == 2){ res<-10*which.max(voter[-2])}
return(res)
})
data.frame(rbind(
DE_UK_predictor=round(prop.table(table(factor(result_1,levels=c(1,10,20,3),labels=c("populist","swing populist","swing non-populist","non-populist")))),3),
DE_US_predictor=round(prop.table(table(factor(result_2, levels=c(1,10,20,3),labels=c("populist","swing populist","swing non-populist","non-populist")))),3)
))
## populist swing.populist swing.non.populist non.populist
## DE_UK_predictor 0.335 0.062 0.157 0.446
## DE_US_predictor 0.144 0.086 0.328 0.442
#FR Predictions
###############
data.frame(rbind(
FR_UK_predictor=round(prop.table(table(factor(apply(predict(overall_fwd_step_UK,newdata=FR_set,type='probs'),1, function(x) which.max(x)),labels=c('populist','no vote','non-populist')))),3),
FR_US_predictor=round(prop.table(table(factor(apply(na.omit(predict(overall_fwd_step_US,newdata=FR_set,type='probs')),1, function(x) which.max(x)),labels=c('populist','no vote','non-populist')))),3)
))
## populist no.vote non.populist
## FR_UK_predictor 0.291 0.220 0.489
## FR_US_predictor 0.189 0.409 0.402
#swing populist and swing non-populist
result_1<-apply(predict(overall_fwd_step_UK,newdata=FR_set,type='probs'), 1,function(voter) {
res<-which.max(voter)
if(res == 2){ res<-10*which.max(voter[-2])}
return(res)
})
result_2<-apply(na.omit(predict(overall_fwd_step_US,newdata=FR_set,type='probs')), 1,function(voter) {
res<-which.max(voter)
if(res == 2){ res<-10*which.max(voter[-2])}
return(res)
})
data.frame(rbind(
FR_UK_predictor=round(prop.table(table(factor(result_1,levels=c(1,10,20,3),labels=c("populist","swing populist","swing non-populist","non-populist")))),3),
FR_US_predictor=round(prop.table(table(factor(result_2, levels=c(1,10,20,3),labels=c("populist","swing populist","swing non-populist","non-populist")))),3)
))
## populist swing.populist swing.non.populist non.populist
## FR_UK_predictor 0.291 0.068 0.152 0.489
## FR_US_predictor 0.189 0.105 0.304 0.402
#PL Predictions
###############
data.frame(rbind(
PL_UK_predictor=round(prop.table(table(factor(apply(predict(overall_fwd_step_UK,newdata=PL_set,type='probs'),1, function(x) which.max(x)),labels=c('populist','no vote','non-populist')))),3),
PL_US_predictor=round(prop.table(table(factor(apply(na.omit(predict(overall_fwd_step_US,newdata=PL_set,type='probs')),1, function(x) which.max(x)),labels=c('populist','no vote','non-populist')))),3)
))
## populist no.vote non.populist
## PL_UK_predictor 0.326 0.179 0.495
## PL_US_predictor 0.231 0.269 0.500
#swing populist and swing non-populist
result_1<-apply(predict(overall_fwd_step_UK,newdata=PL_set,type='probs'), 1,function(voter) {
res<-which.max(voter)
if(res == 2){ res<-10*which.max(voter[-2])}
return(res)
})
result_2<-apply(na.omit(predict(overall_fwd_step_US,newdata=PL_set,type='probs')), 1,function(voter) {
res<-which.max(voter)
if(res == 2){ res<-10*which.max(voter[-2])}
return(res)
})
data.frame(rbind(
PL_UK_predictor=round(prop.table(table(factor(result_1,levels=c(1,10,20,3),labels=c("populist","swing populist","swing non-populist","non-populist")))),3),
PL_US_predictor=round(prop.table(table(factor(result_2, levels=c(1,10,20,3),labels=c("populist","swing populist","swing non-populist","non-populist")))),3)
))
## populist swing.populist swing.non.populist non.populist
## PL_UK_predictor 0.326 0.060 0.119 0.495
## PL_US_predictor 0.231 0.071 0.198 0.500
#ES Predictions
###############
data.frame(rbind(
ES_UK_predictor=round(prop.table(table(factor(apply(predict(overall_fwd_step_UK,newdata=ES_set,type='probs'),1, function(x) which.max(x)),labels=c('populist','no vote','non-populist')))),3),
ES_US_predictor=round(prop.table(table(factor(apply(na.omit(predict(overall_fwd_step_US,newdata=ES_set,type='probs')),1, function(x) which.max(x)),labels=c('populist','no vote','non-populist')))),3)
))
## populist no.vote non.populist
## ES_UK_predictor 0.302 0.141 0.557
## ES_US_predictor 0.115 0.350 0.535
#swing populist and swing non-populist
result_1<-apply(predict(overall_fwd_step_UK,newdata=ES_set,type='probs'), 1,function(voter) {
res<-which.max(voter)
if(res == 2){ res<-10*which.max(voter[-2])}
return(res)
})
result_2<-apply(na.omit(predict(overall_fwd_step_US,newdata=ES_set,type='probs')), 1,function(voter) {
res<-which.max(voter)
if(res == 2){ res<-10*which.max(voter[-2])}
return(res)
})
data.frame(rbind(
ES_UK_predictor=round(prop.table(table(factor(result_1,levels=c(1,10,20,3),labels=c("populist","swing populist","swing non-populist","non-populist")))),3),
ES_US_predictor=round(prop.table(table(factor(result_2, levels=c(1,10,20,3),labels=c("populist","swing populist","swing non-populist","non-populist")))),3)
))
## populist swing.populist swing.non.populist non.populist
## ES_UK_predictor 0.302 0.051 0.090 0.557
## ES_US_predictor 0.115 0.063 0.287 0.535
#IT Predictions
###############
data.frame(rbind(
IT_UK_predictor=round(prop.table(table(factor(apply(predict(overall_fwd_step_UK,newdata=IT_set,type='probs'),1, function(x) which.max(x)),labels=c('populist','no vote','non-populist')))),3),
IT_US_predictor=round(prop.table(table(factor(apply(na.omit(predict(overall_fwd_step_US,newdata=IT_set,type='probs')),1, function(x) which.max(x)),labels=c('populist','no vote','non-populist')))),3)
))
## populist no.vote non.populist
## IT_UK_predictor 0.388 0.150 0.462
## IT_US_predictor 0.193 0.358 0.449
#swing populist and swing non-populist
result_1<-apply(predict(overall_fwd_step_UK,newdata=IT_set,type='probs'), 1,function(voter) {
res<-which.max(voter)
if(res == 2){ res<-10*which.max(voter[-2])}
return(res)
})
result_2<-apply(na.omit(predict(overall_fwd_step_US,newdata=IT_set,type='probs')), 1,function(voter) {
res<-which.max(voter)
if(res == 2){ res<-10*which.max(voter[-2])}
return(res)
})
data.frame(rbind(
IT_UK_predictor=round(prop.table(table(factor(result_1,levels=c(1,10,20,3),labels=c("populist","swing populist","swing non-populist","non-populist")))),3),
IT_US_predictor=round(prop.table(table(factor(result_2, levels=c(1,10,20,3),labels=c("populist","swing populist","swing non-populist","non-populist")))),3)
))
## populist swing.populist swing.non.populist non.populist
## IT_UK_predictor 0.388 0.049 0.101 0.462
## IT_US_predictor 0.193 0.088 0.271 0.449
End of Step 5
Step 6: Classification Tree Models
#Tree Model for US set:
library(rpart)
overall_US_tree<-rpart(vote_for_in_us_election~., data=US_set_5c,control=rpart.control(cp=0,minsplit=30,xval=10, maxsurrogate=0))
overall_UK_tree<-rpart(vote_referendum~., data=UK_set_5c,control=rpart.control(cp=0,minsplit=30,xval=10, maxsurrogate=0))
round(prop.table(table(predict_UK_ref=predict(overall_US_tree, newdata = UK_set_5c, type='class'), actual=UK_set_5c$vote_referendum),2),3)
## actual
## predict_UK_ref For the UK to leave the EU I did not vote
## Donald Trump 0.376 0.162
## I did not vote 0.292 0.593
## Other candidate 0.332 0.245
## actual
## predict_UK_ref For the UK to stay in the EU
## Donald Trump 0.188
## I did not vote 0.286
## Other candidate 0.526
round(prop.table(table(predict_US_elec=predict(overall_UK_tree, newdata = US_set_5c, type='class'), actual=US_set_5c$vote_for_in_us_election),2),3)
## actual
## predict_US_elec Donald Trump I did not vote Other candidate
## For the UK to leave the EU 0.422 0.272 0.306
## I did not vote 0.162 0.451 0.220
## For the UK to stay in the EU 0.416 0.278 0.474
printcp(overall_US_tree)
##
## Classification tree:
## rpart(formula = vote_for_in_us_election ~ ., data = US_set_5c,
## control = rpart.control(cp = 0, minsplit = 30, xval = 10,
## maxsurrogate = 0))
##
## Variables actually used in tree construction:
## [1] age
## [2] change_household_finances_next12months
## [3] country_direction_past5years
## [4] currentplace_change_past5years
## [5] economy_country_next12months
## [6] employment_status
## [7] family_friends_highereducation
## [8] gender
## [9] gender_discrimination_importance
## [10] household_finances_past12months
## [11] important_issues_when_voting
## [12] income_net_monthly
## [13] media_tv_hours
## [14] religion
## [15] social_networks_regularly_used
## [16] status_national_economy
## [17] work_type_manual
## [18] work_type_routine
## [19] worldview
##
## Root node error: 639/1008 = 0.63393
##
## n= 1008
##
## CP nsplit rel error xerror xstd
## 1 0.1737089 0 1.00000 1.00000 0.023935
## 2 0.0798122 1 0.82629 0.86385 0.024730
## 3 0.0266041 2 0.74648 0.80125 0.024840
## 4 0.0219092 3 0.71987 0.77934 0.024841
## 5 0.0203443 4 0.69797 0.78873 0.024843
## 6 0.0187793 5 0.67762 0.79030 0.024843
## 7 0.0152582 6 0.65884 0.80125 0.024840
## 8 0.0104330 12 0.56338 0.77465 0.024839
## 9 0.0093897 15 0.53208 0.75587 0.024821
## 10 0.0078247 17 0.51330 0.75430 0.024819
## 11 0.0062598 19 0.49765 0.76056 0.024827
## 12 0.0046948 21 0.48513 0.75274 0.024817
## 13 0.0039124 26 0.46166 0.73865 0.024793
## 14 0.0031299 28 0.45383 0.73709 0.024789
## 15 0.0015649 32 0.44131 0.76995 0.024836
## 16 0.0000000 33 0.43975 0.78717 0.024843
plotcp(overall_US_tree)
pruned_overall_US_tree<-rpart(vote_for_in_us_election~., data=US_set_5c,control=rpart.control(cp=0.0104330,minsplit=30,xval=10, maxsurrogate=0))
rpart.plot::prp(pruned_overall_US_tree,2,cex=1.3,varlen=0, faclen=0,facsep ="\n", compress=TRUE, leaf.round=1, fallen.leaves = FALSE,type=3,box.palette = 'auto' )
## Warning: labs do not fit even at cex 0.15, there may be some overplotting
snip_right<-rpart::snip.rpart(pruned_overall_US_tree,3)
snip_left<-rpart::snip.rpart(pruned_overall_US_tree,2)
rpart.plot::prp(snip_right,2,cex=1.3,varlen=0, faclen=0,facsep ="\n", compress=TRUE, leaf.round=1, fallen.leaves = FALSE,type=3,box.palette = 'auto' )
rpart.plot::prp(snip_left,2,cex=1.3,varlen=0, faclen=0,facsep ="\n", compress=TRUE, leaf.round=1, fallen.leaves = FALSE,type=3,box.palette = 'auto' )
## Warning: labs do not fit even at cex 0.15, there may be some overplotting
printcp(overall_UK_tree)
##
## Classification tree:
## rpart(formula = vote_referendum ~ ., data = UK_set_5c, control = rpart.control(cp = 0,
## minsplit = 30, xval = 10, maxsurrogate = 0))
##
## Variables actually used in tree construction:
## [1] age
## [2] change_economy_country_past12months
## [3] change_household_finances_next12months
## [4] country_comes_first
## [5] currentplace_change_past5years
## [6] disposable_income
## [7] economy_country_next12months
## [8] education_level
## [9] employment_status
## [10] family_friends_highereducation
## [11] financial_security
## [12] financial_situation_change_past5years
## [13] household_size
## [14] important_issues_when_voting
## [15] income_net_monthly
## [16] international_trade_gain_or_loss
## [17] media_tv_hours
## [18] perceived_effect_of_diversity
## [19] social_networks_regularly_used
## [20] work_type_routine
##
## Root node error: 820/1358 = 0.60383
##
## n= 1358
##
## CP nsplit rel error xerror xstd
## 1 0.1109756 0 1.00000 1.00000 0.021980
## 2 0.0951220 1 0.88902 0.90244 0.022379
## 3 0.0256098 2 0.79390 0.82927 0.022470
## 4 0.0115854 6 0.68415 0.78049 0.022433
## 5 0.0091463 8 0.66098 0.78537 0.022440
## 6 0.0085366 10 0.64268 0.78171 0.022435
## 7 0.0073171 12 0.62561 0.77561 0.022425
## 8 0.0065041 13 0.61829 0.76951 0.022414
## 9 0.0060976 21 0.55732 0.77195 0.022419
## 10 0.0048780 23 0.54512 0.78659 0.022442
## 11 0.0042683 27 0.52561 0.78537 0.022440
## 12 0.0036585 29 0.51707 0.78293 0.022437
## 13 0.0024390 35 0.49512 0.78780 0.022444
## 14 0.0012195 36 0.49268 0.78415 0.022439
## 15 0.0000000 39 0.48902 0.78780 0.022444
plotcp(overall_UK_tree)
pruned_overall_UK_tree<-rpart(vote_referendum~., data=UK_set_5c,control=rpart.control(cp=0.01158537,minsplit=30,xval=10, maxsurrogate=0))
rpart.plot::prp(pruned_overall_UK_tree,2,cex=1.3,varlen=0, faclen=0,facsep ="\n", compress=TRUE, leaf.round=1, fallen.leaves = FALSE, type=3, box.palette = 'auto' )
snip_right_UK<-rpart::snip.rpart(pruned_overall_UK_tree,3)
snip_left_UK<-rpart::snip.rpart(pruned_overall_UK_tree,2)
snip_left_UK_a<-rpart::snip.rpart(pruned_overall_UK_tree,6)
snip_left_UK_b<-rpart::snip.rpart(pruned_overall_UK_tree,7)
rpart.plot::prp(snip_right_UK,2,cex=1.3,varlen=0, faclen=0,facsep ="\n", compress=TRUE, leaf.round=1, fallen.leaves = FALSE, type=3, box.palette = 'auto' )
rpart.plot::prp(snip_left_UK_a,2,cex=1.3,varlen=0, faclen=0,facsep ="\n", compress=TRUE, leaf.round=1, fallen.leaves = FALSE, type=3, box.palette = 'auto' )
rpart.plot::prp(snip_left_UK_b,2,cex=1.3,varlen=0, faclen=0,facsep ="\n", compress=TRUE, leaf.round=1, fallen.leaves = FALSE, type=3, box.palette = 'auto' )