Logistics

  • DEADLINE: Sunday, March 15th at midnight.
  • You can do this homework in groups of two at most.
  • This homework has a total of 20 points. The number of points for each section are clearly indicated in the section and/or subsection headings.
  • You need to hand in this .Rmd file on Moodle and send me the .html file that is produced when you click on the “Knit” button through Slack. If you are having issues knitting your file do get in touch (and knit regularly to make sure everything is working properly).
  • We left code blocks in this .Rmd file for you to fill out. Just put your code within the marked region, and click the Knit button in RStudio, just above this script, to see the generated HTML output. You can actually knit it before you add any code to see how nice the output looks like!
  • You can find a nice intro to Rmarkdown (.Rmd) here.

2017 French Presidential Elections - Analysis of the First Round Results

This midterm project consists in one long analysis of data from the 1st round of the 2017 French presidential elections.

You will be asked to download the data from its original source, just like you would if you were to undertake this task by yourself. You will use everything we’ve learned so far: importing (un-tidy) data; summarizing, visualizing and tidying it; running regressions for different variables; and thinking about whether these associations can be interpreted causally. In the process, you will also learn 2 useful operations: merging 2 (or more) datasets together, and “reshaping” data from wide to long format. Don’t worry we will work you through how to do all of this.

I hope you find this quiz interesting and useful. Now let’s get going! 💪

Loading the Data [0.5 points]

Download the data from the 1st round of the 2017 French presidential elections from here:

https://www.data.gouv.fr/s/resources/election-presidentielle-des-23-avril-et-7-mai-2017-resultats-definitifs-du-1er-tour-par-communes/20170427-100544/Presidentielle_2017_Resultats_Communes_Tour_1_c.xls

This is straight from the French Ministry of the Interior which is in charge of collecting electoral data. This dataset contains election results aggregated at the municipality level.

1. Load it into RStudio in an object called elections_2017. Since the file is an Excel spreadsheet (ends in .xls), you need to use the readxl package.

To import this file you need to:

  • Find out which argument of the read_excel() function (which comes from the readxl package) allows you to skip the first 3 rows of the spreadsheet, which do not contain data (you can see that by yourself by opening the file in Excel).

  • Specify the argument guess_max = 35723 to ensure the entire spreadsheet is analysed before the data types of each columns are guessed by R (if you do not add this argument the column for departement codes will be numeric and therefore replace character values with NAs, which we do not want).

  • Add the argument .name_repair = ~ janitor::make_clean_names, which requires the janitor package, which will make it easier to manipulate the data later on. This argument does what it says: it renames the variables if need be so that every column names are “unique and consist only of the _ character, numbers, and letters.”

You can safely ignore the warning message In is.na(name) : is.na() applied to non-(list or vector) of type 'closure'.

library("tidyverse")
## ── Attaching packages ───────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.2.1     ✓ purrr   0.3.3
## ✓ tibble  2.1.3     ✓ dplyr   0.8.4
## ✓ tidyr   1.0.2     ✓ stringr 1.4.0
## ✓ readr   1.3.1     ✓ forcats 0.4.0
## ── Conflicts ──────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library("readxl", "janitor")
 elections_2017 = read_excel("/home/joe/R/Midterm/Presidentielle_2017_Resultats_Communes_Tour_1_c.xls", skip = 3, guess_max = 35723, .name_repair = ~ janitor::make_clean_names)
## Warning in is.na(name): is.na() applied to non-(list or vector) of type
## 'closure'

Now that the data is loaded let’s get a feel for what it contains

Understanding what the dataset contains [0.5 points]

1. Use head to display the first 6 rows of the dataset.

head(elections_2017, n=6)

2. What’s the unit of observation?

Voters

3. How many variables are there?

ncol(elections_2017)
## [1] 95

4. How many rows are there?

nrow(elections_2017)
## [1] 35719

Cleaning and tidying the data [3 points]

1. All the variable names are in French, so let’s rename some relevant variables and give them an English name. Make sure you rename the variables and do not create new variables. This is important for the rest of the cleaning. Create a new object elections_2017_clean for this task.

Rename the following variables as follows: (old variable name -> new variable name)

  1. departement code: code_du_departement -> code_departement
  2. departement name: libelle_du_departement -> name_departement
  3. municipality code: code_de_la_commune -> code_municipality
  4. municipality namelibelle_de_la_commune -> name_municipality
  5. number of registered voters: inscrits -> registered
  6. number of actual votes: votants -> voted
elections_2017 = elections_2017 %>% 
rename(
    code_departement = code_du_departement,
    name_departement = libelle_du_departement,
    code_municipality = code_de_la_commune,
    name_municipality = libelle_de_la_commune,
    registered = inscrits,
    voted = votants
    )

2. Display the names of the variables to make sure everything was done correctly.

colnames(elections_2017)
##  [1] "code_departement"    "name_departement"    "code_municipality"  
##  [4] "name_municipality"   "registered"          "abstentions"        
##  [7] "percent_abs_ins"     "voted"               "percent_vot_ins"    
## [10] "blancs"              "percent_blancs_ins"  "percent_blancs_vot" 
## [13] "nuls"                "percent_nuls_ins"    "percent_nuls_vot"   
## [16] "exprimes"            "percent_exp_ins"     "percent_exp_vot"    
## [19] "n_panneau"           "sexe"                "nom"                
## [22] "prenom"              "voix"                "percent_voix_ins"   
## [25] "percent_voix_exp"    "n_panneau_2"         "sexe_2"             
## [28] "nom_2"               "prenom_2"            "voix_2"             
## [31] "percent_voix_ins_2"  "percent_voix_exp_2"  "n_panneau_3"        
## [34] "sexe_3"              "nom_3"               "prenom_3"           
## [37] "voix_3"              "percent_voix_ins_3"  "percent_voix_exp_3" 
## [40] "n_panneau_4"         "sexe_4"              "nom_4"              
## [43] "prenom_4"            "voix_4"              "percent_voix_ins_4" 
## [46] "percent_voix_exp_4"  "n_panneau_5"         "sexe_5"             
## [49] "nom_5"               "prenom_5"            "voix_5"             
## [52] "percent_voix_ins_5"  "percent_voix_exp_5"  "n_panneau_6"        
## [55] "sexe_6"              "nom_6"               "prenom_6"           
## [58] "voix_6"              "percent_voix_ins_6"  "percent_voix_exp_6" 
## [61] "n_panneau_7"         "sexe_7"              "nom_7"              
## [64] "prenom_7"            "voix_7"              "percent_voix_ins_7" 
## [67] "percent_voix_exp_7"  "n_panneau_8"         "sexe_8"             
## [70] "nom_8"               "prenom_8"            "voix_8"             
## [73] "percent_voix_ins_8"  "percent_voix_exp_8"  "n_panneau_9"        
## [76] "sexe_9"              "nom_9"               "prenom_9"           
## [79] "voix_9"              "percent_voix_ins_9"  "percent_voix_exp_9" 
## [82] "n_panneau_10"        "sexe_10"             "nom_10"             
## [85] "prenom_10"           "voix_10"             "percent_voix_ins_10"
## [88] "percent_voix_exp_10" "n_panneau_11"        "sexe_11"            
## [91] "nom_11"              "prenom_11"           "voix_11"            
## [94] "percent_voix_ins_11" "percent_voix_exp_11"

3. Create a variable blank_votes equal to the sum of blancs and nuls. This will be useful to compute the share of votes for each candidates since in France, candidate vote shares are computed as a share of the number of votes minus the number of blank votes.

elections_2017 = elections_2017 %>%
mutate(blank_votes = blancs + nuls)

4. Let’s only keep a subset of the variables to make our life easier. Keep only the following variables:

code_departement, name_departement, code_municipality, name_municipality, registered, voted, blank_votes, sexe, nom, prenom, voix, sexe_2, nom_2, prenom_2, voix_2, nom_3, prenom_3, voix_3, nom_4, prenom_4, voix_4, nom_5, prenom_5, voix_5, nom_6, prenom_6, voix_6, nom_7, prenom_7, voix_7, nom_8, prenom_8, voix_8, nom_9, prenom_9, voix_9, nom_10, prenom_10, voix_10, nom_11, prenom_11, voix_11

To faciliate this selection you can use the following code in the select() function:

matches("departement$|municipality$"), registered, voted, blank_votes, matches("^sexe|^nom|^prenom|^voix")

What this code does is it selects columns whose names end (hence the $s in the first “matches”) with department or (hence the |) municipality, and start with (hence the ^s in the second “matches”) sexe or nom or prenom or voix, as well as the registered, voted, blank_votes variables. The $ and ^ signs in the code are called “regular expressions”. You don’t need to know more about them but if you’re curious, here’s a cheatsheet on regular expressions: https://rstudio.com/wp-content/uploads/2016/09/RegExCheatsheet.pdf.

elections_2017 = select(elections_2017, matches("departement$|municipality$"), registered, voted, blank_votes, matches("^sexe|^nom|^prenom|^voix"))

5. Again you should display the names of the variables to make sure the selection was done as expected.

colnames(elections_2017)
##  [1] "code_departement"  "name_departement"  "code_municipality"
##  [4] "name_municipality" "registered"        "voted"            
##  [7] "blank_votes"       "sexe"              "nom"              
## [10] "prenom"            "voix"              "sexe_2"           
## [13] "nom_2"             "prenom_2"          "voix_2"           
## [16] "sexe_3"            "nom_3"             "prenom_3"         
## [19] "voix_3"            "sexe_4"            "nom_4"            
## [22] "prenom_4"          "voix_4"            "sexe_5"           
## [25] "nom_5"             "prenom_5"          "voix_5"           
## [28] "sexe_6"            "nom_6"             "prenom_6"         
## [31] "voix_6"            "sexe_7"            "nom_7"            
## [34] "prenom_7"          "voix_7"            "sexe_8"           
## [37] "nom_8"             "prenom_8"          "voix_8"           
## [40] "sexe_9"            "nom_9"             "prenom_9"         
## [43] "voix_9"            "sexe_10"           "nom_10"           
## [46] "prenom_10"         "voix_10"           "sexe_11"          
## [49] "nom_11"            "prenom_11"         "voix_11"

6. Let’s create the following variable:

  • pct_voted = voted/registered * 100: the percentage of voters among those registered.
elections_2017 = elections_2017 %>%
mutate(pct_voted= voted/registered*100)

7. For how many municipalities is pct_voted missing?

sum(is.na(elections_2017$pct_voted))
## [1] 1

8. Why is the share of voters missing?

Douchanbe has zero registered voters

Analysis of candidates’ vote shares

Some more tidying [4.5 points]

Before we actually perform some analysis on the candidates’ vote shares, we need to do some more data tidying.

Notice how inconveniently the data is formatted: each sexe_#, nom_#, prenom_# and voix_# variables correspond to details for the candidate that was ranked # in a given municipality.
For example: in the first row, municipality called “L’Abergement-Clémenciat”, nom_2 is MACRON which means that in that municipality Macron obtained the second highest number of votes. But in the third row, it was MÉLENCHON who arrived second.

We would like to “gather” all these variables into only 4 variables: rank (containing the # value), candidate_gender (sexe_#), candidate_name (a combination of nom_# and prenom_#), and votes (voix_#). In data parlance we want to “reshape” the data from “wide” format to “long” format.

1. To do this we first need to rename the variables sexe, nom, prenom and voix to sexe_1, nom_1, prenom_1 and voix_1, respectively, to match the pattern of the other such variables. Ensure you are renaming the variables and not creating new ones.

 elections_2017 = elections_2017 %>% 
  rename(
    sexe_1 = sexe,
    nom_1 = nom,
    prenom_1 = prenom,
    voix_1 = voix,
    )

2. Now that these variables have been renamed we are going to use the pivot_longer() function from the tidyverse package to reshape the data. You can find examples of how the pivot_longer() function is used by typing ?pivot_longer in your console.

For our purposes we will use the following 4 arguments in pivot_longer():

  • data = elections_2017_clean (notice that if you use the %>% pipe you don’t need to specify the data argument);
  • cols = matches("^sexe|^nom|^prenom|^voix"). These are the columns to be reshaped. We only want to reshape variables starting with sexe, nom, prenom or voix;
  • names_to = c(".value", "rank"). These are the names of the new columns to be created. The first element of the vector, ".value", says that the name of the reshaped variables should be their name before the "_", e.g. all the name_# values will be stored in name. The second element says that the # value should be stored in a variable called rank;
  • names_sep = "_" Tells the function what separates the variable names (name, voix, etc.) from the rank (1, 2, etc.).

If this is somewhat unclear to you, simply run the code and check how it differs from the previous dataset. This should clear things up.

Create a new object elections_2017_long for this task.

elections_2017_long = pivot_longer(
   data = elections_2017,
   cols = matches("^sexe|^nom|^prenom|^voix"),
   names_to = c(".value", "rank"),
   names_sep = "_"
)

3. As usual, display the names of the variables to ensure that everything has gone according to plan.

colnames(elections_2017_long)
##  [1] "code_departement"  "name_departement"  "code_municipality"
##  [4] "name_municipality" "registered"        "voted"            
##  [7] "blank_votes"       "pct_voted"         "rank"             
## [10] "sexe"              "nom"               "prenom"           
## [13] "voix"

4. How many rows do you have now?

nrow(elections_2017_long)
## [1] 392909

5. Look at the data. Why do you have so many more observations now?

Each ‘name-municipality’ now appears once for every candidate

6. Before we move on to some analysis, rename the new variables to give them an English name.

  • candidate gender: sexe -> candidate_gender
  • nomber of votes: voix -> votes
elections_2017_long = elections_2017_long %>% 
rename(
    candidate_gender = sexe,
    votes = voix,
    )

7. Create a candidate_name variable which combines the prenom (first name) and nom (last name) variables. You can use the paste() function for this, and have as arguments the two variables prenom and nom (in that order).
The paste function works as follows: paste("a","b", sep = "_") will return "a_b". Notice that the value given to the sep argument will be added between the two values being “pasted” together. Use " " as the value for the sep argument, this will add a space between the candidate’s first and last names.

elections_2017_long = elections_2017_long %>%
  mutate(
   candidate_name = paste(prenom, nom, sep = " ")
  )

8. Tabulate the new candidate_name variable to make sure the candidate names make sense.

unique(elections_2017_long$candidate_name)
##  [1] "Marine LE PEN"         "Emmanuel MACRON"       "François FILLON"      
##  [4] "Jean-Luc MÉLENCHON"    "Nicolas DUPONT-AIGNAN" "Benoît HAMON"         
##  [7] "François ASSELINEAU"   "Nathalie ARTHAUD"      "Philippe POUTOU"      
## [10] "Jacques CHEMINADE"     "Jean LASSALLE"

9. Let’s create a variable containing the percentage of votes obtained by each candidate. (Recall that in France the votes obtained by candidates are calculated as the percentage of votes out of the total number of votes minus blank votes.)

  • pct_votes = votes/(voted - blank_votes) * 100
elections_2017_long = elections_2017_long %>%
  mutate(pct_votes=(votes/(voted - blank_votes)* 100))

The last thing we will do in this cleaning section is to filter our data to only keep observations for mainland France.

10. Tabulate the departement codes in elections_2017_long.

unique(elections_2017_long$code_departement)
##   [1] "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10" "11" "12" "13" "14" "15"
##  [16] "16" "17" "18" "19" "2A" "2B" "21" "22" "23" "24" "25" "26" "27" "28" "29"
##  [31] "30" "31" "32" "33" "34" "35" "36" "37" "38" "39" "40" "41" "42" "43" "44"
##  [46] "45" "46" "47" "48" "49" "50" "51" "52" "53" "54" "55" "56" "57" "58" "59"
##  [61] "60" "61" "62" "63" "64" "65" "66" "67" "68" "69" "70" "71" "72" "73" "74"
##  [76] "75" "76" "77" "78" "79" "80" "81" "82" "83" "84" "85" "86" "87" "88" "89"
##  [91] "90" "91" "92" "93" "94" "95" "ZA" "ZB" "ZC" "ZD" "ZM" "ZN" "ZP" "ZS" "ZW"
## [106] "ZX" "ZZ"

You’ll notice that several departement codes start with “Z”. These correspond to departements outside of mainland France, known as overseas France, as well as a departement for the French living outside of France (“Français établis hors de France”). You can tabulate the departement names to see this for yourself.

For our purposes we are only interested in analysing mainland French elections_2017_long_metrop = elections_2017_long %>% filter(!str_detect(code_departement, “Z”, “2A”, “2B”)departements, so we will get rid of departements’ whose code starts with “Z” as well as Corsica (“2A” and “2B”).

11. To do so you need to use the str_detect() function from the stringr package in your filter() function. Since you want to exclude departement codes containing a “Z” you will need to use !str_detect(), the ! implying that you want to exclude whatever comes after it. Remember to also exclude Corsica departements “2A” and “2B”.

Create a new object elections_2017_long_metrop for this task.

elections_2017_long_metrop = elections_2017_long %>%
filter(!str_detect(code_departement, "Z|2A|2B"))

If you’re filtering was done properly you should be left with 94 departements.

Ok, great the data is now (finally!) ready to be used for analysis 🙌 Yes, everything we did before was just to get the data ready for analysis and get a sense of what it contained. And this is data coming from an official French Ministry!

Maps [4.5 points]

Election data are probably better visualized using maps.

In this section we will produce two maps displaying:

  1. Macron, Le Pen, Fillon, and Mélenchon’s vote share in each departement,
  2. The leading candidate in each departement.

Maps are tricky and very complicated objects. But to boil it down to the essential ingredients, you need a “shapefile” which is basically a file that contains the map onto which data will be plotted.

We will download the shapefile for French departements from here:

https://www.data.gouv.fr/en/datasets/r/3096e551-c68d-40ce-8972-a228c94c0ad1.

Unzip the file once it’s downloaded.

To produce the maps you will need to install and load the sf and tmap packages.

library("sf", "tmap")
## Linking to GEOS 3.7.2, GDAL 2.4.2, PROJ 5.2.0
library(tmaptools)

1. Let’s first load the file departements-20140306-100m.shp (in the departements-20140306-100m-shp folder you’ve just downloaded) using the read_sf() function as such:

dep_shfl <- read_sf("/home/joe/R/Midterm/departements-20140306-100m-shp/departements-20140306-100m.shp.shp")
# where `path` corresponds to where the `departements-20140306-100m-shp` folder is located.
dep_shfl = read_sf("/home/joe/R/Midterm/departements-20140306-100m-shp/departements-20140306-100m.shp")

2. Now that you’ve loaded the shapefile, you want to filter out the following code_insee (departement codes) values: “2A”,“2B”,“971”,“972”,“973”,“974”,“976”.

dep_shfl = dep_shfl %>%
filter(!str_detect(code_insee, "2A|2B|971|972|973|974|976"))

3. Tabulate the department codes from the dep_shfl dataset. Then tabulate the department codes from the elections_2017_long_metrop dataset.

unique(dep_shfl$code_insee)
##  [1] "01" "02" "03" "04" "05" "06" "07" "08" "09" "10" "11" "12" "13" "14" "15"
## [16] "16" "17" "18" "19" "21" "22" "23" "24" "25" "26" "27" "28" "29" "30" "31"
## [31] "32" "33" "34" "35" "36" "37" "38" "39" "40" "41" "42" "43" "44" "45" "46"
## [46] "47" "48" "49" "50" "51" "52" "53" "54" "55" "56" "57" "58" "59" "60" "61"
## [61] "62" "63" "64" "65" "66" "67" "68" "69" "70" "71" "72" "73" "74" "75" "76"
## [76] "77" "78" "79" "80" "81" "82" "83" "84" "85" "86" "87" "88" "89" "90" "91"
## [91] "92" "93" "94" "95"
unique(elections_2017_long_metrop$code_departement)
##  [1] "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10" "11" "12" "13" "14" "15"
## [16] "16" "17" "18" "19" "21" "22" "23" "24" "25" "26" "27" "28" "29" "30" "31"
## [31] "32" "33" "34" "35" "36" "37" "38" "39" "40" "41" "42" "43" "44" "45" "46"
## [46] "47" "48" "49" "50" "51" "52" "53" "54" "55" "56" "57" "58" "59" "60" "61"
## [61] "62" "63" "64" "65" "66" "67" "68" "69" "70" "71" "72" "73" "74" "75" "76"
## [76] "77" "78" "79" "80" "81" "82" "83" "84" "85" "86" "87" "88" "89" "90" "91"
## [91] "92" "93" "94" "95"

Notice that in dep_shfl the department codes between 1 and 9 have a zero in front of them. So, let’s add a “0” in front of department codes between 1 and 9 in elections_2017_long_metrop.

4. To do so we first need to convert the code_department variable to a character type (use the as.character() function). Then we need to change the code_department variable by using the conditional statement: ifelse(nchar(code_departement) == 1, paste0("0", code_departement), code_departement). What this conditional statement says is: if the number of characters (nchar) of the variable code_departement is equal to 1, then add a “0” in front of this variables (paste0() function); otherwise do nothing.

elections_2017_long_metrop = elections_2017_long_metrop %>%
  mutate(
    code_departement = as.character(code_departement),
    code_departement = ifelse(nchar(code_departement) == 1, paste0("0", code_departement), code_departement)
      )

5. Create a new variable containing the percentage of votes received by each candidate by departement. Recall that vote share of each candidate = sum of the votes they received / (sum of actual voters - sum of blank votes).
Once that’s done, create a variable ranking the vote share of each candidate by department. You can use the following code: rank = order(order(vote_share, decreasing = TRUE)). Lastly, create a variable called code_insee equal to code_departement. This variable will enable us to merge the data and the departement shapefile. You can pipe all these commands together.

Create a new object called elections_2017_long_metrop_dep for this ask.

elections_2017_long_metrop_dep = elections_2017_long_metrop %>% 
   group_by(code_departement, candidate_name, name_departement) %>% 
  summarise(vote_share=(sum(votes)/(sum(voted)-sum(blank_votes)))*100)%>%
   group_by(code_departement) %>%
     mutate(rank=order(order(vote_share, decreasing = TRUE))) %>%
   mutate(code_insee = code_departement)

6. Now let’s merge elections_2017_long_metrop_dep with dep_shfl (which contains the “map”) using the left_join() function from the tidyverse package.

This is how left_join() works: left_join(x,y) will return all rows from x (very important!) and all columns from x and y. Rows in x with no match in y will have NA values in the new columns. Rows in y with no match in x will be discarded. You don’t need to specify which variable we are merging by, left_join() will automatically detect that code_insee is the variable that uniquely identifies each departement.

Don’t forget that you need to assign left_join() to an object otherwise it will just display the merged dataset in your console but you won’t be able to use it. Create a new object called elections_2017_map for this task.

elections_2017_map = left_join(dep_shfl, elections_2017_long_metrop_dep)
## Joining, by = "code_insee"

Congratulations, you have merged your first datasets! Easy right? 👍

Map 1: Candidates’ vote percentage

7. Create a heat map of Macron, Le Pen, Fillon and Mélenchon’s percentage of the vote by departement, using the following code (change eval = FALSE to eval = TRUE):

library("tmap")
tmap_mode("view") # we want an interactive map
## tmap mode set to interactive viewing
tm_shape(elections_2017_map %>%
             # only keep Macron, Le Pen, Fillon and Mélenchon
             filter(candidate_name %in% c("Emmanuel MACRON", "Marine LE PEN", "François FILLON", "Jean-Luc MÉLENCHON"))) +
    tm_borders(col="white", lwd = 0.3) + # white and thin (line width) borders
    tm_fill(
        col = "vote_share",   # variable to be mapped
        title = "% of votes",   # legend title
        id = "name_departement",   # information to display when mouse hovers over a departement
        popup.vars = c("Vote share (%):" = "vote_share")) + # variable to display in popup window
    tm_facets(by = "candidate_name") # create one map per selected candidate

Wasn’t that easy (well, sort of 😉)! And now you have 4 comparable maps of the vote shares received by Macron, Le Pen, Fillon and Mélenchon! Hover over the departments with your mouse. The name of the department will show up. If you click on a department you will get the vote share received by the candidate. Isn’t that cool!

Map 2: Leading candidate

8. Create a map displaying the candidate who received the highest percentage of votes in each departement. Try to create this map by yourself using the code above (and the comments in it). You don’t need to create any new variables. Don’t hesitate to use Google (or your favorite search engine) to help you out. Don’t forget to change the title of the legend.

Add the following argument in tm_fill() to have a nice color palette for your map:

palette = c("#FFD75D","#78BCDF","#DC2A1C","#83726B") # color palette
tmap_mode("view") # we want an interactive map
## tmap mode set to interactive viewing
tm_shape(elections_2017_map %>%
             filter(rank %in% c("1"))) +
    tm_borders(col="white", lwd = 0.3) + # white and thin (line width) borders
    tm_fill(
      palette = c("#FFD75D","#78BCDF","#DC2A1C","#83726B"), 
        col = "candidate_name",   # variable to be mapped
        title = "Winning candidate",   # legend title
        id = "candidate_name")   # information to display when mouse hovers over a departement

Regression analysis of Le Pen’s votes [7 points]

In this last section, we’ll attempt a simple regression analysis of Marine Le Pen’s vote to better understand its determinants. In case you are unfamiliar with French politics, Marine Le Pen was the candidate for France’s far-right party. We will investigate the relationship between her vote share at the municipal level and 3 other variables:

  • income inequality,
  • median income,
  • share of immigrants.

So first we need to import data containing income inequality indicators, median income and immigration share at the municipality level.

The income inequality/median income data for 2015 can be downloaded here:

https://insee.fr/fr/statistiques/fichier/3560121/filo-revenu-pauvrete-menage-2015.zip

Download this zip file and open it. It contains a single Excel spreadsheet.

The immigration data for 2015 can be downloaded here:

https://www.insee.fr/fr/statistiques/fichier/3561125/BTX_TD_IMG1A_2015.zip

Download this zip file and open it. It contains a single Excel spreadsheet.

In France, the definition of an immigrant is as follows: someone born a foreigner (i.e. not French) outside of France and currently living in France.

1. Use the readxl package to import these Excel spreadsheets in objects called inc_2015 and immig_2015 respectively. (Hint: you will need to use the skip argument as before.)

inc_2015 = read_excel("/home/joe/R/Midterm/filo-revenu-pauvrete-menage-2015/base-cc-filosofi-2015.xls", skip = 5)
immig_2015 = read_excel("/home/joe/R/Midterm/BTX_TD_IMG1A_2015.xls", skip = 10)

2. The immigration data is not formatted properly. It needs reshaping and some summarizing. Here’s the code that does everything needed (change eval = FALSE to eval = TRUE):

# add "_" just after "AGE" in column names that contain "AGE"
names(immig_2015)[names(immig_2015) %in% grep("^AGE", names(immig_2015), value = TRUE)] <- paste(
    substr(grep("^AGE", names(immig_2015), value = TRUE), 1, 3),
    substr(grep("^AGE", names(immig_2015), value = TRUE), 4, 18), sep = "_")

# print new column names
names(immig_2015)
##  [1] "CODGEO"              "LIBGEO"              "AGE_400_IMMI1_SEXE1"
##  [4] "AGE_400_IMMI1_SEXE2" "AGE_400_IMMI2_SEXE1" "AGE_400_IMMI2_SEXE2"
##  [7] "AGE_415_IMMI1_SEXE1" "AGE_415_IMMI1_SEXE2" "AGE_415_IMMI2_SEXE1"
## [10] "AGE_415_IMMI2_SEXE2" "AGE_425_IMMI1_SEXE1" "AGE_425_IMMI1_SEXE2"
## [13] "AGE_425_IMMI2_SEXE1" "AGE_425_IMMI2_SEXE2" "AGE_455_IMMI1_SEXE1"
## [16] "AGE_455_IMMI1_SEXE2" "AGE_455_IMMI2_SEXE1" "AGE_455_IMMI2_SEXE2"
immig_2015_clean <- immig_2015 %>%
    # reshape data to long format
    pivot_longer(
        cols = -c("CODGEO","LIBGEO"),
        names_to = c("age", "age_cat", "immigration_status", "gender"),
        names_sep = "_",
        values_to = "n") %>%
    # group by municipality and immigration status
    group_by(CODGEO, LIBGEO, immigration_status) %>%
    # compute total number of individuals
    summarise(
        total_people = sum(n)
    ) %>%
    # reshape to wide format
    pivot_wider(
        names_from = immigration_status,
        values_from = total_people
    ) %>%
    # compute percentage of immigrants
    summarise(
        pct_immigrant = (IMMI1 / (IMMI1 + IMMI2)) * 100
    )

names(immig_2015_clean)
## [1] "CODGEO"        "LIBGEO"        "pct_immigrant"

3. The inc_2015 dataset contains lots of variables that do not interest us directly. Only keep the following 3 variables.

  • CODGEO => municipality code
  • MED15 => median income (actually to be more specific, it’s the median of total household incomes (after taxes) divided by the number of “consumption units”, which essentially accounts for household size.)
  • RD15 => a measure of income inequality equal to the ratio of the income of the top decile to the income of the bottom decile, usually referred to as D9/D1. For example, if D9/D1 equals 3 that means that the top decile (top 10%) earns 3 times more than the bottom decile (bottom 10%).

Rename the following variables as follows:

  • MED15 -> median_income
  • RD15 -> income_inequality

Create a new object inc_2015_clean for this task.

inc_2015_clean = inc_2015 %>%
select(CODGEO, MED15, RD15) %>%
rename( 
median_income = MED15,
income_inequality = RD15)

4. Now we can merge immig_2015_clean and inc_2015_clean together. We will do so using the left_join() function from the tidyverse package, as we did for creating the maps. You don’t need to specify which variable we are merging by, left_join() will automatically detect that CODGEO is the variable that uniquely identifies each municipality.

Create a new object called covariates for this task.

covariates = left_join(inc_2015_clean, immig_2015_clean)
## Joining, by = "CODGEO"

We now need to merge the covariates data with the elections_2017_long_metrop data. However, to do that we need to ensure that we have a common “identifier” on which to merge, that is we need to have a variable in both datasets, with the same name, which uniquely identifies each municipality (in our case). We could potentially use the municipality name variables but merging on character variables is not good practice as “Paris” and “Paris” or “Paris-Seine” and “Paris Seine” wouldn’t merge because they are different even though they most likely refer to the same thing. So we will merge on a variable that combines departement and municipality codes, which is called CODGEO in covariates.

Notice that in covariates the municipality code variable (CODGEO) is different from the municipality code in elections_2017_long_metrop. This is because in our covariates data, the municipality code is a combination of the departement code (the first 2 digits) and the municipality code (the last 3 digits).

Let’s create a CODGEO variable in elections_2017_long_metrop so that it matches the CODGEO variable in covariates, after which we will be able to merge the two datasets together.

Earlier on we already added a “0” in front of departement codes between 1 and 9. We only need to add a “00” in front of municipality codes between 1 and 9 in elections_2017_long_metrop, and a “0” in front of municipality codes between 10 and 99.

5. To do so we first need to convert the code_municipality variable to a character type (in the same way as you’ve done before using the as.character() function). Then we need to change the code_municipality variable by using the conditional statement:

ifelse(nchar(code_municipality) == 1, paste0("00", code_municipality), ifelse(nchar(code_municipality) == 2, paste0("0", code_municipality), code_municipality))

What this conditional statement says is: if the number of characters of the variable code_municipality is equal to 1, then add a “00” in front of this variable; if the number of characters of the variable code_municipality is equal to 2, then add a “0” in front of this variable; otherwise do nothing.

elections_2017_long_metrop = elections_2017_long_metrop %>%
  mutate(
    code_municipality = as.character(code_municipality),
    code_municipality = ifelse(nchar(code_municipality) == 1, paste0("00", code_municipality), ifelse(nchar(code_municipality) == 2, paste0("0", code_municipality), code_municipality))
  )

6. Create the variable CODGEO which is a combination of code_departement and code_municipality, in that order. Use the paste0() function which doesn’t take a sep argument, it concatenates strings together without adding anything between the two strings. Example: paste0("a","b") = "ab".

elections_2017_long_metrop$CODGEO = paste0(elections_2017_long_metrop$code_departement, elections_2017_long_metrop$code_municipality)

7. Merge elections_2017_long_metrop and covariates together. Again, use the left_join() function from the tidyverse package.

Create a new object called elections_2017_long_metrop_covariates for this task.

elections_2017_long_metrop_covariates = left_join(covariates, elections_2017_long_metrop)
## Joining, by = "CODGEO"

8. Only keep observations for Marine Le Pen.

Create a new object elections_2017_long_metrop_covariates_lepen for this task.

elections_2017_long_metrop_covariates_lepen = elections_2017_long_metrop_covariates %>%
filter(nom == "LE PEN") 

9. Use the skim() function from the skimr package to obtain summary statistics for median_income, income_inequality, pct_immigrant and share_votes.

# typo in instructions, corrected below (share_votes should be pct_votes)
library(skimr)
elections_2017_long_metrop_covariates_lepen %>%
  skim(median_income,income_inequality,pct_immigrant,pct_votes)
Data summary
Name Piped data
Number of rows 34916
Number of columns 20
_______________________
Column type frequency:
numeric 4
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
median_income 3397 0.90 20658.17 2910.69 10157.08 18823.96 20219.00 21959.53 46156.00 ▁▇▁▁▁
income_inequality 29794 0.15 3.00 0.54 1.98 2.66 2.90 3.20 9.92 ▇▁▁▁▁
pct_immigrant 0 1.00 4.13 4.27 0.00 1.54 2.90 5.24 73.05 ▇▁▁▁▁
pct_votes 6 1.00 26.38 9.10 0.00 19.90 25.95 32.47 83.72 ▂▇▂▁▁

Notice that the number of observations missing for each variable differs significantly. Also notice that there are 6 missing values for the percentage of votes received by Marine Le Pen. This is strange so let’s take a look at what’s happening.

10. Why are these values missing? Replace the missing values by 0 (as they should be) by using the following ifelse statement: ifelse(is.na(share_votes), 0, share_votes).

elections_2017_long_metrop_covariates_lepen = elections_2017_long_metrop_covariates_lepen %>%
  mutate(
   pct_votes = ifelse(is.na(pct_votes), 0, pct_votes)
  )

To facilitate plotting we will reshape our data in long format.

11. Re-use pivot_longer() to reshape the data appropriately. Here’s the code you should run (change eval = FALSE to eval = TRUE):

elections_2017_long_metrop_covariates_lepen_long <- elections_2017_long_metrop_covariates_lepen %>%
    pivot_longer(
        cols = c("pct_immigrant","median_income","income_inequality"),
        names_to = "covariates",
        values_to = "value"
    )

12. Produce 3 graphs displaying the relationship between the vote share for Marine Le Pen and our 3 covariate variables. Make sure that your graph has a meaningful y-axis label and that the y-axis is between 0 and 100. (Hint: use the facet_wrap layer in ggplot to make these 3 plots in the same graph. Look at the facet_wrap arguments to ensure that the scales of each plot change freely.)

ggplot(elections_2017_long_metrop_covariates_lepen_long,
 aes(x = value,
 y = pct_votes)) +
 geom_point(shape=1) +
   ylim(0, 100) +
 labs(x = "Covariates",
 y = "Vote share",
 title = "Vote share analysis") +
 theme_bw(base_size = 16) +
  facet_grid(rows = vars(covariates)) + 
  facet_wrap(vars(covariates), scales = "free")
## Warning: Removed 33191 rows containing missing values (geom_point).

13. Add a regression line on these plots using the geom_smooth layer to better see the relationships’ sign. Set the argument method to "lm" and se to FALSE.

ggplot(elections_2017_long_metrop_covariates_lepen_long,
 aes(x = value,
 y = pct_votes)) +
  geom_point(shape=1) +
   ylim(0, 100) +
 labs(x = "Covariates",
 y = "% Vote share",
 title = "Vote share analysis") +
 theme_bw(base_size = 16) +
  facet_grid(rows = vars(covariates)) + 
  facet_wrap(vars(covariates), scales = "free", nrow=3) + 
    geom_smooth(method=lm,se=FALSE)
## Warning: Removed 33191 rows containing non-finite values (stat_smooth).
## Warning: Removed 33191 rows containing missing values (geom_point).
## Warning: Removed 11 rows containing missing values (geom_smooth).

14. From the graphs, is the relationship between the different covariates and Marine Le Pen’s vote share positive or negative? Rather strong or rather weak?

They’re all weak

15. Run a simple linear regression of each of our covariate variables, separately, on Marine Le Pen’s vote share. For this question you should use elections_2017_long_metrop_covariates as the data source.
Use the export_summs() function from the jtools package to see the results from the three regressions side by side.

library(jtools)
modelincome = lm(pct_votes ~ median_income, data = elections_2017_long_metrop_covariates_lepen)
modelinequality = lm(pct_votes ~ income_inequality, data = elections_2017_long_metrop_covariates_lepen)
modelimmigrant = lm(pct_votes ~ pct_immigrant, data = elections_2017_long_metrop_covariates_lepen)
export_summs(modelincome, modelinequality, modelimmigrant)
Model 1 Model 2 Model 3
(Intercept) 37.18 *** 33.87 *** 28.13 ***
(0.35)    (0.63)    (0.07)   
median_income -0.00 ***                
(0.00)                   
income_inequality         -3.52 ***        
        (0.21)           
pct_immigrant                 -0.42 ***
                (0.01)   
N 31519        5122        34916       
R2 0.03     0.05     0.04    
*** p < 0.001; ** p < 0.01; * p < 0.05.

Don’t pay attention to the stars next to the coefficients and to the number in parenthesis below the coefficients, we’ll get to them later on in the course.

16. What is the interpretation of the coefficients for median_income, income_inequality and pct_immigrant? Why is the coefficient for median income so small? Why does the number of observations vary for each regression? Which variable’s variance explains more of the variance in Marine Le Pen’s vote?

  • All three have a weak negative correlation with vote share (for median income the real number is -0.0005169, however in rounding this is displayed as -0.00 which would be equal to no relationship)
  • The median income coefficient is so small because of the much larger scaled (from 0 to more than 40,000, compared with just 0-10 for income inequality and 0 - 100 for % immigrants)
  • The number of observations varies for each regression because not every area collects information on income inequality.
  • The variable’s variance which exlains more of the variance in Le Pen’s vote is income inequality, which has an R2 of 0.05

17. Comment on the \(R^2\) of the three regressions.

-For all three regressions \(R^2\) is extremely small. This means that the data fits very poorly with the regression line. Stated another way, this means that the models given here explain very little of the observed variation

18. Standardize the regressors and re-run the regressions. You can do this either by demeaning the variables and dividing by their standard deviation or by adding scale = TRUE in your export_summs() function. How do you interpret the (slope) coefficients now? Why has the coefficient for median income changed so much? What can you say about the magnitude of the association of each of these variables with Marine Le Pen’s vote share?

export_summs(modelincome, modelinequality, modelimmigrant, scale=TRUE)
Model 1 Model 2 Model 3
(Intercept) 26.51 *** 23.33 *** 26.37 ***
(0.05)    (0.11)    (0.05)   
median_income -1.50 ***                
(0.05)                   
income_inequality         -1.90 ***        
        (0.11)           
pct_immigrant                 -1.81 ***
                (0.05)   
N 31519        5122        34916       
R2 0.03     0.05     0.04    
All continuous predictors are mean-centered and scaled by 1 standard deviation. *** p < 0.001; ** p < 0.01; * p < 0.05.

-All three coefficients remain small, negative numbers, meaning weak negative relation for all three. They are now all also much closer to one another. -The coefficient for median income has changed so much because the three values have now been adjusted for scale, so the fact that the spread of results for median_income is so much larger has been taken into account - The magnitude of the assoication of median income and % of immigrants is now stronger, whilst the correlation between vote share and income inequality is weaker. Nevertheless, the correlation remains weak for all three.

19. Can you interpret these estimates causally? Why not? How would you go about trying to assess whether income inequality at the municipality level has a causal impact on the vote percentage for Marine Le Pen?

No, because there is potentially other confounding factors that we are not controlling for. We might expect areas with lower incomes to be slightly more likely to vote for Le Pen, but it might be the case that rural areas have lower incomes on average, and there may be a causal link between living rurally and voting for Le Pen instead that this model does not account for. - To more reliably make a causal claim about income inequality we would need to carry out a multiple regression analysis controlling for other potentially confounding factors. This would allow us to isolate the effect of income inequality on vote share.

Well done!

And that’s it for this midterm! I hope you had fun, or at least some fun, doing it and that you improved your R skills. I also hope you got a sense of what “real” data looks like and how it can be tamed by using the great tools that R provides.