Cleaning and Recoding Data Prior to Analysis

Anytime you import survey, or really any kind of, data into R, you should always ask yourself what you need to do to get it prepared for analysis. With survey data specifically, there are usually a few things that should always be checked and usually changed prior to conducting any type of analysis.

We will cover several of them here including the following: - Reordering a survey scale so higher values = more of something - Reorder incorrectly coded ordinal scales - Combine multiple questions into one response - Change the name of a variable to indicate what it measures - Combine multiple options into fewer groups - How to set specific values to NAs - How to apply survey weights

Importing Data

We will be importing data from a Stata file downloaded from the American National Election Survey website for the 2020 presidential election in the United States. This survey interviewed over 8,000 US residents asking over 1,000 questions combined between a pre and post survey. For easier processing times, this ANES file for data preparation is a paired down version with 3,000 randomly selected respondents from the full sample.

You should always start a new analysis with a clean data set, meaning one that has not been previously altered. This allows for easier replication by other researchers and ensures that all of the changes made to variables and cases in your data set can be followed.

Survey data is unique to other types of data because most variables are a combination of numbers and variable labels. Because survey questions require respondents to answer with pre-written scale options, for which there is a corresponding number and value label, it adds an additional complication to simply importing data. The ‘haven’ package handles this unique aspect well by reading both the number and the variable label into R. This gives us more information about our data that we can use to make informed decisions on how to clean prior to analysis. What this means is that our survey variables will be considered by R to be Haven Labelled which is a specific format type separate from things like a factor or numeric.

You should always download the code book for any available survey data set so that you have access to how each variable was asked along with how the variables are coded.

# Load the data directly from the GitHub page
url <- "https://github.com/drCES/course_data/raw/main/anes_2020.dta"
anes_raw <- read_dta(url)

One useful R package is the janitor package which cleans up messy variable names and makes all letters lowercase. This follows generally coding best practices as lowercase is more efficient than using uppercase letters. Here, we use the clean_names function from the janitor package to cleanup the variable names.

anes<-clean_names(anes_raw)

Recoding Variables

Naming Conventions

It is important when you start creating new variables to keep a few best practices in mind. First, names should always be lowercase. It is okay to have multiple words in a variable name but you should use ‘snake_case’ which means connecting words with an underscore ‘_’. Next, keep names short but as informative as possible. The name should reflect what the new variable is measuring. So if the new variables measures presidential approval, the name should convey that with something like ‘pres_app’ where pres = presidential and app = approval. This follows all of the best practices: lowercase, snake_case, and short but informative.

Keep this in mind as we work through this code.

Flipping Order of Scale

To simplify, we will focus only on a few variables in the 2020 ANES data. Let’s start with examining how this survey coded approval ratings of the president of the United States. When we examine the codebook, we find that the approval rating question is ‘V201129x’, and it is coded where 1 = Strongly Approve and 4 = Strongly Disapprove. This is a classic example of when we would want to flip the scale so that higher values equals approval rather than disapproval. This makes it easier to discuss the variable and any regression results associated it. It allows us to talk about the approval of something rather than its negation, disapproval.

First, let’s examine the attributes of the variable so that we can learn more about it before the transformations.

attributes(anes$v201129x) #This gives you the variable label, class, and value labels for a haven labelled survey object
## $label
## [1] "PRE: SUMMARY: Approve or disapprove President handling job"
## 
## $format.stata
## [1] "%12.0g"
## 
## $class
## [1] "haven_labelled" "vctrs_vctr"     "double"        
## 
## $labels
## -2. DK/RF in V201127 or V201128             1. Approve strongly 
##                              -2                               1 
##         2. Approve not strongly      3. Disapprove not strongly 
##                               2                               3 
##          4. Disapprove strongly 
##                               4

Here, we confirm that the variable we are analyzing is the one we want to analyze per the codebook. The label tells us the question is the presidential job approval one, the class tells us it is in ‘haven_labelled’ format and the labels tells us the value label for each specific value. We also see that -2 values are non-substantive responses that should be removed from the analysis.

Next, we will get a frequency distribution for the presidential approval rating question to understand its distribution using two approaches: first, using tidyverse language; second, using the descr package along with the freq command.

anes %>%     #Data we are using             
  count(v201129x) %>%                            # Variable we want the distribution for 
  mutate(percent = scales::percent(n / sum(n))) # calculate percent 
## # A tibble: 5 × 3
##   v201129x                                 n percent
##   <dbl+lbl>                            <int> <chr>  
## 1 -2 [-2. DK/RF in V201127 or V201128]    21 0.7%   
## 2  1 [1. Approve strongly]               968 32.3%  
## 3  2 [2. Approve not strongly]           266 8.9%   
## 4  3 [3. Disapprove not strongly]        189 6.3%   
## 5  4 [4. Disapprove strongly]           1556 51.9%
freq(anes$v201129x, plot=F) #More efficient code than tidyverse for same information 
## PRE: SUMMARY: Approve or disapprove President handling job 
##       Frequency Percent
## -2           21   0.700
## 1           968  32.267
## 2           266   8.867
## 3           189   6.300
## 4          1556  51.867
## Total      3000 100.000

From this output, we see an important fact - that right now missing data is coded as -2 and if we were to try and analyze this data without cleaning, we would be including that -2 in all calculations. That would lead to bias in our results causing us to draw incorrect conclusions. This is why you should always check the frequency distribution of your data prior to analysis.

We also know from above that 1 = strongly approve while 4 = strongly disapprove. Next, we will flip the scale order while removing the missing data and saving a new variable called ‘pres_app’ for presidential approval.

Note, that we will use case_when here for our recodes rather than explicit recode function. This is because case_when is more flexible at handling labelled variables like our survey data but does require slightly more code. We also need case_when for more complex transformations such as combining two variables into one.

anes <- anes %>% #This creates new variable called 'pres_app' where 4 = strongly approve and 1 = strongly disapprove while setting -2 to NA 
  mutate(pres_app = case_when(
    v201129x ==1 ~ 4,
    v201129x ==2 ~ 3,
    v201129x ==3 ~ 2,
    v201129x ==4 ~ 1, 
  v201129x ==-2 ~ NA_real_), #This code makes all values of -2 = NA for analysis purposes. 
  pres_app = labelled(pres_app, c(`Str Dis` = 1, `Some Dis` = 2, `Some App` = 3, `Strong App` = 4)))

anes %>%     #Data we are using             
  count(pres_app) %>%                            # Variable we want the distribution for 
  mutate(percent = scales::percent(n / sum(n))) # calculate percent 
## # A tibble: 5 × 3
##   pres_app            n percent
##   <dbl+lbl>       <int> <chr>  
## 1  1 [Str Dis]     1556 51.9%  
## 2  2 [Some Dis]     189 6.3%   
## 3  3 [Some App]     266 8.9%   
## 4  4 [Strong App]   968 32.3%  
## 5 NA                 21 0.7%
CrossTable(anes$v201129x, anes$pres_app, expected = FALSE, chisq=FALSE,  prop.c=TRUE, prop.r=FALSE, prop.t=FALSE, prop.chisq = FALSE)
##    Cell Contents 
## |-------------------------|
## |                       N | 
## |           N / Col Total | 
## |-------------------------|
## 
## ======================================================
##                  anes$pres_app
## anes$v201129x        1       2       3       4   Total
## ------------------------------------------------------
## -2                   0       0       0       0       0
##                      0       0       0       0        
## ------------------------------------------------------
## 1                    0       0       0     968     968
##                      0       0       0       1        
## ------------------------------------------------------
## 2                    0       0     266       0     266
##                      0       0       1       0        
## ------------------------------------------------------
## 3                    0     189       0       0     189
##                      0       1       0       0        
## ------------------------------------------------------
## 4                 1556       0       0       0    1556
##                      1       0       0       0        
## ------------------------------------------------------
## Total             1556     189     266     968    2979
##                  0.522   0.063   0.089   0.325        
## ======================================================

This example showed us how to flip the order of a question scale so that higher values indicates more of whatever the question measures, in this case presidential approval. We named the new variable pres_app to keep the label short but also informative of what the question measures then added value labels to ensure we know what each value represents.

It is important to examine the distribution of your new variable to ensure nothing went wrong in the recoding process. We did two checks: First, we checked the frequency distribution the new variable to ensure we see what we expect to see. Next, we run a crosstab between our new variable and the original. We expect to see a mirror image in the crosstab which we do. This means our recode was successful and 4=strongly approve while 1=strongly disapprove.

We then used the ‘haven’ package to label the new ‘pres_app’ with more informative information to ensure we remember what the variable is measuring.

Next, we will look at one additional way to flip your scale. This is least amount of code but also offers the largest possibility of error. Here we will the if_else command along with mutate to create a new presidential approval variable pres_app3, which is the inverse of the original presidential approval variable V201129x. Here we multiply the original value by -1 and then add one more than the total number of scale points. Here, we had 4 total scale points - strongly approve, approve, disapprove, strongly disapprove - so we add 4+1 or five to the scale. This mathematically flips the scale so that the original values of 1=4, 2=3, 3=2, and 4=1. It will always work provided you add the appropriate number of points.

#Additional way to flip your scale 
anes <- anes %>% 
  mutate(pres_app3 = if_else(v201129x>=1, (v201129x*-1)+5, NA)) #When original variable >= 1, we will multiply the original variable by -1 and then add 1 more than the total scale points. Since there were 4 scale points in the original scale, we add 5. We also recode anything that was >=1 originally as NA since that reflects a non-substantive response. 

#1 becomes 4 b/c (1*-1)=-1+5=4
#2 becomes 3 b/c (2*-1)=-2+5=3
#3 becomes 2 b/c (3*-1)=-3+5=2
#4 becomes 1 b/c (4*-1)=-4+5=1

anes %>%     #Data we are using             
  group_by(v201129x)  %>% #Original variable
  count(pres_app3)         # New variable 
## # A tibble: 5 × 3
## # Groups:   v201129x [5]
##   v201129x                             pres_app3     n
##   <dbl+lbl>                                <dbl> <int>
## 1 -2 [-2. DK/RF in V201127 or V201128]        NA    21
## 2  1 [1. Approve strongly]                     4   968
## 3  2 [2. Approve not strongly]                 3   266
## 4  3 [3. Disapprove not strongly]              2   189
## 5  4 [4. Disapprove strongly]                  1  1556
#Can also add value labels outside of the code. Note that this will make it a factor variable not a haven labelled variable which will change how we have to write our code later on. Because of that, I prefer the previous approach to adding labels. 
value_labels <- c("Strongly Disapprove", "Disapprove", "Approve", "Strongly Approve")

anes$pres_app3 <- factor(anes$pres_app3, levels = 1:4, labels = value_labels)

freq(anes$pres_app3, plot=F)
## anes$pres_app3 
##                     Frequency Percent Valid Percent
## Strongly Disapprove      1556  51.867        52.232
## Disapprove                189   6.300         6.344
## Approve                   266   8.867         8.929
## Strongly Approve          968  32.267        32.494
## NA's                       21   0.700              
## Total                    3000 100.000       100.000

Combining Two Variables into One

Now, let’s combine two variables into one. Oftentimes, surveys will ask branching questions that need to be combined for analysis purposes. In fact, the presidential approval measure we just analyzed is the combination of two questions. Most survey firms will not pre-combine these two questions into one so it is important to learn how to do so.

There are two variables to combine:

  • V201127 (1= approve & 2=disapprove) #Approve/Disapprove of performance

  • V201128 (1=strongly & 2 = not strongly) #How strongly approve/disapprove of performance

Remember, higher values should equal more approval so we want 4 = strongly approve and 1 = strongly disapprove.

#First run a crosstab between your two existing variables to get the distribution across cells. 

anes %>%     #Data we are using             
  group_by(v201128 )  %>% #X Variable in Crosstab 
  count(v201127)         # Y Variable in Crosstab 
## # A tibble: 6 × 3
## # Groups:   v201128 [3]
##   v201128               v201127                 n
##   <dbl+lbl>             <dbl+lbl>           <int>
## 1 -1 [-1. Inapplicable] -9 [-9. Refused]       16
## 2 -1 [-1. Inapplicable] -8 [-8. Don't know]     5
## 3  1 [1. Strongly]       1 [1. Approve]       968
## 4  1 [1. Strongly]       2 [2. Disapprove]   1556
## 5  2 [2. Not strongly]   1 [1. Approve]       266
## 6  2 [2. Not strongly]   2 [2. Disapprove]    189
#Next, create your new variable based on the 4 possible combinations using a series of & statements. Be careful.

anes <- anes %>% #This creates new variable called 'pres_app2' which is a numeric version of the presidential approval combined questions  
  mutate(pres_app2 = case_when(
    v201127==1 & v201128 ==1 ~ 4,
    v201127==1 & v201128 ==2 ~ 3,
    v201127==2 & v201128 ==2 ~ 2,
    v201127==2 & v201128 ==1 ~ 1), 
      pres_app = labelled(pres_app, c(`Str Dis` = 1, `Some Dis` = 2, `Some App` = 3, `Strong App` = 4)))

#Now, if we want we can make a new factor variable that saves the labels. 

anes <- anes %>% #This creates new variable called 'pres_app2_f' which is a factorized version of the above pres_app2 measure.  
  mutate(pres_app2_f = case_when(
    v201127==1 & v201128 ==1 ~ 'Strong Approve',
    v201127==1 & v201128 ==2 ~ 'Approve',
    v201127==2 & v201128 ==2 ~ 'Disapprove',
    v201127==2 & v201128 ==1 ~ 'Strong Disapprove'))

#Check our work against the original variables.

anes %>%     #Data we are using             
  group_by(v201128, v201127)  %>% #X Variable in Crosstab 
  count(pres_app2)         # Y Variable in Crosstab  
## # A tibble: 6 × 4
## # Groups:   v201128, v201127 [6]
##   v201128               v201127             pres_app2     n
##   <dbl+lbl>             <dbl+lbl>               <dbl> <int>
## 1 -1 [-1. Inapplicable] -9 [-9. Refused]           NA    16
## 2 -1 [-1. Inapplicable] -8 [-8. Don't know]        NA     5
## 3  1 [1. Strongly]       1 [1. Approve]             4   968
## 4  1 [1. Strongly]       2 [2. Disapprove]          1  1556
## 5  2 [2. Not strongly]   1 [1. Approve]             3   266
## 6  2 [2. Not strongly]   2 [2. Disapprove]          2   189
#Check our work against our previously created variable.

anes %>%     #Data we are using             
  group_by(pres_app)  %>% #X Variable in Crosstab 
  count(pres_app2)         # Y Variable in Crosstab  
## # A tibble: 5 × 3
## # Groups:   pres_app [5]
##   pres_app        pres_app2     n
##   <dbl+lbl>           <dbl> <int>
## 1  1 [Str Dis]            1  1556
## 2  2 [Some Dis]           2   189
## 3  3 [Some App]           3   266
## 4  4 [Strong App]         4   968
## 5 NA                     NA    21

Once we run the above code, we see that we successfully created our new variable combining two separate variables. This is a flexible approach that can be applied to any two or more variables provided you are careful in coding the correct values. For more complex combinations, it can be useful to map it on a whiteboard/piece of paper first.

Collapsing Variables into Smaller Groups

Oftentimes, we want to collapse the number of groups in a variable into fewer groups or even into a dummy, otherwise known as dichotomous, variable. Using education (V201507x) as the example, let’s look at several ways to collapse groups into fewer options. Remember, since we are doing data transformations, we want to keep the original variable untransformed and save a new one.

Start by examining the frequency distribution for the variable you want to transform. For education, we see that ‘less than HS degree’ is only selected 376 times out of the 8,000+ cases. Because it is selected so infrequently, it should definitely be combined with the next option HS degree. We also see that ‘some college but no degree’ option is the modal response for the scale. This suggests that, depending on our theory and planned analysis, we should create a few different educational attainment variables.

First, we will keep the original categories the same but combine the ‘less than HS degree’ with ‘HS degree’ options.

Then, we will create a dummy variable for college degree or not (0 or 1):

  • College Degree = People with Bachelor’s degree or graduate degree
  • No College Degree = People with some college but no degree, HS degree, or no HS degree

Lastly, we will create a 3-point scale splitting the no college degree into Some college or No College at all: - College Degree = People with Bachelor’s degree or graduate degree - Some College = People with some college but no degree - High school degree or less = HS degree or no HS degree

#First look at the age distribution
anes %>%     #Data we are using             
  count(v201511x)         # Variable to analyze 
## # A tibble: 7 × 2
##   v201511x                                                              n
##   <dbl+lbl>                                                         <int>
## 1 -9 [-9. Refused]                                                     11
## 2 -2 [-2. Missing, other specify not coded for preliminary release]    33
## 3  1 [1. Less than high school credential]                            143
## 4  2 [2. High school credential]                                      499
## 5  3 [3. Some post-high school, no bachelor's degree]                 986
## 6  4 [4. Bachelor's degree]                                           743
## 7  5 [5. Graduate degree]                                             585
freq(anes$v201511x, plot=F)
## PRE: SUMMARY: Respondent 5 Category level of education 
##       Frequency  Percent
## -9           11   0.3667
## -2           33   1.1000
## 1           143   4.7667
## 2           499  16.6333
## 3           986  32.8667
## 4           743  24.7667
## 5           585  19.5000
## Total      3000 100.0000
#Collapsing Less than HS Degree (1) with HS Degree (2)
anes <- anes %>%
  mutate(college = ifelse(v201511x %in% c(1, 2), 1, v201511x-1), #Recodes all values from the original education variables that are 1 or 2 to be = 1 and then sets all other values to their original value - 1 t0 keep the order in tact of 1, 2, 3, 4.
         college = ifelse(college < 0, NA_real_, college),
      pres_app = labelled(pres_app, c(`HS Degree or less` = 1, `Some College` = 2, `BA/BS` = 3, `Advanced Degree` = 4)))

##Second way to recode our college variable; this time to create a dummy variable. 
anes <- anes %>%
  mutate(college2 = case_when(
    v201511x == 1 ~ 0,
    v201511x == 2 ~ 0,
    v201511x == 3 ~ 0,
    v201511x == 4 ~ 1,
    v201511x == 5 ~ 1),
  college2 = labelled(college2, c(`No College Degree` = 0, `College Degree` = 1)))

attributes(anes$college2)
## $labels
## No College Degree    College Degree 
##                 0                 1 
## 
## $class
## [1] "haven_labelled" "vctrs_vctr"     "double"
freq(anes$college2, plot=F)
## anes$college2 
##       Frequency Percent Valid Percent
## 0          1628  54.267         55.07
## 1          1328  44.267         44.93
## NA's         44   1.467              
## Total      3000 100.000        100.00
class(anes$college2)
## [1] "haven_labelled" "vctrs_vctr"     "double"
#Checks distribution of new variable 
anes %>%     #Data we are using             
  count(college)         #  New Variable 
## # A tibble: 5 × 2
##   college     n
##     <dbl> <int>
## 1       1   642
## 2       2   986
## 3       3   743
## 4       4   585
## 5      NA    44
#Crosstab between original and new variable to ensure recode success
anes %>%     #Data we are using    
  group_by(v201511x) %>% #Original Variable 
  count(college)         # New Variable 
## # A tibble: 7 × 3
## # Groups:   v201511x [7]
##   v201511x                                                         college     n
##   <dbl+lbl>                                                          <dbl> <int>
## 1 -9 [-9. Refused]                                                      NA    11
## 2 -2 [-2. Missing, other specify not coded for preliminary releas…      NA    33
## 3  1 [1. Less than high school credential]                               1   143
## 4  2 [2. High school credential]                                         1   499
## 5  3 [3. Some post-high school, no bachelor's degree]                    2   986
## 6  4 [4. Bachelor's degree]                                              3   743
## 7  5 [5. Graduate degree]                                                4   585
###Different approach to creating a dichotomous variable. Here, you have to explicitly make missing data NA 
anes <- anes %>% 
  mutate(college2 = if_else(v201511x>3, 1, 0))

anes <- anes %>% #Explicitly making the values -1/-9 NA 
  mutate(college2 = replace(college2, v201511x <= -1, NA)) #Removes NA from new college2 variable  

#Crosstab between original and new variable to ensure recode success
anes %>%     #Data we are using    
  group_by(v201511x) %>% #Original Variable 
  count(college2)         # New Variable 
## # A tibble: 7 × 3
## # Groups:   v201511x [7]
##   v201511x                                                        college2     n
##   <dbl+lbl>                                                          <dbl> <int>
## 1 -9 [-9. Refused]                                                      NA    11
## 2 -2 [-2. Missing, other specify not coded for preliminary relea…       NA    33
## 3  1 [1. Less than high school credential]                               0   143
## 4  2 [2. High school credential]                                         0   499
## 5  3 [3. Some post-high school, no bachelor's degree]                    0   986
## 6  4 [4. Bachelor's degree]                                              1   743
## 7  5 [5. Graduate degree]                                                1   585
anes <- anes %>%
  mutate(college3 = case_when(
    v201511x == 1 ~ 1,
    v201511x == 2 ~ 1,
    v201511x == 3 ~ 2,
    v201511x == 4 ~ 3,
    v201511x == 5 ~ 3),
  college3 = labelled(college3, c(`No College Degree` = 1, `Some College` = 2, `College Degree or More` = 3)))

freq(anes$college3, plot=F)
## anes$college3 
##       Frequency Percent Valid Percent
## 1           642  21.400         21.72
## 2           986  32.867         33.36
## 3          1328  44.267         44.93
## NA's         44   1.467              
## Total      3000 100.000        100.00
anes %>%     #Data we are using             
  group_by(v201511x)  %>% #X Variable in Crosstab 
  count(college3)         # Y Variable in Crosstab  
## # A tibble: 7 × 3
## # Groups:   v201511x [7]
##   v201511x                                                        college3     n
##   <dbl+lbl>                                                       <dbl+lb> <int>
## 1 -9 [-9. Refused]                                                NA          11
## 2 -2 [-2. Missing, other specify not coded for preliminary relea… NA          33
## 3  1 [1. Less than high school credential]                         1 [No …   143
## 4  2 [2. High school credential]                                   1 [No …   499
## 5  3 [3. Some post-high school, no bachelor's degree]              2 [Som…   986
## 6  4 [4. Bachelor's degree]                                        3 [Col…   743
## 7  5 [5. Graduate degree]                                          3 [Col…   585

Working with Missing Data/non-Substantive Responses

Nearly all survey data will include missing data that should be investigated. Oftentimes, missing data is coded as a real value in your survey such as ‘-2’ or ‘99’. If this is the case, you must ensure that R knows what values should not be included in your analysis otherwise you will introduce bias into your calculations.

If we only want to change a specific value or values to NA without recoding the entire variable, we can use the mutate function from tidyverse along with the na_if command. We still want to save a new variable since we are transforming the orginial variable in some way but we will not need to do anything else to the variable. Here we recode ‘-9’ to system missing for a feeling thermometer measure rating how much Americans liked Donald Trump in 2020.

If there is a more complex NA coding scheme, you likely will need to use a slightly more complex coding approach. The following code easily handles more complexity as the second and third recodes change any values <= -1 or >= 101 to system missing as the codebook identifies all these values as non-valid answers. This is another reminder to always work from an up-to-date codebook so that you can catch any potential issues. There are two slightly different approaches to do this more complex approach in the code below but both will lead to the same outcome.

anes <- anes %>%
  mutate(trump_feel = na_if(v201152, -9))#Create new variable called 'trump_feel' which equals the original variable 'V201151' but replace all -9 values as NA

anes <- anes %>%
  mutate(biden_feel = replace(v201151, (v201151 <= -1 | v201151 >= 101), NA)) #Create new variable called 'biden_feel' which equals the original variable 'V201151' but replace all -9 values as NA

#OR could use this approach

anes <- anes %>%
  mutate(biden_feel = if_else(v201151 <= -1 | v201151 >= 101, NA_real_, v201151))

Remember, we already looked at how to make a value system missing using ‘case_when’ when recoding the entire variable.

anes <- anes %>% #This creates new variable called 'pres_app' where 4 = strongly approve and 1 = strongly disapprove while setting -2 to NA 
  mutate(pres_app = case_when(
    v201129x ==1 ~ 4,
    v201129x ==2 ~ 3,
    v201129x ==3 ~ 2,
    v201129x ==4 ~ 1, 
  v201129x ==-2 ~ NA_real_)) #This code makes all values of -2 = NA for analysis purposes. 

Merging External Data into Survey Data

Depending on the sample, sometimes it is possible to link external data - i.e. data not collected in the survey but from some source external to the interview - to individual survey records. This can include things like a student’s grades if you work for a school system or voting records for political poll respondents. In these cases, researchers can utilize these external files to do a variety of interesting analyses. However, the first step is to merge the two files, which is not always easy to do.

Here, we are going to merge the original 2020 American Election Study Survey with validated voting information collected by a third party vendor. Validated vote data essential is just a public record of if an individual voted tied to their survey responses, all while keeping their identify confidential to researchers. By merging these files, researchers can then answer important around who actually turns out to vote versus who says they turn out to vote.

Let’s go through that process here.

Step 1, we need to identify the ‘case_id’, which is the column which should provide a unique identification number for all individual survey responses. This will be the critical variable that needs to be matched between your main file - for us the 2020 ANES - and the file to be merged - the validated vote file. We need to examine both files and ensure they have the same structure for their ‘case_id’ variable and ideally that the variable names match. If these two columns do not match exactly, you will only get partial to zero matches. Always closely examine what are in your columns before trying to merge.

url2 <- "https://github.com/drCES/course_data/raw/main/anes_timeseries_2020_stata_VoterValidation.dta"
vvote <- read_dta(url2)
vvote <-clean_names(vvote)

head(anes$v200001)
## [1] 400525 345215 311212 227263 461599 211060
head(vvote$v200001)
## [1] 200015 200022 200039 200046 200053 200060
anes_vv <- merge(anes, vvote, by = "v200001", all = FALSE) #all=FALSE merges only cases from the second dataset that exist in the first. This keeps the combined  dataset free of cases that do not exist in the first set of data.  

We’ll start by looking at the codebook for the validated vote data which tells us that ‘V200001’ is the name for the ‘case_id’ variable and that it should match the original 2020 ANES name exactly. Using the ‘head’ function, we can quickly verify that indeed these two variables have the same structure and same values for the first few cases. This indicates that it is safe to move forward with merging the two files.

Note, there are 18,430 cases in the ‘vvote’ data frame while there are only 8280 in the ‘anes’ data. That means the vvote data has cases that do not exist in the ‘anes’ data, which is fine. We do not want to merge data that is not included in the ANES file. By including ‘all = FALSE’, we tell R to only include variables that are in the main file in the merge. This is one reason why it is important to pay attention to which file you are treating as the main file, which will be listed first in the merge code.

#Distribution of responses
anes_vv %>%     #Data we are using             
  count(val1_turnout20)         # Y Variable in Crosstab 
##   val1_turnout20    n
## 1              2   93
## 2              3  583
## 3              4 1060
## 4              5  347
## 5              6  917
anes_vv %>%     #Data we are using             
  count(val2_turnout20)         # Y Variable in Crosstab 
##   val2_turnout20    n
## 1              2  155
## 2              3  364
## 3              7 2481
CrossTable(anes_vv$val1_turnout20, anes_vv$val2_turnout20, expected = FALSE, chisq=TRUE,  prop.c=TRUE, prop.r=FALSE, prop.t=FALSE, prop.chisq = FALSE)
## Warning in chisq.test(tab, correct = FALSE, ...): Chi-squared approximation may
## be incorrect
##    Cell Contents 
## |-------------------------|
## |                       N | 
## |           N / Col Total | 
## |-------------------------|
## 
## =======================================================
##                           anes_vv$val2_turnout20
## anes_vv$val1_turnout20        2       3       7   Total
## -------------------------------------------------------
## 2                            29      18      46      93
##                           0.187   0.049   0.019        
## -------------------------------------------------------
## 3                            63     254     266     583
##                           0.406   0.698   0.107        
## -------------------------------------------------------
## 4                            27      37     996    1060
##                           0.174   0.102   0.401        
## -------------------------------------------------------
## 5                            10      13     324     347
##                           0.065   0.036   0.131        
## -------------------------------------------------------
## 6                            26      42     849     917
##                           0.168   0.115   0.342        
## -------------------------------------------------------
## Total                       155     364    2481    3000
##                           0.052   0.121   0.827        
## =======================================================
## 
## Statistics for All Table Factors
## 
## Pearson's Chi-squared test 
## ------------------------------------------------------------
## Chi^2 = 933.4123      d.f. = 8      p <2e-16

Now that the files are merged, we can quickly see that in the original ‘anes’ file we had 1775 variables while in the new ‘anes_vv’ merged file we have 1797 variables, or 22 additional. A quick visual inspection of the end of the ‘anes_vv’ file reveals that the variables that were in the ‘vvote’ file are now appended to the end of our original ‘anes’ file. This means our merge was successful, and we can now start to analyze the newly created file.

Applying Survey Weights to Analysis

Survey weights make the survey sample’s demographic profile look more like its population while also sometimes accounting for things such as unequal likelihood of being selected to participate. Many publicly accessible large-n surveys will come with associated survey weights that should be applied when conducting analysis. For the most part, provided weights should always be used when analyzing survey data.

Step 1 to using survey weights is to review the codebook and find what it has to say about its survey weights. While some survey data sets will come with only 1 survey weight, many, like the ANES which has 14 unique weights, will have multiple and utilizing the codebook is imperative in this instance.

By reviewing the 2020 ANES codebook, we see that we want to use ‘V200010b’ as our primary weight since we are analyzing data collected in the post-election survey wave. In this case, we also want to include a ‘strata’ weight per the instructions. Not all surveys will include both types of weights but when they do each should be utilized.

The fact that we are required to use the post-election survey weight will pose an additional problem that not every respondent who took the pre-election survey returned for the post-election wave. This code requires no missing data in the weighting variable otherwise it will not run. So we first will remove the cases that do not have values in the post-election weight variable.

We will use the ‘svydesign’ function from the ‘survey’ package to create new weighted data sets for analysis purposes. The new weighted file can then be used in analysis.

anes_post <- anes_vv[complete.cases(anes_vv[, c("v200010b")]), ]
anes_weighted <- svydesign(ids = ~1, weights =~v200010b, data = anes_post) #Creates new weighted data for analysis using the population weights only 
anes_weighted2 <- svydesign(ids = ~v200010c, weights =~v200010b, strata=~v200010d, nest=TRUE, data = anes_post) #For more complex survey designs that includes additional weights including strata and PSUs. PSU weight = 'ids' while 'strata' = strata weight as specified in code book   

#Creates new variable for use in analysis 
anes_post <- anes_post %>%
  mutate(trump_feel = replace(v201152, v201152 == -9, NA)) #Create new variable called 'trump_feel' which equals the original variable 'V201152' but replace all -9 values as NA

###Analyze the mean of the Trump feeling thermometer measure for the different files 
svymean(~trump_feel, anes_weighted, na.rm = TRUE) #Get weighted mean for simple weighting scheme
##              mean     SE
## trump_feel 42.099 1.0573
svymean(~trump_feel, anes_weighted2, na.rm = TRUE) #Get weighted mean for complex weighting scheme
##              mean     SE
## trump_feel 42.099 0.9729
mean(anes_post$trump_feel, na.rm=TRUE) #Get unweighted mean for sample 
## [1] 40.32393
###Also use for regression analysis 
# Running the weighted regression
regression_model <- svyglm(trump_feel ~ factor(college), design = anes_weighted2)

# Summarizing the regression model
summary(regression_model)
## 
## Call:
## svyglm(formula = trump_feel ~ factor(college), design = anes_weighted2)
## 
## Survey design:
## svydesign(ids = ~v200010c, weights = ~v200010b, strata = ~v200010d, 
##     nest = TRUE, data = anes_post)
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        47.958      2.087  22.982  < 2e-16 ***
## factor(college)2   -2.346      2.376  -0.988    0.328    
## factor(college)3  -12.488      2.497  -5.002 8.02e-06 ***
## factor(college)4  -16.809      2.869  -5.858 4.12e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 1563.585)
## 
## Number of Fisher Scoring iterations: 2

By examining the means across the weighted and unweighted files, we see that the two weighted files return identical means, with slightly different standard errors, but the unweighted mean being 1.3 points lower. The difference between the weighted and unweighted means is why we want to use the survey weights as the weighted data should provide more accurate point estimates.