PSCI 9590A - Introduction to Quantitative Methods

Evelyne Brie

Fall 2022

Missing Data

In this laboratory, we will: (1) identify missing values across columns, (2) look for patterns of missing data and (3) perform row-wise deletion.

Relevant functions: missing_plot(), missing_pairs(), na.omit(), drop_na(), TukeyHSD().

1. Importing Data

We start by importing a dataset encompassing data from the World Value Survey, a survey conducted yearly in more than 100 countries since 1981. It includes socio-economic status questions and questions about attitudes towards various issues.

# Set work directory
setwd(dirname(rstudioapi::getSourceEditorContext()$path))

# Loading data
wvs <- read.csv("WVS_subset.csv")

# Printing out the column names
colnames(wvs)

##  [1] "X"               "version"         "doi"             "A_WAVE"         
##  [5] "A_YEAR"          "A_STUDY"         "B_COUNTRY"       "B_COUNTRY_ALPHA"
##  [9] "Q46"             "Q47"             "Q48"             "Q49"            
## [13] "Q50"             "Q51"             "Q52"             "Q53"            
## [17] "Q54"             "Q55"             "Q56"             "X003R"          
## [21] "X003R2"          "Q263"            "Q264"            "Q265"           
## [25] "Q266"            "Q267"            "Q268"            "Q269"           
## [29] "Q270"            "Q271"            "Q272"            "Q273"           
## [33] "Q274"            "Q275"            "Q275A"           "Q275R"          
## [37] "Q276"            "Q276A"           "Q276R"           "Q277"           
## [41] "Q277A"           "Q277R"           "Q278"            "Q278A"          
## [45] "Q278R"           "Q279"            "Q280"            "Q281"           
## [49] "Q282"            "Q283"            "Q284"            "Q285"           
## [53] "Q286"            "Q287"            "Q288"            "Q288R"          
## [57] "Q289"            "Q289CS9"         "Q290"

The codebook is available on OWL. You’ll notice that the first 7 columns display information about the wave, country and year of the study. Questions Q46 to Q47 are relative to happiness and well-being. The other questions offer insights on the socio-economic status of the respondents. This is a subset of the original WVS dataset that I created for the purposes of this lab—the full data is freely available online if you’re interested.

# Printing out the first observation
head(wvs,1)

##      X            version                       doi A_WAVE A_YEAR A_STUDY
## 1 1659 4-0-0 (2022-05-23) doi.org/10.14281/18241.18      7   2018       2
##   B_COUNTRY B_COUNTRY_ALPHA Q46 Q47 Q48 Q49 Q50 Q51 Q52 Q53 Q54 Q55 Q56 X003R
## 1       156             CHN   2   2   5   5   5   4   4   4   4   4   1     3
##   X003R2 Q263 Q264 Q265 Q266 Q267 Q268 Q269 Q270 Q271 Q272 Q273 Q274 Q275 Q275A
## 1      2    1   NA   NA   NA   NA   NA   NA    3    1 2870    1    1   NA    NA
##   Q275R Q276 Q276A Q276R Q277 Q277A Q277R Q278 Q278A Q278R Q279 Q280 Q281 Q282
## 1    NA   NA    NA    NA   NA    NA    NA   NA    NA    NA   NA   NA   NA   NA
##   Q283 Q284 Q285 Q286 Q287 Q288 Q288R Q289   Q289CS9   Q290
## 1   NA   NA    2    1    3    4     2    0 100000020 156001

We already see that we’ll have to deal with a bunch of NAs in this dataset. For this first respondent, for instance, questions Q264 to Q269 are missing. These questions are relative to citizenship status and parental immigration background. We also see that questions Q275 to Q284 are missing. These questions are relative to educational and occupational background.

Let’s look at the first three respondents, who are all from China and have all answered the survey in 2018.

# Printing out the first three observation
head(wvs,3)

##      X            version                       doi A_WAVE A_YEAR A_STUDY
## 1 1659 4-0-0 (2022-05-23) doi.org/10.14281/18241.18      7   2018       2
## 2 1660 4-0-0 (2022-05-23) doi.org/10.14281/18241.18      7   2018       2
## 3 1661 4-0-0 (2022-05-23) doi.org/10.14281/18241.18      7   2018       2
##   B_COUNTRY B_COUNTRY_ALPHA Q46 Q47 Q48 Q49 Q50 Q51 Q52 Q53 Q54 Q55 Q56 X003R
## 1       156             CHN   2   2   5   5   5   4   4   4   4   4   1     3
## 2       156             CHN   1   1  10  10  10   4   4   4   4   4   1     4
## 3       156             CHN   1   1  10  10  10   4   4   4   4   4   1     6
##   X003R2 Q263 Q264 Q265 Q266 Q267 Q268 Q269 Q270 Q271 Q272 Q273 Q274 Q275 Q275A
## 1      2    1   NA   NA   NA   NA   NA   NA    3    1 2870    1    1   NA    NA
## 2      3    1   NA   NA   NA   NA   NA   NA    6    1 2870    1    2    1    NA
## 3      3    1   NA   NA   NA   NA   NA   NA    1    1 2870    5    2    3    NA
##   Q275R Q276 Q276A Q276R Q277 Q277A Q277R Q278 Q278A Q278R Q279 Q280 Q281 Q282
## 1    NA   NA    NA    NA   NA    NA    NA   NA    NA    NA   NA   NA   NA   NA
## 2     1    1    NA     1    0    NA     1    0    NA     1    1    3    8    9
## 3     2   NA    NA    NA    0    NA     1    0    NA     1    4   NA   NA   NA
##   Q283 Q284 Q285 Q286 Q287 Q288 Q288R Q289   Q289CS9   Q290
## 1   NA   NA    2    1    3    4     2    0 100000020 156001
## 2    9   NA    2    3   NA    3     1    0 100000020 156001
## 3    0    1    2    1    3    6     2    0 100000020 156001

Here, it seems like questions Q264 to Q269 are also missing for the second and third respondent, but that questions Q275 to Q284 are (mostly) non-missing. This might mean that for various reasons, surveyors didn’t ask questions Q264 to Q269 in that particular country. This is just an intuition—the test we’ll do below will allow us to verify this.

2. Identify Missing Values

The next thing we want to do is to display the distribution of the missing values across columns. First, let’s load the necessary packages.

# Loading the necessary packages
library(tidyverse)
library(finalfit)

Why are we loading the tidyverse package? Because we’ll be using %>% in the following steps.

Quick Note on %>%

Ceci n’est pas une pipe… or is it? The %>% symbol is actually the pipe operator from the magrittr package, which is included in the tidyverse (a sort of meta-package).

Here is how it works:

# Printing the summary of variable Q46 using base R
summary(wvs$Q46)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   1.000   1.000   2.000   1.877   2.000   4.000     142

# Printing the summary of variable Q46 using the pipe operator
wvs$Q46 %>%
  summary()

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   1.000   1.000   2.000   1.877   2.000   4.000     142

Here, the pipe operator “sends” wvs$Q46 into the summary() function.

We can also add other functions, as shown below:

# Printing the summary of variable Q46 using the pipe operator 
# while keeping only Chinese respondents
wvs %>%
  subset(B_COUNTRY_ALPHA=="CHN", select="Q46") %>% 
  summary()

##       Q46       
##  Min.   :1.000  
##  1st Qu.:1.000  
##  Median :2.000  
##  Mean   :1.851  
##  3rd Qu.:2.000  
##  Max.   :4.000  
##  NA's   :1

All in all, it’s a really powerful operator that can make your code much more efficient.

Let’s use the missing_plot() function to display the location of the NAs in our dataset. Here, we’ll display the NAs per question, for each observation. The light blue stripes represent missing values.

# Displaying the missing values per column, for each observation
wvs %>%
  missing_plot()

I have a suspicion that questions Q264 to Q269 might be missing for all Chinese respondents. Let’s test this by subsetting the dataset to only respondents for which the B_COUNTRY_ALPHA variable is equal to CHN.

# Loading the necessary packages
library(tidyverse)
library(finalfit)

# Displaying the missing values per column, for each observation
wvs %>%
  subset(B_COUNTRY_ALPHA=="CHN") %>% 
  missing_plot()

We can therefore conclude that in the Chinese sample, 没有问这些问题 (these questions were not asked). We also notice that four other questions later in the dataset are missing for all Chinese respondents. This is really useful to know!

3. Look for Patterns

Let’s now look at whether some variables can help us find patterns of missing values for variables where respondents for each country could have responded, but some preferred not to. We can take a look at a question that people might typically dislike responding, for instance this one:

Q287: People sometimes describe themselves as belonging to the working class, the middle class, or the upper or lower class. Would you describe yourself as belonging to the:

1. Upper class

2. Upper middle class

3. Lower middle class

4. Working class

5. Lower class

The Income Gap

Let’s see what the patterns of missing responses per country are.

# Loading the necessary packages
library(tidyverse)
library(finalfit)

# Displaying a diagnostics test of country as a determinant
# for missing values for Q287
explanatory <- "B_COUNTRY_ALPHA"
dependent <- "Q287"
wvs %>% 
  missing_pairs(dependent, explanatory)

## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

## Warning: Removed 656 rows containing non-finite values (`stat_boxplot()`).
## Removed 656 rows containing non-finite values (`stat_boxplot()`).

Here, we see that respondents from the Netherlands have much more missing values than respondents from other countries. It’s important to know what missing values actually are—you’ll find this information in the metadata for each dataset. Here, the variable report specifies that missing WVS values might be either: don’t know, no answer, not asked or missing/not available. That is to say, we can’t tell from this graph why we observe more missing data in the Netherlands, we just know that we do.

4. Perform Row-Wise Deletion

There are lots of ways to deal with NAs in R. The most basic way (and the only one we’ll discuss in this lab) assumes that observations are missing completely at random (MCAR). Note: if you are interested in what do to under the MNAR and the MAR assumptions, I recommend the following resource: https://argoshare.is.ed.ac.uk/healthyr_book/handling-missing-data-mar.html

Let’s say we want to remove ALL rows containing any NAs (for any variable) from our dataset. In other words, if any column is empty for a given respondent, we remove that respondent from the dataset. That’s very easy to do using the na.omit() function.

# Performing row-wise deletion
wvs_noNas <- na.omit(wvs)

# Looking at the dimensions of our new dataset
dim(wvs_noNas)

## [1] 1218   59

Well… that doesn’t look too good. Seems like we removed the overwhelming majority of our dataset (i.e. 13,867 respondents out of 15,085 in total). Perhaps having missing values in some of these variables would have had no impact on our research project anyways. Let’s be a bit more conservative and solely remove respondents who have missing value for one variable, let’s say Q287.

# Performing row-wise deletion 
newWVS <- wvs %>%
          drop_na("Q287") # Only removing respondents who have an NA for Q287

# Looking at the dimensions of our new dataset
dim(newWVS)

## [1] 14429    59

That’s a bit more reasonable. But remember: what matters most is the theoretical justification for which you remove the NAs.

Exercises

Practice Session

Using the WVS dataset and the associated codebook…

Create an indicator variable called Happy that takes a value of 1 when respondents report being either very happy or rather happy at Q46. Looking at the distribution of this variable, would you say that most people are happy, or not?

# Creating a new indicator variable
wvs$Happy <- NA
wvs$Happy[wvs$Q46==1 | wvs$Q46==2] <- 1
wvs$Happy[wvs$Q46==3 | wvs$Q46==4] <- 0

# Checking the distribution of Happy as a sanity check
table(wvs$Happy)

## 
##     0     1 
##  1710 13233

Using the prop.table() and table() functions, create a table with the distribution of the Happy variable per country, in percentage per country (wvs$B_COUNTRY_ALPHA). Which country is the happiest? Which country is the unhappiest?

# Creating a prop.table with percentaging by row (that the 1 argument)
prop.table(table(wvs$B_COUNTRY_ALPHA,wvs$Happy),1)

##      
##                0          1
##   BRA 0.09914040 0.90085960
##   CAN 0.13613738 0.86386262
##   CHN 0.10774300 0.89225700
##   DEU 0.10848126 0.89151874
##   NLD 0.08546169 0.91453831
##   USA 0.12519320 0.87480680

Create a continuous variable called Safe that takes the average value of a respondent’s response to questions Q51 to Q55.

# Creating the Safe continuous variable
wvs$Safe <- NA
wvs$Safe <- (wvs$Q51+wvs$Q52+wvs$Q53+wvs$Q54+wvs$Q55)/5

# Checking the summary as a sanity check
summary(wvs$Safe)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   1.000   3.600   4.000   3.668   4.000   4.000     234

Using the t.test() function, run a t-test that compares the value of Safe for citizens from Canada and from the United States.

# Running a t.test()
t.test(wvs$Safe[wvs$B_COUNTRY_ALPHA=="CAN"],wvs$Safe[wvs$B_COUNTRY_ALPHA=="USA"])

## 
##  Welch Two Sample t-test
## 
## data:  wvs$Safe[wvs$B_COUNTRY_ALPHA == "CAN"] and wvs$Safe[wvs$B_COUNTRY_ALPHA == "USA"]
## t = 5.1626, df = 5456.9, p-value = 2.522e-07
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.0462386 0.1028538
## sample estimates:
## mean of x mean of y 
##  3.617670  3.543124