1. Rodrigo Gómez

Cleaning Data

The process of cleaning data in all the statistics is defined by every analyst. But in general, you can do the next steps in your Data Sets.

Unlike to Cross-Sectional Data Set, the merge and matching process in the time series data is fundamental. For that reason, start your cleaning data set process with the merge and matching codes.

  1. Merge and matching data sets from your time series data.
  2. Exploring your variables & number of observations.
  3. Select your study variables.
  4. Exploring & remove the missing data.
  5. Recode your study variables.
  6. Create your study variables.
  7. Save your cleaning data.

For this exercise, you need to import the data set in R-Studio the a_indresp & b_indresp dataset. Don´t forget to run the working libraries, set your working directory into your own project-specific folder.

The most tipical and usefull libraries use for the analysis in R-studio included:

Exercise 1: Create a New Script and save with the name “clean_ab_indresp”. Inspect the data at wave 1 & 2.

# ---; Set a name for your project
  
# ---; Don´t Forget the Date

# Charge de libraries for your work.
---(haven)
---()
---()
---()

# Change the working directory to your own project specific folder:
setwd("---")

# Import your data sets with haven with the names "w1_indresp" & "w2_indresp":
--- <- read_dta("")
--- <- ---
  
# Inspect the data; use "glimpse" for exploring the number of variables and observations
---(---)

Step 1 - Matching Data from 2 waves

After exploring your variables and number of subjects. The next step it´s merge your time series data, but first, it´s more easy reduced the size of the dataframe (dataSet).

Excercise 2:

# Replace the DataFrame containing only the variables; w_hidp, w_pidp w_istrtdaty, w_sex_dv w_mastat_dv w_julkjb w_sclfsato w_paygu_dv
--- <- --- %>% # for wave 1
  select(---)
--- <- --- %>% # for wave 2
  ---(---)

# hint: use the tidyverse code

# The variables have attached value labels to each numeric value. Look the labels applied to a variable using "attr()":

# Look the labels for the variable "w_mastat_dv" in the wave 2
--- (w1_indresp$---, "labels") # for one variable

# Look the labels for all variables in wave 2, using "sapply"
sapply(---, attr, "labels") # for all variables

# Combine data into a "wide" format using the command full_join(), but first create variables that flag all cases
w1_indresp$wave1 <- 1
---$--- <- 1

indresp_all = full_join(---, ---, by="pidp")

# Examine who was included in wich wave by cross-tabulating the two flag variables
table()

Maybe you note that the person identifier pidp is the only wariable that does not include a wave prefix. That is because a respondent´s pidp does not change across waves. Note too as not all individuals provide an interview in each wave and there may be new respondents in any wave, to be able to tell which individual were included in which wave, we first create variables that flag all cases in the wave-specific data frames.

LS0tCnRpdGxlOiAiVW5kZXJzdGFuZGluZyBTb2NpZXR5OyBDbGVhbmluZyBEYXRhIGZyb20gd2F2ZSBkYXRhIHNldHMsIEV4YW1wbGUgMiIKb3V0cHV0OiBodG1sX25vdGVib29rCi0tLQpNRC4gUm9kcmlnbyBHw7NtZXoKCiMgQ2xlYW5pbmcgRGF0YQpUaGUgcHJvY2VzcyBvZiBjbGVhbmluZyBkYXRhIGluIGFsbCB0aGUgc3RhdGlzdGljcyBpcyBkZWZpbmVkIGJ5IGV2ZXJ5IGFuYWx5c3QuIEJ1dCBpbiBnZW5lcmFsLCB5b3UgY2FuIGRvIHRoZSBuZXh0IHN0ZXBzIGluIHlvdXIgRGF0YSBTZXRzLgoKVW5saWtlIHRvIENyb3NzLVNlY3Rpb25hbCBEYXRhIFNldCwgdGhlIG1lcmdlIGFuZCBtYXRjaGluZyBwcm9jZXNzIGluIHRoZSB0aW1lIHNlcmllcyBkYXRhIGlzIGZ1bmRhbWVudGFsLiBGb3IgdGhhdCByZWFzb24sIHN0YXJ0IHlvdXIgY2xlYW5pbmcgZGF0YSBzZXQgcHJvY2VzcyB3aXRoIHRoZSBtZXJnZSBhbmQgbWF0Y2hpbmcgY29kZXMuCgoxLiBNZXJnZSBhbmQgbWF0Y2hpbmcgZGF0YSBzZXRzIGZyb20geW91ciB0aW1lIHNlcmllcyBkYXRhLgoyLiBFeHBsb3JpbmcgeW91ciB2YXJpYWJsZXMgJiBudW1iZXIgb2Ygb2JzZXJ2YXRpb25zLgozLiBTZWxlY3QgeW91ciBzdHVkeSB2YXJpYWJsZXMuCjQuIEV4cGxvcmluZyAmIHJlbW92ZSB0aGUgbWlzc2luZyBkYXRhLgo1LiBSZWNvZGUgeW91ciBzdHVkeSB2YXJpYWJsZXMuIAo2LiBDcmVhdGUgeW91ciBzdHVkeSB2YXJpYWJsZXMuCjcuIFNhdmUgeW91ciBjbGVhbmluZyBkYXRhLgoKRm9yIHRoaXMgZXhlcmNpc2UsIHlvdSBuZWVkIHRvIGltcG9ydCB0aGUgZGF0YSBzZXQgaW4gUi1TdHVkaW8gdGhlIGFfaW5kcmVzcCAmIGJfaW5kcmVzcCBkYXRhc2V0LiBEb27CtHQgZm9yZ2V0IHRvIHJ1biB0aGUgd29ya2luZyBsaWJyYXJpZXMsIHNldCB5b3VyIHdvcmtpbmcgZGlyZWN0b3J5IGludG8geW91ciBvd24gcHJvamVjdC1zcGVjaWZpYyBmb2xkZXIuCgpUaGUgbW9zdCB0aXBpY2FsIGFuZCB1c2VmdWxsIGxpYnJhcmllcyB1c2UgZm9yIHRoZSBhbmFseXNpcyBpbiBSLXN0dWRpbyBpbmNsdWRlZDoKCi0gVGlkeXZlcnNlOyB0byBjbGVhbmluZywgZXhwbG9yZSBhbmQgYW5hbGl6ZSBkYXRhIHNldHMuCi0gTmFuaWFyOyBwcm92aWRlcyBkYXRhIHN0cnVjdHVyZXMgYW5kIGZ1bmN0aW9ucyB0aGF0IGZhY2lsaXRhdGUgdGhlIHBsb3R0aW5nIG9mIG1pc3NpbmcgdmFsdWVzIGFuZCBleGFtaW5hdGlvbiBvZiBpbXB1dGF0aW9ucy4KLSBIYXZlbjsgaW1wb3J0IGZvcmVpZ24gc3RhdGlzdGljYWwgZm9ybWF0cyBpbnRvIFIuCi0gR2dwbG90MjogY3JlYXRlIGVsZWdhbnQgZGF0YSB2aXN1YWxpc2F0aW9ucyB1c2luZyB0aGUgZ3JhbW1hciBvZiBncmFwaGljcy4KCkV4ZXJjaXNlIDE6IENyZWF0ZSBhIE5ldyBTY3JpcHQgYW5kIHNhdmUgd2l0aCB0aGUgbmFtZSAiY2xlYW5fYWJfaW5kcmVzcCIuIEluc3BlY3QgdGhlIGRhdGEgYXQgd2F2ZSAxICYgMi4KYGBge3J9CiMgLS0tOyBTZXQgYSBuYW1lIGZvciB5b3VyIHByb2plY3QKICAKIyAtLS07IERvbsK0dCBGb3JnZXQgdGhlIERhdGUKCiMgQ2hhcmdlIGRlIGxpYnJhcmllcyBmb3IgeW91ciB3b3JrLgotLS0oaGF2ZW4pCi0tLSgpCi0tLSgpCi0tLSgpCgojIENoYW5nZSB0aGUgd29ya2luZyBkaXJlY3RvcnkgdG8geW91ciBvd24gcHJvamVjdCBzcGVjaWZpYyBmb2xkZXI6CnNldHdkKCItLS0iKQoKIyBJbXBvcnQgeW91ciBkYXRhIHNldHMgd2l0aCBoYXZlbiB3aXRoIHRoZSBuYW1lcyAidzFfaW5kcmVzcCIgJiAidzJfaW5kcmVzcCI6Ci0tLSA8LSByZWFkX2R0YSgiIikKLS0tIDwtIC0tLQogIAojIEluc3BlY3QgdGhlIGRhdGE7IHVzZSAiZ2xpbXBzZSIgZm9yIGV4cGxvcmluZyB0aGUgbnVtYmVyIG9mIHZhcmlhYmxlcyBhbmQgb2JzZXJ2YXRpb25zCi0tLSgtLS0pCmBgYAoKIyBTdGVwIDEgLSBNYXRjaGluZyBEYXRhIGZyb20gMiB3YXZlcwoKQWZ0ZXIgZXhwbG9yaW5nIHlvdXIgdmFyaWFibGVzIGFuZCBudW1iZXIgb2Ygc3ViamVjdHMuIFRoZSBuZXh0IHN0ZXAgaXTCtHMgbWVyZ2UgeW91ciB0aW1lIHNlcmllcyBkYXRhLCBidXQgZmlyc3QsIGl0wrRzIG1vcmUgZWFzeSByZWR1Y2VkIHRoZSBzaXplIG9mIHRoZSBkYXRhZnJhbWUgKGRhdGFTZXQpLgoKRXhjZXJjaXNlIDI6CmBgYHtyfQojIFJlcGxhY2UgdGhlIERhdGFGcmFtZSBjb250YWluaW5nIG9ubHkgdGhlIHZhcmlhYmxlczsgd19oaWRwLCB3X3BpZHAgd19pc3RydGRhdHksIHdfc2V4X2R2IHdfbWFzdGF0X2R2IHdfanVsa2piIHdfc2NsZnNhdG8gd19wYXlndV9kdgotLS0gPC0gLS0tICU+JSAjIGZvciB3YXZlIDEKICBzZWxlY3QoLS0tKQotLS0gPC0gLS0tICU+JSAjIGZvciB3YXZlIDIKICAtLS0oLS0tKQoKIyBoaW50OiB1c2UgdGhlIHRpZHl2ZXJzZSBjb2RlCgojIFRoZSB2YXJpYWJsZXMgaGF2ZSBhdHRhY2hlZCB2YWx1ZSBsYWJlbHMgdG8gZWFjaCBudW1lcmljIHZhbHVlLiBMb29rIHRoZSBsYWJlbHMgYXBwbGllZCB0byBhIHZhcmlhYmxlIHVzaW5nICJhdHRyKCkiOgoKIyBMb29rIHRoZSBsYWJlbHMgZm9yIHRoZSB2YXJpYWJsZSAid19tYXN0YXRfZHYiIGluIHRoZSB3YXZlIDIKLS0tICh3MV9pbmRyZXNwJC0tLSwgImxhYmVscyIpICMgZm9yIG9uZSB2YXJpYWJsZQoKIyBMb29rIHRoZSBsYWJlbHMgZm9yIGFsbCB2YXJpYWJsZXMgaW4gd2F2ZSAyLCB1c2luZyAic2FwcGx5IgpzYXBwbHkoLS0tLCBhdHRyLCAibGFiZWxzIikgIyBmb3IgYWxsIHZhcmlhYmxlcwoKIyBDb21iaW5lIGRhdGEgaW50byBhICJ3aWRlIiBmb3JtYXQgdXNpbmcgdGhlIGNvbW1hbmQgZnVsbF9qb2luKCksIGJ1dCBmaXJzdCBjcmVhdGUgdmFyaWFibGVzIHRoYXQgZmxhZyBhbGwgY2FzZXMKdzFfaW5kcmVzcCR3YXZlMSA8LSAxCi0tLSQtLS0gPC0gMQoKaW5kcmVzcF9hbGwgPSBmdWxsX2pvaW4oLS0tLCAtLS0sIGJ5PSJwaWRwIikKCiMgRXhhbWluZSB3aG8gd2FzIGluY2x1ZGVkIGluIHdpY2ggd2F2ZSBieSBjcm9zcy10YWJ1bGF0aW5nIHRoZSB0d28gZmxhZyB2YXJpYWJsZXMKdGFibGUoKQpgYGAKTWF5YmUgeW91IG5vdGUgdGhhdCB0aGUgcGVyc29uIGlkZW50aWZpZXIgKnBpZHAqIGlzIHRoZSBvbmx5IHdhcmlhYmxlIHRoYXQgZG9lcyBub3QgaW5jbHVkZSBhIHdhdmUgcHJlZml4LiBUaGF0IGlzIGJlY2F1c2UgYSByZXNwb25kZW50wrRzICpwaWRwKiBkb2VzIG5vdCBjaGFuZ2UgYWNyb3NzIHdhdmVzLiBOb3RlIHRvbyBhcyBub3QgYWxsIGluZGl2aWR1YWxzIHByb3ZpZGUgYW4gaW50ZXJ2aWV3IGluIGVhY2ggd2F2ZSBhbmQgdGhlcmUgbWF5IGJlIG5ldyByZXNwb25kZW50cyBpbiBhbnkgd2F2ZSwgdG8gYmUgYWJsZSB0byB0ZWxsIHdoaWNoIGluZGl2aWR1YWwgd2VyZSBpbmNsdWRlZCBpbiB3aGljaCB3YXZlLCB3ZSBmaXJzdCBjcmVhdGUgdmFyaWFibGVzIHRoYXQgZmxhZyBhbGwgY2FzZXMgaW4gdGhlIHdhdmUtc3BlY2lmaWMgZGF0YSBmcmFtZXMuCgoK