You have 8 small tasks. You can work in groups to complete them in fagcafe. The answer provided here may not be the only right answer. You can have your own ways of coding. Some information about the pairfam data:

  1. id: ID of respondent

  2. age: age of respondent

  3. sex_gen: gender of respondent

  4. cohort: birth cohort of respondent

  5. homosex_new: whether has a homosexual partner

  6. relstat: relationship status of the respondent

  7. val1i7: the opinion about the statement “Marriage is a lifelong union that should not be broken.” 1=Disagree completely, 5=Agree completely

  8. sat6: general life satisfaction, range from 0(Very dissatisfied) to 10 (Very satisfied)

Make sure you library packages including tidyverse, haven and janitor.

No. 1

Question

Import anchor1_50percent_Eng data, and name it as “wave1”

Answer

#Question: Import anchor1_50percent_Eng data
library(tidyverse) #use the tidyvese package
library(haven) #use the haven package
library(janitor) #use the janitor package
#you don't need to library the three packages if you already library them.
wave1 <- read_dta("anchor1_50percent_Eng.dta") #import the data

No. 2

Question

Keep variables of 1)id, 2)age, 3)sex_gen, 4)cohort, 5)homosex_new, 6)relstat, as well as 7)val1i7, and 8)sat6. The function that allows you to do this job is called “______”. Make these as a new dataset named “wave1a.”

Answer

We show the data in this tab.

#Question: Keep variables of id, age, sex_gen, cohort, homosex_new, yedu, relstat, as well as one variable that reflects the attitude towards family, and one variable that reflects subjective wellbeing. The function that allow you to do this job is called "______". Make these as a new dataset. 
(wave1a <- select(wave1, 
                  id, 
                  age, 
                  sex_gen, 
                  cohort, 
                  homosex_new, 
                  relstat, 
                  val1i7,
                  sat6)) #select the variables to create a new dataset named wave1a.
## # A tibble: 6,201 × 8
##           id age       sex_gen      cohort    homosex_new relstat  val1i7  sat6 
##        <dbl> <dbl+lbl> <dbl+lbl>    <dbl+lbl> <dbl+lbl>   <dbl+lb> <dbl+l> <dbl>
##  1 267206000 16        2 [2 Female] 1 [1 199… -1 [-1 No …  1 [1 N… 3       7    
##  2 112963000 35        1 [1 Male]   3 [3 197… -1 [-1 No …  1 [1 N… 2       6    
##  3 327937000 16        2 [2 Female] 1 [1 199…  0 [0 Hete… -7 [-7 … 4       8    
##  4 318656000 27        2 [2 Female] 2 [2 198…  0 [0 Hete…  4 [4 M… 5 [5 A… 9    
##  5 717889000 37        1 [1 Male]   3 [3 197…  0 [0 Hete…  4 [4 M… 4       7    
##  6 222517000 15        1 [1 Male]   1 [1 199… -1 [-1 No …  1 [1 N… 5 [5 A… 9    
##  7 144712000 16        2 [2 Female] 1 [1 199… -1 [-1 No …  1 [1 N… 4       8    
##  8 659357000 17        2 [2 Female] 1 [1 199…  0 [0 Hete…  2 [2 N… 5 [5 A… 7    
##  9 506367000 37        1 [1 Male]   3 [3 197…  0 [0 Hete…  4 [4 M… 1 [1 D… 9    
## 10  64044000 15        2 [2 Female] 1 [1 199… -1 [-1 No …  1 [1 N… 1 [1 D… 7    
## # ℹ 6,191 more rows

No. 3

Question

a) Change the variables “sex_gen”, “cohort”, “homosex_new”, “relstat” and “val1i7” into factors. Write and run the your codes for this task.

b) Find out how many birth cohorts this wave of Pairfam covered?

Answer 3a

#Question: Change the variables "sex_gen", "cohort", "homosex_new", "relstat" and "val1i7" into factors.Write and run the your code for this task. How many cohorts this wave of Pairfam covered?
wave1a <- mutate(wave1a,
                 sex_gen=as_factor(sex_gen), #make sex variable as factor
                 cohort=as_factor(cohort), #make cohort variable as factor
                 homosex_new=as_factor(homosex_new), #make homosex_new variable as factor
                 relstat=as_factor(relstat), #make relstat variable as factor
                 val1i7=as_factor(val1i7) #make val1i7 as factor
)
wave1a
## # A tibble: 6,201 × 8
##           id age       sex_gen  cohort      homosex_new   relstat   val1i7 sat6 
##        <dbl> <dbl+lbl> <fct>    <fct>       <fct>         <fct>     <fct>  <dbl>
##  1 267206000 16        2 Female 1 1991-1993 -1 No partner 1 Never … 3      7    
##  2 112963000 35        1 Male   3 1971-1973 -1 No partner 1 Never … 2      6    
##  3 327937000 16        2 Female 1 1991-1993 0 Hetero      -7 Incom… 4      8    
##  4 318656000 27        2 Female 2 1981-1983 0 Hetero      4 Marrie… 5 Agr… 9    
##  5 717889000 37        1 Male   3 1971-1973 0 Hetero      4 Marrie… 4      7    
##  6 222517000 15        1 Male   1 1991-1993 -1 No partner 1 Never … 5 Agr… 9    
##  7 144712000 16        2 Female 1 1991-1993 -1 No partner 1 Never … 4      8    
##  8 659357000 17        2 Female 1 1991-1993 0 Hetero      2 Never … 5 Agr… 7    
##  9 506367000 37        1 Male   3 1971-1973 0 Hetero      4 Marrie… 1 Dis… 9    
## 10  64044000 15        2 Female 1 1991-1993 -1 No partner 1 Never … 1 Dis… 7    
## # ℹ 6,191 more rows

Answer 3b

table(wave1a$cohort)
## 
##               -7 Incomplete data 0 former capikid first interview 
##                                0                                0 
##                      1 1991-1993                      2 1981-1983 
##                             2173                             2013 
##                      3 1971-1973                      4 2001-2003 
##                             2015                                0 
##    9 former capikid re-interview 
##                                0
#or
tabyl(wave1a$cohort)
##                     wave1a$cohort    n   percent
##                -7 Incomplete data    0 0.0000000
##  0 former capikid first interview    0 0.0000000
##                       1 1991-1993 2173 0.3504274
##                       2 1981-1983 2013 0.3246251
##                       3 1971-1973 2015 0.3249476
##                       4 2001-2003    0 0.0000000
##     9 former capikid re-interview    0 0.0000000

No. 4

Question

Please center the “sat6” and “age” variables, create “c_sat6” for centered sat6 and “c_age” for centered age. Remember, \(c_x = x - \bar{x}\).

Some hints for you

# First, check whether values on those variables make sense.
# Second, be careful when a variable has missing values.

Answer

#Question: center sat6 and age.
# First. check whether values on those variables make sense.
tabyl(wave1a$sat6) #tabulate the sat6 variable to see if there are missing cases
##  wave1a$sat6    n      percent
##           -2    2 0.0003225286
##           -1    3 0.0004837929
##            0   26 0.0041928721
##            1   18 0.0029027576
##            2   45 0.0072568940
##            3  110 0.0177390743
##            4  133 0.0214481535
##            5  395 0.0636994033
##            6  508 0.0819222706
##            7 1178 0.1899693598
##            8 1877 0.3026931140
##            9 1157 0.1865828092
##           10  749 0.1207869698
#there are 5 respondents giving -2 and -1. They should be dropped.

#or 
tabyl(as_factor(wave1a$sat6)) #tabulate the sat6 variable to see if there are missing cases
##             as_factor(wave1a$sat6)    n      percent
##              -5 Inconsistent value    0 0.0000000000
##  -4 Filter error / Incorrect entry    0 0.0000000000
##                  -3 Does not apply    0 0.0000000000
##                       -2 No answer    2 0.0003225286
##                      -1 Don't know    3 0.0004837929
##                0 Very dissatisfied   26 0.0041928721
##                                  1   18 0.0029027576
##                                  2   45 0.0072568940
##                                  3  110 0.0177390743
##                                  4  133 0.0214481535
##                                  5  395 0.0636994033
##                                  6  508 0.0819222706
##                                  7 1178 0.1899693598
##                                  8 1877 0.3026931140
##                                  9 1157 0.1865828092
##                  10 Very satisfied  749 0.1207869698
tabyl(wave1a$age) #tabulate the age variable to see if there are missing cases
##  wave1a$age   n     percent
##          14  41 0.006611837
##          15 708 0.114175133
##          16 722 0.116432833
##          17 667 0.107563296
##          18  35 0.005644251
##          24  24 0.003870343
##          25 577 0.093049508
##          26 678 0.109337204
##          27 647 0.104338010
##          28  87 0.014029995
##          34  22 0.003547815
##          35 502 0.080954685
##          36 618 0.099661345
##          37 772 0.124496049
##          38 101 0.016287696
#no missing in the age variable
wave1a <- mutate(wave1a,
                 sat6=case_when(
                   sat6<0 ~ as.numeric(NA), #when sat6 <0, i make it NA
                   TRUE ~ as.numeric(sat6) #the rest of sat6 as it is and make it numeric
                                ),
                 c_sat6=sat6- mean(sat6,na.rm=TRUE), #center sat6, don't forget to include na.rm=TRUE so that the calculation will only focus on non-missings
                 c_age=age- mean(age) #center age
                 )

No. 5

Question

a) Please standardize the “sat6” and “age” variables, create “z_sat6” for the standardized sat6 and “z_age” for the standardized age.

b) Summarize z_sat6 and z_age to see the minimum, maximum and mean of the two variables Remember, \(z_x = \frac{x - \bar{x}}{\text{SD(x)}}\).

Answer 5a

#Question: Z-standardize sat6 and age.
wave1a <- mutate(wave1a,
                 z_sat6=c_sat6/sd(sat6,na.rm = TRUE), #z-standardization of sat6, don't forget to include na.rm=TRUE so that the z-standardization will only focus on non-missings
                 z_age=c_age/sd(age) #z-standardization of age
                 
                 )
summary(wave1a$z_sat6) #summary the standardized sat6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
## -4.3432 -0.3487  0.2220  0.0000  0.7926  1.3632       5
summary(wave1a$z_age) #summary the standardized age
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -1.41586 -1.05703  0.01946  0.00000  1.09595  1.45478

Answer 5b

summary(wave1a$z_sat6) #summary the standardized sat6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
## -4.3432 -0.3487  0.2220  0.0000  0.7926  1.3632       5
summary(wave1a$z_age) #summary the standardized age
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -1.41586 -1.05703  0.01946  0.00000  1.09595  1.45478

No. 6

Question

a) Show the frequency of relstat.

b) Recode relstat to “no partner” when the person is never married regardless of the cohabiting status, and to “with partner” when the person is married regardless of the cohabiting status. Hints: you can use %in% (belong to operator)

c) Show the frequence of the relstat again.What is the frequency for “no partner” and the frequency for “with partner”?

Answer 6a

tabyl(wave1a$relstat)
##               wave1a$relstat    n      percent
##           -7 Incomplete data   34 0.0054829866
##       1 Never married single 2448 0.3947750363
##          2 Never married LAT 1012 0.1631994840
##        3 Never married COHAB  660 0.1064344461
##              4 Married COHAB 1735 0.2797935817
##      5 Married noncohabiting   23 0.0037090792
##  6 Divorced/separated single  146 0.0235445896
##     7 Divorced/separated LAT   63 0.0101596517
##   8 Divorced/separated COHAB   76 0.0122560877
##             9 Widowed single    3 0.0004837929
##               10 Widowed LAT    1 0.0001612643
##             11 Widowed COHAB    0 0.0000000000

Answer 6b

wave1a <- mutate(wave1a,
                relstat_a = case_when( 
                    relstat %in% c("1 Never married single" ,"2 Never married LAT", "3 Never married COHAB")  ~ "no partner", #when the value of relstat belong to the categories mentioned, replace them with 'no partner'
                    relstat %in% c("4 Married COHAB" ,"5 Married noncohabiting")  ~ "with partner",                 #when the value of relstat belong to the categories mentioned, replace them with 'with partner'
                    TRUE ~ as.character(relstat) #the rest remain as it is and make it a character
              ),
                relstat_a = factor(relstat_a) #change the relstat back to factor
)
#Remember, %in% is a operator means "belonging to"


#or 
wave1a <- mutate(wave1a,
                relstat_b = case_when( 
                relstat=="1 Never married single"~ "no partner",
                relstat=="2 Never married LAT"~ "no partner",
                relstat=="3 Never married COHAB" ~ "no partner",
                relstat=="4 Married COHAB"~ "with partner",
                relstat=="5 Married noncohabiting"~ "with partner",
                TRUE ~ as.character(relstat) #the rest remain as it is and make it a character
              ),
              relstat_b = factor(relstat_b) #change the relstat back to factor
)

Answer 6c

table(wave1a$relstat_a) #get the frequence table of relstat
## 
##          -7 Incomplete data              10 Widowed LAT 
##                          34                           1 
## 6 Divorced/separated single    7 Divorced/separated LAT 
##                         146                          63 
##  8 Divorced/separated COHAB            9 Widowed single 
##                          76                           3 
##                  no partner                with partner 
##                        4120                        1758
#or
tabyl(wave1a$relstat_a)
##             wave1a$relstat_a    n      percent
##           -7 Incomplete data   34 0.0054829866
##               10 Widowed LAT    1 0.0001612643
##  6 Divorced/separated single  146 0.0235445896
##     7 Divorced/separated LAT   63 0.0101596517
##   8 Divorced/separated COHAB   76 0.0122560877
##             9 Widowed single    3 0.0004837929
##                   no partner 4120 0.6644089663
##                 with partner 1758 0.2835026609
table(wave1a$relstat_b) #get the frequence table of relstat
## 
##          -7 Incomplete data              10 Widowed LAT 
##                          34                           1 
## 6 Divorced/separated single    7 Divorced/separated LAT 
##                         146                          63 
##  8 Divorced/separated COHAB            9 Widowed single 
##                          76                           3 
##                  no partner                with partner 
##                        4120                        1758
#or
tabyl(wave1a$relstat_b)
##             wave1a$relstat_b    n      percent
##           -7 Incomplete data   34 0.0054829866
##               10 Widowed LAT    1 0.0001612643
##  6 Divorced/separated single  146 0.0235445896
##     7 Divorced/separated LAT   63 0.0101596517
##   8 Divorced/separated COHAB   76 0.0122560877
##             9 Widowed single    3 0.0004837929
##                   no partner 4120 0.6644089663
##                 with partner 1758 0.2835026609

No. 7

Question

1. Rewrite the following code chunks using the %>% operator.

  • log(sd(c(5, 13, 89)))

  • as.numeric(scale(c(100, 32, 45)))

Answer

c(5, 13, 89) %>% sd() %>% log()
## [1] 3.836456
c(100, 32, 45) %>% scale() %>% as.numeric()
## [1]  1.1358256 -0.7479827 -0.3878429

No. 8

Question

Remember in Question 1, you import the data to wave1. Now please use the %>% operator to do the following tasks.

a) keep only the variables “cohort” and “val1i7” in your dataset

b) treat them as factors in your dataset

c) create a variable named “mar_att”, when val1i7 is agree or completely agree, assign mar_att a value of 1; when val1i7 is neutral, diagree or completely disagree, assign mar_att a value of 0; when val1i7 has invalid answer make it NA

d) keep only respondents whose mar_att==1

Hint

#just check what the val1i7 looks like
tabyl(as_factor(wave1$val1i7))
##            as_factor(wave1$val1i7)    n     percent
##              -5 Inconsistent value    0 0.000000000
##  -4 Filter error / Incorrect entry    0 0.000000000
##                  -3 Does not apply    0 0.000000000
##                       -2 No answer   10 0.001612643
##                      -1 Don't know   53 0.008547009
##              1 Disagree completely  913 0.147234317
##                                  2  800 0.129011450
##                                  3 1328 0.214159007
##                                  4 1152 0.185776488
##                 5 Agree completely 1945 0.313659087

Answer

wave1a <- wave1 %>% 
  transmute(cohort=as_factor(cohort),
            val1i7=as_factor(val1i7),
            mar_att=case_when(val1i7 %in% c("4"," 5 Agree completely") ~ 1,
                              val1i7 %in% c("1 Disagree completely","2", "3") ~ 0,
                              TRUE ~ NA)
            )%>%
  filter(mar_att==1)
tabyl(wave1a$mar_att) #double check if wave1a only contains respondents whose mar_att==1
##  wave1a$mar_att    n percent
##               1 1152       1