You have 8 small tasks. You can work in groups to complete them in-class. The answer provided here may not be the only right answer. You can have your own ways of coding. Some information about the pairfam data:
id: ID of respondent
age: age of respondent
sex_gen: gender of respondent
cohort: birth cohort of respondent
homosex_new: whether has a homosexual partner
relstat: relationship status of the respondent
val1i7: the opinion about the statement “Marriage is a lifelong union that should not be broken.” 1=Disagree completely, 5=Agree completely
sat6: general life satisfaction, range from 0(Very dissatisfied) to 10 (Very satisfied)
Import anchor1_50percent_Eng data
#Question: Import anchor1_50percent_Eng data
library(tidyverse) #use the tidyvese package
library(haven) #use the haven package
library(janitor) #use the janitor package
#you don't need to library the three packages if you already library them.
wave1 <- read_dta("anchor1_50percent_Eng.dta") #import the data
Keep variables of 1)id, 2)age, 3)sex_gen, 4)cohort, 5)homosex_new, 6)relstat, as well as 7)val1i7, and 8)sat6. The function that allows you to do this job is called “______”. Make these as a new dataset named “wave1a.”
We show the data in this tab.
#Question: Keep variables of id, age, sex_gen, cohort, homosex_new, yedu, relstat, as well as one variable that reflects the attitude towards family, and one variable that reflects subjective wellbeing. The function that allow you to do this job is called "______". Make these as a new dataset.
(wave1a <- select(wave1,
id,
age,
sex_gen,
cohort,
homosex_new,
relstat,
val1i7,
sat6)) #select the variables to create a new dataset named wave1a.
## # A tibble: 6,201 × 8
## id age sex_gen cohort homosex_new relstat val1i7 sat6
## <dbl> <dbl+lbl> <dbl+lbl> <dbl+lbl> <dbl+lbl> <dbl+lb> <dbl+l> <dbl>
## 1 267206000 16 2 [2 Female] 1 [1 199… -1 [-1 No … 1 [1 N… 3 7
## 2 112963000 35 1 [1 Male] 3 [3 197… -1 [-1 No … 1 [1 N… 2 6
## 3 327937000 16 2 [2 Female] 1 [1 199… 0 [0 Hete… -7 [-7 … 4 8
## 4 318656000 27 2 [2 Female] 2 [2 198… 0 [0 Hete… 4 [4 M… 5 [5 A… 9
## 5 717889000 37 1 [1 Male] 3 [3 197… 0 [0 Hete… 4 [4 M… 4 7
## 6 222517000 15 1 [1 Male] 1 [1 199… -1 [-1 No … 1 [1 N… 5 [5 A… 9
## 7 144712000 16 2 [2 Female] 1 [1 199… -1 [-1 No … 1 [1 N… 4 8
## 8 659357000 17 2 [2 Female] 1 [1 199… 0 [0 Hete… 2 [2 N… 5 [5 A… 7
## 9 506367000 37 1 [1 Male] 3 [3 197… 0 [0 Hete… 4 [4 M… 1 [1 D… 9
## 10 64044000 15 2 [2 Female] 1 [1 199… -1 [-1 No … 1 [1 N… 1 [1 D… 7
## # ℹ 6,191 more rows
a) Change the variables “sex_gen”, “cohort”, “homosex_new”, “relstat” and “val1i7” into factors. Write and run the your codes for this task. b) Find out how many birth cohorts this wave of Pairfam covered?
#Question: Change the variables "sex_gen", "cohort", "homosex_new", "relstat" and "val1i7" into factors.Write and run the your code for this task. How many cohorts this wave of Pairfam covered?
wave1a <- mutate(wave1a,
sex_gen=as_factor(sex_gen), #make sex variable as factor
cohort=as_factor(cohort), #make cohort variable as factor
homosex_new=as_factor(homosex_new), #make homosex_new variable as factor
relstat=as_factor(relstat), #make relstat variable as factor
val1i7=as_factor(val1i7) #make val1i7 as factor
)
wave1a
## # A tibble: 6,201 × 8
## id age sex_gen cohort homosex_new relstat val1i7 sat6
## <dbl> <dbl+lbl> <fct> <fct> <fct> <fct> <fct> <dbl>
## 1 267206000 16 2 Female 1 1991-1993 -1 No partner 1 Never … 3 7
## 2 112963000 35 1 Male 3 1971-1973 -1 No partner 1 Never … 2 6
## 3 327937000 16 2 Female 1 1991-1993 0 Hetero -7 Incom… 4 8
## 4 318656000 27 2 Female 2 1981-1983 0 Hetero 4 Marrie… 5 Agr… 9
## 5 717889000 37 1 Male 3 1971-1973 0 Hetero 4 Marrie… 4 7
## 6 222517000 15 1 Male 1 1991-1993 -1 No partner 1 Never … 5 Agr… 9
## 7 144712000 16 2 Female 1 1991-1993 -1 No partner 1 Never … 4 8
## 8 659357000 17 2 Female 1 1991-1993 0 Hetero 2 Never … 5 Agr… 7
## 9 506367000 37 1 Male 3 1971-1973 0 Hetero 4 Marrie… 1 Dis… 9
## 10 64044000 15 2 Female 1 1991-1993 -1 No partner 1 Never … 1 Dis… 7
## # ℹ 6,191 more rows
table(wave1a$cohort)
##
## -7 Incomplete data 0 former capikid first interview
## 0 0
## 1 1991-1993 2 1981-1983
## 2173 2013
## 3 1971-1973 4 2001-2003
## 2015 0
## 9 former capikid re-interview
## 0
#or
tabyl(wave1a$cohort)
## wave1a$cohort n percent
## -7 Incomplete data 0 0.0000000
## 0 former capikid first interview 0 0.0000000
## 1 1991-1993 2173 0.3504274
## 2 1981-1983 2013 0.3246251
## 3 1971-1973 2015 0.3249476
## 4 2001-2003 0 0.0000000
## 9 former capikid re-interview 0 0.0000000
#Question: center sat6 and age.
# First. check whether values on those variables make sense.
tabyl(wave1a$sat6) #tabulate the sat6 variable to see if there are missing cases
## wave1a$sat6 n percent
## -2 2 0.0003225286
## -1 3 0.0004837929
## 0 26 0.0041928721
## 1 18 0.0029027576
## 2 45 0.0072568940
## 3 110 0.0177390743
## 4 133 0.0214481535
## 5 395 0.0636994033
## 6 508 0.0819222706
## 7 1178 0.1899693598
## 8 1877 0.3026931140
## 9 1157 0.1865828092
## 10 749 0.1207869698
#there are 5 respondents giving -2 and -1. They should be dropped.
#or
tabyl(as_factor(wave1a$sat6)) #tabulate the sat6 variable to see if there are missing cases
## as_factor(wave1a$sat6) n percent
## -5 Inconsistent value 0 0.0000000000
## -4 Filter error / Incorrect entry 0 0.0000000000
## -3 Does not apply 0 0.0000000000
## -2 No answer 2 0.0003225286
## -1 Don't know 3 0.0004837929
## 0 Very dissatisfied 26 0.0041928721
## 1 18 0.0029027576
## 2 45 0.0072568940
## 3 110 0.0177390743
## 4 133 0.0214481535
## 5 395 0.0636994033
## 6 508 0.0819222706
## 7 1178 0.1899693598
## 8 1877 0.3026931140
## 9 1157 0.1865828092
## 10 Very satisfied 749 0.1207869698
tabyl(wave1a$age) #tabulate the age variable to see if there are missing cases
## wave1a$age n percent
## 14 41 0.006611837
## 15 708 0.114175133
## 16 722 0.116432833
## 17 667 0.107563296
## 18 35 0.005644251
## 24 24 0.003870343
## 25 577 0.093049508
## 26 678 0.109337204
## 27 647 0.104338010
## 28 87 0.014029995
## 34 22 0.003547815
## 35 502 0.080954685
## 36 618 0.099661345
## 37 772 0.124496049
## 38 101 0.016287696
#no missing in the age variable
wave1a <- mutate(wave1a,
sat6=case_when(
sat6<0 ~ as.numeric(NA), #when sat6 <0, i make it NA
TRUE ~ as.numeric(sat6) #the rest of sat6 as it is and make it numeric
),
c_sat6=sat6- mean(sat6,na.rm=TRUE), #center sat6, don't forget to include na.rm=TRUE so that the calculation will only focus on non-missings
c_age=age- mean(age) #center age
)
#Question: Z-standardize sat6 and age.
wave1a <- mutate(wave1a,
z_sat6=c_sat6/sd(sat6,na.rm = TRUE), #z-standardization of sat6, don't forget to include na.rm=TRUE so that the z-standardization will only focus on non-missings
z_age=c_age/sd(age) #z-standardization of age
)
summary(wave1a$z_sat6) #summary the standardized sat6
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## -4.3432 -0.3487 0.2220 0.0000 0.7926 1.3632 5
summary(wave1a$z_age) #summary the standardized age
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1.41586 -1.05703 0.01946 0.00000 1.09595 1.45478
a) Show the frequency of relstat.
b) Recode relstat to “no partner” when the person is never married regardless of the cohabiting status.
c) Recode relstat to “with partner” when the person is married regardless of the cohabiting status.
d) Show the frequence of the relstat again.
What is the frequency for “no partner” and the frequency for “with partner”?
#Question: Show the frequency of relstat. Recode relstat to "no partner" when the person is 'never married' regardless of the cohabiting status; recode relstat toto "with partner" when the person is 'married' regardless of the cohabiting status. Show the frequence of the relstat again. What is the frequency for "no partner" and the frequency for "with partner"?
tabyl(wave1a$relstat)
## wave1a$relstat n percent
## -7 Incomplete data 34 0.0054829866
## 1 Never married single 2448 0.3947750363
## 2 Never married LAT 1012 0.1631994840
## 3 Never married COHAB 660 0.1064344461
## 4 Married COHAB 1735 0.2797935817
## 5 Married noncohabiting 23 0.0037090792
## 6 Divorced/separated single 146 0.0235445896
## 7 Divorced/separated LAT 63 0.0101596517
## 8 Divorced/separated COHAB 76 0.0122560877
## 9 Widowed single 3 0.0004837929
## 10 Widowed LAT 1 0.0001612643
## 11 Widowed COHAB 0 0.0000000000
wave1a <- mutate(wave1a,
relstat_a = case_when(
relstat %in% c("1 Never married single" ,"2 Never married LAT", "3 Never married COHAB") ~ "no partner", #when the value of relstat belong to the categories mentioned, replace them with 'no partner'
relstat %in% c("4 Married COHAB" ,"5 Married noncohabiting") ~ "with partner", #when the value of relstat belong to the categories mentioned, replace them with 'with partner'
TRUE ~ as.character(relstat) #the rest remain as it is and make it a character
),
relstat_a = factor(relstat_a) #change the relstat back to factor
)
#Remember, %in% is a operator means "belonging to"
#or
wave1a <- mutate(wave1a,
relstat_b = case_when(
relstat=="1 Never married single"~ "no partner",
relstat=="2 Never married LAT"~ "no partner",
relstat=="3 Never married COHAB" ~ "no partner",
relstat=="4 Married COHAB"~ "with partner",
relstat=="5 Married noncohabiting"~ "with partner",
TRUE ~ as.character(relstat) #the rest remain as it is and make it a character
),
relstat_b = factor(relstat_b) #change the relstat back to factor
)
table(wave1a$relstat_a) #get the frequence table of relstat
##
## -7 Incomplete data 10 Widowed LAT
## 34 1
## 6 Divorced/separated single 7 Divorced/separated LAT
## 146 63
## 8 Divorced/separated COHAB 9 Widowed single
## 76 3
## no partner with partner
## 4120 1758
#or
tabyl(wave1a$relstat_a)
## wave1a$relstat_a n percent
## -7 Incomplete data 34 0.0054829866
## 10 Widowed LAT 1 0.0001612643
## 6 Divorced/separated single 146 0.0235445896
## 7 Divorced/separated LAT 63 0.0101596517
## 8 Divorced/separated COHAB 76 0.0122560877
## 9 Widowed single 3 0.0004837929
## no partner 4120 0.6644089663
## with partner 1758 0.2835026609
table(wave1a$relstat_b) #get the frequence table of relstat
##
## -7 Incomplete data 10 Widowed LAT
## 34 1
## 6 Divorced/separated single 7 Divorced/separated LAT
## 146 63
## 8 Divorced/separated COHAB 9 Widowed single
## 76 3
## no partner with partner
## 4120 1758
#or
tabyl(wave1a$relstat_b)
## wave1a$relstat_b n percent
## -7 Incomplete data 34 0.0054829866
## 10 Widowed LAT 1 0.0001612643
## 6 Divorced/separated single 146 0.0235445896
## 7 Divorced/separated LAT 63 0.0101596517
## 8 Divorced/separated COHAB 76 0.0122560877
## 9 Widowed single 3 0.0004837929
## no partner 4120 0.6644089663
## with partner 1758 0.2835026609
Now check out prop.table() and find out how to use. Check which cohort has the highest proportion of people reporting homosexual orientation
#Question: Now check out prop.table() and find out how to use. Check which cohort has the highest proportion of people reporting homosexual orientation
table(wave1a$homosex_new, wave1a$cohort)
##
## -7 Incomplete data 0 former capikid first interview
## -7 Incomplete data 0 0
## -1 No partner 0 0
## 0 Hetero 0 0
## 1 Gay 0 0
## 2 Lesbian 0 0
##
## 1 1991-1993 2 1981-1983 3 1971-1973 4 2001-2003
## -7 Incomplete data 0 0 0 0
## -1 No partner 1625 628 354 0
## 0 Hetero 538 1357 1643 0
## 1 Gay 3 13 9 0
## 2 Lesbian 7 15 9 0
##
## 9 former capikid re-interview
## -7 Incomplete data 0
## -1 No partner 0
## 0 Hetero 0
## 1 Gay 0
## 2 Lesbian 0
prop.table(table(wave1a$homosex_new, wave1a$cohort), margin = 2) # when margin=2 means calculating percentage by column
##
## -7 Incomplete data 0 former capikid first interview
## -7 Incomplete data
## -1 No partner
## 0 Hetero
## 1 Gay
## 2 Lesbian
##
## 1 1991-1993 2 1981-1983 3 1971-1973 4 2001-2003
## -7 Incomplete data 0.000000000 0.000000000 0.000000000
## -1 No partner 0.747814082 0.311972181 0.175682382
## 0 Hetero 0.247583985 0.674118231 0.815384615
## 1 Gay 0.001380580 0.006458023 0.004466501
## 2 Lesbian 0.003221353 0.007451565 0.004466501
##
## 9 former capikid re-interview
## -7 Incomplete data
## -1 No partner
## 0 Hetero
## 1 Gay
## 2 Lesbian
a) Show a proportion table for variable “val1i7” for the whole sample, and b) then show a proportion table for “val1i7” only among people who are aged >30
#Question: Show a frequency table for variable "val1i7" for the whole sample, and then show a frequency table for people who are aged >30
prop.table(table(wave1a$val1i7)) #proportion table for variable "val1i7" based on the all sample
##
## -5 Inconsistent value -4 Filter error / Incorrect entry
## 0.000000000 0.000000000
## -3 Does not apply -2 No answer
## 0.000000000 0.001612643
## -1 Don't know 1 Disagree completely
## 0.008547009 0.147234317
## 2 3
## 0.129011450 0.214159007
## 4 5 Agree completely
## 0.185776488 0.313659087
prop.table(table(wave1a$val1i7[wave1a$age > 30])) #proportion table for variable "val1i7" based on those aged>30
##
## -5 Inconsistent value -4 Filter error / Incorrect entry
## 0.000000000 0.000000000
## -3 Does not apply -2 No answer
## 0.000000000 0.002481390
## -1 Don't know 1 Disagree completely
## 0.006947891 0.211910670
## 2 3
## 0.133995037 0.216377171
## 4 5 Agree completely
## 0.151364764 0.276923077
#or you can
wave1b <- filter(wave1a,age > 30 ) #generate a new data containing only those aged >30
prop.table(table(wave1b$val1i7)) #then generate a proportion table
##
## -5 Inconsistent value -4 Filter error / Incorrect entry
## 0.000000000 0.000000000
## -3 Does not apply -2 No answer
## 0.000000000 0.002481390
## -1 Don't know 1 Disagree completely
## 0.006947891 0.211910670
## 2 3
## 0.133995037 0.216377171
## 4 5 Agree completely
## 0.151364764 0.276923077
#or you can
tabyl(wave1a$val1i7[wave1a$age > 30])
## wave1a$val1i7[wave1a$age > 30] n percent
## -5 Inconsistent value 0 0.000000000
## -4 Filter error / Incorrect entry 0 0.000000000
## -3 Does not apply 0 0.000000000
## -2 No answer 5 0.002481390
## -1 Don't know 14 0.006947891
## 1 Disagree completely 427 0.211910670
## 2 270 0.133995037
## 3 436 0.216377171
## 4 305 0.151364764
## 5 Agree completely 558 0.276923077