You have 7 small tasks. You can work in groups to complete them in-class. The answer provided here may not be the only right answer. You can have your own ways of coding.

No. 1

Question

Import anchor1_50percent_Eng data

Answer

#Question: Import anchor1_50percent_Eng data
library(tidyverse) #use the tidyvese package
library(haven) #use the haven package
wave1 <- read_dta("anchor1_50percent_Eng.dta") #import the data

No. 2

Question

Keep variables of id, age, sex_gen, cohort, homosex_new, yedu, relstat, as well as one variable that reflects the attitude towards family (e.g. val1i7), and one variable that reflects subjective wellbeing (e.g. sat6). The function that allows you to do this job is called “______”. Make these as a new dataset.

Answer

We show the data in this tab.

#Question: Keep variables of id, age, sex_gen, cohort, homosex_new, yedu, relstat, as well as one variable that reflects the attitude towards family, and one variable that reflects subjective wellbeing. The function that allow you to do this job is called "______". Make these as a new dataset. 
(wave1a <- select(wave1, id, age, sex_gen, cohort, homosex_new, yeduc, 
               relstat, val1i7,sat6)) #select the variables to create a new dataset named wave1a.
## # A tibble: 6,201 × 9
##           id age       sex_gen  cohort  homose…¹ yeduc    relstat  val1i7  sat6 
##        <dbl> <dbl+lbl> <dbl+lb> <dbl+l> <dbl+lb> <dbl+lb> <dbl+lb> <dbl+l> <dbl>
##  1 267206000 16        2 [2 Fe… 1 [1 1… -1 [-1 …  0 [0 c…  1 [1 N… 3       7    
##  2 112963000 35        1 [1 Ma… 3 [3 1… -1 [-1 … 10.5      1 [1 N… 2       6    
##  3 327937000 16        2 [2 Fe… 1 [1 1…  0 [0 H…  0 [0 c… -7 [-7 … 4       8    
##  4 318656000 27        2 [2 Fe… 2 [2 1…  0 [0 H… 11.5      4 [4 M… 5 [5 A… 9    
##  5 717889000 37        1 [1 Ma… 3 [3 1…  0 [0 H… 11.5      4 [4 M… 4       7    
##  6 222517000 15        1 [1 Ma… 1 [1 1… -1 [-1 …  0 [0 c…  1 [1 N… 5 [5 A… 9    
##  7 144712000 16        2 [2 Fe… 1 [1 1… -1 [-1 …  0 [0 c…  1 [1 N… 4       8    
##  8 659357000 17        2 [2 Fe… 1 [1 1…  0 [0 H…  0 [0 c…  2 [2 N… 5 [5 A… 7    
##  9 506367000 37        1 [1 Ma… 3 [3 1…  0 [0 H… 10.5      4 [4 M… 1 [1 D… 9    
## 10  64044000 15        2 [2 Fe… 1 [1 1… -1 [-1 …  0 [0 c…  1 [1 N… 1 [1 D… 7    
## # … with 6,191 more rows, and abbreviated variable name ¹​homosex_new

No. 3

Question

Change the variables into numeric and factors appropriately: 1)Remove labels for continuous variable; 2)make categorical variable as factors. Think about the functions for numeric and for categorical variables. Write and run the code for this task

Answer

#Question: Change the variables into numeric and factors appropriately: 1)Remove labels for continuous variable; 2)make categorical variable as factors.For that, you need the two functions, "______" for numeric and "______" for categorical variables..
wave1a <- mutate(wave1a,
                 id=zap_labels(id), #remove label for id
                 age=zap_labels(age), #remove label of age
                 yeduc=zap_labels(yeduc), #remove label of education years
                 sat6=zap_labels(sat6), #remove label for life satisfaction, it is good to view it as continuous.
                 sex_gen=as_factor(sex_gen), #make sex variable as factor
                 cohort=as_factor(cohort), #make cohort variable as factor
                 homosex_new=as_factor(homosex_new), #make homosex_new variable as factor
                 relstat=as_factor(relstat), #make relstat variable as factor
                 val1i7=as_factor(val1i7) #make val1i7 as factor
)
wave1a
## # A tibble: 6,201 × 9
##           id   age sex_gen  cohort      homosex_new   yeduc relstat val1i7  sat6
##        <dbl> <dbl> <fct>    <fct>       <fct>         <dbl> <fct>   <fct>  <dbl>
##  1 267206000    16 2 Female 1 1991-1993 -1 No partner   0   1 Neve… 3          7
##  2 112963000    35 1 Male   3 1971-1973 -1 No partner  10.5 1 Neve… 2          6
##  3 327937000    16 2 Female 1 1991-1993 0 Hetero        0   -7 Inc… 4          8
##  4 318656000    27 2 Female 2 1981-1983 0 Hetero       11.5 4 Marr… 5 Agr…     9
##  5 717889000    37 1 Male   3 1971-1973 0 Hetero       11.5 4 Marr… 4          7
##  6 222517000    15 1 Male   1 1991-1993 -1 No partner   0   1 Neve… 5 Agr…     9
##  7 144712000    16 2 Female 1 1991-1993 -1 No partner   0   1 Neve… 4          8
##  8 659357000    17 2 Female 1 1991-1993 0 Hetero        0   2 Neve… 5 Agr…     7
##  9 506367000    37 1 Male   3 1971-1973 0 Hetero       10.5 4 Marr… 1 Dis…     9
## 10  64044000    15 2 Female 1 1991-1993 -1 No partner   0   1 Neve… 1 Dis…     7
## # … with 6,191 more rows

No. 4

Question

Z-standardize yeduc and age. Remember, \(z_x = \frac{x - \bar{x}}{\text{SD(x)}}\).

Answer

#Question: Z-standardize yeduc and age.
# First. check whether values on those variables make sense.
summary(wave1a$yeduc) #summary the yeduc variable to see if there are missing cases
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  -7.000   0.000  11.000   8.933  13.000  20.000
summary(wave1a$age) #summary the age variable to see if there are missing cases
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   14.00   17.00   26.00   25.84   35.00   38.00
wave1a <- mutate(wave1a,
                 yeduc=case_when(
                   yeduc<0 ~ as.numeric(NA), #when yeduc <0, i make it NA
                   TRUE ~ as.numeric(yeduc) #the rest of yedu as it is and make it numeric
                 ),
                 z_yeduc=(yeduc- mean(yeduc,na.rm=TRUE))/sd(yeduc,na.rm = TRUE), #z-standardization of yeduc
                 z_age=(age- mean(age))/sd(age) #z-standardization of age
                 
                 )
summary(wave1a$z_yeduc) #summary the standardized yeduc
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
## -1.4513 -1.4513  0.3226  0.0000  0.6451  1.7739      26
summary(wave1a$z_age) #summary the standardized age
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -1.41586 -1.05703  0.01946  0.00000  1.09595  1.45478

No. 5

Question

Show the frequency of relstat. Recode relstat to “no partner” when the person is never married regardless of the cohabiting status; to “with partner” when the person is married regardless of the cohabiting status. Show the frequence of the relstat again. What is the frequency for “no partner” and the frequency for “with partner”?

Answer

#Question: Show the frequency of relstat. Recode relstat to "no partner" when the person is never married regardless of the cohabiting status; to  "with partner" when the person is married regardless of the cohabiting status. Show the frequence of the relstat again. What is the frequency for "no partner" and the frequency for "with partner"? 

table(wave1a$relstat)
## 
##          -7 Incomplete data      1 Never married single 
##                          34                        2448 
##         2 Never married LAT       3 Never married COHAB 
##                        1012                         660 
##             4 Married COHAB     5 Married noncohabiting 
##                        1735                          23 
## 6 Divorced/separated single    7 Divorced/separated LAT 
##                         146                          63 
##  8 Divorced/separated COHAB            9 Widowed single 
##                          76                           3 
##              10 Widowed LAT            11 Widowed COHAB 
##                           1                           0
wave1a <- mutate(wave1a,
              relstat = case_when( 
                relstat %in% c("1 Never married single" ,"2 Never married LAT", "3 Never married COHAB")  ~ "no partner", #when the value of relstat belong to the categories mentioned, replace them with 'no partner'
                relstat %in% c("4 Married COHAB" ,"5 Married noncohabiting")  ~ "with partner", 
                #when the value of relstat belong to the categories mentioned, replace them with 'with partner'
                TRUE ~ as.character(relstat) #the rest remain as it is and make it a character
              ),
              relstat = factor(relstat) #change the relstat back to factor
)
table(wave1a$relstat) #get the frequence table of relstat
## 
##          -7 Incomplete data              10 Widowed LAT 
##                          34                           1 
## 6 Divorced/separated single    7 Divorced/separated LAT 
##                         146                          63 
##  8 Divorced/separated COHAB            9 Widowed single 
##                          76                           3 
##                  no partner                with partner 
##                        4120                        1758
#Remember, %in% is a operator means "belonging to"

No. 6 (Optional)

Question

Now check out prop.table() and find out how to use. Check which cohort has the highest proportion of people reporting homosexual orientation

Answer

#Question: Now check out prop.table() and find out how to use. Check which cohort has the highest proportion of people reporting homosexual orientation 
prop.table(table(wave1a$homosex_new, wave1a$cohort), margin = 2) # when margin=2 means calculating percentage by column
##                     
##                      -7 Incomplete data 0 former capikid first interview
##   -7 Incomplete data                                                    
##   -1 No partner                                                         
##   0 Hetero                                                              
##   1 Gay                                                                 
##   2 Lesbian                                                             
##                     
##                      1 1991-1993 2 1981-1983 3 1971-1973 4 2001-2003
##   -7 Incomplete data 0.000000000 0.000000000 0.000000000            
##   -1 No partner      0.747814082 0.311972181 0.175682382            
##   0 Hetero           0.247583985 0.674118231 0.815384615            
##   1 Gay              0.001380580 0.006458023 0.004466501            
##   2 Lesbian          0.003221353 0.007451565 0.004466501            
##                     
##                      9 former capikid re-interview
##   -7 Incomplete data                              
##   -1 No partner                                   
##   0 Hetero                                        
##   1 Gay                                           
##   2 Lesbian

No. 7 (Optional)

Question

Show a proportion table for variable “val1i7” for the whole sample, and then show a proportion table for “val1i7” only among people who are aged >30

Answer

#Question: Show a frequency table for variable "val1i7" for the whole sample, and then show a frequency table for people who are aged >30

prop.table(table(wave1a$val1i7)) #proportion table for variable "val1i7" based on the all sample
## 
##             -5 Inconsistent value -4 Filter error / Incorrect entry 
##                       0.000000000                       0.000000000 
##                 -3 Does not apply                      -2 No answer 
##                       0.000000000                       0.001612643 
##                     -1 Don't know             1 Disagree completely 
##                       0.008547009                       0.147234317 
##                                 2                                 3 
##                       0.129011450                       0.214159007 
##                                 4                5 Agree completely 
##                       0.185776488                       0.313659087
prop.table(table(wave1a$val1i7[wave1a$age > 30])) #proportion table for variable "val1i7" based on those aged>30
## 
##             -5 Inconsistent value -4 Filter error / Incorrect entry 
##                       0.000000000                       0.000000000 
##                 -3 Does not apply                      -2 No answer 
##                       0.000000000                       0.002481390 
##                     -1 Don't know             1 Disagree completely 
##                       0.006947891                       0.211910670 
##                                 2                                 3 
##                       0.133995037                       0.216377171 
##                                 4                5 Agree completely 
##                       0.151364764                       0.276923077
#or you can 
wave1b <- filter(wave1a,age > 30 ) #generate a new data containing only those aged >30
prop.table(table(wave1b$val1i7)) #then generate a proportion table 
## 
##             -5 Inconsistent value -4 Filter error / Incorrect entry 
##                       0.000000000                       0.000000000 
##                 -3 Does not apply                      -2 No answer 
##                       0.000000000                       0.002481390 
##                     -1 Don't know             1 Disagree completely 
##                       0.006947891                       0.211910670 
##                                 2                                 3 
##                       0.133995037                       0.216377171 
##                                 4                5 Agree completely 
##                       0.151364764                       0.276923077