class: center, middle, inverse, title-slide .title[ # Advanced quantitative data analysis ] .subtitle[ ## Data Frames & Tibbles ] .author[ ### Mengni Chen ] .institute[ ### Department of Sociology, University of Copenhagen ] --- #Packages today - package used in this session - tidyverse - haven ```r #we have installed "tidyverse" last week #now we need to install "haven" install.packages("tidyverse") install.packages("haven") ``` ```r #let use use the two package library(tidyverse) library(haven) # Read and handle SPSS, Stata & SAS data (no need to install) ``` --- #Some vector ```r (age <- c(34, 22, 42, 12, 76)) (conti <- factor(x = c("Europe", "Africa", "Africa", "Asia", "S. America"), levels = c("Africa", "Asia", "Australia", "Europe", "N. America", "S. America"))) (employed <- c(FALSE, TRUE, TRUE, TRUE, TRUE)) (name <- c("Agnes", "Martin", "Hakan", "Tu", "Thais")) (nr_kids <- c(1, 0, 3, 0, 4)) ``` --- #Data frames Data frames organize vectors of **equal length** along their indices. ```r # Bind our 4 vectors along their index into a data frame. # Assign that data frame to object "Dat". (Dat <- data.frame(name, age, conti, employed, nr_kids)) ``` ``` ## name age conti employed nr_kids ## 1 Agnes 34 Europe FALSE 1 ## 2 Martin 22 Africa TRUE 0 ## 3 Hakan 42 Africa TRUE 3 ## 4 Tu 12 Asia TRUE 0 ## 5 Thais 76 S. America TRUE 4 ``` --- #Data frames .center[Data frames are the typical "rectangular" way to organize data:] <img src="https://d33wubrfki0l68.cloudfront.net/6f1ddb544fc5c69a2478e444ab8112fb0eea23f8/91adc/images/tidy-1.png" width="80%" style="display: block; margin: auto;" > --- #Tibbles Tibbles are data frames. But they have some improved features, so we will work with them. .pull-left[ ```r Dat ``` ``` ## name age conti employed nr_kids ## 1 Agnes 34 Europe FALSE 1 ## 2 Martin 22 Africa TRUE 0 ## 3 Hakan 42 Africa TRUE 3 ## 4 Tu 12 Asia TRUE 0 ## 5 Thais 76 S. America TRUE 4 ``` ] .pull-right[ ```r # Make Dat a tibble and assign it to object "Dat", # (effectively overwriting Dat as a tibble). (Dat <- as_tibble(Dat)) ``` ``` ## # A tibble: 5 × 5 ## name age conti employed nr_kids ## <chr> <dbl> <fct> <lgl> <dbl> ## 1 Agnes 34 Europe FALSE 1 ## 2 Martin 22 Africa TRUE 0 ## 3 Hakan 42 Africa TRUE 3 ## 4 Tu 12 Asia TRUE 0 ## 5 Thais 76 S. America TRUE 4 ``` ] --- #Address single variable using **$** -- ```r # Return variable "conti" contained in tibble Dat. # (R for Data Science mentions two further commands.) Dat$conti ``` ``` ## [1] Europe Africa Africa Asia S. America ## Levels: Africa Asia Australia Europe N. America S. America ``` -- ```r # Give a summary of numeric vector age contained in Dat. summary(Dat$age) ``` ``` ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 12.0 22.0 34.0 37.2 42.0 76.0 ``` -- ```r # Give a summary of factor vector conti contained in Dat. summary(Dat$conti) ``` ``` ## Africa Asia Australia Europe N. America S. America ## 2 1 0 1 0 1 ``` --- #`select()` several variables For more functions of `select()`, see `?select` -- .pull-left[ ```r # Select from tibble Dat the variables name and nr_kids, # and assign them to the new object Dat_small. (Dat_small1 <- select(Dat, name, nr_kids)) ``` ``` ## # A tibble: 5 × 2 ## name nr_kids ## <chr> <dbl> ## 1 Agnes 1 ## 2 Martin 0 ## 3 Hakan 3 ## 4 Tu 0 ## 5 Thais 4 ``` ] -- .pull-right[ ```r # Select from object Dat all variables that start with n, # and assign the result to a new object Dat_small. (Dat_small <- select(Dat, starts_with("n"))) ``` ``` ## # A tibble: 5 × 2 ## name nr_kids ## <chr> <dbl> ## 1 Agnes 1 ## 2 Martin 0 ## 3 Hakan 3 ## 4 Tu 0 ## 5 Thais 4 ``` ] --- #use select() to drop variables .pull-left[ drop variable age ```r (Dat_small2 <- select(Dat, -age))# remove variable "age", similar to drop in statat ``` ``` ## # A tibble: 5 × 4 ## name conti employed nr_kids ## <chr> <fct> <lgl> <dbl> ## 1 Agnes Europe FALSE 1 ## 2 Martin Africa TRUE 0 ## 3 Hakan Africa TRUE 3 ## 4 Tu Asia TRUE 0 ## 5 Thais S. America TRUE 4 ``` ```r # if you have more variables to drop (Dat_small2 <- select(Dat, -c(name,age))) ``` ] .pull-right[ drop variables "name" and "number of children" ```r (Dat_small3 <- select(Dat, -starts_with("n"))) ``` ``` ## # A tibble: 5 × 3 ## age conti employed ## <dbl> <fct> <lgl> ## 1 34 Europe FALSE ## 2 22 Africa TRUE ## 3 42 Africa TRUE ## 4 12 Asia TRUE ## 5 76 S. America TRUE ``` ] --- # `filter()` cases based on values in certain variables ```r # Either, use dplyr's (part of tidyverse) filter() function, # to return all cases contained in Dat with value "Africa" in conti. (Dat_small1<- dplyr::filter(Dat, conti == "Africa")) ``` ``` ## # A tibble: 2 × 5 ## name age conti employed nr_kids ## <chr> <dbl> <fct> <lgl> <dbl> ## 1 Martin 22 Africa TRUE 0 ## 2 Hakan 42 Africa TRUE 3 ``` ```r #or (Dat_small1<- filter(Dat, conti == "Africa")) ``` --- #select cases using `[]` ```r Dat ``` ``` ## # A tibble: 5 × 5 ## name age conti employed nr_kids ## <chr> <dbl> <fct> <lgl> <dbl> ## 1 Agnes 34 Europe FALSE 1 ## 2 Martin 22 Africa TRUE 0 ## 3 Hakan 42 Africa TRUE 3 ## 4 Tu 12 Asia TRUE 0 ## 5 Thais 76 S. America TRUE 4 ``` ```r # Or use the index, to achieve the same. (Dat_small2 <- Dat[Dat$conti == "Africa", ]) ``` ``` ## # A tibble: 2 × 5 ## name age conti employed nr_kids ## <chr> <dbl> <fct> <lgl> <dbl> ## 1 Martin 22 Africa TRUE 0 ## 2 Hakan 42 Africa TRUE 3 ``` --- # select cases using `[]` Compare three codes to understand `[,]`. -- .pull-left[ ```r Dat ``` ``` ## # A tibble: 5 × 5 ## name age conti employed nr_kids ## <chr> <dbl> <fct> <lgl> <dbl> ## 1 Agnes 34 Europe FALSE 1 ## 2 Martin 22 Africa TRUE 0 ## 3 Hakan 42 Africa TRUE 3 ## 4 Tu 12 Asia TRUE 0 ## 5 Thais 76 S. America TRUE 4 ``` ```r (Dat_small3 <- Dat[1, 3]) ``` ``` ## # A tibble: 1 × 1 ## conti ## <fct> ## 1 Europe ``` ] .pull-right[ ```r # Or use the index, to achieve the same. (Dat_small3 <- Dat[1, ]) ``` ``` ## # A tibble: 1 × 5 ## name age conti employed nr_kids ## <chr> <dbl> <fct> <lgl> <dbl> ## 1 Agnes 34 Europe FALSE 1 ``` ```r (Dat_small3 <- Dat[, 2]) ``` ``` ## # A tibble: 5 × 1 ## age ## <dbl> ## 1 34 ## 2 22 ## 3 42 ## 4 12 ## 5 76 ``` ] --- #Transform, recode & generate variables of tibbles To transform and recode simply use `$` to clarify which tibble you are referring to. ```r # Center age around the average age. (Dat$age <- Dat$age - mean(Dat$age)) ``` ``` ## [1] -3.2 -15.2 4.8 -25.2 38.8 ``` ```r # Recode "Africa" to "Afrika!". (Dat$conti <- fct_recode(Dat$conti, "Afrika!" = "Africa")) ``` ``` ## [1] Europe Afrika! Afrika! Asia S. America ## Levels: Afrika! Asia Australia Europe N. America S. America ``` If you rather want to generate a new variable, just give it a name. ```r # Devide age by its standard deviation; now it is z-standardized (mean = 0, sd = 1). (Dat$z_age <- Dat$age / sd(Dat$age)) ``` ``` ## [1] -0.1305090 -0.6199178 0.1957635 -1.0277584 1.5824217 ``` --- #Transform, recode & generate Several variables of a tibble If you want to transform and recode several variables from the same tibble, get used to use `mutate()`. ```r (Dat <- mutate(Dat, # Use the Dat tibble. nr_kids = nr_kids - mean(nr_kids), # Transform to deviation from average. z_nr_kids = nr_kids / sd(nr_kids), # z-standardize nr_kids. conti = fct_recode(conti, # Recode conti. "Europa!" = "Europe", # "Europe" to "Europa!". "Asien!" = "Asia") # "Asia" to "Asien!". ) # Don't forget to close mutate's bracket ")" ) ``` ``` ## # A tibble: 5 × 7 ## name age conti employed nr_kids z_age z_nr_kids ## <chr> <dbl> <fct> <lgl> <dbl> <dbl> <dbl> ## 1 Agnes -3.20 Europa! FALSE -0.6 -0.131 -0.330 ## 2 Martin -15.2 Afrika! TRUE -1.6 -0.620 -0.881 ## 3 Hakan 4.8 Afrika! TRUE 1.4 0.196 0.771 ## 4 Tu -25.2 Asien! TRUE -1.6 -1.03 -0.881 ## 5 Thais 38.8 S. America TRUE 2.4 1.58 1.32 ``` **Attention: look how RStudio structures the brackets!** --- #Conditional transform & recode (i.e., for filtered cases) To transform/recode only among certain cases, use `case_when()`. ```r (Dat1 <- mutate( Dat, # Mutate variables contained in Dat. conti = case_when( # Start conditional recode of conti, employed == FALSE ~ "Atlantis", # 1. complex condition ~ new value "Atlantis", age < 0 & nr_kids < -1 ~ "Antarctica", # 2. complex condition ~ new value "Antarctica", ) # close case_when's bracket. )) ``` ``` ## # A tibble: 5 × 7 ## name age conti employed nr_kids z_age z_nr_kids ## <chr> <dbl> <chr> <lgl> <dbl> <dbl> <dbl> ## 1 Agnes -3.20 Atlantis FALSE -0.6 -0.131 -0.330 ## 2 Martin -15.2 Antarctica TRUE -1.6 -0.620 -0.881 ## 3 Hakan 4.8 <NA> TRUE 1.4 0.196 0.771 ## 4 Tu -25.2 Antarctica TRUE -1.6 -1.03 -0.881 ## 5 Thais 38.8 <NA> TRUE 2.4 1.58 1.32 ``` **Why "conti" becomes a character variable?** **Why "conti" have 2 missing values "NA"?** --- ##Conditional transform & recode (i.e., for filtered cases) Let us compare the following with the previous slide ```r (Dat2 <- mutate( Dat, # Mutate variables contained in Dat. conti = case_when( # Start conditional recode of conti, employed == FALSE ~ "Atlantis", # 1. complex condition ~ new value "Atlantis", age < 0 & nr_kids < -1 ~ "Antarctica",# 2. complex condition ~ new value "Antarctica", TRUE ~ as.character(conti) #3. and all others ~ leave as is; ) # close case_when's bracket. )) ``` ``` ## # A tibble: 5 × 7 ## name age conti employed nr_kids z_age z_nr_kids ## <chr> <dbl> <chr> <lgl> <dbl> <dbl> <dbl> ## 1 Agnes -3.20 Atlantis FALSE -0.6 -0.131 -0.330 ## 2 Martin -15.2 Antarctica TRUE -1.6 -0.620 -0.881 ## 3 Hakan 4.8 Afrika! TRUE 1.4 0.196 0.771 ## 4 Tu -25.2 Antarctica TRUE -1.6 -1.03 -0.881 ## 5 Thais 38.8 S. America TRUE 2.4 1.58 1.32 ``` --- #Conditional transform & recode (i.e., for filtered cases) Let us see what the three lines `case_when()` application do 1. For all non-employed ~ recode conti to "Atlantis". 2. For all those who're younger than 0 and have less than -1 kids ~ recode conti to "Antarctica". 3. All remaining cases ~ use their original conti values, but transform them to a character vector. Remember, R is class sensitive! It will not combine numeric information into a character vector. Because we give case_when() "Atlantis" and "Antarctica" as new values, it assumes that we want to make conti a character vector. Thus we cannot supply the former factor values (which are integer) in line three. ```r (Dat2 <- mutate( Dat, # Mutate variables contained in Dat. conti = case_when( # Start conditional recode of conti, employed == FALSE ~ "Atlantis", # 1. complex condition ~ new value "Atlantis", age < 0 & nr_kids < -1 ~ "Antarctica", # 2. complex condition ~ new value "Antarctica", TRUE ~ as.character(conti) # 3. and all others ~ leave as is; ) # close case_when's bracket. )) ``` --- #`case_when()` and `NA` ```r (Dat2 <- mutate( Dat, conti = case_when( # Start conditional recode of conti, employed == FALSE ~ "Atlantis", # 1. complex condition ~ new value "Atlantis", age < 0 & nr_kids < -1 ~ "Antarctica", # 2. complex condition ~ new value "Antarctica", TRUE ~ NA # 3. and all others ~ NA ) # close case_when's bracket. )) ``` ``` ## Error in `mutate()`: ## ! Problem while computing `conti = case_when(...)`. ## Caused by error in `` names(message) <- `*vtmp*` ``: ## ! 'names' attribute [1] must be the same length as the vector [0] ``` Why we cannot make all others of "conti" to be NA. In R vectors are type sensitive; the elements of one vector may only be of one type. If you type `?NA`, you will learn: "NA is a logical constant of length 1 which contains a missing value indicator. NA can be coerced to any other vector type except raw." --- #`case_when()` and `NA` `NA` are type sensitive, too! Therefore, coerce NA to character/numeric with `as.character()`/`as.numeric()`. ```r (Dat3 <- mutate( Dat, conti = case_when( # Start conditional recode of conti, employed == FALSE ~ "Atlantis", # 1. complex condition ~ new value "Atlantis", age < 0 & nr_kids < -1 ~ "Antarctica", # 2. complex condition ~ new value "Antarctica", TRUE ~ as.character(NA) # 3. and all others ~ NA ) # close case_when's bracket. )) ``` ``` ## # A tibble: 5 × 7 ## name age conti employed nr_kids z_age z_nr_kids ## <chr> <dbl> <chr> <lgl> <dbl> <dbl> <dbl> ## 1 Agnes -3.20 Atlantis FALSE -0.6 -0.131 -0.330 ## 2 Martin -15.2 Antarctica TRUE -1.6 -0.620 -0.881 ## 3 Hakan 4.8 <NA> TRUE 1.4 0.196 0.771 ## 4 Tu -25.2 Antarctica TRUE -1.6 -1.03 -0.881 ## 5 Thais 38.8 <NA> TRUE 2.4 1.58 1.32 ``` ```r # you can also use "NA_character_" to replace as.character(NA) ``` --- #About pairfam --- #import downloaded data - Import Stata, SPSS, SAS files: ["haven" package](https://haven.tidyverse.org/) - Import csv, tsv, fwf files: ["readr" package](https://readr.tidyverse.org/) - Import Excel's xlsx files: ["readxl" package](https://readxl.tidyverse.org/) ```r # Create an object pairfam and assign the # imported anchor1_50percent.dta to it, if you have # downloaded it into your intro_r folder # library(haven) #make sure that you call out the "haven" package pairfam <- read_dta("anchor1_50percent_Eng.dta") pairfam ``` ``` ## # A tibble: 6,201 × 1,458 ## id demodiff wave sample pid paren…¹ paren…² paren…³ paren…⁴ ## <dbl> <dbl+lbl> <dbl+l> <dbl+l> <dbl> <dbl+l> <dbl+l> <dbl+l> <dbl+l> ## 1 267206000 0 [0 non-d… 1 [1 2… 1 [1 p… NA NA NA NA NA ## 2 112963000 0 [0 non-d… 1 [1 2… 1 [1 p… NA NA NA NA NA ## 3 327937000 0 [0 non-d… 1 [1 2… 1 [1 p… 3.28e8 NA NA NA NA ## 4 318656000 0 [0 non-d… 1 [1 2… 1 [1 p… 3.19e8 3.19e8 NA NA NA ## 5 717889000 0 [0 non-d… 1 [1 2… 1 [1 p… 7.18e8 7.18e8 7.18e8 NA NA ## 6 222517000 0 [0 non-d… 1 [1 2… 1 [1 p… NA NA NA NA NA ## 7 144712000 0 [0 non-d… 1 [1 2… 1 [1 p… NA NA NA NA NA ## 8 659357000 0 [0 non-d… 1 [1 2… 1 [1 p… 6.59e8 NA NA NA NA ## 9 506367000 0 [0 non-d… 1 [1 2… 1 [1 p… 5.06e8 5.06e8 NA NA NA ## 10 64044000 0 [0 non-d… 1 [1 2… 1 [1 p… NA NA NA NA NA ## # … with 6,191 more rows, 1,449 more variables: parentidk5 <dbl+lbl>, ## # parentidk6 <dbl+lbl>, parentidk7 <dbl+lbl>, parentidk8 <dbl+lbl>, ## # parentidk9 <dbl+lbl>, parentidk10 <dbl+lbl>, parentidk11 <dbl+lbl>, ## # parentidk12 <dbl+lbl>, parentidk13 <dbl+lbl>, parentidk14 <dbl+lbl>, ## # parentidk15 <dbl+lbl>, sex_gen <dbl+lbl>, psex_gen <dbl+lbl>, ## # k1sex_gen <dbl+lbl>, k2sex_gen <dbl+lbl>, k3sex_gen <dbl+lbl>, ## # k4sex_gen <dbl+lbl>, k5sex_gen <dbl+lbl>, k6sex_gen <dbl+lbl>, … ``` --- #First steps with pairfam .pull-left[ ```r table(pairfam$sex_gen) ``` ``` ## ## 1 2 ## 3029 3172 ``` ] .pull-right[ <img src="https://github.com/fancycmn/Slide3/blob/main/Pic2.PNG?raw=true" width="100%" style="display: block; margin: auto;"> ] labels disappeared!!! --- #First steps with pairfam labels disappear. ```r #Let us get labels back class(pairfam$sex_gen) ``` ``` ## [1] "haven_labelled" "vctrs_vctr" "double" ``` ```r pairfam$sex_new <- as_factor(pairfam$sex_gen) table(pairfam$sex_new) ``` ``` ## ## -10 not in demodiff -7 Incomplete data ## 0 0 ## -4 Filter error / Incorrect entry -3 Does not apply ## 0 0 ## 1 Male 2 Female ## 3029 3172 ``` ```r class(pairfam$sex_new) ``` ``` ## [1] "factor" ``` --- #Importing labelled data R imports Stata and SPSS labels, but cannot handle them. We need to change each labelled variable to numeric or factor from the outset of your analysis! sex_gen is a numeric variable now after importing pairfam into R. `as_factor()` can be used to make ""sex_gen" a numeric variable into a categorical variable. --- #Importing labelled data remove labels ```r pairfam$agea <- zap_label(pairfam$age) summary(pairfam$agea) ``` ``` ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 14.00 17.00 26.00 25.84 35.00 38.00 ``` --- #Take home 1. Data frames organize equally-sized vectors along their indices; Tibbles are modernized data frames. 2. The function of `$` for identifying a variable, and `[row, column]` for identifying a cell in a table 3. How to select information from a table - select() to select several variables - filter() to select cases based on values in certain variables - [] can also be used for select 4. Transform, recode, and generate some variables of a tibble 5. Import dataset in R --- # Important code -data.frame(): organize several vectors of equal length by their index. - as_tibble(): take a data frame and make it a Tibble. - summary(): Give a summary of an object. - select(): select several variables from a data frame/Tibble. - filter(): filter cases based on values in certain variables. - mutate(): Adds new variables and preserves existing. Good for recoding several variables. - case_when(): Conditional recode for cases filtered in complex ways. - as_factor(): Make a labelled Stata/SPSS variable a factor. - zap_labels(): Make a labelled Stata/SPSS variable numeric. --- class: center, middle #[Exercise](https://rpubs.com/fancycmn/942601)