class: center, middle, inverse, title-slide .title[ # Advanced quantitative data analysis ] .subtitle[ ## R basics I ] .author[ ### Mengni Chen ] .institute[ ### Department of Sociology, University of Copenhagen ] --- #Vector and Data Frame - Vector - numeric vector - factor (categorical) vector - other types of vector - How to use packages - How to code missing in R - Dataframe: how to build a dataset from vectors - Import data: dta format, csv format, xls format --- #Vector - The most fundamental data type in R is a vector: a sequence/chain/series of information. You can "concatenate" information into a vector by using **c()** ```r # Concatenate the sequence 5 4 3 2 1 to a vector, assign it to object x1, and print it. (x1 <- c(5, 4, 3, 2, 4)) ``` ``` ## [1] 5 4 3 2 4 ``` - You can access a specific element of a vector via it's vector[index]: ```r # Print the second element of x1. x1[2] ``` ``` ## [1] 4 ``` - If you want to access several elements, you need to supply another vector to the vector's index: ```r # Print the second and fourth element of x1. x1[c(2, 4)] ``` ``` ## [1] 4 2 ``` --- #Types of common-used vectors - Numeric vector - Factor vector* - Character vector - Logical vector - Date vectors --- #Numeric vectors - Numeric vectors are sequences of numbers. ```r x1 ``` ``` ## [1] 5 4 3 2 4 ``` ```r class(x1) # what class of this object is? ``` ``` ## [1] "numeric" ``` --- #Transform numeric vectors - arithmetic operators **"+ - * / ^"** - a myriad of different functions - combine arithmetic operators and several functions ```r #arithmetic operators (x1 <- x1 + 5) # Assign the object X1 plus 5 added to each of its elements to x1. ``` ``` ## [1] 10 9 8 7 9 ``` ```r (x1 <- log2(x1)) # Assign the logarithm with base 2 of x1 to x1. ``` ``` ## [1] 3.321928 3.169925 3.000000 2.807355 3.169925 ``` ```r (x1 <- (x1 - mean(x1)) / sd(x1)) # z-standardize x1. ``` ``` ## [1] 1.1606991 0.3872282 -0.4774386 -1.4577170 0.3872282 ``` --- #Factor vectors Factors are for categorical variables that make a distinction but whose values cannot be compared on a common scale. They are composed of a sequence of categorical values (i.e., argument x), and a (ideally comprehensive) list of potential levels (i.e., theoretically-possible values). ```r # Concatenate argument "x" to a factor and give it a # comprehensive list "levels" of all potential categories. intimate <- factor( x = c("Yes", "Yes", "No", "No", "No"), levels = c("Yes", "No", "Don't know", "No answer")) # Print a frequency table of our new factor vector. table(intimate) ``` ``` ## intimate ## Yes No Don't know No answer ## 2 3 0 0 ``` --- #Factor R forces you to decide whether a variable is continuous (numeric) or categorical (factor) - Numeric variables have a scale, such as cm, years, or DKK. Hence there is no need for labels. - Categorical variables, by contrast, have no actual representation in numbers. - Because factors are categorical, they cannot be numerically transformed. - If you nevertheless try to numerically transform a factor, the result is **NA** (i.e., "not available"). ```r intimate * 2 ``` ``` ## Warning in Ops.factor(intimate, 2): '*' not meaningful for factors ``` ``` ## [1] NA NA NA NA NA ``` Furthermore, we learn to use recode functions from the **forcats** package, which is part of the **tidyverse** package. --- #Wait, what's a package? .left-column[ <img src="http://cdn.osxdaily.com/wp-content/uploads/2016/03/package-file-check.jpg" width="60%" style="display: block; margin: auto;" > ] .right-column[- A package is a collection of functions and their documentation (sometimes also data). - Some packages are pre-installed as sub-packages of R's Base package. In addition, there are, currently, +17,000 (user-written) packages on the Comprehensive R Archive Network (CRAN). The <span style="color:blue">tidyverse</span> and its <span style="color:blue">forcats</span> package are such user-written packages. - How to install package - with code - with click] --- #Install package ```r install.packages("name_of_the_package") #for example install package called "tidyverse" install.packages("tidyverse") ##don't forget the quotation mark"" ``` <span style="color:red">**Do NOT add the install.packages() to your R Scripts every time! You will need to install a package only once, and not every time you run your script. If you want to update, use update.packages().** --- #How to use package - library() to use a package - Because there are so many user-written packages, they oftentimes contain functions with *conflicting names*. To avoid conflicts, you need to specify for each R session, which packages you want to work with. You do that, by adding the packages to your current R session's **library**. - It is good practice to add all packages to the library on the very top of an R script. Please all add the tidyverse to your library by writing the following as the very first line in your R script. ```r # use library() to use the tidyverse package. library(tidyverse) #if you want to use several packages at the same time # install.packages("pacman") pacman::p_load(tidyverse,ggplot2) # p_load():checks if a package is installed; #if not installed → it installs it; then it loads it (like library()). ``` Some tidyverse functions conflict with Base R's Stats package. You can always address a function from a specific package by initiating it with `package::function()`. For example, <span style="color:red">`pacman::p_load()` --- #Back to recoding factors Factors can best be **recoded** with `forcats::fct_recode()`. ```r # Recode intimate to make Y for Yes and N for N. Watch out: first the new, then the old value... table(intimate) ``` ``` ## intimate ## Yes No Don't know No answer ## 2 3 0 0 ``` ```r intimate1 <- fct_recode(intimate, "Y" = "Yes", "N" = "No") # new coding=old coding # Frequency table of intimate. table(intimate1) ``` ``` ## intimate1 ## Y N Don't know No answer ## 2 3 0 0 ``` --- #Back to recoding factors forcats contains many more useful functions to handle factors! Also check out [Chapter 15 Factors](https://r4ds.had.co.nz/factors.html) of Grolemund and Wickham (2017). ```r intimate2 <- fct_drop(intimate) #drop unsed levels table(intimate2) ``` ``` ## intimate2 ## Yes No ## 2 3 ``` ```r intimate3 <- fct_relevel(intimate, "Don't know","No answer", "No","Yes",) #re-order levels table(intimate3) ``` ``` ## intimate3 ## Don't know No answer No Yes ## 0 0 3 2 ``` --- #Other types of vectors - Character/string vectors are sequences of text. (an extended example) ```r # Concatenate these four strings as one vector # and assign it to object "x2". x2 <- c("I","love", "this", "course", "!") ``` - Logical vectors Logical vectors are sequences of TRUE and FALSE statements: ```r x3 <- c(TRUE, TRUE, FALSE, TRUE,TRUE) ``` - Date vectors Dates are vectors of the Year-Month-Day (and sometimes -Time) format. ```r Sys.Date() # Tell me the date ``` ``` ## [1] "2025-09-10" ``` ```r class(Sys.Date()) ``` ``` ## [1] "Date" ``` --- #No mixed vectors!!! ```r # Concatenate the sequence 1-4 to a numeric vector (x4 <- seq(1, 4)) ``` ``` ## [1] 1 2 3 4 ``` ```r class(x4) ``` ``` ## [1] "integer" ``` ```r # Replace the third element with the word "test" x4[3] <- "test" x4 ``` ``` ## [1] "1" "2" "test" "4" ``` ```r class(x4) ``` ``` ## [1] "character" ``` --- #No mixed vectors!!! ```r # What type of object is x4? class(x4) ``` ``` ## [1] "character" ``` -- ```r # Make x4 a numeric vector again as.numeric(x4) ``` ``` ## Warning: NAs introduced by coercion ``` ``` ## [1] 1 2 NA 4 ``` --- #NA: Not Available In general, missing values in R are NA. ```r x5 <- c(1, 2, 3, NA, 5, NA, 7) ``` -- Many functions will not ignore NA by default and thus return NA. ```r # Estimate mean of x5 mean(x5) ``` ``` ## [1] NA ``` ```r # Estimate mean of x5 ignoring the NA (i.e., casewise deletion) mean(x5, na.rm = TRUE) #na.rm is a function to tell R to remove NA when doing the calculation ``` ``` ## [1] 3.6 ``` <span style="color:blue">na.rm</span> of `mean(x5, na.rm = TRUE)` is whether to remove NA when doing analysis --- #NA: Not Available is.na() generates logical vectors that identify missing values. ```r x5 <- c(1, 2, 3, NA, 5, NA, 7) # Which elements are not missing? !is.na(x5) ``` ``` ## [1] TRUE TRUE TRUE FALSE TRUE FALSE TRUE ``` --- #NA: Not Available ```r # How many are missing table(is.na(x5)) ``` ``` ## ## FALSE TRUE ## 5 2 ``` ```r # Print only non-missing values of x5 x5[!is.na(x5)] ``` ``` ## [1] 1 2 3 5 7 ``` --- #Some vectors ```r age <- c(34, 22, 42, 12, 76) conti <- factor(x = c("Europe", "Africa", "Africa", "Asia", "S. America"), levels = c("Africa", "Asia", "Australia", "Europe", "N. America", "S. America")) employed <- c(FALSE, TRUE, TRUE, TRUE, TRUE) name <- c("Agnes", "Martin", "Hakan", "Tu", "Thais") nr_kids <- c(1, 0, 3, 0, 4) ``` --- #Data frames Data frames organize vectors of **equal length** along their indices. .pull-left[ ```r # Bind our 4 vectors along their index into a data frame. # Assign that data frame to object "Dat". (Dat <- data.frame(name, age, conti, employed, nr_kids)) ``` ``` ## name age conti employed nr_kids ## 1 Agnes 34 Europe FALSE 1 ## 2 Martin 22 Africa TRUE 0 ## 3 Hakan 42 Africa TRUE 3 ## 4 Tu 12 Asia TRUE 0 ## 5 Thais 76 S. America TRUE 4 ``` ] .pull-right[ ```r age <- c(34, 22, 42, 12, NA) name <- c("Agnes", "Martin", "Hakan", "Tu", "Thais") (Dat_wNA <- data.frame(name, age)) ``` ``` ## name age ## 1 Agnes 34 ## 2 Martin 22 ## 3 Hakan 42 ## 4 Tu 12 ## 5 Thais NA ``` ] --- #Data frames .center[Data frames are the typical "rectangular" way to organize data:] <img src="https://d33wubrfki0l68.cloudfront.net/6f1ddb544fc5c69a2478e444ab8112fb0eea23f8/91adc/images/tidy-1.png" width="80%" style="display: block; margin: auto;" > --- #import "pairfam" data - Import Stata, SPSS, SAS files: ["haven" package](https://haven.tidyverse.org/) - Import csv, tsv, fwf files: ["readr" package](https://readr.tidyverse.org/) - Import Excel's xlsx files: ["readxl" package](https://readxl.tidyverse.org/) --- #import "pairfam" data ```r # Create an object pairfam and assign the imported anchor1_50percent.dta to it # if you have downloaded it, please put it into your current working directory # If you are not sure what is current working directory getwd() ``` ``` ## [1] "C:/Users/rxv320/OneDrive - University of Copenhagen/Documents/My docs/My teaching/2025/Advanced quant/2 R basics-I/R basics-1" ``` ```r #I put the data in the file: #"C:/Users/rxv320/Documents/My docs/My teaching/2025/Advanced quant/2 R basics-I/R basics-1" # library(haven) #make sure that you call out the "haven" package library(haven) pairfam <- read_dta("anchor1_50percent_Eng.dta") pairfam ``` ``` ## # A tibble: 6,201 × 1,458 ## id demodiff wave sample pid parentidk1 parentidk2 parentidk3 ## <dbl> <dbl+lbl> <dbl+l> <dbl+l> <dbl> <dbl+lbl> <dbl+lbl> <dbl+lbl> ## 1 267206000 0 [0 non-… 1 [1 2… 1 [1 p… NA NA NA NA ## 2 112963000 0 [0 non-… 1 [1 2… 1 [1 p… NA NA NA NA ## 3 327937000 0 [0 non-… 1 [1 2… 1 [1 p… 3.28e8 NA NA NA ## 4 318656000 0 [0 non-… 1 [1 2… 1 [1 p… 3.19e8 318656101 NA NA ## 5 717889000 0 [0 non-… 1 [1 2… 1 [1 p… 7.18e8 717889101 717889101 NA ## 6 222517000 0 [0 non-… 1 [1 2… 1 [1 p… NA NA NA NA ## 7 144712000 0 [0 non-… 1 [1 2… 1 [1 p… NA NA NA NA ## 8 659357000 0 [0 non-… 1 [1 2… 1 [1 p… 6.59e8 NA NA NA ## 9 506367000 0 [0 non-… 1 [1 2… 1 [1 p… 5.06e8 506367101 NA NA ## 10 64044000 0 [0 non-… 1 [1 2… 1 [1 p… NA NA NA NA ## # ℹ 6,191 more rows ## # ℹ 1,450 more variables: parentidk4 <dbl+lbl>, parentidk5 <dbl+lbl>, ## # parentidk6 <dbl+lbl>, parentidk7 <dbl+lbl>, parentidk8 <dbl+lbl>, ## # parentidk9 <dbl+lbl>, parentidk10 <dbl+lbl>, parentidk11 <dbl+lbl>, ## # parentidk12 <dbl+lbl>, parentidk13 <dbl+lbl>, parentidk14 <dbl+lbl>, ## # parentidk15 <dbl+lbl>, sex_gen <dbl+lbl>, psex_gen <dbl+lbl>, ## # k1sex_gen <dbl+lbl>, k2sex_gen <dbl+lbl>, k3sex_gen <dbl+lbl>, … ``` --- #Take home 1. Vector: a sequence/chain/series of information. Elements of a vector can be addressed via it's index [i]. 2. Classes of vectors: numeric, factor, date, character, and logical vectors 3. Numerical vs categorical variables: numeric variables have a scale (e.g., cm, years, DKK), while categorical variables have no true representation in numbers 4. Packages: bundles of functions along with their documentation, you need to install and use `library` to call it out 5. `NA`: is "Not Available" and thus the code for missing values in R. 6. Data frames organize vectors of equal length along their indices. --- #Important codes - `install.packages()`: Installs packages from CRAN. - `library()`: adds a package to the library for the current session. - `c()`: concatenate a sequence to a vector. - `factor()`: Make a vector categorical. - `fct_recode`: Recode values of a factor. - `fct_relevel`: change the level of a factor - `as.numeric()`: Make a vector numeric. - `table()`: simple frequency or cross table. - `is.na()`: generate logical vector that identifies missing values. - `data.frame()`: combine vectors to make a dataset - `read_dta()`: read a data of stata format (dta) - `getwd()`: get the current workding directory --- class: center, middle #[Exercise](https://rpubs.com/fancycmn/1341250) you can use= chatgpt for help