Advanced quantitative data analysis

class: center, middle, inverse, title-slide

.title[
# Advanced quantitative data analysis
]
.subtitle[
## R basics I
]
.author[
### Mengni Chen
]
.institute[
### Department of Sociology, University of Copenhagen
]

---

#Vector and Data Frame
- Vector
  - numeric vector
  - factor (categorical) vector
  - other types of vector
- How to use packages
- How to code missing in R
- Dataframe: how to build a dataset from vectors
- Import data: dta format, csv format, xls format

---
#Vector 
- The most fundamental data type in R is a  vector: a sequence/chain/series of information. You can "concatenate" information into a vector by using **c()**

```r
# Concatenate the sequence 5 4 3 2 1 to a vector, assign it to object x1, and print it.
(x1 <- c(5, 4, 3, 2, 4))
```

```
## [1] 5 4 3 2 4
```

- You can access a specific element of a vector via it's vector[index]:

```r
# Print the second element of x1.
x1[2]
```

```
## [1] 4
```
- If you want to access several elements, you need to supply another vector to the vector's index:

```r
# Print the second and fourth element of x1.
x1[c(2, 4)]
```

```
## [1] 4 2
```
---
#Types of common-used vectors
- Numeric vector
- Factor vector*
- Character vector
- Logical vector
- Date vectors

---
#Numeric vectors
- Numeric vectors are sequences of numbers.

```r
x1
```

```
## [1] 5 4 3 2 4
```

```r
class(x1) # what class of this object is?
```

```
## [1] "numeric"
```
---
#Transform numeric vectors
  - arithmetic operators **"+ - * / ^"**
  - a myriad of different functions
  - combine arithmetic operators and several functions

```r
#arithmetic operators
(x1 <- x1 + 5) # Assign the object X1 plus 5 added to each of its elements to x1.
```

```
## [1] 10  9  8  7  9
```

```r
(x1 <- log2(x1)) # Assign the logarithm with base 2 of x1 to x1.
```

```
## [1] 3.321928 3.169925 3.000000 2.807355 3.169925
```

```r
(x1 <- (x1 - mean(x1)) / sd(x1)) # z-standardize x1.
```

```
## [1]  1.1606991  0.3872282 -0.4774386 -1.4577170  0.3872282
```
---
#Factor vectors
Factors are for categorical variables that make a distinction but whose values cannot be compared on a common scale. They are composed of a sequence of categorical values (i.e., argument x), and a (ideally comprehensive) list of potential levels (i.e., theoretically-possible values).

```r
# Concatenate argument "x" to a factor and give it a
# comprehensive list "levels" of all potential categories.

intimate <- factor(
 x = c("Yes", "Yes", "No", "No", "No"),
 levels = c("Yes", "No", "Don't know",
 "No answer"))

# Print a frequency table of our new factor vector.
table(intimate)
```

```
## intimate
##        Yes         No Don't know  No answer 
##          2          3          0          0
```

---
#Factor
R forces you to decide whether a variable is continuous (numeric) or categorical (factor)

- Numeric variables have a scale, such as cm, years, or DKK. Hence there is no need for labels.
- Categorical variables, by contrast, have no actual representation in numbers.
- Because factors are categorical, they cannot be numerically transformed.
- If you nevertheless try to numerically transform a factor,
the result is **NA** (i.e., "not available").

```r
intimate * 2
```

```
## Warning in Ops.factor(intimate, 2): '*' not meaningful for factors
```

```
## [1] NA NA NA NA NA
```
Furthermore, we learn to use recode functions from the **forcats** package, which is part of the **tidyverse** package. 
---
#Wait, what's a package?
.left-column[
<img src="http://cdn.osxdaily.com/wp-content/uploads/2016/03/package-file-check.jpg" width="60%" style="display: block; margin: auto;" >
]

.right-column[- A package is a collection of functions and their documentation (sometimes also data). 
- Some packages are pre-installed as sub-packages of R's Base package. In addition, there are, currently, +17,000 (user-written) packages on the Comprehensive R Archive Network (CRAN). The tidyverse and its forcats package are such user-written packages.

- How to install package
  - with code
  - with click]

---
#Install package

```r
install.packages("name_of_the_package")

#for example install package called "tidyverse"
install.packages("tidyverse") ##don't forget the quotation mark""
```

**Do NOT add the install.packages() to your R Scripts every time! You will need to install a package only once, and not every time you run your script. If you want to update, use update.packages().**
 
---
#How to use package
- library() to use a package
 - Because there are so many user-written packages, they oftentimes contain functions with *conflicting names*. To avoid conflicts, you need to specify for each R session, which packages you want to work with. You do that, by adding the packages to your current R session's **library**.

- It is good practice to add all packages to the library on the very top of an R script. Please all add the tidyverse to your library by writing the following as the very first line in your R script.

```r
# use library() to use the tidyverse package.
library(tidyverse)

#if you want to use several packages at the same time
# install.packages("pacman")
pacman::p_load(tidyverse,ggplot2)
# p_load():checks if a package is installed;
#if not installed → it installs it; then it loads it (like library()).
```
Some tidyverse functions conflict with Base R's Stats package. You can always address a function from a specific package by initiating it with `package::function()`.
For example, `pacman::p_load()`

---
#Back to recoding factors

Factors can best be **recoded** with `forcats::fct_recode()`.

```r
# Recode intimate to make Y for Yes and N for N. Watch out: first the new, then the old value...
table(intimate)
```

```
## intimate
##        Yes         No Don't know  No answer 
##          2          3          0          0
```

```r
intimate1 <- fct_recode(intimate, 
 "Y" = "Yes", "N" = "No") # new coding=old coding

# Frequency table of intimate.
table(intimate1) 
```

```
## intimate1
##          Y          N Don't know  No answer 
##          2          3          0          0
```

---
#Back to recoding factors
forcats contains many more useful functions to handle factors!
Also check out [Chapter 15 Factors](https://r4ds.had.co.nz/factors.html)  of Grolemund and Wickham (2017).

```r
intimate2 <- fct_drop(intimate) #drop unsed levels
table(intimate2)
```

```
## intimate2
## Yes  No 
##   2   3
```

```r
intimate3 <- fct_relevel(intimate, "Don't know","No answer", "No","Yes",) #re-order levels 
table(intimate3)
```

```
## intimate3
## Don't know  No answer         No        Yes 
##          0          0          3          2
```

---
#Other types of vectors
- Character/string vectors are sequences of text. (an extended example)

```r
# Concatenate these four strings as one vector 
# and assign it to object "x2".
x2 <- c("I","love", "this", "course", "!")
```

- Logical vectors
Logical vectors are sequences of TRUE and FALSE statements:

```r
x3 <- c(TRUE, TRUE, FALSE, TRUE,TRUE)
```

- Date vectors
Dates are vectors of the Year-Month-Day (and sometimes -Time) format.

```r
Sys.Date() # Tell me the date
```

```
## [1] "2025-09-10"
```

```r
class(Sys.Date())
```

```
## [1] "Date"
```

---
#No mixed vectors!!!

```r
# Concatenate the sequence 1-4 to a numeric vector
(x4 <- seq(1, 4)) 
```

```
## [1] 1 2 3 4
```

```r
class(x4)
```

```
## [1] "integer"
```

```r
# Replace the third element with the word "test"
x4[3] <- "test"
x4
```

```
## [1] "1"    "2"    "test" "4"
```

```r
class(x4)
```

```
## [1] "character"
```

---
#No mixed vectors!!!

```r
# What type of object is x4?
class(x4)
```

```
## [1] "character"
```

```r
# Make x4 a numeric vector again
as.numeric(x4)
```

```
## Warning: NAs introduced by coercion
```

```
## [1]  1  2 NA  4
```

---
#NA: Not Available
In general, missing values in R are NA.

```r
x5 <- c(1, 2, 3, NA, 5, NA, 7)
```
--
Many functions will not ignore NA by default and thus return NA.

```r
# Estimate mean of x5
mean(x5)
```

```
## [1] NA
```

```r
# Estimate mean of x5 ignoring the NA (i.e., casewise deletion)
mean(x5, na.rm = TRUE) #na.rm is a function to tell R to remove NA when doing the calculation
```

```
## [1] 3.6
```
na.rm of `mean(x5, na.rm = TRUE)` is whether to remove NA when doing analysis
---
#NA: Not Available
is.na() generates logical vectors that identify missing values.

```r
x5 <- c(1, 2, 3, NA, 5, NA, 7)
# Which elements are not missing?
!is.na(x5)
```

```
## [1]  TRUE  TRUE  TRUE FALSE  TRUE FALSE  TRUE
```
---
#NA: Not Available

```r
# How many are missing
table(is.na(x5))
```

```
## 
## FALSE  TRUE 
##     5     2
```

```r
# Print only non-missing values of x5
x5[!is.na(x5)] 
```

```
## [1] 1 2 3 5 7
```

---
#Some vectors

```r
age <- c(34, 22, 42, 12, 76)
conti <- factor(x = c("Europe", "Africa", "Africa", "Asia", "S. America"),
 levels = c("Africa", "Asia", "Australia", "Europe", "N. America", "S. America"))
employed <- c(FALSE, TRUE, TRUE, TRUE, TRUE)
name <- c("Agnes", "Martin", "Hakan", "Tu", "Thais")
nr_kids <- c(1, 0, 3, 0, 4)
```

---
#Data frames
Data frames organize vectors of **equal length** along their indices.
.pull-left[

```r
# Bind our 4 vectors along their index into a data frame.
# Assign that data frame to object "Dat".
(Dat <- data.frame(name, age, conti, employed, nr_kids))
```

```
##     name age      conti employed nr_kids
## 1  Agnes  34     Europe    FALSE       1
## 2 Martin  22     Africa     TRUE       0
## 3  Hakan  42     Africa     TRUE       3
## 4     Tu  12       Asia     TRUE       0
## 5  Thais  76 S. America     TRUE       4
```
]

.pull-right[

```r
age <- c(34, 22, 42, 12, NA)
name <- c("Agnes", "Martin", "Hakan", "Tu", "Thais")
(Dat_wNA <- data.frame(name, age))
```

```
##     name age
## 1  Agnes  34
## 2 Martin  22
## 3  Hakan  42
## 4     Tu  12
## 5  Thais  NA
```
]
---
#Data frames

.center[Data frames are the typical "rectangular" way to organize data:] 
<img src="https://d33wubrfki0l68.cloudfront.net/6f1ddb544fc5c69a2478e444ab8112fb0eea23f8/91adc/images/tidy-1.png" width="80%" style="display: block; margin: auto;" >

---
#import "pairfam" data
- Import Stata, SPSS, SAS files: ["haven" package](https://haven.tidyverse.org/)
- Import csv, tsv, fwf files: ["readr" package](https://readr.tidyverse.org/)
- Import Excel's xlsx files: ["readxl" package](https://readxl.tidyverse.org/)
---
#import "pairfam" data

```r
# Create an object pairfam and assign the imported anchor1_50percent.dta to it
# if you have downloaded it, please put it into your current working directory
# If you are not sure what is current working directory
getwd()
```

```
## [1] "C:/Users/rxv320/OneDrive - University of Copenhagen/Documents/My docs/My teaching/2025/Advanced quant/2 R basics-I/R basics-1"
```

```r
#I put the data in the file: 
#"C:/Users/rxv320/Documents/My docs/My teaching/2025/Advanced quant/2 R basics-I/R basics-1"

# library(haven) #make sure that you call out the "haven" package
library(haven)
pairfam <- read_dta("anchor1_50percent_Eng.dta")
pairfam
```

```
## # A tibble: 6,201 × 1,458
## id demodiff wave sample pid parentidk1 parentidk2 parentidk3
## <dbl> <dbl+lbl> <dbl+l> <dbl+l> <dbl> <dbl+lbl> <dbl+lbl> <dbl+lbl> 
## 1 267206000 0 [0 non-… 1 [1 2… 1 [1 p… NA NA NA NA 
## 2 112963000 0 [0 non-… 1 [1 2… 1 [1 p… NA NA NA NA 
## 3 327937000 0 [0 non-… 1 [1 2… 1 [1 p… 3.28e8 NA NA NA 
## 4 318656000 0 [0 non-… 1 [1 2… 1 [1 p… 3.19e8 318656101 NA NA 
## 5 717889000 0 [0 non-… 1 [1 2… 1 [1 p… 7.18e8 717889101 717889101 NA 
## 6 222517000 0 [0 non-… 1 [1 2… 1 [1 p… NA NA NA NA 
## 7 144712000 0 [0 non-… 1 [1 2… 1 [1 p… NA NA NA NA 
## 8 659357000 0 [0 non-… 1 [1 2… 1 [1 p… 6.59e8 NA NA NA 
## 9 506367000 0 [0 non-… 1 [1 2… 1 [1 p… 5.06e8 506367101 NA NA 
## 10 64044000 0 [0 non-… 1 [1 2… 1 [1 p… NA NA NA NA 
## # ℹ 6,191 more rows
## # ℹ 1,450 more variables: parentidk4 <dbl+lbl>, parentidk5 <dbl+lbl>,
## # parentidk6 <dbl+lbl>, parentidk7 <dbl+lbl>, parentidk8 <dbl+lbl>,
## # parentidk9 <dbl+lbl>, parentidk10 <dbl+lbl>, parentidk11 <dbl+lbl>,
## # parentidk12 <dbl+lbl>, parentidk13 <dbl+lbl>, parentidk14 <dbl+lbl>,
## # parentidk15 <dbl+lbl>, sex_gen <dbl+lbl>, psex_gen <dbl+lbl>,
## # k1sex_gen <dbl+lbl>, k2sex_gen <dbl+lbl>, k3sex_gen <dbl+lbl>, …
```

---
#Take home
1. Vector: a sequence/chain/series of information. Elements of a vector can be addressed via it's index [i].
2. Classes of vectors: numeric, factor, date, character, and logical vectors
3. Numerical vs categorical variables: numeric variables have a scale (e.g., cm, years, DKK), while categorical variables have no true representation in numbers
4. Packages: bundles of functions along with their documentation, you need to install and use `library` to call it out
5. `NA`: is "Not Available" and thus the code for missing values in R.
6. Data frames organize vectors of equal length along their indices.

---
#Important codes
- `install.packages()`: Installs packages from CRAN.
- `library()`: adds a package to the library for the current session.
- `c()`: concatenate a sequence to a vector.
- `factor()`: Make a vector categorical.
- `fct_recode`: Recode values of a factor.
- `fct_relevel`: change the level of a factor
- `as.numeric()`: Make a vector numeric.
- `table()`: simple frequency or cross table.
- `is.na()`: generate logical vector that identifies missing values.
- `data.frame()`: combine vectors to make a dataset
- `read_dta()`: read a data of stata format (dta)
- `getwd()`: get the current workding directory

---
class: center, middle
#[Exercise](https://rpubs.com/fancycmn/1341250)
you can use= chatgpt for help