class: center, middle, inverse, title-slide .title[ # Advanced quantitative data analysis ] .subtitle[ ## R basics I ] .author[ ### Mengni Chen ] .institute[ ### Department of Sociology, University of Copenhagen ] --- #Vector - The most fundamental data type in R is a vector: a sequence/chain/series of information. You can "concatenate" information into a vector by using **c()** ```r # Concatenate the sequence 5 4 3 2 1 to a vector, assign it to object x1, and print it. (x1 <- c(5, 4, 3, 2, 4)) ``` ``` ## [1] 5 4 3 2 4 ``` - You can access a specific element of a vector via it's vector[index]: ```r # Print the second element of x1. x1[2] ``` ``` ## [1] 4 ``` - If you want to access several elements, you need to supply another vector to the vector's index: ```r # Print the second and fourth element of x1. x1[c(2, 4)] ``` ``` ## [1] 4 2 ``` --- #Types of common-used vectors - Numeric vector - Factor vector* - Character vector - Logical vector - Date vectors --- #Numeric vectors - Numeric vectors are sequences of numbers. ```r x1 ``` ``` ## [1] 5 4 3 2 4 ``` ```r class(x1) # what class of this object is? ``` ``` ## [1] "numeric" ``` --- #Transform numeric vectors - arithmetic operators **"+ - * / ^"** - a myriad of different functions - combine arithmetic operators and several functions ```r #arithmetic operators (x1 <- x1 + 5) # Assign the object X1 plus 5 added to each of its elements to x1. ``` ``` ## [1] 10 9 8 7 9 ``` ```r (x1 <- log2(x1)) # Assign the logarithm with base 2 of x1 to x1. ``` ``` ## [1] 3.321928 3.169925 3.000000 2.807355 3.169925 ``` ```r (x1 <- (x1 - mean(x1)) / sd(x1)) # z-standardize x1. ``` ``` ## [1] 1.1606991 0.3872282 -0.4774386 -1.4577170 0.3872282 ``` --- #Factor vectors Factors are for categorical variables that make a distinction but whose values cannot be compared on a common scale. They are composed of a sequence of categorical values (i.e., argument x), and a (ideally comprehensive) list of potential levels (i.e., theoretically-possible values). ```r # Concatenate argument "x" to a factor and give it a # comprehensive list "levels" of all potential categories. intimate <- factor( x = c("Yes", "Yes", "No", "No", "No"), levels = c("Yes", "No", "Don't know", "No answer")) # Print a frequency table of our new factor vector. table(intimate) ``` ``` ## intimate ## Yes No Don't know No answer ## 2 3 0 0 ``` --- #Factor R forces you to decide whether a variable is continuous (numeric) or categorical (factor) - Numeric variables have a scale, such as cm, years, or DKK. Hence there is no need for labels. - Categorical variables, by contrast, have no actual representation in numbers. - Because factors are categorical, they cannot be numerically transformed. - If you nevertheless try to numerically transform a factor, the result is **NA** (i.e., "not available"). ```r intimate * 2 ``` ``` ## Warning in Ops.factor(intimate, 2): '*' not meaningful for factors ``` ``` ## [1] NA NA NA NA NA ``` Furthermore, we learn to use recode functions from the **forcats** package, which is part of the **tidyverse** package. --- #Wait, what's a package? .left-column[ <img src="http://cdn.osxdaily.com/wp-content/uploads/2016/03/package-file-check.jpg" width="60%" style="display: block; margin: auto;" > ] .right-column[- A package is a collection of functions and their documentation (sometimes also data). - Some packages are pre-installed as sub-packages of R's Base package. In addition, there are, currently, +17,000 (user-written) packages on the Comprehensive R Archive Network (CRAN). The <span style="color:blue">tidyverse</span> and its <span style="color:blue">forcats</span> package are such user-written packages. - How to install package - with code - with click] --- #Install package ```r install.packages("name_of_the_package") #for example install package called "tidyverse" install.packages("tidyverse") ##don't forget the quotation mark"" ``` <span style="color:red">**Do NOT add the install.packages() to your R Scripts every time! You will need to install a package only once, and not every time you run your script. If you want to update, use update.packages().** --- #How to use package - library() to use a package - Because there are so many user-written packages, they oftentimes contain functions with *conflicting names*. To avoid conflicts, you need to specify for each R session, which packages you want to work with. You do that, by adding the packages to your current R session's **library**. - It is good practice to add all packages to the library on the very top of an R script. Please all add the tidyverse to your library by writing the following as the very first line in your R script. ```r # use library() to use the tidyverse package. library(tidyverse) #if you want to use several packages at the same time # install.packages("pacman") pacman::p_load(tidyverse,ggplot2) ``` Some tidyverse functions conflict with Base R's Stats package. You can always address a function from a specific package by initiating it with `package::function()`. For example, <span style="color:red">`pacman::p_load</span>` --- #Back to recoding factors Factors can best be **recoded** with `forcats::fct_recode()`. ```r # Recode intimate to make Y for Yes and N for N. Watch out: first the new, then the old value... table(intimate) ``` ``` ## intimate ## Yes No Don't know No answer ## 2 3 0 0 ``` ```r intimate1 <- fct_recode(intimate, "Y" = "Yes", "N" = "No") # new coding=old coding # Frequency table of intimate. table(intimate1) ``` ``` ## intimate1 ## Y N Don't know No answer ## 2 3 0 0 ``` --- #Back to recoding factors forcats contains many more useful functions to handle factors! Also check out [Chapter 15 Factors](https://r4ds.had.co.nz/factors.html) of Grolemund and Wickham (2017). ```r intimate2 <- fct_drop(intimate) #drop unsed levels table(intimate2) ``` ``` ## intimate2 ## Yes No ## 2 3 ``` ```r intimate3 <- fct_relevel(intimate, "Don't know","No answer", "No","Yes",) #re-order levels table(intimate3) ``` ``` ## intimate3 ## Don't know No answer No Yes ## 0 0 3 2 ``` --- #Character vectors Character/string vectors are sequences of text. (an extended example) ```r # Concatenate these four strings as one vector # and assign it to object "x2". x2 <- c("I","love", "this", "course", "!") # Return only the first to third character # of each string-element of x2. (x2a <- str_sub(x2, start = 1, end = 3)) ``` ``` ## [1] "I" "lov" "thi" "cou" "!" ``` ```r # Return only the last two characters # of each string-element of x2. (x2b <- str_sub(x2, -2)) ``` ``` ## [1] "I" "ve" "is" "se" "!" ``` --- #Logical vectors Logical vectors are sequences of TRUE and FALSE statements: ```r x3 <- c(TRUE, TRUE, FALSE, TRUE,TRUE) ``` Internally, R uses logical vectors for case selection. They are thus very important! ```r x2 <- c("I","love", "this", "course", "!") x2[x3] ``` ``` ## [1] "I" "love" "course" "!" ``` --- #Date vectors Dates are vectors of the Year-Month-Day (and sometimes -Time) format. ```r Sys.Date() # Tell me the date ``` ``` ## [1] "2024-08-29" ``` ```r class(Sys.Date()) ``` ``` ## [1] "Date" ``` ```r # Evaluate the logical statement that # today is smaller (i.e., before) # than the date of 2020-05-25. Sys.Date() < "2020-05-25" ``` ``` ## [1] FALSE ``` Date vectors are complex, because time is not metric → advanced R course. --- #No mixed vectors!!! ```r # Concatenate the sequence 1-4 to a numeric vector (x4 <- seq(1, 4)) ``` ``` ## [1] 1 2 3 4 ``` ```r class(x4) ``` ``` ## [1] "integer" ``` ```r # Replace the third element with the word "test" x4[3] <- "test" x4 ``` ``` ## [1] "1" "2" "test" "4" ``` ```r class(x4) ``` ``` ## [1] "character" ``` --- #No mixed vectors!!! ```r # What type of object is x4? class(x4) ``` ``` ## [1] "character" ``` -- ```r # Make x4 a numeric vector again as.numeric(x4) ``` ``` ## Warning: NAs introduced by coercion ``` ``` ## [1] 1 2 NA 4 ``` --- #NA: Not Available In general, missing values in R are NA. ```r x5 <- c(1, 2, 3, NA, 5, NA, 7) ``` -- Many functions will not ignore NA by default and thus return NA. ```r # Estimate mean of x5 mean(x5) ``` ``` ## [1] NA ``` ```r # Estimate mean of x5 ignoring the NA (i.e., casewise deletion) mean(x5, na.rm = TRUE) #na.rm is a function to tell R to remove NA when doing the calculation ``` ``` ## [1] 3.6 ``` <span style="color:blue">na.rm</span> of `mean(x5, na.rm = TRUE)` is whether to remove NA when doing analysis --- #NA: Not Available is.na() generates logical vectors that identify missing values. ```r x5 <- c(1, 2, 3, NA, 5, NA, 7) # Which elements are missing? is.na(x5) ``` ``` ## [1] FALSE FALSE FALSE TRUE FALSE TRUE FALSE ``` ```r # Which elements are not missing? !is.na(x5) ``` ``` ## [1] TRUE TRUE TRUE FALSE TRUE FALSE TRUE ``` --- #NA: Not Available ```r # How many are missing table(is.na(x5)) ``` ``` ## ## FALSE TRUE ## 5 2 ``` ```r # Print only non-missing values of x5 x5[!is.na(x5)] ``` ``` ## [1] 1 2 3 5 7 ``` --- #Take home 1. Vector: a sequence/chain/series of information. Elements of a vector can be addressed via it's index [i]. 2. Classes of vectors: numeric, factor, date, character, and logical vectors 3. Numerical vs categorical variables: numeric variables have a scale (e.g., cm, years, DKK), while categorical variables have no true representation in numbers 4. Packages: bundles of functions along with their documentation, you need to install and use `library` to call it out 5. `NA`: is "Not Available" and thus the code for missing values in R. --- #Important codes - `install.packages()`: Installs packages from CRAN. - `library()`: adds a package to the library for the current session. - `c()`: concatenate a sequence to a vector. - `factor()`: Make a vector categorical. - `fct_recode`: Recode values of a factor. - `fct_relevel`: change the level of a factor - `as.numeric()`: Make a vector numeric. - `table()`: simple frequency or cross table. - `is.na()`: generate logical vector that identifies missing values. --- class: center, middle #[Exercise](https://rpubs.com/fancycmn/1214709) you can use= chatgpt for help