class: center, middle, inverse, title-slide .title[ # Advanced quantitative data analysis ] .subtitle[ ## Vector ] .author[ ### Mengni Chen ] .institute[ ### Department of Sociology, University of Copenhagen ] --- #Vector - The most fundamental data type in R is a vector: a sequence/chain/series of information. You can "concatenate" information into a vector by using **c()** ```r # Concatenate the sequence 5 4 3 2 1 to a vector, assign it to object x1, and print it. (x1 <- c(5, 4, 3, 2, 4)) ``` ``` ## [1] 5 4 3 2 4 ``` - You can access a specific element of a vector via it's vector[index]: ```r # Print the second element of x1. x1[2] ``` ``` ## [1] 4 ``` - If you want to access several elements, you need to supply another vector to the vector's index: ```r # Print the second and fourth element of x1. x1[c(2, 4)] ``` ``` ## [1] 4 2 ``` --- background-image: url(https://pbs.twimg.com/media/CbiFppsWIAUFKLm?format=jpg&name=large) background-size: contain class: center --- #Numeric vectors - Numeric vectors are sequences of numbers. ```r x1 ``` ``` ## [1] 5 4 3 2 4 ``` ```r class(x1) # what class of this object is? ``` ``` ## [1] "numeric" ``` --- #Transform numeric vectors - arithmetic operators **"+ - * / ^"** - a myriad of different functions - combine arithmetic operators and several functions ```r #arithmetic operators (x1 <- x1 + 5) # Assign the object X1 plus 5 added to each of its elements to x1. ``` ``` ## [1] 10 9 8 7 9 ``` ```r (x1 <- log2(x1)) # Assign the logarithm with base 2 of x1 to x1. ``` ``` ## [1] 3.321928 3.169925 3.000000 2.807355 3.169925 ``` ```r (x1 <- (x1 - mean(x1)) / sd(x1)) # z-standardize x1. ``` ``` ## [1] 1.1606991 0.3872282 -0.4774386 -1.4577170 0.3872282 ``` --- background-image: url(https://static.wixstatic.com/media/625cd8_d1c68c0affe543feacce3f487d5210dd~mv2.jpg/v1/fit/w_1000%2Ch_637%2Cal_c%2Cq_80/file.jpg) background-size: contain class: center --- #Factor vectors Factors are for categorical variables that make a distinction but whose values cannot be compared on a common scale. They are composed of a sequence of categorical values (i.e., argument x), and a (ideally comprehensive) list of potential levels (i.e., theoretically-possible values). ```r # Concatenate argument "x" to a factor and give it a # comprehensive list "levels" of all potential categories. conti <- factor( x = c("Europe", "Africa", "Africa", "Asia", "S.America"), levels = c("Africa", "Asia", "Australia", "Europe", "N.America", "S.America")) # Print a frequency table of our new factor vector. table(conti) ``` ``` ## conti ## Africa Asia Australia Europe N.America S.America ## 2 1 0 1 0 1 ``` --- #Factor R forces you to decide whether a variable is continuous (numeric) or categorical (factor) - Numeric variables have a scale, such as cm, years, or DKK. Hence there is no need for labels. - Categorical variables, by contrast, have no actual representation in numbers. - Because factors are categorical, they cannot be numerically transformed. - If you nevertheless try to numerically transform a factor, the result is **NA** (i.e., "not available"). ```r conti * 2 ``` ``` ## Warning in Ops.factor(conti, 2): '*' not meaningful for factors ``` ``` ## [1] NA NA NA NA NA ``` Furthermore, we learn to use recode functions from the forcats package, which is part of the tidyverse. --- #Wait, what's a package? .left-column[ <img src="http://cdn.osxdaily.com/wp-content/uploads/2016/03/package-file-check.jpg" width="60%" style="display: block; margin: auto;" > ] .right-column[- A package is a collection of functions and their documentation (sometimes also data). - Some packages are pre-installed as sub-packages of R's Base package. In addition, there are, currently, +17,000 (user-written) packages on the Comprehensive R Archive Network (CRAN). The tidyverse and its forcats package are such user-written packages. - How to install package - with code - with click] --- #Install package ```r install.packages("name_of_the_package") #for example install package called "tidyverse" install.packages("tidyverse") ``` **Do NOT add the install.packages() to your R Scripts every time! You will need to install a package only once, and not every time you run your script. If you want to update, use update.packages().** --- #How to use package .pull-left[ Because there are so many user-written packages, they oftentimes contain functions with *conflicting names*. To avoid conflicts, you need to specify for each R session, which packages you want to work with. You do that, by adding the packages to your current R session's **library**. It is good practice to add all packages to the library on the very top of an R script. Please all add the tidyverse to your library by writing the following as the very first line in your R script. ```r # Add tidyverse package to library library(tidyverse) ``` Some tidyverse functions conflict with Base R's Stats package. You can always address a function from a specific package by initiating it with `package::function()`. ] .pull-right[ <img src="https://bucket.trending.com/trending/reddit/2016-12-11/the-incredible-library-at-the-university-of-copenhagen-in-denmark_preview.jpg" width="100%" style="display: block; margin: auto;">] --- #Back to recoding factors Factors can best be **recoded** with `forcats::fct_recode()`. ```r # Recode conti to Danish. Watch out: first the new, then the old value... conti ``` ``` ## [1] Europe Africa Africa Asia S.America ## Levels: Africa Asia Australia Europe N.America S.America ``` ```r conti <- fct_recode(conti, "Europa" = "Europe", "Afrika" = "Africa", "Asien" = "Asia", "Suedamerika" = "S.America", "Nordamerika" = "N.America", "Australien" = "Australia") # Frequency table of conti. table(conti) ``` ``` ## conti ## Afrika Asien Australien Europa Nordamerika Suedamerika ## 2 1 0 1 0 1 ``` --- #Back to recoding factors forcats contains many more useful functions to handle factors! Also check out [Chapter 15 Factors](https://r4ds.had.co.nz/factors.html) of Grolemund and Wickham (2017). ```r conti ``` ``` ## [1] Europa Afrika Afrika Asien Suedamerika ## Levels: Afrika Asien Australien Europa Nordamerika Suedamerika ``` ```r conti <- fct_drop(conti) #drop unsed levels table(conti) ``` ``` ## conti ## Afrika Asien Europa Suedamerika ## 2 1 1 1 ``` ```r conti <- fct_relevel(conti, "Suedamerika","Europa", "Asien","Afrika",) #re-order levels table(conti) ``` ``` ## conti ## Suedamerika Europa Asien Afrika ## 1 1 1 2 ``` --- #Date vectors Dates are vectors of the Year-Month-Day (and sometimes -Time) format. ```r Sys.Date() # Tell me the date ``` ``` ## [1] "2022-09-14" ``` ```r class(Sys.Date()) ``` ``` ## [1] "Date" ``` ```r # Evaluate the logical statement that # today is smaller (i.e., before) # than the deadline for the exam. Sys.Date() < "2020-05-25" ``` ``` ## [1] FALSE ``` Date vectors are complex, because time is not metric → advanced R course. --- #Character vectors Character/string vectors are sequences of text. ```r # Concatenate these four strings as one vector # and assign it to object "x2". x2 <- c("Thisis", "a", "goodidea", "really") # Return only the first to third character # of each string-element of x2. (x2 <- str_sub(x2, start = 1, end = 3)) ``` ``` ## [1] "Thi" "a" "goo" "rea" ``` ```r # Return only the last two characters # of each string-element of x2. x2 <- c("Thisis", "a", "goodidea", "really") (x2 <- str_sub(x2, -2)) ``` ``` ## [1] "is" "a" "ea" "ly" ``` --- #Logical vectors Logical vectors are sequences of TRUE and FALSE statements: ```r x3 <- c(TRUE, TRUE, FALSE, TRUE) ``` Internally, R uses logical vectors for case selection. They are thus very important! ```r x2 <- c("Thisis", "a", "goodidea", "really") x2[x3] ``` ``` ## [1] "Thisis" "a" "really" ``` --- class: center, middle <img src="https://merlin-intro-r.netlify.app/2-vectors/img/VectorTypesLogic.png" width="80%" style="display: block; margin: auto;"> --- #No mixed vectors!!! ```r # Concacenate the sequence 1-4 to a numeric vector (x4 <- seq(1, 4)) ``` ``` ## [1] 1 2 3 4 ``` ```r # Replace the third element with the word "test" x4[3] <- "test" x4 ``` ``` ## [1] "1" "2" "test" "4" ``` --- #No mixed vectors!!! ```r # What type of object is x4? class(x4) ``` ``` ## [1] "character" ``` -- ```r # Make x4 a numeric vector again as.numeric(x4) ``` ``` ## Warning: NAs introduced by coercion ``` ``` ## [1] 1 2 NA 4 ``` --- #NA: Not Available In general, missing values in R are NA, impossible values (devision by 0) are NaN (Not a Number). ```r x5 <- c(1, 2, 3, NA, 5, NA, 7) ``` -- Many functions will not ignore NA by default and thus return NA. ```r # Estimate mean of x5 mean(x5) ``` ``` ## [1] NA ``` ```r # Estimate mean of x5 ignoring # the NA (i.e., casewise deletion) mean(x5, na.rm = TRUE) ``` ``` ## [1] 3.6 ``` --- #NA: Not Available is.na() generates logical vectors that identify missing values. ```r x5 <- c(1, 2, 3, NA, 5, NA, 7) # Which elements are missing? is.na(x5) ``` ``` ## [1] FALSE FALSE FALSE TRUE FALSE TRUE FALSE ``` ```r # Which elements are not missing? !is.na(x5) ``` ``` ## [1] TRUE TRUE TRUE FALSE TRUE FALSE TRUE ``` --- #NA: Not Available ```r # How many are missing table(is.na(x5)) ``` ``` ## ## FALSE TRUE ## 5 2 ``` ```r # Print only non-missing values of x5 x5[!is.na(x5)] ``` ``` ## [1] 1 2 3 5 7 ``` --- #Take home 1. Vector: a sequence/chain/series of information. Elements of a vector can be addressed via it's index [i]. 2. Classes of vectors: numeric, factor, date, character, and logical vectors 3. Numerical vs categorical variables: numeric variables have a scale (e.g., cm, years, DKK), while categorical variables have no true representation in numbers 4. Packages: bundles of functions along with their documentation, you need to install and use `library` to call it out 5. `NA`: is "Not Available" and thus the code for missing values in R. --- #Important codes - `install.packages()`: Installs packages from CRAN. - `library()`: adds a package to the library for the current session. - `c()`: concatenate a sequence to a vector. - `factor()`: Make a vector categorical. - `fct_recode`: Recode values of a factor. - `as.numeric()`: Make a vector numeric. - `table()`: simple frequency or cross table. - `is.na()`: generate logical vector that identifies missing values. --- class: center, middle #[Exercise](https://merlin-intro-r.netlify.app/exercises/exercise2.html)