class: title-slide background-image: url(Figures/McCourtTitle.png) .bg-text[ <hr /> ### Topic 2 ## Intro to R <hr /> Michael Bailey ] --- ## Overview of this session -- 1. R and RStudio -- 2. R Markdown -- 3. Loading data (sneaky tricky) -- 4. Three data types (boring) -- 5. Four data structures (exciting!) -- 6. Describing data -- 7. Create new data objects -- 8. Loops -- 9. _If_ statements -- --- ## Looking ahead -- 1. Labs a. We'll start Lab 1 in class next week. It is due Friday September 19 at 5 pm via Canvas. You will use R to describe and analyze exit polling data. You'll submit your code (in a .Rmd file) and output (in a .html file). -- 2. Quizzes a. There will be a practice quiz on Chapters 1 and 2 in class on Thursday September 11. b. There will be a quiz on Chapter 3 on Tuesday September 16. --- class: left, top, inverse, hide-logo background-position: center background-size: cover background-image: url(Figures/peace-sea.jpg) ## Goal 1: Get set up in R and RStudio --- ## R is <!-- .center[<img src="Figures/Rlogo.png", width=700>] --> -- - Powerful -- - Flexible -- - Eases workflow -- - The lingua franca of data science (with Python) -- - Free --- ## Installing R and RStudio -- - R installation: https://cran.case.edu/ -- - RStudio is an *integrated development environment* that simplifies many tasks. -- + Useful, but not necessary. -- + Installation: https://rstudio.com/products/rstudio/download/ --- ### R Studio -- .center[<img src="Figures/rstudio.png", width=700>] --- ### R Studio .center[<img src="Figures/rstudio_1_4.png", width=700>] --- ### R Studio .center[<img src="Figures/rstudio_2_4.png", width=700>] --- ### R Studio .center[<img src="Figures/rstudio_3_4.png", width=700>] --- ### R Studio .center[<img src="Figures/rstudio_4_4.png", width=700>] --- ### Scripts -- - A script is a text file where we write and run code our code. -- - The basic R script is a `.R` file that has code and comments -- - We'll also use "RMarkdown" files `.Rmd` that combine code and text and figures into pdf, Word or html files. --- #### Scripts (continued) -- When we write a line of code in the script window, we can run it in the console by highlighting the text and... -- - pressing **`command + enter`** (mac) -- - pressing **`control + enter`** (windows) -- - clicking **`run`** in banner at top of script quadrant in RStudio -- - Always use scripts! -- + Save the recipe, not the bread! <br> <br> -- - Use lots of comments! (use "#" at the start of a line) -- - What does this line of code do? ``` r library(car) ``` -- - Aha! ``` r # Load package for F-tests library(car) ``` --- ### Console -- The console is where calculations "`R`" occur. We can do interactive commands here. .center[<img src="Figures/console.png">] --- #### Console (continued) -- All commands are processed through the console directly (that is, one can type commands directly into it) or via a **script**. .center[<img src="Figures/console_with_calc.png">] --- ### Directories -- - Stay on top of where you store data -- - Path names are fussy: On Windows, you also have to replace all \ in a path with / or with \\\\ (double backward slash) -- - Option 1: use *setwd* ("set working directory") ``` r setwd("C:\\Documents\\MyFiles") dta <- read.table("data.csv", header=TRUE) ``` -- - Option 2: use absolute paths ``` r dta <- read.table("C:\\Documents\\MyFiles\\data.csv", header=TRUE) ``` -- - Option 3: use .Rproj to produce relative paths -- + In RStudio, File/New Project and create directory where you will store all material related to this project -- + Always open this .Rproj first when working on this projet. -- + Will create paths relative to the directory you create -- + Is portable to another machine <br> <br> -- Also: use the *here()* package/function to manage directories --- ### Directories - "Assignment" #### Please do something like this for all classes -- - Set up a directory/folder for this class (e.g., PPOL5200) -- - Create subdirectories/subfolders (e.g., data, labs, problemSets, slides) -- - Create a RProject (e.g., PPOL5200.RProj) and start with that every time you do -- - Get used to using relative paths + Suppose you are in labs folder and want to use a file in data; + It will look something like `CountryCode = read.csv("../data/CountryCodes.csv")' -- - Clean, well-organized folder structures/practices is **very** important when working in teams --- ## "Cheat" sheets -- - R [Code summary](https://michaelbailey.georgetown.domains/wp-content/uploads/2020/06/R-commands_Real-Econometrics-Bailey.pdf) -- - Stata [Code summary](https://michaelbailey.georgetown.domains/wp-content/uploads/2020/06/Stata-commands_Real-Econometrics-Bailey.pdf) -- - Both from [Real Stats](https://global.oup.com/ushe/product/real-econometrics-9780190857462) --- class: left, top, inverse background-position: center background-size: cover background-image: url(Figures/020419-oceans-turning-color.jpg) ## Goal 2: Use R Markdown --- ### Overview of RMarkdown -- - RMarkdown compiles R code into formats such as pdf, html and Microsoft Word. -- - Example: Use R and RMarkdown to produce analysis and figures together with formatted paper. + [Covid testing and sampling](https://michaelbailey.georgetown.domains/wp-content/uploads/2020/05/Corona_Paper_May2020_ver2.pdf) --- ### R Markdown Steps -- - Create .Rmd file (e.g., File/New File/R Markdown for starting template) -- - Compile a new document in the format indicated by "kniting" Rmd file in RStudio via the Knit button on RStudio command bar -- - RStudio compiles the .Rmd document in **a new session of R**. -- + **This is major source of headaches for new users.** An R data object you can see in interactive console may not exist in the Rmd file. -- + Implication: **Your Rmd code must be self-contained**: create all objects and do all processing inside Rmd document -- - For more details on R Markdown - Xie et al RMarkdown book: https://bookdown.org/yihui/rmarkdown/yihui-xie.html - http://rmarkdown.rstudio.com - https://www.rstudio.com/resources/cheatsheets --- ### Three RMarkdown Elements -- 1. Frontmatter (referred to as YAML frontmatter) -- 1. Text -- 1. Code --- ### Simple Example <div class="figure" style="text-align: center"> <img src="Figures/RmdExample.png" alt="Basic RMarkdown document" width="80%" /> <p class="caption">Basic RMarkdown document</p> </div> --- ### RMarkdown Element 1: Frontmatter #### YAML header -- - Basic metadata: title, author, date, output -- - YAML stands for "YAML Ain't Markup Language" (yes, it's a silly name) -- - Indents and spaces matter here! (**Another common source of headaches**) -- #### Setup chunk -- - Immediately following the YAML frontmatter is a _setup chunk_ of R code that loads libraries and sets settings etc. (More on R code chunks below.) -- ``` r library(tidyverse) library(readxl) ``` --- ### RMarkdown Element 2: Text -- - Text typically organized under headers (One hash for large font, 2 hashes for smaller down to 4 hashes) -- - For bullets, use a single dash. Put two dashes on line above, to reveal the bullets step by step in a presentation. -- `\(Y_{it} = \beta_0 + \beta_1X_{it} + \epsilon_{it}\)` - For math, use LaTex format inside of dollar signs. -- ``` r # $Y_{it} = \beta_0 + \beta_1X_{it} + \epsilon_{it}$ ``` --- ### RMarkdown Element 3: R Code -- - R Code in RMarkdown can be in **chunks** or **inline**. For example, see earlier slide called "Simple Example" -- ##### Chunks -- - Chunks are sections of code separated by 3 backticks followed by {r, eval=TRUE} (see options below) -- - Code chunk options include -- + `include = FALSE` R runs the code, but code and results do not appear in finished file. -- + `echo = FALSE` prevents code, but not the results from appearing in finished file. -- + `eval = FALSE` R will not evaluate the code in the chunk. -- + `message = FALSE` prevents messages generated by code from appearing in finished file. -- ##### Inline -- - Inline R code expressions can be used in text sections. They start with backtick r and end with backtick. --- #### Quick example .center[<img src="Figures/rmarkdown_screenshot.jpg", width=700>] --- #### Quick example -- First, we show output with echo = FALSE (we do not see R code - just output) ``` ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 6.758132e+01 1.3275714260 50.905975 1.979131e-43 ## Income 7.433343e-04 0.0002965107 2.506939 1.561728e-02 ``` -- Second, we show code and output with echo= TRUE ``` r # Code chunk with echo = TRUE ols.1 = lm(LifeExp ~ Income, data = statedata) summary(ols.1)$coefficients ``` ``` ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 6.758132e+01 1.3275714260 50.905975 1.979131e-43 ## Income 7.433343e-04 0.0002965107 2.506939 1.561728e-02 ``` -- Third, we use some code and output in text: The effect of 1,000 dollars of income on life expectancy is 0.74 years of life expectancy. --- class: left, top, inverse background-position: center background-size: cover <!-- background-image: url(https://michaelbailey.georgetown.domains/wp-content/uploads/2020/06/peace-sea-scaled.jpg) --> background-image: url(Figures/peace-sea.jpg) ## Goal 3: Load Data into R --- ## Loading data: 5 common examples -- - [More information on loading data in R](https://github.com/rstudio/cheatsheets/blob/master/data-import.pdf) -- - Data sets in base R: see list by typing *data()* ``` r state.x77[1:5,] # First 5 rows of "US State Facts" data provided inside R ## Population Income Illiteracy Life Exp Murder HS Grad Frost Area ## Alabama 3615 3624 2.1 69.05 15.1 41.3 20 50708 ## Alaska 365 6315 1.5 69.31 11.3 66.7 152 566432 ## Arizona 2212 4530 1.8 70.55 7.8 58.1 15 113417 ## Arkansas 2110 3378 1.9 70.66 10.1 39.9 65 51945 ## California 21198 5114 1.1 71.71 10.3 62.6 20 156361 ``` -- - read.csv ``` r CountryCode = read.csv("Data/Country codes for WVS wave 6.csv") CountryCode[1:4,] # First 4 rows of CountryCode data ## ccode countryName ## 1 8 Albania ## 2 12 Algeria ## 3 16 American Samoa ## 4 20 Andorra ``` --- #### Loading data (continued): R data formats -- - Save multiple objects with .RData format (often abbreviated to .Rda) ``` r object1 = CountryCode[1:10,] object2 = state.x77[1:5,] save(object1, object2, file = "Data/data.RData") rm(list=ls()) # Clear memory load(file = "Data/data.Rdata") # Load data objects() # Look at objects in memory ## [1] "object1" "object2" ``` -- - Save one object to a file with .rds format ``` r saveRDS(object1, file = "Data/data.rds") rm(list=ls()) # Clear memory my_data <- readRDS(file = "Data/data.rds") # Load object into new object name objects() ## [1] "my_data" my_data[1:4,] ## ccode countryName ## 1 8 Albania ## 2 12 Algeria ## 3 16 American Samoa ## 4 20 Andorra ``` --- #### Loading data (continued): Using packages ("libraries") -- - We need *packages* to load certain types of data sets (such as Excel and Stata data files) -- - Packages are collections of code/data/documentation to do thousands of specialized tasks -- - The first time you use a package you need to save the files to your computer ``` r install.packages("readxl") ``` -- - After that you need to ''pull it off the shelf'' every time you want to use it ``` r library(readxl) ``` --- #### Loading data (continued): Using packages ("libraries") -- ``` r library("readxl") # Package to read Excel data ## Warning: package 'readxl' was built under R version 4.3.3 Poll.1 <- readxl::read_excel("Data/Battleground-65-Final-Dataset.xlsx", sheet = "Final Dataset") # We do not necessarily need to identify package (see readxl:: in above command). However, # (a) it can help future us find what package we need to install and # (b) can avoid problems when multiple packages use same function name # Example: "select" is function in multiple packages; if you want standard "tidyverse" # version (more on this later), you'll need to write dplyr::select(...) Poll.1[1:2, 1:10] ## First 2 rows and first 10 variables ## # A tibble: 2 × 10 ## INT REGION STATE COUNTY CD SAMPGEN ACTAGE DTID JBID PBID ## <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> ## 1 430 3 1 3 1 F 64 1 4 4 ## 2 720 3 1 53 1 F 77 4 2 2 ``` -- ``` r # Read in a different sheet in an Excel file Poll.2 <- readxl::read_excel("Data/Battleground-65-Final-Dataset.xlsx", sheet = "NotData") Poll.2[1:2, 1:8] ## # A tibble: 2 × 8 ## x1 x2 x3 x4 x5 x6 x7 x8 ## <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 1 this 1 1 1 1 1 1 ## 2 2 is 2 2 2 2 2 2 ``` --- #### Loading data (continued): Stata data ``` r library(haven) # Package to read Stata data dta = haven::read_dta("Data/Ch5_Exercise6_Global_education.dta") ## First 4 rows and first 10 variables dta[1:4, 1:10] ## # A tibble: 4 × 10 ## code name open ed60 ypc60 ypcgr testavg proprts edavg region ## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> ## 1 ARG Argentina 0 6.13 7.39 0.996 3.92 6.5 7.22 LATAM ## 2 AUS Australia 0.870 9.81 10.6 2.22 5.09 9.32 11.5 COMM ## 3 AUT Austria 1 8.28 7.36 2.96 5.09 9.74 9.85 C-EUR ## 4 BEL Belgium 1 7.39 7.76 2.84 5.04 9.69 9.11 C-EUR ``` --- class: left, top, inverse background-position: center background-size: cover <!-- background-image: url(https://michaelbailey.georgetown.domains/wp-content/uploads/2020/06/peace-sea-scaled.jpg) --> background-image: url(Figures/419662.jpg) ## Goal 4: Identify three major data types 1. Numeric 2. Logical 3. String --- ### 1. Numeric ``` r # Variable named x.numeric contains the numbers 7 and 10 x.numeric <- c(7, 10) x.numeric ## [1] 7 10 ``` -- ### 2. Logical -- ``` r # Variable named x.logic is true/false # Based on condition in parentheses x.logical <- (x.numeric > 8) x.logical ## [1] FALSE TRUE # Notice double equal sign in condition x.logical2 <- (x.numeric == 7) x.logical2 ## [1] TRUE FALSE ``` --- -- ### 3. String -- ``` r # Variable named x.string has words x.string <- c("Hello", "World", "how are you doing?", 7) x.string ## [1] "Hello" "World" "how are you doing?" ## [4] "7" ``` --- #### Data types (continued) If you want to know what type of data a variable is use *class()* -- ``` r x.numeric ## [1] 7 10 class(x.numeric) ## [1] "numeric" x.string ## [1] "Hello" "World" "how are you doing?" ## [4] "7" class(x.string) ## [1] "character" ``` --- #### Data types (continued) #### Other data types -- - Factors - nominal variables that take on one of a specified number of values (e.g. "U.S." or "Mexico" or "Canada") -- - Dates -- - We'll leave getting comfortable with these data types for future work <!-- gender <- c(rep("male",20), rep("female", 30)) --> <!-- str(gender) --> <!-- gender <- factor(gender) --> <!-- str(gender) --> <!-- summary(gender) --> --- ### Missing data #### Often a big deal! ``` r # Create a a new variable. NA is specific term for missing data in R x.new = c(7, 10, NA, 3, 1) x.new ## [1] 7 10 NA 3 1 ``` -- ``` r # What happens here? x.dot = c(7, 10, ".", 3, 1) ``` -- ``` r x.dot ## [1] "7" "10" "." "3" "1" class(x.dot) ## [1] "character" ``` --- #### Missing data (continued) -- Check for missing data with the *is.na()* function ``` r x.new ## [1] 7 10 NA 3 1 ``` -- ``` r Missing.indicator = is.na(x.new) ``` -- What data type is the variable *Missing.indicator*? -- ``` r Missing.indicator ## [1] FALSE FALSE TRUE FALSE FALSE class(Missing.indicator) ## [1] "logical" ``` --- class: left, top, inverse background-position: center background-size: cover <!-- background-image: url(https://michaelbailey.georgetown.domains/wp-content/uploads/2020/06/peace-sea-scaled.jpg) --> background-image: url(Figures/328161_2560_1600.jpg) ## Goal 5: Identify four data structures 1. Vectors 2. Matrices 3. Data frames 4. Lists --- ### 1. Vector: a column of numbers More technically, a vector is a sequence of data elements of same type -- ``` r # Create a vector called x1 x1 <- c(1, 4, -1, 4, 1, 5) x1 ## [1] 1 4 -1 4 1 5 ``` -- ``` r # x2 is simply 2 x x1 for each element x2 <- 2 * x1 x2 ## [1] 2 8 -2 8 2 10 ``` --- #### 1. Vector (continued) -- Reference vector elements with single brackets ``` r x2 ## [1] 2 8 -2 8 2 10 # Set the third element of x2 to be missing x2[3] = NA x2 ## [1] 2 8 NA 8 2 10 ``` -- ``` r x1 ## [1] 1 4 -1 4 1 5 # Set all elements of x1 that are < 2 to equal 200 x1[x1 < 2] = 200 x1 ## [1] 200 4 200 4 200 5 ``` --- #### 1. Vector (continued) -- Use logical and brackets to select subset of data. -- ``` r x.new ## [1] 7 10 NA 3 1 # the is.na() command indicates if an element is NA is.na(x.new) ## [1] FALSE FALSE TRUE FALSE FALSE ``` -- ``` r # Show observations that are not missing # Equivalent ways to represent the same information x.new[is.na(x.new) == FALSE] ## [1] 7 10 3 1 x.new[is.na(x.new) == 0] ## [1] 7 10 3 1 x.new[is.na(x.new) != 1] ## [1] 7 10 3 1 ``` --- ### 2. Matrix: multiple columns of variables of same type and length -- Combine vectors into matrix with *cbind()* ("column bind") ``` r Matrix.1 = cbind(x1, x2) Matrix.1 ## x1 x2 ## [1,] 200 2 ## [2,] 4 8 ## [3,] 200 NA ## [4,] 4 8 ## [5,] 200 2 ## [6,] 5 10 ``` -- Check number of rows and columns with *dim()* ``` r dim(Matrix.1) ## [1] 6 2 ``` --- #### 2. Matrix (continued) .pull-left[ ``` r Matrix.1 ## x1 x2 ## [1,] 200 2 ## [2,] 4 8 ## [3,] 200 NA ## [4,] 4 8 ## [5,] 200 2 ## [6,] 5 10 ``` ] -- .pull-right[ Use brackets to identify subsets of a matrix ``` r # First column Matrix.1[, 1] ## [1] 200 4 200 4 200 5 ``` ``` r # Third row Matrix.1[3, ] ## x1 x2 ## 200 NA ``` ``` r # Rows of Matrix.1 where first # column is greater than 10 Matrix.1[Matrix.1[,1] > 10, ] ## x1 x2 ## [1,] 200 2 ## [2,] 200 NA ## [3,] 200 2 ``` ] --- ### 3. Data frame: multiple variables of same length in matrix form -- .pull-left[ - Very important data structure in R! - Columns can be different data types - Think of data frames as Excel spreadsheets - You may see *tibble*s; just a slightly modified version of data frames used in *tidyverse* coding (more on the tidyverse later) ] -- .pull-right[ ``` r df <- data.frame( st = c("UT", "NV", "OR", "TX", "NY", NA), Wages = x1, Spend = x2) df ## st Wages Spend ## 1 UT 200 2 ## 2 NV 4 8 ## 3 OR 200 NA ## 4 TX 4 8 ## 5 NY 200 2 ## 6 <NA> 5 10 # Variable names names(df) ## [1] "st" "Wages" "Spend" ``` ] -- --- #### Data frames (continued) -- Use brackets to identify subsets of a data frame ``` r # 2nd column df[, 2] ## [1] 200 4 200 4 200 5 ``` -- ``` r # 4th row df[4, ] ## st Wages Spend ## 4 TX 4 8 ``` -- Dollar sign notation -- ``` r df$Wages ## [1] 200 4 200 4 200 5 ``` -- ``` r # Rows of df with wages < 10 df[df$Wages < 10, ] ## st Wages Spend ## 2 NV 4 8 ## 4 TX 4 8 ## 6 <NA> 5 10 ``` --- ### 4. List: an object containing other objects ``` r # Create some objects of different types and lengths number.vec = c(2, 3, 5) string.vec = c("aa", "bb", "cc", "dd", "ee") logical.vec = c(TRUE, FALSE, TRUE, FALSE, FALSE) ``` -- ``` r # x.lists contains data objects n, s, b x.list = list(nn = number.vec, ss = string.vec, ll = logical.vec, 3) x.list ## $nn ## [1] 2 3 5 ## ## $ss ## [1] "aa" "bb" "cc" "dd" "ee" ## ## $ll ## [1] TRUE FALSE TRUE FALSE FALSE ## ## [[4]] ## [1] 3 ``` -- Think of lists as folders --- #### Lists (continued) -- ``` r x.list ## $nn ## [1] 2 3 5 ## ## $ss ## [1] "aa" "bb" "cc" "dd" "ee" ## ## $ll ## [1] TRUE FALSE TRUE FALSE FALSE ## ## [[4]] ## [1] 3 ``` -- - Referencing objects in lists can be a bit tricky -- + Rule of thumb: double brackets are for lists -- + Double brackets pull out an object from a list -- an object may itself have many elements! -- + Double brackets returns an element from a list as an object -- ``` r x.list[[1]] # First object in the x.list list ## [1] 2 3 5 x.list[[4]] # Fourth object in the x.list list ## [1] 3 ``` --- #### Lists (continued) We can combine single and double brackets .pull-left[ ``` r x.list ## $nn ## [1] 2 3 5 ## ## $ss ## [1] "aa" "bb" "cc" "dd" "ee" ## ## $ll ## [1] TRUE FALSE TRUE FALSE FALSE ## ## [[4]] ## [1] 3 ``` ``` r # What will this return? x.list[[1]][3] ``` ] -- .pull-right[ ``` r x.list[[1]][3] ## [1] 5 ``` ``` r # 1st element of 4th object in x.list x.list[[4]][1] ## [1] 3 # 2nd element of "nn" object in x.list x.list[["nn"]][2] ## [1] 3 # 3rd element of ss object in x.list x.list$ss[3] ## [1] "cc" ``` ] --- ### Data structure -- .pull-left[ Figure out the data structure of an object: ``` r str(x1) ## num [1:6] 200 4 200 4 200 5 str(Matrix.1) ## num [1:6, 1:2] 200 4 200 4 200 5 2 8 NA 8 ... ## - attr(*, "dimnames")=List of 2 ## ..$ : NULL ## ..$ : chr [1:2] "x1" "x2" str(df) ## 'data.frame': 6 obs. of 3 variables: ## $ st : chr "UT" "NV" "OR" "TX" ... ## $ Wages: num 200 4 200 4 200 5 ## $ Spend: num 2 8 NA 8 2 10 str(x.list) ## List of 4 ## $ nn: num [1:3] 2 3 5 ## $ ss: chr [1:5] "aa" "bb" "cc" "dd" ... ## $ ll: logi [1:5] TRUE FALSE TRUE FALSE FALSE ## $ : num 3 ``` ] -- .pull-right[ Compare to knowing the data type ``` r class(x1) ## [1] "numeric" ``` ] --- class: left, top, inverse background-position: center background-size: cover background-image: url(Figures/020419-oceans-turning-color.jpg) ## Goal 6: Describe data --- #### Descriptive statistics List the data objects in memory ``` r objects() ## [1] "df" "dta" "logical.vec" ## [4] "Matrix.1" "Missing.indicator" "my_data" ## [7] "number.vec" "Poll.1" "Poll.2" ## [10] "string.vec" "x.dot" "x.list" ## [13] "x.logical" "x.logical2" "x.new" ## [16] "x.numeric" "x.string" "x1" ## [19] "x2" ``` --- #### Descriptive statistics (continued) Many functions available ``` r # mean() is basic function -- as are sum(), min() etc mean(x1) ## [1] 102.1667 ``` -- ``` r # table() provides the frequency distribution of each value table(x1) ## x1 ## 4 5 200 ## 2 1 3 ``` -- ``` r # summary() provides basic descriptive stats summary(x1) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 4.00 4.25 102.50 102.17 200.00 200.00 ``` --- #### Descriptive statistics (continued) -- We often need to deal with missing data ``` r x2 ## [1] 2 8 NA 8 2 10 mean(x2) ## [1] NA ``` -- For many functions, remove NA observations with ", na.rm=TRUE" -- ``` r mean(x2, na.rm = TRUE) ## [1] 6 median(x1, na.rm = TRUE) ## [1] 102.5 max(x.new, na.rm = TRUE) ## [1] 10 ``` --- #### Descriptive statistics (continued) ``` r # For matrix or data frame - need to specify what mean we're looking for Matrix.1 ## x1 x2 ## [1,] 200 2 ## [2,] 4 8 ## [3,] 200 NA ## [4,] 4 8 ## [5,] 200 2 ## [6,] 5 10 ``` -- ``` r # Mean of all data in matrix (seldom useful) mean(Matrix.1, na.rm = TRUE) ## [1] 58.45455 ``` -- ``` r # Mean of column 1 mean(Matrix.1[,1]) ## [1] 102.1667 ``` --- #### Descriptive statistics (continued) -- *apply()* function applies a function to specified dimension of a matrix -- ``` r Matrix.1 ## x1 x2 ## [1,] 200 2 ## [2,] 4 8 ## [3,] 200 NA ## [4,] 4 8 ## [5,] 200 2 ## [6,] 5 10 apply(Matrix.1, 1, mean) # Means of rows ## [1] 101.0 6.0 NA 6.0 101.0 7.5 ``` -- ``` r apply(Matrix.1, 2, mean) # Means of columns ## x1 x2 ## 102.1667 NA ``` -- ``` r # Add additional arguments for the function (such as dealing with NA) apply(Matrix.1, 2, mean, na.rm=TRUE) ## x1 x2 ## 102.1667 6.0000 ``` --- class: right, top, inverse background-position: center background-size: cover background-image: url(Figures/blue-2807728_960_720.jpg) ## Goal 7: Create new data objects --- #### Creating new data objects -- Sequences ``` r # Sequence of numbers 1 thru 4 x.seq = seq(1:4) x.seq ## [1] 1 2 3 4 ``` -- ``` r # Sequence from 0 to 100 by 20 x.seq2 = seq(0, 100, by = 20) x.seq2 ## [1] 0 20 40 60 80 100 ``` -- Random numbers ``` r # 6 draws from a random normal distribution x.rand = rnorm(6) x.rand ## [1] 1.0680748 -1.4616957 0.1676797 0.9629749 1.4979996 -0.3886791 # 4 draws from a random uniform distribution x.rand.unif = runif(4) x.rand.unif ## [1] 0.2516914 0.1401574 0.5151736 0.4227422 ``` --- #### Creating new data objects (continued) #### Can put data into existing objects ``` r # Add a variable called x3 in the df dataframe df$x3 = seq(-4, 1, by = 1) ``` -- ``` r # What is the difference? Wages.SQ = df$Wages^2 df$Wages.squared = df$Wages^2 ``` -- ``` r Wages.SQ ## [1] 40000 16 40000 16 40000 25 ``` -- ``` r names(df) ## [1] "st" "Wages" "Spend" "x3" ## [5] "Wages.squared" objects()[1:15] ## [1] "df" "dta" "logical.vec" ## [4] "Matrix.1" "Missing.indicator" "my_data" ## [7] "number.vec" "Poll.1" "Poll.2" ## [10] "string.vec" "Wages.SQ" "x.dot" ## [13] "x.list" "x.logical" "x.logical2" ``` --- class: left, top, inverse, hide-logo background-position: center background-size: cover background-image: url("Figures/020419-oceans-turning-color.jpg") ## Goal 8: Work with loops --- ### Loops Use the **_`for()` function_** to create loops. ```{} for( i in 1:5 ) ^ ^ | |___ the values of i that will be evaluated in the loop | i is the 'counter' ``` -- **_brackets `{}`_** house the code that is going to happen each iteration. ```{} for( i in 1:5 ){ |~~~~~~~~~~~~~~~~| |~~~~~~~~~~~~~~~~| |~~~~~~~~~~~~~~~~| code performed for each iteration. |~~~~~~~~~~~~~~~~| } ``` --- #### Loops (continued) Example ``` r for( ii in 1:4 ){ cat("The value of ii is", ii, "\n") } ## The value of ii is 1 ## The value of ii is 2 ## The value of ii is 3 ## The value of ii is 4 ``` -- - We can name the looping counter anything we want (_ii_ in this case) -- - We use the `cat` function (for _concatenate_) to print cleanly -- - We use "\n" to start a new line each time --- class: left, top, inverse, hide-logo background-position: center background-size: cover background-image: url("Figures/peace-sea.jpg") ## Goal 9: Understand if statement --- ### If statements -- - Structure: if(condition) { true.expression } else {false.expression} -- ``` r # Simple example using nchar function (that counts characters in string object) if(nchar("Algeria") > 8) cat("Algeria is a long name\n") if(nchar("Algeria") < 8) cat("Algeria is a short name\n") ## Algeria is a short name ``` -- ``` r # Now in a loop for(cc in 2:7){ if(nchar(CountryCode$country[cc]) > 8) { cat(CountryCode$country[cc], "is a long name\n") } if(nchar(CountryCode$country[cc]) < 8) { cat(CountryCode$country[cc], "is a short name\n") } } ## Algeria is a short name ## American Samoa is a long name ## Andorra is a short name ## Angola is a short name ## Antigua is a short name ## Azerbaijan is a long name ``` --- #### If statements (continued) -- - *ifelse* Structure: ifelse(condition, expression1, expression2) ``` r # Create a new variable that equals 1 if country contains a space # But some rows are not countries CountryCode[CountryCode$V2<0,] ## V2 country ## 187 -5 Missing; Unknown ## 188 -4 Not asked in survey ## 189 -3 Not applicable ## 190 -2 No answer ## 191 -1 Don't know ``` -- ``` r # Use ifelse to only identify countries with spaces CountryCode$nameLength = ifelse(CountryCode$V2 > 0, nchar(CountryCode$country), NA) CountryCode[c(1:4, dim(CountryCode)[1]),] ## V2 country nameLength ## 1 8 Albania 7 ## 2 12 Algeria 7 ## 3 16 American Samoa 14 ## 4 20 Andorra 7 ## 191 -1 Don't know NA ``` <!-- # search for "grepl" for more information on regular expressions --> <!-- grepl(" ", CountryCode$country), --> <!-- #### Additional RMarkdown Capabilities --> <!-- -- --> <!-- - Customize with CSS files --> <!-- -- --> <!-- - Create dashboard (see Xie et al Chapter 5) --> <!-- -- --> <!-- - Create slides: PowerPoint or HTML with (for example) `xargingan` package --> <!-- -- --> <!-- - Create interactive web apps via Shiny (Xie et al p. 42) --> <!-- -- --> <!-- - Automated customized reports including emails (Xie et al, p. 10 and Chapter 15) and templates (Xie et al Chapter 17) --> <!-- -- --> <!-- - Academic papers (with automated references) --> <!-- -- --> <!-- - Conference posters (`posterdown` package) --> <!-- -- --> <!-- - Websites (see Xie et al, p. 13) --> <!-- --- --> <!-- ## Example 1: World Values Survey --> <!-- -- --> <!-- ```{r echo = FALSE} --> <!-- CountryCode = read.csv("C:\\Users\\baileyma\\Dropbox\\Textbook_drop\\TextbookData\\WorldValues\\Country codes for WVS wave 6.csv") --> <!-- WV = readRDS("C:\\Users\\baileyma\\Dropbox\\Textbook_drop\\TextbookData\\WorldValues\\F00007762-WV6_Data_R_v20180912.rds") --> <!-- ## PICK OUT SELECT VARIABLES --> <!-- WV$Satisfied = ifelse(WV$V23 >0, WV$V23, NA) --> <!-- WV$Income = ifelse(WV$V239 >0, WV$V239, NA) --> <!-- WV$Education = ifelse(WV$V248 >0, WV$V248, NA) --> <!-- WV$Conserv = ifelse(WV$V95 >0, WV$V95, NA) --> <!-- WV$Male = ifelse(WV$V240 == 2, 1, 0) --> <!-- WV$Marital = ifelse(WV$V57 >0, WV$V57, NA) --> <!-- WV$BirthYear = ifelse(WV$V241 >0, WV$V241, NA) --> <!-- ## Age is interview year minus year of birth --> <!-- WV$Age = WV$V262 - WV$BirthYear --> <!-- #WV$Religious = recode(WV$V145, "1=7; 2=6; 3=5; 4=4; 5=3; 6=2; 7=1; else=NA") --> <!-- ## USE LISTS TO CREATE OLS MODEL FOR EACH COUNTRY (or perhaps subset) --> <!-- Country.vector = CountryCode %>% --> <!-- filter(V2>0) %>% --> <!-- filter(V2 %in% unique(WV$V2)) %>% --> <!-- filter(V2 != 332) --> <!-- #Country.vector --> <!-- ``` --> <!-- ```{r collapse=TRUE, tidy=FALSE} --> <!-- # Run regressions on a list of countries --> <!-- WV.ols.list <- --> <!-- lapply(Country.vector$V2, --> <!-- function(cc){ --> <!-- lm(Satisfied ~ Income + Male + Age, data = WV[WV$V2 == cc,]) --> <!-- } ) --> <!-- # Name each regression output --> <!-- names(WV.ols.list) = Country.vector$country --> <!-- ``` --> <!-- --- --> <!-- #### World Values Survey (continued) --> <!-- We can pick any country and easily display regression results. --> <!-- ```{r collapse=TRUE, tidy=FALSE} --> <!-- summary(WV.ols.list[["United States"]]) --> <!-- ``` --> <!-- --- --> <!-- #### World Values Survey (continued) --> <!-- #### Top fifteen countries by income coefficient --> <!-- ```{r collapse=TRUE, tidy=FALSE, echo=FALSE} --> <!-- # Extract the slope from all models --> <!-- slopes <- round(sapply(WV.ols.list, function(x) x$coefficients["Income"]) , 2) --> <!-- t.stats <- round(sapply(WV.ols.list, function(x) { --> <!-- summary(x)$coefficients["Income",3]}) , 2) --> <!-- ``` --> <!-- -- --> <!-- ```{r collapse=TRUE, tidy=FALSE} --> <!-- Results = data.frame(Country = names(WV.ols.list), slopes, t.stats) %>% --> <!-- arrange(desc(slopes)) --> <!-- Results[1:15,] --> <!-- ``` -->