Introduction to R (Basic - Intermediate)

Abraham Azar (abraham.azar@impact-initiatives.org)

1 An Introduction to R

1.1 Course Introduction

Welcome to this training course. During this course, I hope to introduce you to using R, an interactive environment for statistical computing. R is not difficult to learn, but just like any new language, the initial learning curve can be a little steep and you will need to use it frequently or you forget it.

I have tried to simplify the content of this course as much as possible to IMPACT needs. My aim is to help you climb the initial learning curve and provide you with basic skills to enable you to further build your experience in using this language.

This course will be split into an intro, basic, and intermediate levels with exercises and a final project. So, I encourage you to complete the exercises and to watch out for details.

Buckle up and Enjoy the Ride!!

1.2 FAQ

1.2.1 What is RStudio?

RStudio is an integrated development environment (IDE) for R which works with the standard version of R available from CRAN. RStudio includes a wide range of productivity enhancing features and runs on all major platforms. There are several related IDE products: RStudio Workbench (previously RStudio Server Pro), RStudio Server Open Source, and RStudio Desktop. RStudio Workbench and RStudio Server enable you to provide a browser-based interface to a server running on a remote Linux system.

1.2.2 What versions of R is RStudio compatible with?

RStudio requires an installation of R 4.2.3 or higher. You can download the most recent version of R for your environment from CRAN.

1.2.3 What is the difference between RStudio Desktop, RStudio Server, and RStudio Workbench?

RStudio Desktop is an IDE that works with the version of R you have installed on your local Windows, Mac OS X, or Linux workstation. Version 1.3 and later of RStudio Desktop can also be used as a client of RStudio Workbench (previously RStudio Server Pro).

RStudio Workbench (previously RStudio Server Pro) and RStudio Server Open Source are Linux server applications that provide a web-browser-based interface to a server running on a remote Linux system. For more on why you might want to deploy an RStudio Server instead of RStudio Desktop, see the server documentation.

Check POSIT to see more on FAQs.

2 Setup Instruction

2.1 Core Software

During this course, we will be using RStudio. To get your computer ready for this course, please follow the instructions below.

First, install R. By the time you are conducting this training, you will need R version 4.2.3 or higher. Download and install R for Windows, Mac or linux.

Second, install RStudio. Download and install the free RStudio Desktop version.

These two software are downloaded and installed separately. R is the statistical computing environment, and RStudio is the IDE that makes R used better and easier.

2.2 RStudio Interface

The RStudio interface is composed of quadrants, each of which fulfills a unique purpose:

The Console window,
The Source window,
The Environment / History / Connections / Tutorial window,
The Files / PLots / Packages / Help / Viewer window

Sometimes only three windows might be showing, and you will be wondering where is the Source window has gone. In order to use it, you have to create a new one. You can create a new file by selecting File -> New File -> R Script.

2.3 Installing first package

You will realize across your work with R that most of the cool features and tools comes from third-party packages. They are super easy to install. You will find different ways to install packages in R, mainly through the install.packages() command.

Try installing ggplot2 package from the console window.

install.packages(‘ggplot2’)

The process should be straightforward. R will automatically install any other packages that ggplot2 might need. Throughout this course, we will be focusing on couple of libraries that are mostly used within IMPACT. Remember that with R, it is literally to the infinity and beyond and hopefully you will be building your own packages one day.

3 Tips before starting our journey

3.1 RStudio Projects

RStudio projects make it straightforward to divide your work into multiple contexts, each with their own working directory, workspace, history, and source documents. Here are more info on how to create a project. POSIT

3.2 R Scripts

Try to put all the steps of your works in an RScript rather than running all your work in the console window. This way you will be able to share the work with others and get the same results every time.

The larger the poject the more complex the work become. From experience, try to split your tasks and work into different scripts to not be overwhelmed with thousands of lines of codes.

3.3 Good Layouts

Try to be consisting in your layouting.

Load first libraries
Load the data you are working with
Change or analyse your data
Output and save your work.

3.4 Write readable codes

For many obvious reasons:

Other people might need to use or see your code
You might need to use your code in the future

You might write the perfect functional code, but not understand one single line. Below are some examples.

3.4.1 Basic formatting

Use the following simple rules for writing readable codes:

Use spaces between variables and operators
Break up long line of codes. You can set up a threshold max line by going to Tools -> Global Options -> Code -> Display -> Show Margin -> set up to 110
Use meaningful variable names taking into consideration that R is case sensitive.

See the difference between:

malDiff1=lm(y~grp+grpTim,df,subset=sext1=="m")

and:

male_difference = lm(score ~ group + group_time_interaction,
                     data = interview_data,
                     subset = gender == "male")

Both codes are the same, but the second is better 😊.

3.4.2 Be Consistent

Make sure to stay consistent across your whole script as you and the nexst person using the script will get used to the format in use and will be able to identify variables and data frames.

3.4.3 The most important: Include Comments

It is the best feeling in the world when I open someone’s else code and find comments in them. Comments are used to either explain what you are trying to do or to even show how to use the script the right way. Comments in R are written using #.

# Here I am adding two variables

variable1 <- 1 # first variable
variable2 <- 4 # second variable

sum <- variable1 + variable2

Output:

## [1] 5

4 Quiz 1 (Intro to R)

If you would like to test your knowledge, please click on the link below and complete the Intro to R questions.

LINK

5 Basics of R

There is couple of things to know about programming fundamentals before jumping and building your first script in R. R is just another programming language that is available to serve you at your command.

First we start from the basics of Data Types .

5.1 Basic Data Types

5.1.1 Numbers

Numbers in R can be called either numerics or integers.

## Numeric variable
x <- 28.5
class(x)

Output:

## [1] "numeric"

## Integer variable
y <- 28L
class(y)

Output:

## [1] "integer"

5.1.2 Logical

Logical is a Boolean value with a binary operators: TRUE or FALSE

## Logical variable
z <- TRUE
class(z)

Output:

## [1] "logical"

5.1.2.1 or:

## Logical variable
logical <- 2 < 1
class(logical)

Output:

## [1] "logical"

5.1.3 Characters

Anything that is put inside " " or' ' are considered a text (string).

## String variable
my_string <- "I love this training"
class(my_string)

Output:

## [1] "character"

5.2 Operators in R

Operators in R are used to perform operations on variables and values.

They are divided in the followin groups:

Arithmetic operators
Assignment operators
Comparison operators
Logical operators
Miscellaneous operators

5.2.1 Arithmetic

Mainly used to perform common arithmetic operations.

Operator	Name	Example
`+`	Addition	x + y
`-`	Subtraction	x - y
`*`	Multiplication	x * y
`/`	Division	x / y
`^`	Exponent	x ^ y
`%%`	Modulus	x %% y
`%/%`	Integer Division	x %/% y

5.2.2 Assignment

These operators are used to assign values to variables.

my_var <- "value"
my_var <<- "value"
my_var = "value"
"value" -> my_var
"value" ->> my_var
my_var ## to print my_var

Output:

## [1] "value"

5.2.3 Comparison

Mainly used to compare two values.

Operator	Name	Example
`==`	Equal	x == y
`!=`	Not equal	x != y
`>`	Greater than	x > y
`<`	Less than	x < y
`>=`	Greater than or equal to	x >= y
`<=`	Less than or equal to	x <= y

5.2.4 Logical

Mainly used to combine conditional statements

Operator	Description
`&`	Element-wise logical AND. Return TRUE if both elements are TRUE
`&&`	Logical AND. Return TRUE if both statements are TRUE
`\|`	Element-wise logical OR. Return TRUE if one of the elements is TRUE
`\|\|`	Logical OR. Return TRUE if on of the statements is TRUE
`!`	Logical NOT. Return FALSE if statement is TRUE

5.2.5 Miscellaneous

Mainly used to manipulate data. We will see them a lot in use in coming chapters.

Operator	Description	Example
`:`	Create series of numbers in a sequence	x <- 1:10
`%in%`	Find an element inside a vector (to be explained later)	x %in% y

5.3 Data Structures in R

5.3.1 Variables

Variables are objects for storing data values. There is no commend to create a variable and it is set the moment a value is assigned. We saw in the previous operators chapters an example of assigning a value to a variable.

name <- "Abraham"
age <- 30
name
age

Output:

## [1] "Abraham"

## [1] 30

Variables can be concatenate together using the paste() function. Use a comma (,) between the variables inside the function.

name <- "Abraham"
age <- 30
paste(name, " is ", age, " years old.")

Output:

## [1] "Abraham  is  30  years old."

For numbers, you can use the arithmetic operator (+) but it will perform the actual addition. If you try to use it between a numeric and character types variable, R will give you an error.

name <- "Abraham"
age <- 30
name + age

Output:

## Error in name + age: non-numeric argument to binary operator

5.3.2 Vectors

Vectors are a list of items that are from the same type. You will be using them a lot to identify and assign lists to be iterated or used to be compared with other lists.

To combine items in a vector list, you can use the c() function and use the comma to separate the items.

names <- c("Abraham","Maksym","Oleksandr","Karyna")
names

Output:

## [1] "Abraham"   "Maksym"    "Oleksandr" "Karyna"

You can use the : operator if you want to create a list of vectors with a series of numbers.

numeric_list <- c(1:10)
numeric_list

Output:

##  [1]  1  2  3  4  5  6  7  8  9 10

The following is some features used in vectors.

5.3.2.1 Access Vectors

You can access an item inside a vector by adding a number inside brackets []. Please note that in R, unlike other programming languages, items start at index 1.

names <- c("Abraham","Maksym","Oleksandr","Karyna")
names[1]

Output:

## [1] "Abraham"

5.3.2.2 Replace items in Vectors

You can assign a new value targeting the item you want to change using the index.

names <- c("Abraham","Maksym","Oleksandr","Karyna")
names[1] <- "Anastasiia"
names

Output:

## [1] "Anastasiia" "Maksym"     "Oleksandr"  "Karyna"

5.3.2.3 Vector Length

You can check the vector’s length using the length() function

names <- c("Abraham","Maksym","Oleksandr","Karyna")
length(names)

Output:

## [1] 4

5.3.3 Factors

Factors exists to categorize data and the categories inside a factor are considered levels in R. To create a factor, you can use the factor() function.

units <- factor(c("ISU","CEU","Sectors","ISU","ISU","CASH","CEU"))
units

Output:

## [1] ISU     CEU     Sectors ISU     ISU     CASH    CEU    
## Levels: CASH CEU ISU Sectors

You can set the levels inside the factor function by calling the parameter levels.

Same as vectors, you can Access, Change, and check the lenght.

5.3.4 Data frames

Data frames are the most commonly used data structure in R. It is a data displayed in the format of a table. Data frames can store different types of data in every column. It can be characters, numeric, or logical. Remember, each complumn should have the same type.

You can use data.frame() function to create a data frame.

units_df <- data.frame(
  units = c("ISU","CEU","CASH","Sectors"),
  num_of_employess = c(10,45,20,1),
  success = c(T,T,T,T)
)
units_df

Output:

##     units num_of_employess success
## 1     ISU               10    TRUE
## 2     CEU               45    TRUE
## 3    CASH               20    TRUE
## 4 Sectors                1    TRUE

You can use the summary() function to summarise your data frame.

summary(units_df)

Output:

##     units           num_of_employess success       
##  Length:4           Min.   : 1.00    Mode:logical  
##  Class :character   1st Qu.: 7.75    TRUE:4        
##  Mode  :character   Median :15.00                  
##                     Mean   :19.00                  
##                     3rd Qu.:26.25                  
##                     Max.   :45.00

5.3.4.1 Access items in df

You can use single brackets [ ], double brackets [[ ]], or $ sign to access a column.

units_df[1]

units_df[["units"]]

units_df$units

Output:

##     units
## 1     ISU
## 2     CEU
## 3    CASH
## 4 Sectors

## [1] "ISU"     "CEU"     "CASH"    "Sectors"

## [1] "ISU"     "CEU"     "CASH"    "Sectors"

5.3.4.2 Combine rows to df or df to df

You can use the rbind() function to combine a row to an existing df, or two dfs together.

new_units_df <- data.frame(
  units = c("ISU","CEU","Sectors"),
  num_of_employess = c(10,45,1),
  success = c(T,T,T)
)

combined_df <- rbind(units_df, new_units_df)

combined_df

Output:

##     units num_of_employess success
## 1     ISU               10    TRUE
## 2     CEU               45    TRUE
## 3    CASH               20    TRUE
## 4 Sectors                1    TRUE
## 5     ISU               10    TRUE
## 6     CEU               45    TRUE
## 7 Sectors                1    TRUE

5.3.4.3 Amount of Rows and Columns

You can use ncol() and nrow() to check the amount of rows and columns of a df.

ncol(combined_df)
nrow(combined_df)

Output:

## [1] 3

## [1] 7

5.4 Conditions and statements

We saw in the Operators section some of the conditions that can be used to compare mathematical statements. R support as well conditions such as if statements and others.

5.4.1 If Statement

The “if statement” use if keyword and have the following syntax to be executed.

num_one <- 20
num_two <- 30

if (num_one < num_two) {
  print("num_two is greater than num_one")
}

Output:

## [1] "num_two is greater than num_one"

5.4.2 Else If

This statement use else if keyword and is used if the previous conditions is not true, then use this condition. You can use as many else if as you want.

num_one <- 20
num_two <- 20

if (num_one < num_two) {
  print("num_two is greater than num_one")
} else if (num_two == num_one){
  print("num_one and num_two are equal")
}

Output:

## [1] "num_one and num_two are equal"

5.4.3 If Else

else is a keyword that is used to catch anything that is not meeting any of the previous conditions.

num_one <- 30
num_two <- 20

if (num_one < num_two) {
  print("num_two is greater than num_one")
} else if (num_two == num_one){
  print("num_one and num_two are equal")
} else {
  print("num_one is greater than num_two")
}

Output:

## [1] "num_one is greater than num_two"

5.4.4 Use of AND or OR in IF

Remembering from the operators section, we learned about the logical ones. They can be used in defining two conditions.

num_one <- 30
num_two <- 20
num_three <- 100

if (num_one > num_two & num_three > num_two) {
  print("Both conditions TRUE.")
}

Output:

## [1] "Both conditions TRUE."

num_one <- 30
num_two <- 20
num_three <- 100

if (num_one > num_two | num_two > num_three) {
  print("At least one condition is TRUE.")
}

Output:

## [1] "At least one condition is TRUE."

5.4.5 For Loop

for loops are used to iterate over a sequence, vector, list, or rows/columns of data frames.

for(i in 1:10){
  print(i)
}

Output:

## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5
## [1] 6
## [1] 7
## [1] 8
## [1] 9
## [1] 10

Another example with the units list from before:

for(i in units){
  print(i)
}

Output:

## [1] "ISU"
## [1] "CEU"
## [1] "Sectors"
## [1] "ISU"
## [1] "ISU"
## [1] "CASH"
## [1] "CEU"

6 Quiz 2 (Basics of R)

If you would like to test your knowledge, please click on the link below and complete the Basics of R questions.

LINK

7 Intermediate level

Congratulations!! You made it to the next level.

In this level, we will dive more into specific packages that are mostly used within IMPACT as part of data manipulation and data wrangling in R. Please make sure that you have completed the Basic levels of R before diving deeper. Almost all concepts mentioned are crucial for this part.

7.0.1 Data to be used

In this course, I will be using a part of the 2022 MSNA Ukraine data in the examples below. You will find the dataset here to download so you can work at the same time and practice.

7.1 R Packages

Packages are the fundamental units of reproducible R code. They include reusable R functions, the documentation that describes how to use them, and sample data.

In R, the fundamental unit of shareable code is the package. A package bundles together code, data, documentation, and tests, and is easy to share with others. As of March 2023, there were over 19,000 packages available on the Comprehensive R Archive Network, or CRAN, the public clearing house for R packages. This huge variety of packages is one of the reasons that R is so successful: the chances are that someone has already solved a problem that you’re working on, and you can benefit from their work by downloading their package.

The goal of this training is to get you acquainted with couple of packages and prepare you to tackle some of the main data manipulation needed in your every day tasks with IMPACT.

You can install the packages from CRAN with install.packages("x")
You can use them in R with library("x") or library(x).

The main packages that you might come across while you start working in R regularly are:

dplyr: A Grammar for Data Manipulation
tidyr: For tidying messy data
srvyr: For analyzing survey data
ggplot2: For making graphs
leaflet: For creating maps
And many others

In this course, we will focus mainly on dplyr package using an example of dataset from the 2022 MSNA in Ukraine.

7.2 dplyr

7.2.1 What is dplyr?

Dplyr is a powerful R package to manipulate, clean and summarize unstructured data. It makes data manipulation very fast. It compromises of many functions that perform mostly used data manipulation operations such as filtering, selecting specific columns, sorting data, adding or deleting columns, aggregating data.

To install and load dplyr package into your RStudio, follow the steps

install.packages(‘dplyr’)

library(dplyr)

7.2.2 Most important functions

The most common used functions in the dplyr package are:

select(): picks columns based on their names or type
filter(): picks rows based on their values
group_by(): group the rows together depending on conditions
summarise(): summarize or aggregate the data together
arrange(): sort and order the rows by values of the column
mutate(): create new variables by mutating existing ones
join(): joining data frames together
rename(): rename column names in your dataset

7.2.3 Pipes

You will most likely come across this symbol %>%. It is used to emphasise the sequence of actions and chain together different actions that are performed together.

For example:

We will filter our data to keep only ages above 18
then, we want to mutate the sum of income in last month and this month
then, we want to group_by the hromada together
finally, summarise the average income per hromada.

Here is some documentation on %>%

7.3 Jump to action

Before jumping literally to action, it will be good to learn how to import your own data to R, for example the data you just downloaded above. In R, you can import any kind of data, XLSX, CSV, JSON, GeoData, etc. . Today, we will learn how to import an Excel file.

You will need to import the readxl package. You should know by now how to install new packages and use it in the script. Take a moment to do so.

Then, you can use the

readxl::read_excel(path = "data/your_data.xlsx", col_types = "text", na = c("NA",""))

The path parameter is where your data is sitting in your repository.

The col_types parameter is a crucial one. We should always try to force reading all columns as character data type and transform them in our actions to numeric if needed to not face any possible issues that might arise.

The na parameter is taking care of empty strings. By default, ""empty strings in the data are considered NAs, but you can also force other strings to be considered NA.

7.3.1 `select()` function

Let’s say we would like to select only the uuid and the A_2_respondent_sex.

age_respondent_df <- data %>% 
  select(uuid,A2_respondent_sex)

head(age_respondent_df)

Output:

## # A tibble: 6 × 2
##   uuid                                 A_2_respondent_sex
##   <chr>                                <chr>             
## 1 af204813-5e1f-4040-bc71-6b0061bf4773 female            
## 2 6a5fe753-984c-42ec-8fbe-28d79f6d0ca3 female            
## 3 acc9eac0-bf27-4216-b6d3-1c48919f5999 male              
## 4 e516a220-ba49-4736-ab1f-bd409459d1a0 female            
## 5 c0b55200-d706-46a8-bf13-7d274f03ac4e female            
## 6 9c61d1f0-edc2-4558-ae4d-e56321ca58b2 female

Let’s breakdown a bit the code above. As you see, We started by assigning a new data frame to age_respondent_df. If you start with data <- data %>%, you will be overwriting the actual data with the new actions. We also used a new function called head() which shows only the first few rows of you data.

You can perform many different actions with the select() function.

To select all columns except specific ones, you can use the -subtraction operator.

df_without_uuid <- data %>% 
  select(-uuid)

To select the first five columns, you can use the : colon operator with indexes or the names of the columns.

first_five_col_df <- data %>% 
  select(1:5)

To select all columns that start with a specific character string like "A_", you can use the function starts_with().

all_A_columns_df <- data %>% 
  select(starts_with("A_"))

Here are some additional options to select columns based on a specific criteria:

ends_with(): Select columns that end with a specific character string
contatins(): Select columns that contains a specific character string
matches(): Select columns that matches a specific character string or a REGEX

Here is some documentation on select()

7.3.2 `filter()` function

Filter is used to select specific rows from the datasets. Let’s say we would like to only see the HHs that are above 60 years old.

elderly_df <- data %>% 
  select(uuid,A_1_respondent_age) %>% 
  filter(as.numeric(A_1_respondent_age) > 60)

head(elderly_df)

Output:

## # A tibble: 6 × 2
##   uuid                                 A_1_respondent_age
##   <chr>                                <chr>             
## 1 5656517f-c389-4d20-8f6d-c0f02b588257 65                
## 2 571f52a6-6370-46d3-b1a7-f27ca5658d0d 68                
## 3 d42104cc-8d0b-4024-b230-45ece957c687 61                
## 4 b8aec8d4-fb84-47d6-9104-16d8f6a9c797 63                
## 5 00bb7e1a-7ff2-4e80-9077-d24f7b1fdb06 61                
## 6 c80714a5-e964-488c-ac68-ddf8428b6771 64

You can see that we also used a new function called as.numeric(). If you recall, while we imported the data to R, we did read all columns as characters. Hence why we are transforming it to numeric to be able to conductr some mathematical computations.

You can use all the logical operators here for combining conditions.

elderly_male_df <- data %>% 
  select(uuid,A_1_respondent_age, A_2_respondent_sex) %>% 
  filter(as.numeric(A_1_respondent_age) > 60 & A_2_respondent_sex =="male")

head(elderly_male_df)

Output:

## # A tibble: 6 × 3
##   uuid                                 A_1_respondent_age A_2_respondent_sex
##   <chr>                                <chr>              <chr>             
## 1 5656517f-c389-4d20-8f6d-c0f02b588257 65                 male              
## 2 d42104cc-8d0b-4024-b230-45ece957c687 61                 male              
## 3 00bb7e1a-7ff2-4e80-9077-d24f7b1fdb06 61                 male              
## 4 def21a5c-e1d8-4487-9b83-427a8c261bbd 82                 male              
## 5 80f9b3dd-bd1d-4e03-9119-04151d1edb06 70                 male              
## 6 6b1b34bf-b537-4ba4-a066-40688780c6b3 71                 male

You can use the %in% operator to filter for specific group of values in a column.

shelter_type <- data %>% 
  select(uuid,D_1_shelter_type) %>% 
  filter(D_1_shelter_type %in% c("detached_house","apartment_in_apartment_block"))

head(shelter_type)

Output:

## # A tibble: 6 × 2
##   uuid                                 D_1_shelter_type            
##   <chr>                                <chr>                       
## 1 af204813-5e1f-4040-bc71-6b0061bf4773 apartment_in_apartment_block
## 2 6a5fe753-984c-42ec-8fbe-28d79f6d0ca3 detached_house              
## 3 acc9eac0-bf27-4216-b6d3-1c48919f5999 apartment_in_apartment_block
## 4 c0b55200-d706-46a8-bf13-7d274f03ac4e apartment_in_apartment_block
## 5 9c61d1f0-edc2-4558-ae4d-e56321ca58b2 apartment_in_apartment_block
## 6 04ffcbcb-b936-4cd0-a24e-67643346674f apartment_in_apartment_block

Here is some documentation on filter()

7.3.3 `arrange()` function

Arrange is used to re-order rows following a particular order. You will only need to provide the name of the column. If it is of a character type, it will arrange alphabetically. If it is numeric, then it will arrange ascendant or descendant depending on your specification.

age <- data %>% 
  select(uuid, A_1_respondent_age) %>%
  arrange(A_1_respondent_age)

head(age)

Output:

## # A tibble: 6 × 2
##   uuid                                 A_1_respondent_age
##   <chr>                                <chr>             
## 1 021e6806-d708-4dd0-aae3-99bd10a14ac4 18                
## 2 ce753961-14ff-427c-b0f3-8b0c9b7dc534 18                
## 3 8e8970ac-86b3-48c3-ace5-91f0bbf785ac 18                
## 4 24165246-800a-471e-a2b7-1cfe7ad2506a 18                
## 5 ec942b56-39ce-4feb-8590-eb0b4dc53475 18                
## 6 ef05a1ee-8bb7-4fbe-a3f2-b406032d4e09 18

You can also add multiple columns to arrange the rows respectively. You should add desc() to the column inside arrange() if you are aiming for a descendant order.

age_desc <- data %>% 
  select(uuid, A_1_respondent_age) %>%
  arrange(desc(A_1_respondent_age))

head(age_desc)

Output:

## # A tibble: 6 × 2
##   uuid                                 A_1_respondent_age
##   <chr>                                <chr>             
## 1 a7ed3939-589c-4700-b64a-6b9ce1ff6161 92                
## 2 c9615d0a-0d24-4376-9b6a-4531eafbfe08 87                
## 3 20a5ecb8-ecc1-4ddc-8de7-c7eb5a67dd07 85                
## 4 0027d58b-b67c-4e7b-87e5-f9da508dd3a7 85                
## 5 075042a7-370f-420b-8fd9-1d76eb4ec261 85                
## 6 ff08380c-6aa7-426a-9777-c735595e5de7 85

Here is some documentation on arrange()

7.3.4 `mutate()` function

This function will add a new column to your data frame. Here is where the fun begins and sky is your limit. I will be giving here many examples of what you can use inside a mutate() function.

If I want to fill a new column with a single string

single_string_col <- data %>% 
  mutate(test_column = "test") %>% 
  select(uuid,test_column)


head(single_string_col)

Output:

## # A tibble: 6 × 2
##   uuid                                 test_column
##   <chr>                                <chr>      
## 1 af204813-5e1f-4040-bc71-6b0061bf4773 test       
## 2 6a5fe753-984c-42ec-8fbe-28d79f6d0ca3 test       
## 3 acc9eac0-bf27-4216-b6d3-1c48919f5999 test       
## 4 e516a220-ba49-4736-ab1f-bd409459d1a0 test       
## 5 c0b55200-d706-46a8-bf13-7d274f03ac4e test       
## 6 9c61d1f0-edc2-4558-ae4d-e56321ca58b2 test

We can also do some calculations

# creating the set of columns with only shelter_issues
shelter_columns <- data %>% 
  select(starts_with("D_7_shelter_issues/"),-c("D_7_shelter_issues/none","D_7_shelter_issues/dont_know","D_7_shelter_issues/prefer_not_to_answer")) %>% 
  colnames

## the actual new data frame
shelter_issues <- data %>% 
  mutate(new_column_one = rowSums(across(shelter_columns, .fns = as.numeric), na.rm =T)) %>% 
  select(uuid,new_column_one) %>% 
  arrange(desc(new_column_one))

head(shelter_issues)

Output:

## # A tibble: 6 × 2
##   uuid                                 new_column_one
##   <chr>                                         <dbl>
## 1 2c2d1c15-9eb6-4533-aa89-7f365354d931              7
## 2 a1c5d05a-4464-4a97-b09c-42799e6f5d59              6
## 3 db2602b7-8298-4513-8aad-6d7292330a1c              6
## 4 50bd09f7-f001-4a46-8fab-0689a3547735              6
## 5 65d71b90-26c9-4e82-90d0-6091d2d4eaa9              5
## 6 630f4e79-0382-45a4-ad13-47cd27b580df              5

Here we targeted all shelter issues and calculated the sum of shelter issues the HH is facing. We can see there is new function called rowSums() which is a base function that let you add multiple columns values in a row wise manner. Also, across() function is a dplyr function that calls many columns at the same time and you can provide a .fns parameter that force a as.numeric() function on all columns. starts_with() also is a dplyr function, only used inside select() function to target columns starting with specific pattern.

We can include some conditions with ifelse() or case_when()

# male HoHH or female HoHH
hoHH_genderifelse <- data %>% 
  mutate(gender_head_household = ifelse(A_3_respondent_hohh == "yes" & A_2_respondent_sex == "male", "male_hoHH",
                                        ifelse(A_3_respondent_hohh == "yes" & A_2_respondent_sex == "female", "female_hoHH", "other"))) %>% 
  select(uuid, gender_head_household)

head(hoHH_genderifelse)

Output:

## # A tibble: 6 × 2
##   uuid                                 gender_head_household
##   <chr>                                <chr>                
## 1 af204813-5e1f-4040-bc71-6b0061bf4773 female_hoHH          
## 2 6a5fe753-984c-42ec-8fbe-28d79f6d0ca3 female_hoHH          
## 3 acc9eac0-bf27-4216-b6d3-1c48919f5999 male_hoHH            
## 4 e516a220-ba49-4736-ab1f-bd409459d1a0 female_hoHH          
## 5 c0b55200-d706-46a8-bf13-7d274f03ac4e female_hoHH          
## 6 9c61d1f0-edc2-4558-ae4d-e56321ca58b2 other

As you can see, I used ifelse() many times to target every single options in my conditions. In case we have NA in our data, it is better to handle them with the first condition being is.na(...), NA, ....

# male HoHH or female HoHH
hoHH_gendercasewhen <- data %>% 
  mutate(gender_head_household = case_when(A_3_respondent_hohh == "yes" & A_2_respondent_sex == "male" ~ "male_hoHH",
                                           A_3_respondent_hohh == "yes" & A_2_respondent_sex == "female" ~ "female_hoHH",
                                           TRUE ~ "other")) %>% 
  select(uuid, gender_head_household)

head(hoHH_gendercasewhen)

Output:

## # A tibble: 6 × 2
##   uuid                                 gender_head_household
##   <chr>                                <chr>                
## 1 af204813-5e1f-4040-bc71-6b0061bf4773 female_hoHH          
## 2 6a5fe753-984c-42ec-8fbe-28d79f6d0ca3 female_hoHH          
## 3 acc9eac0-bf27-4216-b6d3-1c48919f5999 male_hoHH            
## 4 e516a220-ba49-4736-ab1f-bd409459d1a0 female_hoHH          
## 5 c0b55200-d706-46a8-bf13-7d274f03ac4e female_hoHH          
## 6 9c61d1f0-edc2-4558-ae4d-e56321ca58b2 other

Here is some documentation on mutate()

7.3.5 `summarise()` function

Summarise function is used to create summary for a column in the data frame, such as finding the mean, min, or max.

# average age
age_average <- data %>% 
  summarise(average_age = mean(as.numeric(A_1_respondent_age)))

age_average

Output:

## # A tibble: 1 × 1
##   average_age
##         <dbl>
## 1        50.4

You can also perform multiple summary of a column:

# age stats
age_stats <- data %>% 
  summarise(average_age = mean(as.numeric(A_1_respondent_age)),
            min_age = min(as.numeric(A_1_respondent_age)),
            max_age = max(as.numeric(A_1_respondent_age)),
            total_num_submission = n())

age_stats

Output:

## # A tibble: 1 × 4
##   average_age min_age max_age total_num_submission
##         <dbl>   <dbl>   <dbl>                <int>
## 1        50.4      18      92                 1000

Here are the different summary statistics you can perform:

sd(): Standard deviation
min(): Minimum value
max(): Maximum value
median(): Median
mean(): Mean
sum(): Sum
n(): Length of the vector (count of all rows)
first(): First value in the vector
last(): Last value in the vector
n_distinct(): Number of distinct values in vector

Here is some documentation on summarise()

7.3.6 `group_by()` function

This function is very important to know if you are aiming to create disaggregations in your data and combine groups/strata together. It is usually always followed by a summarise as the main goal of using it is to split and apply a computation.

Let’s say we want to know the number of males vs female in our data.

# gender disaggregation
gender <- data %>% 
  group_by(A_2_respondent_sex) %>% 
  summarise(count = n())

gender

Output:

## # A tibble: 2 × 2
##   A_2_respondent_sex count
##   <chr>              <int>
## 1 female               626
## 2 male                 374

We can actually calculate and then ungroup using the following dplyr function ungroup(), and the column will remain providing the value for each group respectively. However, instead of summarise() we should use mutate().

# gender age average ungrouped
gender_age_average_ungrouped <- data %>% 
  group_by(A_2_respondent_sex) %>% 
  mutate(age_average = mean(as.numeric(A_1_respondent_age))) %>% 
  ungroup() %>% 
  select(uuid, A_2_respondent_sex, age_average)

head(gender_age_average_ungrouped)

Output:

## # A tibble: 6 × 3
##   uuid                                 A_2_respondent_sex age_average
##   <chr>                                <chr>                    <dbl>
## 1 af204813-5e1f-4040-bc71-6b0061bf4773 female                    51.8
## 2 6a5fe753-984c-42ec-8fbe-28d79f6d0ca3 female                    51.8
## 3 acc9eac0-bf27-4216-b6d3-1c48919f5999 male                      48.2
## 4 e516a220-ba49-4736-ab1f-bd409459d1a0 female                    51.8
## 5 c0b55200-d706-46a8-bf13-7d274f03ac4e female                    51.8
## 6 9c61d1f0-edc2-4558-ae4d-e56321ca58b2 female                    51.8

Here is some documentation on group_by()

7.3.7 `join()` function

We can use the join function to merge multiple data frames together. The most used function that you might run through is the left_join().

Here are the different join functions you might encounter as well:

inner_join(): return all rows from dataframe_1 where there are matching values in dataframe_2 and all the columns in both dataframes. All matches are returned.
left_join(): return all rows from dataframe_1 and all columns from both dataframes. Rows that are in dataframe_1 and not in dataframe_2 will include NA values in the column from dataframe_2. All matches are returned.
right_join(): return all rows from dataframe_2 and all columns from both dataframes. Rows that are in dataframe_2 and not in dataframe_1 will include NA values in the column from dataframe_1. All matches are returned.
full_join(): return all rows and columns from both dataframes. if there is no match, NA is filled in the missing values.

Here is another Excel file including some data for all 1000 submission about the the total number of days with piped water disruptions. Below is the button to upload the data.

# join pipe_water_data with original data
joined_data <- data %>% 
  left_join(pipe_water_data, by=c("uuid")) %>% 
  select(uuid, A_2_respondent_sex, F_2_piped_water_disruptions)

head(joined_data)

Output:

## # A tibble: 6 × 3
##   uuid                                 A_2_respondent_sex F_2_piped_water_disr…¹
##   <chr>                                <chr>              <chr>                 
## 1 af204813-5e1f-4040-bc71-6b0061bf4773 female             <NA>                  
## 2 6a5fe753-984c-42ec-8fbe-28d79f6d0ca3 female             7                     
## 3 acc9eac0-bf27-4216-b6d3-1c48919f5999 male               7                     
## 4 e516a220-ba49-4736-ab1f-bd409459d1a0 female             <NA>                  
## 5 c0b55200-d706-46a8-bf13-7d274f03ac4e female             <NA>                  
## 6 9c61d1f0-edc2-4558-ae4d-e56321ca58b2 female             7                     
## # ℹ abbreviated name: ¹F_2_piped_water_disruptions

Here is some documentation on join()

8 Final Push (Exercise/Test)

You are finally in the last part of this course. The final test to kick off your hopefully long journey with R.

You will be given two data sets of the Poland MSNA, one is the Household (HH) level questions and the Individual (Ind) level questions that usually is asked via the loop system in ODK. Both data sets includes the UUID column as a Unique identifier and a pointer to each data sets.

You should create an R script that will include all the answers to the requirements below.

Here is what you need to work on:

Create a new column in the HH dataframe that include the number of HH members.
Create a new column in the HH dataframe that include the number of children (below 18) in each HH members.
Create a new data frame to count how many HHs have animals and how many don’t (hh_animal column).
Create a new data frame to calculate the average of hh monthly rent and utilities payment in both Outside and Inside CCs. (Hint, you must use all select(), mutate(), rowSums(), group_by(), and summarise())

Once you finished with your task, please share it with the senior data team in your mission for review.

I hope you enjoyed the journey.

9 Extra Self Learning

If you are interested to even take it to another level of self learning R, SWIRL is a fantastic package that teaches you interactively, at your own pace, and directly in the R console how to use R and do data research.

Testimony: Md. Mehedi Khan (Data Specialist in HQ) have learned R using this package. So, you can be the next.

install.packages(“swirl”)

library(swirl)

Once installed, go to console and write swirl() and begin your learning journey.

Introduction to R (Basic - Intermediate)

1 An Introduction to R

1.1 Course Introduction

1.2 FAQ

1.2.1 What is RStudio?

1.2.2 What versions of R is RStudio compatible with?

1.2.3 What is the difference between RStudio Desktop, RStudio Server, and RStudio Workbench?

2 Setup Instruction

2.1 Core Software

2.2 RStudio Interface

2.3 Installing first package

3 Tips before starting our journey

3.1 RStudio Projects

3.2 R Scripts

3.3 Good Layouts

3.4 Write readable codes

3.4.1 Basic formatting

3.4.2 Be Consistent

3.4.3 The most important: Include Comments

4 Quiz 1 (Intro to R)

5 Basics of R

5.1 Basic Data Types

5.1.1 Numbers

5.1.2 Logical

5.1.2.1 or:

5.1.3 Characters

5.2 Operators in R

5.2.1 Arithmetic

5.2.2 Assignment

5.2.3 Comparison

5.2.4 Logical

5.2.5 Miscellaneous

5.3 Data Structures in R

5.3.1 Variables

5.3.2 Vectors

5.3.2.1 Access Vectors

5.3.2.2 Replace items in Vectors

5.3.2.3 Vector Length

5.3.3 Factors

5.3.4 Data frames

5.3.4.1 Access items in df

5.3.4.2 Combine rows to df or df to df

5.3.4.3 Amount of Rows and Columns

5.4 Conditions and statements

5.4.1 If Statement

5.4.2 Else If

5.4.3 If Else

5.4.4 Use of AND or OR in IF

5.4.5 For Loop

6 Quiz 2 (Basics of R)

7 Intermediate level

7.0.1 Data to be used

7.1 R Packages

7.2 dplyr

7.2.1 What is dplyr?

7.2.2 Most important functions

7.2.3 Pipes

7.3 Jump to action

7.3.1 select() function

7.3.2 filter() function

7.3.3 arrange() function

7.3.4 mutate() function

7.3.5 summarise() function

7.3.6 group_by() function

7.3.7 join() function

8 Final Push (Exercise/Test)

9 Extra Self Learning

7.3.1 `select()` function

7.3.2 `filter()` function

7.3.3 `arrange()` function

7.3.4 `mutate()` function

7.3.5 `summarise()` function

7.3.6 `group_by()` function

7.3.7 `join()` function