1 An Introduction to R
1.1 Course Introduction
Welcome to this training course. During this course, I hope to introduce you to using R, an interactive environment for statistical computing. R is not difficult to learn, but just like any new language, the initial learning curve can be a little steep and you will need to use it frequently or you forget it.
I have tried to simplify the content of this course as much as possible to IMPACT needs. My aim is to help you climb the initial learning curve and provide you with basic skills to enable you to further build your experience in using this language.
This course will be split into an intro, basic, and intermediate levels with exercises and a final project. So, I encourage you to complete the exercises and to watch out for details.
Buckle up and Enjoy the Ride!!
1.2 FAQ
1.2.1 What is RStudio?
RStudio is an integrated development environment (IDE) for R which works with the standard version of R available from CRAN. RStudio includes a wide range of productivity enhancing features and runs on all major platforms. There are several related IDE products: RStudio Workbench (previously RStudio Server Pro), RStudio Server Open Source, and RStudio Desktop. RStudio Workbench and RStudio Server enable you to provide a browser-based interface to a server running on a remote Linux system.
1.2.2 What versions of R is RStudio compatible with?
RStudio requires an installation of R 4.2.3 or higher. You can download the most recent version of R for your environment from CRAN.
1.2.3 What is the difference between RStudio Desktop, RStudio Server, and RStudio Workbench?
RStudio Desktop is an IDE that works with the version of R you have installed on your local Windows, Mac OS X, or Linux workstation. Version 1.3 and later of RStudio Desktop can also be used as a client of RStudio Workbench (previously RStudio Server Pro).
RStudio Workbench (previously RStudio Server Pro) and RStudio Server Open Source are Linux server applications that provide a web-browser-based interface to a server running on a remote Linux system. For more on why you might want to deploy an RStudio Server instead of RStudio Desktop, see the server documentation.
Check POSIT to see more on FAQs.
2 Setup Instruction
2.1 Core Software
During this course, we will be using RStudio. To get your computer ready for this course, please follow the instructions below.
First, install R. By the time you are conducting this training, you will need R version 4.2.3 or higher. Download and install R for Windows, Mac or linux.
Second, install RStudio. Download and install the free RStudio Desktop version.
These two software are downloaded and installed separately. R is the statistical computing environment, and RStudio is the IDE that makes R used better and easier.
2.2 RStudio Interface
The RStudio interface is composed of quadrants, each of which fulfills a unique purpose:
- The
Consolewindow, - The
Sourcewindow, - The
Environment / History / Connections / Tutorialwindow, - The
Files / PLots / Packages / Help / Viewerwindow
Sometimes only three windows might be showing, and you will be wondering where is the Source window has gone. In order to use it, you have to create a new one. You can create a new file by selecting File -> New File -> R Script.
2.3 Installing first package
You will realize across your work with R that most of the cool features and tools comes from third-party packages. They are super easy to install. You will find different ways to install packages in R, mainly through the install.packages() command.
Try installing ggplot2 package from the console window.
install.packages(‘ggplot2’)
The process should be straightforward. R will automatically install any other packages that ggplot2 might need. Throughout this course, we will be focusing on couple of libraries that are mostly used within IMPACT. Remember that with R, it is literally to the infinity and beyond and hopefully you will be building your own packages one day.
3 Tips before starting our journey
3.1 RStudio Projects
RStudio projects make it straightforward to divide your work into multiple contexts, each with their own working directory, workspace, history, and source documents. Here are more info on how to create a project. POSIT
3.2 R Scripts
Try to put all the steps of your works in an RScript rather than running all your work in the console window. This way you will be able to share the work with others and get the same results every time.
The larger the poject the more complex the work become. From experience, try to split your tasks and work into different scripts to not be overwhelmed with thousands of lines of codes.
3.3 Good Layouts
Try to be consisting in your layouting.
- Load first libraries
- Load the data you are working with
- Change or analyse your data
- Output and save your work.
3.4 Write readable codes
For many obvious reasons:
- Other people might need to use or see your code
- You might need to use your code in the future
You might write the perfect functional code, but not understand one single line. Below are some examples.
3.4.1 Basic formatting
Use the following simple rules for writing readable codes:
- Use spaces between variables and operators
- Break up long line of codes. You can set up a threshold max line by going to
Tools -> Global Options -> Code -> Display -> Show Margin -> set up to 110 - Use meaningful variable names taking into consideration that R is case sensitive.
See the difference between:
malDiff1=lm(y~grp+grpTim,df,subset=sext1=="m")
and:
male_difference = lm(score ~ group + group_time_interaction,
data = interview_data,
subset = gender == "male")
Both codes are the same, but the second is better 😊.
3.4.2 Be Consistent
Make sure to stay consistent across your whole script as you and the nexst person using the script will get used to the format in use and will be able to identify variables and data frames.
3.4.3 The most important: Include Comments
It is the best feeling in the world when I open someone’s else code and find comments in them. Comments are used to either explain what you are trying to do or to even show how to use the script the right way. Comments in R are written using #.
# Here I am adding two variables
variable1 <- 1 # first variable
variable2 <- 4 # second variable
sum <- variable1 + variable2
Output:
## [1] 5
4 Quiz 1 (Intro to R)
If you would like to test your knowledge, please click on the link below and complete the Intro to R questions.
5 Basics of R
There is couple of things to know about programming fundamentals before jumping and building your first script in R. R is just another programming language that is available to serve you at your command.
First we start from the basics of Data Types .
5.1 Basic Data Types
5.1.1 Numbers
Numbers in R can be called either numerics or integers.
## Numeric variable
x <- 28.5
class(x)
Output:
## [1] "numeric"
## Integer variable
y <- 28L
class(y)
Output:
## [1] "integer"
5.1.2 Logical
Logical is a Boolean value with a binary operators: TRUE or FALSE
## Logical variable
z <- TRUE
class(z)
Output:
## [1] "logical"
5.1.2.1 or:
## Logical variable
logical <- 2 < 1
class(logical)
Output:
## [1] "logical"
5.1.3 Characters
Anything that is put inside " " or' ' are considered a text (string).
## String variable
my_string <- "I love this training"
class(my_string)
Output:
## [1] "character"
5.2 Operators in R
Operators in R are used to perform operations on variables and values.
They are divided in the followin groups:
- Arithmetic operators
- Assignment operators
- Comparison operators
- Logical operators
- Miscellaneous operators
5.2.1 Arithmetic
Mainly used to perform common arithmetic operations.
| Operator | Name | Example |
|---|---|---|
+
|
Addition | x + y |
-
|
Subtraction | x - y |
*
|
Multiplication | x * y |
/
|
Division | x / y |
^
|
Exponent | x ^ y |
%%
|
Modulus | x %% y |
%/%
|
Integer Division | x %/% y |
5.2.2 Assignment
These operators are used to assign values to variables.
my_var <- "value"
my_var <<- "value"
my_var = "value"
"value" -> my_var
"value" ->> my_var
my_var ## to print my_var
Output:
## [1] "value"
5.2.3 Comparison
Mainly used to compare two values.
| Operator | Name | Example |
|---|---|---|
==
|
Equal | x == y |
!=
|
Not equal | x != y |
>
|
Greater than | x > y |
<
|
Less than | x < y |
>=
|
Greater than or equal to | x >= y |
<=
|
Less than or equal to | x <= y |
5.2.4 Logical
Mainly used to combine conditional statements
| Operator | Description |
|---|---|
&
|
Element-wise logical AND. Return TRUE if both elements are TRUE |
&&
|
Logical AND. Return TRUE if both statements are TRUE |
|
|
Element-wise logical OR. Return TRUE if one of the elements is TRUE |
||
|
Logical OR. Return TRUE if on of the statements is TRUE |
!
|
Logical NOT. Return FALSE if statement is TRUE |
5.2.5 Miscellaneous
Mainly used to manipulate data. We will see them a lot in use in coming chapters.
| Operator | Description | Example |
|---|---|---|
:
|
Create series of numbers in a sequence | x <- 1:10 |
%in%
|
Find an element inside a vector (to be explained later) | x %in% y |
5.3 Data Structures in R
5.3.1 Variables
Variables are objects for storing data values. There is no commend to create a variable and it is set the moment a value is assigned. We saw in the previous operators chapters an example of assigning a value to a variable.
name <- "Abraham"
age <- 30
name
age
Output:
## [1] "Abraham"
## [1] 30
Variables can be concatenate together using the paste() function. Use a comma (,) between the variables inside the function.
name <- "Abraham"
age <- 30
paste(name, " is ", age, " years old.")
Output:
## [1] "Abraham is 30 years old."
For numbers, you can use the arithmetic operator (+) but it will perform the actual addition. If you try to use it between a numeric and character types variable, R will give you an error.
name <- "Abraham"
age <- 30
name + age
Output:
## Error in name + age: non-numeric argument to binary operator
5.3.2 Vectors
Vectors are a list of items that are from the same type. You will be using them a lot to identify and assign lists to be iterated or used to be compared with other lists.
To combine items in a vector list, you can use the c() function and use the comma to separate the items.
names <- c("Abraham","Maksym","Oleksandr","Karyna")
names
Output:
## [1] "Abraham" "Maksym" "Oleksandr" "Karyna"
You can use the : operator if you want to create a list of vectors with a series of numbers.
numeric_list <- c(1:10)
numeric_list
Output:
## [1] 1 2 3 4 5 6 7 8 9 10
The following is some features used in vectors.
5.3.2.1 Access Vectors
You can access an item inside a vector by adding a number inside brackets [].
Please note that in R, unlike other programming languages, items start at index 1.
names <- c("Abraham","Maksym","Oleksandr","Karyna")
names[1]
Output:
## [1] "Abraham"
5.3.2.2 Replace items in Vectors
You can assign a new value targeting the item you want to change using the index.
names <- c("Abraham","Maksym","Oleksandr","Karyna")
names[1] <- "Anastasiia"
names
Output:
## [1] "Anastasiia" "Maksym" "Oleksandr" "Karyna"
5.3.2.3 Vector Length
You can check the vector’s length using the length() function
names <- c("Abraham","Maksym","Oleksandr","Karyna")
length(names)
Output:
## [1] 4
5.3.3 Factors
Factors exists to categorize data and the categories inside a factor are considered levels in R. To create a factor, you can use the factor() function.
units <- factor(c("ISU","CEU","Sectors","ISU","ISU","CASH","CEU"))
units
Output:
## [1] ISU CEU Sectors ISU ISU CASH CEU
## Levels: CASH CEU ISU Sectors
You can set the levels inside the factor function by calling the parameter levels.
Same as vectors, you can Access, Change, and check the lenght.
5.3.4 Data frames
Data frames are the most commonly used data structure in R. It is a data displayed in the format of a table. Data frames can store different types of data in every column. It can be characters, numeric, or logical. Remember, each complumn should have the same type.
You can use data.frame() function to create a data frame.
units_df <- data.frame(
units = c("ISU","CEU","CASH","Sectors"),
num_of_employess = c(10,45,20,1),
success = c(T,T,T,T)
)
units_df
Output:
## units num_of_employess success
## 1 ISU 10 TRUE
## 2 CEU 45 TRUE
## 3 CASH 20 TRUE
## 4 Sectors 1 TRUE
You can use the summary() function to summarise your data frame.
summary(units_df)
Output:
## units num_of_employess success
## Length:4 Min. : 1.00 Mode:logical
## Class :character 1st Qu.: 7.75 TRUE:4
## Mode :character Median :15.00
## Mean :19.00
## 3rd Qu.:26.25
## Max. :45.00
5.3.4.1 Access items in df
You can use single brackets [ ], double brackets [[ ]], or $ sign to access a column.
units_df[1]
units_df[["units"]]
units_df$units
Output:
## units
## 1 ISU
## 2 CEU
## 3 CASH
## 4 Sectors
## [1] "ISU" "CEU" "CASH" "Sectors"
## [1] "ISU" "CEU" "CASH" "Sectors"
5.3.4.2 Combine rows to df or df to df
You can use the rbind() function to combine a row to an existing df, or two dfs together.
new_units_df <- data.frame(
units = c("ISU","CEU","Sectors"),
num_of_employess = c(10,45,1),
success = c(T,T,T)
)
combined_df <- rbind(units_df, new_units_df)
combined_df
Output:
## units num_of_employess success
## 1 ISU 10 TRUE
## 2 CEU 45 TRUE
## 3 CASH 20 TRUE
## 4 Sectors 1 TRUE
## 5 ISU 10 TRUE
## 6 CEU 45 TRUE
## 7 Sectors 1 TRUE
5.3.4.3 Amount of Rows and Columns
You can use ncol() and nrow() to check the amount of rows and columns of a df.
ncol(combined_df)
nrow(combined_df)
Output:
## [1] 3
## [1] 7
5.4 Conditions and statements
We saw in the Operators section some of the conditions that can be used to compare mathematical statements. R support as well conditions such as if statements and others.
5.4.1 If Statement
The “if statement” use if keyword and have the following syntax to be executed.
num_one <- 20
num_two <- 30
if (num_one < num_two) {
print("num_two is greater than num_one")
}
Output:
## [1] "num_two is greater than num_one"
5.4.2 Else If
This statement use else if keyword and is used if the previous conditions is not true, then use this condition. You can use as many else if as you want.
num_one <- 20
num_two <- 20
if (num_one < num_two) {
print("num_two is greater than num_one")
} else if (num_two == num_one){
print("num_one and num_two are equal")
}
Output:
## [1] "num_one and num_two are equal"
5.4.3 If Else
else is a keyword that is used to catch anything that is not meeting any of the previous conditions.
num_one <- 30
num_two <- 20
if (num_one < num_two) {
print("num_two is greater than num_one")
} else if (num_two == num_one){
print("num_one and num_two are equal")
} else {
print("num_one is greater than num_two")
}
Output:
## [1] "num_one is greater than num_two"
5.4.4 Use of AND or OR in IF
Remembering from the operators section, we learned about the logical ones. They can be used in defining two conditions.
num_one <- 30
num_two <- 20
num_three <- 100
if (num_one > num_two & num_three > num_two) {
print("Both conditions TRUE.")
}
Output:
## [1] "Both conditions TRUE."
num_one <- 30
num_two <- 20
num_three <- 100
if (num_one > num_two | num_two > num_three) {
print("At least one condition is TRUE.")
}
Output:
## [1] "At least one condition is TRUE."
5.4.5 For Loop
for loops are used to iterate over a sequence, vector, list, or rows/columns of data frames.
for(i in 1:10){
print(i)
}
Output:
## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5
## [1] 6
## [1] 7
## [1] 8
## [1] 9
## [1] 10
Another example with the units list from before:
for(i in units){
print(i)
}
Output:
## [1] "ISU"
## [1] "CEU"
## [1] "Sectors"
## [1] "ISU"
## [1] "ISU"
## [1] "CASH"
## [1] "CEU"
6 Quiz 2 (Basics of R)
If you would like to test your knowledge, please click on the link below and complete the Basics of R questions.
7 Intermediate level
Congratulations!! You made it to the next level.
In this level, we will dive more into specific packages that are mostly used within IMPACT as part of data manipulation and data wrangling in R. Please make sure that you have completed the Basic levels of R before diving deeper. Almost all concepts mentioned are crucial for this part.
7.0.1 Data to be used
In this course, I will be using a part of the 2022 MSNA Ukraine data in the examples below. You will find the dataset here to download so you can work at the same time and practice.
7.1 R Packages
Packages are the fundamental units of reproducible R code. They include reusable R functions, the documentation that describes how to use them, and sample data.
In R, the fundamental unit of shareable code is the package. A package bundles together code, data, documentation, and tests, and is easy to share with others. As of March 2023, there were over 19,000 packages available on the Comprehensive R Archive Network, or CRAN, the public clearing house for R packages. This huge variety of packages is one of the reasons that R is so successful: the chances are that someone has already solved a problem that you’re working on, and you can benefit from their work by downloading their package.
The goal of this training is to get you acquainted with couple of packages and prepare you to tackle some of the main data manipulation needed in your every day tasks with IMPACT.
- You can install the packages from CRAN with
install.packages("x") - You can use them in R with
library("x") or library(x).
The main packages that you might come across while you start working in R regularly are:
- dplyr: A Grammar for Data Manipulation
- tidyr: For tidying messy data
- srvyr: For analyzing survey data
- ggplot2: For making graphs
- leaflet: For creating maps
- And many others
In this course, we will focus mainly on dplyr package using an example of dataset from the 2022 MSNA in Ukraine.
7.2 dplyr
7.2.1 What is dplyr?
Dplyr is a powerful R package to manipulate, clean and summarize unstructured data. It makes data manipulation very fast. It compromises of many functions that perform mostly used data manipulation operations such as filtering, selecting specific columns, sorting data, adding or deleting columns, aggregating data.
To install and load dplyr package into your RStudio, follow the steps
install.packages(‘dplyr’)
library(dplyr)
7.2.2 Most important functions
The most common used functions in the dplyr package are:
select(): picks columns based on their names or typefilter(): picks rows based on their valuesgroup_by(): group the rows together depending on conditionssummarise(): summarize or aggregate the data togetherarrange(): sort and order the rows by values of the columnmutate(): create new variables by mutating existing onesjoin(): joining data frames togetherrename(): rename column names in your dataset
7.2.3 Pipes
You will most likely come across this symbol %>%. It is used to emphasise the sequence of actions and chain together different actions that are performed together.
For example:
- We will
filterour data to keep only ages above 18 - then, we want to
mutatethe sum of income in last month and this month - then, we want to
group_bythe hromada together - finally,
summarisethe average income per hromada.
Here is some documentation on %>%
7.3 Jump to action
Before jumping literally to action, it will be good to learn how to import your own data to R, for example the data you just downloaded above. In R, you can import any kind of data, XLSX, CSV, JSON, GeoData, etc. . Today, we will learn how to import an Excel file.
You will need to import the readxl package. You should know by now how to install new packages and use it in the script. Take a moment to do so.
Then, you can use the
readxl::read_excel(path = "data/your_data.xlsx", col_types = "text", na = c("NA",""))
The path parameter is where your data is sitting in your repository.
The col_types parameter is a crucial one. We should always try to force reading all columns as character data type and transform them in our actions to numeric if needed to not face any possible issues that might arise.
The na parameter is taking care of empty strings. By default, ""empty strings in the data are considered NAs, but you can also force other strings to be considered NA.
7.3.1 select() function
Let’s say we would like to select only the uuid and the A_2_respondent_sex.
age_respondent_df <- data %>%
select(uuid,A2_respondent_sex)
head(age_respondent_df)
Output:
## # A tibble: 6 × 2
## uuid A_2_respondent_sex
## <chr> <chr>
## 1 af204813-5e1f-4040-bc71-6b0061bf4773 female
## 2 6a5fe753-984c-42ec-8fbe-28d79f6d0ca3 female
## 3 acc9eac0-bf27-4216-b6d3-1c48919f5999 male
## 4 e516a220-ba49-4736-ab1f-bd409459d1a0 female
## 5 c0b55200-d706-46a8-bf13-7d274f03ac4e female
## 6 9c61d1f0-edc2-4558-ae4d-e56321ca58b2 female
Let’s breakdown a bit the code above. As you see, We started by assigning a new data frame to age_respondent_df. If you start with data <- data %>%, you will be overwriting the actual data with the new actions.
We also used a new function called head() which shows only the first few rows of you data.
You can perform many different actions with the select() function.
To select all columns except specific ones, you can use the -subtraction operator.
df_without_uuid <- data %>%
select(-uuid)
To select the first five columns, you can use the : colon operator with indexes or the names of the columns.
first_five_col_df <- data %>%
select(1:5)
To select all columns that start with a specific character string like "A_", you can use the function starts_with().
all_A_columns_df <- data %>%
select(starts_with("A_"))
Here are some additional options to select columns based on a specific criteria:
ends_with(): Select columns that end with a specific character stringcontatins(): Select columns that contains a specific character stringmatches(): Select columns that matches a specific character string or a REGEX
Here is some documentation on select()
7.3.2 filter() function
Filter is used to select specific rows from the datasets. Let’s say we would like to only see the HHs that are above 60 years old.
elderly_df <- data %>%
select(uuid,A_1_respondent_age) %>%
filter(as.numeric(A_1_respondent_age) > 60)
head(elderly_df)
Output:
## # A tibble: 6 × 2
## uuid A_1_respondent_age
## <chr> <chr>
## 1 5656517f-c389-4d20-8f6d-c0f02b588257 65
## 2 571f52a6-6370-46d3-b1a7-f27ca5658d0d 68
## 3 d42104cc-8d0b-4024-b230-45ece957c687 61
## 4 b8aec8d4-fb84-47d6-9104-16d8f6a9c797 63
## 5 00bb7e1a-7ff2-4e80-9077-d24f7b1fdb06 61
## 6 c80714a5-e964-488c-ac68-ddf8428b6771 64
You can see that we also used a new function called as.numeric(). If you recall, while we imported the data to R, we did read all columns as characters. Hence why we are transforming it to numeric to be able to conductr some mathematical computations.
You can use all the logical operators here for combining conditions.
elderly_male_df <- data %>%
select(uuid,A_1_respondent_age, A_2_respondent_sex) %>%
filter(as.numeric(A_1_respondent_age) > 60 & A_2_respondent_sex =="male")
head(elderly_male_df)
Output:
## # A tibble: 6 × 3
## uuid A_1_respondent_age A_2_respondent_sex
## <chr> <chr> <chr>
## 1 5656517f-c389-4d20-8f6d-c0f02b588257 65 male
## 2 d42104cc-8d0b-4024-b230-45ece957c687 61 male
## 3 00bb7e1a-7ff2-4e80-9077-d24f7b1fdb06 61 male
## 4 def21a5c-e1d8-4487-9b83-427a8c261bbd 82 male
## 5 80f9b3dd-bd1d-4e03-9119-04151d1edb06 70 male
## 6 6b1b34bf-b537-4ba4-a066-40688780c6b3 71 male
You can use the %in% operator to filter for specific group of values in a column.
shelter_type <- data %>%
select(uuid,D_1_shelter_type) %>%
filter(D_1_shelter_type %in% c("detached_house","apartment_in_apartment_block"))
head(shelter_type)
Output:
## # A tibble: 6 × 2
## uuid D_1_shelter_type
## <chr> <chr>
## 1 af204813-5e1f-4040-bc71-6b0061bf4773 apartment_in_apartment_block
## 2 6a5fe753-984c-42ec-8fbe-28d79f6d0ca3 detached_house
## 3 acc9eac0-bf27-4216-b6d3-1c48919f5999 apartment_in_apartment_block
## 4 c0b55200-d706-46a8-bf13-7d274f03ac4e apartment_in_apartment_block
## 5 9c61d1f0-edc2-4558-ae4d-e56321ca58b2 apartment_in_apartment_block
## 6 04ffcbcb-b936-4cd0-a24e-67643346674f apartment_in_apartment_block
Here is some documentation on filter()
7.3.3 arrange() function
Arrange is used to re-order rows following a particular order. You will only need to provide the name of the column. If it is of a character type, it will arrange alphabetically. If it is numeric, then it will arrange ascendant or descendant depending on your specification.
age <- data %>%
select(uuid, A_1_respondent_age) %>%
arrange(A_1_respondent_age)
head(age)
Output:
## # A tibble: 6 × 2
## uuid A_1_respondent_age
## <chr> <chr>
## 1 021e6806-d708-4dd0-aae3-99bd10a14ac4 18
## 2 ce753961-14ff-427c-b0f3-8b0c9b7dc534 18
## 3 8e8970ac-86b3-48c3-ace5-91f0bbf785ac 18
## 4 24165246-800a-471e-a2b7-1cfe7ad2506a 18
## 5 ec942b56-39ce-4feb-8590-eb0b4dc53475 18
## 6 ef05a1ee-8bb7-4fbe-a3f2-b406032d4e09 18
You can also add multiple columns to arrange the rows respectively. You should add desc() to the column inside arrange() if you are aiming for a descendant order.
age_desc <- data %>%
select(uuid, A_1_respondent_age) %>%
arrange(desc(A_1_respondent_age))
head(age_desc)
Output:
## # A tibble: 6 × 2
## uuid A_1_respondent_age
## <chr> <chr>
## 1 a7ed3939-589c-4700-b64a-6b9ce1ff6161 92
## 2 c9615d0a-0d24-4376-9b6a-4531eafbfe08 87
## 3 20a5ecb8-ecc1-4ddc-8de7-c7eb5a67dd07 85
## 4 0027d58b-b67c-4e7b-87e5-f9da508dd3a7 85
## 5 075042a7-370f-420b-8fd9-1d76eb4ec261 85
## 6 ff08380c-6aa7-426a-9777-c735595e5de7 85
Here is some documentation on arrange()
7.3.4 mutate() function
This function will add a new column to your data frame. Here is where the fun begins and sky is your limit. I will be giving here many examples of what you can use inside a mutate() function.
If I want to fill a new column with a single string
single_string_col <- data %>%
mutate(test_column = "test") %>%
select(uuid,test_column)
head(single_string_col)
Output:
## # A tibble: 6 × 2
## uuid test_column
## <chr> <chr>
## 1 af204813-5e1f-4040-bc71-6b0061bf4773 test
## 2 6a5fe753-984c-42ec-8fbe-28d79f6d0ca3 test
## 3 acc9eac0-bf27-4216-b6d3-1c48919f5999 test
## 4 e516a220-ba49-4736-ab1f-bd409459d1a0 test
## 5 c0b55200-d706-46a8-bf13-7d274f03ac4e test
## 6 9c61d1f0-edc2-4558-ae4d-e56321ca58b2 test
We can also do some calculations
# creating the set of columns with only shelter_issues
shelter_columns <- data %>%
select(starts_with("D_7_shelter_issues/"),-c("D_7_shelter_issues/none","D_7_shelter_issues/dont_know","D_7_shelter_issues/prefer_not_to_answer")) %>%
colnames
## the actual new data frame
shelter_issues <- data %>%
mutate(new_column_one = rowSums(across(shelter_columns, .fns = as.numeric), na.rm =T)) %>%
select(uuid,new_column_one) %>%
arrange(desc(new_column_one))
head(shelter_issues)
Output:
## # A tibble: 6 × 2
## uuid new_column_one
## <chr> <dbl>
## 1 2c2d1c15-9eb6-4533-aa89-7f365354d931 7
## 2 a1c5d05a-4464-4a97-b09c-42799e6f5d59 6
## 3 db2602b7-8298-4513-8aad-6d7292330a1c 6
## 4 50bd09f7-f001-4a46-8fab-0689a3547735 6
## 5 65d71b90-26c9-4e82-90d0-6091d2d4eaa9 5
## 6 630f4e79-0382-45a4-ad13-47cd27b580df 5
Here we targeted all shelter issues and calculated the sum of shelter issues the HH is facing. We can see there is new function called rowSums() which is a base function that let you add multiple columns values in a row wise manner. Also, across() function is a dplyr function that calls many columns at the same time and you can provide a .fns parameter that force a as.numeric() function on all columns.
starts_with() also is a dplyr function, only used inside select() function to target columns starting with specific pattern.
We can include some conditions with ifelse() or case_when()
# male HoHH or female HoHH
hoHH_genderifelse <- data %>%
mutate(gender_head_household = ifelse(A_3_respondent_hohh == "yes" & A_2_respondent_sex == "male", "male_hoHH",
ifelse(A_3_respondent_hohh == "yes" & A_2_respondent_sex == "female", "female_hoHH", "other"))) %>%
select(uuid, gender_head_household)
head(hoHH_genderifelse)
Output:
## # A tibble: 6 × 2
## uuid gender_head_household
## <chr> <chr>
## 1 af204813-5e1f-4040-bc71-6b0061bf4773 female_hoHH
## 2 6a5fe753-984c-42ec-8fbe-28d79f6d0ca3 female_hoHH
## 3 acc9eac0-bf27-4216-b6d3-1c48919f5999 male_hoHH
## 4 e516a220-ba49-4736-ab1f-bd409459d1a0 female_hoHH
## 5 c0b55200-d706-46a8-bf13-7d274f03ac4e female_hoHH
## 6 9c61d1f0-edc2-4558-ae4d-e56321ca58b2 other
As you can see, I used ifelse() many times to target every single options in my conditions. In case we have NA in our data, it is better to handle them with the first condition being is.na(...), NA, ....
# male HoHH or female HoHH
hoHH_gendercasewhen <- data %>%
mutate(gender_head_household = case_when(A_3_respondent_hohh == "yes" & A_2_respondent_sex == "male" ~ "male_hoHH",
A_3_respondent_hohh == "yes" & A_2_respondent_sex == "female" ~ "female_hoHH",
TRUE ~ "other")) %>%
select(uuid, gender_head_household)
head(hoHH_gendercasewhen)
Output:
## # A tibble: 6 × 2
## uuid gender_head_household
## <chr> <chr>
## 1 af204813-5e1f-4040-bc71-6b0061bf4773 female_hoHH
## 2 6a5fe753-984c-42ec-8fbe-28d79f6d0ca3 female_hoHH
## 3 acc9eac0-bf27-4216-b6d3-1c48919f5999 male_hoHH
## 4 e516a220-ba49-4736-ab1f-bd409459d1a0 female_hoHH
## 5 c0b55200-d706-46a8-bf13-7d274f03ac4e female_hoHH
## 6 9c61d1f0-edc2-4558-ae4d-e56321ca58b2 other
Here is some documentation on mutate()
7.3.5 summarise() function
Summarise function is used to create summary for a column in the data frame, such as finding the mean, min, or max.
# average age
age_average <- data %>%
summarise(average_age = mean(as.numeric(A_1_respondent_age)))
age_average
Output:
## # A tibble: 1 × 1
## average_age
## <dbl>
## 1 50.4
You can also perform multiple summary of a column:
# age stats
age_stats <- data %>%
summarise(average_age = mean(as.numeric(A_1_respondent_age)),
min_age = min(as.numeric(A_1_respondent_age)),
max_age = max(as.numeric(A_1_respondent_age)),
total_num_submission = n())
age_stats
Output:
## # A tibble: 1 × 4
## average_age min_age max_age total_num_submission
## <dbl> <dbl> <dbl> <int>
## 1 50.4 18 92 1000
Here are the different summary statistics you can perform:
sd(): Standard deviationmin(): Minimum valuemax(): Maximum valuemedian(): Medianmean(): Meansum(): Sumn(): Length of the vector (count of all rows)first(): First value in the vectorlast(): Last value in the vectorn_distinct(): Number of distinct values in vector
Here is some documentation on summarise()
7.3.6 group_by() function
This function is very important to know if you are aiming to create disaggregations in your data and combine groups/strata together. It is usually always followed by a summarise as the main goal of using it is to split and apply a computation.
Let’s say we want to know the number of males vs female in our data.
# gender disaggregation
gender <- data %>%
group_by(A_2_respondent_sex) %>%
summarise(count = n())
gender
Output:
## # A tibble: 2 × 2
## A_2_respondent_sex count
## <chr> <int>
## 1 female 626
## 2 male 374
We can actually calculate and then ungroup using the following dplyr function ungroup(), and the column will remain providing the value for each group respectively. However, instead of summarise() we should use mutate().
# gender age average ungrouped
gender_age_average_ungrouped <- data %>%
group_by(A_2_respondent_sex) %>%
mutate(age_average = mean(as.numeric(A_1_respondent_age))) %>%
ungroup() %>%
select(uuid, A_2_respondent_sex, age_average)
head(gender_age_average_ungrouped)
Output:
## # A tibble: 6 × 3
## uuid A_2_respondent_sex age_average
## <chr> <chr> <dbl>
## 1 af204813-5e1f-4040-bc71-6b0061bf4773 female 51.8
## 2 6a5fe753-984c-42ec-8fbe-28d79f6d0ca3 female 51.8
## 3 acc9eac0-bf27-4216-b6d3-1c48919f5999 male 48.2
## 4 e516a220-ba49-4736-ab1f-bd409459d1a0 female 51.8
## 5 c0b55200-d706-46a8-bf13-7d274f03ac4e female 51.8
## 6 9c61d1f0-edc2-4558-ae4d-e56321ca58b2 female 51.8
Here is some documentation on group_by()
7.3.7 join() function
We can use the join function to merge multiple data frames together. The most used function that you might run through is the left_join().
Here are the different join functions you might encounter as well:
inner_join(): return all rows from dataframe_1 where there are matching values in dataframe_2 and all the columns in both dataframes. All matches are returned.left_join(): return all rows from dataframe_1 and all columns from both dataframes. Rows that are in dataframe_1 and not in dataframe_2 will includeNAvalues in the column from dataframe_2. All matches are returned.right_join(): return all rows from dataframe_2 and all columns from both dataframes. Rows that are in dataframe_2 and not in dataframe_1 will includeNAvalues in the column from dataframe_1. All matches are returned.full_join(): return all rows and columns from both dataframes. if there is no match,NAis filled in the missing values.
Here is another Excel file including some data for all 1000 submission about the the total number of days with piped water disruptions. Below is the button to upload the data.
# join pipe_water_data with original data
joined_data <- data %>%
left_join(pipe_water_data, by=c("uuid")) %>%
select(uuid, A_2_respondent_sex, F_2_piped_water_disruptions)
head(joined_data)
Output:
## # A tibble: 6 × 3
## uuid A_2_respondent_sex F_2_piped_water_disr…¹
## <chr> <chr> <chr>
## 1 af204813-5e1f-4040-bc71-6b0061bf4773 female <NA>
## 2 6a5fe753-984c-42ec-8fbe-28d79f6d0ca3 female 7
## 3 acc9eac0-bf27-4216-b6d3-1c48919f5999 male 7
## 4 e516a220-ba49-4736-ab1f-bd409459d1a0 female <NA>
## 5 c0b55200-d706-46a8-bf13-7d274f03ac4e female <NA>
## 6 9c61d1f0-edc2-4558-ae4d-e56321ca58b2 female 7
## # ℹ abbreviated name: ¹F_2_piped_water_disruptions
Here is some documentation on join()
8 Final Push (Exercise/Test)
You are finally in the last part of this course. The final test to kick off your hopefully long journey with R.
You will be given two data sets of the Poland MSNA, one is the Household (HH) level questions and the Individual (Ind) level questions that usually is asked via the loop system in ODK. Both data sets includes the UUID column as a Unique identifier and a pointer to each data sets.
You should create an R script that will include all the answers to the requirements below.
Here is what you need to work on:
- Create a new column in the HH dataframe that include the number of HH members.
- Create a new column in the HH dataframe that include the number of children (below 18) in each HH members.
- Create a new data frame to count how many HHs have animals and how many don’t (
hh_animalcolumn). - Create a new data frame to calculate the average of hh monthly rent and utilities payment in both Outside and Inside CCs. (Hint, you must use all
select(),mutate(),rowSums(),group_by(), andsummarise())
Once you finished with your task, please share it with the senior data team in your mission for review.
I hope you enjoyed the journey.
9 Extra Self Learning
If you are interested to even take it to another level of self learning R, SWIRL is a fantastic package that teaches you interactively, at your own pace, and directly in the R console how to use R and do data research.
Testimony: Md. Mehedi Khan (Data Specialist in HQ) have learned R using this package. So, you can be the next.
install.packages(“swirl”)
library(swirl)
Once installed, go to console and write swirl() and begin your learning journey.