Run the chunks below by clicking on > button on every chunk(gray boxes with r commands)
It is important to learn some basic syntax of the R programming language before launching in to more sophisticated functions Below is a compilation of some of the basic features of R that will get you going and help you to understand the R language.
Data can be entered directly into R or loaded from external sources. To get a sense of how R handles different data inputs, we will begin with entering data on our own.
First note that, R can be used as a very fancy calculator without creating any “objects” at all.
The default R prompt is the greater-than sign (>)
2*3
## [1] 6
sqrt(144)
## [1] 12
10^2
## [1] 100
log(10)
## [1] 2.302585
If a line is not syntactically complete, a continuation prompt (+) appears
2*
3
## [1] 6
As our analyses increase in complexity, however, we may want to store values from particular operations into “objects,” which we can then call later for additional analysis.
R uses ‘commands’ that are typed into the console window
Commands in R are generally made up of two parts: objects and functions :example
Object <- function
The assignment operator is the left arrow (<-) or equal sign(=) and assigns the value of the object on the right to the object on the left, which roughly means “store 2*4 in object value”
value <- 2 * 4
value
## [1] 8
Names for R objects can be any combination of letters, numbers and periods (.) but they must not start with a number. R is also case sensitive
The type of data we enter can differ, however. We can think about the different data types that R handles in terms of their dimensionality. Single dimension data types will be (atomic) vectors or lists, whereas two-dimensional types will be matrices or dataframes.Within each of these dimensionalities, data may still be of different types.
We have 2 broad range of variables in R; categorical and numeric variables
where categorical variables can further be classified into;
Norminal- categorical variable with two or more levels/ groups that cannot be ranked (character)
Binary- nominal variable with only two levels (logic)
Ordinal-categorical variable with two or more levels/ groups that can be ranked or ordered (factor)
Numeric varibles are also further classified into:
Discreate - whole numerical variable e.g count of people (Integer)
Continous - can take values within any range e.h height (numeric)
Knowing what type of data you are using and how it is organized is critical to using functions and conducting your analysis.It also guides in choosing the correct type of statistical analysis.hence the need to change the types to the format that will help in achieving your goal.
The simplest way to create a vector is through the concatenation function, c. This function binds elements together, whether they are of character form, numeric or logical.
First_name=c("Naomi","Steve","Michael","Humphrey","Betty")
First_name
## [1] "Naomi" "Steve" "Michael" "Humphrey" "Betty"
ID=c(3206756,29200883,33062552,45727229,1272892)
ID
## [1] 3206756 29200883 33062552 45727229 1272892
The rep function replicates elements of vectors
Gender=c("Female",rep("Male",3),"Female")
Gender
## [1] "Female" "Male" "Male" "Male" "Female"
The function data.frame converts a matrix or collection of vectors into a data frame
attendance_list=data.frame(First_name,Gender,ID)
head(attendance_list)
## First_name Gender ID
## 1 Naomi Female 3206756
## 2 Steve Male 29200883
## 3 Michael Male 33062552
## 4 Humphrey Male 45727229
## 5 Betty Female 1272892
# check class of the object
class(attendance_list)
## [1] "data.frame"
List all the objects in your local workspace using ls() List all the files in your working directory using list.files() or dir()
ls()
## [1] "attendance_list" "First_name" "Gender" "ID"
## [5] "value"
dir()
## [1] "kenya-accidents-database.csv" "kenya-accidents-database.xlsx"
## [3] "R_data_manipulation_part1.Rmd"
The functions rm() or remove() are used to remove objects from the working directory
rm(attendance_list)
# Remove everything in the workspace
rm(list=ls())
R can seem complex, but it also provides many built-in help functions, particularly when you are using new packages. To get help in R on a specific function or an object or alternatively an operator, one of the following commands can be issued:
?function
help(function)
or click on the Help menu within R.
***********************************Break out Exercise****************************************
Create a data frame containing information about your family members, names,age, favorite food, sport, status etc
Now in most cases you will have a dataset ready.
You will learn how to easily perform data manipulation using R software. We’ll cover the following data manipulation techniques:
filtering and ordering rows
renaming and adding columns
computing summary statistics
We’ll use mainly the popular dplyr R package , which contains important R functions to carry out easily your data manipulation. In the final section, we’ll show you how to group your data by a grouping variable, and then compute some summary statitistics on each subset. You will also learn how to chain your data manipulation operations.
R works best if you have a dedicated folder for each separate project. This is referred to as the working folder. This makes R sessions more manageable and it avoids objects getting messed up or mistakenly deleted.
To Determine which directory your R session is using as its current working directory use
getwd()
## [1] "/home/kn/Documents/R_data science/Data_manuplation"
setwd("/home/kn/Documents/R_data science/Data_manuplation")
We recommend to install the tidyverse packages, which include the dplyr package (for data manipulation) and additional R packages for easily reading (readr), transforming (tidyr) and visualizing (ggplot2) datasets.
#install.packages("tidyverse")
# Once you have installed the package, you will need to call it in each additional script file where you want to use it.
library("dplyr")
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
#R comes with several packages pre-loaded, but you can always check which packages are currently installed.
#library()
Also note that sometimes packages depend on each other. If you load one package, it may also load “dependencies” that allow it to function. By the same token, several packages might have functions using the same name. This means that whichever package you load last will “mask” that function from other packages. R will notify you that it has masked a function so that you know which package’s function is currently in use. If you want to use two functions of the same name in different packages that you’ve loaded, you can do that by using packagename::functionname.
There are 8 fundamental data manipulation verbs that you will use to do most of your data manipulations. These functions are included in the dplyr package:
filter(): Pick rows (observations/samples) based on their values.
distinct(): Remove duplicate rows.
arrange(): Reorder the rows.
select(): Select columns (variables) by their names.
rename(): Rename columns.
mutate() and transmutate(): Add/create new variables.
summarise(): Compute statistical summaries (e.g., computing the mean or the sum)
We shall use Kenya accident data from the humanitarian data exchange platform
accidents_2016 <- read.csv("/home/kn/Documents/R_data science/Data_manuplation/kenya-accidents-database.csv",stringsAsFactors = F)
#To read in other excel format files you will need to use readxl package
accidents_2017 <- readxl::read_xlsx("/home/kn/Documents/R_data science/Data_manuplation/kenya-accidents-database.xlsx",sheet = "2017")
#To read in other data formats such as stata files you will use the foreign package
It is always a good idea to get a sense of the data. Key R commands for doing so include, names(), head(), tail(),View(), edit() and summary()
## add a column that indicate that this is kenya's data
accidents_2017 <- mutate(accidents_2017,Country="Kenya")
## Prac.. generate a variable that indicates that the data is owned by HDX
## generating new variable from an existing one
accidents_2017<- mutate(accidents_2017,Gender1=ifelse(GENDER=="F","Female","Male"))
accidents_2017<- mutate(accidents_2017,location=paste(ROAD,PLACE,sep=","))
## selecting only relevant variables and store them in another object
accidents_2017 <-rename(accidents_2017,Details=`BRIEF ACCIDENT DETAILS`)
accidents_2017 <-rename(accidents_2017,"Date"="Date DD/MM/YYYY")
## Prac..rename date column and Base/ sub base to more intuitive names e.g Date,Base_subBase
## selecting only relevant variables and storing them in another object
acc_summ <-select(accidents_2017,`BASE/SUB BASE`,Details,Month,Hour,`MV INVOLVED`,VICTIM,`CAUSE CODE`)
## use select() to reorder variables
#You can specify a particular ordering of columns with select(). To place the remaining columns in their current ordering, use everything(). You can also rename variables with select.
# prac.. arrange your dataframe to start with Month, time and car involved
## selecting only relevant variables and store them in another object
acc_summ <-arrange(acc_summ,Month)
# To display the data in descending order, surround a sorting variable with desc() to sort
# prac.. sort Victim type in descending order
We can select a particular row and column of data frame x, with the syntax x[rows, columns], where rows and columns are one of the following: a number, a vector of numbers, the name of a variable that stores numbers, or omitted.
Omitting the row value x[, columns] selects all rows, and omitting the column value, x[rows, ], selects all columns. (Using this syntax to extract a row or column will result in a data.frame, not a vector)
The syntax x$colname extracts the column with name colname as a vector.
Here are some operators and functions to help with selection:
==: equality
,>=: greater than, greater than or equal to
!: not
&: AND
|: OR
%in%: matches any of (2 %in% c(1,2,3) = TRUE)
is.na(): equality to NA
## select only relevant variables and store them in another object
acc_summ1 <-filter(acc_summ,Details=="HIT & RUN")
# prac.. filter out if time is UNKNOWN
## removing duplicated records in terms of cause code, MV involved and base
dups<- distinct(acc_summ,`BASE/SUB BASE`,`MV INVOLVED`,`CAUSE CODE`,.keep_all = T)
## check duplicated records
dups<- acc_summ[duplicated(acc_summ),]
## check duplicated records in terms of cause code, MV involved and base
dups <- acc_summ[duplicated(acc_summ[,c('BASE/SUB BASE','MV INVOLVED','CAUSE CODE')]) | duplicated(acc_summ[,c('BASE/SUB BASE','MV INVOLVED','CAUSE CODE')], fromLast=TRUE),]
Some times you might want to chain/combine your operations, we use piping %>%
you start with your main object (me) and pipe (%>%) through this object by filtering, selecting, renaming, … parts of it
## so combining the above commands
acc_summ <- accidents_2017 %>%
mutate(Country = "Kenya") %>%
select(`BASE/SUB BASE`,Details,Month) %>%
rename(Base = `BASE/SUB BASE`) %>%
filter(Details == "HIT & RUN") %>%
arrange(Month)
For many datasets we want to calculate statistics by group, rather than over the whole dataset.
The dplyr functions group_by() and summarise() together greatly simplify group operations.
First, we group our dataset by a grouping variable. Next we use summarise() and statistical or numerical functions to summarize variables by group.
Example of useful functions for summarise():
min(), max(), mean(), sum(), var(), sd() n(): number of observations in the group n_distinct(x): number of distinct values in variable x
### calculate number of accidents across months
summ <- accidents_2017%>%
group_by(Month) %>%
summarise(accidents=n())
## Identfy most frequent accident type
summ <- accidents_2017%>%
group_by(Details) %>%
summarise(accidents=n())
## affected gender by accidents
summ <- accidents_2017%>%
group_by(Month,Gender1) %>%
summarise(accidents=n())
## `summarise()` has grouped output by 'Month'. You can override using the
## `.groups` argument.
# prac.. Accident hotspot counties
# prac.. which months do accidents occur mostly in each county
1: remove all objects in your working except for the initial 2017 data
2: rename column names that have spaces to have underscores instead of spaces
3: Subset the data to have data for Nairobi metropolitan counties only (Nairobi,Kiambu, Machakos, Kajiado,Muranga) and cal it “NMS_data”
4: Remove irrelevant variables from the dataset such as No., name of victim, places
5: Using NMS_data find out which county has experienced the highest number of accidents
6: Using NMS_data find out which month do accidents occurs more rampantly and the counties