A key feature of R is functions. Functions are “self contained” modules of code that accomplish a specific task. Functions usually take in some sort of data structure (value, vector, dataframe etc.), process it, and return a result.
The character class is your typical string, a set of one or more letters.
# Assign "Hello R!" text to myString Variable myString <- "Hello R!"
# Print myString Variable print(myString)
## [1] "Hello R!"
# Check type/class of myString variable class(myString)
## [1] "character"
Corresponds to “float” in other languages – indicates numeric values like 10, 15.6, -48792.5498982749879 and so on.
# Assign 10.4343 number to myNum Variable myNum <- 10.4343
# Print myNum Variable print(myNum)
## [1] 10.4343
# Check type/class of myNum variable class(myNum)
## [1] "numeric"
myInf <- 1/0 print(myInf)
## [1] Inf
myNaN <- 0/0 myNaN
## [1] NaN
Integers are whole numbers, though they get autocoerced (changed) into numerics when saved into variables
myInt <- 209173987 class(myInt)
## [1] "numeric"
myInt <- as.integer(myInt) class(myInt)
## [1] "integer"
myInt <- 209173987L class(myInt)
## [1] "integer"
Logical types (booleans) are the same as in most other languages and can be two things – either true, or false. True can be represented with TRUE or T while false is, predictably, FALSE or F.
Logical_1 <- TRUE class(Logical_1)
## [1] "logical"
Logical_2 <- 2 < 1 class(Logical_2)
## [1] "logical"
A vector is the most common and basic data structure in R, and is pretty much the workhorse of R. It’s just a collection of values.
myVector <- c(1,50,9,42) myVector
## [1] 1 50 9 42
Exercise
hint: You can use cat() or c() fucntion to create the vectors and aslo combine them.
A factor is a special type of vector that is used to store categorical data. Each unique category is referred to as a factor level (i.e. category = level).
Strlevels <- c("low", "high", "medium", "high", "low", "medium", "high") Factorlevels <- factor(Strlevels) Factorlevels
## [1] low high medium high low medium high ## Levels: high low medium
A matrix in R is a collection of vectors of same length and identical datatype. Vectors can be combined as columns in the matrix or by row, to create a 2-dimensional structure.
A data frame is the most common way of storing data in R, and if used systematically makes data analysis easier.
We can create a dataframe by bringing vectors together to form the columns. We do this using the data.frame() function.
Exercise
hint: You can use data.frame() fucntion to create a data frame.
Importing data in R
#df <- read.csv("myproject/subfolder/fileName.csv") #df <- read_excel("myproject/subfolder/fileName.xlsx")
Inspecting your data
All data structures
Vector and factor variables:
Inspecting your data
Dataframe and matrix variables:
Exercise 2
hint: your dataset should be in csv format.
Exercise 3 - renaming columns
It is common that the data we import contains column names that we need to change. ie. names are too long or contain strange characters.
The syntax for changing column names is as follows:
# colnames(my_df)[colnames(my_df)=="old_column_name"] <- "new_column_name"
In R, the command “subset” is used to filter the data in a data frame based on criteria you set.
When we subset data, it is recommended that we assign this to a new object / data frame so that data is not lost.
The subset command takes the following form:
# subset_df <- subset(my_df, criteria)
Where the criteria refers to either a numeric or categorical variable. Categorical variable criteria always appear in quotations.
Examples:
Numeric variables
# subset_df <- subset(my_df, numeric_variable>=3)
OR
# subset_df <- subset(my_df, numeric_variable==10)
Categorical variable
# subset_df <- subset(my_df, categorical_variable=="category")
Exercise 4
At least one of these subsets should be of a categorical vriable (ie. all “male”) and at least one should be of a numeric variable (ie. more or less than a number)
Subsetting, Extracting and Bracket Notation
R’s bracket notation, or bracket operators, is a frequent source of confusion for new users, but it is very useful for subsetting data
Exercise 4
Packages are collections of R functions, data, and compiled code in a well-defined format. The directory where packages are stored is called the library. R comes with a standard set of packages. Others are available for download and installation. Once installed, they have to be loaded into the session to be used.
To use a package you need
To install and load a package:
dplyr is a grammar (package) of data manipulation, providing a consistent set of verbs (functions) that help you solve the most common data manipulation challenges
Important dplyr Functions
To select a set of columns we can use:
# select(my_df, column_name,...)
To select all the columns except a specific column/s
# select (my_df, - column_name,...)
To select a range of columns by name, use the “:” operator
# select(my_df, column_name_1:column_name_n))
To filter the rows based on a condition, you can use:
filter(my_df, condition)
Exercise 4
Perform 3 subsets of your my_df, assigning each to a new data frame.
At least one of these subsets should be of a categorical vriable (ie. all “male”) and at least one should be of a numeric variable (ie. more or less than a number)
dplyr imports this operator from another package (magrittr). This operator allows you to “pipe” the output from one function to the input of another function. Instead of nesting functions (reading from the inside to the outside), the idea of of piping is to read the functions from left to right.
# select(my_df, column_name_1:column_name_n)) %>% # filter( column_name_2 > 10)
Exercise 5
Let’s select at least 3 columns from my_df and filter it, use pipe operator
To order rows by values of a column (low-high) we can use:
# arrange(my_df, column_name)
To order rows by values of a column in descending order (high-low) we can use:
# arrange(my_df, desc(column_name))
Mutate allows us to create new columns/variables and is commonly used within a dplyr pipe. The syntax is:
# mutate(my_df, new_column_name = condition)
To order rows by values of a column in descending order (high-low) we can use:
# arrange(my_df, desc(column_name))