Introduction

Workshop Scope

  • Comfortably use RStudio (a graphical interface for R)
  • Fluently interact with R using RStudio
  • Become familiar with R syntax
  • Understand data structures in R
  • Inspect and manipulate data structures
  • Install packages and use functions in R
  • manipulate data using dplyr

What is R?

R is a powerful, extensible environment. It has a wide range of statistics and general data analysis and visualization capabilities.

  • Data handling, wrangling, and storage
  • Wide array of statistical methods and graphical techniques available
  • Easy to install on any platform and use (and it’s free!)
  • Open source with a large and growing community of peers

Why Use R?

Transitioning from Excel/SPSS to R

Learning R

R User

What is RStudio?

  • Graphical user interface, not just a command prompt
  • Great learning tool
  • Free for academic use
  • Platform agnostic
  • Open source

RStudio Interface

  • Console
  • Script editor
  • Environment/History
  • Files/Plots/Packages/Help

R Basic terms

  • Dataframe (df) - Excel/SPSS dataset
  • Variable - Excel column*
  • Observation - Excel row
  • Datapoint - Excel cell value
  • Vectors - contiguous cells containing data of same type

R Operators - Arithmetic

R Operators - Relational

R Operators - Logical

R Operators - Assignment

Functions

A key feature of R is functions. Functions are “self contained” modules of code that accomplish a specific task. Functions usually take in some sort of data structure (value, vector, dataframe etc.), process it, and return a result.

  • sum(1,2) Adds two or more numbers
  • sum(a) sums a numeric vector
  • sum(b, na.rm = TRUE) takes arguments

Data types (Atomics) - Character

The character class is your typical string, a set of one or more letters.

# Assign "Hello R!" text to myString Variable
myString <- "Hello R!"
# Print myString Variable
print(myString)
## [1] "Hello R!"
# Check type/class of myString variable
class(myString)
## [1] "character"

Data types (Atomics) - Numeric

Corresponds to “float” in other languages – indicates numeric values like 10, 15.6, -48792.5498982749879 and so on.

# Assign 10.4343 number to myNum Variable
myNum <- 10.4343
# Print myNum Variable
print(myNum)
## [1] 10.4343
# Check type/class of myNum variable
class(myNum)
## [1] "numeric"

Data types (Atomics) - Numeric Cont…

  • Inf is special number that represents infinity
myInf <-  1/0
print(myInf)
## [1] Inf
  • NaN is also a special number which stands for “Not a Number”
myNaN <- 0/0
myNaN
## [1] NaN

Data types (Atomics) - Integer

Integers are whole numbers, though they get autocoerced (changed) into numerics when saved into variables

myInt <- 209173987
class(myInt)
## [1] "numeric"
myInt <- as.integer(myInt)
class(myInt)
## [1] "integer"
myInt <- 209173987L
class(myInt)
## [1] "integer"

Data types (Atomics) - Logical

Logical types (booleans) are the same as in most other languages and can be two things – either true, or false. True can be represented with TRUE or T while false is, predictably, FALSE or F.

Logical_1 <- TRUE 
class(Logical_1)
## [1] "logical"
Logical_2 <- 2 < 1 
class(Logical_2)
## [1] "logical"

Data Structures - Vectors

A vector is the most common and basic data structure in R, and is pretty much the workhorse of R. It’s just a collection of values.

myVector <- c(1,50,9,42)
myVector
## [1]  1 50  9 42

Data Structures - Vectors

Exercise

  1. Create a numeric vector that contains (65, 70.6, 88, 50, 80, 5) and assign it to marks variable
  2. Create a character vector that contains (Fayab, balkh, Herat, Faryab, Jawzjan, Kabul) and assign it to names variable
  3. Combine marks and names vectors to a single vector.
  4. Assign this combined vector to new variable called combined

hint: You can use cat() or c() fucntion to create the vectors and aslo combine them.

Data Structures - Factors

A factor is a special type of vector that is used to store categorical data. Each unique category is referred to as a factor level (i.e. category = level).

Strlevels <- c("low", "high", "medium", "high", "low", "medium", "high")
Factorlevels <- factor(Strlevels)
Factorlevels
## [1] low    high   medium high   low    medium high  
## Levels: high low medium

Data Structures - Matrix

A matrix in R is a collection of vectors of same length and identical datatype. Vectors can be combined as columns in the matrix or by row, to create a 2-dimensional structure.

Data Structures - Data Frame

A data frame is the most common way of storing data in R, and if used systematically makes data analysis easier.

We can create a dataframe by bringing vectors together to form the columns. We do this using the data.frame() function.

Data Structures - Data Frame Cont…

Exercise

  1. Using marks and names vectors you created earlier create a dataframe and name it myfirst_df
  2. Use class() function to check myfirst_df type.

hint: You can use data.frame() fucntion to create a data frame.

Data Structures - Data Frame Cont…

Importing data in R

#df <- read.csv("myproject/subfolder/fileName.csv")
#df <- read_excel("myproject/subfolder/fileName.xlsx")

Data Structures - Data Frame Cont…

Inspecting your data

All data structures

  • str(): compact display of data contents (env.)
  • class(): data type (e.g. character, numeric, etc.) of vectors and data structure of dataframes, matrices, and lists.
  • head(): will print the beginning entries for the variable
  • tail(): will print the end entries for the variable

Vector and factor variables:

  • length(): returns the number of elements in the vector or factor

Data Structures - Data Frame Cont…

Inspecting your data

Dataframe and matrix variables:

  • dim(): returns dimensions of the dataset
  • nrow(): returns the number of rows in the dataset
  • ncol(): returns the number of columns in the dataset
  • rownames(): returns the row names in the dataset
  • colnames(): returns the column names in the dataset

Data Structures - Data Frame Cont…

Exercise 2

  1. Import a dataset of your choice in R using read.csv() function and call it my_df.
  2. Inspect my_df using functions in previews slides.

hint: your dataset should be in csv format.

Data Structures - Data Frame Cont…

Exercise 3 - renaming columns

It is common that the data we import contains column names that we need to change. ie. names are too long or contain strange characters.

The syntax for changing column names is as follows:

# colnames(my_df)[colnames(my_df)=="old_column_name"] <- "new_column_name"
  1. rename 3 columns in your my_df dataset

Data Frames - Subsetting Data

In R, the command “subset” is used to filter the data in a data frame based on criteria you set.

When we subset data, it is recommended that we assign this to a new object / data frame so that data is not lost.

The subset command takes the following form:

# subset_df <- subset(my_df, criteria)

Where the criteria refers to either a numeric or categorical variable. Categorical variable criteria always appear in quotations.

Data Frames - Subsetting Data ctnd

Examples:

Numeric variables

# subset_df <- subset(my_df, numeric_variable>=3) 

OR

# subset_df <- subset(my_df, numeric_variable==10) 

Categorical variable

# subset_df <- subset(my_df, categorical_variable=="category")

Data Frames - Subsetting Data ctnd

Exercise 4

  1. Perform 3 subsets of your my_df, assigning each to a new data frame.

At least one of these subsets should be of a categorical vriable (ie. all “male”) and at least one should be of a numeric variable (ie. more or less than a number)

Data Frames - More Subsetting

Subsetting, Extracting and Bracket Notation

R’s bracket notation, or bracket operators, is a frequent source of confusion for new users, but it is very useful for subsetting data

  • my_df[1:3] (no comma) will subset my_df, returning the first three columns as a data frame.
  • my_df[1:3, ] (with comma, numbers to left of the comma) will subset my_df and return the first three rows as a data frame.
  • my_df[, 1:3] (with comma, numbers to right of the comma) will subset my_df and return the first three columns as a data frame, the

Data Frames - More Subsetting ctnd

Exercise 4

  1. Perform 2 subsets of your my_df, assigning each to a new data frame. Use Bracket Notation [ ]
  2. Write the new subsets in your output folder.

R Packages

Packages are collections of R functions, data, and compiled code in a well-defined format. The directory where packages are stored is called the library. R comes with a standard set of packages. Others are available for download and installation. Once installed, they have to be loaded into the session to be used.

To use a package you need

  1. Install the package, if not included in R installation (you need to do this only once).
  2. import the package in r session.

R Packages ctnd

To install and load a package:

  1. Choose Install Packages from Tools menu or Packages panel.
  2. Search/Select a package. (e.g. dyplr)
  3. Click Install
  4. Then use the library(packageName) function to load it for use. (e.g. library(boot))

dplyr

dplyr is a grammar (package) of data manipulation, providing a consistent set of verbs (functions) that help you solve the most common data manipulation challenges

Important dplyr Functions

  • select() = Select columns
  • filter() = Filter rows
  • arrange() = Re-order or arrange rows
  • mutate() = Create new columns
  • summarize() = Summarize values
  • group_by() = Allows for group operations

dplyr - select()

To select a set of columns we can use:

# select(my_df, column_name,...)

To select all the columns except a specific column/s

# select (my_df, - column_name,...)

To select a range of columns by name, use the “:” operator

# select(my_df, column_name_1:column_name_n))

dplyr - filter()

To filter the rows based on a condition, you can use:

filter(my_df, condition)

Exercise 4

Perform 3 subsets of your my_df, assigning each to a new data frame.

At least one of these subsets should be of a categorical vriable (ie. all “male”) and at least one should be of a numeric variable (ie. more or less than a number)

dplyr - Pipe operator: %>%

dplyr imports this operator from another package (magrittr). This operator allows you to “pipe” the output from one function to the input of another function. Instead of nesting functions (reading from the inside to the outside), the idea of of piping is to read the functions from left to right.

# select(my_df, column_name_1:column_name_n)) %>%
# filter( column_name_2 > 10)

Exercise 5

Let’s select at least 3 columns from my_df and filter it, use pipe operator

dplyr - arrange()

To order rows by values of a column (low-high) we can use:

# arrange(my_df, column_name)

To order rows by values of a column in descending order (high-low) we can use:

#  arrange(my_df, desc(column_name))

dplyr - mutate()

Mutate allows us to create new columns/variables and is commonly used within a dplyr pipe. The syntax is:

# mutate(my_df, new_column_name = condition)

To order rows by values of a column in descending order (high-low) we can use:

#  arrange(my_df, desc(column_name))