Introduction

This short tutorial covers ‘Essential R,’ or the most important R constructions that you will encounter most frequently. I’ve tried to condense the main R phrases that are used the most. It is possible to build pretty sophisticated code with these statements. Once you’ve mastered these statements, you can move on to expanding your R repertoire.

You can install R and Rstudio

Install the latest R version from CRAN (https://cran.r-project.org/)..)

Also, install RStudio

(https://www.rstudio.com/products/rstudio/download/#download)..)

RStudio is one of the best IDEs for developing in R

Because there are so many commands and options, it is easy to become lost in R code; thus, we will take a little tour to obtain an insight. Typically, R code appears as follows:

Or in my word, Object<-function(argument name=argument values, options)

For Example, if you want to run a linear regression model

we typically write the code as

Model<-lm(formula=dependent variable~independent variable, Data= Data set name)

Thus we can say R programming is a function based programming language where all R codes follow the above code syntax style.

One more simple example

sales<-c(34,45,65)
sales
## [1] 34 45 65

This makes a vector called sales; the c function is concatenate.

Installing and Loading Packages

Packages extend R’s capabilities. We need to install a package once, before we can load it. We use the following code to install the tidyverse package:

install.packages(“tidyverse”)

We load the tidyverse package when we need to use it:

library(tidyverse)

Every time we want to use an external function, in this case “tidyverse”, we need to install the package and to call the function capabilities, we do the library(tidyverse). So this is the way you should initially learn the fundamentals of R programming. The above steps are the core of R programming before you start any new command.

Quick Tips

  • Always, create a project and a new directory at the same time or we can create a new project in a directory before starting any new R session. All output and files get saved in the same directory.

  • We can start working with a script whenever we type a code in R studio console as follows. First, in RStudio we click on File, then New File, and then Script. We can save it as ‘Script’. We can type in 2 + 3, and click on Run; RStudio prints the result in the Console window. We can save the Script.

    2 + 3

    ## [1] 5

You can watch Bangla R Programming Tutorial in my Youtube Channel, https://youtube.com/playlist?list=PLPNtMcgVLNVQAA0apA2DQy2QCPaPSX2mq

For English Tutorial, https://youtube.com/playlist?list=PLPNtMcgVLNVQMQOdjrUBlzaOU8p1q1Ba-

You can find all the R programming excited videos in my YouTube Channel,

My Channel

Quick Tips

To get help on any topic use ‘?’ or help() in Rstudio console.

Key Objects/Data Types in R

We work with objects of various sorts in R. Let’s look at four important objects using a basic example: vector, matrix, data frame, and list.

Vectors

Vector is the most important data types/object in R programming. All the data types in R followed a vector form. For example, We set up a vector called price, consisting of three prices. We need to type the following in the script window, and then click on Run, which runs that line. Then the line appears in the console window.

price<-c(10,2,15)
print(price)
## [1] 10  2 15

The three prices are equal to 10, 2 and 15. We use the c() function which stands for concatenate, and parentheses enclose the values that are separated by commas.

Another example,

myvector <- c(1,2,4,5,8,4, 10,12)

Vector is similar to values under a variable in excel spreadsheet. Usually, in a excel file when we create a variable, the similar types of unit are put under that particular variable name.

So we start with vector every time we want to create a variable for a data frame in R.

Next is Matrices

Matrices is similar like a vector with specified rows and columns

For example, Create 5 x 4 numeric matrix

y<-matrix(1:20, nrow=5,ncol=4)
print(y)
##      [,1] [,2] [,3] [,4]
## [1,]    1    6   11   16
## [2,]    2    7   12   17
## [3,]    3    8   13   18
## [4,]    4    9   14   19
## [5,]    5   10   15   20

By the way, if you want learn basic statistics and research methods, you can follow my YouTube Channel Playlist

https://youtube.com/playlist?list=PLPNtMcgVLNVTJTvs_0cOHprthsaufsdai

List

The second most important data object in R programming is List. How to create a list? List is a data type object where we can combine string and numeric values together

In lists the elements can be dissimilar. The list below has the following objects a) Character variable b) character vector c) matrix and 4) numeric variable

w <-list(name="Fred", mynumbers="charVector", mymatrix=y, age=5.3)
print(w)
## $name
## [1] "Fred"
## 
## $mynumbers
## [1] "charVector"
## 
## $mymatrix
##      [,1] [,2] [,3] [,4]
## [1,]    1    6   11   16
## [2,]    2    7   12   17
## [3,]    3    8   13   18
## [4,]    4    9   14   19
## [5,]    5   10   15   20
## 
## $age
## [1] 5.3

So we can understand how to use list.

Finally, Data Frame

A data frame is the most commonly used data type in R. Most data analysis is done on dataframes. Dataframes can be thought of as Excel sheets with rows and columns. Each column can be of a different type, but all elements in the column must be of the same type.

d <-c(1,2,3,4)
e <-c("red", "white", "red", NA)
f <-c(TRUE,TRUE,TRUE,FALSE)
mydataframe <-data.frame(d,e,f)
names(mydataframe) <- c("ID","Color","Passed")
mydataframe
##   ID Color Passed
## 1  1   red   TRUE
## 2  2 white   TRUE
## 3  3   red   TRUE
## 4  4  <NA>  FALSE

This is how we create a data frame in R programming.

Do it by Yourself

Create a data frame containing the vectors Price and Quantity with hypothetical values. Try to print the first column of the data frame. Hint use “$” sign to call a variable inside the data frame.

Learn R Free Software

You can learn R based free software in my Channel,

https://youtu.be/OJnAuNmZVwQ

https://youtu.be/hmQJqIfmj5I

https://youtu.be/iQS7T0517L0

Quick Tips

  • Get a range of elements, from the 3rd to 7th in a vector
myvector <-c(1,2,4,5,8,4, 10,12)
s <-myvector[3:7]
s
## [1]  4  5  8  4 10
  • Exclude elements from vector

# Remove the 2nd element from myvector

myvector <-c(1,2,4,5,8,4, 10,12)
myvector[-2]
## [1]  1  4  5  8  4 10 12

Always use [ ] to remove or add any values in a vector.

  • For loops

There are 2 ways to write ‘for’ loops. Important note: One should avoid ‘for’ loops, as ‘for’ loops are very performance intensive. Try to use vectors instead of ‘for’ loops wherever possible

for(i in 1:5){

print(i*5)

}

Here i is the name of a variable.

  • Check the size of the data frame

    data("iris")
    dim(iris)
    ## [1] 150   5

A very useful command to check the class of all columns in a data frame is the ‘str’ command. The ‘str’ command lists the columns in the data frame, their type and also displays a few representative values from each column

str(iris)
## 'data.frame':    150 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

If you want to use any R built in data, just use data command/function to call any in built data. It will make your practice smooth. R has numerous in built data to use for practice.

  • Check the 1st few rows of the data frame

    head(iris)
    ##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
    ## 1          5.1         3.5          1.4         0.2  setosa
    ## 2          4.9         3.0          1.4         0.2  setosa
    ## 3          4.7         3.2          1.3         0.2  setosa
    ## 4          4.6         3.1          1.5         0.2  setosa
    ## 5          5.0         3.6          1.4         0.2  setosa
    ## 6          5.4         3.9          1.7         0.4  setosa
  • Final Quick tip for data investigation, The summary() function

    summary(iris)
    ##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
    ##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
    ##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
    ##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
    ##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
    ##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
    ##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
    ##        Species  
    ##  setosa    :50  
    ##  versicolor:50  
    ##  virginica :50  
    ##                 
    ##                 
    ## 

If you want to subset a sample Display 1-6 rows and 2-4 columns. This similar to how we sub setted the matrix We can use df[row,column] or df[row vector, column vector]

iris[1:6,2:4]
##   Sepal.Width Petal.Length Petal.Width
## 1         3.5          1.4         0.2
## 2         3.0          1.4         0.2
## 3         3.2          1.3         0.2
## 4         3.1          1.5         0.2
## 5         3.6          1.4         0.2
## 6         3.9          1.7         0.4

You can inspect the data set using the above commands before go for any data analysis. Now we move to the most important package of R programming for Data Science which is called “tidyverse package”. One last thing before we move to tidyverse package learning is how to upload your excel data into R studio. Its very simple, first put your data set into your project directory and then use read.csv(“data frame.csv”) command to use your own excel data into R studio environment.

We now present Hadley Wickham’s approach to the data analysis workflow:

Hadley Wickham’s approach to the data analysis workflow is the most popular data science work flow in R programming. This approach is exposited in Grolemund and Wickham (2017). Wickham has designed several packages to work on different parts of a workflow.

The Tidyverse Approach

Typically, a data analyst follows a workflow consisting of these parts:

  • Get data into R

  • tidy and transform

  • visualize and model, and

  • communicate.

The ‘tidyverse’ (Wickham, 2017) is a collection of Hadley Wickham’s tools that assist us with various areas of data analysis workflow. We will now utilize the tidyverse package created by Wickham. This requires us to install the package beforehand. A package only has to be installed once. It can be installed by a command or an icon within RStudio. We can then load the package using the library method.

We enter our data in R with each variable as a separate vector. This is hypothetical data of a survey of six persons, and the variables are: payment they received, hours worked, their gender and age.

surv_id <-c(1,2,3,4,5,6)
payment <-c(1000,700,600,1200,800,500)
hours <-c(7,5,3,6,7,4)
gender <-c("F","M","F","M","M","M")
age <-c(28,52,37,35,59,43)

In R, data is usually stored in a data frame. Wickham has designed a ‘tibble’ to improve on R’s data frame. A tibble, which is similar to a data frame, stores our data. We call our tibble labour. We create a tibble with the tibble() function.

library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.6     v purrr   0.3.4
## v tibble  3.1.7     v dplyr   1.0.9
## v tidyr   1.2.0     v stringr 1.4.0
## v readr   2.1.2     v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
labour <- tibble(surv_id,
payment, hours,
gender,age)

We print labour. Note that R would otherwise only create the tibble, and would do so quietly.

labour
## # A tibble: 6 x 5
##   surv_id payment hours gender   age
##     <dbl>   <dbl> <dbl> <chr>  <dbl>
## 1       1    1000     7 F         28
## 2       2     700     5 M         52
## 3       3     600     3 F         37
## 4       4    1200     6 M         35
## 5       5     800     7 M         59
## 6       6     500     4 M         43

This example was taken from Vikram Dayal book from his github account.

We get a glimpse of our data with the glimpse function.

glimpse(labour)
## Rows: 6
## Columns: 5
## $ surv_id <dbl> 1, 2, 3, 4, 5, 6
## $ payment <dbl> 1000, 700, 600, 1200, 800, 500
## $ hours   <dbl> 7, 5, 3, 6, 7, 4
## $ gender  <chr> "F", "M", "F", "M", "M", "M"
## $ age     <dbl> 28, 52, 37, 35, 59, 43

glimse function is similar to str function in base R.

A tibble consists of several vectors.We extract the gender column from the labour dataset, and then the second and third elements of the gender column:

labour$gender
## [1] "F" "M" "F" "M" "M" "M"

We extract the first row and then the second column.

labour[1,]
## # A tibble: 1 x 5
##   surv_id payment hours gender   age
##     <dbl>   <dbl> <dbl> <chr>  <dbl>
## 1       1    1000     7 F         28
labour[,2]
## # A tibble: 6 x 1
##   payment
##     <dbl>
## 1    1000
## 2     700
## 3     600
## 4    1200
## 5     800
## 6     500

Five Data Verbs

These five data verbs are the most important functions for data science specially for data cleaning and manipulation. As we know the 80% of data science work is data cleaning. Thus if you know these five verbs or functions in the tidyverse package, you are good enough to start your ow project.

The general format of these data verb code syntax

verbs(dataset, a=b+c)

Here we start with verb function, then name the data set, then use additional or remaining arguments.

Five data verbs help us do a lot with data. We use these with the help of the dplyr package written by Wickham, contained in the tidyverse package.

The five data verbs are:

All of these can be used with the group_by() function. group_by function use for specific category of a variable in a data set to summarize anything. For example, if you want to summarize your data set based on gender, then group_by function will group your data based on gender criteria.

You can find all statistical basic information in my Channel

https://youtube.com/playlist?list=PLPNtMcgVLNVQAA0apA2DQy2QCPaPSX2mq

Quick Tips

These five verbs, along with the pipe symbol, help us accomplish a lot when working with data. The pipe symbol %>% is equivalent to

x %>%

f(,y)

The pipe symbol pipes x into the function of x and y. So what is on the left of the pipe gets piped as the first argument of the function on the right. This might appear strange at first and takes some getting used to, but it greatly helps us carry out actions on data that build on each other and makes code easier to understand.

Filter Verb

We select rows by using filter.We filter for rows where the gender is female. We create a new tibble, by piping labour into the filter function, asking it to give us the rows with female gender (F). Note the use of two equal to signs.

labour_filter <- labour %>%
filter(gender == "F")
labour_filter
## # A tibble: 2 x 5
##   surv_id payment hours gender   age
##     <dbl>   <dbl> <dbl> <chr>  <dbl>
## 1       1    1000     7 F         28
## 2       3     600     3 F         37

Do it By yourself

Filter for rows where the gender is male, calling the new tibble labour_filter2. Check that you have done this correctly by printing labour_filter2.

We create new variables with mutate; we calculate the wage rates.

Mutate

labour_mutate <- labour %>%
mutate(wage = payment /
hours)
labour_mutate
## # A tibble: 6 x 6
##   surv_id payment hours gender   age  wage
##     <dbl>   <dbl> <dbl> <chr>  <dbl> <dbl>
## 1       1    1000     7 F         28  143.
## 2       2     700     5 M         52  140 
## 3       3     600     3 F         37  200 
## 4       4    1200     6 M         35  200 
## 5       5     800     7 M         59  114.
## 6       6     500     4 M         43  125

Do it by yourself

Create a new variable called minutes with mutate: minutes = hours * 60.

We arrange the data by hours worked with arrange.

labour_arrange <- labour %>%
arrange(hours)
labour_arrange
## # A tibble: 6 x 5
##   surv_id payment hours gender   age
##     <dbl>   <dbl> <dbl> <chr>  <dbl>
## 1       3     600     3 F         37
## 2       6     500     4 M         43
## 3       2     700     5 M         52
## 4       4    1200     6 M         35
## 5       1    1000     7 F         28
## 6       5     800     7 M         59

We select the columns hours worked and gender.

labour_select <- labour %>%
select(hours, gender)
labour_select
## # A tibble: 6 x 2
##   hours gender
##   <dbl> <chr> 
## 1     7 F     
## 2     5 M     
## 3     3 F     
## 4     6 M     
## 5     7 M     
## 6     4 M

We now summarize the data; grouping by gender. The group by here groups by gender; we get the mean hours worked by females and males:

labour_summary <- labour %>%
group_by(gender) %>%
summarize(mean = mean(hours))
labour_summary
## # A tibble: 2 x 2
##   gender  mean
##   <chr>  <dbl>
## 1 F        5  
## 2 M        5.5

Here the general form is first take data set %>% then using pipe to the next analysis using group_by verb/function, then pipe to summarize. This is the meaning of above code syntax of all the verbs. You can use data set as first argument if you do not use pipe operator, however, we strongly advice you to use pipe operator and follow the above common code syntax for every verb.

labour_summary
## # A tibble: 2 x 2
##   gender  mean
##   <chr>  <dbl>
## 1 F        5  
## 2 M        5.5

Do it by your self

Find the median hours worked by gender.

Graphs

The package ggplot2 in the tidyverse can do wonderful visualizations. In ggplot2, gg stands for the grammar of graphics. Here we ask for a plot; we mention the x and y variable as aesthetics that are mapped onto the axes

gg1 <- ggplot(data = labour_mutate,
aes(x = age, y = wage))
gg1

We use geom_point() to tell R that we want points plotted. The different components are added with a plus at the end of the line of code, and this helps us build plots in layers

gg2 <- gg1 +
geom_point()
gg2

The colour aesthetic is used to distinguish the gender of the cases

gg3 <- gg1 +
geom_point(
aes(colour = gender))
gg3

Finally after all the data cleaning, visualization, we use a Linear Model

We can fit a linear model with the lm command. This creates an object which we can print or extract information from.

age_wage_fit <- lm(wage~ age, data = labour_mutate)
age_wage_fit
## 
## Call:
## lm(formula = wage ~ age, data = labour_mutate)
## 
## Coefficients:
## (Intercept)          age  
##      233.28        -1.88

Finally, We can plot the fitted line along with the scatter plot.We add geom_smooth(), and ask for it to be based on a linear model

gg5 <- gg2 +
geom_smooth(method = "lm")
gg5
## `geom_smooth()` using formula 'y ~ x'

This is how you can start with R basics and then use tidyverse package for your Data Science Project. Thanks a lot for reading until the end of my small effort. Its free to use for everyone who want to follow my steps. If you like my effort, Please subscribe my YouTube Channel and share the video and this notes in order to learn R Programming Easy Way.

https://www.youtube.com/user/munshi1231/featured

You can also communicate with my for research article writing help, data analysis help or Data Science tutoring help,

munshinaser-eco@sust.edu

munshi.naser@gmail.com

You can donate to inspire me to create more of this kind of notes for you every month for free, (If you wish, not compulsory)

Bkash account-+8801877532192 (For Bangladesh)

Nogod account-+8801794666824 (For Bangladesh)

Whatsapp- +8801794666824

For International Bank Transfer,

Account Name- Munshi Naser Ibne Afzal

Account number- A/C-6307102810365001, BRAC Bank Limited, Address- Sylhet Uposhahar Branch, 3/3 Upashahar, Main Road Shahjalal Upashahar Main Rd, Sylhet 3100, Bangladesh.

Bank Details

BANK BRAC BANK LIMITED
BANK CODE 060
DISTRICT SYLHET
BRANCH UPOSHAHAR BRANCH
ROUTING NUMBER 060913737
BRANCH CODE 373
SWIFT CODE BRAKBDDH
ADDRESS 3/3 SHAHJALAL UPOSHAHAR, MAIN ROAD, BLOCK D, SYLHET

Thanks & Cheers!!