This short tutorial covers ‘Essential R,’ or the most important R constructions that you will encounter most frequently. I’ve tried to condense the main R phrases that are used the most. It is possible to build pretty sophisticated code with these statements. Once you’ve mastered these statements, you can move on to expanding your R repertoire.
You can install R and Rstudio
Install the latest R version from CRAN (https://cran.r-project.org/)..)
Also, install RStudio
(https://www.rstudio.com/products/rstudio/download/#download)..)
RStudio is one of the best IDEs for developing in R
Because there are so many commands and options, it is easy to become lost in R code; thus, we will take a little tour to obtain an insight. Typically, R code appears as follows:
Or in my word, Object<-function(argument name=argument values, options)
For Example, if you want to run a linear regression model
we typically write the code as
Model<-lm(formula=dependent variable~independent variable, Data= Data set name)
Here, Model= a object
lm= a function (particularly linear model function to run linear regression)
two argument names, one formula and Data
Two argument values associated with argument names
Thus we can say R programming is a function based programming language where all R codes follow the above code syntax style.
One more simple example
sales<-c(34,45,65)
sales
## [1] 34 45 65
This makes a vector called sales; the c function is concatenate.
Packages extend R’s capabilities. We need to install a package once, before we can load it. We use the following code to install the tidyverse package:
install.packages(“tidyverse”)
We load the tidyverse package when we need to use it:
library(tidyverse)
Every time we want to use an external function, in this case “tidyverse”, we need to install the package and to call the function capabilities, we do the library(tidyverse). So this is the way you should initially learn the fundamentals of R programming. The above steps are the core of R programming before you start any new command.
Always, create a project and a new directory at the same time or we can create a new project in a directory before starting any new R session. All output and files get saved in the same directory.
We can start working with a script whenever we type a code in R studio console as follows. First, in RStudio we click on File, then New File, and then Script. We can save it as ‘Script’. We can type in 2 + 3, and click on Run; RStudio prints the result in the Console window. We can save the Script.
2 + 3
## [1] 5
You can watch Bangla R Programming Tutorial in my Youtube Channel, https://youtube.com/playlist?list=PLPNtMcgVLNVQAA0apA2DQy2QCPaPSX2mq
For English Tutorial, https://youtube.com/playlist?list=PLPNtMcgVLNVQMQOdjrUBlzaOU8p1q1Ba-
You can find all the R programming excited videos in my YouTube Channel,
My Channel
To get help on any topic use ‘?’ or help() in Rstudio console.
We work with objects of various sorts in R. Let’s look at four important objects using a basic example: vector, matrix, data frame, and list.
Vector is the most important data types/object in R programming. All the data types in R followed a vector form. For example, We set up a vector called price, consisting of three prices. We need to type the following in the script window, and then click on Run, which runs that line. Then the line appears in the console window.
price<-c(10,2,15)
print(price)
## [1] 10 2 15
The three prices are equal to 10, 2 and 15. We use the c() function which stands for concatenate, and parentheses enclose the values that are separated by commas.
Another example,
myvector <- c(1,2,4,5,8,4, 10,12)
Vector is similar to values under a variable in excel spreadsheet. Usually, in a excel file when we create a variable, the similar types of unit are put under that particular variable name.
So we start with vector every time we want to create a variable for a data frame in R.
Matrices is similar like a vector with specified rows and columns
For example, Create 5 x 4 numeric matrix
y<-matrix(1:20, nrow=5,ncol=4)
print(y)
## [,1] [,2] [,3] [,4]
## [1,] 1 6 11 16
## [2,] 2 7 12 17
## [3,] 3 8 13 18
## [4,] 4 9 14 19
## [5,] 5 10 15 20
By the way, if you want learn basic statistics and research methods, you can follow my YouTube Channel Playlist
https://youtube.com/playlist?list=PLPNtMcgVLNVTJTvs_0cOHprthsaufsdai
The second most important data object in R programming is List. How to create a list? List is a data type object where we can combine string and numeric values together
In lists the elements can be dissimilar. The list below has the following objects a) Character variable b) character vector c) matrix and 4) numeric variable
w <-list(name="Fred", mynumbers="charVector", mymatrix=y, age=5.3)
print(w)
## $name
## [1] "Fred"
##
## $mynumbers
## [1] "charVector"
##
## $mymatrix
## [,1] [,2] [,3] [,4]
## [1,] 1 6 11 16
## [2,] 2 7 12 17
## [3,] 3 8 13 18
## [4,] 4 9 14 19
## [5,] 5 10 15 20
##
## $age
## [1] 5.3
So we can understand how to use list.
A data frame is the most commonly used data type in R. Most data analysis is done on dataframes. Dataframes can be thought of as Excel sheets with rows and columns. Each column can be of a different type, but all elements in the column must be of the same type.
d <-c(1,2,3,4)
e <-c("red", "white", "red", NA)
f <-c(TRUE,TRUE,TRUE,FALSE)
mydataframe <-data.frame(d,e,f)
names(mydataframe) <- c("ID","Color","Passed")
mydataframe
## ID Color Passed
## 1 1 red TRUE
## 2 2 white TRUE
## 3 3 red TRUE
## 4 4 <NA> FALSE
This is how we create a data frame in R programming.
Create a data frame containing the vectors Price and Quantity with hypothetical values. Try to print the first column of the data frame. Hint use “$” sign to call a variable inside the data frame.
You can learn R based free software in my Channel,
myvector <-c(1,2,4,5,8,4, 10,12)
s <-myvector[3:7]
s
## [1] 4 5 8 4 10
# Remove the 2nd element from myvector
myvector <-c(1,2,4,5,8,4, 10,12)
myvector[-2]
## [1] 1 4 5 8 4 10 12
Always use [ ] to remove or add any values in a vector.
There are 2 ways to write ‘for’ loops. Important note: One should avoid ‘for’ loops, as ‘for’ loops are very performance intensive. Try to use vectors instead of ‘for’ loops wherever possible
for(i in 1:5){
print(i*5)
}
Here i is the name of a variable.
Check the size of the data frame
data("iris")
dim(iris)
## [1] 150 5A very useful command to check the class of all columns in a data frame is the ‘str’ command. The ‘str’ command lists the columns in the data frame, their type and also displays a few representative values from each column
str(iris)
## 'data.frame': 150 obs. of 5 variables:
## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
If you want to use any R built in data, just use data command/function to call any in built data. It will make your practice smooth. R has numerous in built data to use for practice.
Check the 1st few rows of the data frame
head(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosaFinal Quick tip for data investigation, The summary() function
summary(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
## Median :5.800 Median :3.000 Median :4.350 Median :1.300
## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
## Species
## setosa :50
## versicolor:50
## virginica :50
##
##
## If you want to subset a sample Display 1-6 rows and 2-4 columns. This similar to how we sub setted the matrix We can use df[row,column] or df[row vector, column vector]
iris[1:6,2:4]
## Sepal.Width Petal.Length Petal.Width
## 1 3.5 1.4 0.2
## 2 3.0 1.4 0.2
## 3 3.2 1.3 0.2
## 4 3.1 1.5 0.2
## 5 3.6 1.4 0.2
## 6 3.9 1.7 0.4
You can inspect the data set using the above commands before go for any data analysis. Now we move to the most important package of R programming for Data Science which is called “tidyverse package”. One last thing before we move to tidyverse package learning is how to upload your excel data into R studio. Its very simple, first put your data set into your project directory and then use read.csv(“data frame.csv”) command to use your own excel data into R studio environment.
Hadley Wickham’s approach to the data analysis workflow is the most popular data science work flow in R programming. This approach is exposited in Grolemund and Wickham (2017). Wickham has designed several packages to work on different parts of a workflow.
Typically, a data analyst follows a workflow consisting of these parts:
Get data into R
tidy and transform
visualize and model, and
communicate.
The ‘tidyverse’ (Wickham, 2017) is a collection of Hadley Wickham’s tools that assist us with various areas of data analysis workflow. We will now utilize the tidyverse package created by Wickham. This requires us to install the package beforehand. A package only has to be installed once. It can be installed by a command or an icon within RStudio. We can then load the package using the library method.
We enter our data in R with each variable as a separate vector. This is hypothetical data of a survey of six persons, and the variables are: payment they received, hours worked, their gender and age.
surv_id <-c(1,2,3,4,5,6)
payment <-c(1000,700,600,1200,800,500)
hours <-c(7,5,3,6,7,4)
gender <-c("F","M","F","M","M","M")
age <-c(28,52,37,35,59,43)
In R, data is usually stored in a data frame. Wickham has designed a ‘tibble’ to improve on R’s data frame. A tibble, which is similar to a data frame, stores our data. We call our tibble labour. We create a tibble with the tibble() function.
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.6 v purrr 0.3.4
## v tibble 3.1.7 v dplyr 1.0.9
## v tidyr 1.2.0 v stringr 1.4.0
## v readr 2.1.2 v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
labour <- tibble(surv_id,
payment, hours,
gender,age)
We print labour. Note that R would otherwise only create the tibble, and would do so quietly.
labour
## # A tibble: 6 x 5
## surv_id payment hours gender age
## <dbl> <dbl> <dbl> <chr> <dbl>
## 1 1 1000 7 F 28
## 2 2 700 5 M 52
## 3 3 600 3 F 37
## 4 4 1200 6 M 35
## 5 5 800 7 M 59
## 6 6 500 4 M 43
This example was taken from Vikram Dayal book from his github account.
We get a glimpse of our data with the glimpse function.
glimpse(labour)
## Rows: 6
## Columns: 5
## $ surv_id <dbl> 1, 2, 3, 4, 5, 6
## $ payment <dbl> 1000, 700, 600, 1200, 800, 500
## $ hours <dbl> 7, 5, 3, 6, 7, 4
## $ gender <chr> "F", "M", "F", "M", "M", "M"
## $ age <dbl> 28, 52, 37, 35, 59, 43
glimse function is similar to str function in base R.
A tibble consists of several vectors.We extract the gender column from the labour dataset, and then the second and third elements of the gender column:
labour$gender
## [1] "F" "M" "F" "M" "M" "M"
We extract the first row and then the second column.
labour[1,]
## # A tibble: 1 x 5
## surv_id payment hours gender age
## <dbl> <dbl> <dbl> <chr> <dbl>
## 1 1 1000 7 F 28
labour[,2]
## # A tibble: 6 x 1
## payment
## <dbl>
## 1 1000
## 2 700
## 3 600
## 4 1200
## 5 800
## 6 500
These five data verbs are the most important functions for data science specially for data cleaning and manipulation. As we know the 80% of data science work is data cleaning. Thus if you know these five verbs or functions in the tidyverse package, you are good enough to start your ow project.
The general format of these data verb code syntax
verbs(dataset, a=b+c)
Here we start with verb function, then name the data set, then use additional or remaining arguments.
Five data verbs help us do a lot with data. We use these with the help of the dplyr package written by Wickham, contained in the tidyverse package.
The five data verbs are:
filter-to pick certain rows
select- to pick columns
mutate- to generate new variables
summarize- to summarize
arrange- to sort in some order
All of these can be used with the group_by() function. group_by function use for specific category of a variable in a data set to summarize anything. For example, if you want to summarize your data set based on gender, then group_by function will group your data based on gender criteria.
You can find all statistical basic information in my Channel
https://youtube.com/playlist?list=PLPNtMcgVLNVQAA0apA2DQy2QCPaPSX2mq
These five verbs, along with the pipe symbol, help us accomplish a lot when working with data. The pipe symbol %>% is equivalent to
x %>%
f(,y)
The pipe symbol pipes x into the function of x and y. So what is on the left of the pipe gets piped as the first argument of the function on the right. This might appear strange at first and takes some getting used to, but it greatly helps us carry out actions on data that build on each other and makes code easier to understand.
We select rows by using filter.We filter for rows where the gender is female. We create a new tibble, by piping labour into the filter function, asking it to give us the rows with female gender (F). Note the use of two equal to signs.
labour_filter <- labour %>%
filter(gender == "F")
labour_filter
## # A tibble: 2 x 5
## surv_id payment hours gender age
## <dbl> <dbl> <dbl> <chr> <dbl>
## 1 1 1000 7 F 28
## 2 3 600 3 F 37
Filter for rows where the gender is male, calling the new tibble labour_filter2. Check that you have done this correctly by printing labour_filter2.
We create new variables with mutate; we calculate the wage rates.
labour_mutate <- labour %>%
mutate(wage = payment /
hours)
labour_mutate
## # A tibble: 6 x 6
## surv_id payment hours gender age wage
## <dbl> <dbl> <dbl> <chr> <dbl> <dbl>
## 1 1 1000 7 F 28 143.
## 2 2 700 5 M 52 140
## 3 3 600 3 F 37 200
## 4 4 1200 6 M 35 200
## 5 5 800 7 M 59 114.
## 6 6 500 4 M 43 125
Do it by yourself
Create a new variable called minutes with mutate: minutes = hours * 60.
labour_arrange <- labour %>%
arrange(hours)
labour_arrange
## # A tibble: 6 x 5
## surv_id payment hours gender age
## <dbl> <dbl> <dbl> <chr> <dbl>
## 1 3 600 3 F 37
## 2 6 500 4 M 43
## 3 2 700 5 M 52
## 4 4 1200 6 M 35
## 5 1 1000 7 F 28
## 6 5 800 7 M 59
labour_select <- labour %>%
select(hours, gender)
labour_select
## # A tibble: 6 x 2
## hours gender
## <dbl> <chr>
## 1 7 F
## 2 5 M
## 3 3 F
## 4 6 M
## 5 7 M
## 6 4 M
labour_summary <- labour %>%
group_by(gender) %>%
summarize(mean = mean(hours))
labour_summary
## # A tibble: 2 x 2
## gender mean
## <chr> <dbl>
## 1 F 5
## 2 M 5.5
Here the general form is first take data set %>% then using pipe to the next analysis using group_by verb/function, then pipe to summarize. This is the meaning of above code syntax of all the verbs. You can use data set as first argument if you do not use pipe operator, however, we strongly advice you to use pipe operator and follow the above common code syntax for every verb.
labour_summary
## # A tibble: 2 x 2
## gender mean
## <chr> <dbl>
## 1 F 5
## 2 M 5.5
Do it by your self
Find the median hours worked by gender.
The package ggplot2 in the tidyverse can do wonderful visualizations. In ggplot2, gg stands for the grammar of graphics. Here we ask for a plot; we mention the x and y variable as aesthetics that are mapped onto the axes
gg1 <- ggplot(data = labour_mutate,
aes(x = age, y = wage))
gg1
We use geom_point() to tell R that we want points plotted. The different components are added with a plus at the end of the line of code, and this helps us build plots in layers
gg2 <- gg1 +
geom_point()
gg2
The colour aesthetic is used to distinguish the gender of the cases
gg3 <- gg1 +
geom_point(
aes(colour = gender))
gg3
We can fit a linear model with the lm command. This creates an object which we can print or extract information from.
age_wage_fit <- lm(wage~ age, data = labour_mutate)
age_wage_fit
##
## Call:
## lm(formula = wage ~ age, data = labour_mutate)
##
## Coefficients:
## (Intercept) age
## 233.28 -1.88
Finally, We can plot the fitted line along with the scatter plot.We add geom_smooth(), and ask for it to be based on a linear model
gg5 <- gg2 +
geom_smooth(method = "lm")
gg5
## `geom_smooth()` using formula 'y ~ x'
This is how you can start with R basics and then use tidyverse package for your Data Science Project. Thanks a lot for reading until the end of my small effort. Its free to use for everyone who want to follow my steps. If you like my effort, Please subscribe my YouTube Channel and share the video and this notes in order to learn R Programming Easy Way.
https://www.youtube.com/user/munshi1231/featured
You can also communicate with my for research article writing help, data analysis help or Data Science tutoring help,
munshinaser-eco@sust.edu
munshi.naser@gmail.com
You can donate to inspire me to create more of this kind of notes for you every month for free, (If you wish, not compulsory)
Bkash account-+8801877532192 (For Bangladesh)
Nogod account-+8801794666824 (For Bangladesh)
Whatsapp- +8801794666824
For International Bank Transfer,
Account Name- Munshi Naser Ibne Afzal
Account number- A/C-6307102810365001, BRAC Bank Limited, Address- Sylhet Uposhahar Branch, 3/3 Upashahar, Main Road Shahjalal Upashahar Main Rd, Sylhet 3100, Bangladesh.
Bank Details
| BANK | BRAC BANK LIMITED |
|---|---|
| BANK CODE | 060 |
| DISTRICT | SYLHET |
| BRANCH | UPOSHAHAR BRANCH |
| ROUTING NUMBER | 060913737 |
| BRANCH CODE | 373 |
| SWIFT CODE | BRAKBDDH |
| ADDRESS | 3/3 SHAHJALAL UPOSHAHAR, MAIN ROAD, BLOCK D, SYLHET |
Thanks & Cheers!!