Contents

Session Lesson Duration (minutes)
1 1. Introduction to R with installation 15
2. Script vs console, working Directory 30
3. Basic operations, dataset creation, opening and descriptive statistics 30
4. Application of normal distribution 30
5. One-sample t-test 45
6. Standard error and confidence interval 30
2 7. Two-sample t-test (independent, upaired) with assumptions 45
8. Welch test (unequal variance) 15
9. Two-sample t-test (paired) 60
10. Wilcoxon/Mann-Whitney-U test (non-parametric) 60
3 11. Chi-square distribution 45
12. Chi-square test for association/goodness of fit with assumptions 45
13. Determining the cell contribution 45
14. Visualization of chi-square test results, package installation 30
15. Fisher’s exact test 15
4 16. Variance, SD, sum of squares, CV, df, F-value 30
17. One-way ANOVA with assumptions 60
18. Kruskal-Wallis test (non-parametric) 30
19. Two-way anova with assumptions 60
5 20. Post-hoc analysis of ANOVA (Tukey, DMRT, lsd) 60
21. Visualization of ANOVA, mean, error bar 120
6 22. Mixed-effect models (split plot design), ANOVA comparison and best model selection 180
7 23. Correlation (Pearson, parametric) 60
24. Correlation (Spearman, non-parametric) 30
25. Simple regression 30
26. Residual analysis 60
8 2. Multiple regression 60
29. 28. Multicollinearity issues and their solution 60
30. Heteroskedasticity issues and their solution 60
9 31. Practicing visualization and image saving using ggplot2 180
10 32. Assessment of data analysis and visualization 120
33. Feedback, comments, questions 60

Lesson 1: Introduction to R with installation

The present age is information age. Information is data with context and meaning that helps decision making. Data is unprocessed raw observations. Therefore, quality of data analysis determines the accuracy of decision not matter whether it is simple descriptive statistics or advanced machine learning algorithms. A suitable and affordable data analysis software is a must for a data analyst. While for general purpose programming, Python is very popular, R is quite unbeatable for data analysis and visualization. R has a simple syntax structure, wide community support, low-memory transferable file structure, markdown and sharing facilities that have made it a super popular data analysis programming language. It also helps learning statistical concepts.

This command-driven programming language is open source meaning free of cost and literally has unlimited possibilities compared to the menu-driven programs such as SPSS, STATA, and Minitab. Almost any statistical models can be written in R language. R software is basically an engine that runs the code and RStudio is an IDE that facilitates script writing, testing and enhances programming speed. This means we need to install R and RStudio to perform data analysis using R.

Before starting the analysis, we need to know the interface of RStudio (Figure 1) and its different menu and a few shortcut options, e.g. ctrl+enter –> run current line or selected part of the codes, ctrl+s –> save the script. For details, please watch the video tutorial of RStudio interface.

Figure 1. RStudio interface
Figure 1. RStudio interface

Lesson 2: Script vs console, working directory

Open a new R script. You do not need to write any names for it. Go to the Files pan and browse/select a folder where you want to save this script. You can also create a new folder for your purpose. More importantly, you can set or change the working directory from this pan. Your works will be saved always in the working directory or if you load a script or file, the system will always search in the working directory. I have observed, the beginners almost all fall this trap of ’no file in the directory’. Easy solution is just checking the directory and set right in the folder where you have kept your required file.

After setting the working directory, hit the save icon and give a file name. R script has an .R extension in the file name. If you delete this .R, RStudio will not recognize the file as associated with R program.

You can write your code in two places - script and console. The console is used to show your outputs and can also be used for short/draft calculations. Script is the main place for writing your codes that you can save for later use. Commands written in the console is not usable in the next sessions of your analysis.

Lesson 3: Basic operations, dataset creation, opening and descriptive statistics

First write some basic mathematical expression in the script and hit run icon or press ctrl+enter. Check the outputs in the console. You can also write the expressions in the control and hit enter. This way you can use the console as a big calculator. In the script, you can write comments starting with ‘#’ which will not be executed as commands.

3+6
## [1] 9
4*9 + 27/9
## [1] 39
# using pi = 3.14
pi
## [1] 3.141593
x = 4
y = 5
x*y
## [1] 20

You need to run the first two lines to get the result from the last line, otherwise the values of x and y cannot be found. This way you can store values to an object or variable. Without defining an object you cannot use it in a code. We will convert degree to radian now. We will need the value of pi (3.14) which is already known to R. You need to call ‘radian’ after defining to show the output.

pi
## [1] 3.141593
degree = 45
radian = degree*pi/180
radian
## [1] 0.7853982

Now, calculate the value of tangent of 45 degree, which should be equal to 1. First, you need to convert degree to radian, then plug the value in tan() function. R function always has a common syntax with function name followed by brackets without space.

# We have already converted 45 degree to radian and stored the value in the 'radian' object.
tan(radian)
## [1] 1

In R, log(100) means natural logarithm of 100. In Excel, this is 10-based logarithm by default. If you need the 10-based logarithm, you have to write log10(100).

log(100)
## [1] 4.60517
log10(100)
## [1] 2

You can show text output in the console by putting the text inside single or double quotes.

"This is inside double quotes."
## [1] "This is inside double quotes."
'This is inside single quotes'
## [1] "This is inside single quotes"

You can also combines text and R outputs in the same sentence using paste() or paste0() function. This function combines multiple elements separated by comma and past0() will not leave any spaces between elements.

paste('This is a past() function that shows 3+4 =', sum(3,4))
## [1] "This is a past() function that shows 3+4 = 7"
paste0('This is a past0() that shows 3+4 =', 3+4, '.')
## [1] "This is a past0() that shows 3+4 =7."

You can classify a vector into different categories by cut(x, breaks, labels, right=TRUE) function. First create a vector (variable) of age of 30 persons ranges between 20 to 60 using random number from a normal distribution with mean 43 and sd 2, then categorize as young (upto 35), middle-aged (>35 to 50), and old (>50). Categories must be mutually exclusive (one person will fall only in one category) and completely exhaustive (all persons will be classified with no one leaving unclassified).

set.seed(1) # to generate the same set of random numbers
age = rnorm(50, 40, 5)
breaks = c(0, 35, 45, Inf)
labels = c('Young', 'Middle-aged', 'Old')
categories = cut(age, breaks = breaks, labels = labels)
table(categories)
## categories
##       Young Middle-aged         Old 
##           4          41           5

You can also code a variable into a new categorical variable, then summarize using tableI() function. Square bracket [] is used to index or select certain values in a variable. In this you need to convert the age variable into a dataframe.

set.seed(1) # to generate the same set of random numbers
age = round(rnorm(50, 40, 5), 0) 
# to store and show at the same time () around statement is typed
# round() is used to show certain number of digits after decimal
age_cat = ifelse(age<=35, 'Young', 
          ifelse(age>45, 'Old', 'Middle-aged'))



# Now, summarise age
table(age_cat)
## age_cat
## Middle-aged         Old       Young 
##          41           5           4

Use of c(), seq() , rep(), %% and %/%

The c() simply combines multiple elements into a single vector. A sequence of numbers or series is created with seq() having from, to, by, and length.out options. One or more elements can be repeated by rep(). The modulus operator %% finds remainder and %/% results only full integer quotient. Remember the basic division rule: Divident = Divisor * Quotient + Remainder.

c(1:7) # sequence of 1 to 7 by 1 increment
## [1] 1 2 3 4 5 6 7
c(5:12) # sequence of 5 to 12
## [1]  5  6  7  8  9 10 11 12
seq(1,7) # sequence of 1 to 7 by 1 increment
## [1] 1 2 3 4 5 6 7
seq(1, 7, by = 2) # by 2 increment
## [1] 1 3 5 7
seq(0, 50, length.out = 5) # sequence of 0 to 50 with 5 break points
## [1]  0.0 12.5 25.0 37.5 50.0
rep(c(1:4), 3) # repeat 1 to 4 by 3 times
##  [1] 1 2 3 4 1 2 3 4 1 2 3 4
10%%7 # remainder of 10 divide by 7
## [1] 3
10%/%7 # quotient of 10 divide by 7 
## [1] 1

Exercise: Create a series of full integers from 1 to 20. Divide them into even and odd categories.

numbers = seq(1:20)
(output = ifelse(numbers%%2 == 0, 'Even', 'Odd'))
##  [1] "Odd"  "Even" "Odd"  "Even" "Odd"  "Even" "Odd"  "Even" "Odd"  "Even"
## [11] "Odd"  "Even" "Odd"  "Even" "Odd"  "Even" "Odd"  "Even" "Odd"  "Even"
table(output)
## output
## Even  Odd 
##   10   10

Vector (variable) and dataframe: A vector is a single variable. Multiple variables create a dataframe. A dataframe has equal number of rows for every variables. Vectors with unequal number of rows cannot be bound in a dataframe, but a list is suitable to hold these values. It is important to know to select the individual element(s) of a dataset, list, or vector.

set.seed(1)
serial_vector0 = c(1:30)
age_vector1 = round(c(rnorm(30, 45, 5)), 0) # from normal distribution
edu_vector2 = c(sample(c(1:5), 30, replace = TRUE)) # random sampling from 1 to 5 for 30 times
status_vector3 = c(sample(c('Low', 'Medium', 'High'), 30, replace = TRUE))
gender_vector4 = c(sample(c('Male', 'Female'), 30, replace = TRUE))

dat = data.frame(serial_vector0, age_vector1, edu_vector2, status_vector3, gender_vector4)
dat
# Save this dataset
write.csv(dat, 'my_data.csv', row.names = FALSE)

Loading and exploring a dataset: Now load the dataset named ‘my_data.csv’ in csv format with read.csv(). See the first 6 rows with head(), last 6 with tail(), structure with str(), nrow(), ncol(), dimensions with dim() functions. A comprehensive summary() can also be used to a quick descriptives of the dataset. Mean with mean(), sd with sd(), variance with var. We also see how to create custom functions to calculate standard error (se) and coefficient of variation (cv).

# Before loading the dataset, check the working directory.
# Set the working directory to the folder containing the dataset.
# Now read and save the dataset in dat object.
dat = read.csv('my_data.csv')
head(dat)
tail(dat)
str(dat)
## 'data.frame':    30 obs. of  5 variables:
##  $ serial_vector0: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ age_vector1   : int  42 46 41 53 47 41 47 49 48 43 ...
##  $ edu_vector2   : int  3 2 2 5 2 1 3 3 4 3 ...
##  $ status_vector3: chr  "High" "Medium" "Medium" "High" ...
##  $ gender_vector4: chr  "Female" "Female" "Female" "Female" ...
dim(dat)
## [1] 30  5
summary(dat)
##  serial_vector0   age_vector1     edu_vector2    status_vector3    
##  Min.   : 1.00   Min.   :34.00   Min.   :1.000   Length:30         
##  1st Qu.: 8.25   1st Qu.:43.00   1st Qu.:2.000   Class :character  
##  Median :15.50   Median :46.50   Median :3.000   Mode  :character  
##  Mean   :15.50   Mean   :45.50   Mean   :3.033                     
##  3rd Qu.:22.75   3rd Qu.:48.75   3rd Qu.:4.000                     
##  Max.   :30.00   Max.   :53.00   Max.   :5.000                     
##  gender_vector4    
##  Length:30         
##  Class :character  
##  Mode  :character  
##                    
##                    
## 

Data vector / variable type can be numeric (can be with decimal), integer (full number), character (a, b, c), or factor (text, string). The data type ‘character’ is used to analysis characters in a word or text. Most of the analysis requires a text variable as factor (not character). Therefore, you need to convert the character variable into factor variable using as.factor(). After converting, check the str() again.

Remember: To select a variable, a common syntax is used, which is data$variable. Index syntax with square brackets [] is used to select certain value in a variable.

dat$status_vector3 = as.factor(dat$status_vector3)
dat$gender_vector4 = as.factor(dat$gender_vector4)
str(dat)
## 'data.frame':    30 obs. of  5 variables:
##  $ serial_vector0: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ age_vector1   : int  42 46 41 53 47 41 47 49 48 43 ...
##  $ edu_vector2   : int  3 2 2 5 2 1 3 3 4 3 ...
##  $ status_vector3: Factor w/ 3 levels "High","Low","Medium": 1 3 3 1 1 1 2 3 3 2 ...
##  $ gender_vector4: Factor w/ 2 levels "Female","Male": 1 1 1 1 2 2 2 1 2 2 ...
summary(dat)
##  serial_vector0   age_vector1     edu_vector2    status_vector3 gender_vector4
##  Min.   : 1.00   Min.   :34.00   Min.   :1.000   High  :12      Female:13     
##  1st Qu.: 8.25   1st Qu.:43.00   1st Qu.:2.000   Low   : 8      Male  :17     
##  Median :15.50   Median :46.50   Median :3.000   Medium:10                    
##  Mean   :15.50   Mean   :45.50   Mean   :3.033                                
##  3rd Qu.:22.75   3rd Qu.:48.75   3rd Qu.:4.000                                
##  Max.   :30.00   Max.   :53.00   Max.   :5.000

Now you can explore each of the variables in the dataset with some descriptive statistics.

mean(dat$age_vector1)
## [1] 45.5
sd(dat$age_vector1)
## [1] 4.651659
table(dat$status_vector3) # frequency table
## 
##   High    Low Medium 
##     12      8     10

List: A list is similar to a dataframe or matrix but it is more flexible than a dataframe. All the variables/vectors in a dataframe must contain the same number of rows or cases. However, a list can hold variables of different lengths. For example, the following codes produce a list with three variables of variable number of cases. More interestingly, lists can be nested in one or more levels of hierarchical lists. The list_example can be component of another list.

village = c('Patakhali', 'Rajapur', 'Telikhali', 'Shotabdipur')
farmer = c('Hridoy', 'Ripon', 'Ishtiak', 'Labonya', 'Pintu', 'Jamir', 'Emran')
destination = c('Naogaon', 'Cumilla', 'Jashore', 'Barishal', 'Chattogram', 'Rangpur', 'Chapainawabganj')

list_example = list(village, farmer, destination)
str(list_example)
## List of 3
##  $ : chr [1:4] "Patakhali" "Rajapur" "Telikhali" "Shotabdipur"
##  $ : chr [1:7] "Hridoy" "Ripon" "Ishtiak" "Labonya" ...
##  $ : chr [1:7] "Naogaon" "Cumilla" "Jashore" "Barishal" ...

New function: R does not have se and cv function by default. You can easily define these functions. You can even create a complex long function using this method.

sd = sd(x)
n = length(x)
mean = mean(x)
se = function (x) {sd/sqrt(n)}
cv = function (x) {sd/mean * 100}

sd(dat$age_vector1)
## [1] 4.651659
cv(dat$age_vector1)
## [1] NA

Lesson 4: Application of normal distribution and CLT

In order to show the normal distribution, we will create a dataset that will be used to create the normal density function. The dataset contains 1,000,000 average heights of 1,000,000 samples each of which has 30 individuals sampled from a large population of 10,000,000 persons. In other words, take 30 random samples, measure their heights and enter their mean value. Repeat this for 1,000,000 times. We will plot these 1,000,000 data points to show the normal distribution and central limit theorem (CLT).

set.seed(1)
height = rnorm(1000000, mean = 5, sd = 0.5)
hist(height, breaks = 10) # with 10 classes

hist(height, breaks = 200) # with 10 classes

# 4e+05 means 4*10 to the power 5, means 400000.

# If you increase the number of bins to many, many and many more, 
# you will get the density function of normal distribution.

plot(density(height)) # with mean = 5, sd = 0.5

# Now, construct the same density plot with actual values from normal distribution that has mean = 0, sd = 1.

set.seed(1)
values = rnorm(1000000, mean = 0, sd = 1)
plot(density(values))

You notice that the shapes of the both curves are the same, but x-axis values are different. Look at the middle point on the x-axis which is exactly same as the mean values. The area under the curve is the probability, which is 1 (100%) in total, y-axis shows the density meaning relative frequency. From this graph, you can imagine (even calculate) what is the probability of getting a number less than 2 from this 1,000,000 values. Draw and vertical line where x = -2, calculate the area under the curve left to this vertical line.

plot(density(values))
abline(v = -2)

pnorm(-2) 
## [1] 0.02275013
# The area under the curve left to the vertical line will be 0.023, i.e., p = 0.023
# This indicates that there is 2.3% probability of having a number less than -2 in this distribution.

Now calculate the probability of having a number between -1 and +1 (1sd left and 1sd right to the mean value).

plot(density(values))
abline(v = c(-1, 1))
pnorm(1) - pnorm(-1) 
## [1] 0.6826895
text(0,0.2, "p = 68.27%")

# pnorm() estimates area under the curve left to the quantile value inside the bracket.
# This means that 68% of the total values falls between 1sd of the mean of the distribution.

This way, you will find that 95.45% of the total observation falls within 2sd of the mean. For this reason z-value (normal distribution quantile) is 1.96 at 95% probability. This value is 2.58 at 99% probability. Remember this two values.

plot(density(values))
abline(v = c(-2, 2))
pnorm(2) - pnorm(-2)
## [1] 0.9544997
text(0,0.2, "p = 95.45%")

The t-distribution (student t-distribution) is similar to the normal distribution, but it has only df (degree of freedom = n - 1). For large samples (n>30), z and t values tend to be close to each other. For this reason, z-test is used for large sample with known population standard deviation (sigma). The t-test can be used for both small and large samples as it considers df (sample size - 1) to estimate the probabilities and quantiles. You can check with the following commands:

qnorm(p = 0.95) # normal distribution
## [1] 1.644854
qt(p = 0.95, df = 19) # t-distribution with n = 20
## [1] 1.729133
qt(p = 0.95, df = 499) # t-distribution with n = 500
## [1] 1.647913

Central limit theorem (CLT) states that the sampling distribution is normal for large samples (n > 30). Sampling distribution means the distribution of averages of many, many and many more samples from a population. If this is the case, CLT says that 68% of the total observations will fall between one standard deviation and 95% them will fall between 1.96 standard deviation. For population, 68% of the total population will fall between one standard error and 95% them will fall between 1.96 standard error.

Recall the density plot of height and interpret the distribution. We can say that 68% of the heights fall between (5-1*0.5, 5+1*0.5) or 4.5 and 5.5, and 95% of them between 5-1.96*0.5, 5+1.96*0.5) or 4.02 and 5.98.

plot(density(height)) # with mean = 5, sd = 0.5
abline(v=c(5-1*0.5, 5+1*0.5, 5-1.96*0.5, 5+1.96*0.5), col=c('blue', 'blue', 'red', 'red'))

In conclusion, we can say that the normal distribution shows the p (probability) of having a value (quantile on the x-axis) equal to the area under the curve left to the value. If this p is less than 0.05 (or 0.01, 0.001), we say that insufficient p-value and we reject the null hypothesis associated with a test at 5% (or 1, 0,1%) level of significance. The p-value shows evidence to support the null hypothesis. Insufficient evidence is the basis of rejection of a null hypothesis.