setwd("D:/OneDrive/Document/GitHub/Analysis-on-Dollarization-in-Cambodia/R Project")R Tutorial for Data Analysis (1st Week Session)
How to start R
Before analysis, we should set the working directory. We can choose a folder from local folders of your PC. For choosing the folder, select [Session] button, and then select [Set Working Directory]. Or you can choose the folder by using setwd().
In R, we first need to install packages to perform analysis. The packages can be download and installed by using the function install.packages(). Once we install packages, we do not need to install packages every time.
install.packages('tidyverse') # This package is for efficient data management and importing STATA data into R.
install.packages("plotly") # This package is for making a interactive figures.
install.packages("arm") # This package is for visualizing the regression results in figures.
install.packages("render") # This package is for visualizing the regression results in figures.}After we download and install the packages, we can call the functions in the packages by using library(). You need to call libraries (packages) of functions for your analysis every time we restart analysis using R. 1
library(plotly)
library(tidyverse)
library(haven) # This package is included in "tidyverse".
library(shiny)How to import data into R
For importing a dataset, there are various functions in R. If your original data set is in Excel format, you need to use read_excel() from readxl package. If your original data set is in CSV format, you need to use read.csv().2 If you want to import dataset in STATA format, use read_dta() from haven package.
library(readxl)
file_path <- "Data/NBC_MFI.xlsx"
data <- read_excel(file_path, sheet = "Sheet1", range = "A3:BY842")file_path <- "Data.csv"
data <- read.csv(file_path, header = TRUE, sep = ",")library(haven)
file_path <- "Data.dta"
df <- read_dta(file_path)Descriptive statistics
Once you import the data, you may want to calculate the sample statistics of the dataset. Usually, we need to see mean, median, several quantile values, and standard deviation of the variables in the dataset. Those sample statistics are defined as follow.
Mean
\[ \bar{X} = \frac{1}{n} \sum_{i=1}^{n} X_i \]
Median
\[ Med(X) = \begin{cases} X_{(n+1)/2} & \text{if n is odd} \\ \frac{1}{2} (X_{n/2} + X_{n/2 + 1}) & \text{if n is even} \end{cases} \]
Quantile
\[ Q_q(X) = \begin{cases} X_{(n+1)q} & \text{if nq is not an integer} \\ X_{nq} & \text{if nq is an integer} \end{cases} \]
Variance \[ Var(X) = \frac{1}{n-1} \sum_{i=1}^{n} (X_i - \bar{X})^2 \]
Standard Deviation \[ sd(X) = \sqrt{\frac{1}{n-1} \sum_{i=1}^{n} (X_i - \bar{X})^2} \]
summary() function can be used for calculating several sample statistics for the variables in the dataset. For calculating the sample statistics of the specific variable, you can use the following codes.
summary(data["m_asset1"]) m_asset1
Min. : 504
1st Qu.: 9340
Median : 26628
Mean : 310135
3rd Qu.: 85926
Max. :19725521
NA's :6
Or you can select variables using “$”
summary(data$m_asset1) Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
504 9340 26628 310135 85926 19725521 6
If you input the whole dataset, the R will return descriptive statistics for all the variables in the dataset.
summary(data) If you want to make a summary statistics for more than one variables at once. You should type the following codes.
summary(data[c('m_asset1', 'm_asset2', 'm_asset3', "m_operation1") ]) m_asset1 m_asset2 m_asset3 m_operation1
Min. : 504 Min. : 62 Length:839 Length:839
1st Qu.: 9340 1st Qu.: 1049 Class :character Class :character
Median : 26628 Median : 3636 Mode :character Mode :character
Mean : 310135 Mean : 46354
3rd Qu.: 85926 3rd Qu.: 11034
Max. :19725521 Max. :2438765
NA's :6 NA's :5
You can see that variables “asset1” and “asset2” are imported as “character” (or we say “string”). The calculation of statistics require the variables in the format of “numeric”. In this case, you have to convert the variables into “numeric” by using the following codes.
Then, you can see that the sample statistics of asset1 and asset 2 are shown in the result window.
data$m_asset3 <- as.numeric(data$m_asset3)
data$m_operation1 <- as.numeric(data$m_operation1)
summary(data[c('m_asset1', 'm_asset2', 'm_asset3', "m_operation1") ]) m_asset1 m_asset2 m_asset3 m_operation1
Min. : 504 Min. : 62 Min. : 0 Min. : 0
1st Qu.: 9340 1st Qu.: 1049 1st Qu.: 6338 1st Qu.: 1096
Median : 26628 Median : 3636 Median : 21502 Median : 3674
Mean : 310135 Mean : 46354 Mean : 258201 Mean : 41521
3rd Qu.: 85926 3rd Qu.: 11034 3rd Qu.: 70182 3rd Qu.: 12150
Max. :19725521 Max. :2438765 Max. :17730347 Max. :2371615
NA's :6 NA's :5 NA's :8 NA's :8
If you want to calculate sample statistics for the specific sub-sample, you can specify the sub-sample by setting conditions as follow.
M <- data$m_asset1[data$m_asset2 >12000]
summary(M) Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
12632 109755 260771 1195924 877337 19725521 5
The summary() function provides the mean and quantile statistics in the default setting. If you want to calculate the stantard deviation, you can use the following codes. In the default settion, the function will provides NA if there is “NA” value in a variable.
sd(data$m_asset1, na.rm = TRUE) [1] 1398422
mean(data$m_asset1, na.rm = TRUE) # For calculating mean, we have this function.[1] 310134.6
quantile(data$m_asset1, 0.25, na.rm = TRUE) # For quantile statistics, we have this function. 25%
9340
You can also see the correlation between variables as follow. (Some variables include “NA” in values. In this case, we need to write the option “use” )
cor(data[c('m_asset1', 'm_asset2', 'm_asset3', "m_operation1") ], use = "complete.obs") m_asset1 m_asset2 m_asset3 m_operation1
m_asset1 1.0000000 0.9647632 0.9988622 0.9940217
m_asset2 0.9647632 1.0000000 0.9513290 0.9566454
m_asset3 0.9988622 0.9513290 1.0000000 0.9931728
m_operation1 0.9940217 0.9566454 0.9931728 1.0000000
Making graphs
The next step of the analysis is visualizing the data. The package of “ggplot2” is usefule to make professional graphs.
Most used graph is histogram, scatter plot, box plot, line plot and bar plot. Let’s see how you can make those graphs in R.
Scatterplot
Scatter plot can be created using the function of geom_point() from ggplot2 package. The function aes() is used to specify the variables for x and y axis.
ggplot(data, aes(x = m_asset1, y= staff)) + geom_point()In R, you can use mathematical function in the codes of graphs. For example, you can use the log function in the graph as follow.3
ggplot(data, aes(x = log(m_asset1), y= log(staff))) + geom_point()Histogram
For making histgram, you can use the function of geom_histogram() from ggplot2 package. The function aes() is used to specify the variable for x axis.
ggplot(data, aes(x = log(m_asset1))) + geom_histogram()`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Warning: Removed 6 rows containing non-finite outside the scale range
(`stat_bin()`).
Box plot
For box plot, you can use the function of geom_boxplot() from ggplot2 package. The function aes() is used to specify the variables for x and y axis.
ggplot(data, aes(x = as.factor(year), y= log(m_asset1))) + geom_boxplot()Bar plot
For bar plot, you can use the function of geom_bar() from ggplot2 package. The function aes() is used to specify the variables for x and y axis. For bar plot, aggregated data is suitable to use. Thus, we add codes for aggregation by year variable.
data2 <- data %>%
group_by(year) %>%
summarise(m_asset1 = mean(m_asset1, na.rm = TRUE))
graph <- ggplot(data2, aes(x = as.factor(year), y= m_asset1)) + geom_bar(stat = "identity")
graph Line plot
For line plot, you can use the function of geom_line() from ggplot2 package. The function aes() is used to specify the variables for x and y axis.
data2 <- data %>%
group_by(year) %>%
summarise(m_asset1 = mean(m_asset1, na.rm = TRUE))
graph <- ggplot(data2, aes(x = year, y= m_asset1)) + geom_line()
graph Interactive graphs
For making advanced figures, the “Plotly” package gives you a various functions for making interactive graphics. For example, here are the codes for the interactive graph.
fig <- plot_ly(data, x = ~log(m_asset1), y = ~log(staff), type = "scatter")
figInteraction graph can be also created using the function of ggplotly() from the plotly library. Once you create the graph using ggplot(), you can convert it into interactive graph. The interactive figures empower you to explore data in detail.
p <- ggplot(data, aes(x = log(m_asset1), y= log(staff))) +
geom_point(size = 1, aes(text = paste("MFI Name:", Abbreviation, "\n Year: ", year ))) +
theme_minimal()
ggplotly(p)We can also create an interactive graph with a range slider. This is useful to see the time series data. A range slider is a small subplot-like area below a plot which allows users to pan and zoom the X-axis while maintaining an overview of the chart. Check out the reference for more options: https://plotly.com/r/range-slider/
library(plotly)
data2 <- data %>%
group_by(year) %>%
summarise(m_asset1 = mean(m_asset1, na.rm = TRUE))
fig <- plot_ly(data2, type = 'scatter', mode = 'lines')%>%
add_trace(x = ~year, y = ~m_asset1)%>%
layout(showlegend = F, title='Time Series with Rangeslider',
xaxis = list(rangeslider = list(visible = T)))
fig <- fig %>%
layout(
xaxis = list(zerolinecolor = '#ffff',
zerolinewidth = 2,
gridcolor = 'ffff'),
yaxis = list(zerolinecolor = '#ffff',
zerolinewidth = 2,
gridcolor = 'ffff'),
plot_bgcolor='#e5ecf6', width = 900)Warning: Specifying width/height in layout() is now deprecated.
Please specify in ggplotly() or plot_ly()
figAnalysis on the impact of interest rate cap policy
You can also create a new variable from variables in the data. For example, if you want to create average implicit interest rate of each individual bank , you can calculate it as follow:
\[ Interest Rate (\%) = \frac{Interest Income}{Loans} *100 \]
In R, you can create new variables in data in several ways. One of the good ways to create new variables from readability of R code is using mutate() function from dplyr package.
library(dplyr)
data <- data |>
mutate(interest_rate = m_operation1/m_asset3*100)Or you can also create the new variable as follow:
data["interest_rate"] <- data["m_operation1"]/data["m_asset3"]*100Let’s see the created variable.
summary(data["interest_rate"]) interest_rate
Min. : 0.7526
1st Qu.:13.5661
Median :17.2740
Mean : Inf
3rd Qu.:23.5880
Max. : Inf
NA's :10
There is “Inf” value. This means some observations have “0” in loans and the calculation of division produced infinite values. In this case, the calculation of statistics produce the meaningless results as “Inf”. You can modify it by using following code. We can see that the above-mentioned problems were solved
data$interest_rate[is.infinite(data$interest_rate)] <- NA
summary(data["interest_rate"]) interest_rate
Min. : 0.7526
1st Qu.: 13.5578
Median : 17.2485
Mean : 21.1732
3rd Qu.: 23.4287
Max. :1300.0000
NA's :13
Next, let’s plot this implicit interest rates by individual MFIs.
For this plot, it is
df <- subset(data, year == 2016)
ggplot(df, aes(x = Abbreviation, y = interest_rate)) +
geom_bar(stat = "identity") +
labs(title = "Interest Rates by MFI in 2016",
x = "MFI",
y = "Interest Rate") +
theme_minimal() +
theme(axis.text.x = element_text(angle = -90, hjust = 0, vjust = 0,size = 5))You can also reorder the MFIs from low to high in interest rates as follow:
df <- subset(data, year == 2016)
ggplot(df, aes(x = reorder(Abbreviation, interest_rate), y = interest_rate)) +
geom_bar(stat = "identity") +
labs(title = "Interest Rates by MFI in 2016",
x = "MFI",
y = "Interest Rate") +
theme_minimal() +
theme(axis.text.x = element_text(angle = -90, hjust = 0, vjust = 0,size = 5))You can also compare the interest rate between 2016 and 2017 for each MFI. position = "dodge" is used to place the bars for different years side by side for each bank.
data_filtered <- subset(data, year %in% c(2016, 2017))
# Calculate the order of banks based on the interest rate in 2016
order_banks <- data_filtered[data_filtered$year == 2016, ]
order_banks <- order_banks[order(order_banks$interest_rate), ]
ordered_levels <- order_banks$Abbreviation
# Convert the bank column to a factor with ordered levels
data_filtered$Abbreviation <- factor(data_filtered$Abbreviation, levels = ordered_levels)
# Create the grouped bar plot
ggplot(data_filtered, aes(x = Abbreviation, y = interest_rate, fill = factor(year))) +
geom_bar(stat = "identity", position = "dodge") +
labs(title = "Interest Rates by Bank in 2016 and 2017",
x = "MFI",
y = "Interest Rate",
fill = "Year") +
theme_minimal() +
theme(axis.text.x = element_text(angle = -90, hjust = 0, vjust = 0,size = 5))Some MFIs show the increase in interest rates from 2016 to 2017, although there is a regulation of the interest rate cap. Why? Consider the problems in the measure of “implicit interest rate”. And propose another approach for analysis of impact of interest rate cap policy.
Hint: There could be multiple issues in the analysis. Consider as many as possible. One of the problems is based on the context of micro lending in Cambodia. And others are based on the definition of the variable.
Aiba, Daiju and Sovvanroeun Samreth and Sothearoath Oeur and Vanndy Vat (2021) “Impact of Interest Rate Cap Policies on the Lending Behavior of Microfinance Institutions: Evidence from Millions of Observations in the Credit Registry Database” JICA Ogata Research Institute Working Paper Series, No. 224, Available at [link] https://ssrn.com/abstract=4026281
Heng, Dyna and Chea, Serey and Heng, Bomakara(April 1, 2021), “Impacts of Interest Rate Cap on Financial Inclusion in Cambodia” . IMF Working Paper No. 2021/107, Available at [link]
Regression Analysis
OLS Estimation
Making regression in R is very simple. We can do that by using lm() function.
We can type the codes as follow.
\[ y_{i} = \alpha + \beta x_i + \epsilon_i \]
reg <- lm( m_asset1 ~ staff, data= data)
summary(reg)
Call:
lm(formula = m_asset1 ~ staff, data = data)
Residuals:
Min 1Q Median 3Q Max
-2994730 28711 112352 146917 8182090
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -157225.58 21592.98 -7.281 7.69e-13 ***
staff 1222.26 19.46 62.808 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 583800 on 828 degrees of freedom
(9 observations deleted due to missingness)
Multiple R-squared: 0.8265, Adjusted R-squared: 0.8263
F-statistic: 3945 on 1 and 828 DF, p-value: < 2.2e-16
The results of estimated regression coefficients can be plotted by abline() function. After plotting the data of two variables in the graph, we can add the regression line using the results of regression as follow.
plot(data$staff, data$m_asset1) #x axis comes in the first imput, and y axis comes in the second input. abline(reg,col=“red"")Coding the regression analysis in R is quite flexible. We can run the regression with log transformation of independent variables If we transform y and x variables in the log form, the regression model is expressed as below.
\[ \log(y_i) = \alpha + \beta + \log(x_i) + \epsilon_i \]
The regression of the above mentioned model can be performed using the log function as follow.
log_reg <- lm( log(m_asset1 +1) ~ log(staff +1), data=data)
summary(log_reg)
Call:
lm(formula = log(m_asset1 + 1) ~ log(staff + 1), data = data)
Residuals:
Min 1Q Median 3Q Max
-4.4350 -0.5725 -0.0261 0.5351 3.2044
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.30249 0.09163 68.78 <2e-16 ***
log(staff + 1) 0.94567 0.01972 47.95 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.9323 on 828 degrees of freedom
(9 observations deleted due to missingness)
Multiple R-squared: 0.7352, Adjusted R-squared: 0.7349
F-statistic: 2299 on 1 and 828 DF, p-value: < 2.2e-16
Since the “asset2” include zero values and negative values, log transformation make “NA” as missing values. If the transformed variable contains “NA” or “inf”, the regression will return errors. In that case, we should transform variables in advance and should code the variables not including NA and inf before the analysis.
log transformation can be also used in plot function.
plot(log(data$m_asset2), log(data$m_asset1)) Footnotes
Technically speaking,
library()is the function to import a set of functions into the PC memory. Every time you clear memory by ending R Studio or shutting down your PC, memory is refreshed, and the all the functions and data in the memory are gone.↩︎This function is built-in in R↩︎
In most cases, variables follow an exponential distribution. In such instances, applying the logarithm function to the variable can smooth the distribution, making it more closely resemble a normal distribution.↩︎