R Tutorial for Data Analysis (1st Week Session)

Author

Daiju Aiba

Published

July 25, 2024

How to start R

Before analysis, we should set the working directory. We can choose a folder from local folders of your PC. For choosing the folder, select [Session] button, and then select [Set Working Directory]. Or you can choose the folder by using setwd().

setwd("D:/OneDrive/Document/GitHub/Analysis-on-Dollarization-in-Cambodia/R Project")

In R, we first need to install packages to perform analysis. The packages can be download and installed by using the function install.packages(). Once we install packages, we do not need to install packages every time.

install.packages('tidyverse') # This package is for efficient data management and importing STATA data into R. 
install.packages("plotly") # This package is for making a interactive figures. 
install.packages("arm")  # This package is for visualizing the regression results in figures. 
install.packages("render")  # This package is for visualizing the regression results in figures.}

After we download and install the packages, we can call the functions in the packages by using library(). You need to call libraries (packages) of functions for your analysis every time we restart analysis using R. ¹

library(plotly) 
library(tidyverse) 
library(haven)  # This package is included in "tidyverse". 
library(shiny)

How to import data into R

For importing a dataset, there are various functions in R. If your original data set is in Excel format, you need to use read_excel() from readxl package. If your original data set is in CSV format, you need to use read.csv().² If you want to import dataset in STATA format, use read_dta() from haven package.

library(readxl)

file_path <- "Data/NBC_MFI.xlsx"
data <- read_excel(file_path, sheet = "Sheet1", range = "A3:BY842")

file_path <- "Data.csv"
data <- read.csv(file_path, header = TRUE, sep = ",")

library(haven)

file_path <- "Data.dta"
df <- read_dta(file_path)

Descriptive statistics

Once you import the data, you may want to calculate the sample statistics of the dataset. Usually, we need to see mean, median, several quantile values, and standard deviation of the variables in the dataset. Those sample statistics are defined as follow.

Mean

\[ \bar{X} = \frac{1}{n} \sum_{i=1}^{n} X_i \]

Median

\[ Med(X) = \begin{cases} X_{(n+1)/2} & \text{if n is odd} \\ \frac{1}{2} (X_{n/2} + X_{n/2 + 1}) & \text{if n is even} \end{cases} \]

Quantile

\[ Q_q(X) = \begin{cases} X_{(n+1)q} & \text{if nq is not an integer} \\ X_{nq} & \text{if nq is an integer} \end{cases} \]

Variance \[ Var(X) = \frac{1}{n-1} \sum_{i=1}^{n} (X_i - \bar{X})^2 \]

Standard Deviation \[ sd(X) = \sqrt{\frac{1}{n-1} \sum_{i=1}^{n} (X_i - \bar{X})^2} \]

summary() function can be used for calculating several sample statistics for the variables in the dataset. For calculating the sample statistics of the specific variable, you can use the following codes.

summary(data["m_asset1"])

    m_asset1       
 Min.   :     504  
 1st Qu.:    9340  
 Median :   26628  
 Mean   :  310135  
 3rd Qu.:   85926  
 Max.   :19725521  
 NA's   :6

Or you can select variables using “$”

summary(data$m_asset1)

    Min.  1st Qu.   Median     Mean  3rd Qu.     Max.     NA's 
     504     9340    26628   310135    85926 19725521        6

If you input the whole dataset, the R will return descriptive statistics for all the variables in the dataset.

summary(data)

If you want to make a summary statistics for more than one variables at once. You should type the following codes.

summary(data[c('m_asset1', 'm_asset2', 'm_asset3', "m_operation1") ])

    m_asset1           m_asset2         m_asset3         m_operation1      
 Min.   :     504   Min.   :     62   Length:839         Length:839        
 1st Qu.:    9340   1st Qu.:   1049   Class :character   Class :character  
 Median :   26628   Median :   3636   Mode  :character   Mode  :character  
 Mean   :  310135   Mean   :  46354                                        
 3rd Qu.:   85926   3rd Qu.:  11034                                        
 Max.   :19725521   Max.   :2438765                                        
 NA's   :6          NA's   :5

You can see that variables “asset1” and “asset2” are imported as “character” (or we say “string”). The calculation of statistics require the variables in the format of “numeric”. In this case, you have to convert the variables into “numeric” by using the following codes.

Then, you can see that the sample statistics of asset1 and asset 2 are shown in the result window.

data$m_asset3 <- as.numeric(data$m_asset3)  
data$m_operation1 <- as.numeric(data$m_operation1)  
summary(data[c('m_asset1', 'm_asset2', 'm_asset3', "m_operation1") ])

    m_asset1           m_asset2          m_asset3         m_operation1    
 Min.   :     504   Min.   :     62   Min.   :       0   Min.   :      0  
 1st Qu.:    9340   1st Qu.:   1049   1st Qu.:    6338   1st Qu.:   1096  
 Median :   26628   Median :   3636   Median :   21502   Median :   3674  
 Mean   :  310135   Mean   :  46354   Mean   :  258201   Mean   :  41521  
 3rd Qu.:   85926   3rd Qu.:  11034   3rd Qu.:   70182   3rd Qu.:  12150  
 Max.   :19725521   Max.   :2438765   Max.   :17730347   Max.   :2371615  
 NA's   :6          NA's   :5         NA's   :8          NA's   :8

If you want to calculate sample statistics for the specific sub-sample, you can specify the sub-sample by setting conditions as follow.

M <- data$m_asset1[data$m_asset2 >12000]  
summary(M)

    Min.  1st Qu.   Median     Mean  3rd Qu.     Max.     NA's 
   12632   109755   260771  1195924   877337 19725521        5

The summary() function provides the mean and quantile statistics in the default setting. If you want to calculate the stantard deviation, you can use the following codes. In the default settion, the function will provides NA if there is “NA” value in a variable.

sd(data$m_asset1, na.rm = TRUE)

[1] 1398422

mean(data$m_asset1, na.rm = TRUE)  # For calculating mean, we have this function.

[1] 310134.6

quantile(data$m_asset1, 0.25, na.rm = TRUE) # For quantile statistics, we have this function.

 25% 
9340

You can also see the correlation between variables as follow. (Some variables include “NA” in values. In this case, we need to write the option “use” )

cor(data[c('m_asset1', 'm_asset2', 'm_asset3', "m_operation1") ], use = "complete.obs")

              m_asset1  m_asset2  m_asset3 m_operation1
m_asset1     1.0000000 0.9647632 0.9988622    0.9940217
m_asset2     0.9647632 1.0000000 0.9513290    0.9566454
m_asset3     0.9988622 0.9513290 1.0000000    0.9931728
m_operation1 0.9940217 0.9566454 0.9931728    1.0000000

Making graphs

The next step of the analysis is visualizing the data. The package of “ggplot2” is usefule to make professional graphs.

Most used graph is histogram, scatter plot, box plot, line plot and bar plot. Let’s see how you can make those graphs in R.

Scatterplot

Scatter plot can be created using the function of geom_point() from ggplot2 package. The function aes() is used to specify the variables for x and y axis.

ggplot(data, aes(x = m_asset1, y= staff)) + geom_point()

In R, you can use mathematical function in the codes of graphs. For example, you can use the log function in the graph as follow.³

ggplot(data, aes(x = log(m_asset1), y= log(staff))) + geom_point()

Histogram

For making histgram, you can use the function of geom_histogram() from ggplot2 package. The function aes() is used to specify the variable for x axis.

ggplot(data, aes(x = log(m_asset1))) + geom_histogram()

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Warning: Removed 6 rows containing non-finite outside the scale range
(`stat_bin()`).

Box plot

For box plot, you can use the function of geom_boxplot() from ggplot2 package. The function aes() is used to specify the variables for x and y axis.

ggplot(data, aes(x = as.factor(year), y= log(m_asset1))) + geom_boxplot()

Bar plot

For bar plot, you can use the function of geom_bar() from ggplot2 package. The function aes() is used to specify the variables for x and y axis. For bar plot, aggregated data is suitable to use. Thus, we add codes for aggregation by year variable.

data2 <- data %>%
  group_by(year) %>%
  summarise(m_asset1 = mean(m_asset1, na.rm = TRUE))

graph <- ggplot(data2, aes(x = as.factor(year), y= m_asset1)) + geom_bar(stat = "identity")
graph

Line plot

For line plot, you can use the function of geom_line() from ggplot2 package. The function aes() is used to specify the variables for x and y axis.

data2 <- data %>%
  group_by(year) %>%
  summarise(m_asset1 = mean(m_asset1, na.rm = TRUE))

graph <- ggplot(data2, aes(x = year, y= m_asset1)) + geom_line()
graph

Interactive graphs

For making advanced figures, the “Plotly” package gives you a various functions for making interactive graphics. For example, here are the codes for the interactive graph.

fig <- plot_ly(data, x = ~log(m_asset1), y = ~log(staff), type = "scatter") 
fig

Interaction graph can be also created using the function of ggplotly() from the plotly library. Once you create the graph using ggplot(), you can convert it into interactive graph. The interactive figures empower you to explore data in detail.

p <- ggplot(data, aes(x = log(m_asset1), y= log(staff))) + 
  geom_point(size = 1, aes(text = paste("MFI Name:", Abbreviation, "\n Year: ", year )))  +
  theme_minimal()

ggplotly(p)

We can also create an interactive graph with a range slider. This is useful to see the time series data. A range slider is a small subplot-like area below a plot which allows users to pan and zoom the X-axis while maintaining an overview of the chart. Check out the reference for more options: https://plotly.com/r/range-slider/

library(plotly)

data2 <- data %>%
  group_by(year) %>%
  summarise(m_asset1 = mean(m_asset1, na.rm = TRUE))

fig <- plot_ly(data2, type = 'scatter', mode = 'lines')%>%
  add_trace(x = ~year, y = ~m_asset1)%>%
  layout(showlegend = F, title='Time Series with Rangeslider',
         xaxis = list(rangeslider = list(visible = T)))
fig <- fig %>%
  layout(
         xaxis = list(zerolinecolor = '#ffff',
                      zerolinewidth = 2,
                      gridcolor = 'ffff'),
         yaxis = list(zerolinecolor = '#ffff',
                      zerolinewidth = 2,
                      gridcolor = 'ffff'),
         plot_bgcolor='#e5ecf6', width = 900)

Warning: Specifying width/height in layout() is now deprecated.
Please specify in ggplotly() or plot_ly()

fig

Analysis on the impact of interest rate cap policy

You can also create a new variable from variables in the data. For example, if you want to create average implicit interest rate of each individual bank , you can calculate it as follow:

\[ Interest Rate (\%) = \frac{Interest Income}{Loans} *100 \]

In R, you can create new variables in data in several ways. One of the good ways to create new variables from readability of R code is using mutate() function from dplyr package.

library(dplyr)
data <- data |>
    mutate(interest_rate = m_operation1/m_asset3*100)

Or you can also create the new variable as follow:

data["interest_rate"] <- data["m_operation1"]/data["m_asset3"]*100

Let’s see the created variable.

summary(data["interest_rate"])

 interest_rate    
 Min.   : 0.7526  
 1st Qu.:13.5661  
 Median :17.2740  
 Mean   :    Inf  
 3rd Qu.:23.5880  
 Max.   :    Inf  
 NA's   :10

There is “Inf” value. This means some observations have “0” in loans and the calculation of division produced infinite values. In this case, the calculation of statistics produce the meaningless results as “Inf”. You can modify it by using following code. We can see that the above-mentioned problems were solved

data$interest_rate[is.infinite(data$interest_rate)] <- NA


summary(data["interest_rate"])

 interest_rate      
 Min.   :   0.7526  
 1st Qu.:  13.5578  
 Median :  17.2485  
 Mean   :  21.1732  
 3rd Qu.:  23.4287  
 Max.   :1300.0000  
 NA's   :13

Next, let’s plot this implicit interest rates by individual MFIs.

For this plot, it is

df <- subset(data, year == 2016)
ggplot(df, aes(x = Abbreviation, y = interest_rate)) +
  geom_bar(stat = "identity") +
  labs(title = "Interest Rates by MFI in 2016",
       x = "MFI",
       y = "Interest Rate") +
  theme_minimal() + 
  theme(axis.text.x = element_text(angle = -90, hjust = 0, vjust = 0,size = 5))

You can also reorder the MFIs from low to high in interest rates as follow:

df <- subset(data, year == 2016)
ggplot(df, aes(x = reorder(Abbreviation, interest_rate), y = interest_rate)) +
  geom_bar(stat = "identity") +
  labs(title = "Interest Rates by MFI in 2016",
       x = "MFI",
       y = "Interest Rate") +
  theme_minimal() + 
  theme(axis.text.x = element_text(angle = -90, hjust = 0, vjust = 0,size = 5))

You can also compare the interest rate between 2016 and 2017 for each MFI. position = "dodge" is used to place the bars for different years side by side for each bank.

data_filtered <- subset(data, year %in% c(2016, 2017))

# Calculate the order of banks based on the interest rate in 2016
order_banks <- data_filtered[data_filtered$year == 2016, ]
order_banks <- order_banks[order(order_banks$interest_rate), ]
ordered_levels <- order_banks$Abbreviation

# Convert the bank column to a factor with ordered levels
data_filtered$Abbreviation <- factor(data_filtered$Abbreviation, levels = ordered_levels)

# Create the grouped bar plot
ggplot(data_filtered, aes(x = Abbreviation, y = interest_rate, fill = factor(year))) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(title = "Interest Rates by Bank in 2016 and 2017",
       x = "MFI",
       y = "Interest Rate",
       fill = "Year") +
  theme_minimal() + 
  theme(axis.text.x = element_text(angle = -90, hjust = 0, vjust = 0,size = 5))

Question

Some MFIs show the increase in interest rates from 2016 to 2017, although there is a regulation of the interest rate cap. Why? Consider the problems in the measure of “implicit interest rate”. And propose another approach for analysis of impact of interest rate cap policy.

Hint: There could be multiple issues in the analysis. Consider as many as possible. One of the problems is based on the context of micro lending in Cambodia. And others are based on the definition of the variable.

Regression Analysis

OLS Estimation

Making regression in R is very simple. We can do that by using lm() function.

We can type the codes as follow.

\[ y_{i} = \alpha + \beta x_i + \epsilon_i \]

reg <- lm( m_asset1 ~ staff, data= data) 
summary(reg)


Call:
lm(formula = m_asset1 ~ staff, data = data)

Residuals:
     Min       1Q   Median       3Q      Max 
-2994730    28711   112352   146917  8182090 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -157225.58   21592.98  -7.281 7.69e-13 ***
staff          1222.26      19.46  62.808  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 583800 on 828 degrees of freedom
  (9 observations deleted due to missingness)
Multiple R-squared:  0.8265,    Adjusted R-squared:  0.8263 
F-statistic:  3945 on 1 and 828 DF,  p-value: < 2.2e-16

The results of estimated regression coefficients can be plotted by abline() function. After plotting the data of two variables in the graph, we can add the regression line using the results of regression as follow.

plot(data$staff, data$m_asset1)  #x axis comes in the first imput, and y axis comes in the second input. abline(reg,col=“red"")

Coding the regression analysis in R is quite flexible. We can run the regression with log transformation of independent variables If we transform y and x variables in the log form, the regression model is expressed as below.

\[ \log(y_i) = \alpha + \beta + \log(x_i) + \epsilon_i \]

The regression of the above mentioned model can be performed using the log function as follow.

log_reg <- lm( log(m_asset1 +1) ~ log(staff +1), data=data) 
summary(log_reg)


Call:
lm(formula = log(m_asset1 + 1) ~ log(staff + 1), data = data)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.4350 -0.5725 -0.0261  0.5351  3.2044 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)     6.30249    0.09163   68.78   <2e-16 ***
log(staff + 1)  0.94567    0.01972   47.95   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.9323 on 828 degrees of freedom
  (9 observations deleted due to missingness)
Multiple R-squared:  0.7352,    Adjusted R-squared:  0.7349 
F-statistic:  2299 on 1 and 828 DF,  p-value: < 2.2e-16

Since the “asset2” include zero values and negative values, log transformation make “NA” as missing values. If the transformed variable contains “NA” or “inf”, the regression will return errors. In that case, we should transform variables in advance and should code the variables not including NA and inf before the analysis.

log transformation can be also used in plot function.

plot(log(data$m_asset2), log(data$m_asset1))

Footnotes

Technically speaking, library() is the function to import a set of functions into the PC memory. Every time you clear memory by ending R Studio or shutting down your PC, memory is refreshed, and the all the functions and data in the memory are gone.↩︎
This function is built-in in R↩︎
In most cases, variables follow an exponential distribution. In such instances, applying the logarithm function to the variable can smooth the distribution, making it more closely resemble a normal distribution.↩︎