setwd("D:/OneDrive/Document/GitHub/Analysis-on-Dollarization-in-Cambodia/R Project")R Tutorial for Data Analysis (1st Week Session)
How to start R
Before analysis, we should set the working directory. We can choose a folder from local folders of your PC. For choosing the folder, select [Session] button, and then select [Set Working Directory]. Or you can choose the folder by using setwd().
In R, we first need to install packages to perform analysis. The packages can be download and installed by using the function install.packages(). Once we install packages, we do not need to install packages every time.
install.packages('tidyverse') # This package is for efficient data management and importing STATA data into R.
install.packages("plotly") # This package is for making a interactive figures.
install.packages("arm") # This package is for visualizing the regression results in figures.
install.packages("render") # This package is for visualizing the regression results in figures.}After we download and install the packages, we can call the functions in the packages by using library(). You need to call libraries (packages) of functions for your analysis every time we restart analysis using R. 1
library(plotly)
library(tidyverse)
library(haven) # This package is included in "tidyverse". How to import data into R
For importing a dataset, there are various functions in R. If your original data set is in Excel format, you need to use read_excel() from readxl package. If your original data set is in CSV format, you need to use read.csv().2 If you want to import dataset in STATA format, use read_dta() from haven package.
library(readxl)
file_path <- "Data/NBC_MFI.xlsx"
data <- read_excel(file_path, sheet = "Sheet1", range = "A3:BY842")file_path <- "Data.csv"
data <- read.csv(file_path, header = TRUE, sep = ",")library(haven)
file_path <- "Data.dta"
df <- read_dta(file_path)Descriptive statistics
Once you import the data, you may want to calculate the sample statistics of the dataset. Usually, we need to see mean, median, several quantile values, and standard deviation of the variables in the dataset. Those sample statistics are defined as follow.
Mean
\bar{X} = \frac{1}{n} \sum_{i=1}^{n} X_i
Median
Med(X) = \begin{cases} X_{(n+1)/2} & \text{if n is odd} \\ \frac{1}{2} (X_{n/2} + X_{n/2 + 1}) & \text{if n is even} \end{cases}
Quantile
Q_q(X) = \begin{cases} X_{(n+1)q} & \text{if nq is not an integer} \\ X_{nq} & \text{if nq is an integer} \end{cases}
Variance Var(X) = \frac{1}{n-1} \sum_{i=1}^{n} (X_i - \bar{X})^2
Standard Deviation sd(X) = \sqrt{\frac{1}{n-1} \sum_{i=1}^{n} (X_i - \bar{X})^2}
summary() function can be used for calculating several sample statistics for the variables in the dataset. For calculating the sample statistics of the specific variable, you can use the following codes.
summary(data["m_asset1"]) m_asset1
Min. : 504
1st Qu.: 9340
Median : 26628
Mean : 310135
3rd Qu.: 85926
Max. :19725521
NA's :6
Or you can select variables using “$”
summary(data$m_asset1) Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
504 9340 26628 310135 85926 19725521 6
If you input the whole dataset, the R will return descriptive statistics for all the variables in the dataset.
summary(data) If you want to make a summary statistics for more than one variables at once. You should type the following codes.
summary(data[c('m_asset1', 'm_asset2', 'm_asset3', "m_operation1") ]) m_asset1 m_asset2 m_asset3 m_operation1
Min. : 504 Min. : 62 Length:839 Length:839
1st Qu.: 9340 1st Qu.: 1049 Class :character Class :character
Median : 26628 Median : 3636 Mode :character Mode :character
Mean : 310135 Mean : 46354
3rd Qu.: 85926 3rd Qu.: 11034
Max. :19725521 Max. :2438765
NA's :6 NA's :5
You can see that variables “asset1” and “asset2” are imported as “character” (or we say “string”). The calculation of statistics require the variables in the format of “numeric”. In this case, you have to convert the variables into “numeric” by using the following codes.
Then, you can see that the sample statistics of asset1 and asset 2 are shown in the result window.
data$m_asset3 <- as.numeric(data$m_asset3)
data$m_operation1 <- as.numeric(data$m_operation1)
summary(data[c('m_asset1', 'm_asset2', 'm_asset3', "m_operation1") ]) m_asset1 m_asset2 m_asset3 m_operation1
Min. : 504 Min. : 62 Min. : 0 Min. : 0
1st Qu.: 9340 1st Qu.: 1049 1st Qu.: 6338 1st Qu.: 1096
Median : 26628 Median : 3636 Median : 21502 Median : 3674
Mean : 310135 Mean : 46354 Mean : 258201 Mean : 41521
3rd Qu.: 85926 3rd Qu.: 11034 3rd Qu.: 70182 3rd Qu.: 12150
Max. :19725521 Max. :2438765 Max. :17730347 Max. :2371615
NA's :6 NA's :5 NA's :8 NA's :8
If you want to calculate sample statistics for the specific sub-sample, you can specify the sub-sample by setting conditions as follow.
M <- data$m_asset1[data$m_asset2 >12000]
summary(M) Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
12632 109755 260771 1195924 877337 19725521 5
The summary() function provides the mean and quantile statistics in the default setting. If you want to calculate the stantard deviation, you can use the following codes. In the default settion, the function will provides NA if there is “NA” value in a variable.
sd(data$m_asset1, na.rm = TRUE) [1] 1398422
mean(data$m_asset1, na.rm = TRUE) # For calculating mean, we have this function.[1] 310134.6
quantile(data$m_asset1, 0.25, na.rm = TRUE) # For quantile statistics, we have this function. 25%
9340
You can also see the correlation between variables as follow. (Some variables include “NA” in values. In this case, we need to write the option “use” )
cor(data[c('m_asset1', 'm_asset2', 'm_asset3', "m_operation1") ], use = "complete.obs") m_asset1 m_asset2 m_asset3 m_operation1
m_asset1 1.0000000 0.9647632 0.9988622 0.9940217
m_asset2 0.9647632 1.0000000 0.9513290 0.9566454
m_asset3 0.9988622 0.9513290 1.0000000 0.9931728
m_operation1 0.9940217 0.9566454 0.9931728 1.0000000
You can also create a table for descriptive statistics using describe() from psych package .
library(psych)
Attaching package: 'psych'
The following objects are masked from 'package:ggplot2':
%+%, alpha
describe(data) vars n mean sd median trimmed
id 1 838 247.50 33.60 237.50 245.35
Abbreviation 2 837 96.11 58.40 93.00 95.62
mame 3 675 109.00 65.55 106.00 108.98
specialization 4 839 1.87 0.35 2.00 1.96
year 5 839 2016.48 4.26 2017.00 2016.85
staff 6 832 382.28 1040.87 63.50 125.10
total_office 7 429 38.86 56.14 12.00 27.21
...8 8 402 1.00 0.00 1.00 1.00
...9 9 402 7.75 8.79 2.00 6.39
...10 10 402 9.67 11.54 2.00 7.77
n_province 11 313 9.41 8.23 7.00 8.56
n_district 12 313 59.51 64.25 30.00 50.07
n_commune 13 313 394.53 609.35 144.00 274.45
n_villege 14 313 2125.22 3693.27 391.00 1210.29
head_office_capital 15 161 0.95 0.22 1.00 1.00
head_office_province 16 159 0.05 0.22 0.00 0.00
Branch_office_capital 17 161 7.26 6.18 5.00 6.40
Branch_office_provincial 18 161 50.46 58.01 20.00 40.49
foreign_share 19 838 39.57 41.94 17.00 37.09
Amount_paid_foreign 20 837 96.81 114.73 28.00 80.88
local_share 21 834 52.77 35.84 81.50 54.82
Amount_paid_local 22 834 178.15 149.72 163.50 171.42
paid_up_capital 23 836 29499.55 98116.31 10188.00 13659.17
m_asset1 24 833 310134.56 1398421.72 26628.00 54310.86
m_asset2 25 834 46354.06 205324.96 3636.50 7331.67
m_asset3 26 831 258201.43 1189716.51 21502.00 44062.32
m_asset4 27 834 262.22 189.06 252.50 256.51
m_asset5 28 774 21.99 36.21 3.00 13.90
m_asset6 29 834 363.99 208.50 371.50 364.35
m_asset7 30 827 128.88 125.13 79.00 115.07
m_liability1 31 834 309731.39 1397621.84 26633.50 54197.01
m_liability2 32 759 7.66 18.21 2.00 2.22
m_liability3 33 763 15.21 33.16 2.00 5.61
m_liability4 34 763 11.66 26.45 2.00 3.79
m_liability5 35 825 214.88 169.29 199.00 206.79
m_liability6 36 830 217.78 165.64 213.50 211.03
m_liability7 37 828 270.55 215.89 255.50 259.89
m_liability8 38 804 20.47 31.11 10.00 12.25
m_liability9 39 823 99.37 91.23 43.00 87.65
m_liability10 40 824 93.65 106.85 16.00 78.06
m_liability11 41 834 19828.29 64603.87 6113.00 9046.90
m_liability12 42 792 5.67 12.75 2.00 1.93
m_liability13 43 790 6.63 14.93 2.00 2.18
m_liability14 44 807 39.59 63.98 5.00 25.27
m_liability15 45 834 15184.04 70765.94 76.00 2162.53
m_operation1 46 831 41520.65 168028.51 3674.00 8108.50
m_operation2 47 832 263.28 210.56 249.50 252.84
m_operation3 48 834 404.76 235.35 403.50 404.50
m_operation4 49 834 326.53 208.37 321.50 322.21
m_operation5 50 591 -4.50 238.75 0.00 1.04
m_operation6 51 591 62.52 191.84 0.00 10.95
m_operation7 52 834 407.67 236.85 406.50 407.53
m_operation8 53 592 9700.18 26393.14 1064.71 2915.48
m_operation9 54 592 3.59 68.26 0.00 0.00
m_operation10 55 834 391.34 234.78 390.50 391.36
m_operation11 56 591 72.33 291.32 0.00 8.17
m_operation12 57 591 34.68 242.21 0.00 1.07
m_operation13 58 832 286.06 192.10 266.50 278.51
m_operation14 59 670 328.50 190.02 326.50 328.26
m_operation15 60 594 67.15 276.23 0.00 2.26
m_operation16 61 791 9.28 20.22 3.00 3.27
m_operation17 62 591 4.55 111.10 0.00 0.00
m_operation18 63 834 406.41 235.09 404.50 406.24
m_operation19 64 833 198.12 173.10 173.00 186.20
m_operation20 65 834 406.35 233.87 408.50 406.49
m_npl1 66 834 267.20 206.38 255.50 258.85
m_npl2 67 816 220.00 190.77 196.50 207.21
m_portfolio1 68 837 2.14 2.68 2.00 1.79
m_portfolio2 69 834 249.43 208.06 229.50 237.51
...70 70 837 8.90 21.44 2.00 2.42
m_portfolio3 71 834 263.68 214.82 247.50 252.26
m_portfolio4 72 833 202.70 191.21 166.00 186.42
m_portfolio5 73 833 152.33 160.69 97.00 134.71
m_portfolio6 74 834 159.96 168.77 102.50 141.30
m_portfolio7 75 834 266.16 213.76 251.50 255.64
m_portfolio8 76 827 184.65 179.64 144.00 168.31
m_total_credit 77 834 409.12 240.22 409.50 409.21
mad min max range skew
id 34.84 201.00 324.00 123.00 0.51
Abbreviation 74.13 1.00 198.00 197.00 0.07
mame 84.51 1.00 219.00 218.00 0.02
specialization 0.00 1.00 3.00 2.00 -1.99
year 4.45 2006.00 2022.00 16.00 -0.67
staff 71.91 0.00 9573.00 9573.00 5.15
total_office 16.31 0.00 374.00 374.00 1.96
...8 0.00 1.00 1.00 0.00 NaN
...9 1.48 1.00 28.00 27.00 0.93
...10 1.48 1.00 39.00 38.00 1.00
n_province 7.41 0.00 25.00 25.00 0.74
n_district 32.62 0.00 203.00 203.00 1.13
n_commune 174.95 0.00 6091.00 6091.00 3.51
n_villege 524.84 0.00 13916.00 13916.00 1.91
head_office_capital 0.00 0.00 1.00 1.00 -4.11
head_office_province 0.00 0.00 1.00 1.00 4.08
Branch_office_capital 4.45 1.00 24.00 23.00 1.03
Branch_office_provincial 23.72 1.00 352.00 351.00 1.90
foreign_share 23.72 1.00 102.00 101.00 0.44
Amount_paid_foreign 40.03 1.00 353.00 352.00 0.81
local_share 14.08 1.00 102.00 101.00 -0.46
Amount_paid_local 237.96 1.00 431.00 430.00 0.16
paid_up_capital 8836.30 250.00 1646800.00 1646550.00 10.49
m_asset1 30433.33 504.00 19725521.00 19725017.00 8.94
m_asset2 4531.57 62.00 2438765.00 2438703.00 8.15
m_asset3 26719.42 0.00 17730347.00 17730347.00 9.30
m_asset4 253.52 1.00 601.00 600.00 0.14
m_asset5 0.00 1.00 134.00 133.00 1.64
m_asset6 263.90 1.00 726.00 725.00 -0.02
m_asset7 99.33 1.00 403.00 402.00 0.69
m_liability1 30468.91 0.00 19725521.00 19725521.00 8.94
m_liability2 0.00 1.00 95.00 94.00 3.23
m_liability3 0.00 1.00 144.00 143.00 2.47
m_liability4 0.00 1.00 123.00 122.00 2.74
m_liability5 241.66 1.00 527.00 526.00 0.21
m_liability6 227.58 1.00 522.00 521.00 0.15
m_liability7 302.45 1.00 666.00 665.00 0.22
m_liability8 0.00 1.00 138.00 137.00 2.40
m_liability9 34.10 1.00 321.00 320.00 1.02
m_liability10 22.24 1.00 344.00 343.00 0.95
m_liability11 9060.84 0.01 937250.00 937249.99 9.79
m_liability12 0.00 1.00 72.00 71.00 3.42
m_liability13 0.00 1.00 79.00 78.00 3.17
m_liability14 0.00 1.00 228.00 227.00 1.62
m_liability15 1710.34 -24721.00 1143068.00 1167789.00 9.23
m_operation1 4736.91 0.00 2371615.00 2371615.00 8.38
m_operation2 298.00 1.00 648.00 647.00 0.21
m_operation3 303.19 1.00 813.00 812.00 0.01
m_operation4 280.95 1.00 692.00 691.00 0.09
m_operation5 0.00 -1808.45 1695.18 3503.63 -2.10
m_operation6 0.00 -3.09 1666.00 1669.09 4.47
m_operation7 304.67 1.00 818.00 817.00 0.01
m_operation8 1578.54 0.00 223785.38 223785.38 4.51
m_operation9 0.00 0.00 1653.00 1653.00 23.84
m_operation10 301.71 1.00 797.00 796.00 0.00
m_operation11 0.00 0.00 2543.00 2543.00 5.49
m_operation12 0.00 0.00 5167.36 5167.36 16.91
m_operation13 269.83 1.00 643.00 642.00 0.23
m_operation14 242.41 1.00 659.00 658.00 0.01
m_operation15 0.00 -1196.00 2510.52 3706.52 4.33
m_operation16 0.00 1.00 103.00 102.00 3.08
m_operation17 0.00 -1476.22 1397.48 2873.70 -1.89
m_operation18 299.49 1.00 813.00 812.00 0.01
m_operation19 249.08 1.00 521.00 520.00 0.33
m_operation20 299.49 1.00 809.00 808.00 -0.01
m_npl1 286.88 1.00 638.00 637.00 0.17
m_npl2 280.95 1.00 580.00 579.00 0.32
m_portfolio1 0.00 1.00 27.00 26.00 6.66
m_portfolio2 300.23 1.00 631.00 630.00 0.26
...70 0.00 1.00 109.00 108.00 3.12
m_portfolio3 304.67 1.00 659.00 658.00 0.23
m_portfolio4 243.15 1.00 575.00 574.00 0.43
m_portfolio5 140.85 1.00 485.00 484.00 0.60
m_portfolio6 149.00 1.00 510.00 509.00 0.60
m_portfolio7 301.71 1.00 655.00 654.00 0.21
m_portfolio8 210.53 1.00 539.00 538.00 0.47
m_total_credit 309.12 1.00 824.00 823.00 0.00
kurtosis se
id -1.00 1.16
Abbreviation -1.20 2.02
mame -1.23 2.52
specialization 2.62 0.01
year -0.41 0.15
staff 32.73 36.09
total_office 4.32 2.71
...8 NaN 0.00
...9 -0.63 0.44
...10 -0.43 0.58
n_province -0.84 0.47
n_district -0.20 3.63
n_commune 23.35 34.44
n_villege 2.24 208.76
head_office_capital 14.95 0.02
head_office_province 14.70 0.02
Branch_office_capital 0.10 0.49
Branch_office_provincial 4.91 4.57
foreign_share -1.61 1.45
Amount_paid_foreign -0.82 3.97
local_share -1.58 1.24
Amount_paid_local -1.51 5.18
paid_up_capital 134.91 3393.42
m_asset1 96.69 48452.44
m_asset2 78.54 7109.82
m_asset3 105.85 41270.81
m_asset4 -1.26 6.55
m_asset5 1.24 1.30
m_asset6 -1.18 7.22
m_asset7 -0.93 4.35
m_liability1 96.81 48395.69
m_liability2 9.58 0.66
m_liability3 4.87 1.20
m_liability4 6.41 0.96
m_liability5 -1.31 5.89
m_liability6 -1.30 5.75
m_liability7 -1.31 7.50
m_liability8 4.59 1.10
m_liability9 -0.35 3.18
m_liability10 -0.55 3.72
m_liability11 117.35 2237.05
m_liability12 10.97 0.45
m_liability13 9.18 0.53
m_liability14 1.17 2.25
m_liability15 110.72 2450.42
m_operation1 87.74 5828.84
m_operation2 -1.32 7.30
m_operation3 -1.21 8.15
m_operation4 -1.31 7.22
m_operation5 27.29 9.82
m_operation6 23.62 7.89
m_operation7 -1.21 8.20
m_operation8 23.81 1084.75
m_operation9 572.78 2.81
m_operation10 -1.21 8.13
m_operation11 31.58 11.98
m_operation12 341.73 9.96
m_operation13 -1.29 6.66
m_operation14 -1.20 7.34
m_operation15 27.41 11.33
m_operation16 8.56 0.72
m_operation17 120.11 4.57
m_operation18 -1.20 8.14
m_operation19 -1.32 6.00
m_operation20 -1.20 8.10
m_npl1 -1.33 7.15
m_npl2 -1.30 6.68
m_portfolio1 47.13 0.09
m_portfolio2 -1.33 7.20
...70 8.78 0.74
m_portfolio3 -1.31 7.44
m_portfolio4 -1.25 6.62
m_portfolio5 -1.12 5.57
m_portfolio6 -1.10 5.84
m_portfolio7 -1.33 7.40
m_portfolio8 -1.23 6.25
m_total_credit -1.21 8.32
Checking mission values in data
There is the package naniar to visualize missing values in a dataset.3 Here’s how you can use the naniar package in R to visualize missing data. If you want to know how many missing values in each variable, you can use gg_miss_var().
library(naniar)
gg_miss_var(data)If you want to plot the percentages of missing values to total number of observations, the following code will be helpful.
gg_miss_var(data, show_pct = TRUE)If you want to know which observations including missing values in many variables, it is helpful to plot the pattern of missing values as follow.
vis_miss(data)If you want more advanced visualizations, you can use the VIM package.
library(VIM)Loading required package: colorspace
Loading required package: grid
VIM is ready to use.
Suggestions and bug-reports can be submitted at: https://github.com/statistikat/VIM/issues
Attaching package: 'VIM'
The following object is masked from 'package:datasets':
sleep
matrixplot(data)aggr(data, col = c('navyblue', 'yellow'), numbers = TRUE, sortVars = TRUE, labels = names(data), cex.axis = 0.7, gap = 3, ylab = c("Missing data", "Pattern"))Warning in plot.aggr(res, ...): not enough vertical space to display
frequencies (too many combinations)
Variables sorted by number of missings:
Variable Count
head_office_province 0.810488677
head_office_capital 0.808104887
Branch_office_capital 0.808104887
Branch_office_provincial 0.808104887
n_province 0.626936830
n_district 0.626936830
n_commune 0.626936830
n_villege 0.626936830
...8 0.520858164
...9 0.520858164
...10 0.520858164
total_office 0.488676996
m_operation5 0.295589988
m_operation6 0.295589988
m_operation11 0.295589988
m_operation12 0.295589988
m_operation17 0.295589988
m_operation8 0.294398093
m_operation9 0.294398093
m_operation15 0.292014303
m_operation14 0.201430274
mame 0.195470799
m_liability2 0.095351609
m_liability3 0.090584029
m_liability4 0.090584029
m_asset5 0.077473182
m_liability13 0.058402861
m_operation16 0.057210965
m_liability12 0.056019070
m_liability8 0.041716329
m_liability14 0.038140644
m_npl2 0.027413588
m_liability9 0.019070322
m_liability10 0.017878427
m_liability5 0.016686532
m_asset7 0.014302741
m_portfolio8 0.014302741
m_liability7 0.013110846
m_liability6 0.010727056
m_asset3 0.009535161
m_operation1 0.009535161
staff 0.008343266
m_operation2 0.008343266
m_operation13 0.008343266
m_asset1 0.007151371
m_operation19 0.007151371
m_portfolio4 0.007151371
m_portfolio5 0.007151371
local_share 0.005959476
Amount_paid_local 0.005959476
m_asset2 0.005959476
m_asset4 0.005959476
m_asset6 0.005959476
m_liability1 0.005959476
m_liability11 0.005959476
m_liability15 0.005959476
m_operation3 0.005959476
m_operation4 0.005959476
m_operation7 0.005959476
m_operation10 0.005959476
m_operation18 0.005959476
m_operation20 0.005959476
m_npl1 0.005959476
m_portfolio2 0.005959476
m_portfolio3 0.005959476
m_portfolio6 0.005959476
m_portfolio7 0.005959476
m_total_credit 0.005959476
paid_up_capital 0.003575685
Abbreviation 0.002383790
Amount_paid_foreign 0.002383790
m_portfolio1 0.002383790
...70 0.002383790
id 0.001191895
foreign_share 0.001191895
specialization 0.000000000
year 0.000000000
Making graphs
The next step of the analysis is visualizing the data. The package of “ggplot2” is usefule to make professional graphs.
Most used graph is histogram, scatter plot, box plot, line plot and bar plot. Let’s see how you can make those graphs in R.
Scatterplot
Scatter plot can be created using the function of geom_point() from ggplot2 package. The function aes() is used to specify the variables for x and y axis.
ggplot(data, aes(x = m_asset1, y= staff)) + geom_point()In R, you can use mathematical function in the codes of graphs. For example, you can use the log function in the graph as follow.4
ggplot(data, aes(x = log(m_asset1), y= log(staff))) + geom_point()Histogram
For making histgram, you can use the function of geom_histogram() from ggplot2 package. The function aes() is used to specify the variable for x axis.
ggplot(data, aes(x = log(m_asset1))) + geom_histogram()`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Warning: Removed 6 rows containing non-finite outside the scale range
(`stat_bin()`).
Box plot
For box plot, you can use the function of geom_boxplot() from ggplot2 package. The function aes() is used to specify the variables for x and y axis.
ggplot(data, aes(x = as.factor(year), y= log(m_asset1))) + geom_boxplot()Bar plot
For bar plot, you can use the function of geom_bar() from ggplot2 package. The function aes() is used to specify the variables for x and y axis. For bar plot, aggregated data is suitable to use. Thus, we add codes for aggregation by year variable.
data2 <- data %>%
group_by(year) %>%
summarise(m_asset1 = mean(m_asset1, na.rm = TRUE))
graph <- ggplot(data2, aes(x = as.factor(year), y= m_asset1)) + geom_bar(stat = "identity")
graph Line plot
For line plot, you can use the function of geom_line() from ggplot2 package. The function aes() is used to specify the variables for x and y axis.
data2 <- data %>%
group_by(year) %>%
summarise(m_asset1 = mean(m_asset1, na.rm = TRUE))
graph <- ggplot(data2, aes(x = year, y= m_asset1)) + geom_line()
graph Interactive graphs
For making advanced figures, the “Plotly” package gives you a various functions for making interactive graphics. For example, here are the codes for the interactive graph.
fig <- plot_ly(data, x = ~log(m_asset1), y = ~log(staff), type = "scatter")
figInteraction graph can be also created using the function of ggplotly() from the plotly library. Once you create the graph using ggplot(), you can convert it into interactive graph. The interactive figures empower you to explore data in detail.
p <- ggplot(data, aes(x = log(m_asset1), y= log(staff))) +
geom_point(size = 1, aes(text = paste("MFI Name:", Abbreviation, "\n Year: ", year ))) +
theme_minimal()
ggplotly(p)We can also create an interactive graph with a range slider. This is useful to see the time series data. A range slider is a small subplot-like area below a plot which allows users to pan and zoom the X-axis while maintaining an overview of the chart. Check out the reference for more options: https://plotly.com/r/range-slider/
library(plotly)
data2 <- data %>%
group_by(year) %>%
summarise(m_asset1 = mean(m_asset1, na.rm = TRUE))
fig <- plot_ly(data2, type = 'scatter', mode = 'lines')%>%
add_trace(x = ~year, y = ~m_asset1)%>%
layout(showlegend = F, title='Time Series with Rangeslider',
xaxis = list(rangeslider = list(visible = T)))
fig <- fig %>%
layout(
xaxis = list(zerolinecolor = '#ffff',
zerolinewidth = 2,
gridcolor = 'ffff'),
yaxis = list(zerolinecolor = '#ffff',
zerolinewidth = 2,
gridcolor = 'ffff'),
plot_bgcolor='#e5ecf6', width = 900)Warning: Specifying width/height in layout() is now deprecated.
Please specify in ggplotly() or plot_ly()
figAnalysis on the impact of interest rate cap policy
You can also create a new variable from variables in the data. For example, if you want to create average implicit interest rate of each individual bank , you can calculate it as follow:
Interest Rate (\%) = \frac{Interest Income}{Loans} *100
In R, you can create new variables in data in several ways. One of the good ways to create new variables from readability of R code is using mutate() function from dplyr package.
library(dplyr)
data <- data |>
mutate(interest_rate = m_operation1/m_asset3*100)Or you can also create the new variable as follow:
data["interest_rate"] <- data["m_operation1"]/data["m_asset3"]*100Let’s see the created variable.
summary(data["interest_rate"]) interest_rate
Min. : 0.7526
1st Qu.:13.5661
Median :17.2740
Mean : Inf
3rd Qu.:23.5880
Max. : Inf
NA's :10
There is “Inf” value. This means some observations have “0” in loans and the calculation of division produced infinite values. In this case, the calculation of statistics produce the meaningless results as “Inf”. You can modify it by using following code. We can see that the above-mentioned problems were solved
data$interest_rate[is.infinite(data$interest_rate)] <- NA
summary(data["interest_rate"]) interest_rate
Min. : 0.7526
1st Qu.: 13.5578
Median : 17.2485
Mean : 21.1732
3rd Qu.: 23.4287
Max. :1300.0000
NA's :13
Next, let’s plot this implicit interest rates by individual MFIs.
For this plot, it is
df <- subset(data, year == 2016)
ggplot(df, aes(x = Abbreviation, y = interest_rate)) +
geom_bar(stat = "identity") +
labs(title = "Interest Rates by MFI in 2016",
x = "MFI",
y = "Interest Rate") +
theme_minimal() +
theme(axis.text.x = element_text(angle = -90, hjust = 0, vjust = 0,size = 5))You can also reorder the MFIs from low to high in interest rates as follow:
df <- subset(data, year == 2016)
ggplot(df, aes(x = reorder(Abbreviation, interest_rate), y = interest_rate)) +
geom_bar(stat = "identity") +
labs(title = "Interest Rates by MFI in 2016",
x = "MFI",
y = "Interest Rate") +
theme_minimal() +
theme(axis.text.x = element_text(angle = -90, hjust = 0, vjust = 0,size = 5))You can also compare the interest rate between 2016 and 2017 for each MFI. position = "dodge" is used to place the bars for different years side by side for each bank.
data_filtered <- subset(data, year %in% c(2016, 2017))
# Calculate the order of banks based on the interest rate in 2016
order_banks <- data_filtered[data_filtered$year == 2016, ]
order_banks <- order_banks[order(order_banks$interest_rate), ]
ordered_levels <- order_banks$Abbreviation
# Convert the bank column to a factor with ordered levels
data_filtered$Abbreviation <- factor(data_filtered$Abbreviation, levels = ordered_levels)
# Create the grouped bar plot
ggplot(data_filtered, aes(x = Abbreviation, y = interest_rate, fill = factor(year))) +
geom_bar(stat = "identity", position = "dodge") +
labs(title = "Interest Rates by Bank in 2016 and 2017",
x = "MFI",
y = "Interest Rate",
fill = "Year") +
theme_minimal() +
theme(axis.text.x = element_text(angle = -90, hjust = 0, vjust = 0,size = 5))Some MFIs show the increase in interest rates from 2016 to 2017, although there is a regulation of the interest rate cap. Why? Consider the problems in the measure of “implicit interest rate”. And propose another approach for analysis of impact of interest rate cap policy.
Hint: There could be multiple issues in the analysis. Consider as many as possible. One of the problems is based on the context of micro lending in Cambodia. And others are based on the definition of the variable.
Aiba, Daiju and Sovvanroeun Samreth and Sothearoath Oeur and Vanndy Vat (2021) “Impact of Interest Rate Cap Policies on the Lending Behavior of Microfinance Institutions: Evidence from Millions of Observations in the Credit Registry Database” JICA Ogata Research Institute Working Paper Series, No. 224, Available at [link] https://ssrn.com/abstract=4026281
Heng, Dyna and Chea, Serey and Heng, Bomakara(April 1, 2021), “Impacts of Interest Rate Cap on Financial Inclusion in Cambodia” . IMF Working Paper No. 2021/107, Available at [link]
Footnotes
Technically speaking,
library()is the function to import a set of functions into the PC memory. Every time you clear memory by ending R Studio or shutting down your PC, memory is refreshed, and the all the functions and data in the memory are gone.↩︎This function is built-in in R↩︎
For the detail, please see the following document.
https://cran.r-project.org/web/packages/naniar/vignettes/getting-started-w-naniar.html↩︎
In most cases, variables follow an exponential distribution. In such instances, applying the logarithm function to the variable can smooth the distribution, making it more closely resemble a normal distribution.↩︎