R Tutorial for Data Analysis (1st Week Session)

Author

Daiju Aiba

Published

July 25, 2024

How to start R

Before analysis, we should set the working directory. We can choose a folder from local folders of your PC. For choosing the folder, select [Session] button, and then select [Set Working Directory]. Or you can choose the folder by using setwd().

setwd("D:/OneDrive/Document/GitHub/Analysis-on-Dollarization-in-Cambodia/R Project")

In R, we first need to install packages to perform analysis. The packages can be download and installed by using the function install.packages(). Once we install packages, we do not need to install packages every time.

install.packages('tidyverse') # This package is for efficient data management and importing STATA data into R. 
install.packages("plotly") # This package is for making a interactive figures. 
install.packages("arm")  # This package is for visualizing the regression results in figures. 
install.packages("render")  # This package is for visualizing the regression results in figures.}

After we download and install the packages, we can call the functions in the packages by using library(). You need to call libraries (packages) of functions for your analysis every time we restart analysis using R. 1

library(plotly) 
library(tidyverse) 
library(haven)  # This package is included in "tidyverse". 

How to import data into R

For importing a dataset, there are various functions in R. If your original data set is in Excel format, you need to use read_excel() from readxl package. If your original data set is in CSV format, you need to use read.csv().2 If you want to import dataset in STATA format, use read_dta() from haven package.

library(readxl)

file_path <- "Data/NBC_MFI.xlsx"
data <- read_excel(file_path, sheet = "Sheet1", range = "A3:BY842")
file_path <- "Data.csv"
data <- read.csv(file_path, header = TRUE, sep = ",")
library(haven)

file_path <- "Data.dta"
df <- read_dta(file_path)

Descriptive statistics

Once you import the data, you may want to calculate the sample statistics of the dataset. Usually, we need to see mean, median, several quantile values, and standard deviation of the variables in the dataset. Those sample statistics are defined as follow.

Mean

\bar{X} = \frac{1}{n} \sum_{i=1}^{n} X_i

Median

Med(X) = \begin{cases} X_{(n+1)/2} & \text{if n is odd} \\ \frac{1}{2} (X_{n/2} + X_{n/2 + 1}) & \text{if n is even} \end{cases}

Quantile

Q_q(X) = \begin{cases} X_{(n+1)q} & \text{if nq is not an integer} \\ X_{nq} & \text{if nq is an integer} \end{cases}

Variance Var(X) = \frac{1}{n-1} \sum_{i=1}^{n} (X_i - \bar{X})^2

Standard Deviation sd(X) = \sqrt{\frac{1}{n-1} \sum_{i=1}^{n} (X_i - \bar{X})^2}

summary() function can be used for calculating several sample statistics for the variables in the dataset. For calculating the sample statistics of the specific variable, you can use the following codes.

summary(data["m_asset1"])  
    m_asset1       
 Min.   :     504  
 1st Qu.:    9340  
 Median :   26628  
 Mean   :  310135  
 3rd Qu.:   85926  
 Max.   :19725521  
 NA's   :6         

Or you can select variables using “$”

summary(data$m_asset1) 
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max.     NA's 
     504     9340    26628   310135    85926 19725521        6 

If you input the whole dataset, the R will return descriptive statistics for all the variables in the dataset.

summary(data) 

If you want to make a summary statistics for more than one variables at once. You should type the following codes.

summary(data[c('m_asset1', 'm_asset2', 'm_asset3', "m_operation1") ])
    m_asset1           m_asset2         m_asset3         m_operation1      
 Min.   :     504   Min.   :     62   Length:839         Length:839        
 1st Qu.:    9340   1st Qu.:   1049   Class :character   Class :character  
 Median :   26628   Median :   3636   Mode  :character   Mode  :character  
 Mean   :  310135   Mean   :  46354                                        
 3rd Qu.:   85926   3rd Qu.:  11034                                        
 Max.   :19725521   Max.   :2438765                                        
 NA's   :6          NA's   :5                                              

You can see that variables “asset1” and “asset2” are imported as “character” (or we say “string”). The calculation of statistics require the variables in the format of “numeric”. In this case, you have to convert the variables into “numeric” by using the following codes.

Then, you can see that the sample statistics of asset1 and asset 2 are shown in the result window.

data$m_asset3 <- as.numeric(data$m_asset3)  
data$m_operation1 <- as.numeric(data$m_operation1)  
summary(data[c('m_asset1', 'm_asset2', 'm_asset3', "m_operation1") ])
    m_asset1           m_asset2          m_asset3         m_operation1    
 Min.   :     504   Min.   :     62   Min.   :       0   Min.   :      0  
 1st Qu.:    9340   1st Qu.:   1049   1st Qu.:    6338   1st Qu.:   1096  
 Median :   26628   Median :   3636   Median :   21502   Median :   3674  
 Mean   :  310135   Mean   :  46354   Mean   :  258201   Mean   :  41521  
 3rd Qu.:   85926   3rd Qu.:  11034   3rd Qu.:   70182   3rd Qu.:  12150  
 Max.   :19725521   Max.   :2438765   Max.   :17730347   Max.   :2371615  
 NA's   :6          NA's   :5         NA's   :8          NA's   :8        

If you want to calculate sample statistics for the specific sub-sample, you can specify the sub-sample by setting conditions as follow.

M <- data$m_asset1[data$m_asset2 >12000]  
summary(M)
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max.     NA's 
   12632   109755   260771  1195924   877337 19725521        5 

The summary() function provides the mean and quantile statistics in the default setting. If you want to calculate the stantard deviation, you can use the following codes. In the default settion, the function will provides NA if there is “NA” value in a variable.

sd(data$m_asset1, na.rm = TRUE) 
[1] 1398422
mean(data$m_asset1, na.rm = TRUE)  # For calculating mean, we have this function.
[1] 310134.6
quantile(data$m_asset1, 0.25, na.rm = TRUE) # For quantile statistics, we have this function.
 25% 
9340 

You can also see the correlation between variables as follow. (Some variables include “NA” in values. In this case, we need to write the option “use” )

cor(data[c('m_asset1', 'm_asset2', 'm_asset3', "m_operation1") ], use = "complete.obs")
              m_asset1  m_asset2  m_asset3 m_operation1
m_asset1     1.0000000 0.9647632 0.9988622    0.9940217
m_asset2     0.9647632 1.0000000 0.9513290    0.9566454
m_asset3     0.9988622 0.9513290 1.0000000    0.9931728
m_operation1 0.9940217 0.9566454 0.9931728    1.0000000

You can also create a table for descriptive statistics using describe() from psych package .

library(psych)

Attaching package: 'psych'
The following objects are masked from 'package:ggplot2':

    %+%, alpha
describe(data)
                         vars   n      mean         sd   median  trimmed
id                          1 838    247.50      33.60   237.50   245.35
Abbreviation                2 837     96.11      58.40    93.00    95.62
mame                        3 675    109.00      65.55   106.00   108.98
specialization              4 839      1.87       0.35     2.00     1.96
year                        5 839   2016.48       4.26  2017.00  2016.85
staff                       6 832    382.28    1040.87    63.50   125.10
total_office                7 429     38.86      56.14    12.00    27.21
...8                        8 402      1.00       0.00     1.00     1.00
...9                        9 402      7.75       8.79     2.00     6.39
...10                      10 402      9.67      11.54     2.00     7.77
n_province                 11 313      9.41       8.23     7.00     8.56
n_district                 12 313     59.51      64.25    30.00    50.07
n_commune                  13 313    394.53     609.35   144.00   274.45
n_villege                  14 313   2125.22    3693.27   391.00  1210.29
head_office_capital        15 161      0.95       0.22     1.00     1.00
head_office_province       16 159      0.05       0.22     0.00     0.00
Branch_office_capital      17 161      7.26       6.18     5.00     6.40
Branch_office_provincial   18 161     50.46      58.01    20.00    40.49
foreign_share              19 838     39.57      41.94    17.00    37.09
Amount_paid_foreign        20 837     96.81     114.73    28.00    80.88
local_share                21 834     52.77      35.84    81.50    54.82
Amount_paid_local          22 834    178.15     149.72   163.50   171.42
paid_up_capital            23 836  29499.55   98116.31 10188.00 13659.17
m_asset1                   24 833 310134.56 1398421.72 26628.00 54310.86
m_asset2                   25 834  46354.06  205324.96  3636.50  7331.67
m_asset3                   26 831 258201.43 1189716.51 21502.00 44062.32
m_asset4                   27 834    262.22     189.06   252.50   256.51
m_asset5                   28 774     21.99      36.21     3.00    13.90
m_asset6                   29 834    363.99     208.50   371.50   364.35
m_asset7                   30 827    128.88     125.13    79.00   115.07
m_liability1               31 834 309731.39 1397621.84 26633.50 54197.01
m_liability2               32 759      7.66      18.21     2.00     2.22
m_liability3               33 763     15.21      33.16     2.00     5.61
m_liability4               34 763     11.66      26.45     2.00     3.79
m_liability5               35 825    214.88     169.29   199.00   206.79
m_liability6               36 830    217.78     165.64   213.50   211.03
m_liability7               37 828    270.55     215.89   255.50   259.89
m_liability8               38 804     20.47      31.11    10.00    12.25
m_liability9               39 823     99.37      91.23    43.00    87.65
m_liability10              40 824     93.65     106.85    16.00    78.06
m_liability11              41 834  19828.29   64603.87  6113.00  9046.90
m_liability12              42 792      5.67      12.75     2.00     1.93
m_liability13              43 790      6.63      14.93     2.00     2.18
m_liability14              44 807     39.59      63.98     5.00    25.27
m_liability15              45 834  15184.04   70765.94    76.00  2162.53
m_operation1               46 831  41520.65  168028.51  3674.00  8108.50
m_operation2               47 832    263.28     210.56   249.50   252.84
m_operation3               48 834    404.76     235.35   403.50   404.50
m_operation4               49 834    326.53     208.37   321.50   322.21
m_operation5               50 591     -4.50     238.75     0.00     1.04
m_operation6               51 591     62.52     191.84     0.00    10.95
m_operation7               52 834    407.67     236.85   406.50   407.53
m_operation8               53 592   9700.18   26393.14  1064.71  2915.48
m_operation9               54 592      3.59      68.26     0.00     0.00
m_operation10              55 834    391.34     234.78   390.50   391.36
m_operation11              56 591     72.33     291.32     0.00     8.17
m_operation12              57 591     34.68     242.21     0.00     1.07
m_operation13              58 832    286.06     192.10   266.50   278.51
m_operation14              59 670    328.50     190.02   326.50   328.26
m_operation15              60 594     67.15     276.23     0.00     2.26
m_operation16              61 791      9.28      20.22     3.00     3.27
m_operation17              62 591      4.55     111.10     0.00     0.00
m_operation18              63 834    406.41     235.09   404.50   406.24
m_operation19              64 833    198.12     173.10   173.00   186.20
m_operation20              65 834    406.35     233.87   408.50   406.49
m_npl1                     66 834    267.20     206.38   255.50   258.85
m_npl2                     67 816    220.00     190.77   196.50   207.21
m_portfolio1               68 837      2.14       2.68     2.00     1.79
m_portfolio2               69 834    249.43     208.06   229.50   237.51
...70                      70 837      8.90      21.44     2.00     2.42
m_portfolio3               71 834    263.68     214.82   247.50   252.26
m_portfolio4               72 833    202.70     191.21   166.00   186.42
m_portfolio5               73 833    152.33     160.69    97.00   134.71
m_portfolio6               74 834    159.96     168.77   102.50   141.30
m_portfolio7               75 834    266.16     213.76   251.50   255.64
m_portfolio8               76 827    184.65     179.64   144.00   168.31
m_total_credit             77 834    409.12     240.22   409.50   409.21
                              mad       min         max       range  skew
id                          34.84    201.00      324.00      123.00  0.51
Abbreviation                74.13      1.00      198.00      197.00  0.07
mame                        84.51      1.00      219.00      218.00  0.02
specialization               0.00      1.00        3.00        2.00 -1.99
year                         4.45   2006.00     2022.00       16.00 -0.67
staff                       71.91      0.00     9573.00     9573.00  5.15
total_office                16.31      0.00      374.00      374.00  1.96
...8                         0.00      1.00        1.00        0.00   NaN
...9                         1.48      1.00       28.00       27.00  0.93
...10                        1.48      1.00       39.00       38.00  1.00
n_province                   7.41      0.00       25.00       25.00  0.74
n_district                  32.62      0.00      203.00      203.00  1.13
n_commune                  174.95      0.00     6091.00     6091.00  3.51
n_villege                  524.84      0.00    13916.00    13916.00  1.91
head_office_capital          0.00      0.00        1.00        1.00 -4.11
head_office_province         0.00      0.00        1.00        1.00  4.08
Branch_office_capital        4.45      1.00       24.00       23.00  1.03
Branch_office_provincial    23.72      1.00      352.00      351.00  1.90
foreign_share               23.72      1.00      102.00      101.00  0.44
Amount_paid_foreign         40.03      1.00      353.00      352.00  0.81
local_share                 14.08      1.00      102.00      101.00 -0.46
Amount_paid_local          237.96      1.00      431.00      430.00  0.16
paid_up_capital           8836.30    250.00  1646800.00  1646550.00 10.49
m_asset1                 30433.33    504.00 19725521.00 19725017.00  8.94
m_asset2                  4531.57     62.00  2438765.00  2438703.00  8.15
m_asset3                 26719.42      0.00 17730347.00 17730347.00  9.30
m_asset4                   253.52      1.00      601.00      600.00  0.14
m_asset5                     0.00      1.00      134.00      133.00  1.64
m_asset6                   263.90      1.00      726.00      725.00 -0.02
m_asset7                    99.33      1.00      403.00      402.00  0.69
m_liability1             30468.91      0.00 19725521.00 19725521.00  8.94
m_liability2                 0.00      1.00       95.00       94.00  3.23
m_liability3                 0.00      1.00      144.00      143.00  2.47
m_liability4                 0.00      1.00      123.00      122.00  2.74
m_liability5               241.66      1.00      527.00      526.00  0.21
m_liability6               227.58      1.00      522.00      521.00  0.15
m_liability7               302.45      1.00      666.00      665.00  0.22
m_liability8                 0.00      1.00      138.00      137.00  2.40
m_liability9                34.10      1.00      321.00      320.00  1.02
m_liability10               22.24      1.00      344.00      343.00  0.95
m_liability11             9060.84      0.01   937250.00   937249.99  9.79
m_liability12                0.00      1.00       72.00       71.00  3.42
m_liability13                0.00      1.00       79.00       78.00  3.17
m_liability14                0.00      1.00      228.00      227.00  1.62
m_liability15             1710.34 -24721.00  1143068.00  1167789.00  9.23
m_operation1              4736.91      0.00  2371615.00  2371615.00  8.38
m_operation2               298.00      1.00      648.00      647.00  0.21
m_operation3               303.19      1.00      813.00      812.00  0.01
m_operation4               280.95      1.00      692.00      691.00  0.09
m_operation5                 0.00  -1808.45     1695.18     3503.63 -2.10
m_operation6                 0.00     -3.09     1666.00     1669.09  4.47
m_operation7               304.67      1.00      818.00      817.00  0.01
m_operation8              1578.54      0.00   223785.38   223785.38  4.51
m_operation9                 0.00      0.00     1653.00     1653.00 23.84
m_operation10              301.71      1.00      797.00      796.00  0.00
m_operation11                0.00      0.00     2543.00     2543.00  5.49
m_operation12                0.00      0.00     5167.36     5167.36 16.91
m_operation13              269.83      1.00      643.00      642.00  0.23
m_operation14              242.41      1.00      659.00      658.00  0.01
m_operation15                0.00  -1196.00     2510.52     3706.52  4.33
m_operation16                0.00      1.00      103.00      102.00  3.08
m_operation17                0.00  -1476.22     1397.48     2873.70 -1.89
m_operation18              299.49      1.00      813.00      812.00  0.01
m_operation19              249.08      1.00      521.00      520.00  0.33
m_operation20              299.49      1.00      809.00      808.00 -0.01
m_npl1                     286.88      1.00      638.00      637.00  0.17
m_npl2                     280.95      1.00      580.00      579.00  0.32
m_portfolio1                 0.00      1.00       27.00       26.00  6.66
m_portfolio2               300.23      1.00      631.00      630.00  0.26
...70                        0.00      1.00      109.00      108.00  3.12
m_portfolio3               304.67      1.00      659.00      658.00  0.23
m_portfolio4               243.15      1.00      575.00      574.00  0.43
m_portfolio5               140.85      1.00      485.00      484.00  0.60
m_portfolio6               149.00      1.00      510.00      509.00  0.60
m_portfolio7               301.71      1.00      655.00      654.00  0.21
m_portfolio8               210.53      1.00      539.00      538.00  0.47
m_total_credit             309.12      1.00      824.00      823.00  0.00
                         kurtosis       se
id                          -1.00     1.16
Abbreviation                -1.20     2.02
mame                        -1.23     2.52
specialization               2.62     0.01
year                        -0.41     0.15
staff                       32.73    36.09
total_office                 4.32     2.71
...8                          NaN     0.00
...9                        -0.63     0.44
...10                       -0.43     0.58
n_province                  -0.84     0.47
n_district                  -0.20     3.63
n_commune                   23.35    34.44
n_villege                    2.24   208.76
head_office_capital         14.95     0.02
head_office_province        14.70     0.02
Branch_office_capital        0.10     0.49
Branch_office_provincial     4.91     4.57
foreign_share               -1.61     1.45
Amount_paid_foreign         -0.82     3.97
local_share                 -1.58     1.24
Amount_paid_local           -1.51     5.18
paid_up_capital            134.91  3393.42
m_asset1                    96.69 48452.44
m_asset2                    78.54  7109.82
m_asset3                   105.85 41270.81
m_asset4                    -1.26     6.55
m_asset5                     1.24     1.30
m_asset6                    -1.18     7.22
m_asset7                    -0.93     4.35
m_liability1                96.81 48395.69
m_liability2                 9.58     0.66
m_liability3                 4.87     1.20
m_liability4                 6.41     0.96
m_liability5                -1.31     5.89
m_liability6                -1.30     5.75
m_liability7                -1.31     7.50
m_liability8                 4.59     1.10
m_liability9                -0.35     3.18
m_liability10               -0.55     3.72
m_liability11              117.35  2237.05
m_liability12               10.97     0.45
m_liability13                9.18     0.53
m_liability14                1.17     2.25
m_liability15              110.72  2450.42
m_operation1                87.74  5828.84
m_operation2                -1.32     7.30
m_operation3                -1.21     8.15
m_operation4                -1.31     7.22
m_operation5                27.29     9.82
m_operation6                23.62     7.89
m_operation7                -1.21     8.20
m_operation8                23.81  1084.75
m_operation9               572.78     2.81
m_operation10               -1.21     8.13
m_operation11               31.58    11.98
m_operation12              341.73     9.96
m_operation13               -1.29     6.66
m_operation14               -1.20     7.34
m_operation15               27.41    11.33
m_operation16                8.56     0.72
m_operation17              120.11     4.57
m_operation18               -1.20     8.14
m_operation19               -1.32     6.00
m_operation20               -1.20     8.10
m_npl1                      -1.33     7.15
m_npl2                      -1.30     6.68
m_portfolio1                47.13     0.09
m_portfolio2                -1.33     7.20
...70                        8.78     0.74
m_portfolio3                -1.31     7.44
m_portfolio4                -1.25     6.62
m_portfolio5                -1.12     5.57
m_portfolio6                -1.10     5.84
m_portfolio7                -1.33     7.40
m_portfolio8                -1.23     6.25
m_total_credit              -1.21     8.32

Checking mission values in data

There is the package naniar to visualize missing values in a dataset.3 Here’s how you can use the naniar package in R to visualize missing data. If you want to know how many missing values in each variable, you can use gg_miss_var().

library(naniar)
gg_miss_var(data)

If you want to plot the percentages of missing values to total number of observations, the following code will be helpful.

gg_miss_var(data, show_pct = TRUE)

If you want to know which observations including missing values in many variables, it is helpful to plot the pattern of missing values as follow.

vis_miss(data)

If you want more advanced visualizations, you can use the VIM package.

library(VIM)
Loading required package: colorspace
Loading required package: grid
VIM is ready to use.
Suggestions and bug-reports can be submitted at: https://github.com/statistikat/VIM/issues

Attaching package: 'VIM'
The following object is masked from 'package:datasets':

    sleep
matrixplot(data)

aggr(data, col = c('navyblue', 'yellow'), numbers = TRUE, sortVars = TRUE, labels = names(data), cex.axis = 0.7, gap = 3, ylab = c("Missing data", "Pattern"))
Warning in plot.aggr(res, ...): not enough vertical space to display
frequencies (too many combinations)


 Variables sorted by number of missings: 
                 Variable       Count
     head_office_province 0.810488677
      head_office_capital 0.808104887
    Branch_office_capital 0.808104887
 Branch_office_provincial 0.808104887
               n_province 0.626936830
               n_district 0.626936830
                n_commune 0.626936830
                n_villege 0.626936830
                     ...8 0.520858164
                     ...9 0.520858164
                    ...10 0.520858164
             total_office 0.488676996
             m_operation5 0.295589988
             m_operation6 0.295589988
            m_operation11 0.295589988
            m_operation12 0.295589988
            m_operation17 0.295589988
             m_operation8 0.294398093
             m_operation9 0.294398093
            m_operation15 0.292014303
            m_operation14 0.201430274
                     mame 0.195470799
             m_liability2 0.095351609
             m_liability3 0.090584029
             m_liability4 0.090584029
                 m_asset5 0.077473182
            m_liability13 0.058402861
            m_operation16 0.057210965
            m_liability12 0.056019070
             m_liability8 0.041716329
            m_liability14 0.038140644
                   m_npl2 0.027413588
             m_liability9 0.019070322
            m_liability10 0.017878427
             m_liability5 0.016686532
                 m_asset7 0.014302741
             m_portfolio8 0.014302741
             m_liability7 0.013110846
             m_liability6 0.010727056
                 m_asset3 0.009535161
             m_operation1 0.009535161
                    staff 0.008343266
             m_operation2 0.008343266
            m_operation13 0.008343266
                 m_asset1 0.007151371
            m_operation19 0.007151371
             m_portfolio4 0.007151371
             m_portfolio5 0.007151371
              local_share 0.005959476
        Amount_paid_local 0.005959476
                 m_asset2 0.005959476
                 m_asset4 0.005959476
                 m_asset6 0.005959476
             m_liability1 0.005959476
            m_liability11 0.005959476
            m_liability15 0.005959476
             m_operation3 0.005959476
             m_operation4 0.005959476
             m_operation7 0.005959476
            m_operation10 0.005959476
            m_operation18 0.005959476
            m_operation20 0.005959476
                   m_npl1 0.005959476
             m_portfolio2 0.005959476
             m_portfolio3 0.005959476
             m_portfolio6 0.005959476
             m_portfolio7 0.005959476
           m_total_credit 0.005959476
          paid_up_capital 0.003575685
             Abbreviation 0.002383790
      Amount_paid_foreign 0.002383790
             m_portfolio1 0.002383790
                    ...70 0.002383790
                       id 0.001191895
            foreign_share 0.001191895
           specialization 0.000000000
                     year 0.000000000

Making graphs

The next step of the analysis is visualizing the data. The package of “ggplot2” is usefule to make professional graphs.

Most used graph is histogram, scatter plot, box plot, line plot and bar plot. Let’s see how you can make those graphs in R.

Scatterplot

Scatter plot can be created using the function of geom_point() from ggplot2 package. The function aes() is used to specify the variables for x and y axis.

ggplot(data, aes(x = m_asset1, y= staff)) + geom_point()

In R, you can use mathematical function in the codes of graphs. For example, you can use the log function in the graph as follow.4

ggplot(data, aes(x = log(m_asset1), y= log(staff))) + geom_point()

Histogram

For making histgram, you can use the function of geom_histogram() from ggplot2 package. The function aes() is used to specify the variable for x axis.

ggplot(data, aes(x = log(m_asset1))) + geom_histogram()
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Warning: Removed 6 rows containing non-finite outside the scale range
(`stat_bin()`).

Box plot

For box plot, you can use the function of geom_boxplot() from ggplot2 package. The function aes() is used to specify the variables for x and y axis.

ggplot(data, aes(x = as.factor(year), y= log(m_asset1))) + geom_boxplot()

Bar plot

For bar plot, you can use the function of geom_bar() from ggplot2 package. The function aes() is used to specify the variables for x and y axis. For bar plot, aggregated data is suitable to use. Thus, we add codes for aggregation by year variable.

data2 <- data %>%
  group_by(year) %>%
  summarise(m_asset1 = mean(m_asset1, na.rm = TRUE))

graph <- ggplot(data2, aes(x = as.factor(year), y= m_asset1)) + geom_bar(stat = "identity")
graph 

Line plot

For line plot, you can use the function of geom_line() from ggplot2 package. The function aes() is used to specify the variables for x and y axis.

data2 <- data %>%
  group_by(year) %>%
  summarise(m_asset1 = mean(m_asset1, na.rm = TRUE))

graph <- ggplot(data2, aes(x = year, y= m_asset1)) + geom_line()
graph 

Interactive graphs

For making advanced figures, the “Plotly” package gives you a various functions for making interactive graphics. For example, here are the codes for the interactive graph.

fig <- plot_ly(data, x = ~log(m_asset1), y = ~log(staff), type = "scatter") 
fig

Interaction graph can be also created using the function of ggplotly() from the plotly library. Once you create the graph using ggplot(), you can convert it into interactive graph. The interactive figures empower you to explore data in detail.

p <- ggplot(data, aes(x = log(m_asset1), y= log(staff))) + 
  geom_point(size = 1, aes(text = paste("MFI Name:", Abbreviation, "\n Year: ", year )))  +
  theme_minimal()

ggplotly(p)

We can also create an interactive graph with a range slider. This is useful to see the time series data. A range slider is a small subplot-like area below a plot which allows users to pan and zoom the X-axis while maintaining an overview of the chart. Check out the reference for more options: https://plotly.com/r/range-slider/

library(plotly)

data2 <- data %>%
  group_by(year) %>%
  summarise(m_asset1 = mean(m_asset1, na.rm = TRUE))

fig <- plot_ly(data2, type = 'scatter', mode = 'lines')%>%
  add_trace(x = ~year, y = ~m_asset1)%>%
  layout(showlegend = F, title='Time Series with Rangeslider',
         xaxis = list(rangeslider = list(visible = T)))
fig <- fig %>%
  layout(
         xaxis = list(zerolinecolor = '#ffff',
                      zerolinewidth = 2,
                      gridcolor = 'ffff'),
         yaxis = list(zerolinecolor = '#ffff',
                      zerolinewidth = 2,
                      gridcolor = 'ffff'),
         plot_bgcolor='#e5ecf6', width = 900)
Warning: Specifying width/height in layout() is now deprecated.
Please specify in ggplotly() or plot_ly()
fig

Analysis on the impact of interest rate cap policy

You can also create a new variable from variables in the data. For example, if you want to create average implicit interest rate of each individual bank , you can calculate it as follow:

Interest Rate (\%) = \frac{Interest Income}{Loans} *100

In R, you can create new variables in data in several ways. One of the good ways to create new variables from readability of R code is using mutate() function from dplyr package.

library(dplyr)
data <- data |>
    mutate(interest_rate = m_operation1/m_asset3*100)

Or you can also create the new variable as follow:

data["interest_rate"] <- data["m_operation1"]/data["m_asset3"]*100

Let’s see the created variable.

summary(data["interest_rate"])
 interest_rate    
 Min.   : 0.7526  
 1st Qu.:13.5661  
 Median :17.2740  
 Mean   :    Inf  
 3rd Qu.:23.5880  
 Max.   :    Inf  
 NA's   :10       

There is “Inf” value. This means some observations have “0” in loans and the calculation of division produced infinite values. In this case, the calculation of statistics produce the meaningless results as “Inf”. You can modify it by using following code. We can see that the above-mentioned problems were solved

data$interest_rate[is.infinite(data$interest_rate)] <- NA


summary(data["interest_rate"])
 interest_rate      
 Min.   :   0.7526  
 1st Qu.:  13.5578  
 Median :  17.2485  
 Mean   :  21.1732  
 3rd Qu.:  23.4287  
 Max.   :1300.0000  
 NA's   :13         

Next, let’s plot this implicit interest rates by individual MFIs.

For this plot, it is

df <- subset(data, year == 2016)
ggplot(df, aes(x = Abbreviation, y = interest_rate)) +
  geom_bar(stat = "identity") +
  labs(title = "Interest Rates by MFI in 2016",
       x = "MFI",
       y = "Interest Rate") +
  theme_minimal() + 
  theme(axis.text.x = element_text(angle = -90, hjust = 0, vjust = 0,size = 5))

You can also reorder the MFIs from low to high in interest rates as follow:

df <- subset(data, year == 2016)
ggplot(df, aes(x = reorder(Abbreviation, interest_rate), y = interest_rate)) +
  geom_bar(stat = "identity") +
  labs(title = "Interest Rates by MFI in 2016",
       x = "MFI",
       y = "Interest Rate") +
  theme_minimal() + 
  theme(axis.text.x = element_text(angle = -90, hjust = 0, vjust = 0,size = 5))

You can also compare the interest rate between 2016 and 2017 for each MFI. position = "dodge" is used to place the bars for different years side by side for each bank.

data_filtered <- subset(data, year %in% c(2016, 2017))

# Calculate the order of banks based on the interest rate in 2016
order_banks <- data_filtered[data_filtered$year == 2016, ]
order_banks <- order_banks[order(order_banks$interest_rate), ]
ordered_levels <- order_banks$Abbreviation

# Convert the bank column to a factor with ordered levels
data_filtered$Abbreviation <- factor(data_filtered$Abbreviation, levels = ordered_levels)

# Create the grouped bar plot
ggplot(data_filtered, aes(x = Abbreviation, y = interest_rate, fill = factor(year))) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(title = "Interest Rates by Bank in 2016 and 2017",
       x = "MFI",
       y = "Interest Rate",
       fill = "Year") +
  theme_minimal() + 
  theme(axis.text.x = element_text(angle = -90, hjust = 0, vjust = 0,size = 5))

Question

Some MFIs show the increase in interest rates from 2016 to 2017, although there is a regulation of the interest rate cap. Why? Consider the problems in the measure of “implicit interest rate”. And propose another approach for analysis of impact of interest rate cap policy.

Hint: There could be multiple issues in the analysis. Consider as many as possible. One of the problems is based on the context of micro lending in Cambodia. And others are based on the definition of the variable.

Related papers
  • Aiba, Daiju and Sovvanroeun Samreth and Sothearoath Oeur and Vanndy Vat (2021) “Impact of Interest Rate Cap Policies on the Lending Behavior of Microfinance Institutions: Evidence from Millions of Observations in the Credit Registry Database” JICA Ogata Research Institute Working Paper Series, No. 224, Available at [link] https://ssrn.com/abstract=4026281

  • Heng, Dyna and Chea, Serey and Heng, Bomakara(April 1, 2021), “Impacts of Interest Rate Cap on Financial Inclusion in Cambodia” . IMF Working Paper No. 2021/107, Available at [link]

Footnotes

  1. Technically speaking, library() is the function to import a set of functions into the PC memory. Every time you clear memory by ending R Studio or shutting down your PC, memory is refreshed, and the all the functions and data in the memory are gone.↩︎

  2. This function is built-in in R↩︎

  3. For the detail, please see the following document.

    https://cran.r-project.org/web/packages/naniar/vignettes/getting-started-w-naniar.html↩︎

  4. In most cases, variables follow an exponential distribution. In such instances, applying the logarithm function to the variable can smooth the distribution, making it more closely resemble a normal distribution.↩︎