Mahdiar Khosravi

STAT-545A hw#05
October.07.2013

Source Data

I decided to work on a new data set for this homework. I picked a data set from OECD.Stat Extracts about Bank Profitability Statistics under its Finance subsection.

Data Import

Used libraries:

library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.0.2
library(xtable)
gDat <- read.delim(file="BankProfitabilityStatistics.csv")

Basic sanity check:

str(gDat)
## 'data.frame':    50689 obs. of  5 variables:
##  $ Country     : Factor w/ 33 levels "Austria","Belgium",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Bank        : Factor w/ 7 levels "All banks","Co-operative banks",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Item        : Factor w/ 49 levels "1. Interest income",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Year        : int  1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 ...
##  $ Value..Flags: num  22333 27413 26867 23522 20846 ...
names(gDat)
## [1] "Country"      "Bank"         "Item"         "Year"        
## [5] "Value..Flags"
levels(gDat$Country)
##  [1] "Austria"            "Belgium"            "Canada"            
##  [4] "Chile"              "Czech Republic"     "Denmark"           
##  [7] "Estonia"            "Finland"            "France"            
## [10] "Germany"            "Greece"             "Hungary"           
## [13] "Ireland"            "Israel"             "Italy"             
## [16] "Japan"              "Korea"              "Luxembourg"        
## [19] "Mexico"             "Netherlands"        "New Zealand"       
## [22] "Norway"             "Poland"             "Portugal"          
## [25] "Russian Federation" "Slovak Republic"    "Slovenia"          
## [28] "Spain"              "Sweden"             "Switzerland"       
## [31] "Turkey"             "United Kingdom"     "United States"
levels(gDat$Bank)
## [1] "All banks"                                
## [2] "Co-operative banks"                       
## [3] "Commercial banks"                         
## [4] "Foreign commercial banks"                 
## [5] "Large commercial banks"                   
## [6] "Other miscellaneous monetary institutions"
## [7] "Savings banks"

As we can see here, this data set presents some information about financial indexes and parameters for different types of banking institutions, in different countries and for 11 years (1999 to 2009).

Data Manipulation

This is long.format data.frame. There are a number of dependent variables and levels which we can omit from the main data.frame.All Bank for example is the summation of the values for different types of banks. I also chose to omit the two levels Foreign commercial banks and Large commercial banks, for consistency reasons.

BankLevels <- c("Co-operative banks","Commercial banks","Other miscellaneous monetary institutions",
                "Savings banks")
iDat <- droplevels(subset(gDat, subset= Bank %in% BankLevels))
str(iDat)
## 'data.frame':    20896 obs. of  5 variables:
##  $ Country     : Factor w/ 20 levels "Canada","Denmark",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Bank        : Factor w/ 4 levels "Co-operative banks",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ Item        : Factor w/ 49 levels "1. Interest income",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Year        : int  1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 ...
##  $ Value..Flags: num  76301 85474 86333 68692 66578 ...

The matter of dependency also exists for levels in Item variable; many of the following parameters are acquired by subtracting some previous ones:

levels(iDat$Item)
##  [1] "1. Interest income"                             
##  [2] "10. Income tax"                                 
##  [3] "11. Net income after tax "                      
##  [4] "12. Distributed profit"                         
##  [5] "13. Retained profit "                           
##  [6] "14. Cash and balance with Central bank"         
##  [7] "15. Interbank deposits"                         
##  [8] "16. Loans"                                      
##  [9] "17. Securities"                                 
## [10] "18. Other assets"                               
## [11] "19. Capital and reserves"                       
## [12] "2. Interest expenses"                           
## [13] "20. Borrowing from Central bank"                
## [14] "21. Interbank deposits"                         
## [15] "22. Customer deposits"                          
## [16] "23. Bonds"                                      
## [17] "24. Other liabilities"                          
## [18] "25. End-year total"                             
## [19] "26. Average total"                              
## [20] "27. Short-term securities"                      
## [21] "28. Bonds"                                      
## [22] "29. Shares and participations"                  
## [23] "3. Net interest income "                        
## [24] "30. Claims on non-residents"                    
## [25] "31. Liabilities to non-residents"               
## [26] "32. Tier 1 Capital"                             
## [27] "33. Tier 2 Capital"                             
## [28] "34. Supervisory deductions"                     
## [29] "35. Total regulatory capital"                   
## [30] "36. Risk-weighted assets"                       
## [31] "37. Number of institutions"                     
## [32] "38. Number of branches"                         
## [33] "39. Number of employees"                        
## [34] "4. Net non-interest income"                     
## [35] "4.a. Fees and commissions receivable"           
## [36] "4.b. Fees and commissions payable"              
## [37] "4.c. Net profit or loss on financial operations"
## [38] "4.d. Other net non-interest income"             
## [39] "5. Net interest and non-interest income"        
## [40] "6. Operating expenses"                          
## [41] "6.a. Staff costs"                               
## [42] "6.b. Property costs"                            
## [43] "6.c. Other operating expenses"                  
## [44] "7. Net income before provisions"                
## [45] "8. Net provisions"                              
## [46] "8.a. Provisions on loans"                       
## [47] "8.b. Provisions on securities"                  
## [48] "8.c. Other net provisions"                      
## [49] "9. Income before tax "

As one can imagine from this levels, we have a problem of levels names here, which I did not know how to appropriately cope with. I did not succumb to the temptation of replacing them by reasonable names in the .csv file prior to import! Instead, I decided to reshape the data.frame and change the names whenever needed. The levels here address different categories. And I define some vectors to use later on.

incomeLevels <- c("1. Interest income","2. Interest expenses",
                "4. Net non-interest income","6. Operating expenses",
                "8. Net provisions","10. Income tax",
                "12. Distributed profit","13. Retained profit ")
assetLevels <- c("14. Cash and balance with Central bank","15. Interbank deposits","16. Loans",
                 "17. Securities","18. Other assets")
liabLevels <- c("19. Capital and reserves","20. Borrowing from Central bank",
                "21. Interbank deposits","22. Customer deposits","23. Bonds","24. Other liabilities")
QuantLevels <- c("37. Number of institutions","38. Number of branches","39. Number of employees")

Now I can introduce a data.frame addressing Income Statements as follows:

incomeDat <- droplevels(subset(iDat, Item %in% incomeLevels))
incomeDat <- reshape(incomeDat, idvar=c("Country","Bank","Year"), timevar="Item", direction= "wide")
names(incomeDat) <- c("Country","Bank","Year","InterestIncome","InterestExpenses",
                       "NetNonInterestIncome","OperatingExpenses","NetProvisions","IncomeTax",
                       "DistributedProfit","RetainedProfit")
str(incomeDat)
## 'data.frame':    490 obs. of  11 variables:
##  $ Country             : Factor w/ 20 levels "Canada","Denmark",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Bank                : Factor w/ 4 levels "Co-operative banks",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ Year                : int  1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 ...
##  $ InterestIncome      : num  76301 85474 86333 68692 66578 ...
##  $ InterestExpenses    : num  51788 59487 56558 35553 33460 ...
##  $ NetNonInterestIncome: num  26931 33016 33851 30884 31789 ...
##  $ OperatingExpenses   : num  33518 38757 42364 43299 43090 ...
##  $ NetProvisions       : num  3177 4310 6436 9432 4123 ...
##  $ IncomeTax           : num  4542 5003 3348 2325 4297 ...
##  $ DistributedProfit   : num  3470 3483 4296 4359 5570 ...
##  $ RetainedProfit      : num  6736 7450 7182 4610 7825 ...
##  - attr(*, "reshapeWide")=List of 5
##   ..$ v.names: NULL
##   ..$ timevar: chr "Item"
##   ..$ idvar  : chr  "Country" "Bank" "Year"
##   ..$ times  : Factor w/ 8 levels "1. Interest income",..: 1 5 6 7 8 2 3 4
##   ..$ varying: chr [1, 1:8] "Value..Flags.1. Interest income" "Value..Flags.2. Interest expenses" "Value..Flags.4. Net non-interest income" "Value..Flags.6. Operating expenses" ...

We can see some rows of this data.frame. From the table below we can notice that some data is missing and this comes from the source data, as far as I could see on the website.

Dumm1 <- incomeDat[sample(nrow(incomeDat), 10),  ]
print(xtable(Dumm1),type = "html", include.rownames = FALSE)
Country Bank Year InterestIncome InterestExpenses NetNonInterestIncome OperatingExpenses NetProvisions IncomeTax DistributedProfit RetainedProfit
Finland Commercial banks 2008 10731.30 7832.57 578.16 1590.33 89.43 406.25 1387.70 3.20
Finland Other miscellaneous monetary institutions 2006 639.79 425.76 102.02 142.40 30.70 32.63 4.34 105.97
Finland Co-operative banks 2000 1124.97 372.21 378.86 666.28 57.15 84.27 19.02 304.91
Germany Savings banks 2002 53932.00 30698.00 6174.00 19022.00 6959.00 1471.00 1280.00 676.00
Korea Other miscellaneous monetary institutions 2003 11760000.00 6463000.00 239000.00 2029000.00 2540000.00 168000.00 156000.00 644000.00
Portugal Commercial banks 2004 13292.42 8341.83 2760.60 4235.83 1674.27 145.86
Russian Federation Other miscellaneous monetary institutions 2004 363.50 82.94 1164.63 1037.77 9.91 103.27 291.72
Sweden Savings banks 2007 6222.00 2695.00 2150.00 3401.00 59.00 497.00 26.00 1693.00
Canada Commercial banks 2007 110557.00 74941.00 42650.00 49054.00 3444.00 4746.00 9034.00 11988.00
Poland Co-operative banks 2002 2475.00 985.00 706.00 1550.00 150.00 140.00 24.00 332.00

With this data.frame we can look into the interest income from different types of banking with countries as the grouping variable. Also, the distribution of Operating Expenses values and change of this value by time are illustrated.

ggplot(incomeDat, aes(x= Bank, y= InterestIncome, color=Country)) + 
      geom_jitter(alpha=1/2, position = position_jitter(width = .2)) + scale_y_log10()

plot of chunk unnamed-chunk-10

ggplot(incomeDat, aes(x= OperatingExpenses, fill= Country)) + geom_density() + scale_x_log10()

plot of chunk unnamed-chunk-11


ggplot(incomeDat, aes(x= Year, y= OperatingExpenses, color=Bank)) +
       facet_wrap(~ Country) + geom_point() + scale_y_log10()

plot of chunk unnamed-chunk-11

Surprisingly, the last plot does not show significant changes in expenses by time. Now, I want to do some HR study on banking institutes using another level vector:

HRDat <- droplevels(subset(gDat, Item %in% QuantLevels))
HRDat <- reshape(HRDat, idvar=c("Country","Bank","Year"), timevar="Item", direction= "wide")
names(HRDat) <- c("Country","Bank","Year","institutions","branches","employees")

We can take look at some rows of this new data.frame.

Dumm1 <- HRDat[sample(nrow(HRDat), 10),  ]
print(xtable(Dumm1),type = "html", include.rownames = FALSE)
Country Bank Year institutions branches employees
United States Commercial banks 2004 7701.00 74519.00 1819664.00
United States Large commercial banks 2007 100.00 45042.00 1391736.00
Denmark All banks 2009 99.00 1760.00 45935.00
Israel Large commercial banks 2000 5.00 1075.00 39251.00
United Kingdom Large commercial banks 2002 41.00 10754.00 322800.00
United States Commercial banks 2002 7975.00 70571.00 1756260.00
Slovak Republic Large commercial banks 2007 6.00 843.00 14723.00
Mexico Foreign commercial banks 2007 18.00 6375.00 101137.00
Turkey Foreign commercial banks 2000 7.00 21.00 755.00
Denmark All banks 2003 101.00 2014.00 41000.00

We can do a different set of investigations using this new set. Here, I illustrate number of employees in different banking institutions in U.S. by year and number of institutions.

ggplot(subset(HRDat, Country == "United States"), aes(x = Year, y = employees, color= Bank,
                size = 1500 *sqrt(institutions/pi))) + geom_point()

plot of chunk unnamed-chunk-14

As we can see here, although Other monetary institutions have a considerable number of institutions, they do not embrace a considerable number of employees and Large commercial banks show a fantastic contribution given their relatively low number of institutions. Another interesting observation here is the reduction in employees after 2007 in Commercial banks, who consistently had the majority of employees.

At the end, I should say, maybe a more professional way of manipulating this data set is avoiding segregating it through level vectors I used here. I just used it as a faster approach to the problem and after I got a headache of trying to prepare the input data due to weird level names!
I am also a bit suspicious about my reshaping and am not sure if I have not messed things using it!