Mahdiar Khosravi

STAT-545A hw#05
October.07.2013

Source Data

I decided to work on a new data set for this homework. I picked a data set from OECD.Stat Extracts about Bank Profitability Statistics under its Finance subsection.

Data Import

Used libraries:

library(ggplot2)

## Warning: package 'ggplot2' was built under R version 3.0.2

library(xtable)

gDat <- read.delim(file="BankProfitabilityStatistics.csv")

Basic sanity check:

str(gDat)

## 'data.frame':    50689 obs. of  5 variables:
##  $ Country     : Factor w/ 33 levels "Austria","Belgium",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Bank        : Factor w/ 7 levels "All banks","Co-operative banks",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Item        : Factor w/ 49 levels "1. Interest income",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Year        : int  1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 ...
##  $ Value..Flags: num  22333 27413 26867 23522 20846 ...

names(gDat)

## [1] "Country"      "Bank"         "Item"         "Year"        
## [5] "Value..Flags"

levels(gDat$Country)

##  [1] "Austria"            "Belgium"            "Canada"            
##  [4] "Chile"              "Czech Republic"     "Denmark"           
##  [7] "Estonia"            "Finland"            "France"            
## [10] "Germany"            "Greece"             "Hungary"           
## [13] "Ireland"            "Israel"             "Italy"             
## [16] "Japan"              "Korea"              "Luxembourg"        
## [19] "Mexico"             "Netherlands"        "New Zealand"       
## [22] "Norway"             "Poland"             "Portugal"          
## [25] "Russian Federation" "Slovak Republic"    "Slovenia"          
## [28] "Spain"              "Sweden"             "Switzerland"       
## [31] "Turkey"             "United Kingdom"     "United States"

levels(gDat$Bank)

## [1] "All banks"                                
## [2] "Co-operative banks"                       
## [3] "Commercial banks"                         
## [4] "Foreign commercial banks"                 
## [5] "Large commercial banks"                   
## [6] "Other miscellaneous monetary institutions"
## [7] "Savings banks"

As we can see here, this data set presents some information about financial indexes and parameters for different types of banking institutions, in different countries and for 11 years (1999 to 2009).

Data Manipulation

This is long.format data.frame. There are a number of dependent variables and levels which we can omit from the main data.frame.All Bank for example is the summation of the values for different types of banks. I also chose to omit the two levels Foreign commercial banks and Large commercial banks, for consistency reasons.

BankLevels <- c("Co-operative banks","Commercial banks","Other miscellaneous monetary institutions",
                "Savings banks")
iDat <- droplevels(subset(gDat, subset= Bank %in% BankLevels))
str(iDat)

## 'data.frame':    20896 obs. of  5 variables:
##  $ Country     : Factor w/ 20 levels "Canada","Denmark",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Bank        : Factor w/ 4 levels "Co-operative banks",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ Item        : Factor w/ 49 levels "1. Interest income",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Year        : int  1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 ...
##  $ Value..Flags: num  76301 85474 86333 68692 66578 ...

The matter of dependency also exists for levels in Item variable; many of the following parameters are acquired by subtracting some previous ones:

levels(iDat$Item)

##  [1] "1. Interest income"                             
##  [2] "10. Income tax"                                 
##  [3] "11. Net income after tax "                      
##  [4] "12. Distributed profit"                         
##  [5] "13. Retained profit "                           
##  [6] "14. Cash and balance with Central bank"         
##  [7] "15. Interbank deposits"                         
##  [8] "16. Loans"                                      
##  [9] "17. Securities"                                 
## [10] "18. Other assets"                               
## [11] "19. Capital and reserves"                       
## [12] "2. Interest expenses"                           
## [13] "20. Borrowing from Central bank"                
## [14] "21. Interbank deposits"                         
## [15] "22. Customer deposits"                          
## [16] "23. Bonds"                                      
## [17] "24. Other liabilities"                          
## [18] "25. End-year total"                             
## [19] "26. Average total"                              
## [20] "27. Short-term securities"                      
## [21] "28. Bonds"                                      
## [22] "29. Shares and participations"                  
## [23] "3. Net interest income "                        
## [24] "30. Claims on non-residents"                    
## [25] "31. Liabilities to non-residents"               
## [26] "32. Tier 1 Capital"                             
## [27] "33. Tier 2 Capital"                             
## [28] "34. Supervisory deductions"                     
## [29] "35. Total regulatory capital"                   
## [30] "36. Risk-weighted assets"                       
## [31] "37. Number of institutions"                     
## [32] "38. Number of branches"                         
## [33] "39. Number of employees"                        
## [34] "4. Net non-interest income"                     
## [35] "4.a. Fees and commissions receivable"           
## [36] "4.b. Fees and commissions payable"              
## [37] "4.c. Net profit or loss on financial operations"
## [38] "4.d. Other net non-interest income"             
## [39] "5. Net interest and non-interest income"        
## [40] "6. Operating expenses"                          
## [41] "6.a. Staff costs"                               
## [42] "6.b. Property costs"                            
## [43] "6.c. Other operating expenses"                  
## [44] "7. Net income before provisions"                
## [45] "8. Net provisions"                              
## [46] "8.a. Provisions on loans"                       
## [47] "8.b. Provisions on securities"                  
## [48] "8.c. Other net provisions"                      
## [49] "9. Income before tax "

As one can imagine from this levels, we have a problem of levels names here, which I did not know how to appropriately cope with. I did not succumb to the temptation of replacing them by reasonable names in the .csv file prior to import! Instead, I decided to reshape the data.frame and change the names whenever needed. The levels here address different categories. And I define some vectors to use later on.

incomeLevels <- c("1. Interest income","2. Interest expenses",
                "4. Net non-interest income","6. Operating expenses",
                "8. Net provisions","10. Income tax",
                "12. Distributed profit","13. Retained profit ")
assetLevels <- c("14. Cash and balance with Central bank","15. Interbank deposits","16. Loans",
                 "17. Securities","18. Other assets")
liabLevels <- c("19. Capital and reserves","20. Borrowing from Central bank",
                "21. Interbank deposits","22. Customer deposits","23. Bonds","24. Other liabilities")
QuantLevels <- c("37. Number of institutions","38. Number of branches","39. Number of employees")

Now I can introduce a data.frame addressing Income Statements as follows:

incomeDat <- droplevels(subset(iDat, Item %in% incomeLevels))
incomeDat <- reshape(incomeDat, idvar=c("Country","Bank","Year"), timevar="Item", direction= "wide")
names(incomeDat) <- c("Country","Bank","Year","InterestIncome","InterestExpenses",
                       "NetNonInterestIncome","OperatingExpenses","NetProvisions","IncomeTax",
                       "DistributedProfit","RetainedProfit")
str(incomeDat)

## 'data.frame':    490 obs. of  11 variables:
##  $ Country             : Factor w/ 20 levels "Canada","Denmark",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Bank                : Factor w/ 4 levels "Co-operative banks",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ Year                : int  1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 ...
##  $ InterestIncome      : num  76301 85474 86333 68692 66578 ...
##  $ InterestExpenses    : num  51788 59487 56558 35553 33460 ...
##  $ NetNonInterestIncome: num  26931 33016 33851 30884 31789 ...
##  $ OperatingExpenses   : num  33518 38757 42364 43299 43090 ...
##  $ NetProvisions       : num  3177 4310 6436 9432 4123 ...
##  $ IncomeTax           : num  4542 5003 3348 2325 4297 ...
##  $ DistributedProfit   : num  3470 3483 4296 4359 5570 ...
##  $ RetainedProfit      : num  6736 7450 7182 4610 7825 ...
##  - attr(*, "reshapeWide")=List of 5
##   ..$ v.names: NULL
##   ..$ timevar: chr "Item"
##   ..$ idvar  : chr  "Country" "Bank" "Year"
##   ..$ times  : Factor w/ 8 levels "1. Interest income",..: 1 5 6 7 8 2 3 4
##   ..$ varying: chr [1, 1:8] "Value..Flags.1. Interest income" "Value..Flags.2. Interest expenses" "Value..Flags.4. Net non-interest income" "Value..Flags.6. Operating expenses" ...

We can see some rows of this data.frame. From the table below we can notice that some data is missing and this comes from the source data, as far as I could see on the website.

Dumm1 <- incomeDat[sample(nrow(incomeDat), 10),  ]
print(xtable(Dumm1),type = "html", include.rownames = FALSE)

Country	Bank	Year	InterestIncome	InterestExpenses	NetNonInterestIncome	OperatingExpenses	NetProvisions	IncomeTax	DistributedProfit	RetainedProfit
Finland	Commercial banks	2008	10731.30	7832.57	578.16	1590.33	89.43	406.25	1387.70	3.20
Finland	Other miscellaneous monetary institutions	2006	639.79	425.76	102.02	142.40	30.70	32.63	4.34	105.97
Finland	Co-operative banks	2000	1124.97	372.21	378.86	666.28	57.15	84.27	19.02	304.91
Germany	Savings banks	2002	53932.00	30698.00	6174.00	19022.00	6959.00	1471.00	1280.00	676.00
Korea	Other miscellaneous monetary institutions	2003	11760000.00	6463000.00	239000.00	2029000.00	2540000.00	168000.00	156000.00	644000.00
Portugal	Commercial banks	2004	13292.42	8341.83	2760.60	4235.83	1674.27	145.86
Russian Federation	Other miscellaneous monetary institutions	2004	363.50	82.94	1164.63	1037.77	9.91	103.27		291.72
Sweden	Savings banks	2007	6222.00	2695.00	2150.00	3401.00	59.00	497.00	26.00	1693.00
Canada	Commercial banks	2007	110557.00	74941.00	42650.00	49054.00	3444.00	4746.00	9034.00	11988.00
Poland	Co-operative banks	2002	2475.00	985.00	706.00	1550.00	150.00	140.00	24.00	332.00

With this data.frame we can look into the interest income from different types of banking with countries as the grouping variable. Also, the distribution of Operating Expenses values and change of this value by time are illustrated.

ggplot(incomeDat, aes(x= Bank, y= InterestIncome, color=Country)) + 
      geom_jitter(alpha=1/2, position = position_jitter(width = .2)) + scale_y_log10()

plot of chunk unnamed-chunk-10

ggplot(incomeDat, aes(x= OperatingExpenses, fill= Country)) + geom_density() + scale_x_log10()

plot of chunk unnamed-chunk-11


ggplot(incomeDat, aes(x= Year, y= OperatingExpenses, color=Bank)) +
       facet_wrap(~ Country) + geom_point() + scale_y_log10()

plot of chunk unnamed-chunk-11

Surprisingly, the last plot does not show significant changes in expenses by time. Now, I want to do some HR study on banking institutes using another level vector:

HRDat <- droplevels(subset(gDat, Item %in% QuantLevels))
HRDat <- reshape(HRDat, idvar=c("Country","Bank","Year"), timevar="Item", direction= "wide")
names(HRDat) <- c("Country","Bank","Year","institutions","branches","employees")

We can take look at some rows of this new data.frame.

Dumm1 <- HRDat[sample(nrow(HRDat), 10),  ]
print(xtable(Dumm1),type = "html", include.rownames = FALSE)

Country	Bank	Year	institutions	branches	employees
United States	Commercial banks	2004	7701.00	74519.00	1819664.00
United States	Large commercial banks	2007	100.00	45042.00	1391736.00
Denmark	All banks	2009	99.00	1760.00	45935.00
Israel	Large commercial banks	2000	5.00	1075.00	39251.00
United Kingdom	Large commercial banks	2002	41.00	10754.00	322800.00
United States	Commercial banks	2002	7975.00	70571.00	1756260.00
Slovak Republic	Large commercial banks	2007	6.00	843.00	14723.00
Mexico	Foreign commercial banks	2007	18.00	6375.00	101137.00
Turkey	Foreign commercial banks	2000	7.00	21.00	755.00
Denmark	All banks	2003	101.00	2014.00	41000.00

We can do a different set of investigations using this new set. Here, I illustrate number of employees in different banking institutions in U.S. by year and number of institutions.

ggplot(subset(HRDat, Country == "United States"), aes(x = Year, y = employees, color= Bank,
                size = 1500 *sqrt(institutions/pi))) + geom_point()

plot of chunk unnamed-chunk-14

As we can see here, although Other monetary institutions have a considerable number of institutions, they do not embrace a considerable number of employees and Large commercial banks show a fantastic contribution given their relatively low number of institutions. Another interesting observation here is the reduction in employees after 2007 in Commercial banks, who consistently had the majority of employees.

At the end, I should say, maybe a more professional way of manipulating this data set is avoiding segregating it through level vectors I used here. I just used it as a faster approach to the problem and after I got a headache of trying to prepare the input data due to weird level names!
I am also a bit suspicious about my reshaping and am not sure if I have not messed things using it!