STAT-545A hw#05
October.07.2013
I decided to work on a new data set for this homework. I picked a data set from OECD.Stat Extracts about Bank Profitability Statistics under its Finance subsection.
Used libraries:
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.0.2
library(xtable)
gDat <- read.delim(file="BankProfitabilityStatistics.csv")
Basic sanity check:
str(gDat)
## 'data.frame': 50689 obs. of 5 variables:
## $ Country : Factor w/ 33 levels "Austria","Belgium",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Bank : Factor w/ 7 levels "All banks","Co-operative banks",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Item : Factor w/ 49 levels "1. Interest income",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Year : int 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 ...
## $ Value..Flags: num 22333 27413 26867 23522 20846 ...
names(gDat)
## [1] "Country" "Bank" "Item" "Year"
## [5] "Value..Flags"
levels(gDat$Country)
## [1] "Austria" "Belgium" "Canada"
## [4] "Chile" "Czech Republic" "Denmark"
## [7] "Estonia" "Finland" "France"
## [10] "Germany" "Greece" "Hungary"
## [13] "Ireland" "Israel" "Italy"
## [16] "Japan" "Korea" "Luxembourg"
## [19] "Mexico" "Netherlands" "New Zealand"
## [22] "Norway" "Poland" "Portugal"
## [25] "Russian Federation" "Slovak Republic" "Slovenia"
## [28] "Spain" "Sweden" "Switzerland"
## [31] "Turkey" "United Kingdom" "United States"
levels(gDat$Bank)
## [1] "All banks"
## [2] "Co-operative banks"
## [3] "Commercial banks"
## [4] "Foreign commercial banks"
## [5] "Large commercial banks"
## [6] "Other miscellaneous monetary institutions"
## [7] "Savings banks"
As we can see here, this data set presents some information about financial indexes and parameters for different types of banking institutions, in different countries and for 11 years (1999 to 2009).
This is long.format data.frame. There are a number of dependent variables and levels which we can omit from the main data.frame.All Bank for example is the summation of the values for different types of banks.
I also chose to omit the two levels Foreign commercial banks and Large commercial banks, for consistency reasons.
BankLevels <- c("Co-operative banks","Commercial banks","Other miscellaneous monetary institutions",
"Savings banks")
iDat <- droplevels(subset(gDat, subset= Bank %in% BankLevels))
str(iDat)
## 'data.frame': 20896 obs. of 5 variables:
## $ Country : Factor w/ 20 levels "Canada","Denmark",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Bank : Factor w/ 4 levels "Co-operative banks",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ Item : Factor w/ 49 levels "1. Interest income",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Year : int 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 ...
## $ Value..Flags: num 76301 85474 86333 68692 66578 ...
The matter of dependency also exists for levels in Item variable; many of the following parameters are acquired by subtracting some previous ones:
levels(iDat$Item)
## [1] "1. Interest income"
## [2] "10. Income tax"
## [3] "11. Net income after tax "
## [4] "12. Distributed profit"
## [5] "13. Retained profit "
## [6] "14. Cash and balance with Central bank"
## [7] "15. Interbank deposits"
## [8] "16. Loans"
## [9] "17. Securities"
## [10] "18. Other assets"
## [11] "19. Capital and reserves"
## [12] "2. Interest expenses"
## [13] "20. Borrowing from Central bank"
## [14] "21. Interbank deposits"
## [15] "22. Customer deposits"
## [16] "23. Bonds"
## [17] "24. Other liabilities"
## [18] "25. End-year total"
## [19] "26. Average total"
## [20] "27. Short-term securities"
## [21] "28. Bonds"
## [22] "29. Shares and participations"
## [23] "3. Net interest income "
## [24] "30. Claims on non-residents"
## [25] "31. Liabilities to non-residents"
## [26] "32. Tier 1 Capital"
## [27] "33. Tier 2 Capital"
## [28] "34. Supervisory deductions"
## [29] "35. Total regulatory capital"
## [30] "36. Risk-weighted assets"
## [31] "37. Number of institutions"
## [32] "38. Number of branches"
## [33] "39. Number of employees"
## [34] "4. Net non-interest income"
## [35] "4.a. Fees and commissions receivable"
## [36] "4.b. Fees and commissions payable"
## [37] "4.c. Net profit or loss on financial operations"
## [38] "4.d. Other net non-interest income"
## [39] "5. Net interest and non-interest income"
## [40] "6. Operating expenses"
## [41] "6.a. Staff costs"
## [42] "6.b. Property costs"
## [43] "6.c. Other operating expenses"
## [44] "7. Net income before provisions"
## [45] "8. Net provisions"
## [46] "8.a. Provisions on loans"
## [47] "8.b. Provisions on securities"
## [48] "8.c. Other net provisions"
## [49] "9. Income before tax "
As one can imagine from this levels, we have a problem of levels names here, which I did not know how to appropriately cope with. I did not succumb to the temptation of replacing them by reasonable names in the .csv file prior to import! Instead, I decided to reshape the data.frame and change the names whenever needed. The levels here address different categories. And I define some vectors to use later on.
incomeLevels <- c("1. Interest income","2. Interest expenses",
"4. Net non-interest income","6. Operating expenses",
"8. Net provisions","10. Income tax",
"12. Distributed profit","13. Retained profit ")
assetLevels <- c("14. Cash and balance with Central bank","15. Interbank deposits","16. Loans",
"17. Securities","18. Other assets")
liabLevels <- c("19. Capital and reserves","20. Borrowing from Central bank",
"21. Interbank deposits","22. Customer deposits","23. Bonds","24. Other liabilities")
QuantLevels <- c("37. Number of institutions","38. Number of branches","39. Number of employees")
Now I can introduce a data.frame addressing Income Statements as follows:
incomeDat <- droplevels(subset(iDat, Item %in% incomeLevels))
incomeDat <- reshape(incomeDat, idvar=c("Country","Bank","Year"), timevar="Item", direction= "wide")
names(incomeDat) <- c("Country","Bank","Year","InterestIncome","InterestExpenses",
"NetNonInterestIncome","OperatingExpenses","NetProvisions","IncomeTax",
"DistributedProfit","RetainedProfit")
str(incomeDat)
## 'data.frame': 490 obs. of 11 variables:
## $ Country : Factor w/ 20 levels "Canada","Denmark",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Bank : Factor w/ 4 levels "Co-operative banks",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ Year : int 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 ...
## $ InterestIncome : num 76301 85474 86333 68692 66578 ...
## $ InterestExpenses : num 51788 59487 56558 35553 33460 ...
## $ NetNonInterestIncome: num 26931 33016 33851 30884 31789 ...
## $ OperatingExpenses : num 33518 38757 42364 43299 43090 ...
## $ NetProvisions : num 3177 4310 6436 9432 4123 ...
## $ IncomeTax : num 4542 5003 3348 2325 4297 ...
## $ DistributedProfit : num 3470 3483 4296 4359 5570 ...
## $ RetainedProfit : num 6736 7450 7182 4610 7825 ...
## - attr(*, "reshapeWide")=List of 5
## ..$ v.names: NULL
## ..$ timevar: chr "Item"
## ..$ idvar : chr "Country" "Bank" "Year"
## ..$ times : Factor w/ 8 levels "1. Interest income",..: 1 5 6 7 8 2 3 4
## ..$ varying: chr [1, 1:8] "Value..Flags.1. Interest income" "Value..Flags.2. Interest expenses" "Value..Flags.4. Net non-interest income" "Value..Flags.6. Operating expenses" ...
We can see some rows of this data.frame. From the table below we can notice that some data is missing and this comes from the source data, as far as I could see on the website.
Dumm1 <- incomeDat[sample(nrow(incomeDat), 10), ]
print(xtable(Dumm1),type = "html", include.rownames = FALSE)
| Country | Bank | Year | InterestIncome | InterestExpenses | NetNonInterestIncome | OperatingExpenses | NetProvisions | IncomeTax | DistributedProfit | RetainedProfit |
|---|---|---|---|---|---|---|---|---|---|---|
| Finland | Commercial banks | 2008 | 10731.30 | 7832.57 | 578.16 | 1590.33 | 89.43 | 406.25 | 1387.70 | 3.20 |
| Finland | Other miscellaneous monetary institutions | 2006 | 639.79 | 425.76 | 102.02 | 142.40 | 30.70 | 32.63 | 4.34 | 105.97 |
| Finland | Co-operative banks | 2000 | 1124.97 | 372.21 | 378.86 | 666.28 | 57.15 | 84.27 | 19.02 | 304.91 |
| Germany | Savings banks | 2002 | 53932.00 | 30698.00 | 6174.00 | 19022.00 | 6959.00 | 1471.00 | 1280.00 | 676.00 |
| Korea | Other miscellaneous monetary institutions | 2003 | 11760000.00 | 6463000.00 | 239000.00 | 2029000.00 | 2540000.00 | 168000.00 | 156000.00 | 644000.00 |
| Portugal | Commercial banks | 2004 | 13292.42 | 8341.83 | 2760.60 | 4235.83 | 1674.27 | 145.86 | ||
| Russian Federation | Other miscellaneous monetary institutions | 2004 | 363.50 | 82.94 | 1164.63 | 1037.77 | 9.91 | 103.27 | 291.72 | |
| Sweden | Savings banks | 2007 | 6222.00 | 2695.00 | 2150.00 | 3401.00 | 59.00 | 497.00 | 26.00 | 1693.00 |
| Canada | Commercial banks | 2007 | 110557.00 | 74941.00 | 42650.00 | 49054.00 | 3444.00 | 4746.00 | 9034.00 | 11988.00 |
| Poland | Co-operative banks | 2002 | 2475.00 | 985.00 | 706.00 | 1550.00 | 150.00 | 140.00 | 24.00 | 332.00 |
With this data.frame we can look into the interest income from different types of banking with countries as the grouping variable. Also, the distribution of Operating Expenses values and change of this value by time are illustrated.
ggplot(incomeDat, aes(x= Bank, y= InterestIncome, color=Country)) +
geom_jitter(alpha=1/2, position = position_jitter(width = .2)) + scale_y_log10()
ggplot(incomeDat, aes(x= OperatingExpenses, fill= Country)) + geom_density() + scale_x_log10()
ggplot(incomeDat, aes(x= Year, y= OperatingExpenses, color=Bank)) +
facet_wrap(~ Country) + geom_point() + scale_y_log10()
Surprisingly, the last plot does not show significant changes in expenses by time. Now, I want to do some HR study on banking institutes using another level vector:
HRDat <- droplevels(subset(gDat, Item %in% QuantLevels))
HRDat <- reshape(HRDat, idvar=c("Country","Bank","Year"), timevar="Item", direction= "wide")
names(HRDat) <- c("Country","Bank","Year","institutions","branches","employees")
We can take look at some rows of this new data.frame.
Dumm1 <- HRDat[sample(nrow(HRDat), 10), ]
print(xtable(Dumm1),type = "html", include.rownames = FALSE)
| Country | Bank | Year | institutions | branches | employees |
|---|---|---|---|---|---|
| United States | Commercial banks | 2004 | 7701.00 | 74519.00 | 1819664.00 |
| United States | Large commercial banks | 2007 | 100.00 | 45042.00 | 1391736.00 |
| Denmark | All banks | 2009 | 99.00 | 1760.00 | 45935.00 |
| Israel | Large commercial banks | 2000 | 5.00 | 1075.00 | 39251.00 |
| United Kingdom | Large commercial banks | 2002 | 41.00 | 10754.00 | 322800.00 |
| United States | Commercial banks | 2002 | 7975.00 | 70571.00 | 1756260.00 |
| Slovak Republic | Large commercial banks | 2007 | 6.00 | 843.00 | 14723.00 |
| Mexico | Foreign commercial banks | 2007 | 18.00 | 6375.00 | 101137.00 |
| Turkey | Foreign commercial banks | 2000 | 7.00 | 21.00 | 755.00 |
| Denmark | All banks | 2003 | 101.00 | 2014.00 | 41000.00 |
We can do a different set of investigations using this new set. Here, I illustrate number of employees in different banking institutions in U.S. by year and number of institutions.
ggplot(subset(HRDat, Country == "United States"), aes(x = Year, y = employees, color= Bank,
size = 1500 *sqrt(institutions/pi))) + geom_point()
As we can see here, although Other monetary institutions have a considerable number of institutions, they do not embrace a considerable number of employees and Large commercial banks show a fantastic contribution given their relatively low number of institutions. Another interesting observation here is the reduction in employees after 2007 in Commercial banks, who consistently had the majority of employees.
At the end, I should say, maybe a more professional way of manipulating this data set is avoiding segregating it through level vectors I used here. I just used it as a faster approach to the problem and after I got a headache of trying to prepare the input data due to weird level names!
I am also a bit suspicious about my reshaping and am not sure if I have not messed things using it!