Project2 - Dataset 3

Dataset 1 - “Coca-Cola”

For the third untidy data set, I chose Coca-Cola’s financial info Excel sheet.

This data set is untidy because years are represented as columns and variables are represent by rows.

Source: https://www.kaggle.com/datasets/shivavashishtha/dirty-excel-data/data

Load the Libraries

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(rvest)

## 
## Attaching package: 'rvest'
## 
## The following object is masked from 'package:readr':
## 
##     guess_encoding

library(stringi)
library(tidyr)
library(ggplot2)
library(fpp3)

## Registered S3 method overwritten by 'tsibble':
##   method               from 
##   as_tibble.grouped_df dplyr
## ── Attaching packages ──────────────────────────────────────────── fpp3 1.0.1 ──
## ✔ tsibble     1.1.6     ✔ feasts      0.4.1
## ✔ tsibbledata 0.4.1     ✔ fable       0.4.1
## ── Conflicts ───────────────────────────────────────────────── fpp3_conflicts ──
## ✖ lubridate::date()    masks base::date()
## ✖ dplyr::filter()      masks stats::filter()
## ✖ tsibble::intersect() masks base::intersect()
## ✖ tsibble::interval()  masks lubridate::interval()
## ✖ dplyr::lag()         masks stats::lag()
## ✖ tsibble::setdiff()   masks base::setdiff()
## ✖ tsibble::union()     masks base::union()

Read the Data

df_coca_cola <- read_csv("https://raw.githubusercontent.com/gillianmcgovern0/cuny-data-607/refs/heads/main/Cola(COCA%20COLA%20CO).csv", skip = 2)

## New names:
## Rows: 101 Columns: 12
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (12): Profit & Loss statement, ...2, ...3, ...4, ...5, ...6, ...7, ...8,...
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...2`
## • `` -> `...3`
## • `` -> `...4`
## • `` -> `...5`
## • `` -> `...6`
## • `` -> `...7`
## • `` -> `...8`
## • `` -> `...9`
## • `` -> `...10`
## • `` -> `...11`
## • `` -> `...12`

df_coca_cola

## # A tibble: 101 × 12
##    Profit & Loss stateme…¹ ...2  ...3  ...4  ...5  ...6  ...7  ...8  ...9  ...10
##    <chr>                   <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
##  1 <NA>                    in m… FY '… FY '… FY '… FY '… FY '… FY '… FY '… FY '…
##  2 <NA>                    NET … 30,9… 35,1… 46,5… 48,0… 46,8… 45,9… 44,2… 41,8…
##  3 <NA>                    Cost… 11,0… 12,6… 18,2… 19,0… 18,4… 17,8… 17,4… 16,4…
##  4 <NA>                    Gros… 19,9… 22,4… 28,3… 28,9… 28,4… 28,1… 26,8… 25,3…
##  5 <NA>                    Sell… 11,3… 13,1… 17,4… 17,7… 17,3… 17,2… 16,4… 15,2…
##  6 <NA>                    Othe… 313   819   732   447   895   1,183 1,657 1,510
##  7 <NA>                    Oper… 8,231 8,413 10,1… 10,7… 10,2… 9,708 8,728 8,626
##  8 <NA>                    Inte… 249   317   483   471   534   594   613   642  
##  9 <NA>                    Inte… 355   733   417   397   463   483   856   733  
## 10 <NA>                    Equi… 781   1,025 690   819   602   769   489   835  
## # ℹ 91 more rows
## # ℹ abbreviated name: ¹`Profit & Loss statement`
## # ℹ 2 more variables: ...11 <chr>, ...12 <chr>

This excel provides 3 different data sets within one spreadsheet, so let’s create 3 unique data frames to represent the data.

Initial Clean Up (Date Frame 1)

Let’s first focus on the Profit & Loss statement.

profit_loss <- df_coca_cola[1:19,]
head(profit_loss, 10)

## # A tibble: 10 × 12
##    Profit & Loss stateme…¹ ...2  ...3  ...4  ...5  ...6  ...7  ...8  ...9  ...10
##    <chr>                   <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
##  1 <NA>                    in m… FY '… FY '… FY '… FY '… FY '… FY '… FY '… FY '…
##  2 <NA>                    NET … 30,9… 35,1… 46,5… 48,0… 46,8… 45,9… 44,2… 41,8…
##  3 <NA>                    Cost… 11,0… 12,6… 18,2… 19,0… 18,4… 17,8… 17,4… 16,4…
##  4 <NA>                    Gros… 19,9… 22,4… 28,3… 28,9… 28,4… 28,1… 26,8… 25,3…
##  5 <NA>                    Sell… 11,3… 13,1… 17,4… 17,7… 17,3… 17,2… 16,4… 15,2…
##  6 <NA>                    Othe… 313   819   732   447   895   1,183 1,657 1,510
##  7 <NA>                    Oper… 8,231 8,413 10,1… 10,7… 10,2… 9,708 8,728 8,626
##  8 <NA>                    Inte… 249   317   483   471   534   594   613   642  
##  9 <NA>                    Inte… 355   733   417   397   463   483   856   733  
## 10 <NA>                    Equi… 781   1,025 690   819   602   769   489   835  
## # ℹ abbreviated name: ¹`Profit & Loss statement`
## # ℹ 2 more variables: ...11 <chr>, ...12 <chr>

The current columns are not the actual columns in the untidy data set (or at least how I perceived them), so let’s remove the first row. Additionally, the first column is empty so let’s remove that as well:

names(profit_loss) <- profit_loss[1,]

## Warning: The `value` argument of `names<-()` can't be empty as of tibble 3.0.0.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

## Warning: The `value` argument of `names<-()` must be a character vector as of tibble
## 3.0.0.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

profit_loss <- profit_loss[-1,-1]
profit_loss

## # A tibble: 18 × 11
##    `in million USD`        `FY '09` `FY '10` `FY '11` `FY '12` `FY '13` `FY '14`
##    <chr>                   <chr>    <chr>    <chr>    <chr>    <chr>    <chr>   
##  1 NET OPERATING REVENUES  30,990   35,119   46,542   48,017   46,854   45,998  
##  2 Cost of goods sold      11,088   12,693   18,215   19,053   18,421   17,889  
##  3 Gross Profit            19,902   22,426   28,327   28,964   28,433   28,109  
##  4 Selling, general and a… 11,358   13,194   17,422   17,738   17,310   17,218  
##  5 Other operating charges 313      819      732      447      895      1,183   
##  6 Operating Income        8,231    8,413    10,173   10,779   10,228   9,708   
##  7 Interest income         249      317      483      471      534      594     
##  8 Interest expense        355      733      417      397      463      483     
##  9 Equity income (loss) -… 781      1,025    690      819      602      769     
## 10 Other income (loss) - … 40       5,185    529      137      576      -1,263  
## 11 Income before income t… 8,946    14,207   11,458   11,809   11,477   9,325   
## 12 Income taxes            2,040    2,370    2,812    2,723    2,851    2,201   
## 13 CONSOLIDATED NET INCOME 6,906    11,837   8,646    9,086    8,626    7,124   
## 14 Less: Net income attri… 82       50       62       67       42       26      
## 15 Net income from contin… <NA>     <NA>     <NA>     <NA>     <NA>     <NA>    
## 16 Income (Loss) from Dis… <NA>     <NA>     <NA>     <NA>     <NA>     <NA>    
## 17 NET INCOME ATTRIBUTABL… 6,824    11,787   8,584    9,019    8,584    7,098   
## 18 <NA>                    <NA>     <NA>     <NA>     <NA>     <NA>     <NA>    
## # ℹ 4 more variables: `FY '15` <chr>, `FY '16` <chr>, `FY '17` <chr>,
## #   `FY '18` <chr>

Now this matches the untidy data set.

Tidying the Data

Now let’s change the individual years to rows/observations:

colnames(profit_loss)[1] <- "test"
profit_loss3 <- profit_loss %>% 
   pivot_longer(cols = "FY '09":"FY '18", names_to = "year", values_to = "cost")
glimpse(profit_loss3)

## Rows: 180
## Columns: 3
## $ test <chr> "NET OPERATING REVENUES", "NET OPERATING REVENUES", "NET OPERATIN…
## $ year <chr> "FY '09", "FY '10", "FY '11", "FY '12", "FY '13", "FY '14", "FY '…
## $ cost <chr> "30,990", "35,119", "46,542", "48,017", "46,854", "45,998", "44,2…

We now have a year column which is great, but now we need to make categories such as NET OPERATING REVENUES variables in the data frame:

colnames(profit_loss)[1] <- "test"
profit_loss4 <- profit_loss3 %>% 
   pivot_wider(names_from = test, values_from = cost)
head(profit_loss4, 10)

## # A tibble: 10 × 19
##    year   `NET OPERATING REVENUES` `Cost of goods sold` `Gross Profit`
##    <chr>  <chr>                    <chr>                <chr>         
##  1 FY '09 30,990                   11,088               19,902        
##  2 FY '10 35,119                   12,693               22,426        
##  3 FY '11 46,542                   18,215               28,327        
##  4 FY '12 48,017                   19,053               28,964        
##  5 FY '13 46,854                   18,421               28,433        
##  6 FY '14 45,998                   17,889               28,109        
##  7 FY '15 44,294                   17,482               26,812        
##  8 FY '16 41,863                   16,465               25,398        
##  9 FY '17 35,410                   13,255               22,155        
## 10 FY '18 31,856                   11,770               20,086        
## # ℹ 15 more variables: `Selling, general and administrative expenses` <chr>,
## #   `Other operating charges` <chr>, `Operating Income` <chr>,
## #   `Interest income` <chr>, `Interest expense` <chr>,
## #   `Equity income (loss) - net` <chr>, `Other income (loss) - net` <chr>,
## #   `Income before income taxes` <chr>, `Income taxes` <chr>,
## #   `CONSOLIDATED NET INCOME` <chr>,
## #   `Less: Net income attributable to noncontrolling interests` <chr>, …

Now we have a tidy data frame! Each observation/row represents one year. Now if we wanted to add data for 2019, all we would need is a new row.

Now let’s do the same for the other data frames in the csv.

Balance Sheet:

balance <- df_coca_cola[21:63,]
balance

## # A tibble: 43 × 12
##    Profit & Loss stateme…¹ ...2  ...3  ...4  ...5  ...6  ...7  ...8  ...9  ...10
##    <chr>                   <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
##  1 <NA>                    in m… FY '… FY '… FY '… FY '… FY '… FY '… FY '… FY '…
##  2 <NA>                    Asse… <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA> 
##  3 <NA>                    Cash… 7,021 8,517 12,8… 8,442 10,4… 8,958 7,309 8,555
##  4 <NA>                    Shor… 2,130 2,682 1,088 5,017 6,707 9,052 8,322 9,595
##  5 <NA>                    Tota… 9,151 11,1… 13,8… 13,4… 17,1… 18,0… 15,6… 18,1…
##  6 <NA>                    Mark… 62    138   144   3,092 3,147 3,665 4,269 4,051
##  7 <NA>                    Trad… 3,758 4,430 4,920 4,759 4,873 4,466 3,941 3,856
##  8 <NA>                    Inve… 2,354 2,650 3,092 3,264 3,277 3,100 2,902 2,675
##  9 <NA>                    Prep… 2,226 3,162 3,450 2,781 2,886 3,066 2,752 2,481
## 10 <NA>                    Asse… <NA>  <NA>  0     2,973 0     679   3,900 0    
## # ℹ 33 more rows
## # ℹ abbreviated name: ¹`Profit & Loss statement`
## # ℹ 2 more variables: ...11 <chr>, ...12 <chr>

names(balance) <- balance[1,]
balance <- balance[-1,-1]
balance[10,1] <- "Assets held for sale_v2"
balance

## # A tibble: 42 × 11
##    `in million USD`        `FY '09` `FY '10` `FY '11` `FY '12` `FY '13` `FY '14`
##    <chr>                   <chr>    <chr>    <chr>    <chr>    <chr>    <chr>   
##  1 Assets                  <NA>     <NA>     <NA>     <NA>     <NA>     <NA>    
##  2 Cash and cash equivale… 7,021    8,517    12,803   8,442    10,414   8,958   
##  3 Short-term investments  2,130    2,682    1,088    5,017    6,707    9,052   
##  4 Total cash, cash equiv… 9,151    11,199   13,891   13,459   17,121   18,010  
##  5 Marketable securities   62       138      144      3,092    3,147    3,665   
##  6 Trade accounts receiva… 3,758    4,430    4,920    4,759    4,873    4,466   
##  7 Inventories             2,354    2,650    3,092    3,264    3,277    3,100   
##  8 Prepaid expenses and o… 2,226    3,162    3,450    2,781    2,886    3,066   
##  9 Assets held for sale    <NA>     <NA>     0        2,973    0        679     
## 10 Assets held for sale_v2 <NA>     <NA>     <NA>     <NA>     <NA>     <NA>    
## # ℹ 32 more rows
## # ℹ 4 more variables: `FY '15` <chr>, `FY '16` <chr>, `FY '17` <chr>,
## #   `FY '18` <chr>

balance2 <- balance %>% 
   pivot_longer(cols = "FY '09":"FY '18", names_to = "year", values_to = "cost")
balance2

## # A tibble: 420 × 3
##    `in million USD` year   cost 
##    <chr>            <chr>  <chr>
##  1 Assets           FY '09 <NA> 
##  2 Assets           FY '10 <NA> 
##  3 Assets           FY '11 <NA> 
##  4 Assets           FY '12 <NA> 
##  5 Assets           FY '13 <NA> 
##  6 Assets           FY '14 <NA> 
##  7 Assets           FY '15 <NA> 
##  8 Assets           FY '16 <NA> 
##  9 Assets           FY '17 <NA> 
## 10 Assets           FY '18 <NA> 
## # ℹ 410 more rows

balance3 <- balance2 %>% 
   pivot_wider(names_from = "in million USD", values_from = cost)
balance3

## # A tibble: 10 × 43
##    year   Assets `Cash and cash equivalents` `Short-term investments`
##    <chr>  <chr>  <chr>                       <chr>                   
##  1 FY '09 <NA>   7,021                       2,130                   
##  2 FY '10 <NA>   8,517                       2,682                   
##  3 FY '11 <NA>   12,803                      1,088                   
##  4 FY '12 <NA>   8,442                       5,017                   
##  5 FY '13 <NA>   10,414                      6,707                   
##  6 FY '14 <NA>   8,958                       9,052                   
##  7 FY '15 <NA>   7,309                       8,322                   
##  8 FY '16 <NA>   8,555                       9,595                   
##  9 FY '17 <NA>   6,006                       9,352                   
## 10 FY '18 <NA>   8,926                       2,025                   
## # ℹ 39 more variables:
## #   `Total cash, cash equivalents and short-term investments` <chr>,
## #   `Marketable securities` <chr>,
## #   `Trade accounts receivable, less allowances` <chr>, Inventories <chr>,
## #   `Prepaid expenses and other assets` <chr>, `Assets held for sale` <chr>,
## #   `Assets held for sale_v2` <chr>, `Total current assets` <chr>,
## #   `Equity method investments` <chr>, `Other investments` <chr>, …

Now we have a tidy data frame that matches the original.

Cash Flow Statement:

cash_flow <- df_coca_cola[65:101,]
cash_flow

## # A tibble: 37 × 12
##    Profit & Loss stateme…¹ ...2  ...3  ...4  ...5  ...6  ...7  ...8  ...9  ...10
##    <chr>                   <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
##  1 <NA>                    in m… FY '… FY '… FY '… FY '… FY '… FY '… FY '… FY '…
##  2 <NA>                    CONS… 6,906 11,8… 8,646 9,086 8,626 7,124 7,366 6,550
##  3 <NA>                    Depr… 1,236 1,443 1,954 1,982 1,977 1,976 1,970 1,787
##  4 <NA>                    (Inc… <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  0     0    
##  5 <NA>                    Net … <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  7,366 6,550
##  6 <NA>                    Stoc… 241   380   354   259   227   209   236   258  
##  7 <NA>                    Defe… 353   604   1,035 632   648   -40   73    -856 
##  8 <NA>                    Equi… -359  -671  -269  -426  -201  -371  -122  -449 
##  9 <NA>                    Fore… 61    151   7     -130  168   415   -137  158  
## 10 <NA>                    Sign… -43   -645  -220  -98   -670  831   -374  1,146
## # ℹ 27 more rows
## # ℹ abbreviated name: ¹`Profit & Loss statement`
## # ℹ 2 more variables: ...11 <chr>, ...12 <chr>

names(cash_flow) <- cash_flow[1,]
cash_flow <- cash_flow[-1,-1]
cash_flow

## # A tibble: 36 × 11
##    `in million USD`        `FY '09` `FY '10` `FY '11` `FY '12` `FY '13` `FY '14`
##    <chr>                   <chr>    <chr>    <chr>    <chr>    <chr>    <chr>   
##  1 CONSOLIDATED NET INCOME 6,906    11,837   8,646    9,086    8,626    7,124   
##  2 Depreciation and amort… 1,236    1,443    1,954    1,982    1,977    1,976   
##  3 (Income) Loss from dis… <NA>     <NA>     <NA>     <NA>     <NA>     <NA>    
##  4 Net income from contin… <NA>     <NA>     <NA>     <NA>     <NA>     <NA>    
##  5 Stock-based compensati… 241      380      354      259      227      209     
##  6 Deferred income taxes   353      604      1,035    632      648      -40     
##  7 Equity (income) loss -… -359     -671     -269     -426     -201     -371    
##  8 Foreign currency adjus… 61       151      7        -130     168      415     
##  9 Significant (gains) lo… -43      -645     -220     -98      -670     831     
## 10 Other significant (gai… 0        -4,713   <NA>     <NA>     <NA>     <NA>    
## # ℹ 26 more rows
## # ℹ 4 more variables: `FY '15` <chr>, `FY '16` <chr>, `FY '17` <chr>,
## #   `FY '18` <chr>

cash_flow2 <- cash_flow %>%
   pivot_longer(cols = "FY '09":"FY '18", names_to = "year", values_to = "cost")
cash_flow2

## # A tibble: 360 × 3
##    `in million USD`        year   cost  
##    <chr>                   <chr>  <chr> 
##  1 CONSOLIDATED NET INCOME FY '09 6,906 
##  2 CONSOLIDATED NET INCOME FY '10 11,837
##  3 CONSOLIDATED NET INCOME FY '11 8,646 
##  4 CONSOLIDATED NET INCOME FY '12 9,086 
##  5 CONSOLIDATED NET INCOME FY '13 8,626 
##  6 CONSOLIDATED NET INCOME FY '14 7,124 
##  7 CONSOLIDATED NET INCOME FY '15 7,366 
##  8 CONSOLIDATED NET INCOME FY '16 6,550 
##  9 CONSOLIDATED NET INCOME FY '17 1,283 
## 10 CONSOLIDATED NET INCOME FY '18 6,476 
## # ℹ 350 more rows

cash_flow3 <- cash_flow2 %>%
   pivot_wider(names_from = "in million USD", values_from = cost)
cash_flow3

## # A tibble: 10 × 37
##    year   CONSOLIDATED NET INCOM…¹ Depreciation and amo…² (Income) Loss from d…³
##    <chr>  <chr>                    <chr>                  <chr>                 
##  1 FY '09 6,906                    1,236                  <NA>                  
##  2 FY '10 11,837                   1,443                  <NA>                  
##  3 FY '11 8,646                    1,954                  <NA>                  
##  4 FY '12 9,086                    1,982                  <NA>                  
##  5 FY '13 8,626                    1,977                  <NA>                  
##  6 FY '14 7,124                    1,976                  <NA>                  
##  7 FY '15 7,366                    1,970                  0                     
##  8 FY '16 6,550                    1,787                  0                     
##  9 FY '17 1,283                    1,260                  -101                  
## 10 FY '18 6,476                    1,086                  251                   
## # ℹ abbreviated names: ¹`CONSOLIDATED NET INCOME`,
## #   ²`Depreciation and amortization`,
## #   ³`(Income) Loss from discontinued operations`
## # ℹ 33 more variables: `Net income from continuing operations` <chr>,
## #   `Stock-based compensation expense` <chr>, `Deferred income taxes` <chr>,
## #   `Equity (income) loss - net of dividends` <chr>,
## #   `Foreign currency adjustments` <chr>, …

Analysis

For the analysis, I’ll first focus on the Profit and Loss data frame.

Let’s first do some clean up and convert the year to an actual date, make column names lowercase, and remove special characters from column names:

profit_loss5 <- profit_loss4

profit_loss5$year <- paste("20", str_sub(profit_loss5$year,5), sep = "")

names(profit_loss5) <- tolower(names(profit_loss5))
names(profit_loss5) <- str_replace_all(names(profit_loss5), " ", "_")
names(profit_loss5) <- str_replace_all(names(profit_loss5), "\\(", "")
names(profit_loss5) <- str_replace_all(names(profit_loss5), "\\)", "")
names(profit_loss5) <- str_replace_all(names(profit_loss5), ",", "")
names(profit_loss5) <- str_replace_all(names(profit_loss5), ":", "")

profit_loss6 <- profit_loss5[ , 1:18]
names(profit_loss6)

##  [1] "year"                                                                                                         
##  [2] "net_operating_revenues"                                                                                       
##  [3] "cost_of_goods_sold"                                                                                           
##  [4] "gross_profit"                                                                                                 
##  [5] "selling_general_and_administrative_expenses"                                                                  
##  [6] "other_operating_charges"                                                                                      
##  [7] "operating_income"                                                                                             
##  [8] "interest_income"                                                                                              
##  [9] "interest_expense"                                                                                             
## [10] "equity_income_loss_-_net"                                                                                     
## [11] "other_income_loss_-_net"                                                                                      
## [12] "income_before_income_taxes"                                                                                   
## [13] "income_taxes"                                                                                                 
## [14] "consolidated_net_income"                                                                                      
## [15] "less_net_income_attributable_to_noncontrolling_interests"                                                     
## [16] "net_income_from_continuing_operations"                                                                        
## [17] "income_loss_from_discontinued_operations_net_of_tax_including_portion_attributable_to_noncontrolling_interest"
## [18] "net_income_attributable_to_shareowners_of_the_coca-cola_company"

Let’s first see how gross profit changes over time:

# Convert value to number
profit_loss6$gross_profit <- as.numeric(gsub("," ,"", profit_loss6$gross_profit))

ggplot(profit_loss6, aes(x=year, y=gross_profit)) +
  geom_bar(stat='identity', position='dodge') +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))

Profit increased until 2012 and then gradualy started to decrease.

Comparatively, let’s see how the general and administrative expense changed over time as well:

profit_loss6$selling_general_and_administrative_expenses <- as.numeric(gsub("," ,"", profit_loss6$selling_general_and_administrative_expenses))
profit_loss6

## # A tibble: 10 × 18
##    year  net_operating_revenues cost_of_goods_sold gross_profit
##    <chr> <chr>                  <chr>                     <dbl>
##  1 2009  30,990                 11,088                    19902
##  2 2010  35,119                 12,693                    22426
##  3 2011  46,542                 18,215                    28327
##  4 2012  48,017                 19,053                    28964
##  5 2013  46,854                 18,421                    28433
##  6 2014  45,998                 17,889                    28109
##  7 2015  44,294                 17,482                    26812
##  8 2016  41,863                 16,465                    25398
##  9 2017  35,410                 13,255                    22155
## 10 2018  31,856                 11,770                    20086
## # ℹ 14 more variables: selling_general_and_administrative_expenses <dbl>,
## #   other_operating_charges <chr>, operating_income <chr>,
## #   interest_income <chr>, interest_expense <chr>,
## #   `equity_income_loss_-_net` <chr>, `other_income_loss_-_net` <chr>,
## #   income_before_income_taxes <chr>, income_taxes <chr>,
## #   consolidated_net_income <chr>,
## #   less_net_income_attributable_to_noncontrolling_interests <chr>, …

ggplot(profit_loss6, aes(x=year, y=selling_general_and_administrative_expenses)) +
  geom_bar(stat='identity', position='dodge') +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))

What’s interesting is that 2011 had a big jump in expense and also had a peak in 2012.

Now let’s see how the other major expense, Operating Charges, changes over time:

profit_loss6$other_operating_charges <- as.numeric(gsub("," ,"", profit_loss6$other_operating_charges))
profit_loss6

## # A tibble: 10 × 18
##    year  net_operating_revenues cost_of_goods_sold gross_profit
##    <chr> <chr>                  <chr>                     <dbl>
##  1 2009  30,990                 11,088                    19902
##  2 2010  35,119                 12,693                    22426
##  3 2011  46,542                 18,215                    28327
##  4 2012  48,017                 19,053                    28964
##  5 2013  46,854                 18,421                    28433
##  6 2014  45,998                 17,889                    28109
##  7 2015  44,294                 17,482                    26812
##  8 2016  41,863                 16,465                    25398
##  9 2017  35,410                 13,255                    22155
## 10 2018  31,856                 11,770                    20086
## # ℹ 14 more variables: selling_general_and_administrative_expenses <dbl>,
## #   other_operating_charges <dbl>, operating_income <chr>,
## #   interest_income <chr>, interest_expense <chr>,
## #   `equity_income_loss_-_net` <chr>, `other_income_loss_-_net` <chr>,
## #   income_before_income_taxes <chr>, income_taxes <chr>,
## #   consolidated_net_income <chr>,
## #   less_net_income_attributable_to_noncontrolling_interests <chr>, …

ggplot(profit_loss6, aes(x=year, y=other_operating_charges)) +
  geom_bar(stat='identity', position='dodge') +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))

This graph does not show a peak in 2012, but it is interesting that this actually increases until 2017 with a drop in 2018. So in 2017, something happened where this took up a higher percentage of the expenses than other years (since the other expenses dropped).

Now let’s look at the Balance data frame, and do some of the same clean up:

balance4 <- balance3

balance4$year <- paste("20", str_sub(balance4$year,5), sep = "")
names(balance4) <- tolower(names(balance4))
names(balance4) <- str_replace_all(names(balance4), " ", "_")
names(balance4) <- str_replace_all(names(balance4), "\\(", "")
names(balance4) <- str_replace_all(names(balance4), "\\)", "")
names(balance4) <- str_replace_all(names(balance4), ",", "")
names(balance4) <- str_replace_all(names(balance4), ":", "")
names(balance4) <- str_replace_all(names(balance4), "'", "")
names(balance4) <- str_replace_all(names(balance4), "-", "_")
balance4

## # A tibble: 10 × 43
##    year  assets cash_and_cash_equivalents short_term_investments
##    <chr> <chr>  <chr>                     <chr>                 
##  1 2009  <NA>   7,021                     2,130                 
##  2 2010  <NA>   8,517                     2,682                 
##  3 2011  <NA>   12,803                    1,088                 
##  4 2012  <NA>   8,442                     5,017                 
##  5 2013  <NA>   10,414                    6,707                 
##  6 2014  <NA>   8,958                     9,052                 
##  7 2015  <NA>   7,309                     8,322                 
##  8 2016  <NA>   8,555                     9,595                 
##  9 2017  <NA>   6,006                     9,352                 
## 10 2018  <NA>   8,926                     2,025                 
## # ℹ 39 more variables:
## #   total_cash_cash_equivalents_and_short_term_investments <chr>,
## #   marketable_securities <chr>,
## #   trade_accounts_receivable_less_allowances <chr>, inventories <chr>,
## #   prepaid_expenses_and_other_assets <chr>, assets_held_for_sale <chr>,
## #   assets_held_for_sale_v2 <chr>, total_current_assets <chr>,
## #   equity_method_investments <chr>, other_investments <chr>, …

# Remove columns we won't need for this data frame to make things simpler
cols_to_delete <- c("assets", "liabilities", "na")
balance5 <- balance4[ , !(names(balance4) %in% cols_to_delete)]
balance5

## # A tibble: 10 × 40
##    year  cash_and_cash_equivalents short_term_investments total_cash_cash_equi…¹
##    <chr> <chr>                     <chr>                  <chr>                 
##  1 2009  7,021                     2,130                  9,151                 
##  2 2010  8,517                     2,682                  11,199                
##  3 2011  12,803                    1,088                  13,891                
##  4 2012  8,442                     5,017                  13,459                
##  5 2013  10,414                    6,707                  17,121                
##  6 2014  8,958                     9,052                  18,010                
##  7 2015  7,309                     8,322                  15,631                
##  8 2016  8,555                     9,595                  18,150                
##  9 2017  6,006                     9,352                  15,358                
## 10 2018  8,926                     2,025                  10,951                
## # ℹ abbreviated name: ¹total_cash_cash_equivalents_and_short_term_investments
## # ℹ 36 more variables: marketable_securities <chr>,
## #   trade_accounts_receivable_less_allowances <chr>, inventories <chr>,
## #   prepaid_expenses_and_other_assets <chr>, assets_held_for_sale <chr>,
## #   assets_held_for_sale_v2 <chr>, total_current_assets <chr>,
## #   equity_method_investments <chr>, other_investments <chr>,
## #   other_assets <chr>, property_plant_and_equipment___net <chr>, …

convert_string_to_number <- function(column) {
   as.numeric(gsub("," ,"", column))
}

balance5 <- balance5 %>% 
  mutate(across(where(is.character), convert_string_to_number))
balance5$year <- as.character(balance5$year)
balance5

## # A tibble: 10 × 40
##    year  cash_and_cash_equivalents short_term_investments total_cash_cash_equi…¹
##    <chr>                     <dbl>                  <dbl>                  <dbl>
##  1 2009                       7021                   2130                   9151
##  2 2010                       8517                   2682                  11199
##  3 2011                      12803                   1088                  13891
##  4 2012                       8442                   5017                  13459
##  5 2013                      10414                   6707                  17121
##  6 2014                       8958                   9052                  18010
##  7 2015                       7309                   8322                  15631
##  8 2016                       8555                   9595                  18150
##  9 2017                       6006                   9352                  15358
## 10 2018                       8926                   2025                  10951
## # ℹ abbreviated name: ¹total_cash_cash_equivalents_and_short_term_investments
## # ℹ 36 more variables: marketable_securities <dbl>,
## #   trade_accounts_receivable_less_allowances <dbl>, inventories <dbl>,
## #   prepaid_expenses_and_other_assets <dbl>, assets_held_for_sale <dbl>,
## #   assets_held_for_sale_v2 <dbl>, total_current_assets <dbl>,
## #   equity_method_investments <dbl>, other_investments <dbl>,
## #   other_assets <dbl>, property_plant_and_equipment___net <dbl>, …

We now have a cleaner, tidy data frame.

Let’s look at individual percentages of liabilities for this, to see what is taking up the most liability:

balance6 <- balance5 %>% 
  mutate(accounts_payable_and_accrued_expenses_perc = accounts_payable_and_accrued_expenses / total_current_liabilities) %>%
  mutate(loans_and_notes_payable_perc = loans_and_notes_payable / total_current_liabilities) %>%
  mutate(current_maturities_of_long_term_debt_perc = current_maturities_of_long_term_debt / total_current_liabilities) %>% 
  mutate(accrued_income_taxes_perc = accrued_income_taxes / total_current_liabilities) %>%
  mutate(liabilities_held_for_sale_perc = liabilities_held_for_sale / total_current_liabilities) %>%
  mutate(liabilities_held_for_sale_discontinued_operations_perc = liabilities_held_for_sale_discontinued_operations / total_current_liabilities)

head(balance6, 10)

## # A tibble: 10 × 46
##    year  cash_and_cash_equivalents short_term_investments total_cash_cash_equi…¹
##    <chr>                     <dbl>                  <dbl>                  <dbl>
##  1 2009                       7021                   2130                   9151
##  2 2010                       8517                   2682                  11199
##  3 2011                      12803                   1088                  13891
##  4 2012                       8442                   5017                  13459
##  5 2013                      10414                   6707                  17121
##  6 2014                       8958                   9052                  18010
##  7 2015                       7309                   8322                  15631
##  8 2016                       8555                   9595                  18150
##  9 2017                       6006                   9352                  15358
## 10 2018                       8926                   2025                  10951
## # ℹ abbreviated name: ¹total_cash_cash_equivalents_and_short_term_investments
## # ℹ 42 more variables: marketable_securities <dbl>,
## #   trade_accounts_receivable_less_allowances <dbl>, inventories <dbl>,
## #   prepaid_expenses_and_other_assets <dbl>, assets_held_for_sale <dbl>,
## #   assets_held_for_sale_v2 <dbl>, total_current_assets <dbl>,
## #   equity_method_investments <dbl>, other_investments <dbl>,
## #   other_assets <dbl>, property_plant_and_equipment___net <dbl>, …

Let’s see a graph of how total liabilities have changed over time:

ggplot(balance6, aes(x=year, y=total_current_liabilities)) +
  geom_bar(stat='identity', position='dodge') +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))

The general trend is increasing with a peak in 2014.

Let’s see how accounts_payable_and_accrued_expense percentage has changed:

ggplot(balance6, aes(x=year, y=accounts_payable_and_accrued_expenses_perc)) +
  geom_bar(stat='identity', position='dodge') +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))

ggplot(balance6, aes(x=year, y=loans_and_notes_payable_perc)) +
  geom_bar(stat='identity', position='dodge') +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))

ggplot(balance6, aes(x=year, y=current_maturities_of_long_term_debt_perc)) +
  geom_bar(stat='identity', position='dodge') +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))

So it looks like loans_and_notes_payable_perc most of the time, takes up most of the liabilities with the higher percentage. It looks like the general liability spike in 2014 could be attributed to current_maturities_of_long_term_debt_perc which also had a major increase in 2014. It’s also interesting that this generally continued to increase until 2018 (peak at 2018).

Let’s finally compare total assets to total liabilities:

ggplot(data = balance6, aes(x = total_assets, y = total_current_liabilities, color = year)) +
  geom_point()

This shows a positive trend as total assets increases, the total liabilities also increases. This also shows in 2014, where there was a spike in liabilities, there was also a spike in total assets.

Conclusions

This data set was a good practice for using pivot_wider and pivot_longer. This example feels like an excel sheet that could be seen in the real world, where the focus is on how easily presentable it is instead of the most efficient data structure.