For the third untidy data set, I chose Coca-Cola’s financial info Excel sheet.
This data set is untidy because years are represented as columns and variables are represent by rows.
Source: https://www.kaggle.com/datasets/shivavashishtha/dirty-excel-data/data
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(rvest)
##
## Attaching package: 'rvest'
##
## The following object is masked from 'package:readr':
##
## guess_encoding
library(stringi)
library(tidyr)
library(ggplot2)
library(fpp3)
## Registered S3 method overwritten by 'tsibble':
## method from
## as_tibble.grouped_df dplyr
## ── Attaching packages ──────────────────────────────────────────── fpp3 1.0.1 ──
## ✔ tsibble 1.1.6 ✔ feasts 0.4.1
## ✔ tsibbledata 0.4.1 ✔ fable 0.4.1
## ── Conflicts ───────────────────────────────────────────────── fpp3_conflicts ──
## ✖ lubridate::date() masks base::date()
## ✖ dplyr::filter() masks stats::filter()
## ✖ tsibble::intersect() masks base::intersect()
## ✖ tsibble::interval() masks lubridate::interval()
## ✖ dplyr::lag() masks stats::lag()
## ✖ tsibble::setdiff() masks base::setdiff()
## ✖ tsibble::union() masks base::union()
df_coca_cola <- read_csv("https://raw.githubusercontent.com/gillianmcgovern0/cuny-data-607/refs/heads/main/Cola(COCA%20COLA%20CO).csv", skip = 2)
## New names:
## Rows: 101 Columns: 12
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (12): Profit & Loss statement, ...2, ...3, ...4, ...5, ...6, ...7, ...8,...
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...2`
## • `` -> `...3`
## • `` -> `...4`
## • `` -> `...5`
## • `` -> `...6`
## • `` -> `...7`
## • `` -> `...8`
## • `` -> `...9`
## • `` -> `...10`
## • `` -> `...11`
## • `` -> `...12`
df_coca_cola
## # A tibble: 101 × 12
## Profit & Loss stateme…¹ ...2 ...3 ...4 ...5 ...6 ...7 ...8 ...9 ...10
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 <NA> in m… FY '… FY '… FY '… FY '… FY '… FY '… FY '… FY '…
## 2 <NA> NET … 30,9… 35,1… 46,5… 48,0… 46,8… 45,9… 44,2… 41,8…
## 3 <NA> Cost… 11,0… 12,6… 18,2… 19,0… 18,4… 17,8… 17,4… 16,4…
## 4 <NA> Gros… 19,9… 22,4… 28,3… 28,9… 28,4… 28,1… 26,8… 25,3…
## 5 <NA> Sell… 11,3… 13,1… 17,4… 17,7… 17,3… 17,2… 16,4… 15,2…
## 6 <NA> Othe… 313 819 732 447 895 1,183 1,657 1,510
## 7 <NA> Oper… 8,231 8,413 10,1… 10,7… 10,2… 9,708 8,728 8,626
## 8 <NA> Inte… 249 317 483 471 534 594 613 642
## 9 <NA> Inte… 355 733 417 397 463 483 856 733
## 10 <NA> Equi… 781 1,025 690 819 602 769 489 835
## # ℹ 91 more rows
## # ℹ abbreviated name: ¹`Profit & Loss statement`
## # ℹ 2 more variables: ...11 <chr>, ...12 <chr>
This excel provides 3 different data sets within one spreadsheet, so let’s create 3 unique data frames to represent the data.
Let’s first focus on the
Profit & Loss statement
.
profit_loss <- df_coca_cola[1:19,]
head(profit_loss, 10)
## # A tibble: 10 × 12
## Profit & Loss stateme…¹ ...2 ...3 ...4 ...5 ...6 ...7 ...8 ...9 ...10
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 <NA> in m… FY '… FY '… FY '… FY '… FY '… FY '… FY '… FY '…
## 2 <NA> NET … 30,9… 35,1… 46,5… 48,0… 46,8… 45,9… 44,2… 41,8…
## 3 <NA> Cost… 11,0… 12,6… 18,2… 19,0… 18,4… 17,8… 17,4… 16,4…
## 4 <NA> Gros… 19,9… 22,4… 28,3… 28,9… 28,4… 28,1… 26,8… 25,3…
## 5 <NA> Sell… 11,3… 13,1… 17,4… 17,7… 17,3… 17,2… 16,4… 15,2…
## 6 <NA> Othe… 313 819 732 447 895 1,183 1,657 1,510
## 7 <NA> Oper… 8,231 8,413 10,1… 10,7… 10,2… 9,708 8,728 8,626
## 8 <NA> Inte… 249 317 483 471 534 594 613 642
## 9 <NA> Inte… 355 733 417 397 463 483 856 733
## 10 <NA> Equi… 781 1,025 690 819 602 769 489 835
## # ℹ abbreviated name: ¹`Profit & Loss statement`
## # ℹ 2 more variables: ...11 <chr>, ...12 <chr>
The current columns are not the actual columns in the untidy data set (or at least how I perceived them), so let’s remove the first row. Additionally, the first column is empty so let’s remove that as well:
names(profit_loss) <- profit_loss[1,]
## Warning: The `value` argument of `names<-()` can't be empty as of tibble 3.0.0.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: The `value` argument of `names<-()` must be a character vector as of tibble
## 3.0.0.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
profit_loss <- profit_loss[-1,-1]
profit_loss
## # A tibble: 18 × 11
## `in million USD` `FY '09` `FY '10` `FY '11` `FY '12` `FY '13` `FY '14`
## <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 NET OPERATING REVENUES 30,990 35,119 46,542 48,017 46,854 45,998
## 2 Cost of goods sold 11,088 12,693 18,215 19,053 18,421 17,889
## 3 Gross Profit 19,902 22,426 28,327 28,964 28,433 28,109
## 4 Selling, general and a… 11,358 13,194 17,422 17,738 17,310 17,218
## 5 Other operating charges 313 819 732 447 895 1,183
## 6 Operating Income 8,231 8,413 10,173 10,779 10,228 9,708
## 7 Interest income 249 317 483 471 534 594
## 8 Interest expense 355 733 417 397 463 483
## 9 Equity income (loss) -… 781 1,025 690 819 602 769
## 10 Other income (loss) - … 40 5,185 529 137 576 -1,263
## 11 Income before income t… 8,946 14,207 11,458 11,809 11,477 9,325
## 12 Income taxes 2,040 2,370 2,812 2,723 2,851 2,201
## 13 CONSOLIDATED NET INCOME 6,906 11,837 8,646 9,086 8,626 7,124
## 14 Less: Net income attri… 82 50 62 67 42 26
## 15 Net income from contin… <NA> <NA> <NA> <NA> <NA> <NA>
## 16 Income (Loss) from Dis… <NA> <NA> <NA> <NA> <NA> <NA>
## 17 NET INCOME ATTRIBUTABL… 6,824 11,787 8,584 9,019 8,584 7,098
## 18 <NA> <NA> <NA> <NA> <NA> <NA> <NA>
## # ℹ 4 more variables: `FY '15` <chr>, `FY '16` <chr>, `FY '17` <chr>,
## # `FY '18` <chr>
Now this matches the untidy data set.
Now let’s change the individual years to rows/observations:
colnames(profit_loss)[1] <- "test"
profit_loss3 <- profit_loss %>%
pivot_longer(cols = "FY '09":"FY '18", names_to = "year", values_to = "cost")
glimpse(profit_loss3)
## Rows: 180
## Columns: 3
## $ test <chr> "NET OPERATING REVENUES", "NET OPERATING REVENUES", "NET OPERATIN…
## $ year <chr> "FY '09", "FY '10", "FY '11", "FY '12", "FY '13", "FY '14", "FY '…
## $ cost <chr> "30,990", "35,119", "46,542", "48,017", "46,854", "45,998", "44,2…
We now have a year column which is great, but now we need to make
categories such as NET OPERATING REVENUES
variables in the
data frame:
colnames(profit_loss)[1] <- "test"
profit_loss4 <- profit_loss3 %>%
pivot_wider(names_from = test, values_from = cost)
head(profit_loss4, 10)
## # A tibble: 10 × 19
## year `NET OPERATING REVENUES` `Cost of goods sold` `Gross Profit`
## <chr> <chr> <chr> <chr>
## 1 FY '09 30,990 11,088 19,902
## 2 FY '10 35,119 12,693 22,426
## 3 FY '11 46,542 18,215 28,327
## 4 FY '12 48,017 19,053 28,964
## 5 FY '13 46,854 18,421 28,433
## 6 FY '14 45,998 17,889 28,109
## 7 FY '15 44,294 17,482 26,812
## 8 FY '16 41,863 16,465 25,398
## 9 FY '17 35,410 13,255 22,155
## 10 FY '18 31,856 11,770 20,086
## # ℹ 15 more variables: `Selling, general and administrative expenses` <chr>,
## # `Other operating charges` <chr>, `Operating Income` <chr>,
## # `Interest income` <chr>, `Interest expense` <chr>,
## # `Equity income (loss) - net` <chr>, `Other income (loss) - net` <chr>,
## # `Income before income taxes` <chr>, `Income taxes` <chr>,
## # `CONSOLIDATED NET INCOME` <chr>,
## # `Less: Net income attributable to noncontrolling interests` <chr>, …
Now we have a tidy data frame! Each observation/row represents one year. Now if we wanted to add data for 2019, all we would need is a new row.
Now let’s do the same for the other data frames in the csv.
Balance Sheet
:
balance <- df_coca_cola[21:63,]
balance
## # A tibble: 43 × 12
## Profit & Loss stateme…¹ ...2 ...3 ...4 ...5 ...6 ...7 ...8 ...9 ...10
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 <NA> in m… FY '… FY '… FY '… FY '… FY '… FY '… FY '… FY '…
## 2 <NA> Asse… <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
## 3 <NA> Cash… 7,021 8,517 12,8… 8,442 10,4… 8,958 7,309 8,555
## 4 <NA> Shor… 2,130 2,682 1,088 5,017 6,707 9,052 8,322 9,595
## 5 <NA> Tota… 9,151 11,1… 13,8… 13,4… 17,1… 18,0… 15,6… 18,1…
## 6 <NA> Mark… 62 138 144 3,092 3,147 3,665 4,269 4,051
## 7 <NA> Trad… 3,758 4,430 4,920 4,759 4,873 4,466 3,941 3,856
## 8 <NA> Inve… 2,354 2,650 3,092 3,264 3,277 3,100 2,902 2,675
## 9 <NA> Prep… 2,226 3,162 3,450 2,781 2,886 3,066 2,752 2,481
## 10 <NA> Asse… <NA> <NA> 0 2,973 0 679 3,900 0
## # ℹ 33 more rows
## # ℹ abbreviated name: ¹`Profit & Loss statement`
## # ℹ 2 more variables: ...11 <chr>, ...12 <chr>
names(balance) <- balance[1,]
balance <- balance[-1,-1]
balance[10,1] <- "Assets held for sale_v2"
balance
## # A tibble: 42 × 11
## `in million USD` `FY '09` `FY '10` `FY '11` `FY '12` `FY '13` `FY '14`
## <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 Assets <NA> <NA> <NA> <NA> <NA> <NA>
## 2 Cash and cash equivale… 7,021 8,517 12,803 8,442 10,414 8,958
## 3 Short-term investments 2,130 2,682 1,088 5,017 6,707 9,052
## 4 Total cash, cash equiv… 9,151 11,199 13,891 13,459 17,121 18,010
## 5 Marketable securities 62 138 144 3,092 3,147 3,665
## 6 Trade accounts receiva… 3,758 4,430 4,920 4,759 4,873 4,466
## 7 Inventories 2,354 2,650 3,092 3,264 3,277 3,100
## 8 Prepaid expenses and o… 2,226 3,162 3,450 2,781 2,886 3,066
## 9 Assets held for sale <NA> <NA> 0 2,973 0 679
## 10 Assets held for sale_v2 <NA> <NA> <NA> <NA> <NA> <NA>
## # ℹ 32 more rows
## # ℹ 4 more variables: `FY '15` <chr>, `FY '16` <chr>, `FY '17` <chr>,
## # `FY '18` <chr>
balance2 <- balance %>%
pivot_longer(cols = "FY '09":"FY '18", names_to = "year", values_to = "cost")
balance2
## # A tibble: 420 × 3
## `in million USD` year cost
## <chr> <chr> <chr>
## 1 Assets FY '09 <NA>
## 2 Assets FY '10 <NA>
## 3 Assets FY '11 <NA>
## 4 Assets FY '12 <NA>
## 5 Assets FY '13 <NA>
## 6 Assets FY '14 <NA>
## 7 Assets FY '15 <NA>
## 8 Assets FY '16 <NA>
## 9 Assets FY '17 <NA>
## 10 Assets FY '18 <NA>
## # ℹ 410 more rows
balance3 <- balance2 %>%
pivot_wider(names_from = "in million USD", values_from = cost)
balance3
## # A tibble: 10 × 43
## year Assets `Cash and cash equivalents` `Short-term investments`
## <chr> <chr> <chr> <chr>
## 1 FY '09 <NA> 7,021 2,130
## 2 FY '10 <NA> 8,517 2,682
## 3 FY '11 <NA> 12,803 1,088
## 4 FY '12 <NA> 8,442 5,017
## 5 FY '13 <NA> 10,414 6,707
## 6 FY '14 <NA> 8,958 9,052
## 7 FY '15 <NA> 7,309 8,322
## 8 FY '16 <NA> 8,555 9,595
## 9 FY '17 <NA> 6,006 9,352
## 10 FY '18 <NA> 8,926 2,025
## # ℹ 39 more variables:
## # `Total cash, cash equivalents and short-term investments` <chr>,
## # `Marketable securities` <chr>,
## # `Trade accounts receivable, less allowances` <chr>, Inventories <chr>,
## # `Prepaid expenses and other assets` <chr>, `Assets held for sale` <chr>,
## # `Assets held for sale_v2` <chr>, `Total current assets` <chr>,
## # `Equity method investments` <chr>, `Other investments` <chr>, …
Now we have a tidy data frame that matches the original.
Cash Flow Statement
:
cash_flow <- df_coca_cola[65:101,]
cash_flow
## # A tibble: 37 × 12
## Profit & Loss stateme…¹ ...2 ...3 ...4 ...5 ...6 ...7 ...8 ...9 ...10
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 <NA> in m… FY '… FY '… FY '… FY '… FY '… FY '… FY '… FY '…
## 2 <NA> CONS… 6,906 11,8… 8,646 9,086 8,626 7,124 7,366 6,550
## 3 <NA> Depr… 1,236 1,443 1,954 1,982 1,977 1,976 1,970 1,787
## 4 <NA> (Inc… <NA> <NA> <NA> <NA> <NA> <NA> 0 0
## 5 <NA> Net … <NA> <NA> <NA> <NA> <NA> <NA> 7,366 6,550
## 6 <NA> Stoc… 241 380 354 259 227 209 236 258
## 7 <NA> Defe… 353 604 1,035 632 648 -40 73 -856
## 8 <NA> Equi… -359 -671 -269 -426 -201 -371 -122 -449
## 9 <NA> Fore… 61 151 7 -130 168 415 -137 158
## 10 <NA> Sign… -43 -645 -220 -98 -670 831 -374 1,146
## # ℹ 27 more rows
## # ℹ abbreviated name: ¹`Profit & Loss statement`
## # ℹ 2 more variables: ...11 <chr>, ...12 <chr>
names(cash_flow) <- cash_flow[1,]
cash_flow <- cash_flow[-1,-1]
cash_flow
## # A tibble: 36 × 11
## `in million USD` `FY '09` `FY '10` `FY '11` `FY '12` `FY '13` `FY '14`
## <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 CONSOLIDATED NET INCOME 6,906 11,837 8,646 9,086 8,626 7,124
## 2 Depreciation and amort… 1,236 1,443 1,954 1,982 1,977 1,976
## 3 (Income) Loss from dis… <NA> <NA> <NA> <NA> <NA> <NA>
## 4 Net income from contin… <NA> <NA> <NA> <NA> <NA> <NA>
## 5 Stock-based compensati… 241 380 354 259 227 209
## 6 Deferred income taxes 353 604 1,035 632 648 -40
## 7 Equity (income) loss -… -359 -671 -269 -426 -201 -371
## 8 Foreign currency adjus… 61 151 7 -130 168 415
## 9 Significant (gains) lo… -43 -645 -220 -98 -670 831
## 10 Other significant (gai… 0 -4,713 <NA> <NA> <NA> <NA>
## # ℹ 26 more rows
## # ℹ 4 more variables: `FY '15` <chr>, `FY '16` <chr>, `FY '17` <chr>,
## # `FY '18` <chr>
cash_flow2 <- cash_flow %>%
pivot_longer(cols = "FY '09":"FY '18", names_to = "year", values_to = "cost")
cash_flow2
## # A tibble: 360 × 3
## `in million USD` year cost
## <chr> <chr> <chr>
## 1 CONSOLIDATED NET INCOME FY '09 6,906
## 2 CONSOLIDATED NET INCOME FY '10 11,837
## 3 CONSOLIDATED NET INCOME FY '11 8,646
## 4 CONSOLIDATED NET INCOME FY '12 9,086
## 5 CONSOLIDATED NET INCOME FY '13 8,626
## 6 CONSOLIDATED NET INCOME FY '14 7,124
## 7 CONSOLIDATED NET INCOME FY '15 7,366
## 8 CONSOLIDATED NET INCOME FY '16 6,550
## 9 CONSOLIDATED NET INCOME FY '17 1,283
## 10 CONSOLIDATED NET INCOME FY '18 6,476
## # ℹ 350 more rows
cash_flow3 <- cash_flow2 %>%
pivot_wider(names_from = "in million USD", values_from = cost)
cash_flow3
## # A tibble: 10 × 37
## year CONSOLIDATED NET INCOM…¹ Depreciation and amo…² (Income) Loss from d…³
## <chr> <chr> <chr> <chr>
## 1 FY '09 6,906 1,236 <NA>
## 2 FY '10 11,837 1,443 <NA>
## 3 FY '11 8,646 1,954 <NA>
## 4 FY '12 9,086 1,982 <NA>
## 5 FY '13 8,626 1,977 <NA>
## 6 FY '14 7,124 1,976 <NA>
## 7 FY '15 7,366 1,970 0
## 8 FY '16 6,550 1,787 0
## 9 FY '17 1,283 1,260 -101
## 10 FY '18 6,476 1,086 251
## # ℹ abbreviated names: ¹`CONSOLIDATED NET INCOME`,
## # ²`Depreciation and amortization`,
## # ³`(Income) Loss from discontinued operations`
## # ℹ 33 more variables: `Net income from continuing operations` <chr>,
## # `Stock-based compensation expense` <chr>, `Deferred income taxes` <chr>,
## # `Equity (income) loss - net of dividends` <chr>,
## # `Foreign currency adjustments` <chr>, …
For the analysis, I’ll first focus on the Profit and Loss data frame.
Let’s first do some clean up and convert the year to an actual date, make column names lowercase, and remove special characters from column names:
profit_loss5 <- profit_loss4
profit_loss5$year <- paste("20", str_sub(profit_loss5$year,5), sep = "")
names(profit_loss5) <- tolower(names(profit_loss5))
names(profit_loss5) <- str_replace_all(names(profit_loss5), " ", "_")
names(profit_loss5) <- str_replace_all(names(profit_loss5), "\\(", "")
names(profit_loss5) <- str_replace_all(names(profit_loss5), "\\)", "")
names(profit_loss5) <- str_replace_all(names(profit_loss5), ",", "")
names(profit_loss5) <- str_replace_all(names(profit_loss5), ":", "")
profit_loss6 <- profit_loss5[ , 1:18]
names(profit_loss6)
## [1] "year"
## [2] "net_operating_revenues"
## [3] "cost_of_goods_sold"
## [4] "gross_profit"
## [5] "selling_general_and_administrative_expenses"
## [6] "other_operating_charges"
## [7] "operating_income"
## [8] "interest_income"
## [9] "interest_expense"
## [10] "equity_income_loss_-_net"
## [11] "other_income_loss_-_net"
## [12] "income_before_income_taxes"
## [13] "income_taxes"
## [14] "consolidated_net_income"
## [15] "less_net_income_attributable_to_noncontrolling_interests"
## [16] "net_income_from_continuing_operations"
## [17] "income_loss_from_discontinued_operations_net_of_tax_including_portion_attributable_to_noncontrolling_interest"
## [18] "net_income_attributable_to_shareowners_of_the_coca-cola_company"
Let’s first see how gross profit changes over time:
# Convert value to number
profit_loss6$gross_profit <- as.numeric(gsub("," ,"", profit_loss6$gross_profit))
ggplot(profit_loss6, aes(x=year, y=gross_profit)) +
geom_bar(stat='identity', position='dodge') +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
Profit increased until 2012 and then gradualy started to decrease.
Comparatively, let’s see how the general and administrative expense changed over time as well:
profit_loss6$selling_general_and_administrative_expenses <- as.numeric(gsub("," ,"", profit_loss6$selling_general_and_administrative_expenses))
profit_loss6
## # A tibble: 10 × 18
## year net_operating_revenues cost_of_goods_sold gross_profit
## <chr> <chr> <chr> <dbl>
## 1 2009 30,990 11,088 19902
## 2 2010 35,119 12,693 22426
## 3 2011 46,542 18,215 28327
## 4 2012 48,017 19,053 28964
## 5 2013 46,854 18,421 28433
## 6 2014 45,998 17,889 28109
## 7 2015 44,294 17,482 26812
## 8 2016 41,863 16,465 25398
## 9 2017 35,410 13,255 22155
## 10 2018 31,856 11,770 20086
## # ℹ 14 more variables: selling_general_and_administrative_expenses <dbl>,
## # other_operating_charges <chr>, operating_income <chr>,
## # interest_income <chr>, interest_expense <chr>,
## # `equity_income_loss_-_net` <chr>, `other_income_loss_-_net` <chr>,
## # income_before_income_taxes <chr>, income_taxes <chr>,
## # consolidated_net_income <chr>,
## # less_net_income_attributable_to_noncontrolling_interests <chr>, …
ggplot(profit_loss6, aes(x=year, y=selling_general_and_administrative_expenses)) +
geom_bar(stat='identity', position='dodge') +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
What’s interesting is that 2011 had a big jump in expense and also had a peak in 2012.
Now let’s see how the other major expense, Operating Charges, changes over time:
profit_loss6$other_operating_charges <- as.numeric(gsub("," ,"", profit_loss6$other_operating_charges))
profit_loss6
## # A tibble: 10 × 18
## year net_operating_revenues cost_of_goods_sold gross_profit
## <chr> <chr> <chr> <dbl>
## 1 2009 30,990 11,088 19902
## 2 2010 35,119 12,693 22426
## 3 2011 46,542 18,215 28327
## 4 2012 48,017 19,053 28964
## 5 2013 46,854 18,421 28433
## 6 2014 45,998 17,889 28109
## 7 2015 44,294 17,482 26812
## 8 2016 41,863 16,465 25398
## 9 2017 35,410 13,255 22155
## 10 2018 31,856 11,770 20086
## # ℹ 14 more variables: selling_general_and_administrative_expenses <dbl>,
## # other_operating_charges <dbl>, operating_income <chr>,
## # interest_income <chr>, interest_expense <chr>,
## # `equity_income_loss_-_net` <chr>, `other_income_loss_-_net` <chr>,
## # income_before_income_taxes <chr>, income_taxes <chr>,
## # consolidated_net_income <chr>,
## # less_net_income_attributable_to_noncontrolling_interests <chr>, …
ggplot(profit_loss6, aes(x=year, y=other_operating_charges)) +
geom_bar(stat='identity', position='dodge') +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
This graph does not show a peak in 2012, but it is interesting that this actually increases until 2017 with a drop in 2018. So in 2017, something happened where this took up a higher percentage of the expenses than other years (since the other expenses dropped).
Now let’s look at the Balance data frame, and do some of the same clean up:
balance4 <- balance3
balance4$year <- paste("20", str_sub(balance4$year,5), sep = "")
names(balance4) <- tolower(names(balance4))
names(balance4) <- str_replace_all(names(balance4), " ", "_")
names(balance4) <- str_replace_all(names(balance4), "\\(", "")
names(balance4) <- str_replace_all(names(balance4), "\\)", "")
names(balance4) <- str_replace_all(names(balance4), ",", "")
names(balance4) <- str_replace_all(names(balance4), ":", "")
names(balance4) <- str_replace_all(names(balance4), "'", "")
names(balance4) <- str_replace_all(names(balance4), "-", "_")
balance4
## # A tibble: 10 × 43
## year assets cash_and_cash_equivalents short_term_investments
## <chr> <chr> <chr> <chr>
## 1 2009 <NA> 7,021 2,130
## 2 2010 <NA> 8,517 2,682
## 3 2011 <NA> 12,803 1,088
## 4 2012 <NA> 8,442 5,017
## 5 2013 <NA> 10,414 6,707
## 6 2014 <NA> 8,958 9,052
## 7 2015 <NA> 7,309 8,322
## 8 2016 <NA> 8,555 9,595
## 9 2017 <NA> 6,006 9,352
## 10 2018 <NA> 8,926 2,025
## # ℹ 39 more variables:
## # total_cash_cash_equivalents_and_short_term_investments <chr>,
## # marketable_securities <chr>,
## # trade_accounts_receivable_less_allowances <chr>, inventories <chr>,
## # prepaid_expenses_and_other_assets <chr>, assets_held_for_sale <chr>,
## # assets_held_for_sale_v2 <chr>, total_current_assets <chr>,
## # equity_method_investments <chr>, other_investments <chr>, …
# Remove columns we won't need for this data frame to make things simpler
cols_to_delete <- c("assets", "liabilities", "na")
balance5 <- balance4[ , !(names(balance4) %in% cols_to_delete)]
balance5
## # A tibble: 10 × 40
## year cash_and_cash_equivalents short_term_investments total_cash_cash_equi…¹
## <chr> <chr> <chr> <chr>
## 1 2009 7,021 2,130 9,151
## 2 2010 8,517 2,682 11,199
## 3 2011 12,803 1,088 13,891
## 4 2012 8,442 5,017 13,459
## 5 2013 10,414 6,707 17,121
## 6 2014 8,958 9,052 18,010
## 7 2015 7,309 8,322 15,631
## 8 2016 8,555 9,595 18,150
## 9 2017 6,006 9,352 15,358
## 10 2018 8,926 2,025 10,951
## # ℹ abbreviated name: ¹total_cash_cash_equivalents_and_short_term_investments
## # ℹ 36 more variables: marketable_securities <chr>,
## # trade_accounts_receivable_less_allowances <chr>, inventories <chr>,
## # prepaid_expenses_and_other_assets <chr>, assets_held_for_sale <chr>,
## # assets_held_for_sale_v2 <chr>, total_current_assets <chr>,
## # equity_method_investments <chr>, other_investments <chr>,
## # other_assets <chr>, property_plant_and_equipment___net <chr>, …
convert_string_to_number <- function(column) {
as.numeric(gsub("," ,"", column))
}
balance5 <- balance5 %>%
mutate(across(where(is.character), convert_string_to_number))
balance5$year <- as.character(balance5$year)
balance5
## # A tibble: 10 × 40
## year cash_and_cash_equivalents short_term_investments total_cash_cash_equi…¹
## <chr> <dbl> <dbl> <dbl>
## 1 2009 7021 2130 9151
## 2 2010 8517 2682 11199
## 3 2011 12803 1088 13891
## 4 2012 8442 5017 13459
## 5 2013 10414 6707 17121
## 6 2014 8958 9052 18010
## 7 2015 7309 8322 15631
## 8 2016 8555 9595 18150
## 9 2017 6006 9352 15358
## 10 2018 8926 2025 10951
## # ℹ abbreviated name: ¹total_cash_cash_equivalents_and_short_term_investments
## # ℹ 36 more variables: marketable_securities <dbl>,
## # trade_accounts_receivable_less_allowances <dbl>, inventories <dbl>,
## # prepaid_expenses_and_other_assets <dbl>, assets_held_for_sale <dbl>,
## # assets_held_for_sale_v2 <dbl>, total_current_assets <dbl>,
## # equity_method_investments <dbl>, other_investments <dbl>,
## # other_assets <dbl>, property_plant_and_equipment___net <dbl>, …
We now have a cleaner, tidy data frame.
Let’s look at individual percentages of liabilities for this, to see what is taking up the most liability:
balance6 <- balance5 %>%
mutate(accounts_payable_and_accrued_expenses_perc = accounts_payable_and_accrued_expenses / total_current_liabilities) %>%
mutate(loans_and_notes_payable_perc = loans_and_notes_payable / total_current_liabilities) %>%
mutate(current_maturities_of_long_term_debt_perc = current_maturities_of_long_term_debt / total_current_liabilities) %>%
mutate(accrued_income_taxes_perc = accrued_income_taxes / total_current_liabilities) %>%
mutate(liabilities_held_for_sale_perc = liabilities_held_for_sale / total_current_liabilities) %>%
mutate(liabilities_held_for_sale_discontinued_operations_perc = liabilities_held_for_sale_discontinued_operations / total_current_liabilities)
head(balance6, 10)
## # A tibble: 10 × 46
## year cash_and_cash_equivalents short_term_investments total_cash_cash_equi…¹
## <chr> <dbl> <dbl> <dbl>
## 1 2009 7021 2130 9151
## 2 2010 8517 2682 11199
## 3 2011 12803 1088 13891
## 4 2012 8442 5017 13459
## 5 2013 10414 6707 17121
## 6 2014 8958 9052 18010
## 7 2015 7309 8322 15631
## 8 2016 8555 9595 18150
## 9 2017 6006 9352 15358
## 10 2018 8926 2025 10951
## # ℹ abbreviated name: ¹total_cash_cash_equivalents_and_short_term_investments
## # ℹ 42 more variables: marketable_securities <dbl>,
## # trade_accounts_receivable_less_allowances <dbl>, inventories <dbl>,
## # prepaid_expenses_and_other_assets <dbl>, assets_held_for_sale <dbl>,
## # assets_held_for_sale_v2 <dbl>, total_current_assets <dbl>,
## # equity_method_investments <dbl>, other_investments <dbl>,
## # other_assets <dbl>, property_plant_and_equipment___net <dbl>, …
Let’s see a graph of how total liabilities have changed over time:
ggplot(balance6, aes(x=year, y=total_current_liabilities)) +
geom_bar(stat='identity', position='dodge') +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
The general trend is increasing with a peak in 2014.
Let’s see how accounts_payable_and_accrued_expense
percentage has changed:
ggplot(balance6, aes(x=year, y=accounts_payable_and_accrued_expenses_perc)) +
geom_bar(stat='identity', position='dodge') +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
ggplot(balance6, aes(x=year, y=loans_and_notes_payable_perc)) +
geom_bar(stat='identity', position='dodge') +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
ggplot(balance6, aes(x=year, y=current_maturities_of_long_term_debt_perc)) +
geom_bar(stat='identity', position='dodge') +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
So it looks like loans_and_notes_payable_perc
most of
the time, takes up most of the liabilities with the higher percentage.
It looks like the general liability spike in 2014 could be attributed to
current_maturities_of_long_term_debt_perc
which also had a
major increase in 2014. It’s also interesting that this generally
continued to increase until 2018 (peak at 2018).
Let’s finally compare total assets to total liabilities:
ggplot(data = balance6, aes(x = total_assets, y = total_current_liabilities, color = year)) +
geom_point()
This shows a positive trend as total assets increases, the total liabilities also increases. This also shows in 2014, where there was a spike in liabilities, there was also a spike in total assets.
This data set was a good practice for using pivot_wider and pivot_longer. This example feels like an excel sheet that could be seen in the real world, where the focus is on how easily presentable it is instead of the most efficient data structure.