Data Loading

Gillian McGovern’s data

This R Markdown will be using the dataset provided by Gillian McGovern from the discussion board. For the sake of simpicity, I’ll convert the provided “Coca-cola” file from .xlsx to .csv and additonally there is text obscuring and at risk of distorting the data (the first 3 rows only say “Data provided by SimFin” with an additional row being empty, and the “Profit & Loss statement” which I would rather remove to have a clean way of naming the columns myself). Due to the csv file having a totally empty first column, I will name it for detection and follow up with removing it entirely. The following code is used to load and observe the dataset:

cola <- read_csv(
  "https://raw.githubusercontent.com/GullitNa/DATA607-Project2/main/Cola.csv",
  skip = 4,
  col_names = c(
    "test", "In million USD", "FY09", "FY10", "FY11", "FY12",
    "FY13", "FY14", "FY15", "FY16", "FY17", "FY18"
  )
) %>%
  select(-test)
## Rows: 100 Columns: 12
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (12): test, In million USD, FY09, FY10, FY11, FY12, FY13, FY14, FY15, FY...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(cola)
## # A tibble: 6 × 11
##   `In million USD`   FY09  FY10  FY11  FY12  FY13  FY14  FY15  FY16  FY17  FY18 
##   <chr>              <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 NET OPERATING REV… 30,9… 35,1… 46,5… 48,0… 46,8… 45,9… 44,2… 41,8… 35,4… 31,8…
## 2 Cost of goods sold 11,0… 12,6… 18,2… 19,0… 18,4… 17,8… 17,4… 16,4… 13,2… 11,7…
## 3 Gross Profit       19,9… 22,4… 28,3… 28,9… 28,4… 28,1… 26,8… 25,3… 22,1… 20,0…
## 4 Selling, general … 11,3… 13,1… 17,4… 17,7… 17,3… 17,2… 16,4… 15,2… 12,6… 10,3…
## 5 Other operating c… 313   819   732   447   895   1,183 1,657 1,510 1,902 1,079
## 6 Operating Income   8,231 8,413 10,1… 10,7… 10,2… 9,708 8,728 8,626 7,599 8,700

Explanation and Initial Thoughts

Loading in this dataset took alot of trail and error to see what works and what doesn’t. Aside from the future code to further tidy up the data, I would say the main challenge of this dataset was the skipping of the constant non-data related rows to see which is the best start for tidying, naming the columns and originally shifting the data to the left, but followed up with naming the very first empty column for its removal. This in general term, would be transformation of the data simply because of the skipping of the metadata, and renaming columns.

Data Cleaning

In order for readability, I can change all the numerical numbers from the columns to not include commas and to ensure that the data is counting these results as numerical data. I understand that the data still contains NA

cola <- cola %>%
  mutate(
    across(starts_with("FY"), ~ as.numeric(gsub(",", "", .x)))
  )
## Warning: There were 10 warnings in `mutate()`.
## The first warning was:
## ℹ In argument: `across(starts_with("FY"), ~as.numeric(gsub(",", "", .x)))`.
## Caused by warning:
## ! NAs introduced by coercion
## ℹ Run `dplyr::last_dplyr_warnings()` to see the 9 remaining warnings.
cola
## # A tibble: 100 × 11
##    `In million USD`   FY09  FY10  FY11  FY12  FY13  FY14  FY15  FY16  FY17  FY18
##    <chr>             <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1 NET OPERATING RE… 30990 35119 46542 48017 46854 45998 44294 41863 35410 31856
##  2 Cost of goods so… 11088 12693 18215 19053 18421 17889 17482 16465 13255 11770
##  3 Gross Profit      19902 22426 28327 28964 28433 28109 26812 25398 22155 20086
##  4 Selling, general… 11358 13194 17422 17738 17310 17218 16427 15262 12654 10307
##  5 Other operating …   313   819   732   447   895  1183  1657  1510  1902  1079
##  6 Operating Income   8231  8413 10173 10779 10228  9708  8728  8626  7599  8700
##  7 Interest income     249   317   483   471   534   594   613   642   677   682
##  8 Interest expense    355   733   417   397   463   483   856   733   841   919
##  9 Equity income (l…   781  1025   690   819   602   769   489   835  1071  1008
## 10 Other income (lo…    40  5185   529   137   576 -1263   631 -1234 -1764 -1121
## # ℹ 90 more rows

Data Tidy Transformations

As per Gillian McGovern’s dataset, an accounting spreadsheet with specific cells being empty and registered as NA is completely normal. For the sake of readability, it’s actually best if I don’t interfer and tamper with these cells for the reason I just listed and to preserve the raw data, in the event to not shake and alter the analysis. For example, some expense or revenue categories may not apply in all fiscal years, or a line item might only appear under specific conditions (like the discontinued operations row). As a result, there are NA values in places where data isn’t recorded. These NA values typically won’t affect the analysis, assuming ignoring/filtering them out for specific datawide calculations as one example. Additionally for the transformation process, to make it clear to the viewer, rather than just the last two digits of the year appearing for visualization, I mutate to add “20” infront to show the full year (“10” to “2010”).

cola_tidy <- cola %>%
  pivot_longer(
    cols = starts_with("FY"),
    names_to = "Year",
    values_to = "Value"
  ) %>%
  mutate(Year = sub("FY", "", Year),
         Year = paste0("20", Year)
  )
cola_tidy
## # A tibble: 1,000 × 3
##    `In million USD`       Year  Value
##    <chr>                  <chr> <dbl>
##  1 NET OPERATING REVENUES 2009  30990
##  2 NET OPERATING REVENUES 2010  35119
##  3 NET OPERATING REVENUES 2011  46542
##  4 NET OPERATING REVENUES 2012  48017
##  5 NET OPERATING REVENUES 2013  46854
##  6 NET OPERATING REVENUES 2014  45998
##  7 NET OPERATING REVENUES 2015  44294
##  8 NET OPERATING REVENUES 2016  41863
##  9 NET OPERATING REVENUES 2017  35410
## 10 NET OPERATING REVENUES 2018  31856
## # ℹ 990 more rows

Analysis

Filtering

I plan to analyze this data via specifically choosing the category “Net increase (decrease) during the year” in terms of visualization and also by comparing the years.

net_year <- cola_tidy %>%
  filter(`In million USD` == "Net increase (decrease) during the year")
net_year
## # A tibble: 10 × 3
##    `In million USD`                        Year  Value
##    <chr>                                   <chr> <dbl>
##  1 Net increase (decrease) during the year 2009   2320
##  2 Net increase (decrease) during the year 2010   1496
##  3 Net increase (decrease) during the year 2011   4286
##  4 Net increase (decrease) during the year 2012  -4361
##  5 Net increase (decrease) during the year 2013   1972
##  6 Net increase (decrease) during the year 2014  -1456
##  7 Net increase (decrease) during the year 2015  -1649
##  8 Net increase (decrease) during the year 2016   1246
##  9 Net increase (decrease) during the year 2017  -2477
## 10 Net increase (decrease) during the year 2018   2945

Plotting

According to the bar graph of “Net increase (decrease) over the years, and as an example of seeing how a specific category changes over time, Coca-cola remains positive within the years of 2009, 2010, and 2011. However, in 2012 marks the first major negative value of -4361 in million USD. 2013, and 2016 rebound somewhat substantially, otherwise 2014, 2015, and 2017 indicate net decreases. 2018’s fiscal year shows significant growth indicating that Coca-cola as a company had significant improvement to get out of bounds form their negative value years.

ggplot(net_year, aes(x = Year, y = Value)) +
  geom_col() +
  labs(
    title = "Net Increase (Decrease) Over the Years",
    x = "Fiscal Year",
    y = "Value (in million USD)"
  )

Conclusion

Throughout the Coca-cola dataset, I skipped the first few lines to remove metadata as well as renamed columns and removed any extra blank column so that each row corresponds to a financial statement line and each column is a fiscal year (FY09–FY18) also including the “in million USD” as a base categorical column. Additionally followed up via plotting how a single category changes over time by manually gathering its yearly values. This gives us insights into year-to-year trends (net increases or decreases) and the ability to explore different individual financial statement lines.