This R Markdown will be using the dataset provided by Gillian McGovern from the discussion board. For the sake of simpicity, I’ll convert the provided “Coca-cola” file from .xlsx to .csv and additonally there is text obscuring and at risk of distorting the data (the first 3 rows only say “Data provided by SimFin” with an additional row being empty, and the “Profit & Loss statement” which I would rather remove to have a clean way of naming the columns myself). Due to the csv file having a totally empty first column, I will name it for detection and follow up with removing it entirely. The following code is used to load and observe the dataset:
cola <- read_csv(
"https://raw.githubusercontent.com/GullitNa/DATA607-Project2/main/Cola.csv",
skip = 4,
col_names = c(
"test", "In million USD", "FY09", "FY10", "FY11", "FY12",
"FY13", "FY14", "FY15", "FY16", "FY17", "FY18"
)
) %>%
select(-test)
## Rows: 100 Columns: 12
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (12): test, In million USD, FY09, FY10, FY11, FY12, FY13, FY14, FY15, FY...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(cola)
## # A tibble: 6 × 11
## `In million USD` FY09 FY10 FY11 FY12 FY13 FY14 FY15 FY16 FY17 FY18
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 NET OPERATING REV… 30,9… 35,1… 46,5… 48,0… 46,8… 45,9… 44,2… 41,8… 35,4… 31,8…
## 2 Cost of goods sold 11,0… 12,6… 18,2… 19,0… 18,4… 17,8… 17,4… 16,4… 13,2… 11,7…
## 3 Gross Profit 19,9… 22,4… 28,3… 28,9… 28,4… 28,1… 26,8… 25,3… 22,1… 20,0…
## 4 Selling, general … 11,3… 13,1… 17,4… 17,7… 17,3… 17,2… 16,4… 15,2… 12,6… 10,3…
## 5 Other operating c… 313 819 732 447 895 1,183 1,657 1,510 1,902 1,079
## 6 Operating Income 8,231 8,413 10,1… 10,7… 10,2… 9,708 8,728 8,626 7,599 8,700
Loading in this dataset took alot of trail and error to see what works and what doesn’t. Aside from the future code to further tidy up the data, I would say the main challenge of this dataset was the skipping of the constant non-data related rows to see which is the best start for tidying, naming the columns and originally shifting the data to the left, but followed up with naming the very first empty column for its removal. This in general term, would be transformation of the data simply because of the skipping of the metadata, and renaming columns.
In order for readability, I can change all the numerical numbers from the columns to not include commas and to ensure that the data is counting these results as numerical data. I understand that the data still contains NA
cola <- cola %>%
mutate(
across(starts_with("FY"), ~ as.numeric(gsub(",", "", .x)))
)
## Warning: There were 10 warnings in `mutate()`.
## The first warning was:
## ℹ In argument: `across(starts_with("FY"), ~as.numeric(gsub(",", "", .x)))`.
## Caused by warning:
## ! NAs introduced by coercion
## ℹ Run `dplyr::last_dplyr_warnings()` to see the 9 remaining warnings.
cola
## # A tibble: 100 × 11
## `In million USD` FY09 FY10 FY11 FY12 FY13 FY14 FY15 FY16 FY17 FY18
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 NET OPERATING RE… 30990 35119 46542 48017 46854 45998 44294 41863 35410 31856
## 2 Cost of goods so… 11088 12693 18215 19053 18421 17889 17482 16465 13255 11770
## 3 Gross Profit 19902 22426 28327 28964 28433 28109 26812 25398 22155 20086
## 4 Selling, general… 11358 13194 17422 17738 17310 17218 16427 15262 12654 10307
## 5 Other operating … 313 819 732 447 895 1183 1657 1510 1902 1079
## 6 Operating Income 8231 8413 10173 10779 10228 9708 8728 8626 7599 8700
## 7 Interest income 249 317 483 471 534 594 613 642 677 682
## 8 Interest expense 355 733 417 397 463 483 856 733 841 919
## 9 Equity income (l… 781 1025 690 819 602 769 489 835 1071 1008
## 10 Other income (lo… 40 5185 529 137 576 -1263 631 -1234 -1764 -1121
## # ℹ 90 more rows
As per Gillian McGovern’s dataset, an accounting spreadsheet with specific cells being empty and registered as NA is completely normal. For the sake of readability, it’s actually best if I don’t interfer and tamper with these cells for the reason I just listed and to preserve the raw data, in the event to not shake and alter the analysis. For example, some expense or revenue categories may not apply in all fiscal years, or a line item might only appear under specific conditions (like the discontinued operations row). As a result, there are NA values in places where data isn’t recorded. These NA values typically won’t affect the analysis, assuming ignoring/filtering them out for specific datawide calculations as one example.
I plan to analyze this data via specifically choosing the category “Net increase (decrease) during the year” in terms of visualization and also by comparing the years.
net_year <- cola %>%
filter(`In million USD` == "Net increase (decrease) during the year")
The following code is for further clarity for the upcoming bar graph that represents the net increase and decrease (via changing the name of the year columns).
years <- 2009:2018
net_year_chain <- c(
net_year$FY09, net_year$FY10, net_year$FY11,
net_year$FY12, net_year$FY13, net_year$FY14,
net_year$FY15, net_year$FY16, net_year$FY17,
net_year$FY18
)
According to the bar graph of “Net increase (decrease) over the years, and as an example of seeing how a specific category changes over time, Coca-cola remains positive within the years of 2009, 2010, and 2011. However, in 2012 marks the first major negative value of -4361 in million USD. 2013, and 2016 rebound somewhat substantially, otherwise 2014, 2015, and 2017 indicate net decreases. 2018’s fiscal year shows significant growth indicating that Coca-cola as a company had significant improvement to get out of bounds form their negative value years.
df_net_year <- data.frame(
Year = years,
Value = net_year_chain
)
ggplot(df_net_year, aes(x = factor(Year), y = Value)) +
geom_col() +
labs(
title = "Net increase (decrease) over the years",
x = "Year",
y = "in million USD"
)
Throughout the Coca-cola dataset, I skipped the first few lines to remove metadata as well as renamed columns and removed any extra blank column so that each row corresponds to a financial statement line and each column is a fiscal year (FY09–FY18) also including the “in million USD” as a base categorical column. Additionally followed up via plotting how a single category changes over time by manually gathering its yearly values. This gives us insights into year-to-year trends (net increases or decreases) and the ability to explore different individual financial statement lines.