Homework 2 is all about using ggplot2. You will use tech_co_cstat.dta (or .zip) data set you have used previously in Homework 1. You are aware of its structure and the meanings of the variables. Recall that you explored it in Homework 1. Knowing your data set well before you start exploring it is absolutely essential for data science.
d1 <- haven::read_dta("tech_co_cstat_dta.zip") %>%
filter(sale > 0) %>%
mutate(conm = stringr::str_to_title(conm)) # Converts the string to title case
This homework consists of 7 questions. Your objective is to reproduce the output shown in the HTML file for Q1 through Q7.
Create a bar graph of the average annual profits of each company using the variable oibdp such that the bars are arranged in descending order. I’m giving you the code to prepare the data set to make the plot:
d1_1 <- d1 %>%
group_by(conm) %>%
summarize(avg_profit = mean(oibdp), .groups = "drop")
d1_1 has the appropriate variable you need to create the bar plot. Notice that I have used .groups = "drop" inside summarize(). This makes sure that the data you will use in ggplot() is not grouped.
There is a lot going on here. Note that the bars are ordered and flipped. The X axis labels are using a “$” sign. Finally, the axis titles are elaborate.
d1_1 %>%
ggplot(aes(x = reorder(conm, avg_profit), y = avg_profit)) +
geom_bar(stat = "identity") +
labs(x = "Company",
y= "Avg Profit in $Millions") + coord_flip() +
scale_y_continuous(labels = scales::dollar_format(prefix = "$"))
Modify the plot in Q1 to add text labels to the bars. Note that I used hjust = -0.2 and size = 3 for the text labels.
d1_1 %>%
ggplot(aes(x = reorder(conm, avg_profit), y = avg_profit)) +
geom_bar(stat = "identity") +
labs(x = "Company",
y= "Avg Profit in $Millions") + coord_flip() +
scale_y_continuous(labels = scales::dollar_format(prefix = "$")) +
geom_text(aes(label = round(avg_profit/1000, 0)), hjust = -0.2, siz = 3)
## Warning: Ignoring unknown parameters: siz
In finance, it is widely believed that companies with more R&D prefer lower debt. Let’s explore whether we can observe this relationship in our data. Using mutate first create these two variables. I am giving you the exact formulas to put inside mutate().
debt_ratio = (dlc + replace_na(dltt, 0)) / at
rnd_int = xrd / sale
Next, create a scatterplot with debt_ratio on the X axis and rnd_int on the Y axis.
q_3 <- d1 %>%
mutate(debt_ratio = (dlc + replace_na(dltt, 0)/at))
q_3 <- q_3 %>%
mutate(rnd_int = xrd/sale)
ggplot(q_3, aes(x = debt_ratio, y = rnd_int)) +
geom_point() +
geom_smooth() +
scale_y_continuous("R&D/Sales Ratio") +
scale_x_continuous(breaks = c(0.0,0.1,0.2,0.3,0.4,0.5),"Debt Ratio")
R&D investments are risky and may take years to generate returns. As such, borrowing money to fund R&D is very expensive. But looks like this is not true in our sample!
Profit margin is simply profits divided by sales. Compare profit margins of the following six companies - Apple, IBM, Facebook, Paypal, Amazon, and Qualcomm - over the full sample period. Use fyear on the X axis. fyear is the fiscal year.
Here I give you the code to get the data set in the required form. First, note that I am using the variable tic to filter certain “ticker symbols”, which are the IDs used by stock markets for companies. I am doing this simply to save on typing rather than writing out the entire company names! You could also use gvkey as it is a company identifier. But gvkey are not intuitive. Ticker symbols can help you guess (in most cases) what a stock is.
Aside: Some comapnies may have non-intuitive tickers. For exmaple, Coca Cola’s ticker is KO. You would expect it to be COKE but you would be wrong! Some other companies have secured cool tickers. For example, the synthetic biology firm Gingko’s ticker is DNA.
As you are using fyear for plotting a time series, we have to make sure that fyear is indeed interpreted by ggplot2 as a time variable. However, it’s not that straightforward. This is because fiscal years, unlike calendar years, don’t all end exactly on the same day! I know it sounds insane but that’s true. Think about this like school years in different school districts. If the fiscal year ends in different months, how can we create a valid comparison among these companies? Indeed, the variable datadate, which is the fiscal year end date, is not the same for all the companies for any given fiscal year. Luckily we are dealing with annual data and so we can artificially choose to pick a common year end date for the sake of making the plot. Note that this is not the right thing to do for statistical or financial analysis! This simply helps us in making a meaningful plot. As such, I am setting the year end date for all the fiscal years to December 31st. Below, paste function will create strings in “yyyy-mm-dd” format with mm being 12 and dd being 31 as show below. Next, as.Date() function from base R will convert it into an actual date format!
d1_4 <- d1 %>%
filter(tic %in% c("AAPL", "FB", "IBM", "PYPL", "AMZN", "QCOM")) %>%
mutate(pr_margin = oibdp / sale,
fyear = as.Date(paste(fyear, "12", "31", sep = "-")))
Now use d1_4 to create the following plot:
ggplot(d1_4, aes(fyear, pr_margin)) +
geom_line(method = "lm", se = FALSE) +
labs(x = "Fiscal Year", y = "Profit Margin") +
facet_wrap(~conm, scales = "fixed")
## Warning: Ignoring unknown parameters: method, se
Tesla is the largest car manufacturer in the world by market value. But what about sales? Let’s compare sales and market value over the 10/11 years period in our sample.
First create a data frame that you can use to create a plot where you can compare sales and market value in the same plot. This requires rearranging the data into “long” form, where we will stack Tesla’s sales and market value on top of each other.
Aside: Learn more about the functions that deal with long and wide formats conversions here - https://tidyr.tidyverse.org/reference/pivot_longer.html and https://tidyr.tidyverse.org/reference/pivot_wider.html
Here is the code to create such a data set. Please read it carefully to understand all the steps.
d1_5 <- d1 %>%
filter(conm == "Tesla Inc") %>%
mutate(mkt_val = prcc_f * cshpri) %>% # Create market value
select(conm, datadate, mkt_val, sale) %>%
pivot_longer(cols = c(mkt_val, sale),
names_to = "fin_var",
values_to = "fin_value")
Print first few rows of d1_5 by using head() function to understand what this data set is.
head(d1_5)
## # A tibble: 6 x 4
## conm datadate fin_var fin_value
## <chr> <date> <chr> <dbl>
## 1 Tesla Inc 2010-12-31 mkt_val 2481.
## 2 Tesla Inc 2010-12-31 sale 117.
## 3 Tesla Inc 2011-12-31 mkt_val 2867.
## 4 Tesla Inc 2011-12-31 sale 204.
## 5 Tesla Inc 2012-12-31 mkt_val 3636.
## 6 Tesla Inc 2012-12-31 sale 413.
Now using d1_5, create the following plot using datadate on the X axis:
d1_5 %>%
ggplot(aes(x = datadate, y = fin_value, color = fin_var)) +
geom_line() +
labs(x = "Date", y = NULL, colour = "Financial Variable") +
theme(legend.position = "top") +
scale_y_continuous(labels = scales::dollar_format(prefix = "$"))
Tesla sales are pretty much flat while its market value zoomed up during the pandemic. What gives? Nobody knows! But we know that a bet against Tesla and Elon Musk means bankruptcy at this point :)
When the time variable is discrete, we can also show a time trend using a bar plot. This is quite common in practice. fyear is an integer so we can use it as a discrete variable and create a bar plot of profits for Facebook and IBM as shown below. Manually change the fill of bars using the following colors: c("#5cc9f5", "#b131a2")
d1_6 <- filter(d1, conm == "Facebook Inc" | conm == "Intl Business Machines Corp")
d1_6 %>%
ggplot(aes(fyear, oibdp, fill = conm)) +
geom_bar(stat = "identity") +
facet_wrap(~conm, nrow = 2) +
scale_fill_manual(values = c("#5cc9f5", "#b131a2")) +
theme(legend.position = "top",
legend.title = element_blank()) +
labs(x = "Fiscal Year",
y = "Profit in $Millions") +
scale_x_continuous(breaks = seq(2010,2020,1))
Use Mark Zuckerberg’s cutout to create the following visualization. You are free to position the picture anywhere and in size you want. Just don’t cover the bars.
d1_7 <- filter(d1, conm == "Facebook Inc")
d1_7 %>%
ggplot(aes(x = fyear, y = oibdp)) +
geom_bar(stat = "identity", color = "light blue", fill = "light blue") +
labs(x = "Fiscal Year",
y = "Profit in $Millions") +
scale_x_continuous(breaks = seq(2010,2020,1)) +
annotation_raster(readPNG("mark-zuckerberg-celebrity-mask.png"),
xmin = 2012, xmax = 2015, ymin = 20000, ymax = 40000, interpolate = T)