Homework 1

Author

NUPOOR KARNIK and jhr497

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.3     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.4.3     ✔ tibble    3.2.1
✔ lubridate 1.9.2     ✔ tidyr     1.3.0
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

This homework uses tech_stocks_csv.zip which we used in the week 3 class.

Read tech_stocks_csv.zip into your R session using read_csv() function from readr package. Store the resulting object in d1.

d1 = read_csv("tech_stocks_csv.zip")
Rows: 93 Columns: 981
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr   (34): gvkey, indfmt, consol, popsrc, datafmt, tic, cusip, conm, acctch...
dbl  (436): fyear, ajex, ajp, currtr, fyr, ismod, ltcm, pddur, scf, src, upd...
lgl  (506): adrr, bspr, curuscn, ogm, stalt, udpl, acco, accrt, acoxar, acql...
date   (5): datadate, apdedate, fdate, pdate, ipodate

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

This homework consists of 8 questions. Q1 carries 1 point. Q2 through Q8 carry two points each. We use d2 as the initial input. Your objective is to reproduce the output shown in the PDF file for Q1 through Q8.

Q1

Print a data frame with the medians of at, emp, and xrd.

at is Total assets (USD million) emp is Number of employees (In thousands) xrd is R&D expenses (USD million)

dt2 = select(.data = d1, at, emp, xrd)
summarize(dt2, 
          at_median = median(at),
          emp_median = format(round(median(emp), 1), nsmall = 1),
          xrd_median = median(xrd),
)
# A tibble: 1 × 3
  at_median emp_median xrd_median
      <dbl> <chr>           <dbl>
1     93798 72.0             7754

Q2

Print a data frame with the means of sale, oibdp, and xrd for Apple, Meta, and Tesla. For this, you will need to follow these steps:

  1. Filter only the observations pertaining to Apple, Meta, and Tesla
  2. Group by conm
  3. Summarize sale, oibdp, and xrd to get their means
  4. Output it as a data frame by using as.data.frame() function.

In Step 3, you may optionally use across() inside summarize(). This function will save you writing more code. Read more about it here: https://dplyr.tidyverse.org/reference/across.html

df = d1 %>% 
  select(conm, sale, oibdp, xrd) %>% 
filter(conm %in% c("APPLE INC", "META PLATFORMS INC", "TESLA INC")) %>% 
  group_by(conm) %>% 
  summarize(
            sale_mean = mean(sale),
            oibdp_mean = mean(oibdp),
            xrd_mean = mean(xrd, na.rm = TRUE))
as.data.frame(df)
                conm sale_mean oibdp_mean  xrd_mean
1          APPLE INC 224763.08  71468.923 11167.000
2 META PLATFORMS INC  43413.00  20110.923  9754.923
3          TESLA INC  18585.48   2672.776  1094.280

Q3

Round all the numeric variables in the above data frame to 1 decimal place. Output as a data frame using as.data.frame() function.

For rounding, you will have to use mutate, across, and where functions from dplyr package. Check https://www.tidyverse.org/blog/2020/04/dplyr-1-0-0-colwise/ for more information.

df = d1 %>% 
  select(conm, sale, oibdp, xrd) %>% 
filter(conm %in% c("APPLE INC", "META PLATFORMS INC", "TESLA INC")) %>% 
  group_by(conm) %>% 
  summarize(
            sale_mean = mean(sale),
            oibdp_mean = mean(oibdp),
            xrd_mean = mean(xrd, na.rm = TRUE))
mutate(df, across(where(is.numeric), round, 1))
Warning: There was 1 warning in `mutate()`.
ℹ In argument: `across(where(is.numeric), round, 1)`.
Caused by warning:
! The `...` argument of `across()` is deprecated as of dplyr 1.1.0.
Supply arguments directly to `.fns` through an anonymous function instead.

  # Previously
  across(a:b, mean, na.rm = TRUE)

  # Now
  across(a:b, \(x) mean(x, na.rm = TRUE))
# A tibble: 3 × 4
  conm               sale_mean oibdp_mean xrd_mean
  <chr>                  <dbl>      <dbl>    <dbl>
1 APPLE INC            224763.     71469.   11167 
2 META PLATFORMS INC    43413      20111.    9755.
3 TESLA INC             18586.      2673.    1094.
as.data.frame(df)
                conm sale_mean oibdp_mean  xrd_mean
1          APPLE INC 224763.08  71468.923 11167.000
2 META PLATFORMS INC  43413.00  20110.923  9754.923
3          TESLA INC  18585.48   2672.776  1094.280

Q4

Many advertising values are missing. The missing code in R is NA. We can get the total number of missing values for advertising quite easily by running the following function:

sum(is.na(d1$xad))
[1] 13

In the finance literature, a common (but incorrect) practice is to assume that the missing advertising is 0. We will use this adjustment to xad and create a new variable adv and save it in a new object d2.

The first six values of d2 when xad is NA are as follows:

d3 = select(.data = d1, conm, datadate, xad)
d2 <- d3 %>%
  mutate(adv = ifelse(is.na(xad), 0, xad))
head(filter(d2,is.na(xad)),6)
# A tibble: 6 × 4
  conm      datadate     xad   adv
  <chr>     <date>     <dbl> <dbl>
1 APPLE INC 2016-09-30    NA     0
2 APPLE INC 2017-09-30    NA     0
3 APPLE INC 2018-09-30    NA     0
4 APPLE INC 2019-09-30    NA     0
5 APPLE INC 2020-09-30    NA     0
6 APPLE INC 2021-09-30    NA     0

Q5

Using d2, create the following variables and print first 8 rows for NVidia and the new columns along with conm and datadate:

  1. Return on assets (roa) = oibdp / at
  2. Free cash flow (fcf) = oancf / che
  3. Strategic emphasis (strat_emph) = (adv - xrd) / at
d2 <- d1 %>%
 mutate(d2,
    roa = round(oibdp / at, 3),
    fcf = round(oancf / che, 3),
    strat_emph = round((adv - xrd) / at, 3)
  )
nvidia_data <- d2 %>%
  
  select(conm, datadate, roa, fcf, strat_emph) %>% 
filter(conm == "NVIDIA CORP")
nvidia_joined = left_join(d1, nvidia_data, by = "datadate")


head(nvidia_data, 8)
# A tibble: 8 × 5
  conm        datadate     roa   fcf strat_emph
  <chr>       <date>     <dbl> <dbl>      <dbl>
1 NVIDIA CORP 2010-01-31 0.093 0.282     -0.249
2 NVIDIA CORP 2011-01-31 0.129 0.271     -0.187
3 NVIDIA CORP 2012-01-31 0.158 0.291     -0.179
4 NVIDIA CORP 2013-01-31 0.142 0.221     -0.177
5 NVIDIA CORP 2014-01-31 0.107 0.179     -0.182
6 NVIDIA CORP 2015-01-31 0.136 0.196     -0.187
7 NVIDIA CORP 2016-01-31 0.149 0.233     -0.178
8 NVIDIA CORP 2017-01-31 0.217 0.246     -0.147

Q6

You want to know how many profitable years each of the sample company experienced. For this, follow these steps:

  1. Create an indicator variable (dummy variable) called profit_ind such that when oibdp > 0 this variable is 1. Otherwise it is 0.
  2. Group by company names
  3. Summarize profit_ind by taking its sum. Also, get the total number of observations for each company.
d5 <- d1 %>% 
  select(conm, oibdp) %>% 
mutate(profit_ind = ifelse(oibdp > 0,1,0)) %>% 
  group_by(conm) %>% 
       summarize(profit_years=sum(profit_ind), total_years=n())
as.data.frame(d5)
                conm profit_years total_years
1       ALPHABET INC           13          13
2     AMAZON.COM INC           13          13
3          APPLE INC           13          13
4 META PLATFORMS INC           13          13
5     MICROSOFT CORP           14          14
6        NVIDIA CORP           14          14
7          TESLA INC            9          13

Q7

Find the average annual stock returns of all the companies. Follow these steps:

  1. Arrange the data set by conm and datadate.
  2. Group by conm
  3. Calculate stock return stk_ret by taking the difference between mkvalt (Market value of equity) and its lag and then divide the difference by the lag of mkvalt
  4. Calculate stock return stk_ret2 by taking the difference between prcc_f (Stock price at the end of the fiscal year) and its lag and then divide the difference by the lag of prcc_f
  5. Summarize to get the mean of the stock returns stk_ret_mean and stk_ret2_mean.
  6. Display the average stock returns in percentage format.
d6 <- d1 %>%
  arrange(conm, datadate) %>%
  group_by(conm) %>%
  mutate(
    stk_ret = (mkvalt - lag(mkvalt)) / lag(mkvalt),
    stk_ret2 = (prcc_f - lag(prcc_f)) / lag(prcc_f)
  ) %>%
  summarise(
    stk_ret_mean = mean(stk_ret, na.rm = TRUE),
    stk_ret2_mean = mean(stk_ret2, na.rm = TRUE)
  ) %>%
  ungroup() %>%
  mutate(
    stk_ret_mean = sprintf("%.2f%%", stk_ret_mean*100),  
    stk_ret2_mean= sprintf("%.2f%%", stk_ret2_mean*100)
  )


as.data.frame(d6)
                conm stk_ret_mean stk_ret2_mean
1       ALPHABET INC       19.87%        11.00%
2     AMAZON.COM INC       29.98%        24.60%
3          APPLE INC       24.31%         5.62%
4 META PLATFORMS INC       29.45%        26.84%
5     MICROSOFT CORP       22.61%        24.01%
6        NVIDIA CORP       57.76%        43.89%
7          TESLA INC      115.37%        43.60%

Not graded: The average stock returns calculated using these two measures are very different. Which of these is correct?

Ans: The second method seems correct. However, it depends on the research question.

Q8

In many statistical and machine learning applications, we use scaled variables instead of the original variables. A scaled variable is typically created by subtracting the sample mean of the variable from the variable and dividing it by its standard deviation. There is a scale() function in base R which can directly do it.

You want to create a scaled variable for sale but separately for each company. Therefore, you can’t use the mean and standard deviation of sale for the entire sample. Instead, you have to calculate these statistics for each company separately and then create a scaled variable. Follow these steps:

  1. Group by conm
  2. Summarize sale to get the mean (sale_mean) and the standard deviation (sale_sd)
  3. Assign this dataframe to d2_sum
  4. Join d2 and d2_sum by conm
  5. Create sale_scaled by subtracting sale_mean from sale and dividing this difference by sale_sd

Print the first 10 rows for Tesla with conm, sale, sale_scaled, sale_mean, and sale_sd using as.data.frame()

d2 = d1
d2_sum = d2 %>% 
  
  
filter(conm %in% c("TESLA INC")) %>% 
  group_by(conm) %>% 
  summarize(
            sale_mean = mean(sale),
            sale_sd = sd(sale))
  

d7 = inner_join(d2, d2_sum, by = "conm")


d2_sum<- d7 %>% 
  mutate(d7,
            sale_scaled = (sale - sale_mean) / sale_sd) %>%
  select(conm, sale, sale_scaled, sale_mean, sale_sd) %>% 
filter(conm == "TESLA INC")

as.data.frame(d2_sum, 10)
        conm      sale sale_scaled sale_mean  sale_sd
1  TESLA INC   116.744  -0.7477853  18585.48 24697.92
2  TESLA INC   204.242  -0.7442426  18585.48 24697.92
3  TESLA INC   413.256  -0.7357798  18585.48 24697.92
4  TESLA INC  2013.496  -0.6709872  18585.48 24697.92
5  TESLA INC  3198.356  -0.6230132  18585.48 24697.92
6  TESLA INC  4046.025  -0.5886917  18585.48 24697.92
7  TESLA INC  7000.132  -0.4690821  18585.48 24697.92
8  TESLA INC 11758.751  -0.2764092  18585.48 24697.92
9  TESLA INC 21461.268   0.1164384  18585.48 24697.92
10 TESLA INC 24578.000   0.2426325  18585.48 24697.92
11 TESLA INC 31536.000   0.5243567  18585.48 24697.92
12 TESLA INC 53823.000   1.4267406  18585.48 24697.92
13 TESLA INC 81462.000   2.5458229  18585.48 24697.92