── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.3 ✔ readr 2.1.4
✔ forcats 1.0.0 ✔ stringr 1.5.0
✔ ggplot2 3.4.3 ✔ tibble 3.2.1
✔ lubridate 1.9.2 ✔ tidyr 1.3.0
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
This homework uses tech_stocks_csv.zip which we used in the week 3 class.
Read tech_stocks_csv.zip into your R session using read_csv() function from readr package. Store the resulting object in d1.
d1 =read_csv("tech_stocks_csv.zip")
Rows: 93 Columns: 981
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (34): gvkey, indfmt, consol, popsrc, datafmt, tic, cusip, conm, acctch...
dbl (436): fyear, ajex, ajp, currtr, fyr, ismod, ltcm, pddur, scf, src, upd...
lgl (506): adrr, bspr, curuscn, ogm, stalt, udpl, acco, accrt, acoxar, acql...
date (5): datadate, apdedate, fdate, pdate, ipodate
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
This homework consists of 8 questions. Q1 carries 1 point. Q2 through Q8 carry two points each. We use d2 as the initial input. Your objective is to reproduce the output shown in the PDF file for Q1 through Q8.
Q1
Print a data frame with the medians of at, emp, and xrd.
at is Total assets (USD million) emp is Number of employees (In thousands) xrd is R&D expenses (USD million)
Print a data frame with the means of sale, oibdp, and xrd for Apple, Meta, and Tesla. For this, you will need to follow these steps:
Filter only the observations pertaining to Apple, Meta, and Tesla
Group by conm
Summarize sale, oibdp, and xrd to get their means
Output it as a data frame by using as.data.frame() function.
In Step 3, you may optionally use across() inside summarize(). This function will save you writing more code. Read more about it here: https://dplyr.tidyverse.org/reference/across.html
conm sale_mean oibdp_mean xrd_mean
1 APPLE INC 224763.08 71468.923 11167.000
2 META PLATFORMS INC 43413.00 20110.923 9754.923
3 TESLA INC 18585.48 2672.776 1094.280
Q3
Round all the numeric variables in the above data frame to 1 decimal place. Output as a data frame using as.data.frame() function.
For rounding, you will have to use mutate, across, and where functions from dplyr package. Check https://www.tidyverse.org/blog/2020/04/dplyr-1-0-0-colwise/ for more information.
Warning: There was 1 warning in `mutate()`.
ℹ In argument: `across(where(is.numeric), round, 1)`.
Caused by warning:
! The `...` argument of `across()` is deprecated as of dplyr 1.1.0.
Supply arguments directly to `.fns` through an anonymous function instead.
# Previously
across(a:b, mean, na.rm = TRUE)
# Now
across(a:b, \(x) mean(x, na.rm = TRUE))
# A tibble: 3 × 4
conm sale_mean oibdp_mean xrd_mean
<chr> <dbl> <dbl> <dbl>
1 APPLE INC 224763. 71469. 11167
2 META PLATFORMS INC 43413 20111. 9755.
3 TESLA INC 18586. 2673. 1094.
as.data.frame(df)
conm sale_mean oibdp_mean xrd_mean
1 APPLE INC 224763.08 71468.923 11167.000
2 META PLATFORMS INC 43413.00 20110.923 9754.923
3 TESLA INC 18585.48 2672.776 1094.280
Q4
Many advertising values are missing. The missing code in R is NA. We can get the total number of missing values for advertising quite easily by running the following function:
sum(is.na(d1$xad))
[1] 13
In the finance literature, a common (but incorrect) practice is to assume that the missing advertising is 0. We will use this adjustment to xad and create a new variable adv and save it in a new object d2.
The first six values of d2 when xad is NA are as follows:
# A tibble: 6 × 4
conm datadate xad adv
<chr> <date> <dbl> <dbl>
1 APPLE INC 2016-09-30 NA 0
2 APPLE INC 2017-09-30 NA 0
3 APPLE INC 2018-09-30 NA 0
4 APPLE INC 2019-09-30 NA 0
5 APPLE INC 2020-09-30 NA 0
6 APPLE INC 2021-09-30 NA 0
Q5
Using d2, create the following variables and print first 8 rows for NVidia and the new columns along with conm and datadate:
Return on assets (roa) = oibdp / at
Free cash flow (fcf) = oancf / che
Strategic emphasis (strat_emph) = (adv - xrd) / at
conm profit_years total_years
1 ALPHABET INC 13 13
2 AMAZON.COM INC 13 13
3 APPLE INC 13 13
4 META PLATFORMS INC 13 13
5 MICROSOFT CORP 14 14
6 NVIDIA CORP 14 14
7 TESLA INC 9 13
Q7
Find the average annual stock returns of all the companies. Follow these steps:
Arrange the data set by conm and datadate.
Group by conm
Calculate stock return stk_ret by taking the difference between mkvalt (Market value of equity) and its lag and then divide the difference by the lag of mkvalt
Calculate stock return stk_ret2 by taking the difference between prcc_f (Stock price at the end of the fiscal year) and its lag and then divide the difference by the lag of prcc_f
Summarize to get the mean of the stock returns stk_ret_mean and stk_ret2_mean.
Display the average stock returns in percentage format.
conm stk_ret_mean stk_ret2_mean
1 ALPHABET INC 19.87% 11.00%
2 AMAZON.COM INC 29.98% 24.60%
3 APPLE INC 24.31% 5.62%
4 META PLATFORMS INC 29.45% 26.84%
5 MICROSOFT CORP 22.61% 24.01%
6 NVIDIA CORP 57.76% 43.89%
7 TESLA INC 115.37% 43.60%
Not graded: The average stock returns calculated using these two measures are very different. Which of these is correct?
Ans: The second method seems correct. However, it depends on the research question.
Q8
In many statistical and machine learning applications, we use scaled variables instead of the original variables. A scaled variable is typically created by subtracting the sample mean of the variable from the variable and dividing it by its standard deviation. There is a scale() function in base R which can directly do it.
You want to create a scaled variable for sale but separately for each company. Therefore, you can’t use the mean and standard deviation of sale for the entire sample. Instead, you have to calculate these statistics for each company separately and then create a scaled variable. Follow these steps:
Group by conm
Summarize sale to get the mean (sale_mean) and the standard deviation (sale_sd)
Assign this dataframe to d2_sum
Join d2 and d2_sum by conm
Create sale_scaled by subtracting sale_mean from sale and dividing this difference by sale_sd
Print the first 10 rows for Tesla with conm, sale, sale_scaled, sale_mean, and sale_sd using as.data.frame()