Data Analytics Visualization and Communication

Assignment1

Reading the csv file

library(readr)

d1 <- read_csv("Week03/Week03/tech_financials.csv")
## Rows: 120 Columns: 981
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr  (35): gvkey, indfmt, consol, popsrc, datafmt, tic, cusip, conm, acctchg...
## dbl (448): datadate, fyear, ajex, ajp, currtr, fyr, ismod, ltcm, pddur, scf,...
## lgl (498): adrr, bspr, curuscn, ogm, stalt, udpl, acco, accrt, acoxar, acqln...
## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.

Filtering rows with sale>0

library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.1.3
## -- Attaching packages --------------------------------------- tidyverse 1.3.2 --
## v ggplot2 3.3.5     v dplyr   1.0.7
## v tibble  3.1.6     v stringr 1.4.0
## v tidyr   1.2.0     v forcats 0.5.1
## v purrr   0.3.4
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
d2<- filter(d1,sale>0)

Q1

Print a data frame with the medians of cogs, emp, and xrd.

cogs is Cost of goods sold (USD million) emp is Number of employees (In thousands) xrd is R&D expenses (USD million)

summarize(d2,
          cogs_median = median(cogs),
          emp_median = median(emp, na.rm = TRUE),
          xrd_median = median(xrd))
## # A tibble: 1 x 3
##   cogs_median emp_median xrd_median
##         <dbl>      <dbl>      <dbl>
## 1        9348       34.6      3838.

Q2

Print a data frame with the means of sale, oibdp, and xrd for Apple, Meta, and Tesla. For this, you will need to follow these steps:

  1. Filter only the observations pertaining to Apple, Meta, and Tesla
  2. Group by conm
  3. Summarize sale, oibdp, and xrd to get their means
  4. Output it as a data frame by using as.data.frame() function.

In Step 3, you may optionally use across() inside summarize(). This function will save you writing more code. Read more about it here: https://dplyr.tidyverse.org/reference/across.html

Step1: Using filter function to filter only the observations pertaining to Apple, Meta, and Tesla

d3 <-filter(d2, 
       conm == "APPLE INC"| 
       conm == "META PLATFORMS INC"|
       conm == "TESLA INC")

Step2: Using groupby conm and step3: summarizing sale, oibdp,xrd Step4: the output is converted to dataframe in d5

d4 <- summarize(group_by(d3, conm), 
          sale_mean = mean(sale),
          xrd_mean = mean(xrd),
          oibdp_mean = mean(oibdp))
d5 <- as.data.frame(d4)

d5
##                 conm sale_mean  xrd_mean oibdp_mean
## 1          APPLE INC 210632.67 9910.0000   67245.75
## 2 META PLATFORMS INC  37313.33 7623.0000   18266.75
## 3          TESLA INC  13345.77  929.2205    1430.59

Q3

Round all the numeric variables in the above data frame to 1 decimal place. Output as a data frame using as.data.frame() function.

For rounding, you will have to use mutate, across, and where functions from dplyr package. Check https://www.tidyverse.org/blog/2020/04/dplyr-1-0-0-colwise/ for more information.

d7 <- d3 %>% 
          select(.data = ., 
                 conm,sale,oibdp,xrd) %>%
          group_by(conm) %>%
          summarize(
            round(across(where(is.numeric), mean, na.rm = TRUE),1),
            across(where(is.character),nlevels),
            
          ) %>% as.data.frame()

d7
##                 conm     sale   oibdp    xrd
## 1          APPLE INC 210632.7 67245.8 9910.0
## 2 META PLATFORMS INC  37313.3 18266.8 7623.0
## 3          TESLA INC  13345.8  1430.6  929.2

Q4

Many advertising values are missing. The missing code in R is NA. We can get the total number of missing values for advertising quite easily by running the following function:

sum(is.na(d2$xad))
## [1] 28

In the finance literature, a common (but incorrect) practice is to assume that the missing advertising is 0. We will use this adjustment to xad and create a new variable adv and save it in a new object d3.

The new variable ‘adv’ is saved in d31 instead of d3. But the results are correct.

d31 <- d2 %>% 
          select(.data = ., 
                 conm,datadate,xad) %>%
          mutate(.data = .,adv = ifelse(is.na(xad),0,xad)
              )
head(d31)
## # A tibble: 6 x 4
##   conm                   datadate   xad   adv
##   <chr>                     <dbl> <dbl> <dbl>
## 1 ADVANCED MICRO DEVICES 20101231    NA     0
## 2 ADVANCED MICRO DEVICES 20111231    NA     0
## 3 ADVANCED MICRO DEVICES 20121231    NA     0
## 4 ADVANCED MICRO DEVICES 20131231    NA     0
## 5 ADVANCED MICRO DEVICES 20141231    NA     0
## 6 ADVANCED MICRO DEVICES 20151231    NA     0

Q5

Using d3, create the following variables and print first 8 rows for Netflix and the new columns along with conm and datadate:

  1. Return on assets (roa) = oibdp / at
  2. Free cash flow (fcf) = oancf / che
  3. Strategic emphasis (strat_emph) = (adv - xrd) / at
d32 <- d2 %>% 
          select(.data = .,conm,datadate,xad,oibdp,oancf,at,che,xrd) %>%
          filter(.data = ., conm == "NETFLIX INC") %>% 
          mutate(.data = .,adv = ifelse(is.na(xad),0,xad)) %>% 
          mutate(.data=.,roa = oibdp/at) %>% 
          mutate(.data =.,fcf=oancf/che) %>% 
          mutate(.data =.,strat_emph =(adv-xrd)/at) %>% 
          select(.data = .,conm,datadate,roa,fcf,strat_emph)

d32
## # A tibble: 12 x 5
##    conm        datadate    roa     fcf strat_emph
##    <chr>          <dbl>  <dbl>   <dbl>      <dbl>
##  1 NETFLIX INC 20101231 0.321   0.789    0.0500  
##  2 NETFLIX INC 20111231 0.140   0.398    0.0131  
##  3 NETFLIX INC 20121231 0.0241  0.0304   0.0121  
##  4 NETFLIX INC 20131231 0.0511  0.0815   0.0109  
##  5 NETFLIX INC 20141231 0.0647  0.0102   0.00861 
##  6 NETFLIX INC 20151231 0.0361 -0.324    0.00622 
##  7 NETFLIX INC 20161231 0.0322 -0.850   -0.000714
##  8 NETFLIX INC 20171231 0.0479 -0.633    0.00202 
##  9 NETFLIX INC 20181231 0.0650 -0.706    0.0226  
## 10 NETFLIX INC 20191231 0.0797 -0.575    0.00983 
## 11 NETFLIX INC 20201231 0.125   0.296   -0.00974 
## 12 NETFLIX INC 20211231 0.144   0.0651  -0.0136

Q6

You want to know how many profitable years each of the sample company experienced. For this follow these steps:

  1. Create an indicator variable (dummy variable) called profit_ind such that when oibdp > 0 this variable is 1. Otherwise it is 0.
  2. Group by company names
  3. Summarize profit_ind by taking its sum. Also, get the total number of observations for each company.
d9 <- d2 %>% 
          select(.data = .,conm,sale,oibdp,xrd,fyear) %>%
          mutate(.data = .,profit_ind = ifelse(oibdp>0,1,0)) %>% 
          group_by(conm) %>%
          summarize(.data=.,profit_years = sum(profit_ind))


d10 <- as.data.frame(d9)


d101 <- d2 %>%
                select(.data = .,conm,sale,oibdp,xrd,fyear) %>%
                mutate(.data = .,year_ind = ifelse(fyear>0,1,0)) %>% 
                group_by(conm) %>%
                summarize(.data=.,total_years = sum(year_ind))


d102 <- merge(d10,d101)

d103 <- as.data.frame(d102)
d103
##                      conm profit_years total_years
## 1  ADVANCED MICRO DEVICES           10          12
## 2            ALPHABET INC           12          12
## 3          AMAZON.COM INC           12          12
## 4               APPLE INC           12          12
## 5      META PLATFORMS INC           12          12
## 6          MICROSOFT CORP           13          13
## 7             NETFLIX INC           12          12
## 8            QUALCOMM INC           12          12
## 9               TESLA INC            8          12
## 10            TWITTER INC            6          11

Q7

Find the average annual stock returns of all the companies. Follow these steps:

  1. Arrange the data set by conm and datadate.
  2. Group by conm
  3. Calculate stock return stk_ret by taking the difference between mkvalt (Market value of equity) and its lag and then divide the difference by the lag of mkvalt
  4. Summarize to get the mean of the stock returns stk_ret_mean.
  5. Display the average stock returns in percentage format.
library(tidyverse)
d11<- d2 %>% 
          select(.data = .,conm,datadate,mkvalt) %>%
          arrange(conm,datadate) %>%
          group_by(conm) %>%
                        mutate(.data=.,pre_mkvalt = lag(mkvalt))%>%
                        mutate(.data=.,stk_ret = ((mkvalt-pre_mkvalt)/pre_mkvalt)) %>% 
                        summarize(.data=.,stk_ret_avg = mean(stk_ret,na.rm = TRUE),stk_ret_avg=scales::percent(stk_ret_avg,accuracy=0.1))
                        
                        
d11
## # A tibble: 10 x 2
##    conm                   stk_ret_avg
##    <chr>                  <chr>      
##  1 ADVANCED MICRO DEVICES 69.1%      
##  2 ALPHABET INC           25.4%      
##  3 AMAZON.COM INC         37.2%      
##  4 APPLE INC              27.0%      
##  5 META PLATFORMS INC     40.0%      
##  6 MICROSOFT CORP         21.8%      
##  7 NETFLIX INC            58.2%      
##  8 QUALCOMM INC           9.1%       
##  9 TESLA INC              131.7%     
## 10 TWITTER INC            6.0%

Q8

In many statistical and machine learning applications, we use scaled variables instead of the original variables. A scaled variable is typically created by subtracting the sample mean of the variable from the variable and dividing it by its standard deviation. There is a scale() function in base R which can directly do it.

You want to create a scaled variable for sale but separately for each company. Therefore, you can’t use the mean and standard deviation of sale for the entire sample. Instead, you have to calculate these statistics for each company separately and then create a scaled variable. Follow these steps:

  1. Group by conm
  2. Summarize sale to get the mean (sale_mean) and the standard deviation (sale_sd)
  3. Assign this data frame to d3_sum
  4. Join d3 and d3_sum by conm
  5. Create sale_scaled by subtracting sale_mean from sale and dividing this difference by sale_sd

Print the first 10 rows for Twitter with conm, sale, sale_scaled, sale_mean, and sale_sd using `as.data.frame()

d3_sum <- d2 %>%
          select(.data = .,conm,datadate,sale) %>%
          filter(.data = ., conm == "TWITTER INC") %>% 
          summarize(.data = .,conm = "TWITTER INC",sale_mean=mean(sale),sale_sd=sd(sale))
          

d33 <- d2 %>% 
          filter(.data = ., conm == "TWITTER INC")%>% 
          select(.data = .,conm,sale) 
         

d34 = inner_join(d33, d3_sum, by = "conm")


d34 %>% mutate(.data = .,sale_scaled=((sale-sale_mean)/sale_sd))
## # A tibble: 11 x 5
##    conm         sale sale_mean sale_sd sale_scaled
##    <chr>       <dbl>     <dbl>   <dbl>       <dbl>
##  1 TWITTER INC  106.     2271.   1546.     -1.40  
##  2 TWITTER INC  317.     2271.   1546.     -1.26  
##  3 TWITTER INC  665.     2271.   1546.     -1.04  
##  4 TWITTER INC 1403.     2271.   1546.     -0.561 
##  5 TWITTER INC 2218.     2271.   1546.     -0.0341
##  6 TWITTER INC 2530.     2271.   1546.      0.167 
##  7 TWITTER INC 2443.     2271.   1546.      0.112 
##  8 TWITTER INC 3042.     2271.   1546.      0.499 
##  9 TWITTER INC 3459.     2271.   1546.      0.769 
## 10 TWITTER INC 3716.     2271.   1546.      0.935 
## 11 TWITTER INC 5077.     2271.   1546.      1.82