ACCT 420 Individual Assignment 2

Seasonality, visualization, and logistic regression

Author

Ruth Priya Ramani, Mah Min Hui Joyce

Published

September 19, 2023

General instructions

This assignment is intended to help you explore some of the concepts we have discussed in weeks 3, 4, and 5:

Forecasting using linear regression
Seasonality in data
Visualization using ggplot2
Visualization using other tools
Logistic regression

This assignment is written as a Quarto document. Quarto is a technology to create replicable documents, embedding all code, results, and analysis in the same file. Margin notes are included in this file using the syntax [^1], containing various tips and hints for the exercises¹. To get a feel for how this document type works, click the Render bottom in the toolbar right above this editor window. If you have selected “Preview in window” from the settings dropdown (just right of Render), a window will pop up with the output of the document that refreshes each time you click Render. Alternatively, you can open the new html file that was created in the same folder as this .qmd file. You can open it in your favorite web browser and refresh that browser tab after rendering each time.

¹ This is an example of a margin note. The list of notes are all at the bottom of the R Markdown file.

Note that some of the code in the code blocks has already been filled in for you. As a consequence, code blocks are all set to eval: false to start with, so that the document can be compiled before you finish all exercises. As you get to an exercise, change eval: false to eval: true for its code blocks so that they are evaluated when rendering.

I recommend experimenting with your code for the assignment in the R Console at the bottom of the window. To make this easier, you should set the Console to be working from the same directory as this file. You can easily set this from the menu at the top by going to Session > Set Working Directory > To Source File Location. Do note that that the the console and code run while rendering are separate – variables and functions defined in one environment are not accessible from the other without running the code separately.

The main tools you will be gaining familiarity with in this Assignment are:

If you need help with code, you can always reference the class slides or the Datacamp tutorials, or stop by office hours.

The assignment will be graded out of 100 points; point values are specified for each part below.

Goal of the assignment

Throughout the assignment, you will build a model to predict ROA (net income divided by assets) for US real estate companies. To do so, the following steps will be taken:

Loading data (provided for you on eLearn)
Process the data into a usable form
Run an initial model on a training sample
Explore seasonality in the model using visualization
Run a model incorporating seasonal factors on a training sample
Look at accuracy on a testing sample
Make a plot showcasing where the firms in the analysis were located
Try a logistic model to examine if GSS’ poor 2016 performance was expected.

Part 1: Load data from eLearn (5 points)

Instructions

All data needed for this homework is provided on eLearn. There are two files you will need:

Compustat_SG_retail_quarterly.csv

From Compustat - Capital IQ > Global - Daily > Fundamentals Quarterly

SG_postal_codes.csv

Parsed from Singapore Postal Codes, github/xkjyeah

Exercise 1

First, we will need to import some needed packages – tidyverse and plotly.

# This code is done for you
# If you get an error here -- install these packages.
library(tidyverse)
library(plotly)
library(readr)
library(ggplot2)
library(fixest)
library(leaflet)
library(viridis)

Load in the data from eLearn. To do this, you can use the read_csv() function. After you have loaded the data, print out a summary of the financial data using the summary() function, including revenue and total assets.² Also output a list of the companies in the data.³

² Some possible ways to do this: indexing with [,] or using select() from the dplyr package.

³ Use either unique() or table() to do this – look up these functions either online or in R’s help using ?.

# Set `eval: true` after writing your code
# Place your code for Exercise 1 in this code block.

# Store the financial data in a variable called df.
df <- read_csv("~/Documents/SMU/YEAR 4/FFA/Assignment 2/Compustat_SG_retail_quarterly.csv")

Rows: 233 Columns: 21
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (9): indfmt, consol, popsrc, datafmt, datacqtr, datafqtr, conm, costat,...
dbl (12): gvkey, fqtr, fyearq, datadate, atq, ibq, revtq, addzip, ggroup, gi...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# Fix the postal code measure to be numeric (done for you already)
df <- df %>% mutate(addzip = as.numeric(addzip))

# Store the postal code data in a variable called postal_codes
postal_codes <- read_csv("~/Documents/SMU/YEAR 4/FFA/Assignment 2/SG_postal_codes.csv")

New names:
Rows: 120929 Columns: 7
── Column specification
──────────────────────────────────────────────────────── Delimiter: "," chr
(3): BLK_NO, BUILDING, ROAD_NAME dbl (4): ...1, LATITUDE, LONGITUDE, POSTAL
ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
Specify the column types or set `show_col_types = FALSE` to quiet this message.
• `` -> `...1`

# Output a summary of the financial data
df %>% 
select(revtq, atq) %>%
summary()

     revtq              atq         
 Min.   :  9.318   Min.   :  73.42  
 1st Qu.: 54.239   1st Qu.: 253.91  
 Median : 75.919   Median : 364.27  
 Mean   : 85.116   Mean   : 588.56  
 3rd Qu.: 97.262   3rd Qu.: 744.65  
 Max.   :260.812   Max.   :1701.92  
 NA's   :5         NA's   :17

# Output a list of all companies in the data
company_list <- unique(df$conm)

# Print the list of unique company names
print(company_list)

[1] "ROBINSON & CO LTD"            "METRO HOLDINGS LTD"          
[3] "ISETAN (SINGAPORE) LTD"       "CK TANG LTD"                 
[5] "COURTS (SINGAPORE) LTD"       "ZHONGMIN BAIHUI RETAIL GROUP"
[7] "PARKSON RETAIL ASIA LTD"

Part 2: Processing data (20 points)

Instructions

Next we will construct some needed measures. For this assignment, we will work using first differences, i.e., $revenue_t - revenue_{t-1}$. We will need to construct the first difference between the current quarter and the next, as well as the four differences prior. We will also compare revenue by quarter, and thus calculating the lag of revenue will be useful. Thus, we will construct the following six measures:

revtq_lag = last quarter revenue
diff_lead = next quarter revenue - current quarter revenue
diff1 = current quarter revenue - last quarter revenue
diff2 = last quarter revenue - 2 quarters ago revenue
diff3 = 2 quarters ago revenue - 3 quarters ago revenue
diff4 = 3 quarters ago revenue - 4 quarters ago revenue

Also useful will be actual year and quarter for the data, as most retailers do not have calendar aligned fiscal years in Singapore.This has been coded for you.

Then, plot out current revenue vs last quarter revenue such that we can easily distinguish colors by quarter. What insights do we see from the graph? How does this compare with what we saw about US retail firms in session 3?

Exercise 2

Construct the 6 measures specified above.⁴

⁴ You should use mutate() and group_by() for these measures. To identify specific companies, you can use either gvkey or their names.

# Set `eval: true` after writing your code
# Place your code for Exercise 2 in this code block

# Construct the six measures listed in the instructions, storing them in the
# specified names.  Make sure to add these into df.

df <- df %>%
  arrange(gvkey,fyearq)%>%
  group_by(gvkey)%>%
  mutate(
    revtq_lag = lag(revtq), 
    diff_lead = (lead(revtq) - revtq),
    diff1 = revtq - lag(revtq), 
    diff2 = lag(revtq) - lag(revtq,2),
    diff3 = lag(revtq,2) - lag(revtq,3),
    diff4 = lag(revtq,3) - lag(revtq,4))%>%
  ungroup() 
             

# Other useful measures: actual year and quarter -- these are done for you
df$year <- as.numeric(substr(df$datacqtr, 1, 4))
df$qtr <- as.numeric(substr(df$datacqtr, 6, 6))
df$date <- ymd(df$datadate)  
df$qtr <- quarter(df$date)   # From lubridate

# Clean the data: Replace NaN, Inf, and -Inf with NA
df <- df %>%
  mutate(across(where(is.numeric), ~replace(., !is.finite(.), NA)))
summary(df)

     gvkey           indfmt             consol             popsrc         
 Min.   :100081   Length:233         Length:233         Length:233        
 1st Qu.:100870   Class :character   Class :character   Class :character  
 Median :103725   Mode  :character   Mode  :character   Mode  :character  
 Mean   :165784                                                           
 3rd Qu.:296576                                                           
 Max.   :306491                                                           
                                                                          
   datafmt            datacqtr           datafqtr              fqtr      
 Length:233         Length:233         Length:233         Min.   :1.000  
 Class :character   Class :character   Class :character   1st Qu.:1.000  
 Mode  :character   Mode  :character   Mode  :character   Median :2.000  
                                                          Mean   :2.468  
                                                          3rd Qu.:3.000  
                                                          Max.   :4.000  
                                                                         
     fyearq        datadate             atq               ibq          
 Min.   :2003   Min.   :20030630   Min.   :  73.42   Min.   :-59.7660  
 1st Qu.:2006   1st Qu.:20068056   1st Qu.: 253.91   1st Qu.:  0.7408  
 Median :2011   Median :20110331   Median : 364.27   Median :  4.0105  
 Mean   :2010   Mean   :20104507   Mean   : 588.56   Mean   :  9.3301  
 3rd Qu.:2014   3rd Qu.:20140930   3rd Qu.: 744.65   3rd Qu.: 11.6830  
 Max.   :2018   Max.   :20180331   Max.   :1701.92   Max.   :204.6640  
                NA's   :5          NA's   :17        NA's   :5         
     revtq             conm              costat              fic           
 Min.   :  9.318   Length:233         Length:233         Length:233        
 1st Qu.: 54.239   Class :character   Class :character   Class :character  
 Median : 75.919   Mode  :character   Mode  :character   Mode  :character  
 Mean   : 85.116                                                           
 3rd Qu.: 97.262                                                           
 Max.   :260.812                                                           
 NA's   :5                                                                 
     addzip           ggroup          gind           gsector  
 Min.   : 68898   Min.   :2550   Min.   :255030   Min.   :25  
 1st Qu.:169641   1st Qu.:2550   1st Qu.:255030   1st Qu.:25  
 Median :179103   Median :2550   Median :255030   Median :25  
 Mean   :241635   Mean   :2550   Mean   :255030   Mean   :25  
 3rd Qu.:238873   3rd Qu.:2550   3rd Qu.:255030   3rd Qu.:25  
 Max.   :528766   Max.   :2550   Max.   :255030   Max.   :25  
                                                              
    gsubind           revtq_lag         diff_lead           diff1        
 Min.   :25503010   Min.   :  9.318   Min.   :-54.773   Min.   :-54.773  
 1st Qu.:25503010   1st Qu.: 54.285   1st Qu.: -5.557   1st Qu.: -5.557  
 Median :25503010   Median : 75.722   Median :  0.000   Median :  0.000  
 Mean   :25503011   Mean   : 84.726   Mean   :  1.355   Mean   :  1.355  
 3rd Qu.:25503010   3rd Qu.: 96.261   3rd Qu.:  9.895   3rd Qu.:  9.895  
 Max.   :25503020   Max.   :260.812   Max.   :146.420   Max.   :146.420  
                    NA's   :12        NA's   :17        NA's   :17       
     diff2              diff3             diff4              year     
 Min.   :-54.7730   Min.   :-54.773   Min.   :-54.773   Min.   :2003  
 1st Qu.: -5.3388   1st Qu.: -5.481   1st Qu.: -5.467   1st Qu.:2006  
 Median :  0.0345   Median :  0.000   Median :  0.000   Median :2010  
 Mean   :  1.7809   Mean   :  1.305   Mean   :  1.598   Mean   :2010  
 3rd Qu.: 10.2323   3rd Qu.:  9.504   3rd Qu.:  9.726   3rd Qu.:2014  
 Max.   :146.4200   Max.   :146.420   Max.   :146.420   Max.   :2018  
 NA's   :23         NA's   :29        NA's   :36                      
      qtr             date           
 Min.   :1.000   Min.   :2003-06-30  
 1st Qu.:1.750   1st Qu.:2007-03-08  
 Median :2.500   Median :2011-03-31  
 Mean   :2.496   Mean   :2010-12-29  
 3rd Qu.:3.000   3rd Qu.:2014-09-30  
 Max.   :4.000   Max.   :2018-03-31  
 NA's   :5       NA's   :5

Next, make a plot of lagged revenue (on the x-axis) vs revenue (on the y-axis).⁵ Color the points based on the fiscal quarter they are coming from.⁶

⁵ The library ggplot2 was already loaded when we loaded tidyverse earlier. Use the function ggplot() along with the function geom_point() to make your scatterplot. Here is a reference for scatterplots.

⁶ Fiscal quarter is given by fqtr in the data. You can set the color by passing factor() of a variable to the color= aesthetic in ggplot2.

# Set `eval: true` after writing your code
# Place your code for Exercise 2 in this code block

# Make a scatter plot using ggplot and store in plot.

plot <- df %>%
  ggplot(aes(y = revtq, x = revtq_lag, color = as.factor(fqtr))) +
  geom_point() +
  labs(
    x = "lagged Revenue",
    y = 'Revenue',
    colour = 'Fiscal Quarter',
    title = "Current Revenue Against Last Quarter Revenue") + 
  scale_color_discrete(na.translate = FALSE)

# Make the plot interactive
ggplotly(plot)

Next, make the same plot as before, but color the points based on the actual quarter they are coming from.⁷

⁷ Actual quarter is given by qtr.

# Set `eval: true` after writing your code
# Place your code for Exercise 2 in this code block

# Make a scatter plot using ggplot and store in plot.
plot <- df %>%
  ggplot(aes(y = revtq, x = revtq_lag, color = as.factor(qtr))) +
  geom_point() +
  labs(
    x = "lagged Revenue",
    y = 'Revenue',
    colour = 'Actual Quarter',
    title = "Current Revenue Against Last Quarter Revenue")+ 
   scale_color_discrete(na.translate = FALSE)

# Make the plot interactive
ggplotly(plot)

What insights do we see from the graphs? How does this compare with what we saw about US retail firms in session 3?

Explanation: Insights observed 1. Higher Lagged Revenue results in higher revenue earned for the current quarter (for both graphs) 2. It seems that Fiscal Year Q2 represents the Actual Q4 data. We see that Q2 and Q4 in graph 1 and 2 respectively has higher revenue than the other quarters.
3. Graph 2 plots fits a linear line more closely which could mean that it is a better linear relationship between current and last quarter revenue when filtered by actual quarters.

-> In Session 3 of class we were discussing why revenue reflected in our data is not reflective of the events held in respective quarters (e.g Christmas and Black Friday are in Q4 but Q1 shows higher revenue than Q4). From our graph above, we suggest that graph 2 has a more linear or predictable pattern in SG’s retail shop revenues based on the standard calendar months (actual quarters). This might be due to factors like annual events, holidays, or State holidays that follow the calendar year very closely.

Part 3: Initial model (10 points)

Instructions

Next, let’s split our data into training and testing data. We will save 2016 onward for testing later.

Now, we’ll build an initial model for predicting diff_lead. Build a linear model that regresses diff_lead (y) on diff1(x), and save that model to an object called model1. Then, output information about the model using summary(). Then explain the following: 1) how well does the model work, and 2) how do we interpret the coefficient on diff1?

Exercise 3

# Set `eval: true` after writing your code
# Place your code for Exercise 3 in this code block

# Split data into training and test -- before actual year 2016 data will be
# for training, the rest for testing.  Call the training data frame training,
# and call the testing data frame testing
training <- df %>% filter(year < 2016)
testing <- df %>% filter(year >= 2016)

# Run the linear model and store it in model1
model1 <- lm(diff_lead ~ diff1, data = training)

# Print out a summary of model1
summary(model1)


Call:
lm(formula = diff_lead ~ diff1, data = training)

Residuals:
    Min      1Q  Median      3Q     Max 
-38.548  -8.128  -2.432   8.040 147.235 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)   
(Intercept)  2.23296    1.29834   1.720  0.08727 . 
diff1       -0.19981    0.07261  -2.752  0.00656 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 16.99 on 171 degrees of freedom
  (24 observations deleted due to missingness)
Multiple R-squared:  0.04241,   Adjusted R-squared:  0.03681 
F-statistic: 7.573 on 1 and 171 DF,  p-value: 0.006565

Explanation: This regression model (Model 1) is to regress the change in revenue from the current quarter to the next quarter (as represented by diff_lead) against the change in revenue from the previous quarter to the current quarter (as represented by diff1).

Context: For example, we are in Q3, explaining changes in revenue between Q2 and Q4(predicted) Regression Interpretation: For every $1 increase in [revenue change] from (Q3-Q2),our [revenue change] from (Q3 - Q4) decreases by $0.20 In summary, Model1 suggests that there is a statistically significant relationship between the change in revenue from the previous quarter to the current quarter (diff1) and the change in revenue from the current quarter to the next quarter (diff_lead) with it’s low p-value of 0.006565. However, the coefficient for the intercept is not significant, and the overall explanatory power of the model is relatively low, as indicated by the low R-squared value of 0.03681. This means that other factors not included in the model may also influence diff_lead.

Possible reasons for the output: There is Seasonality in our data.Retail industry revenue is often subject to seasonality, with certain quarters being naturally stronger or weaker due to holidays, events, or economic factors. In such cases, an increase in diff1 might not necessarily lead to an increase in diff_lead if the next quarter is weaker.

Part 4: Visualizing the data (10 points)

Instructions

Based on the initial model as well as the visualization from exercise 2, we may want to explore seasonality more in the data. For this section, you should propose a visualization that could help you to better understand the data and any potential seasonality in it.⁸

⁸ You can use any tool in R that you would like to. We have mostly used ggplot2 for this in class, along with passing the ggplot2 plots to ggplotly()

Then, explain the following:

Why did you make the selected visualization?
What do you learn from the visualization?

Exercise 4

# Set `eval: true` after writing your code
# Place your code for Exercise 4 in this code block

# Create a line plot for quarterly revenue trends (CHOSEN VISUALISATION)
seasonality_plot <- ggplot(data = df, aes(x = year, y = revtq, group = qtr, color =as.factor(qtr))) +
  geom_line() +
  labs(x = "Year", y = "Revenue") +
  ggtitle("Quarterly Revenue Trends Over Time") +
  theme_minimal()

# Display the plot
print(seasonality_plot)

Warning: Removed 5 rows containing missing values (`geom_line()`).

# Scatter plot with lag for 1st to 99th percentile of data (ANOTHER VISUALISATION TO CONSIDER FOR FIRST DIFFERENCE DATA)
plt_sct <- function(df,var1, var2) {
  df %>%
    filter(eval(parse(text=var1)) < quantile(eval(parse(text=var1)),0.99, na.rm=TRUE),
           eval(parse(text=var2)) < quantile(eval(parse(text=var2)),0.99, na.rm=TRUE),
           eval(parse(text=var1)) > quantile(eval(parse(text=var1)),0.01, na.rm=TRUE),
           eval(parse(text=var2)) > quantile(eval(parse(text=var2)),0.01, na.rm=TRUE)) %>%
    ggplot(aes(y=eval(parse(text=var1)), x=eval(parse(text=var2)), color=factor(qtr))) + 
    geom_point() + geom_smooth(method = "lm") + ylab(var1) + xlab(var2)
}

plt_sct(training, "diff_lead", "diff1")

`geom_smooth()` using formula = 'y ~ x'

# # Create a scatter plot to explore seasonality (ATTEMPT #1)
# Model1_plot <- ggplot(data=training, aes(y=diff_lead, x=predict(model1, training), color=factor(qtr))) +
#   geom_point() + 
#   geom_abline(slope=1) + 
#   xlab("Fiscal Quarter") + 
#   ylab("Predicted Revenue")
# 
# #Make the plot interactive
# ggplotly(Model1_plot)  
# 
# ATTEMPT #2
# #season_plot <- ggplot(data = df, aes(x = qtr, y = revtq, colour=factor(year))) +
# #  geom_col(fill = "blue") +  # Use geom_col() for continuous data
# #  xlab("Fiscal Quarter") +
# #  ylab("Predicted Revenue")
# #ggplotly(season_plot)  
# 
# ATTEMPT #3
# season_line_plot <- ggplot(data = df, aes(x = qtr, y = revtq, color = factor(year))) +
#   geom_line() +  # Use geom_line() for line plot
#   xlab("Fiscal Quarter") +
#   ylab("Predicted Revenue") +
#   ggtitle("Predicted Revenue Trends Over Fiscal Quarters by Year") +
#   theme_minimal()
# ggplotly(season_line_plot)

Explanation: 1. Why did you make the selected visualization? From the previous scatterplots, we realised that each quarter appears to have its own unique trendline. Hence, we wanted to include seasonality in our model and came up with this time series visualisation to illustrate the effects of seasonalities clearly.

-> Quarterly Variations: Variations in revenue within each quarter will allow us to see the quarterly performance of the company, allowing us to identify which quarters tend to have higher or lower revenue based on the peaks and valleys in the lines.

-> Outliers: Unusual spikes or dips in revenue are potential outliers that require further research and questioning.These outliers could be due to specific events or factors that significantly impacted revenue in those quarters (e.g Pandemics, Extreme weather conditions, social-economic situations).

What do you learn from the visualization? we see a reoccurring pattern everywhere where Q4 generally has higher revenue as compared to other quarters which fits into our assumption that the holiday season in Q4 generates higher sales among the other quarters of the year. We do see some anomalies where Q1 outperformed the other quarters in 2007. This could be due to macro-economic situations that occurred in previous quarters?

Part 5: Incorporating seasonality (20 points)

Instructions

Based on your graphic in part 4, propose the changes that you would make to the initial model from exercise 3, using variables from the original data set or from the variables created in Exercise 2. Write out your reasoning for the model, and include a null and alternative hypothesis.

Once you have decided on your model, run an OLS regression to test the model.

Interpret the result of the model. And how well does the model perform?

Exercise 5

Model reasoning: These hypotheses reflect the statements made below and can be tested using statistical inference techniques such as hypothesis testing on the coefficients in your linear regression model (e.g., t-tests or F-tests).

$H_0$:The coefficients of diff2, diff3, and diff4 in the model are all equal to zero, suggesting that a long-run autoregressive model (including lagged quarters) does not help predict next quarter revenue (diff_lead).

$H_1$: At least one of the coefficients of diff2, diff3, or diff4 in the model is not equal to zero, suggesting that a long-run autoregressive model (including lagged quarters) helps predict next quarter revenue (diff_lead).

# Set `eval: true` after writing your code
# Place your code for Exercise 5 in this code block

# Run your proposed model, and store it in model2 (1 year of lag regression) (CHOSEN MODEL!!) 
model2 <- lm(diff_lead ~ diff1 + diff2 + diff3 + diff4, data = training) 

#POSSIBLE MODELS CONSIDERED (model3)
#Quarter and Year lag regression 
model3 <- lm(diff_lead ~ diff1 + diff4, data = training)

# Print out the model
summary(model2)


Call:
lm(formula = diff_lead ~ diff1 + diff2 + diff3 + diff4, data = training)

Residuals:
    Min      1Q  Median      3Q     Max 
-31.877  -6.043  -0.771   3.860 147.896 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  1.096150   1.293016   0.848    0.398    
diff1       -0.005215   0.076723  -0.068    0.946    
diff2        0.230185   0.114333   2.013    0.046 *  
diff3        0.024257   0.121087   0.200    0.842    
diff4        0.687782   0.112465   6.116 8.95e-09 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 15.35 on 141 degrees of freedom
  (51 observations deleted due to missingness)
Multiple R-squared:  0.2816,    Adjusted R-squared:  0.2612 
F-statistic: 13.82 on 4 and 141 DF,  p-value: 1.56e-09

summary(model3)


Call:
lm(formula = diff_lead ~ diff1 + diff4, data = training)

Residuals:
    Min      1Q  Median      3Q     Max 
-29.845  -5.329  -0.988   2.774 153.248 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  1.18999    1.25937   0.945    0.346    
diff1       -0.04451    0.07098  -0.627    0.532    
diff4        0.66127    0.09950   6.646  5.5e-10 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 15.32 on 147 degrees of freedom
  (47 observations deleted due to missingness)
Multiple R-squared:  0.2559,    Adjusted R-squared:  0.2457 
F-statistic: 25.27 on 2 and 147 DF,  p-value: 3.687e-10

Model 2 is the best as it reflects how calculating the differences in revenue one year ago actual affects revenue prediction for the next year (exactly 4 quarters later). Lower p_score value and higher adjusted r^2 than other models

Interpretation: For every dollar increase in diff1, diff_lead decreases by 0.00522 dollar. For every dollar increase in diff2, diff_lead increases by 0.230 dollar. For every dollar increase in diff3, diff_lead increases by 0.0243 dollar. For every dollar increase in diff4, diff_lead increases by 0.688 dollar.

Model performance: The adjusted R-squared value is 0.261, which means that only 26.1% of the variation in diff_lead can be explained by the variable diff1. In other words, the model has a low adjusted R-squared, indicating that it is poor at predicting diff_lead even though is it a lot better than model 1.

p-value of diff4 is approximately equal to zero meaning that diff4 is statistically significant in predicting diff_lead. p-value of diff2 is 0.046 which is <0.05 meaning that diff4 is also statistically significant in predicting diff_lead. p-value of diff1 and diff3 is high, and not statistically significant in predicting diff_lead.

Part 6: Out of sample accuracy (10 points)

Instructions

Next, we will check how well the initial model and your model perform out of sample.To check this, we will use RMSE. The function for RMSE is given below:

# Takes two vectors and returns the RMSE between the two
rmse <- function(v1, v2) {
  sqrt(mean((v1 - v2)^2, na.rm=T))
}

Use predict() and the rmse() function to generate the RMSE for each model using the testing data frame. How would you interpret the performance of the models in relation to one another?

Exercise 6

# Set `eval: true` after writing your code
# Place your code for Exercise 6 in this code block

# Calculate model RMSEs -- fill in the predict function
rmse(testing$diff_lead, predict(model1,testing))

[1] 17.16705

rmse(testing$diff_lead, predict(model2, testing))

[1] 14.60683

Interpretation: Comparing the RMSE values for Model 1 and Model 2: Model 2 has a lower RMSE (14.6) compared to Model 1 (17.2). This suggests that Model 2 has a better predictive performance on the testing data compared to Model 1. In other words, Model 2’s predictions are, on average, closer to the actual values than Model 1’s predictions.

Part 7: Firm locations (10 points)

Instructions

For this part, we will plot out where these seven retail firms are or were headquartered in Singapore, and include some basic financial information about them as well. To do this, you will need to merge in the postal_codes data into df. You will either need to specify the key to match on,⁹ or change variable names to match each other, as postal code is stored as addzip in the Compustat data and as POSTAL in the postal code data.

⁹ You can specify matching on different names across the data by specifying the following option: by = c("left_var" = "right_var")

Then, print out an interactive map with the following:¹⁰

¹⁰ Specifying na.rm=TRUE may be helpful if using min() and max(), so as to tell them to ignore missing data. Also, note that to take the first observation within a group, you can use group_by() to group, and then pipe into slice(1) to extract just the first row of a group, then pipe into ungroup().

1 marker (of any appropriate type) where each of the businesses are or were headquartered
Each marker should, either on hover, click, or a mix of the two, display:

The name of the company
The years the company is in the data
The minimum and maximum quarterly revenue the company had during that period

You can use either leaflet or plotly to make the interactive map.¹¹ You can follow the syntax from Session 3 exercises 3 or 4.¹²

¹¹ You will get a much more detailed map of Singapore using leaflet. You could also use ggplot2 paired with sf, maps, maptools, and rgeos, passing the result to ggplotly() from plotly. This requires a lot more setup, however, and may require installing other software outside of R if you are not running on Windows.

¹² You may have noticed that both plotly and leaflet use ~ in front of variables. This tells it to use the data specified to data= for any variables following the ~. Thus, you can pass expressions like ~lat in place of mapdata$lat. You can also use functions, like ~paste(conm, "<br>Latitude:", lat, "<br>Longitude:", lon).

Exercise 7

# Set `eval: true` after writing your code
# Place your code for Exercise 7 in this code block

# Make sure that POSTAL is numeric -- this are done for you
postal_codes$POSTAL <- as.numeric(postal_codes$POSTAL)
df <- df %>%
  rename(POSTAL = addzip)

# Merge the data
df_map <- inner_join(df, postal_codes, by = "POSTAL")

# If using leaflet, load leaflet.  Plotly is already loaded above.
# Process the data for your map -- fill in each function
# Process the data for your map

mapdata <- df_map %>%
  group_by(conm) %>%
  summarize(
    lat = mean(LATITUDE, na.rm = TRUE),     # Calculate mean latitude
    lon = mean(LONGITUDE, na.rm = TRUE),   # Calculate mean longitude
    years = paste(min(year), "-", max(year)),  # Years in the data
    min_revenue = min(revtq, na.rm = TRUE),    # Min quarterly revenue
    max_revenue = max(revtq, na.rm = TRUE),    # Max quarterly revenue
    average_revenue = mean(revtq, na.rm = TRUE)) %>%
  ungroup()



# Make your map -- consult Session 3 exercises 3 or 4 for details on how to do this.
# Load the leaflet library if not already loaded

# Create the interactive map
my_map <- leaflet(data = mapdata) %>%
  addTiles() %>%
  addCircleMarkers(
    lng = ~lon,
    lat = ~lat,
    radius = 5,  # Adjust the marker radius as needed
    color =~colorNumeric(
      palette = "blue",
      domain = mapdata$average_revenue, 
      reverse = TRUE
      )(average_revenue),
    fillOpacity = 0.7,  # Opacity of marker fill
    popup = ~paste(
      "<strong>Company:</strong>", mapdata$conm, "<br>",
      "<strong>Years:</strong>", mapdata$years, "<br>",
      "<strong>Min Revenue:</strong> $", round(mapdata$min_revenue, 2), " million<br>",
      "<strong>Max Revenue:</strong> $", round(mapdata$max_revenue, 2), " million", 
      "<strong>Avg Revenue:</strong> $", round(mapdata$average_revenue, 2), " million"
    )
  )%>%
  setView(lng = mean(mapdata$lon), lat = mean(mapdata$lat), zoom = 10)

my_map

Explanation: -> We used the leaflet library to plot our graph and filtered the map feature only Singapore. We also used clustering method in order to easily identify where these businesses are headquartered at.

As such, we notice 3 insights 1. 5 out of the 7 retail companies in Singapore are mainly located with central region (Isetan, Robinson, Metro,Parkson & Tangs) of Singapore while 2 reside in the east (Courts& Zhongmin). 2. ZHONGMIN BAIHUI RETAIL GROUP has the highest revenue with minimum revenue of $ 9.32 million and Max Revenue of $ 260.81 million (Avg Revenue: $ 157.95 million). 3. METRO HOLDINGS LTD has the lowest revenue with minimum revenue: $ 28.32 million and maximum revenue: $ 73.9 million Avg Revenue: $ 47.67 million.

Part 8: Running a logistic model (15 points)

Instructions 8

The Great Singapore Sale in 2016 was generally considered to have performed quite poorly¹³. For this exercise, we will make the assumption that a successful GSS will lead to higher sales in Q2 than sales from the previous Q2.¹⁴

¹³ See this Straits Times coverage.

¹⁴ This doesn’t cover the whole of GSS, but does cover the first month.

Let’s run a simple model to see if:

A model of revenue growth can predict GSS success, and ##Ruth: Logistic Regression (if exceed prev Q2 sales = GSS success)
If we would have predicted a poor GSS season for the retailers in our data in 2016. ## out-sample data

After that, you should interpret the model – what does it say about the relationship between GSS success and past revenue growth?

Lastly, I have provided code to produce the odds ratios and the probability of success for each company with 2016 Q2 data. I have also provided a baseline for each, which is the average result across all years in the training data. Interpret either the odds ratios or the probabilities. Based on your model, does it seem like GSS 2016 was expected to be bad from the beginning?

Exercise 8

# Set `eval: true` after writing your code
# Place your code for Exercise 8 in this code block

# Create the GSS_Success measure and growth measures by company
# These are done for you
df <- df %>%
  group_by(gvkey) %>%
  mutate(GSS_success=ifelse(revtq > lag(revtq, 4), 1, 0)) %>%
  mutate(growth1 = (lag(revtq) - lag(revtq, 2))/lag(revtq, 2),    #Revenue Growth Q1
         growth2 = (lag(revtq, 2) - lag(revtq, 3))/lag(revtq, 3), #Revenue Growth Q2
         growth3 = (lag(revtq, 3) - lag(revtq, 4))/lag(revtq, 4), #Revenue Growth Q3
         growth4 = (lag(revtq, 4) - lag(revtq, 5))/lag(revtq, 5)) %>% #Revenue Growth Q4
  ungroup()
df[!is.na(df$qtr) & df$qtr != 2,]$GSS_success = NA

# Split into training and testing data -- these are done for you
training = df %>% filter(year < 2016)
testing = df %>% filter(year == 2016, qtr == 2)

# Run a logistic model using glm() of GSS_success on growth1 through growth4
# using training data.  Make sure to set the family option
fit <- glm(GSS_success ~ growth1 + growth2 + growth3 + growth4, data = training, family = binomial)

Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

# Output a summary of the model
summary(fit)


Call:
glm(formula = GSS_success ~ growth1 + growth2 + growth3 + growth4, 
    family = binomial, data = training)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)  
(Intercept)   -3.058      1.610  -1.900   0.0575 .
growth1        9.301      6.497   1.432   0.1523  
growth2       22.596      9.778   2.311   0.0208 *
growth3       20.166      9.666   2.086   0.0369 *
growth4        5.112      4.824   1.060   0.2892  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 45.234  on 33  degrees of freedom
Residual deviance: 33.165  on 29  degrees of freedom
  (163 observations deleted due to missingness)
AIC: 43.165

Number of Fisher Scoring iterations: 7

Explanation of the model: Q: what does it say about the relationship between GSS success and past revenue growth? -> There is a positive correlation between past revenue growth and GSS success which helps me infer that there is some sort of relationship between the two factors.

p-values of growth2 and growth3 are statistically significant at the 0.05 significance level, suggesting that changes in revenue from two and three quarters ago are important predictors of GSS success.
Since growth2 and growth3 have positive coefficients of 22.596 and 20.166, it suggests that an increase in revenue growth from two to three quarters ago is associated with an increased likelihood of GSS success.
Similarly growth1 and growth4 have positive coefficients, however they are not statistically significant in predicting GSS success due to them having larger p-values.

# Set `eval: true` after writing your code
# This code is all already done for you.
# It creates the odds and probabilities of success by firm,
# as well as a baseline from the training data

logodds <- predict(fit, testing)

# Convert to odds
odds = exp(logodds)
names(odds) <- testing$conm

# Convert to probability
probability <- odds / (1 + odds)
names(probability) <- testing$conm

# Baseline probability:
baseline <- sum(training$GSS_success, na.rm=T) /
  sum(!is.na(training$GSS_success))
names(baseline) <- "Baseline"

baseline_odds = baseline / (1-baseline)

print("Odds")

[1] "Odds"

c(baseline_odds, odds)

                    Baseline           METRO HOLDINGS LTD 
                 1.625000000                  0.002635516 
      ISETAN (SINGAPORE) LTD ZHONGMIN BAIHUI RETAIL GROUP 
                 0.165977553                  0.152708641 
     PARKSON RETAIL ASIA LTD 
                 0.554071796

print("Probabilities")

[1] "Probabilities"

c(baseline, probability)

                    Baseline           METRO HOLDINGS LTD 
                 0.619047619                  0.002628588 
      ISETAN (SINGAPORE) LTD ZHONGMIN BAIHUI RETAIL GROUP 
                 0.142350556                  0.132478092 
     PARKSON RETAIL ASIA LTD 
                 0.356529086

Explanation of 2016 expectation: Based on your model, does it seem like GSS 2016 was expected to be bad from the beginning?

ANSWER: Based on our model, GSS 2016 was expected to be good from the beginning as suggested by the positive correlations between revenue growths in predicting GSS success. However, none of the firms below met the GSS 2016 baseline, meaning that companies performed poorer than previous years during a GSS.

The baseline probability is 0.619 which means that the probability for GSS success across the years before 2016 is 61.9%.
METRO HOLDINGS LTD has a very low probability (0.0026) of GSS success, indicating a very low expectation of success as its probability of success is significantly lower than the baseline.
ISETAN (SINGAPORE) LTD and ZHONGMIN BAIHUI RETAIL GROUP have a relatively low probability (0.1424) (0.1325) of GSS success, suggesting a lower expectation compared to the baseline, both of which have a lower GSS success than the baseline.
Meanwhile, PARKSON RETAIL ASIA LTD has a probability of 0.3565 which is the highest among the companies in the dataset, however it is still considerably below the baseline.