# This code is done for you
# If you get an error here -- install these packages.
library(tidyverse)
library(plotly)
library(readr)
library(ggplot2)
library(fixest)
library(leaflet)
library(viridis)ACCT 420 Individual Assignment 2
Seasonality, visualization, and logistic regression
General instructions
This assignment is intended to help you explore some of the concepts we have discussed in weeks 3, 4, and 5:
- Forecasting using linear regression
- Seasonality in data
- Visualization using ggplot2
- Visualization using other tools
- Logistic regression
This assignment is written as a Quarto document. Quarto is a technology to create replicable documents, embedding all code, results, and analysis in the same file. Margin notes are included in this file using the syntax [^1], containing various tips and hints for the exercises1. To get a feel for how this document type works, click the Render bottom in the toolbar right above this editor window. If you have selected “Preview in window” from the settings dropdown (just right of Render), a window will pop up with the output of the document that refreshes each time you click Render. Alternatively, you can open the new html file that was created in the same folder as this .qmd file. You can open it in your favorite web browser and refresh that browser tab after rendering each time.
1 This is an example of a margin note. The list of notes are all at the bottom of the R Markdown file.
Note that some of the code in the code blocks has already been filled in for you. As a consequence, code blocks are all set to eval: false to start with, so that the document can be compiled before you finish all exercises. As you get to an exercise, change eval: false to eval: true for its code blocks so that they are evaluated when rendering.
I recommend experimenting with your code for the assignment in the R Console at the bottom of the window. To make this easier, you should set the Console to be working from the same directory as this file. You can easily set this from the menu at the top by going to Session > Set Working Directory > To Source File Location. Do note that that the the console and code run while rendering are separate – variables and functions defined in one environment are not accessible from the other without running the code separately.
The main tools you will be gaining familiarity with in this Assignment are:
If you need help with code, you can always reference the class slides or the Datacamp tutorials, or stop by office hours.
The assignment will be graded out of 100 points; point values are specified for each part below.
Goal of the assignment
Throughout the assignment, you will build a model to predict ROA (net income divided by assets) for US real estate companies. To do so, the following steps will be taken:
- Loading data (provided for you on eLearn)
- Process the data into a usable form
- Run an initial model on a training sample
- Explore seasonality in the model using visualization
- Run a model incorporating seasonal factors on a training sample
- Look at accuracy on a testing sample
- Make a plot showcasing where the firms in the analysis were located
- Try a logistic model to examine if GSS’ poor 2016 performance was expected.
Part 1: Load data from eLearn (5 points)
Instructions
All data needed for this homework is provided on eLearn. There are two files you will need:
- Compustat_SG_retail_quarterly.csv
- From Compustat - Capital IQ > Global - Daily > Fundamentals Quarterly
- SG_postal_codes.csv
- Parsed from Singapore Postal Codes, github/xkjyeah
Exercise 1
First, we will need to import some needed packages – tidyverse and plotly.
Load in the data from eLearn. To do this, you can use the read_csv() function. After you have loaded the data, print out a summary of the financial data using the summary() function, including revenue and total assets.2 Also output a list of the companies in the data.3
2 Some possible ways to do this: indexing with [,] or using select() from the dplyr package.
3 Use either unique() or table() to do this – look up these functions either online or in R’s help using ?.
# Set `eval: true` after writing your code
# Place your code for Exercise 1 in this code block.
# Store the financial data in a variable called df.
df <- read_csv("~/Documents/SMU/YEAR 4/FFA/Assignment 2/Compustat_SG_retail_quarterly.csv") Rows: 233 Columns: 21
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (9): indfmt, consol, popsrc, datafmt, datacqtr, datafqtr, conm, costat,...
dbl (12): gvkey, fqtr, fyearq, datadate, atq, ibq, revtq, addzip, ggroup, gi...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Fix the postal code measure to be numeric (done for you already)
df <- df %>% mutate(addzip = as.numeric(addzip))
# Store the postal code data in a variable called postal_codes
postal_codes <- read_csv("~/Documents/SMU/YEAR 4/FFA/Assignment 2/SG_postal_codes.csv") New names:
Rows: 120929 Columns: 7
── Column specification
──────────────────────────────────────────────────────── Delimiter: "," chr
(3): BLK_NO, BUILDING, ROAD_NAME dbl (4): ...1, LATITUDE, LONGITUDE, POSTAL
ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
Specify the column types or set `show_col_types = FALSE` to quiet this message.
• `` -> `...1`
# Output a summary of the financial data
df %>%
select(revtq, atq) %>%
summary() revtq atq
Min. : 9.318 Min. : 73.42
1st Qu.: 54.239 1st Qu.: 253.91
Median : 75.919 Median : 364.27
Mean : 85.116 Mean : 588.56
3rd Qu.: 97.262 3rd Qu.: 744.65
Max. :260.812 Max. :1701.92
NA's :5 NA's :17
# Output a list of all companies in the data
company_list <- unique(df$conm)
# Print the list of unique company names
print(company_list)[1] "ROBINSON & CO LTD" "METRO HOLDINGS LTD"
[3] "ISETAN (SINGAPORE) LTD" "CK TANG LTD"
[5] "COURTS (SINGAPORE) LTD" "ZHONGMIN BAIHUI RETAIL GROUP"
[7] "PARKSON RETAIL ASIA LTD"
Part 2: Processing data (20 points)
Instructions
Next we will construct some needed measures. For this assignment, we will work using first differences, i.e., \(revenue_t - revenue_{t-1}\). We will need to construct the first difference between the current quarter and the next, as well as the four differences prior. We will also compare revenue by quarter, and thus calculating the lag of revenue will be useful. Thus, we will construct the following six measures:
revtq_lag= last quarter revenuediff_lead= next quarter revenue - current quarter revenuediff1= current quarter revenue - last quarter revenuediff2= last quarter revenue - 2 quarters ago revenuediff3= 2 quarters ago revenue - 3 quarters ago revenuediff4= 3 quarters ago revenue - 4 quarters ago revenue
Also useful will be actual year and quarter for the data, as most retailers do not have calendar aligned fiscal years in Singapore.This has been coded for you.
Then, plot out current revenue vs last quarter revenue such that we can easily distinguish colors by quarter. What insights do we see from the graph? How does this compare with what we saw about US retail firms in session 3?
Exercise 2
Construct the 6 measures specified above.4
4 You should use mutate() and group_by() for these measures. To identify specific companies, you can use either gvkey or their names.
# Set `eval: true` after writing your code
# Place your code for Exercise 2 in this code block
# Construct the six measures listed in the instructions, storing them in the
# specified names. Make sure to add these into df.
df <- df %>%
arrange(gvkey,fyearq)%>%
group_by(gvkey)%>%
mutate(
revtq_lag = lag(revtq),
diff_lead = (lead(revtq) - revtq),
diff1 = revtq - lag(revtq),
diff2 = lag(revtq) - lag(revtq,2),
diff3 = lag(revtq,2) - lag(revtq,3),
diff4 = lag(revtq,3) - lag(revtq,4))%>%
ungroup()
# Other useful measures: actual year and quarter -- these are done for you
df$year <- as.numeric(substr(df$datacqtr, 1, 4))
df$qtr <- as.numeric(substr(df$datacqtr, 6, 6))
df$date <- ymd(df$datadate)
df$qtr <- quarter(df$date) # From lubridate
# Clean the data: Replace NaN, Inf, and -Inf with NA
df <- df %>%
mutate(across(where(is.numeric), ~replace(., !is.finite(.), NA)))
summary(df) gvkey indfmt consol popsrc
Min. :100081 Length:233 Length:233 Length:233
1st Qu.:100870 Class :character Class :character Class :character
Median :103725 Mode :character Mode :character Mode :character
Mean :165784
3rd Qu.:296576
Max. :306491
datafmt datacqtr datafqtr fqtr
Length:233 Length:233 Length:233 Min. :1.000
Class :character Class :character Class :character 1st Qu.:1.000
Mode :character Mode :character Mode :character Median :2.000
Mean :2.468
3rd Qu.:3.000
Max. :4.000
fyearq datadate atq ibq
Min. :2003 Min. :20030630 Min. : 73.42 Min. :-59.7660
1st Qu.:2006 1st Qu.:20068056 1st Qu.: 253.91 1st Qu.: 0.7408
Median :2011 Median :20110331 Median : 364.27 Median : 4.0105
Mean :2010 Mean :20104507 Mean : 588.56 Mean : 9.3301
3rd Qu.:2014 3rd Qu.:20140930 3rd Qu.: 744.65 3rd Qu.: 11.6830
Max. :2018 Max. :20180331 Max. :1701.92 Max. :204.6640
NA's :5 NA's :17 NA's :5
revtq conm costat fic
Min. : 9.318 Length:233 Length:233 Length:233
1st Qu.: 54.239 Class :character Class :character Class :character
Median : 75.919 Mode :character Mode :character Mode :character
Mean : 85.116
3rd Qu.: 97.262
Max. :260.812
NA's :5
addzip ggroup gind gsector
Min. : 68898 Min. :2550 Min. :255030 Min. :25
1st Qu.:169641 1st Qu.:2550 1st Qu.:255030 1st Qu.:25
Median :179103 Median :2550 Median :255030 Median :25
Mean :241635 Mean :2550 Mean :255030 Mean :25
3rd Qu.:238873 3rd Qu.:2550 3rd Qu.:255030 3rd Qu.:25
Max. :528766 Max. :2550 Max. :255030 Max. :25
gsubind revtq_lag diff_lead diff1
Min. :25503010 Min. : 9.318 Min. :-54.773 Min. :-54.773
1st Qu.:25503010 1st Qu.: 54.285 1st Qu.: -5.557 1st Qu.: -5.557
Median :25503010 Median : 75.722 Median : 0.000 Median : 0.000
Mean :25503011 Mean : 84.726 Mean : 1.355 Mean : 1.355
3rd Qu.:25503010 3rd Qu.: 96.261 3rd Qu.: 9.895 3rd Qu.: 9.895
Max. :25503020 Max. :260.812 Max. :146.420 Max. :146.420
NA's :12 NA's :17 NA's :17
diff2 diff3 diff4 year
Min. :-54.7730 Min. :-54.773 Min. :-54.773 Min. :2003
1st Qu.: -5.3388 1st Qu.: -5.481 1st Qu.: -5.467 1st Qu.:2006
Median : 0.0345 Median : 0.000 Median : 0.000 Median :2010
Mean : 1.7809 Mean : 1.305 Mean : 1.598 Mean :2010
3rd Qu.: 10.2323 3rd Qu.: 9.504 3rd Qu.: 9.726 3rd Qu.:2014
Max. :146.4200 Max. :146.420 Max. :146.420 Max. :2018
NA's :23 NA's :29 NA's :36
qtr date
Min. :1.000 Min. :2003-06-30
1st Qu.:1.750 1st Qu.:2007-03-08
Median :2.500 Median :2011-03-31
Mean :2.496 Mean :2010-12-29
3rd Qu.:3.000 3rd Qu.:2014-09-30
Max. :4.000 Max. :2018-03-31
NA's :5 NA's :5
Next, make a plot of lagged revenue (on the x-axis) vs revenue (on the y-axis).5 Color the points based on the fiscal quarter they are coming from.6
5 The library ggplot2 was already loaded when we loaded tidyverse earlier. Use the function ggplot() along with the function geom_point() to make your scatterplot. Here is a reference for scatterplots.
6 Fiscal quarter is given by fqtr in the data. You can set the color by passing factor() of a variable to the color= aesthetic in ggplot2.
# Set `eval: true` after writing your code
# Place your code for Exercise 2 in this code block
# Make a scatter plot using ggplot and store in plot.
plot <- df %>%
ggplot(aes(y = revtq, x = revtq_lag, color = as.factor(fqtr))) +
geom_point() +
labs(
x = "lagged Revenue",
y = 'Revenue',
colour = 'Fiscal Quarter',
title = "Current Revenue Against Last Quarter Revenue") +
scale_color_discrete(na.translate = FALSE)
# Make the plot interactive
ggplotly(plot)Next, make the same plot as before, but color the points based on the actual quarter they are coming from.7
7 Actual quarter is given by qtr.
# Set `eval: true` after writing your code
# Place your code for Exercise 2 in this code block
# Make a scatter plot using ggplot and store in plot.
plot <- df %>%
ggplot(aes(y = revtq, x = revtq_lag, color = as.factor(qtr))) +
geom_point() +
labs(
x = "lagged Revenue",
y = 'Revenue',
colour = 'Actual Quarter',
title = "Current Revenue Against Last Quarter Revenue")+
scale_color_discrete(na.translate = FALSE)
# Make the plot interactive
ggplotly(plot)What insights do we see from the graphs? How does this compare with what we saw about US retail firms in session 3?
Explanation: Insights observed 1. Higher Lagged Revenue results in higher revenue earned for the current quarter (for both graphs) 2. It seems that Fiscal Year Q2 represents the Actual Q4 data. We see that Q2 and Q4 in graph 1 and 2 respectively has higher revenue than the other quarters.
3. Graph 2 plots fits a linear line more closely which could mean that it is a better linear relationship between current and last quarter revenue when filtered by actual quarters.
-> In Session 3 of class we were discussing why revenue reflected in our data is not reflective of the events held in respective quarters (e.g Christmas and Black Friday are in Q4 but Q1 shows higher revenue than Q4). From our graph above, we suggest that graph 2 has a more linear or predictable pattern in SG’s retail shop revenues based on the standard calendar months (actual quarters). This might be due to factors like annual events, holidays, or State holidays that follow the calendar year very closely.
Part 3: Initial model (10 points)
Instructions
Next, let’s split our data into training and testing data. We will save 2016 onward for testing later.
Now, we’ll build an initial model for predicting diff_lead. Build a linear model that regresses diff_lead (y) on diff1(x), and save that model to an object called model1. Then, output information about the model using summary(). Then explain the following: 1) how well does the model work, and 2) how do we interpret the coefficient on diff1?
Exercise 3
# Set `eval: true` after writing your code
# Place your code for Exercise 3 in this code block
# Split data into training and test -- before actual year 2016 data will be
# for training, the rest for testing. Call the training data frame training,
# and call the testing data frame testing
training <- df %>% filter(year < 2016)
testing <- df %>% filter(year >= 2016)
# Run the linear model and store it in model1
model1 <- lm(diff_lead ~ diff1, data = training)
# Print out a summary of model1
summary(model1)
Call:
lm(formula = diff_lead ~ diff1, data = training)
Residuals:
Min 1Q Median 3Q Max
-38.548 -8.128 -2.432 8.040 147.235
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.23296 1.29834 1.720 0.08727 .
diff1 -0.19981 0.07261 -2.752 0.00656 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 16.99 on 171 degrees of freedom
(24 observations deleted due to missingness)
Multiple R-squared: 0.04241, Adjusted R-squared: 0.03681
F-statistic: 7.573 on 1 and 171 DF, p-value: 0.006565
Explanation: This regression model (Model 1) is to regress the change in revenue from the current quarter to the next quarter (as represented by diff_lead) against the change in revenue from the previous quarter to the current quarter (as represented by diff1).
Context: For example, we are in Q3, explaining changes in revenue between Q2 and Q4(predicted) Regression Interpretation: For every $1 increase in [revenue change] from (Q3-Q2),our [revenue change] from (Q3 - Q4) decreases by $0.20 In summary, Model1 suggests that there is a statistically significant relationship between the change in revenue from the previous quarter to the current quarter (diff1) and the change in revenue from the current quarter to the next quarter (diff_lead) with it’s low p-value of 0.006565. However, the coefficient for the intercept is not significant, and the overall explanatory power of the model is relatively low, as indicated by the low R-squared value of 0.03681. This means that other factors not included in the model may also influence diff_lead.
Possible reasons for the output: There is Seasonality in our data.Retail industry revenue is often subject to seasonality, with certain quarters being naturally stronger or weaker due to holidays, events, or economic factors. In such cases, an increase in diff1 might not necessarily lead to an increase in diff_lead if the next quarter is weaker.
Part 4: Visualizing the data (10 points)
Instructions
Based on the initial model as well as the visualization from exercise 2, we may want to explore seasonality more in the data. For this section, you should propose a visualization that could help you to better understand the data and any potential seasonality in it.8
8 You can use any tool in R that you would like to. We have mostly used ggplot2 for this in class, along with passing the ggplot2 plots to ggplotly()
Then, explain the following:
- Why did you make the selected visualization?
- What do you learn from the visualization?
Exercise 4
# Set `eval: true` after writing your code
# Place your code for Exercise 4 in this code block
# Create a line plot for quarterly revenue trends (CHOSEN VISUALISATION)
seasonality_plot <- ggplot(data = df, aes(x = year, y = revtq, group = qtr, color =as.factor(qtr))) +
geom_line() +
labs(x = "Year", y = "Revenue") +
ggtitle("Quarterly Revenue Trends Over Time") +
theme_minimal()
# Display the plot
print(seasonality_plot)Warning: Removed 5 rows containing missing values (`geom_line()`).
# Scatter plot with lag for 1st to 99th percentile of data (ANOTHER VISUALISATION TO CONSIDER FOR FIRST DIFFERENCE DATA)
plt_sct <- function(df,var1, var2) {
df %>%
filter(eval(parse(text=var1)) < quantile(eval(parse(text=var1)),0.99, na.rm=TRUE),
eval(parse(text=var2)) < quantile(eval(parse(text=var2)),0.99, na.rm=TRUE),
eval(parse(text=var1)) > quantile(eval(parse(text=var1)),0.01, na.rm=TRUE),
eval(parse(text=var2)) > quantile(eval(parse(text=var2)),0.01, na.rm=TRUE)) %>%
ggplot(aes(y=eval(parse(text=var1)), x=eval(parse(text=var2)), color=factor(qtr))) +
geom_point() + geom_smooth(method = "lm") + ylab(var1) + xlab(var2)
}
plt_sct(training, "diff_lead", "diff1")`geom_smooth()` using formula = 'y ~ x'
# # Create a scatter plot to explore seasonality (ATTEMPT #1)
# Model1_plot <- ggplot(data=training, aes(y=diff_lead, x=predict(model1, training), color=factor(qtr))) +
# geom_point() +
# geom_abline(slope=1) +
# xlab("Fiscal Quarter") +
# ylab("Predicted Revenue")
#
# #Make the plot interactive
# ggplotly(Model1_plot)
#
# ATTEMPT #2
# #season_plot <- ggplot(data = df, aes(x = qtr, y = revtq, colour=factor(year))) +
# # geom_col(fill = "blue") + # Use geom_col() for continuous data
# # xlab("Fiscal Quarter") +
# # ylab("Predicted Revenue")
# #ggplotly(season_plot)
#
# ATTEMPT #3
# season_line_plot <- ggplot(data = df, aes(x = qtr, y = revtq, color = factor(year))) +
# geom_line() + # Use geom_line() for line plot
# xlab("Fiscal Quarter") +
# ylab("Predicted Revenue") +
# ggtitle("Predicted Revenue Trends Over Fiscal Quarters by Year") +
# theme_minimal()
# ggplotly(season_line_plot) Explanation: 1. Why did you make the selected visualization? From the previous scatterplots, we realised that each quarter appears to have its own unique trendline. Hence, we wanted to include seasonality in our model and came up with this time series visualisation to illustrate the effects of seasonalities clearly.
-> Quarterly Variations: Variations in revenue within each quarter will allow us to see the quarterly performance of the company, allowing us to identify which quarters tend to have higher or lower revenue based on the peaks and valleys in the lines.
-> Outliers: Unusual spikes or dips in revenue are potential outliers that require further research and questioning.These outliers could be due to specific events or factors that significantly impacted revenue in those quarters (e.g Pandemics, Extreme weather conditions, social-economic situations).
- What do you learn from the visualization? we see a reoccurring pattern everywhere where Q4 generally has higher revenue as compared to other quarters which fits into our assumption that the holiday season in Q4 generates higher sales among the other quarters of the year. We do see some anomalies where Q1 outperformed the other quarters in 2007. This could be due to macro-economic situations that occurred in previous quarters?
Part 5: Incorporating seasonality (20 points)
Instructions
Based on your graphic in part 4, propose the changes that you would make to the initial model from exercise 3, using variables from the original data set or from the variables created in Exercise 2. Write out your reasoning for the model, and include a null and alternative hypothesis.
Once you have decided on your model, run an OLS regression to test the model.
Interpret the result of the model. And how well does the model perform?
Exercise 5
Model reasoning: These hypotheses reflect the statements made below and can be tested using statistical inference techniques such as hypothesis testing on the coefficients in your linear regression model (e.g., t-tests or F-tests).
\(H_0\):The coefficients of diff2, diff3, and diff4 in the model are all equal to zero, suggesting that a long-run autoregressive model (including lagged quarters) does not help predict next quarter revenue (diff_lead).
\(H_1\): At least one of the coefficients of diff2, diff3, or diff4 in the model is not equal to zero, suggesting that a long-run autoregressive model (including lagged quarters) helps predict next quarter revenue (diff_lead).
# Set `eval: true` after writing your code
# Place your code for Exercise 5 in this code block
# Run your proposed model, and store it in model2 (1 year of lag regression) (CHOSEN MODEL!!)
model2 <- lm(diff_lead ~ diff1 + diff2 + diff3 + diff4, data = training)
#POSSIBLE MODELS CONSIDERED (model3)
#Quarter and Year lag regression
model3 <- lm(diff_lead ~ diff1 + diff4, data = training)
# Print out the model
summary(model2)
Call:
lm(formula = diff_lead ~ diff1 + diff2 + diff3 + diff4, data = training)
Residuals:
Min 1Q Median 3Q Max
-31.877 -6.043 -0.771 3.860 147.896
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.096150 1.293016 0.848 0.398
diff1 -0.005215 0.076723 -0.068 0.946
diff2 0.230185 0.114333 2.013 0.046 *
diff3 0.024257 0.121087 0.200 0.842
diff4 0.687782 0.112465 6.116 8.95e-09 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 15.35 on 141 degrees of freedom
(51 observations deleted due to missingness)
Multiple R-squared: 0.2816, Adjusted R-squared: 0.2612
F-statistic: 13.82 on 4 and 141 DF, p-value: 1.56e-09
summary(model3)
Call:
lm(formula = diff_lead ~ diff1 + diff4, data = training)
Residuals:
Min 1Q Median 3Q Max
-29.845 -5.329 -0.988 2.774 153.248
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.18999 1.25937 0.945 0.346
diff1 -0.04451 0.07098 -0.627 0.532
diff4 0.66127 0.09950 6.646 5.5e-10 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 15.32 on 147 degrees of freedom
(47 observations deleted due to missingness)
Multiple R-squared: 0.2559, Adjusted R-squared: 0.2457
F-statistic: 25.27 on 2 and 147 DF, p-value: 3.687e-10
Model 2 is the best as it reflects how calculating the differences in revenue one year ago actual affects revenue prediction for the next year (exactly 4 quarters later). Lower p_score value and higher adjusted r^2 than other models
Interpretation: For every dollar increase in diff1, diff_lead decreases by 0.00522 dollar. For every dollar increase in diff2, diff_lead increases by 0.230 dollar. For every dollar increase in diff3, diff_lead increases by 0.0243 dollar. For every dollar increase in diff4, diff_lead increases by 0.688 dollar.
Model performance: The adjusted R-squared value is 0.261, which means that only 26.1% of the variation in diff_lead can be explained by the variable diff1. In other words, the model has a low adjusted R-squared, indicating that it is poor at predicting diff_lead even though is it a lot better than model 1.
p-value of diff4 is approximately equal to zero meaning that diff4 is statistically significant in predicting diff_lead. p-value of diff2 is 0.046 which is <0.05 meaning that diff4 is also statistically significant in predicting diff_lead. p-value of diff1 and diff3 is high, and not statistically significant in predicting diff_lead.
Part 6: Out of sample accuracy (10 points)
Instructions
Next, we will check how well the initial model and your model perform out of sample.To check this, we will use RMSE. The function for RMSE is given below:
# Takes two vectors and returns the RMSE between the two
rmse <- function(v1, v2) {
sqrt(mean((v1 - v2)^2, na.rm=T))
}Use predict() and the rmse() function to generate the RMSE for each model using the testing data frame. How would you interpret the performance of the models in relation to one another?
Exercise 6
# Set `eval: true` after writing your code
# Place your code for Exercise 6 in this code block
# Calculate model RMSEs -- fill in the predict function
rmse(testing$diff_lead, predict(model1,testing))[1] 17.16705
rmse(testing$diff_lead, predict(model2, testing))[1] 14.60683
Interpretation: Comparing the RMSE values for Model 1 and Model 2: Model 2 has a lower RMSE (14.6) compared to Model 1 (17.2). This suggests that Model 2 has a better predictive performance on the testing data compared to Model 1. In other words, Model 2’s predictions are, on average, closer to the actual values than Model 1’s predictions.
Part 7: Firm locations (10 points)
Instructions
For this part, we will plot out where these seven retail firms are or were headquartered in Singapore, and include some basic financial information about them as well. To do this, you will need to merge in the postal_codes data into df. You will either need to specify the key to match on,9 or change variable names to match each other, as postal code is stored as addzip in the Compustat data and as POSTAL in the postal code data.
9 You can specify matching on different names across the data by specifying the following option: by = c("left_var" = "right_var")
Then, print out an interactive map with the following:10
10 Specifying na.rm=TRUE may be helpful if using min() and max(), so as to tell them to ignore missing data. Also, note that to take the first observation within a group, you can use group_by() to group, and then pipe into slice(1) to extract just the first row of a group, then pipe into ungroup().
- 1 marker (of any appropriate type) where each of the businesses are or were headquartered
- Each marker should, either on hover, click, or a mix of the two, display:
- The name of the company
- The years the company is in the data
- The minimum and maximum quarterly revenue the company had during that period
You can use either leaflet or plotly to make the interactive map.11 You can follow the syntax from Session 3 exercises 3 or 4.12
11 You will get a much more detailed map of Singapore using leaflet. You could also use ggplot2 paired with sf, maps, maptools, and rgeos, passing the result to ggplotly() from plotly. This requires a lot more setup, however, and may require installing other software outside of R if you are not running on Windows.
12 You may have noticed that both plotly and leaflet use ~ in front of variables. This tells it to use the data specified to data= for any variables following the ~. Thus, you can pass expressions like ~lat in place of mapdata$lat. You can also use functions, like ~paste(conm, "<br>Latitude:", lat, "<br>Longitude:", lon).
Exercise 7
# Set `eval: true` after writing your code
# Place your code for Exercise 7 in this code block
# Make sure that POSTAL is numeric -- this are done for you
postal_codes$POSTAL <- as.numeric(postal_codes$POSTAL)
df <- df %>%
rename(POSTAL = addzip)
# Merge the data
df_map <- inner_join(df, postal_codes, by = "POSTAL")
# If using leaflet, load leaflet. Plotly is already loaded above.
# Process the data for your map -- fill in each function
# Process the data for your map
mapdata <- df_map %>%
group_by(conm) %>%
summarize(
lat = mean(LATITUDE, na.rm = TRUE), # Calculate mean latitude
lon = mean(LONGITUDE, na.rm = TRUE), # Calculate mean longitude
years = paste(min(year), "-", max(year)), # Years in the data
min_revenue = min(revtq, na.rm = TRUE), # Min quarterly revenue
max_revenue = max(revtq, na.rm = TRUE), # Max quarterly revenue
average_revenue = mean(revtq, na.rm = TRUE)) %>%
ungroup()
# Make your map -- consult Session 3 exercises 3 or 4 for details on how to do this.
# Load the leaflet library if not already loaded
# Create the interactive map
my_map <- leaflet(data = mapdata) %>%
addTiles() %>%
addCircleMarkers(
lng = ~lon,
lat = ~lat,
radius = 5, # Adjust the marker radius as needed
color =~colorNumeric(
palette = "blue",
domain = mapdata$average_revenue,
reverse = TRUE
)(average_revenue),
fillOpacity = 0.7, # Opacity of marker fill
popup = ~paste(
"<strong>Company:</strong>", mapdata$conm, "<br>",
"<strong>Years:</strong>", mapdata$years, "<br>",
"<strong>Min Revenue:</strong> $", round(mapdata$min_revenue, 2), " million<br>",
"<strong>Max Revenue:</strong> $", round(mapdata$max_revenue, 2), " million",
"<strong>Avg Revenue:</strong> $", round(mapdata$average_revenue, 2), " million"
)
)%>%
setView(lng = mean(mapdata$lon), lat = mean(mapdata$lat), zoom = 10)
my_mapExplanation: -> We used the leaflet library to plot our graph and filtered the map feature only Singapore. We also used clustering method in order to easily identify where these businesses are headquartered at.
As such, we notice 3 insights 1. 5 out of the 7 retail companies in Singapore are mainly located with central region (Isetan, Robinson, Metro,Parkson & Tangs) of Singapore while 2 reside in the east (Courts& Zhongmin). 2. ZHONGMIN BAIHUI RETAIL GROUP has the highest revenue with minimum revenue of $ 9.32 million and Max Revenue of $ 260.81 million (Avg Revenue: $ 157.95 million). 3. METRO HOLDINGS LTD has the lowest revenue with minimum revenue: $ 28.32 million and maximum revenue: $ 73.9 million Avg Revenue: $ 47.67 million.
Part 8: Running a logistic model (15 points)
Instructions 8
The Great Singapore Sale in 2016 was generally considered to have performed quite poorly13. For this exercise, we will make the assumption that a successful GSS will lead to higher sales in Q2 than sales from the previous Q2.14
13 See this Straits Times coverage.
14 This doesn’t cover the whole of GSS, but does cover the first month.
Let’s run a simple model to see if:
- A model of revenue growth can predict GSS success, and ##Ruth: Logistic Regression (if exceed prev Q2 sales = GSS success)
- If we would have predicted a poor GSS season for the retailers in our data in 2016. ## out-sample data
After that, you should interpret the model – what does it say about the relationship between GSS success and past revenue growth?
Lastly, I have provided code to produce the odds ratios and the probability of success for each company with 2016 Q2 data. I have also provided a baseline for each, which is the average result across all years in the training data. Interpret either the odds ratios or the probabilities. Based on your model, does it seem like GSS 2016 was expected to be bad from the beginning?
Exercise 8
# Set `eval: true` after writing your code
# Place your code for Exercise 8 in this code block
# Create the GSS_Success measure and growth measures by company
# These are done for you
df <- df %>%
group_by(gvkey) %>%
mutate(GSS_success=ifelse(revtq > lag(revtq, 4), 1, 0)) %>%
mutate(growth1 = (lag(revtq) - lag(revtq, 2))/lag(revtq, 2), #Revenue Growth Q1
growth2 = (lag(revtq, 2) - lag(revtq, 3))/lag(revtq, 3), #Revenue Growth Q2
growth3 = (lag(revtq, 3) - lag(revtq, 4))/lag(revtq, 4), #Revenue Growth Q3
growth4 = (lag(revtq, 4) - lag(revtq, 5))/lag(revtq, 5)) %>% #Revenue Growth Q4
ungroup()
df[!is.na(df$qtr) & df$qtr != 2,]$GSS_success = NA
# Split into training and testing data -- these are done for you
training = df %>% filter(year < 2016)
testing = df %>% filter(year == 2016, qtr == 2)
# Run a logistic model using glm() of GSS_success on growth1 through growth4
# using training data. Make sure to set the family option
fit <- glm(GSS_success ~ growth1 + growth2 + growth3 + growth4, data = training, family = binomial)Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
# Output a summary of the model
summary(fit)
Call:
glm(formula = GSS_success ~ growth1 + growth2 + growth3 + growth4,
family = binomial, data = training)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.058 1.610 -1.900 0.0575 .
growth1 9.301 6.497 1.432 0.1523
growth2 22.596 9.778 2.311 0.0208 *
growth3 20.166 9.666 2.086 0.0369 *
growth4 5.112 4.824 1.060 0.2892
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 45.234 on 33 degrees of freedom
Residual deviance: 33.165 on 29 degrees of freedom
(163 observations deleted due to missingness)
AIC: 43.165
Number of Fisher Scoring iterations: 7
Explanation of the model: Q: what does it say about the relationship between GSS success and past revenue growth? -> There is a positive correlation between past revenue growth and GSS success which helps me infer that there is some sort of relationship between the two factors.
- p-values of growth2 and growth3 are statistically significant at the 0.05 significance level, suggesting that changes in revenue from two and three quarters ago are important predictors of GSS success.
- Since growth2 and growth3 have positive coefficients of 22.596 and 20.166, it suggests that an increase in revenue growth from two to three quarters ago is associated with an increased likelihood of GSS success.
- Similarly growth1 and growth4 have positive coefficients, however they are not statistically significant in predicting GSS success due to them having larger p-values.
# Set `eval: true` after writing your code
# This code is all already done for you.
# It creates the odds and probabilities of success by firm,
# as well as a baseline from the training data
logodds <- predict(fit, testing)
# Convert to odds
odds = exp(logodds)
names(odds) <- testing$conm
# Convert to probability
probability <- odds / (1 + odds)
names(probability) <- testing$conm
# Baseline probability:
baseline <- sum(training$GSS_success, na.rm=T) /
sum(!is.na(training$GSS_success))
names(baseline) <- "Baseline"
baseline_odds = baseline / (1-baseline)
print("Odds")[1] "Odds"
c(baseline_odds, odds) Baseline METRO HOLDINGS LTD
1.625000000 0.002635516
ISETAN (SINGAPORE) LTD ZHONGMIN BAIHUI RETAIL GROUP
0.165977553 0.152708641
PARKSON RETAIL ASIA LTD
0.554071796
print("Probabilities")[1] "Probabilities"
c(baseline, probability) Baseline METRO HOLDINGS LTD
0.619047619 0.002628588
ISETAN (SINGAPORE) LTD ZHONGMIN BAIHUI RETAIL GROUP
0.142350556 0.132478092
PARKSON RETAIL ASIA LTD
0.356529086
Explanation of 2016 expectation: Based on your model, does it seem like GSS 2016 was expected to be bad from the beginning?
ANSWER: Based on our model, GSS 2016 was expected to be good from the beginning as suggested by the positive correlations between revenue growths in predicting GSS success. However, none of the firms below met the GSS 2016 baseline, meaning that companies performed poorer than previous years during a GSS.
- The baseline probability is 0.619 which means that the probability for GSS success across the years before 2016 is 61.9%.
- METRO HOLDINGS LTD has a very low probability (0.0026) of GSS success, indicating a very low expectation of success as its probability of success is significantly lower than the baseline.
- ISETAN (SINGAPORE) LTD and ZHONGMIN BAIHUI RETAIL GROUP have a relatively low probability (0.1424) (0.1325) of GSS success, suggesting a lower expectation compared to the baseline, both of which have a lower GSS success than the baseline.
- Meanwhile, PARKSON RETAIL ASIA LTD has a probability of 0.3565 which is the highest among the companies in the dataset, however it is still considerably below the baseline.