ACCT 420 Individual Assignment 1

Forecasting with OLS

Author

Ruth Priya Ramani & Mah Min Hui, Joyce

Published

September 8, 2023

General instructions

This assignment is intended to help you explore some of the concepts we have discussed in the first few weeks of class:

  1. R Coding
  2. Forecasting using linear regression
  3. Visualization using ggplot2

This assignment is written as a Quarto document. Quarto is a technology to create replicable documents, embedding all code, results, and analysis in the same file. Margin notes are included in this file using the syntax [^1], containing various tips and hints for the exercises1. To get a feel for how this document type works, click the Render bottom in the toolbar right above this editor window. If you have selected “Preview in window” from the settings dropdown (just right of Render), a window will pop up with the output of the document that refreshes each time you click Render. Alternatively, you can open the new html file that was created in the same folder as this .qmd file. You can open it in your favorite web browser and refresh that browser tab after rendering each time.

  • 1 This is an example of a margin note. The list of notes are all at the bottom of the R Markdown file.

  • Note that some of the code in the code blocks has already been filled in for you. As a consequence, code blocks are all set to eval: false to start with, so that the document can be compiled before you finish all exercises. As you get to an exercise, change eval: false to eval: true for its code blocks so that they are evaluated when rendering.

    I recommend experimenting with your code for the assignment in the R Console at the bottom of the window. To make this easier, you should set the Console to be working from the same directory as this file. You can easily set this from the menu at the top by going to Session > Set Working Directory > To Source File Location. Do note that that the the console and code run while rendering are separate – variables and functions defined in one environment are not accessible from the other without running the code separately.

    The main tools you will be gaining familiarity with in this Assignment are:

    If you need help with code, you can always reference the class slides or the Datacamp tutorials, or stop by office hours.

    The assignment will be graded out of 100 points; point values are specified for each part below.

    Goal of the assignment

    Throughout the assignment, you will build a model to predict ROA (net income divided by assets) for US real estate companies. To do so, the following steps will be taken:

    1. Obtain US real estate company data from Compustat
    2. Process the data into a usable form
    3. Run an initial model using only accounting data
    4. Obtain and process data on home values in the US
    5. Explore the macro data
    6. Merge the accounting and macro data
    7. Build a hybrid model using both
    8. See how well the models perform

    Part 1: Obtaining accounting data (10 points)

    Instructions

    think about a financial ratio that you you think might help predict these companies’ ROA. Before starting with part 1, You will need this for Part 1 and Part 3. What financial ratio did you pick?2

  • 2 Here is a list of financial ratios if you need a refresher.

  • Ratio: debt to equity ratio

    Log on to WRDS.3 Navigate to “Compustat - Capital IQ”, then to “North America - Daily”, then to “Fundamentals Annual.” The resulting page is where we will select our data. Select the following options4:

  • 3 If you do not have an account, you can register for an account at this link. Be sure to input your SMU email under Email address, select “Singapore Management University” as the Subscriber, and “Students (Masters / Undergrad)” as User type.

  • 4 If an option is not mentioned, you should leave it unchanged at the default setting.

    1. Date range: 2012-01 to 2022-12
    2. Company codes and screening
    • Company codes: Select “Search the entire database”
    • Industry format: uncheck “FS”
    • Currency: uncheck “CAD”
    1. Variable types:
    • CONM – Company Name
    • GSECTOR – GIC Sectors
    • STATE – State/Province
    • ADDZIP – Postal Code
    • REVT – Revenue - Total
    • NI – Net Income (Loss)
    • AT – Assets - Total
    • Also select any other measure(s) needed for the financial ratio you picked.
    • Conditional Statement5: GSECTOR, equal, 60
  • 5 This will keep the download size more manageable.

    1. Output
    • Output format: comma-delimited text (*.csv)
    • Compression: None or zip (uncompress the file after downloading if you pick zip)
    • Date Format YYMMDDn8 (the default)
    1. Click submit
    • When WRDS finishes, download the csv file and store it in the same location as this file

    Exercise 1

    Load in the data you downloaded from Compustat. To do this, you can use the read_csv() function. After you have loaded the data, print out a summary of the data using the summary() function, including only fiscal year, revenue, and total assets.6 Below the code block, write a brief explanation on what the summary says about revenue for these firms over the time period.

  • 6 Some possible ways to do this: indexing with [,], using subset(), using filter() from the dplyr package.

  • # Set `eval: true` after writing your code
    # Place your code for Exercise 1 in this code block.
    library(readr)
    
    # Store the data in a variable called df.
    df <- read_csv("~/Downloads/FFA/FFA Ass 1/compustat1.csv")
    Rows: 3168 Columns: 20
    ── Column specification ────────────────────────────────────────────────────────
    Delimiter: ","
    chr (11): datadate, indfmt, consol, popsrc, datafmt, tic, conm, curcd, costa...
    dbl  (9): gvkey, fyear, at, cogs, dt, gp, ni, revt, gsector
    
    ℹ Use `spec()` to retrieve the full column specification for this data.
    ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
    # Output a summary of the data
    summary_df <- summary(df[, c("fyear", "revt", "at")])
    print(summary_df)
         fyear           revt                 at          
     Min.   :2011   Min.   :   -8.798   Min.   :     0.0  
     1st Qu.:2014   1st Qu.:   71.250   1st Qu.:   470.9  
     Median :2017   Median :  314.462   Median :  2114.1  
     Mean   :2017   Mean   :  874.769   Mean   :  4612.3  
     3rd Qu.:2020   3rd Qu.:  884.093   3rd Qu.:  5574.4  
     Max.   :2022   Max.   :30828.246   Max.   :122520.0  
                    NA's   :110         NA's   :82        

    Explanation: From this output, we see that revenue varies greatly even if we were to exclude the extreme ends and look at the 1st Quartile to 3rd Quartile. Median revenue is USD314 mil, with the minimum being making USD-8.80 mil and the max being USD30828.25mil.This could potentially show the size of real estate properties as well where smaller or lesser known companies would earn lesser revenue than bigger and more notable companies. The minimum revenue may be a reflection of the effects of Covid whereby property markets are not doing well due to constructions halts.

    Assets is made of the properties, companies and cash that is held by the real estate company. Assets vary from 0 to USD122520 mil. A high asset means that there is better financial stability and investment growth. The median is USD2114mil.

    Part 2: Processing data (20 points)

    Instructions

    First, we will need to import some needed packages – tidyverse and plotly.

    # If you get an error here -- install the packages from the Session 3 slides.

    In order to predict ROA, we will need to construct some measures, particularly ROA and future (1 year ahead) ROA. We will also need to construct the financial ratio you picked above. Remember that ROA is defined as:

    \[ ROA_t = \frac{Net~Income_t}{Assets_t} \]

    We will also need to remove the problematic observations indicated by the summary() of revenue in Exercise 1. Remove any observations from df where revenue is not more than 0, assets7 are less than 300 million USD, revenue is NA8, or assets is NA.

  • 7 Remember, numbers in Compustat are already in Millions of USD. So the conditional expression for this is at < 300 or df$at < 300.

  • 8 The function is.na() is very useful for this. It takes a vector as an input and returns a vector of TRUE and FALSETRUE whenever the vector is NA, FALSE otherwise. Also, negation using ! may be useful.

  • After constructing the measures and cleaning the data, take a look at the distribution of ROA vs future ROA.

    Exercise 2

    Construct the three needed measures: ROA9, 1 year ahead ROA10, and your chosen measure.

  • 9 Nothing fancy is needed for calculating ROA: just net income divided by assets.

  • 10 This follows a very similar structure to revt_lead for the all Singapore real estate companies data in the Session 2 slides.

  • # Set `eval: true` after writing your code
    # Place your code for Exercise 2 in this code block
    
    # Clean up the data following the instructions.  Overwrite df with the cleaned data.
    df <- df %>%
        filter(
        revt > 0,         # Revenue is more than 0
        at >= 300,        # Assets are at least 300 million USD
        !is.na(revt),    
        !is.na(at)  ,
      )
    
    
    # Construct ROA.  Call this roa; add it to your data frame.
    df <- df %>%
      mutate(roa = ni/at)
    
    # Construct 1 year ahead ROA.  Call this roa_lead; add it to your data frame.
    df <- df %>%
      mutate(roa_lead = lead(roa))
    
    # Construct your financial ratio.  Pick an appropriate name; add it to your data frame.
    df <- df %>%
      mutate(dta = dt/at) %>% # dta represents debt to asset ratio
      filter(!is.na(dta))

    Next, make a plot of ROA (on the x-axis) vs future ROA (on the y-axis).11 Store the plot in a variable called plot

  • 11 The library ggplot2 was already loaded when we loaded tidyverse earlier. Use the function ggplot() along with the function geom_point() to make your scatterplot. Here is a reference for scatterplots.

  • # Set `eval: true` after writing your code
    # Place your code for Exercise 2 in this code block
    # Feel free to clean up the plot a bit, such as adjusting the formatting
    # or adding a regression line as well.
    
    # Make a scatterplot using ggplot and store in plot.
    plot <-  ggplot(df, aes(x = roa, y = roa_lead)) +
      geom_point() +
      labs(x = "ROA", y = "Future ROA")
    
    #Make the plot interactive
    library(plotly)
    ggplotly(plot)  # No need to change this line

    Part 3: Initial model (20 points)

    Instructions

    Now, let’s build our first model for predicting the next year’s ROA. Based on the plot in Part 2 (if you zoom in to the center of the cluster), it seems that including ROA from the current year would be a good start. Build a linear model that regresses roa_lead on roa, and save that model to an object called model1. Then, output information about the model using summary(). Then explain the following: 1) how well does the model work, and 2) how do we interpret the coefficient on ROA?

    Next, make a new model called model2, regressing future ROA on both ROA and your chosen ratio. Again, output information about the model using summary(). Does your ratio improve prediction?12 What evidence is there in the summary?

  • 12 It is fine if it doesn’t – you do not need to go through the above until you find one that works.

  • Exercise 3

    # Set `eval: true` after writing your code
    # Place your code for Exercise 3 in this code block
    
    # Run the linear model and store it in model1
    model1 <- lm(roa_lead ~ roa, data = df)
    
    # Print out a summary of model1
    summary(model1)
    
    Call:
    lm(formula = roa_lead ~ roa, data = df)
    
    Residuals:
         Min       1Q   Median       3Q      Max 
    -0.81199 -0.01499  0.00019  0.01653  0.79009 
    
    Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
    (Intercept) 0.014312   0.001247   11.47   <2e-16 ***
    roa         0.274262   0.019241   14.25   <2e-16 ***
    ---
    Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    
    Residual standard error: 0.0587 on 2444 degrees of freedom
      (1 observation deleted due to missingness)
    Multiple R-squared:  0.07675,   Adjusted R-squared:  0.07637 
    F-statistic: 203.2 on 1 and 2444 DF,  p-value: < 2.2e-16

    Explanation: Qn: Explain the following: 1) how well does the model work, and 2) how do we interpret the coefficient on ROA?

    1. Despite the low adjusted R squared of 0.0763 which means that only 7.6% of the variation in roa_lead can be explained by the variation in roa, p-value is almost 0 for both the intercept and the roa means that roa is still statistically significant.
    2. For every USD 1 increase in roa, roa_lead increases by USD 0.274 .
    # Set `eval: true` after writing your code
    # Place your code for Exercise 3 in this code block
    
    # Run the linear model and store it in model2
    model2 <- lm(roa_lead ~ roa + dta , data = df)
    
    # Print out a summary of model2
    summary(model2)
    
    Call:
    lm(formula = roa_lead ~ roa + dta, data = df)
    
    Residuals:
         Min       1Q   Median       3Q      Max 
    -0.81253 -0.01434 -0.00011  0.01643  0.78324 
    
    Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
    (Intercept)  0.020607   0.003276   6.290 3.74e-10 ***
    roa          0.267243   0.019523  13.689  < 2e-16 ***
    dta         -0.013002   0.006257  -2.078   0.0378 *  
    ---
    Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    
    Residual standard error: 0.05866 on 2443 degrees of freedom
      (1 observation deleted due to missingness)
    Multiple R-squared:  0.07838,   Adjusted R-squared:  0.07762 
    F-statistic: 103.9 on 2 and 2443 DF,  p-value: < 2.2e-16

    Explanation: Qn: Next, make a new model called model2, regressing future ROA on both ROA and your chosen ratio. Again, output information about the model using summary(). Does your ratio improve prediction?13 What evidence is there in the summary?

  • 13 It is fine if it doesn’t – you do not need to go through the above until you find one that works.

  • For every USD 1 increase in roa, roa_lead increases by USD 0.267. Considering the low p-value = 0, roa is statistically significant. Meanwhile, for every USD 1 increase in dta, roa_lead decreases by USD0.013. Considering the low p-value = 0.01, dta is still statistically significant although it is not able to predict roa_lead as well as roa. The coefficient of roa is 20 times higher than the coefficient of dta, meaning that roa impacts roa_lead more than dta.

    Adjusted R- square has increased from 0.07637 to 0.07762 (by 1.6%) which is negligible. Hence, we conclude that our ratio, dta,is not a strong factor that would affect roa_lead.

    Part 4: Obtaining macro data (5 points)

    Instructions

    The file for the macro data is available on eLearn. The file is provided publicly by Zillow. Please download the file and add it to the same directory as this file. Then, load the file using read_csv(), storing it into a variable named macro. This data contains median house prices in the US for each zip code (US name for a postal code) for each month since January 2011.

    Take a look at the format of the data – either 1) load it through the R Console below and then click it in the Environment viewer on the right side of RStudio, or 2) look at the file in Excel or a similar app. Notice how it has U.S. Zip codes (postal codes) as rows and years as columns. This is an example of a wide format dataset. Code is provided in the next exercise to convert this to our more familiar long format.

    Exercise 4

    # Set `eval: true` after writing your code
    # Place your code for Exercise 4 in this code block
    library(readr)
    # Load in the macro data
    macro <- read_csv("~/Downloads/FFA/FFA Ass 1/Zillow_SFR_ZIP_2011_2023.csv")
    Rows: 26909 Columns: 159
    ── Column specification ────────────────────────────────────────────────────────
    Delimiter: ","
    chr   (7): RegionName, RegionType, StateName, State, City, Metro, CountyName
    dbl (152): RegionID, SizeRank, 2011-01-31, 2011-02-28, 2011-03-31, 2011-04-3...
    
    ℹ Use `spec()` to retrieve the full column specification for this data.
    ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

    Part 5: Exploring the macro data (20 points)

    Instructions

    First we will need to reshape the data into a long format – that’s actually quite easy to do using the pivot_longer() command from dplyr, a part of tiddyverse which we loaded earlier.

    Next up, we should rename the RegionName variable to something that is easier to understand, such as zip, since it actually contains zip codes (U.S. postal codes).

    Lastly, we’ll need to fix the dates – they load from the file as “XYYYY-MM”, where X is the letter X, YYYY is the year, and MM is the month. The given solution will use substr(), like the in class example. We’ll cover text more in week 8, so this is done for you, for now.

    The below code contains all the above pre-coded for you – just change eval: false to eval: true and run the code. Do take a look at the code and try to understand what exactly it is doing!

    # Set `eval: true` after finishing Part 4
    # No need to change any of the below code -- it is all done for you
    
    # Convert from wide to long format
    macro <- macro %>%
      pivot_longer(cols=c(-RegionID, -SizeRank, -RegionName, -RegionType,
                          -StateName, -State, -City, -Metro, -CountyName),
                   names_to="date", values_to="housing_index")
    
    # rename RegionName to zip
    macro <- macro %>% rename(zip=RegionName)
    
    # Fix dates using lubridate
    library(lubridate)
    macro <- macro %>%
      mutate(date = ymd(date),
        year = year(date),
        month = month(date))

    Now, for your part. First, we only need one observation per year, so just keep data from each January. Then, sort the data by zip and year to make it a bit more organized. Lastly, we need to construct a more useful measure from the housing_index measure. Let’s calculate growth in the housing index (year over year), and call it housing_growth.

    After that, explore the data a bit using the provided ggplot2 code. You will see which states have the most expensive and least expensive locations, as well as the overall trend in the US housing market.

    Exercise 5

    Filter the data following the instructions, sort the data, and calculate housing index growth14.

  • 14 Make sure to use group_by() when calculating housing index growth. We wouldn’t want to accidentally pull data from other zip codes when calculating growth for a particular zip code.

  • # Set `eval: true` after writing your code
    # Place your code for Exercise 5 in this code block
    
    # Filter to just January data.  Store in macro.
    macro <- macro %>%
      filter(month == "1")
    
    # Sort the data by zip code and year.  Store in macro.
    macro <- macro %>% 
      arrange(zip, year) 
    
    # Calculate housing index growth, name it housing_growth, and add it to macro.
    macro <- macro %>%
      group_by(zip) %>%
      mutate(housing_growth = (housing_index - lag(housing_index))/ lag(housing_index)) %>%
      ungroup()
    summary(macro)
        RegionID         SizeRank         zip             RegionType       
     Min.   : 58196   Min.   :    0   Length:349817      Length:349817     
     1st Qu.: 68946   1st Qu.: 6812   Class :character   Class :character  
     Median : 78973   Median :13716   Mode  :character   Mode  :character  
     Mean   : 80538   Mean   :14393                                        
     3rd Qu.: 88602   3rd Qu.:21376                                        
     Max.   :808739   Max.   :39809                                        
                                                                           
      StateName            State               City              Metro          
     Length:349817      Length:349817      Length:349817      Length:349817     
     Class :character   Class :character   Class :character   Class :character  
     Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                                
                                                                                
                                                                                
                                                                                
      CountyName             date            housing_index           year     
     Length:349817      Min.   :2011-01-31   Min.   :     802   Min.   :2011  
     Class :character   1st Qu.:2014-01-31   1st Qu.:  113009   1st Qu.:2014  
     Mode  :character   Median :2017-01-31   Median :  174541   Median :2017  
                        Mean   :2017-01-30   Mean   :  244784   Mean   :2017  
                        3rd Qu.:2020-01-31   3rd Qu.:  286885   3rd Qu.:2020  
                        Max.   :2023-01-31   Max.   :10837078   Max.   :2023  
                                             NA's   :49737                    
         month   housing_growth 
     Min.   :1   Min.   :-0.86  
     1st Qu.:1   1st Qu.: 0.02  
     Median :1   Median : 0.06  
     Mean   :1   Mean   : 0.07  
     3rd Qu.:1   3rd Qu.: 0.10  
     Max.   :1   Max.   :19.65  
                 NA's   :76142  

    Inside the code sections below you will find completed code for three graphs.15 Change eval: false to eval: true in each, and then render the file. Take a look at the graphs, and, in 1 to 2 sentences below each graph, write out 1) if there is anything interesting or surprising that you see in the graph, and 2) what your main takeaway of the graph is.

  • 15 Feel free to explore the code to see how the graphs were made.

  • Interesting/Surprising:

    It’s interesting to see how majority of the states have relatively similar median housing prices except for Florida, Colorado and California (ranging ard 7-8million). The prices for those states might be high due to factors such as them being desirable locations like Florida with their costal living, California with the silicon valley and colorada has limited supply of land which drives higher demand hence, higher median prices.

    Main Takeaway: Our assumptions may not hold truth whereby we thought that New York would have the highest median housing prices due to it being the cosmopolitan, highly populated as well as tourist destination.

    Interesting/Surprising: For a graph that is filtered to view the least expensive areas in America, having a house that’s priced around 400,000 is considerably high as compared to the other states as seen in the graph.

    Main Takeaway:Geographic locations and accessibility could be more important factors in determining house prices which could then help us predict ROA better (rather than relying on accounting variables).

    Interesting/Surprising: The top 1% growth and the bottom 1% growth are correlated as they rise and fall together. However, the fluctuation in the top 1% growth is more significant as compared to the bottom 1% growth line. In 2020, housing priced had a sharp dip before increasing exponentially and reaching a peak in 2022. Based on the average growth, there is generally an upward trend in housing price.

    Main Takeaway: We see a general increase/spike in house prices from 2012 and potentially, these high prices will continue and be a norm in the future.

    Part 6: Merge the data sources (5 points)

    Instructions

    There are two potential issues with the merging this data. First, Compustat does not have a year variable – it has fiscal year (fyear), which may be different from the calendar year, and datadate. To merge the data properly, we need to extract the year from datadate and add it to the data frame as year. Second, zip code was originally imported as character data. Since our macro data only has numeric zip codes, we can just cast that variable to numeric. However, some of these zip codes are in US zip+4 format (XXXXX-XXXX), whereas others are in the standard 5-digit format (XXXXX). There are also zip codes in a Canadian format using letters. First we need to standardize the format, and then cast to numeric. In the process, the Canadian zip codes will turn into NA. Both of these have already been coded for you. Take a look at the code to see how these conversions were done.

    Next, use the left_join() command from dplyr to merge df and macro together. Name the output as df_macro.

    Exercise 6

    # Set `eval: true` after writing your code
    # Place your code for Exercise 6 in this code block
    
    # Add in year based on datadate
    #df$year <- round(df$datadate / 10000, 0)  ### No need to change 
    df$datadate <- dmy(df$datadate) # Convert the "datadate" column to a Date object
    df$year <- year(df$datadate) # creates a "year" column 
    
    # Fix addzip to be standard zip format and numeric
    df$zip <- substr(df$addzip, 1, 5)  # No need to change
    
    # Join the compustat and macro data together into df_macro
    df_macro <- left_join(df, macro, by = c("year", "zip"))

    Part 7: Build a hybrid model (10 points)

    Instructions

    Now we will the macro data in a linear model to predict future ROA. Include both current ROA and housing_growth as inputs in the model. Do not include any other inputs. Save the model as model3, and then output information about the model using summary. Then, interpret the results of this regression. How does housing_growth seem to affect real estate firm performance? Should it be included in the regression?

    Exercise 7

    # Set `eval: true` after writing your code
    # Place your code for Exercise 7 in this code block
    
    # Create the linear model
    model3 <- lm(roa_lead ~ roa + housing_growth, data=df_macro)
    
    # Output model information
    summary(model3)
    
    Call:
    lm(formula = roa_lead ~ roa + housing_growth, data = df_macro)
    
    Residuals:
         Min       1Q   Median       3Q      Max 
    -0.36393 -0.01415 -0.00046  0.01515  0.79483 
    
    Coefficients:
                    Estimate Std. Error t value Pr(>|t|)    
    (Intercept)     0.014966   0.001400  10.688  < 2e-16 ***
    roa             0.324635   0.021807  14.887  < 2e-16 ***
    housing_growth -0.013114   0.004582  -2.862  0.00426 ** 
    ---
    Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    
    Residual standard error: 0.05296 on 1829 degrees of freedom
      (615 observations deleted due to missingness)
    Multiple R-squared:  0.1137,    Adjusted R-squared:  0.1127 
    F-statistic: 117.3 on 2 and 1829 DF,  p-value: < 2.2e-16

    Explanation: Then, interpret the results of this regression. How does housing_growth seem to affect real estate firm performance? Should it be included in the regression?

    With the addition of housing_growth, the adjusted R-square is higher, from 7.76% to 11.27%. The coefficient of housing_growth is -0.01 which means that there is a negative correlation between housing growth and roa_lead. To explain, housing growth represents the changes in housing price in America over the years which evidently in the graphs above, we see that house prices are on the rise. As such,for every unit increase in housing_growth, our roa_lead decreases by 0.01%.Potentially, this correlation is due to lower housing demand due to higher prices that might not be affordable among consumers (tagging in inflation as well as recessions recently), our purchasing powers are reduced which results in consumers being less inclined to purchase properties. (hence lower income which means roa is lower). Thus, from this reasoning, we believe it should be included. Furthermore, it also has a relatively low p-value of 0.00426 which proves that the factor is statistically significant in exaplaining our roa_lead.

    Part 8: Model performance (10 points)

    Instructions 8

    Run the provided code16 to generate a \(\chi^2\) comparison of a model with just ROA and a model with both ROA and housing_growth. Interpret the results of the test and say whether or not the measure should be kept.

  • 16 If you get an error here, make sure you used the names stated in the instructions for your models and variables.

  • # Set `eval: true` after writing your code
    # No need to change any of the below code -- it is all done for you
    
    model4 <- lm(roa_lead ~ roa, data=df_macro[!is.na(df_macro$housing_growth),])
    anova(model4, model3, test="Chisq")
    Analysis of Variance Table
    
    Model 1: roa_lead ~ roa
    Model 2: roa_lead ~ roa + housing_growth
      Res.Df    RSS Df Sum of Sq Pr(>Chi)   
    1   1830 5.1537                         
    2   1829 5.1307  1  0.022979 0.004209 **
    ---
    Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

    Explanation: Based on the p-value of 0.004209 in the Pr(>Chi), which is less than a typical significance level of 0.05 (or 5%), we would reject the null hypothesis which is that model 1 is just as good as explaining roa_lead. This suggests that Model 2 (roa_lead ~ roa + housing_growth) is statistically better at explaining the variation in roa_lead compared to Model 1 (roa_lead ~ roa).