Draft: Prediction Question

For this question, we will revisit and expand on the housing data from Buenos Aires that you have been using in section. Once again, please imagine that you are a policy analyst for Asociación Civil por la Igualdad y la Justicia, an advocacy organization in Argentina. You specialize in housing policy, and are preparing a report about housing affordability in Buenos Aires. You downloadd data from the Buenos Aires Municipal Govt. (Source: Gobierno de la Ciudad de Buenos Aires).

Your housing universe is visualized in the map below: (not required, but if you would like to make these maps please revisit materials from section on 10/21):

Setup

Question 1a

First, please fit an OLS model with apartment price (preciousd) as your outcome and the apartment size in meters squared (m2total) as your one predictor. Using R, calculate the predicted value according to this model for an apartment with size 112 square meters.

fit1 <- lm(preciousd ~ m2total, data = arg)
ft1 <- tidy(fit1)

pred_value <- ft1$estimate[1] + (ft1$estimate[2] * arg$m2total[11])

Question 1b

Coincidentally, the eleventh (11) apartment in the dataset has a value of 112 for m2total. Given your prediction above, manually calculate the value of the residual Then, use the resid() function to get the vector of residuals in the data. Inspect the first one (corresponding to the first row in the data), and verify that it is the same as your calculated value.

arg %>% 
  slice(2) %>% 
  select(m2total, preciousd)

## # A tibble: 1 × 2
##   m2total preciousd
##     <dbl>     <dbl>
## 1      95    250000

arg$preciousd[11] - pred_value

## [1] 268355.1

resid(fit1)[11]

##       11 
## 268355.1

Question 1c

You make a critical error and send your boss your R script above without any accompanying interpretation or context. Your boss, understandably, is frightened by R output — though their interest is piqued by the fact that your manual calculation returns the same number as what seems to be a built-in resid() command.

They are very skeptical of the fact that you seem to have used an apartment in the dataset to make a prediction – they ask, if you already know what the true answer is, why is the residual for a row within your dataset of any interest? Please explain how you generated a predicted value for a given apartment and answer their question about residuals with as little technical language as possible.

Basically, we are building a model that tries to uncover patterns in the existing data. Our results show that larger apartments tend to become more expensive, and our model quantifies how much more expensive an apartment can get by being a certain size. We can use our model estimates to make a prediction for how expensive any particular apartment should be given a particular size.

We are interested in residuals as a way of investigating how well our model explains the data. We could also use a model to generate out-of-sample predictions, which would be a way of evaluating predictive accuracy. Both can be important and of interest.

Question 1c

Inspecting residuals is an important part of prediction. Please replicate the following plot below as closely as possible.

Hint: recall from section that you can access a vector of residuals from a model object called reg with resid(reg).

Hint #2: two other functions you might find helpful are scale_y_continuous(labels = scales::label_dollar(prefix = "$")) and geom_hline(). Don’t forget to read function documentation to know what arguments to use.

tibble(resid = resid(fit1),
       m2total = arg$m2total) %>% 
  ggplot(aes(x = m2total, y = resid)) + 
    geom_point() + 
    theme_bw() +
    scale_y_continuous(labels = scales::label_dollar(prefix = "$")) + 
    geom_hline(yintercept = 0, col = "red", lty = "dashed") + 
    labs(x = "Apartment Square Meters", y = "Model Residual")

Question 1e

Please revisit the five assumptions for OLS from Handouts 16 and 18. In light of those assumptions, please interpret the plot above. These assumptions concern patterns in the population, which we cannot observe, but an important step in fitting statistical models is diagnosing them. Given the evidence in your sample, does this plot seem consistent with these four assumptions? If not, which ones? Why? Be specific.

Note: please do not critique the simple bivariate model itself, which is intentionally simplistic. Instead, evaluate our four OLS assumptions in light of the evidence above in the residual plot.

Assumptions MLR4 and MLR5 seem to be violated here. Conditional on our single X, the plotted residuals do not seem to have a mean of zero (there is a downward trend) and the variance does not seem to be constant (it grows as X increases).

Question 1f

You may have noticed that some of our problems in the previous question may come from the scale of our data – some apartments are very very expensive, causing skew in our data. When fitting regressions and making predictions, it is common to take transformations of variables to make them more normal. Below, I have plotted histograms of the baseline preciousd and m2total variables for you, and you can see both are skewed.

In this case, skewed variables (those with long tails to the right or left) are often log transformed before we include them in an OLS model. Often, this will help to solve these unsavory patterns we see in residuals. Please do the following:

Take the log of preciousd and m2total, and save them to new variables in your dataset.
Create histograms similar to those presented below. Confirm visually that your new variables look more like a normal distribution than before, and that the skew problem has been mostly solved.

###################################
## Skewed
p1 <- ggplot(arg) +
  geom_histogram(aes(x = preciousd, 
                     y = stat(width*density)), binwidth = 10000) + 
    theme_bw() + 
    scale_x_continuous(labels = scales::label_dollar(prefix = "$")) + 
    labs(y = "Proportion")

p2 <- ggplot(arg) +
  geom_histogram(aes(x = m2total, 
                     y = stat(width*density)), binwidth = 10) + 
    theme_bw() + 
    labs(y = "Proportion")

p1 + p2

arg$logm2 <- log(arg$m2total)
arg$logprice <- log(arg$preciousd)

p1 <- ggplot(arg) +
  geom_histogram(aes(x = logm2, 
                     y = stat(width*density))) + 
    theme_bw() + 
    scale_x_continuous(labels = scales::label_dollar(prefix = "$")) + 
    labs(y = "Proportion")

p2 <- ggplot(arg) +
  geom_histogram(aes(x = logprice, 
                     y = stat(width*density))) + 
    theme_bw() + 
    labs(y = "Proportion")

p1 + p2

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Question 1g

Now, please fit a new bivariate regression between your log m2total variable and preciousd variable. Recreate your residual plot from question 1c on this new model. Please explain any patterns that you see. Are you more, less, or just as confident in the model assumptions now? Please explain any lingering issues.

###################################
fit2 <- lm(logprice ~ logm2, data = arg)

tibble(resid = resid(fit2),
       m2 = arg$logm2) %>% 
  ggplot(aes(x = m2, y = resid)) + 
    geom_point() + 
    theme_bw() + 
    geom_hline(yintercept = 0, lty = "dashed", col = "red")

Residuals look much more randomly distributed, but still noticeable outliers.

Question 1g

Now that we have thoroughly investigated our data and transformed some variables, we will predict. Please fit a regression model with log price in USD as your outcome with two predictors: log meters squared and building age (antig in your data).

So far in this course, we have calculated predicted values manually. Now, we will use the predict() function to make many predictions at once. Please read this short tutorial on the predict() function.

Then, use the predict() function to generate the predicted value for a “typical” argument – a single apartment that has the median value of both of your predictors.

fit3 <- lm(preciousd ~ logm2 + antig, data = arg)

new_df <- tibble(antig = median(arg$antig),
                 logm2 = median(arg$logm2))

predict(fit3, newdata = new_df)

##        1 
## 247321.6

Question 1h

When you have multiple variables, you have heard the intuition in lecture that your coefficients represent the associated relationship between one variable and the outcome, “holding all else constant.” Let’s explore that intuition.

A nice feature of predict() is that we can easily generate many predictions at once. Please create a new tibble with apartments at 1000 equal-sized steps between a value of 3.5 and the maximum observed value of logm2, where all have the same median value for antig. That is, the first row of your new dataset will have a value of 3.5 in the logm2 column and the median value of antig in the antig column. The next row will have the same value of antig, but a slightly smaller value of logm2…so and and so forth until you get to the maximum observed value of logm2.

Finally, use predict() on that new dataset to generate predicted values for each fictitious apartment.

Hint: don’t forget the seq() function from math camp. How do you create a sequence of a certain length? Never, ever forget to read the documentation…

new_df <- tibble(logm2 = seq(3.5, max(arg$logm2), length.out = 1000),
                 antig = median(arg$antig))

preds <- predict(fit3, newdata = new_df)

Question 1i

Visualize your predictions on a scatterplot, with the logm2 values from your new dataset created in the previous question on the x-axis and your predicted values on the y-axis. Please explain what you see. How would you explain these predictions to you boss, and what is one limitation of them based on the assumptions in the prevvious question?

tibble(logm2 = new_df$logm2,
       preds = preds) %>% 
  ggplot(aes(x = logm2, y = preds)) + 
    geom_point() + 
    theme_bw() +
    scale_y_continuous(labels = scales::label_dollar(prefix = "$")) + 
    labs(x = "Log(Sq. Meters)",
         y = "Predicted Cost",
         title = "Predicted Cost, Building Age Held at Median")

They're linear, according to your model. The primary limitation is that we had to assume values at the median of building age, since it is easier to visualize two dimensions. You might expect lower apartment costs in older buildings, but this graph visualizes the relationship between one variable and our outcome, holding the other one constant.

Question 1j

Please go to the following URL, where we have prepared a 3D visualization for you with log meters-squared, building age, and predictions for apartment price shown all at once. This plot conducts the same same predictive exercise as in the previous question, but for all values of both variables in your model. Please interpret your results from the previous question in terms of this visualization. Where do your predicted values fall on this graph, and what do they represent in terms of this 3D visualization?

Our predict values represent a single line on this graph -- a slice of the plane when the building age was held constant. Wow! Cool! :)

Draft: Prediction Question

Tyler Simko

2022-10-21

Setup

Question 1a

Question 1b

Question 1c

Question 1c

Question 1e

Question 1f

Question 1g

Question 1g

Question 1h

Question 1i

Question 1j