Assignment C

Diagnostics & Remedial Measures

Instructions

You may talk to a friend, discuss the questions and potential directions for solving them. However, you need to write your own solutions and code separately, and not as a group activity.
Make R code chunks to insert code and type your answer outside the code chunks. Ensure that the solution is written neatly enough to understand and grade.
Render the file as HTML to submit. For theoretical questions, you can either type the answer and include the solutions in this file, or write the solution on paper, scan and submit separately.
The assignment is worth 100 points, and is due on 22nd October 2023 at 11:59 pm.
There is an extra credit question worth 10 points in the end. You can score 110/100 in the assignment.
Five points are properly formatting the assignment. The breakdown is as follows:

Must be an HTML file rendered using Quarto (the theory part may be scanned and submitted separately) (2 pts).
There aren’t excessively long outputs of extraneous information (e.g. no printouts of entire data frames without good reason, there aren’t long printouts of which iteration a loop is on, there aren’t long sections of commented-out code, etc.). There is no piece of unnecessary / redundant code, and no unnecessary / redundant text (1 pt)
Final answers of each question are written clearly (1 pt).
The proofs are legible, and clearly written with reasoning provided for every step. They are easy to follow and understand (1 pt)

Real estate sales

Read the file real_estate_sales.txt.

1.1

Develop a linear regression model to predict price of a house based on square_feet (floor area of the house). Check the following model assumptions using diagnostic plots:

Linear relationship
Homoscedasticity
Normal distribution of errors

For each of the above assumption, comment if it is appears to be satisfied or violated based on the plots.

(2 + 2 + 2 = 6 points)

library(ggplot2)
res <- read.table('real_estate_sales.txt', header = TRUE)
model <- lm(price~square_feet, data=res)
par(mfrow = c(2,2))
plot(model)

1- Based on the residuals vs. fitted plot, it appears that a quadratic relationship may be more appropriate than a linear one. Though slight, it appears the line on the plot has a parabolic shape with a distinct minimum and change of slope.

2- Based on the residuals vs. fitted plot, it appears that this data displays heteroscedasticity. For lower fitted values, the residuals have relatively low variance. For higher fitted values, the residuals have greater variance, demonstrating a marked change/difference.

3- Based on the Q-Q residuals plot, it appears that the errors are not normally distributed. We can also see from this plot that the data seems to be right skewed.

1.2

Consider performing the statistical test to check the linear relationship assumption. What condition must be satisfied by the data for the test to be performed? Show that the condition is satisfied by this data.

(1 + 2 = 3 points)

In order to perform this test, it must be true that x and y have the same variance.

#SOS its not working
vx <- var(res$square_feet)
vy <-var(res$price)
vx

[1] 505614.8

vy

[1] 19022863514

#these are not identical, so we will look at the ratio of the standard deviations
results <- var.test(res$price,res$square_feet)
results


    F test to compare two variances

data:  res$price and res$square_feet
F = 37623, num df = 521, denom df = 521, p-value < 2.2e-16
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
 31680.41 44680.85
sample estimates:
ratio of variances 
          37623.24

1.3

Perform the statistical test to check the linear relationship assumption. What is your conclusion?

(2 points)

full_model <- lm(price~as.factor(square_feet), data=res)
anova(model, full_model)

Analysis of Variance Table

Model 1: price ~ square_feet
Model 2: price ~ as.factor(square_feet)
  Res.Df        RSS  Df  Sum of Sq      F    Pr(>F)    
1    520 3.2554e+12                                    
2    107 3.3303e+11 413 2.9224e+12 2.2735 5.208e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The resulting p-value is 5.208e-07, which is very low. From this, we conclude that the relationship is nonlinear.

1.4

Check assumptions (2) and (3) as mentioned in C.1.1 with statistical tests, and mention your conclusion.

(2 + 2 = 4 points)

First, we check for homoscedasticity with the Breusch-Pagan test:

library(lmtest)

Loading required package: zoo


Attaching package: 'zoo'

The following objects are masked from 'package:base':

    as.Date, as.Date.numeric

bptest(model)


    studentized Breusch-Pagan test

data:  model
BP = 72.174, df = 1, p-value < 2.2e-16

The resultant p-value is very low (2.2e-16), so we conclude that the model is not homoscedastic.

Next, we check for normal distribution of error terms with the Shaprio-Wilk test:

shapiro.test(model$residuals)


    Shapiro-Wilk normality test

data:  model$residuals
W = 0.90048, p-value < 2.2e-16

The resultant p-value is very low (2.2e-16), so we concluded that the model’s error terms are not normally distributed.

1.5

Use the Box-Cox procedure to identify the appropriate transformation of the regression model developed in C.1.1. Write the transformed model equation.

(3 + 2 = 5 points)

library(MASS)
boxcox(model, lambda=seq(-.6,0,0.1))

# estimating lambda as -0.25
boxcox_model <- lm(price^{-0.25}~square_feet, data=res)
par(mfrow=c(2,2))
plot(boxcox_model)

The transformed model equation: $Y_i^{-0.25} = \beta_0 + \beta_1X_i$

1.6

Check all the 3 model assumptions mentioned in C.1.1 for the transformed model developed in C.1.5. Use both diagnostic plots and statistical tests. Mention your comments based on the plots and conclusions based on the tests.

(2 + 2 + 2 = 6 points)

#linear relationship
full_boxcox_model <- lm(price^{-0.25}~as.factor(square_feet), data=res)
anova(boxcox_model, full_boxcox_model)

Analysis of Variance Table

Model 1: price^{
    -0.25
} ~ square_feet
Model 2: price^{
    -0.25
} ~ as.factor(square_feet)
  Res.Df       RSS  Df Sum of Sq      F   Pr(>F)   
1    520 0.0034703                                 
2    107 0.0004943 413  0.002976 1.5599 0.003097 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

#homoscedasticity
bptest(boxcox_model)


    studentized Breusch-Pagan test

data:  boxcox_model
BP = 14.817, df = 1, p-value = 0.0001185

#normal distribution of errors
shapiro.test(boxcox_model$residuals)


    Shapiro-Wilk normality test

data:  boxcox_model$residuals
W = 0.98591, p-value = 6.126e-05

Based on the residuals vs. fitted plot, it still appears that a quadratic relationship may be more appropriate than this fit and that this fit does not result in a linear relationship. The statistical test resulted in a p-value of 0.003097, which is small enough to conclude that the linearity assumption is not satisfied, though it is notable that this number is larger the p-value of the original model.

Based on the residuals vs. fitted plot, it still appears that the residuals are heteroscedastic. The statistical test resulted in a p-value of 0.0001185, which is small enough to conclude that the homoscedasticity assumption is not satisfied, though it is notable that this number is larger the p-value of the original model.

Based on the Q-Q residuals plot, it still appears that the errors are slightly not normally distributed, with perhaps a left skew. The statistical test resulted in a p-value of 6.126e-05, which is small enough to conclude that the homoscedasticity assumption is not satisfied, though it is notable that this number is larger the p-value of the original model.

1.7

Is the transformed model (developed in C.1.5) better than the original model (developed in C.1.1) with respect to the model assumptions? Comment based on the tests/plots in the previous question (C.1.6).

(2 points)

Yes, though the assumptions are still all unsatisfied. Neither model resulted in any of the model assumptions being satisfied (as determined by the statistical tests and visual inspection of the plots). However, the p-values of the transformed model were all greater than those of the original model, indicating that the transformed model was closer to fulfilling the assumptions. Neither model is perfect, but the transformed model may still be more useful than the original model.

1.8

If the linearity assumption is still not satisfied in the transformed model (developed in C.1.5),

Propose another transformation based on the diagnostic plot(s) plotted in the previous question.
Write the transformed model equation.
Show that the transformed model (based on the proposed transformation in (a)) satisfies the linearity assumption based on the statistical test, if the probability of type I error considered is 1%.
Also make the diagnostic plot for the transformed model (based on the proposed transformation in (a)) to show that it satisfies the linearity assumption.

(2 + 2 + 2 + 2 = 8 points)

quadratic_model <- lm(price^{-0.25}~square_feet + I(square_feet**2), data=res)
par(mfrow=c(2,2))
plot(quadratic_model)

full_quadratic_model <- lm(price^{-0.25}~as.factor(square_feet + I(square_feet**2)), data=res)
anova(quadratic_model, full_quadratic_model)

Analysis of Variance Table

Model 1: price^{
    -0.25
} ~ square_feet + I(square_feet^2)
Model 2: price^{
    -0.25
} ~ as.factor(square_feet + I(square_feet^2))
  Res.Df        RSS  Df Sum of Sq      F  Pr(>F)  
1    519 0.00306524                               
2    107 0.00049428 412  0.002571 1.3509 0.03078 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Based on the plots in the previous question, it seemed that the transformed model could benefit from additionally transforming the x values (square_feet) quadratically.
Transformed Model: $Y_i^{-0.25}=\beta_0+\beta_1X_i+\beta_2X_i^2$
The linerity test of this model results in a p-value of 0.03078, which is greater than 0.01, so we can conclude that the linearity assumption is satisfied.

D)The residuals vs. fitted diagnostic plot appears to be imporved from the previous models, and seems to satisfy the linearity assumption.

1.9

Does the transformed model developed in the previous question (C.1.8) satisfy the homoscedasticity and normally distributed errors assumptions? Verify based on diagnostic plots and statistical tests.

(2 + 2 = 4 points)

#homoscedasticity
bptest(quadratic_model)


    studentized Breusch-Pagan test

data:  quadratic_model
BP = 18.404, df = 2, p-value = 0.0001008

#normal dist of error
shapiro.test(quadratic_model$residuals)


    Shapiro-Wilk normality test

data:  quadratic_model$residuals
W = 0.99194, p-value = 0.006281

It is difficult to definitively tell whether or not this model satisfies the homoscedasticity assumption based on the residuals vs. fitted plot. The data looks slightly heteroscedastic, but not egregiously so. The statistical test resulted in a p-value of 0.0001008, which is small enough that we can conclude that the assumption is not satisfied.

Based on the Q-Q residuals plot, it appears that the error terms are slightly non-normally distributed, perhaps with a slight left skew. The statistical test resulted in a p-value of 0.006281, which almost large enough to conclude that the assumption is satisfied, but is still less than 0.01 so we conclude that it is not satisfied.

1.10

Plot all the three models - the original model (developed in C.1.1), the boxcox transformed model (developed in C.1.5), and the final model (developed in C.1.8) over a scatterplot of price vs square_feet. Which model seems to have the best fit? Also report the $R^2$ of all the 3 models.

(1 + 3 + 1 + 3 = 8 points)

ggplot(data=res, aes(x=square_feet, y=price)) + geom_point() + geom_line(aes(y = quadratic_model$fitted.values, color = "Transformed Model")) + geom_line(aes(y = boxcox_model$fitted.values, color = "Box-Cox Model")) + geom_line(aes(y = model$fitted.values, color = "Original Model"))

Original model $R^2 = 0.6715$ Box-Cox model $R^2 = 0.6981$ Transformed model $R^2 = 0.7333$

The final transformed model seems to be the best fit for the data when the axes are transformed. However, with no axis transformation, the original model seems to be the best fit graphically.

Mortality vs Income

The dataset infmort.csv gives the infant mortality of different countries in the world. The column mortality contains the infant mortality in deaths per 1000 births.

2.1

Read the dataset. There are 4 observations that have missing values. Remove those observations from the dataset.

Hint: You may use the R function complete.cases().

(2 points)

infmort <- read.csv('infmort.csv')
infmort <- infmort[complete.cases(infmort),]

2.2

Over the scatterplot of mortality against income, plot the regression model predicting mortality based on income. Report the $R^2$ for this model.

(2 + 1 = 3 points)

plot(infmort$mortality,infmort$income)

model = lm(mortality~income, data=infmort)
ggplot(data = infmort, aes(x = income, y = mortality)) + geom_point() + geom_smooth(method="lm")

`geom_smooth()` using formula = 'y ~ x'

summary(model)


Call:
lm(formula = mortality ~ income, data = infmort)

Residuals:
   Min     1Q Median     3Q    Max 
-90.25 -45.84 -23.47  22.65 571.57 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) 110.421086  10.538807  10.478  < 2e-16 ***
income       -0.020907   0.005999  -3.485 0.000735 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 86.13 on 99 degrees of freedom
Multiple R-squared:  0.1093,    Adjusted R-squared:  0.1003 
F-statistic: 12.14 on 1 and 99 DF,  p-value: 0.0007354

$R^2$ for this model is 0.1093

2.3

Are there outlying observations in the data with respect to the model developed in the previous question (C.2.2)? How many? Consider observations having a magnitude of standardized residual more than 5 as outliers.

(2 points)

par(mfrow=c(2,2))
plot(model)

According to the Q-Q residuals plot, there is only one observation with a magnitude of standardized residual greater than 5.

2.4

Based on the plot in C.2.2, propose a transformation for income. Justify the proposed transformation. Write the equation of the transformed model and report its $R^2$.

(2 + 2 + 1 + 1 = 6 points)

log_model <- lm(mortality~income+log(income), data=infmort)
par(mfrow=c(2,2))
plot(log_model)

summary(log_model)


Call:
lm(formula = mortality ~ income + log(income), data = infmort)

Residuals:
   Min     1Q Median     3Q    Max 
-88.87 -35.29 -18.42   7.08 600.61 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) 281.763823  70.128714   4.018 0.000115 ***
income        0.006321   0.012480   0.507 0.613632    
log(income) -33.007259  13.363652  -2.470 0.015242 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 83.99 on 98 degrees of freedom
Multiple R-squared:  0.1615,    Adjusted R-squared:  0.1443 
F-statistic: 9.435 on 2 and 98 DF,  p-value: 0.0001789

I tried a log transformation of income based off of the distribution of the data points. It reminded me of the equation $y=log(1/x)$, so I thought a transformation similar to that form would be a better fit for the data than a linear equation.

Transformed Model: $Y_i = \beta_0 + \beta_1X_i+\beta_2log(X_i)$

However, this model has an $R^2$ of 0.1615, which is only marginally better than the original model.

2.5

Plot the transformed model developed in the previous question over a scatterplot of mortality vs the transformed income (as transformed in the previous question (C.2.4)). Did the model fit of the transformed model (developed in C.2.4) improve over the fit of the original model (developed in C.2.2)?

(2 + 1 = 3 points)

ggplot(data=infmort, aes(x=log(income), y=mortality)) + geom_point() + geom_line(aes(y = model$fitted.values, color = "Original Model")) + geom_line(aes(y = log_model$fitted.values, color = "Log linear model"))

The fit did seem to improve with the transformed model, but the $R^2$ values are only marginally different.

2.6 Use Box-Cox to identify the appropriate transformation for mortality in the transformed model (developed in C.2.4). Write the equation of the Box-Cox transformed model, and report its $R^2$.

(3 points)

boxcox(log_model, lambda=seq(-.6,0,.1))

#estimating labda as -0.2
boxcox_model <- lm(mortality^-0.2~income+log(income), data = infmort)
par(mfrow=c(2,2))
plot(boxcox_model)

summary(boxcox_model)


Call:
lm(formula = mortality^-0.2 ~ income + log(income), data = infmort)

Residuals:
      Min        1Q    Median        3Q       Max 
-0.235983 -0.029812 -0.000236  0.037583  0.174665 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 1.969e-01  5.030e-02   3.915 0.000167 ***
income      7.712e-06  8.951e-06   0.862 0.391000    
log(income) 4.106e-02  9.585e-03   4.284  4.3e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.06024 on 98 degrees of freedom
Multiple R-squared:  0.5432,    Adjusted R-squared:  0.5339 
F-statistic: 58.27 on 2 and 98 DF,  p-value: < 2.2e-16

Box-Cox transformed model: $Y_i^{-0.2} = \beta_0 + \beta_1X_i+\beta_2log(X_i)$

$R^2$: 0.5432

2.7

Plot the Box-Cox transformed model developed in the previous question (C.2.6) over a scatterplot of the transformed mortality vs the transformed income. Did the fit of the Box-Cox transformed model improve over the fit of the transformed model (developed in C.2.4)?

(2 + 1 = 3 points)

ggplot(data=infmort, aes(x=log(income), y=mortality^(-.2))) + geom_point() + geom_line(aes(y = boxcox_model$fitted.values, color = "Box-Cox Transformed Model"))

The Box-Cox transformed model is definitely an improvement from the fit of the transformed model.

2.8

Plot the Box-Cox transformed model (developed in C.2.6) over the scatterplot of mortality vs income.

(3 points)

ggplot(data=infmort, aes(x=income, y=mortality)) + geom_point() + geom_line(aes(y = boxcox_model$fitted.values, color = "Box-Cox Transformed Model"))

2.9

Are there outlying observations in the data with respect to the Box-Cox transformed model (developed in C.2.6)? If not, then how did they disappear given that there were outliers in the original model (developed in C.2.2)?

*(1 + 3 = 4 points

plot(boxcox_model)

Using the definition of outlier in C.2.6, there are no longer any outlying observations. Standardized residual is defined as $(observed value - expected value)/\sqrt(expected value)$. The better the fit is, the closer the observed and expected values are, thus minimizing the magnitude of the standardized residual. The Box-Cox model is a better fit than the original model, so it follows that the standardized residuals have decreased in magnitude and are no longer in the oultier range.

Suppose the error terms in the following linear regression model are independent $N(0, \sigma^2)$:

$Y_i = \beta_0 + \beta_1X_i + \epsilon_i, \epsilon_i \sim N(0, \sigma^2)$.

3.1

If $X_i$ is transformed to $X_i'=1/X_i$, then will the error terms still be independent $N(0, \sigma^2)$? If not, then what will be the change in their distribution?

(4 points)

The error terms will still be independent if $X_i$ is transformed to $X_i'=1/X_i$. Transforming $X_i$ has no effect on error distributions.

3.2

Instead of $X_i$, if $Y_i$ is transformed to $Y_i'=1/Y_i$, then will the error terms still be independent $N(0, \sigma^2)$? If not, then what will be the change in their distribution?

(4 points)

Transforming $Y_i$ to $Y_i'=1/Y_i$ will affect the error distribution, so the error terms will no longer be independent. Their distribution will no longer be normal and may be proportional to $Y^4$.

A simple linear regression model with intercept $\beta_0 = 0$ is under consideration. Data have been obtained that contain replications.

State the full and reduced models for testing the appropriateness of the regression function under consideration.

What are the degrees of freedom associated with the full and reduced model if number of observations $n=20$ and number of distinct values of the predictor $c=10$?

(2 + 2 = 4 points)

Full model: $Y_{ij}=\mu_j+\epsilon_{ij}$

Reduced model: $Y = _0 + 1X_j + {ij} $

Full model degrees of freedom $=n-c=20-10=10$ Reduced model degrees of freedom $=n-2=20-2=18$

Let the observed value of the response variable for the $i^{th}$ replicate of the $j^{th}$ level of the predictor $X$ be $Y_{ij}$, where $i=1,...,n_j$, $j= 1,...,c$.

The fitted regression model is:

$\hat{Y}_{ij} = \hat{\beta}_0+\hat{\beta}_1X_j$

The error deviation can be decomposed as the pure error deviation, and the lack of fit deviation as follows:

$Y_{ij}-\hat{Y}_{ij} = (Y_{ij}-\bar{Y}_j) + (\bar{Y}_j-\hat{Y}_{ij})$

Given the above equation, show that:

$\sum_{i=1}^{n_j}\sum_{j=1}^c(Y_{ij}-\hat{Y}_{ij})^2 = \sum_{i=1}^{n_j}\sum_{j=1}^c(Y_{ij}-\bar{Y}_j)^2 + \sum_{i=1}^{n_j}\sum_{j=1}^c(\bar{Y}_j-\hat{Y}_{ij})^2$.

(6 points)

See attached paper.

Bonus question

(This is extra credit question, you are not required to do it)

For $\lambda \ne 0$, the Box-Cox transformation is given as:

$y_i^{(\lambda)}=\frac{y_i^\lambda-1}{\lambda}...(1)$

However, typically practitioners transform the response as:

$y_i^{(\lambda)}=y_i^\lambda...(2)$.

Why is (2) acceptable in general even when (1) is the transformation proposed by George Box and David Cox? In which cases will it be detrimental to use (2) instead of (1), and one must use (1)?

Support your arguments with examples based on simulated data to answer this question. You must provide an example when both (1) and (2) are acceptable, and another example when (2) is not effective, and only (1) must be used.

(10 points)