Due October 24

The problems on this assignment require you fit a quantile regression model.

Answer all questions specified on the problem but if you see something interesting and want to do more analysis please report it as well. Don’t forget to include discussion and to make your plots look nice and presentable.

Submit your file with the knitted (or knitted Word Document saved as a PDF). If you are still having trouble with .rmd, let us know and we will help you, but both the .rmd and the PDF are required.

Consider the {} data from the {} package

Review the linear model fitted to this data in Chapter 6 of the text book and report the model and findings

As shown from the summary of the model built in chapter 6 of the text book rainfall is explained by the variables seeding and the suitability criterion. Seeding is significant when the action takes place, which means rainfall increases with cloud seeding.

## 
## Call:
## lm(formula = clouds_formula, data = clouds)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.5259 -1.1486 -0.2704  1.0401  4.3913 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)   
## (Intercept)                     -0.34624    2.78773  -0.124  0.90306   
## seedingyes                      15.68293    4.44627   3.527  0.00372 **
## time                            -0.04497    0.02505  -1.795  0.09590 . 
## seedingno:sne                    0.41981    0.84453   0.497  0.62742   
## seedingyes:sne                  -2.77738    0.92837  -2.992  0.01040 * 
## seedingno:cloudcover             0.38786    0.21786   1.780  0.09839 . 
## seedingyes:cloudcover           -0.09839    0.11029  -0.892  0.38854   
## seedingno:prewetness             4.10834    3.60101   1.141  0.27450   
## seedingyes:prewetness            1.55127    2.69287   0.576  0.57441   
## seedingno:echomotionstationary   3.15281    1.93253   1.631  0.12677   
## seedingyes:echomotionstationary  2.59060    1.81726   1.426  0.17757   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.205 on 13 degrees of freedom
## Multiple R-squared:  0.7158, Adjusted R-squared:  0.4972 
## F-statistic: 3.274 on 10 and 13 DF,  p-value: 0.02431

The below pot shows how rainfall is explained by the suitability criterion when seeding either took place or did not take place. Notice that there is a much better fit when cloud seeding was present versus when it was not.

  b) Fit a median regression model

To fit a median regression model we will use the formula rainfall ~ sne on both values of the \(seeding\) variable setting \(\tau\) equal to the median, 0.50.

  c) Compare the two results.

As shown by the below plot a median regression model on the data splits the absence of seeding in such a way that the median regression line has a positive slope, compared to the negative slope of the simple linear regression model. This shows the high variability of rainfall when cloud seeding is absent. Additionally, the median regression line is not weighted by the outliers when seeding is present and looks to better explain the data overall due to the high variability of the rainfall variable.

Below is a table of Mean Squared Errors for the models fitted. As shown the MSE of the median regression model is lower when seeding is present than for the linear model. This is because the model is not over-weighted by the outliers in the data. However, when seeding is not present the MSE for the linear model is lower because the median is skewed by the variability (unpredictability) of rainfall when seeding is not present.

##            linear.model median.regression
## No Seeding        6.407            10.832
## Seeding          11.726            10.145

Reanalyze the {} data from the {} package. Compare the regression tree approach from chapter 9 of the textbook to median regression and summarize the different findings.

Regression Tree With All Variables First, as in chapter 9 of the book we will build a regression tree with a minimum split of 10.

Regression Tree Pruning
Judging by the pruned tree the variables waist circumference and hip circumference splits explain the majority of the data. These are the two variables that we will select for quantile regression.

Fit a Quantile Regression Model With All Variables From Pruned Tree and With the First Split to Waist Circumference and Hip Circumference

Plot the Relationship of Body Fat to Age by Waist Circumference Both median regression lines have a positive slope, but the line for the waist circumference subset less than 85 increases much faster.

Compare the MSE of all Models Built Interestingly, the MSE of the pruned regression tree is slightly lower than that of the median regression model with all variables. However, the split where waist circumference is less than 85 has a lower MSE than that of the regression tree. Its counterpart where waist circumference is greater than or equal to the median is about twice as high which suggests additional subsets of the data (splits in the tree) should be used to explain the variance.

## [[1]]
##                    MSE
## Regression.Tree 14.575
## 
## [[2]]
##                  MSE
## QR.Prune.Vars 15.025
## 
## [[3]]
##                        MSE
## Waistcirc.>=.Median 26.534
## Waistcirc.<.Median  12.444

Consider {} data from the lecture notes (package {}). Refit the additive quantile regression models presented ({}) with varying values of \(\lambda\) (lambda) in {}. How do the estimated quantile curves change?

The shinkage factor, lambda, causes the quantiles to smooth out over the data rather than be choppy where lambda = 0. This is because the penalty term lambda is assigned to the slope of the coefficients thus smooths the overall fit of the additive quantile regression model. Additionally, it is clear that the model allows afor a more realisitc interpretation of the sparsity of the covariate effects. Essentially, it is assumed that small subsets of covariates influence the conditional distribution. I identified this issue in the second problem where waist circumference was above the median 85 and the MSE was twice as large.

A simple loop over lambda on top of the tau loop from the chapter 12 code allowed for an easy plotting of the curves for varying values of lambda.

Read either the paper by Koenker and Hallock (2001), OR read the paper “A novel application of quantile regression for identification…”, both posted on D2L. Write a one page summary of the paper. This should include but not be limited to introduction, motivation, case study considered and findings. Be sure to mention which paper you chose at the beginning.

Introduction

The ‘Quantile Regression’ paper by Koenker and Hallock describes quantile regression and its applications. The purpose for using quantile regression is to break a population into quantiles based upon a parameter \(\tau\). Quantile regression seeks to minimize the sum of absolute residuals to ensure there is the same, or a specific value of \(\tau\), observations above and below the quantile. This contrasts regular linear regression that seeks to define the sample mean by minimizing the sum of squared residuals.

Motivation

Quantile regression, which estimates the median rather than mean, can yield more appropriate fits to data when a distribution contains outliers that would otherwise weigh the mean in a specific direction. This was discussed in the paper when referencing the ‘Quantile Engel Curves’ where the least squares estimate was affected by two unusual points with high income and low food expenditure. Quantile regression can eliminate this bias through the minimization of absolute residuals.

Case study

The case study on Infant Birth weight sought to examine the relationship between low infant birth weights and several factors including public policy initiatives. Low birth weight was defined as an infant weighing below 5 pounds, 9 ounces at birth (2500 grams). Quantile regression was applicable in this study because all of the least squares estimates had been skewed due to the lower tail of the birth weight distribution.

Findings

The first noticeable finding of the study was that boys generally were born larger than girls by about 100 grams according to their least squares estimate. However, that estimate shifts with the 0.05 quantile where boys were only 45 grams larger and even further at the 095 quantile where boys were 130 grams larger. In this case the least squares (linear regression) does a poor job of explaining the variability of the distribution.

The second significant finding of the study was the disparity between birth weights of infants born to black and white mothers at the 5th percentile was 1/3 of a kilogram. Additionally, mother’s age, education beyond high school, prenatal care, marital status, and smoking all contribute to an infant’s birth weight. Finally, in almost every test to the response the quantiles were outside of the confidence intervals for the least squares regression which suggests that there is substantial variability in the covariates which could skew the least squares estimates.

Conclusion

The primary take away is that some distributions with longer tails can weigh the mean in such a way that least squares regression can provide a misleading representation of the data. Using median (quantile) regression a more precise representation of the data and underlying trends can be captured. This is especially useful in econometrics where there are often substantial outliers that can greatly impact a least squares model. When I think of this in practice the first thing that comes to my mind is the median home price. This is because extremely large home prices can alter the mean home price in such a way that home prices in the area could potentially be overvalued.

Homework 8

Modern Applied Statistics I

October 18, 2017

Due October 24