The problems on this assignment require you fit a quantile regression model.
Answer all questions specified on the problem but if you see something interesting and want to do more analysis please report it as well. Don’t forget to include discussion and to make your plots look nice and presentable.
Submit your file with the knitted (or knitted Word Document saved as a PDF). If you are still having trouble with .rmd, let us know and we will help you, but both the .rmd and the PDF are required.
Consider the {} data from the {} package
As shown from the summary of the model built in chapter 6 of the text book rainfall is explained by the variables seeding and the suitability criterion. Seeding is significant when the action takes place, which means rainfall increases with cloud seeding.
##
## Call:
## lm(formula = clouds_formula, data = clouds)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.5259 -1.1486 -0.2704 1.0401 4.3913
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.34624 2.78773 -0.124 0.90306
## seedingyes 15.68293 4.44627 3.527 0.00372 **
## time -0.04497 0.02505 -1.795 0.09590 .
## seedingno:sne 0.41981 0.84453 0.497 0.62742
## seedingyes:sne -2.77738 0.92837 -2.992 0.01040 *
## seedingno:cloudcover 0.38786 0.21786 1.780 0.09839 .
## seedingyes:cloudcover -0.09839 0.11029 -0.892 0.38854
## seedingno:prewetness 4.10834 3.60101 1.141 0.27450
## seedingyes:prewetness 1.55127 2.69287 0.576 0.57441
## seedingno:echomotionstationary 3.15281 1.93253 1.631 0.12677
## seedingyes:echomotionstationary 2.59060 1.81726 1.426 0.17757
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.205 on 13 degrees of freedom
## Multiple R-squared: 0.7158, Adjusted R-squared: 0.4972
## F-statistic: 3.274 on 10 and 13 DF, p-value: 0.02431The below pot shows how rainfall is explained by the suitability criterion when seeding either took place or did not take place. Notice that there is a much better fit when cloud seeding was present versus when it was not.
b) Fit a median regression model
To fit a median regression model we will use the formula rainfall ~ sne on both values of the \(seeding\) variable setting \(\tau\) equal to the median, 0.50.
c) Compare the two results.
As shown by the below plot a median regression model on the data splits the absence of seeding in such a way that the median regression line has a positive slope, compared to the negative slope of the simple linear regression model. This shows the high variability of rainfall when cloud seeding is absent. Additionally, the median regression line is not weighted by the outliers when seeding is present and looks to better explain the data overall due to the high variability of the rainfall variable.
Below is a table of Mean Squared Errors for the models fitted. As shown the MSE of the median regression model is lower when seeding is present than for the linear model. This is because the model is not over-weighted by the outliers in the data. However, when seeding is not present the MSE for the linear model is lower because the median is skewed by the variability (unpredictability) of rainfall when seeding is not present.
## linear.model median.regression
## No Seeding 6.407 10.832
## Seeding 11.726 10.145
Regression Tree With All Variables First, as in chapter 9 of the book we will build a regression tree with a minimum split of 10.
Regression Tree Pruning
Judging by the pruned tree the variables waist circumference and hip circumference splits explain the majority of the data. These are the two variables that we will select for quantile regression.
Fit a Quantile Regression Model With All Variables From Pruned Tree and With the First Split to Waist Circumference and Hip Circumference
Plot the Relationship of Body Fat to Age by Waist Circumference Both median regression lines have a positive slope, but the line for the waist circumference subset less than 85 increases much faster.
Compare the MSE of all Models Built Interestingly, the MSE of the pruned regression tree is slightly lower than that of the median regression model with all variables. However, the split where waist circumference is less than 85 has a lower MSE than that of the regression tree. Its counterpart where waist circumference is greater than or equal to the median is about twice as high which suggests additional subsets of the data (splits in the tree) should be used to explain the variance.
## [[1]]
## MSE
## Regression.Tree 14.575
##
## [[2]]
## MSE
## QR.Prune.Vars 15.025
##
## [[3]]
## MSE
## Waistcirc.>=.Median 26.534
## Waistcirc.<.Median 12.444
The shinkage factor, lambda, causes the quantiles to smooth out over the data rather than be choppy where lambda = 0. This is because the penalty term lambda is assigned to the slope of the coefficients thus smooths the overall fit of the additive quantile regression model. Additionally, it is clear that the model allows afor a more realisitc interpretation of the sparsity of the covariate effects. Essentially, it is assumed that small subsets of covariates influence the conditional distribution. I identified this issue in the second problem where waist circumference was above the median 85 and the MSE was twice as large.
A simple loop over lambda on top of the tau loop from the chapter 12 code allowed for an easy plotting of the curves for varying values of lambda.
Introduction
The ‘Quantile Regression’ paper by Koenker and Hallock describes quantile regression and its applications. The purpose for using quantile regression is to break a population into quantiles based upon a parameter \(\tau\). Quantile regression seeks to minimize the sum of absolute residuals to ensure there is the same, or a specific value of \(\tau\), observations above and below the quantile. This contrasts regular linear regression that seeks to define the sample mean by minimizing the sum of squared residuals.
Motivation
Quantile regression, which estimates the median rather than mean, can yield more appropriate fits to data when a distribution contains outliers that would otherwise weigh the mean in a specific direction. This was discussed in the paper when referencing the ‘Quantile Engel Curves’ where the least squares estimate was affected by two unusual points with high income and low food expenditure. Quantile regression can eliminate this bias through the minimization of absolute residuals.
Case study
The case study on Infant Birth weight sought to examine the relationship between low infant birth weights and several factors including public policy initiatives. Low birth weight was defined as an infant weighing below 5 pounds, 9 ounces at birth (2500 grams). Quantile regression was applicable in this study because all of the least squares estimates had been skewed due to the lower tail of the birth weight distribution.
Findings
The first noticeable finding of the study was that boys generally were born larger than girls by about 100 grams according to their least squares estimate. However, that estimate shifts with the 0.05 quantile where boys were only 45 grams larger and even further at the 095 quantile where boys were 130 grams larger. In this case the least squares (linear regression) does a poor job of explaining the variability of the distribution.
The second significant finding of the study was the disparity between birth weights of infants born to black and white mothers at the 5th percentile was 1/3 of a kilogram. Additionally, mother’s age, education beyond high school, prenatal care, marital status, and smoking all contribute to an infant’s birth weight. Finally, in almost every test to the response the quantiles were outside of the confidence intervals for the least squares regression which suggests that there is substantial variability in the covariates which could skew the least squares estimates.
Conclusion
The primary take away is that some distributions with longer tails can weigh the mean in such a way that least squares regression can provide a misleading representation of the data. Using median (quantile) regression a more precise representation of the data and underlying trends can be captured. This is especially useful in econometrics where there are often substantial outliers that can greatly impact a least squares model. When I think of this in practice the first thing that comes to my mind is the median home price. This is because extremely large home prices can alter the mean home price in such a way that home prices in the area could potentially be overvalued.