Homework 5

Using the prostate data, fit a model with lpsa as the response and the other variables as predictors.
1. Plot the residuals versus predicted values. Do the residuals seem to have constant variance? (1 point)

proc import out=Prostate
datafile="C:/Users/Scott/Downloads/Programming/prostate(1).csv"
dbms = csv;
run;

PROC REG DATA = Prostate;
MODEL lpsa = lcavol lweight age lbph svi lcp gleason pgg45 / R;
output out=New p=predicted r=residual;
run;
// this code yields a new data set titled “new” that contains all the data from Prostate, and also has 2 new columns that contain the predicted values and the residuals for each observation
proc GPLOT data=New;
Plot residual*predicted;
run;
//This plots the residuals v. predicted

The points on the plot above appear to be randomly scattered around zero, so assuming that the error terms have a mean of zero is reasonable. The vertical width of the scatter doesn’t appear to increase or decrease across the fitted values, so we can assume that the variance in the error terms is constant.

There is definitely a noticeable pattern here! The residuals (error terms) take on positive values with small or large fitted values, and negative values in the middle. The width of the scatter seems consistent, but the points are not randomly scattered around the zero line from left to right. This graph tells us we should not use the regression model that produced these results.

  b)    Fit a regression with the absolute values of the residuals as the dependent variable and the predicted values as the predictor. Is the dependence on the predicted values statistically significant? (2 points)

//export data to excel and create a column called abs that has the absolute values of the residuals, I called this Book1
proc import out=ProstateUpdate
datafile="C:/Users/Scott/Downloads/Programming/Book1.csv"
dbms = csv;
run;
//this creates a new file called “prostateupdate” from Book1
PROC REG DATA = ProstateUpdate;
MODEL abs = predicted;
run;

If the p-value is less than the reference probability, the result is statistically significant, and you reject the null hypothesis. For paired groups, you conclude that the mean difference is significantly different from 0. The p-value for the predicted value is 0.8605, which is greater than 0.05 implying that the result is not statistically significant. Also we cannot reject the null hypothesis.

c)  Draw the QQ plot of the residuals. What does the QQ plot tell us? (1 point)

//The QQPLOT statement creates a quantile-quantile plot (Q-Q plot), which compares ordered values of a variable with quantiles of a specified theoretical distribution such as the normal. If the data distribution matches the theoretical distribution, the points on the plot form a linear pattern. Thus, you can use a Q-Q plot to determine how well a theoretical distribution models a set of measurements.

//Q-Q plots are similar to probability plots, which you can create with the PROB-PLOT statement. Q-Q plots are preferable for graphical estimation of distribution parameters and capability indices, whereas probability plots are preferable for graphical estimation of percentiles
proc capability data=ProstateUpdate noprint;
qqplot residual;
run;

The plot compares the ordered values of RESIDUAL with quantiles of the normal distribution. The linearity of the point pattern indicates that the measurements are normally distributed.

Check for correlated errors. (1 point)

// Autocorrelation (a relationship between values separated from each other by a given time lag) of the error function is something that needs to be addressed in linear regression. If strong autocorrelation exists, then autoregressive models will be more appropriate as opposed to linear regression for the independence assumption of linear regression is severely violated. If no such strong autocorrelation exist, for beginners, we have no additional reasons to reject linear regression model as a suitable model. // When data set of interest is a time series data, we may want to compute the 1st-order autocorrelation for the variables of interest and to test if the autocorrelation is zero. One common test is Durbin-Watson test. The Durbin-Watson test statistic can be computed in proc reg by using option dw after the model statement.

// It should be noted that the Durbin Watson statistics is not relevant in certain scenario where normality assumptions are violated or when variables which are extremely time dependent (lag variables) are used. In these cases, the more relevant serial correlation test which is the Breusch-Godfrey test will be more relevant.

proc reg data=ProstateUpdate;
model residual=predicted /dwprob;
run;

Durbin-Watson D	1.507
Pr < DW	0.0055
Pr > DW	0.9945
Number of Observations	97
1st Order Autocorrelation	0.202

// Since d is approximately equal to 2(1 − r), where r is the sample autocorrelation of the residuals, d = 2 indicates no autocorrelation. The value of d always lies between 0 and 4. If the Durbin–Watson statistic is substantially less than 2, there is evidence of positive serial correlation. As a rough rule of thumb, if Durbin–Watson is less than 1.0, there may be cause for alarm. Small values of d indicate successive error terms are, on average, close in value to one another, or positively correlated. If d > 2, successive error terms are, on average, much different in value from one another, i.e., negatively correlated. In regressions, this can imply an underestimation of the level of statistical significance.

The Durbin-Watson D is 1.507. This is greater than 1, indicating that there are not successive error terms that are on average close to one another in value. It is also less than 2, signifying that there are not successive error terms that are on average much different from one another in value.

Using the divusa data, fit a model with divorce as the response and the other variables except year as predictors.

proc import out=divusa
datafile="C:/Users/Scott/Downloads/Programming/divusa(1).csv"
dbms = csv;
run;
PROC REG DATA = divusa;
MODEL divorce = unemployed femlab marriage birth military;
run;

Using the divusa data, fit a model with divorce as the response and the other variables except year as predictors.

proc import out=divusa
datafile="C:/Users/Scott/Downloads/Programming/divusa(1).csv"
dbms = csv;
run;
PROC REG DATA = divusa;
MODEL divorce = unemployed femlab marriage birth military;
Run;

Plot the residuals versus predicted values. (1 point)

PROC REG DATA = divusa;
MODEL divorce = unemployed femlab marriage birth military / R;
output out=divusa2 p=predicted r=residual;
run;
quit;
title "divorce residual v. predicted";
proc GPLOT data=divusa2;
Plot residual*predicted;
run;

Fit sqrt(divorce) as the response and plot the residuals versus predicted for this model. (1 point)

proc import out=sqrtdivusa
datafile="C:/Users/Scott/Downloads/Programming/divusa(2).csv"
dbms = csv;
run;
PROC REG DATA = sqrtdivusa;
MODEL sqrtdiv = unemployed femlab marriage birth military / R;
output out=sqrtdivusa2 p=predicted r=residual;
run;
quit;
title "sqrt(divorce) residual v. predicted";
proc GPLOT data=sqrtdivusa2;
Plot residual*predicted;
run;

Fit log(divorce) as the response and plot the residuals versus predicted for this model. (1 point)

proc import out=logdivusa
datafile="C:/Users/Scott/Downloads/Programming/divusa(3).csv"
dbms = csv;
run;
PROC REG DATA = logdivusa;
MODEL logdiv = unemployed femlab marriage birth military / R;
output out=logdivusa2 p=predicted r=residual;
run;
quit;
title "log(divorce) residual v. predicted";
proc GPLOT data=sqrtdivusa2;
Plot residual*predicted;
run;

Based on the R2, which of the three models is the best fit? (1 point)

data divcalc;
set divusa;
divorcesqrt=sqrt(divorce);
divorcelog=log(divorce);
run;
proc reg data=divcalc;
model divorcesqrt = unemployed femlab marriage birth military;
output out=divrt r=resid p=predicted;
run;quit;
proc reg data=divcalc;
model divorcelog = unemployed femlab marriage birth military;
output out=divlog r=resid p=predicted;
run;quit;

R2 full = 0.9208 R2 sqrt divorce = 0.9363 R2 log divorce = 0.9498 R square from log of divorce variable at 0.9498 gives the model with the most explanation of variance (largest R^2) versus the other models

Check for correlated errors for the model from point b). (1 points)

proc import out=absresiddiv
datafile="C:/Users/Scott/Downloads/Programming/absresiddiv.csv"
dbms = csv;
run;
proc reg data=absresiddiv;
model residual=predicted /dwprob;
run;

The Durbin-Watson D is 0.351. This is less than 1, indicating that there are successive error terms that are on average close to one another in value. This is evidence of a positive serial correlation.

Homework 5

Scott Goley

November 11, 2016