The given data “S4173proj”, referred to hereafter as “the data”, was the result of a survey with nearly 20,000 participants. Criteria within the data included the attributes Race, Age, Gender, and Cigarette Smoking as well as basic medical history including Diabetic Status, Coronary Heart Disease, Body Mass Index, Systolic and diastolic blood pressure, Cholesterol and death from heart disease during a follow up.

However, during preliminary analysis, many observations were discovered to have an incomplete set of values for the entire survey. After elimination of observations with missing values, the total set decreased to the size of 16,189 observations and 13 variables (See Program 1) An exploratory analysis was conducted where general statistics were calculated for the predictors based around their relationship to the dependent variable (See Program 2).

                                            



Also, a linear regression (See Program 2) is run to determine basic characteristics of the full model. Immediately following is the resulting output which shows that the full model is already insignificant, R-Square of only .1432 for total variation.

                                                  

                                                           

The parameter estimates reveal 3 different variables which do not meet the required significance for the model (Race, smoking, and cholesterol) though all other variables show sufficient significance.

                                                  

Agreement was found between both of the variable selection methods. The forward selection method (See Program 3) highlighted the significance of the variables: Age, Chd, Male, Diab, Sbp, Chol and CurrSmok. Each of these had a significance against Chi-Square of (<.0001) except for CurrSmok which was at the very minimal value of (.0004). Included below is the table of output for forward selection.

                                                  

Following this, a backwards selection (See Program 3) was also performed for model confirmation. The selection proceeded to remove the two variables dbp and bmi because of their insignificance as determined by a Chi Square test.

                                                  

The selected models agree on the important factors (Age, Chronic Heart Disease, Gender, Diabetes, Systolic Blood Pressure, Cholesterol, and Smoking) which give the following linear model:

                                            \[d\_chd = 9.7495 – 0.0774Age – 0.4874Male – 0.3422Currsmok -0.5062Diab -1.1143chd -0.00842sbp -0.00338chol.\]

All of the variables have a negative relationship to the dependent variable, death from coronary heart disease. Finally, examinations of the model were performed to determine stability (See Program 5). A QQ plot shows that data is not normally distributed.

                                                  

Also, the Durbin Watson test was conducted to determine relationship and this was actually slightly concerning because it was beyond the 5% confidence limits.

Durbin-Watson D 1.977
Pr < DW 0.0755
Pr > DW 0.9245
Number of Observations   16189
1st Order Autocorrelation    0.011



Outliers were not detected however through a test of studentized residuals.

                                                  

A further statement cannot then be made about coronary heart disease as this data set has issues within itself preventing it from being a truly representative analysis. It does not fulfill the requirements for several of the assumptions for a successful regression, most importantly, independence of the errors and normality of their distribution.



Program Index

Program 1.

libname s4173 "/courses/u_fsu.edu1/i_648881/c_5934/Sas4173" access=readonly;
data project;
set s4173.s4173proj;
if Race = . then delete;
if Age = . then delete;
if Male = . then delete;
if Currsmok = . then delete;
if Diab = . then delete;
if Chd = . then delete;
if Bmi = . then delete;
if sbp = . then delete;
if dbp = . then delete;
if chol = . then delete;
if d_chd = . then delete;
run;

Program 2.

proc means data=project maxdec=3;
class d_chd;
run;quit;
proc reg data=project;
model d_chd = Race Male Age currsmok chd diab bmi sbp
dbp chol;
run;quit;

Program 3.

proc logistic data=project;
class race (ref="1" param=ref) male (ref="0" param=ref)
diab (ref="0" param=ref) Currsmok (ref="0" param=ref);
Model d_chd= Age Male Currsmok Diab Chd Bmi sbp dbp chol / selection=forward;
run;

proc logistic data=project;
class race (ref="1" param=ref) male (ref="0" param=ref)
diab (ref="0" param=ref) Currsmok (ref="0" param=ref);
Model d_chd= Age Male Currsmok Diab Chd Bmi sbp dbp chol / selection=backward;
run;

Program 4.

proc surveyselect data=project out=outboot
seed=12345 method=urs samprate=1 outhits noprint rep=1000;
run;
proc univariate data=outboot;
var d_chd; by Replicate;
run;

Program 5.

  • QQ Plot
proc reg data=project;
model d_chd = Age Male Currsmok Diab chd sbp chol;
plot r.*nqq. / noline mse cframe=ligr;
output out=residuals r=resid;
run;quit;
  • Durbin Watts
proc reg data=project;
model d_chd = Age Male Currsmok Diab chd sbp chol/ dwprob;
plot r.*p.;
output out=studentized rstudent=studentresid;
run;quit;
  • Studentize + Cutoff for outliers
data studentized;
set studentized;
absstudentresid=abs(studentresid);
run;
proc sort data=studentized;
by absstudentresid;
run;
data quantiles;
cutoff=abs(tinv(0.05/(2*16189), 16189-13-1));
run;