Biostatistics for Health Researchers II: Formative Assessment I

Author
Affiliation

Bongani Ncube(3002164)

University Of the Witwatersrand (School of Public Health)

Published

May 12, 2025

Keywords

Regression Analysis, Simple Linear, Multicolinearity, Multiple Regression, Cooks distance

About the dataset
id radius_mean texture_mean perimeter_mean area_mean smoothness_mean compactness_mean concavity_mean concave_points_mean symmetry_mean fractal_dimension_mean
842302 17.99 10.38 122.80 1001.0 0.11840 0.27760 0.3001 0.14710 0.2419 0.07871
842517 20.57 17.77 132.90 1326.0 0.08474 0.07864 0.0869 0.07017 0.1812 0.05667
84300903 19.69 21.25 130.00 1203.0 0.10960 0.15990 0.1974 0.12790 0.2069 0.05999
84348301 11.42 20.38 77.58 386.1 0.14250 0.28390 0.2414 0.10520 0.2597 0.09744
84358402 20.29 14.34 135.10 1297.0 0.10030 0.13280 0.1980 0.10430 0.1809 0.05883
843786 12.45 15.70 82.57 477.1 0.12780 0.17000 0.1578 0.08089 0.2087 0.07613

Data description (5 marks)

(a) Use appropriate descriptive statistics to summarise the variables, and include at least two types of graphical displays to support your summary

import delimited BreastCancer

dtable, continuous(radius_mean texture_mean perimeter_mean area_mean smoothness_mean compactness_mean concavity_mean concave_points_mean symmetry_mean fractal_dimension_mean )  title(Table1) titlestyles( font(, bold) ) export("Statin_table", as(docx) replace)
(encoding automatically selected: ISO-8859-2)
(11 vars, 569 obs)


Table1
----------------------------------------
                            Summary     
----------------------------------------
N                                    569
radius_mean               14.127 (3.524)
texture_mean              19.290 (4.301)
perimeter_mean           91.969 (24.299)
area_mean              654.889 (351.914)
smoothness_mean            0.096 (0.014)
compactness_mean           0.104 (0.053)
concavity_mean             0.089 (0.080)
concave_points_mean        0.049 (0.039)
symmetry_mean              0.181 (0.027)
fractal_dimension_mean     0.063 (0.007)
----------------------------------------
(collection DTable exported to file Statin_table.docx)

Graphical representation

Histograms

import delimited BreastCancer

histogram texture_mean ,title("Histogram of texture_mean") note( "") ytitle("Fraction") normal ytitle("Density") name(graph1 , replace) mcolor(blue%50)

histogram area_mean ,title("Histogram of area_mean") note( "") ytitle("Fraction") normal ytitle("Density") name(graph2 , replace) mcolor(blue%50)

histogram compactness_mean ,title("Histogram of compactness_mean") note( "") ytitle("Fraction") normal ytitle("Density") name(graph3 , replace) mcolor(blue%50)

histogram concave_points_mean ,title("Histogram of concave_points_mean") note( "") ytitle("Fraction") normal ytitle("Density") name(graph4 , replace) mcolor(blue%50)

histogram fractal_dimension_mean ,title("Histogram of fractal_dimension_mean") note( "") ytitle("Fraction") normal ytitle("Density") name(graph5 , replace) mcolor(blue%50)

histogram radius_mean ,title("Histogram of radius_mean") note( "") ytitle("Fraction") normal ytitle("Density") name(graph6 , replace) mcolor(blue%50)

histogram perimeter_mean ,title("Histogram of perimeter_mean") note( "") ytitle("Fraction") normal ytitle("Density") name(graph7 , replace) mcolor(blue%50)

histogram smoothness_mean ,title("Histogram of smoothness_mean") note( "") ytitle("Fraction") normal ytitle("Density") name(graph8 , replace) mcolor(blue%50)

histogram concavity_mean ,title("Histogram of concavity_mean") note( "") ytitle("Fraction") normal ytitle("Density") name(graph9 , replace) mcolor(blue%50)

histogram symmetry_mean ,title("Histogram of symmetry_mean") note( "") ytitle("Fraction") normal ytitle("Density") name(graph10 , replace) mcolor(blue%50)

graph combine graph1 graph2 graph3 graph4 graph5 graph6 graph7 graph8 graph9 graph10 ,title("") saving("all_graphs.svg" ,replace) 

quietly graph export all_graphs.svg, replace
(encoding automatically selected: ISO-8859-2)
(11 vars, 569 obs)

(bin=23, start=9.71, width=1.2856521)

(bin=23, start=143.5, width=102.5)

(bin=23, start=.01938, width=.01417478)

(bin=23, start=0, width=.00874783)

(bin=23, start=.04996, width=.00206435)

(bin=23, start=6.9809999, width=.9186522)

(bin=23, start=43.790001, width=6.2917391)

(bin=23, start=.05263, width=.00481609)

(bin=23, start=0, width=.01855652)

(bin=23, start=.106, width=.0086087)

file all_graphs.svg saved as .gph format

Stata Graph - Graph 0 .02 .04 .06 .08 .1 Density 10 20 30 40 texture_mean Histogram of texture_mean 0 5.0e-04 .001 .0015 .002 Density 0 500 1000 1500 2000 2500 area_mean Histogram of area_mean 0 2 4 6 8 10 Density 0 .1 .2 .3 .4 compactness_mean Histogram of compactness_mean 0 5 10 15 20 Density 0 .05 .1 .15 .2 concave_points_mean Histogram of concave_points_mean 0 20 40 60 80 Density .05 .06 .07 .08 .09 .1 fractal_dimension_mean Histogram of fractal_dimension_mean 0 .05 .1 .15 Density 5 10 15 20 25 30 radius_mean Histogram of radius_mean 0 .005 .01 .015 .02 .025 Density 50 100 150 200 perimeter_mean Histogram of perimeter_mean 0 10 20 30 Density .06 .08 .1 .12 .14 .16 smoothness_mean Histogram of smoothness_mean 0 2 4 6 8 10 Density 0 .1 .2 .3 .4 concavity_mean Histogram of concavity_mean 0 5 10 15 Density .1 .15 .2 .25 .3 symmetry_mean Histogram of symmetry_mean

Boxplots

import delimited BreastCancer

graph box texture_mean ,title("Boxplot of texture_mean") note( "") name(graph1 , replace)

graph box area_mean ,title("Boxplot of area_mean") note( "") name(graph2 , replace)

graph box compactness_mean ,title("Boxplot of compactness_mean") note( "") name(graph3 , replace)

graph box concave_points_mean ,title("Boxplot of concave_points_mean") note( "")  name(graph4 , replace)

graph box fractal_dimension_mean ,title("Boxplot of fractal_dimension_mean") note( "")  name(graph5 , replace)

graph box radius_mean ,title("Boxplot of radius_mean") note( "") name(graph6 , replace)

graph box perimeter_mean ,title("Boxplot of perimeter_mean") note( "") name(graph7 , replace)

graph box smoothness_mean ,title("Boxplot of smoothness_mean") note( "") name(graph8 , replace)

graph box concavity_mean ,title("Boxplot of concavity_mean") note( "")  name(graph9 , replace)

graph box symmetry_mean ,title("Boxplot of symmetry_mean") note( "")  name(graph10 , replace)

graph combine graph1 graph2 graph3 graph4 graph5 graph6 graph7 graph8 graph9 graph10 ,title("") saving("all_boxes.svg" ,replace) 

quietly graph export all_boxes.svg, replace
(encoding automatically selected: ISO-8859-2)
(11 vars, 569 obs)











file all_boxes.svg saved as .gph format

Stata Graph - Graph 10 20 30 40 texture_mean Boxplot of texture_mean 0 500 1,000 1,500 2,000 2,500 area_mean Boxplot of area_mean 0 .1 .2 .3 .4 compactness_mean Boxplot of compactness_mean 0 .05 .1 .15 .2 concave_points_mean Boxplot of concave_points_mean .05 .06 .07 .08 .09 .1 fractal_dimension_mean Boxplot of fractal_dimension_mean 5 10 15 20 25 30 radius_mean Boxplot of radius_mean 50 100 150 200 perimeter_mean Boxplot of perimeter_mean .06 .08 .1 .12 .14 .16 smoothness_mean Boxplot of smoothness_mean 0 .1 .2 .3 .4 concavity_mean Boxplot of concavity_mean .1 .15 .2 .25 .3 symmetry_mean Boxplot of symmetry_mean

2. Exploring variable relationships and multicollinearity (25 marks)

(a) Perform a correlation analysis among all continuous variables in the dataset, including the outcome variable radius mean

import delimited BreastCancer

pwcorr radius_mean texture_mean perimeter_mean area_mean smoothness_mean compactness_mean     concavity_mean concave_points_mean symmetry_mean fractal_dimension_mean , star(0.05) sig
(encoding automatically selected: ISO-8859-2)
(11 vars, 569 obs)

             | radius~n textur~n perime~n area_m~n smooth~n compac~n c~y_mean
-------------+---------------------------------------------------------------
 radius_mean |   1.0000 
             |
             |
texture_mean |   0.3238*  1.0000 
             |   0.0000
             |
perimeter_~n |   0.9979*  0.3295*  1.0000 
             |   0.0000   0.0000
             |
   area_mean |   0.9874*  0.3211*  0.9865*  1.0000 
             |   0.0000   0.0000   0.0000
             |
smoothness~n |   0.1706* -0.0234   0.2073*  0.1770*  1.0000 
             |   0.0000   0.5777   0.0000   0.0000
             |
compactnes~n |   0.5061*  0.2367*  0.5569*  0.4985*  0.6591*  1.0000 
             |   0.0000   0.0000   0.0000   0.0000   0.0000
             |
concavity_~n |   0.6768*  0.3024*  0.7161*  0.6860*  0.5220*  0.8831*  1.0000 
             |   0.0000   0.0000   0.0000   0.0000   0.0000   0.0000
             |
concave_po~n |   0.8225*  0.2935*  0.8510*  0.8233*  0.5537*  0.8311*  0.9214*
             |   0.0000   0.0000   0.0000   0.0000   0.0000   0.0000   0.0000
             |
symmetry_m~n |   0.1477*  0.0714   0.1830*  0.1513*  0.5578*  0.6026*  0.5007*
             |   0.0004   0.0888   0.0000   0.0003   0.0000   0.0000   0.0000
             |
fractal_di~n |  -0.3116* -0.0764  -0.2615* -0.2831*  0.5848*  0.5654*  0.3368*
             |   0.0000   0.0685   0.0000   0.0000   0.0000   0.0000   0.0000
             |

             | c~e_po~n symmet~n fracta~n
-------------+---------------------------
concave_po~n |   1.0000 
             |
             |
symmetry_m~n |   0.4625*  1.0000 
             |   0.0000
             |
fractal_di~n |   0.1669*  0.4799*  1.0000 
             |   0.0001   0.0000
             |
Comments
  • radius_mean and perimeter_mean (0.9979) → strong perfect relationship.
  • radius_mean and area_mean (0.9874) → strong perfect relationship.
  • perimeter_mean and area_mean (0.9865) -> strong perfect relationship.
  • concavity_mean and compactness_mean (0.8831) -> strength of relationship is strong.
  • concave points_mean and concavity_mean (0.9214) -> strength of relationship is strong.

Moderate to strong positive correlations:

  • compactness_mean and smoothness_mean (0.6591).
  • concave points_mean and radius_mean (0.8225).

texture_mean and symmetry_mean shows weak correlations with other variables (highest is 0.3295 with perimeter_mean).

  • the plot below highlights variables that have negative correlations
* Install corrtable if necessary
ssc install corrtable

* Get auto data
import delimited BreastCancer

* Make correlation table
* The half option just shows the lower triangle and puts variable names on the axis.
* The flag1 and howflag1 options tell corrtable to plot positive correlations (r(rho > 0))
* as blue (blue*.1)
* and flag2 and howflag2 similarly tell it to plot negative correlations as pink.
corrtable radius_mean texture_mean perimeter_mean area_mean smoothness_mean compactness_mean concavity_mean concave_points_mean symmetry_mean fractal_dimension_mean, half flag1(r(rho) > 0) howflag1(plotregion(color(blue * 0.1))) flag2(r(rho) < 0) howflag2(plotregion(color(pink*0.1)))

quietly graph export heatmap2.svg, replace
checking corrtable consistency and verifying not already installed...
all files already exist and are up to date.

(encoding automatically selected: ISO-8859-2)
(11 vars, 569 obs)

Stata Graph - Graph radius_mean 0.324 texture_mean 0.998 0.330 perimeter_mean 0.987 0.321 0.987 area_mean 0.171 -0.023 0.207 0.177 smoothness_mean 0.506 0.237 0.557 0.499 0.659 compactness_mean 0.677 0.302 0.716 0.686 0.522 0.883 concavity_mean 0.823 0.293 0.851 0.823 0.554 0.831 0.921 concave_points_mean 0.148 0.071 0.183 0.151 0.558 0.603 0.501 0.462 symmetry_mean -0.312 -0.076 -0.261 -0.283 0.585 0.565 0.337 0.167 0.480 fractal_dimension_mean

Note

Negative correlations can be shown below:

fractal_dimension_mean is negatively correlated with size variables i.e:

  • With radius_mean (-0.3116).
  • With perimeter_mean (-0.2615).
  • With area_mean (-0.2831).
Conclusion
  • The pleminary analysis of correlations shows that there is strong positive correlations between radius_mean and perimeter_mean \((r \approx 1.00)\), as well as between radius_mean and area_mean \((r \approx 0.99)\), indicating that as the radius increases, the perimeter and area also increase significantly. This is expected as both area and perimeter depend on the radius

  • Concavity_mean and concavepoints_mean are strongly positively correlated (r = 0.92), suggesting that higher concavity is associated with more concave points.

  • Moderate positive correlations are observed between compactness_mean and smoothness_mean (r = 0.66), and between compactness_mean and symmetry_mean (r = 0.60), implying there is significant and powerful asssociations between these variables.

  • Weak positive correlations can be noticed between radius_mean and smoothness_mean (r = 0.17) and between symmetry_mean and texture_mean (r = 0.07), indicating very weak associations.

  • On the other hand we notice that fractal_dimension_mean shows a weak negative correlation with radius_mean (r = -0.31) and area_mean (r = -0.28), suggesting that larger radius or area values are slightly associated with lower fractal dimension.

(b) Summarise the results using a clear and well-labelled correlation matrix table or a visual format such as a correlogram

Correlogram

ssc install heatplot
ssc install palettes, replace 
ssc install colrspace, replace
import delimited BreastCancer
corr radius_mean texture_mean perimeter_mean area_mean smoothness_mean compactness_mean concavity_mean concave_points_mean symmetry_mean fractal_dimension_mean


return list
matrix corrmatrix = r(C)
heatplot corrmatrix, values(format(%4.3f) size(tiny)) legend(off) color(hcl diverging,  intensity(.7)) aspectratio(1) xlabel(,labsize(small) angle(45)) xsize(10) ysize(13)

quietly graph export heatmap.svg, replace
checking heatplot consistency and verifying not already installed...
all files already exist and are up to date.

checking palettes consistency and verifying not already installed...
all files already exist and are up to date.

checking colrspace consistency and verifying not already installed...
all files already exist and are up to date.

(encoding automatically selected: ISO-8859-2)
(11 vars, 569 obs)

(obs=569)

             | radius~n textur~n perime~n area_m~n smooth~n compac~n c~y_mean
-------------+---------------------------------------------------------------
 radius_mean |   1.0000
texture_mean |   0.3238   1.0000
perimeter_~n |   0.9979   0.3295   1.0000
   area_mean |   0.9874   0.3211   0.9865   1.0000
smoothness~n |   0.1706  -0.0234   0.2073   0.1770   1.0000
compactnes~n |   0.5061   0.2367   0.5569   0.4985   0.6591   1.0000
concavity_~n |   0.6768   0.3024   0.7161   0.6860   0.5220   0.8831   1.0000
concave_po~n |   0.8225   0.2935   0.8510   0.8233   0.5537   0.8311   0.9214
symmetry_m~n |   0.1477   0.0714   0.1830   0.1513   0.5578   0.6026   0.5007
fractal_di~n |  -0.3116  -0.0764  -0.2615  -0.2831   0.5848   0.5654   0.3368

             | c~e_po~n symmet~n fracta~n
-------------+---------------------------
concave_po~n |   1.0000
symmetry_m~n |   0.4625   1.0000
fractal_di~n |   0.1669   0.4799   1.0000



scalars:
                  r(N) =  569
                r(rho) =  .3237818874068258

matrices:
                  r(C) :  10 x 10

Stata Graph - Graph 1.000 0.324 0.998 0.987 0.171 0.506 0.677 0.823 0.148 -0.312 0.324 1.000 0.330 0.321 -0.023 0.237 0.302 0.293 0.071 -0.076 0.998 0.330 1.000 0.987 0.207 0.557 0.716 0.851 0.183 -0.261 0.987 0.321 0.987 1.000 0.177 0.499 0.686 0.823 0.151 -0.283 0.171 -0.023 0.207 0.177 1.000 0.659 0.522 0.554 0.558 0.585 0.506 0.237 0.557 0.499 0.659 1.000 0.883 0.831 0.603 0.565 0.677 0.302 0.716 0.686 0.522 0.883 1.000 0.921 0.501 0.337 0.823 0.293 0.851 0.823 0.554 0.831 0.921 1.000 0.462 0.167 0.148 0.071 0.183 0.151 0.558 0.603 0.501 0.462 1.000 0.480 -0.312 -0.076 -0.261 -0.283 0.585 0.565 0.337 0.167 0.480 1.000 radius_mean texture_mean perimeter_mean area_mean smoothness_mean compactness_mean concavity_mean concave_points_mean symmetry_mean fractal_dimension_mean radius_mean texture_mean perimeter_mean area_mean smoothness_mean compactness_mean concavity_mean concave_points_mean symmetry_mean fractal_dimension_mean

Pairwise scatter plot

import delimited BreastCancer

graph matrix radius_mean texture_mean perimeter_mean area_mean smoothness_mean compactness_mean concavity_mean concave_points_mean symmetry_mean fractal_dimension_mean, half maxis(ylabel(none) xlabel(none))

quietly graph export scatter.svg, replace
(encoding automatically selected: ISO-8859-2)
(11 vars, 569 obs)

Stata Graph - Graph radius_mean texture_mean perimeter_mean area_mean smoothness_mean compactness_mean concavity_mean concave_points_mean symmetry_mean fractal_dimension_mean

(d) Identify the variables that show the most substantial positive or negative correlations with the outcome variable or with other predictors, and explain why these associations make sense

Note

Substantial Positive Correlations With outcome variable:

Radius mean is substantially strongly positive correlated to :

  • perimeter_mean \((r = 1.00)\)
  • area_mean \((r = 0.99)\)
  • concavity_mean \((r = 0.68)\)
  • concavepoints_mean \((r = 0.82)\)

Substantial Positive Correlations between independent variables variable:

  • perimeter_mean with area_mean (r = 0.99)
  • concavity_mean with concavepoints_mean (r = 0.92)
  • compactness_mean with concavity_mean (r = 0.88)
  • compactness_mean with:
  • oncavepoints_mean (r = 0.83)

Substantial Negative Correlations With outcome variable. + radius_mean with fractal_dimension_mean (r = -0.31)

With predictor variables

  • fractal_dimension_mean with perimeter_mean (r = -0.26)

Justifications

  • The most substantial positive correlations are between radius_mean, perimeter_mean, and area_mean, with correlations approximately equal to 1.00, meaning they are almost perfectly positively related.

This is expected because all three measure are related to dimension and also perimeter and area depend on the radius i.e:

\(A=\pi r^2 , \quad Perimeter = 2\pi r\)

based on these formulas above we can see that perimeter is directly proportional to radius while area is directly proportional to \(r^2\) as well.

  • On the other hand concavity_mean and concavepoints_mean are very strongly correlated (r = 0.92), makes sense as both capture aspects of concavity.
  • Substantial positive correlations among predictors seen between compactness_mean, concavity_mean, and concavepoints_mean, which makes sense because more compact tumors (less smooth) tend to have more noticeable concavities.
  • Notably ,fractal_dimension_mean shows the most substantial negative correlations with size-related measures like radius_mean, perimeter_mean, and area_mean (r ≈ -0.26 to -0.31), meaning that larger tumors tend to have less complex edges, which aligns with the biological understanding that larger tumors often have smoother borders compared to smaller, more irregular ones.

(e) Identify any predictor variables that may be redundant due to very high correlations with each other, and explain how such redundancy could affect model performance or interpretation

Comments
  • perimeter_mean and area_mean are highly correlated and both depend on the outcome variable naturally \((r \approx 0.99)\) and we also not that these variables are linear functions of each other and could produce potential multicollinearity.

Proof \[Area=\pi r^2 = \frac{1}{2}*2\pi r*r=\frac{1}{2}*r*perimeter\]

  • concavity mean and concave points mean are highly correlated \((r=0.92)\)
  • compactness mean and concavity_mean are highly correlated \((r=0.88)\)

How Redundancy Could Affect Model Performance:

  1. Unstable coefficient estimates
  2. standard errors may be inflated.
  3. Worse model generalization (overfitting to training data noise).
  4. Substantially higher standard errors, with correspondingly lower t
  5. Unexpected changes in coefficient magnitudes or signs.
  6. Non-significant coefficients despite a high \(R^2\).

Linear Regression

Linear regression fits this model:

\[ Y = \beta_0 + \beta_1X_1 + \beta_2X_2 + \cdots + \beta_pX_p + \epsilon \]

  • \(Y\) represents the outcome variable
  • \(X_1, X_2, \cdots, X_p\) represent the predictors, of which there are \(p\) total.
  • \(\beta_0\) represents the intercept. If you have a subject for which every predictor is equal to zero, \(\beta_0\) represents their predicted outcome.
  • The other \(\beta\)’s are called the coefficients, and represent the relationship between each predictor and the response. We will cover their interpretation in detail later.
  • \(\epsilon\) represents the error. Regression is a game of averages, but for any individual observation, the model will contain some error.
import delimited BreastCancer
regress radius_mean texture_mean perimeter_mean area_mean smoothness_mean compactness_mean concavity_mean concave_points_mean symmetry_mean fractal_dimension_mean

vif
(encoding automatically selected: ISO-8859-2)
(11 vars, 569 obs)

      Source |       SS           df       MS      Number of obs   =       569
-------------+----------------------------------   F(9, 559)       =  99738.39
       Model |  7049.55654         9   783.28406   Prob > F        =    0.0000
    Residual |  4.39004275       559  .007853386   R-squared       =    0.9994
-------------+----------------------------------   Adj R-squared   =    0.9994
       Total |  7053.94658       568    12.41892   Root MSE        =    .08862

------------------------------------------------------------------------------
 radius_mean | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
texture_mean |   .0002343   .0009418     0.25   0.804    -.0016156    .0020843
perimeter_~n |   .1568643   .0013373   117.30   0.000     .1542376     .159491
   area_mean |  -.0002857   .0000783    -3.65   0.000    -.0004396   -.0001318
smoothness~n |   1.273811   .4514421     2.82   0.005     .3870806    2.160541
compactnes~n |  -4.827446   .2654031   -18.19   0.000    -5.348755   -4.306136
concavity_~n |  -.7595862   .1563819    -4.86   0.000    -1.066754   -.4524182
concave_po~n |  -.2975441   .4463421    -0.67   0.505    -1.174257    .5791686
symmetry_m~n |   .2350665   .1806098     1.30   0.194    -.1196903    .5898233
fractal_di~n |   3.251577   1.332831     2.44   0.015     .6336083    5.869546
       _cons |   .0994146   .1342057     0.74   0.459    -.1641945    .3630236
------------------------------------------------------------------------------

    Variable |       VIF       1/VIF  
-------------+----------------------
perimeter_~n |     76.37    0.013094
   area_mean |     54.98    0.018190
concave_po~n |     21.69    0.046094
compactnes~n |     14.21    0.070375
concavity_~n |     11.24    0.088962
fractal_di~n |      6.40    0.156137
smoothness~n |      2.92    0.342988
symmetry_m~n |      1.77    0.563991
texture_mean |      1.19    0.842569
-------------+----------------------
    Mean VIF |     21.20
Note
  • the model for the above model is given by: \[ \begin{aligned} radius~mean =&\beta_0 +\beta_1texturemean +\beta_2perimetermean+\beta_3areamean\\ &+ \beta_4smoothnessmean+\beta_5compactness+\beta_6meanconcavitymean\\ &+\beta_7concavepointsmean+\beta_8symmetrymean +\beta_{9}fractaldimensionmean \end{aligned}\] where the \(\beta\)’s can be found under the coefficients column.
  • The following variables show serious multicolinearity concerns, as their VIF values are well above 10,which are:
  1. perimeter_mean (VIF = 76.37)
  2. area_mean (VIF = 54.98)
  3. concave points_mean (VIF = 21.69)
  4. compactness_mean (VIF = 14.21)
  5. concavity_mean (VIF = 11.24).
  • These extremely high VIFs indicate high multicolinearity with each other and are not providing unique information to the model.
  • Only smoothness_mean, symmetry_mean, and texture_mean have VIFs well below 5 suggesting low multicolinearity problems.

Model without redundant variables

import delimited BreastCancer
regress radius_mean fractal_dimension_mean smoothness_mean symmetry_mean texture_mean concavity_mean
vif
(encoding automatically selected: ISO-8859-2)
(11 vars, 569 obs)

      Source |       SS           df       MS      Number of obs   =       569
-------------+----------------------------------   F(5, 563)       =    438.92
       Model |  5613.78998         5    1122.758   Prob > F        =    0.0000
    Residual |   1440.1566       563  2.55800462   R-squared       =    0.7958
-------------+----------------------------------   Adj R-squared   =    0.7940
       Total |  7053.94658       568    12.41892   Root MSE        =    1.5994

------------------------------------------------------------------------------
 radius_mean | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
fractal_di~n |  -326.6724    12.1021   -26.99   0.000    -350.4432   -302.9017
smoothness~n |   34.51059   6.802347     5.07   0.000     21.14951    47.87167
symmetry_m~n |  -4.461625   3.169397    -1.41   0.160    -10.68691    1.763662
texture_mean |   .0222003   .0168809     1.32   0.189    -.0109569    .0553575
concavity_~n |   36.88825   1.106025    33.35   0.000     34.71581    39.06069
       _cons |   28.42048    .820753    34.63   0.000     26.80837    30.03259
------------------------------------------------------------------------------

    Variable |       VIF       1/VIF  
-------------+----------------------
smoothness~n |      2.03    0.492051
concavity_~n |      1.73    0.579283
symmetry_m~n |      1.68    0.596549
fractal_di~n |      1.62    0.616847
texture_mean |      1.17    0.854312
-------------+----------------------
    Mean VIF |      1.65
Comments on Multicollinearity
  • We notice from the output that \(R^2\) and \(Adj.R^2\) reduced slightly, Thus removing these variables from the model could solve the problem of multicollinearity without reducing the overall quality of the regression model.
Comments on model coefficients
  • For one unit increase in fractal dimension mean, radius mean significantly decreases by 326.6724 adjusting for all other variables. \((p=0.000)\).

  • For one increase in smoothness mean , radius mean significantly increases by 34.51059 adjusting for all other variables.\(p=0.000\)

  • For one unit increase in symmetry mean , radius mean decreases by 4.461625 while adjusting for all other variables. However this relation is not significant since \((p-value= 0.160>0.05 )\).

  • For one unit increase in texture mean , radius mean increases by 0.0222003 while adjusting for all other variables. However this relationship is not significant at 5% level of significance \((p-value= 0.189>0.05)\).

  • For one-unit increase in concavity mean ,radius mean significantly increases by 36.88825 while adjusting for all other variables. \((p-value=0.000)\).

Section B

a) Research Question

  • What demographic, lifestyle, and clinical factors are associated with systolic blood pressure (sysBP) among participants in the Framingham Heart Study?

b) Simple linear model

import delimited framingham_clean
regress sysbp bmi
(encoding automatically selected: ISO-8859-1)
(16 vars, 3,656 obs)

      Source |       SS           df       MS      Number of obs   =     3,656
-------------+----------------------------------   F(1, 3654)      =    449.60
       Model |  195452.085         1  195452.085   Prob > F        =    0.0000
    Residual |  1588465.99     3,654  434.719756   R-squared       =    0.1096
-------------+----------------------------------   Adj R-squared   =    0.1093
       Total |  1783918.07     3,655   488.07608   Root MSE        =     20.85

------------------------------------------------------------------------------
       sysbp | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
         bmi |   1.798533   .0848209    21.20   0.000     1.632232    1.964834
       _cons |   85.99432   2.214055    38.84   0.000     81.65341    90.33522
------------------------------------------------------------------------------
Note

Interpretation

  • Predictor - Body mass index
  • Outcome - systolic blood pressure (sysbp)

for a unit increase in Body mass index , systolic blood pressure increases significantly by 1.798533 on average while holding all other variables constant.

import delimited framingham_clean

regress sysbp bmi age diabp heartrate glucose cigsperday totchol i.male i.education i.currentsmoker i.prevalentstroke i.prevalenthyp i.bpmeds i.diabetes, allbase
(encoding automatically selected: ISO-8859-1)
(16 vars, 3,656 obs)

      Source |       SS           df       MS      Number of obs   =     3,656
-------------+----------------------------------   F(16, 3639)     =    629.36
       Model |  1310373.33        16   81898.333   Prob > F        =    0.0000
    Residual |  473544.743     3,639   130.13046   R-squared       =    0.7345
-------------+----------------------------------   Adj R-squared   =    0.7334
       Total |  1783918.07     3,655   488.07608   Root MSE        =    11.407

------------------------------------------------------------------------------
       sysbp | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
         bmi |  -.0248742   .0518146    -0.48   0.631    -.1264628    .0767143
         age |   .4283898   .0250708    17.09   0.000     .3792356     .477544
       diabp |    1.04679   .0210667    49.69   0.000     1.005486    1.088093
   heartrate |   .0474775   .0164587     2.88   0.004     .0152083    .0797467
     glucose |   .0464323   .0100716     4.61   0.000     .0266858    .0661788
  cigsperday |    .012802    .026168     0.49   0.625    -.0385034    .0641073
     totchol |   .0085838   .0045199     1.90   0.058    -.0002779    .0174455
             |
        male |
          0  |          0  (base)
          1  |  -2.960022   .4172047    -7.09   0.000    -3.778001   -2.142044
             |
   education |
          1  |          0  (base)
          2  |  -.6177673   .4674476    -1.32   0.186    -1.534253     .298718
          3  |   -1.62405   .5581798    -2.91   0.004    -2.718427    -.529674
          4  |  -2.599828   .6391894    -4.07   0.000    -3.853033   -1.346623
             |
currentsmo~r |
          0  |          0  (base)
          1  |   .3914087   .6068143     0.65   0.519    -.7983212    1.581138
             |
prevalents~e |
          0  |          0  (base)
          1  |  -1.355887   2.518326    -0.54   0.590    -6.293358    3.581584
             |
prevalenthyp |
          0  |          0  (base)
          1  |   12.90949   .5428089    23.78   0.000     11.84525    13.97373
             |
      bpmeds |
          0  |          0  (base)
          1  |    7.41619   1.152858     6.43   0.000     5.155878    9.676501
             |
    diabetes |
          0  |          0  (base)
          1  |  -.2279949   1.477869    -0.15   0.877    -3.125529    2.669539
             |
       _cons |   13.08773   2.608305     5.02   0.000     7.973844    18.20161
------------------------------------------------------------------------------
Note

Interpretation

  • Check Number of obs . Here, the data has 3656 rows, so the regression model is using all the data (there is no missingness in our data.
  • The F-test which follows (F(16, 3639)^[The 16 and 3639 are degrees of freedom and Prob > F=0000) is testing the null hypothesis that all coefficients are 0. In other words, if this test fails to reject, the conclusion is the model captures no relationships. For this model the null model is rejected and we conclude that the model captures the relationships and at least one of the coefficients is not equal to zero. The model is globally significant
  • The \(R^2\) (R-squared) is a measure of model fit and is a percentage, explaining what percent in the variation in the response is explained by the linear relationship with the predictors. For this model \(R^2= 0.7345\) meaning that \(73.45\%\) of the variability in systolic blood pressure among the participants can be explained by the variables included in my model.
  • Mathematically, adding a new predictor to the model will increase the \(R^2\), regardless of how useless the variable is.1 This makes \(R^2\) poor for model comparison, as it would always select the model with the most predictors. Instead, the adjusted \(R^2\) (“Adj R-Squared”) accounts for this; it penalizes the \(R^2\) by the number of predictors in the model. Hence for this model \(Adjusted~ R^2 = 0.7334\) is slightly lower, accounting for the number of predictors in the model that it penalizes for not contributing much explanatory power.
  • The root mean squared error (Root MSE, as known as RMSE) is a measure of the average difference between the observed outcome (Systolic blood pressure) and the predicted outcome. So for this model, the RMSE is \(Root ~MSE = 11.407\) so the average error in the model is about 11.407.

Hence for the research question at hand, which is to model based on demographic, lifestyle, and clinical variables , the \(R^2 ~and~ Adjusted ~R^2\) are high enough , close to one, suggesting that the model fits the data pretty well and the variables significantly help in explaining systolic blood pressure in this sample

4. Variable selection (10 marks)

Note

a) method using stepwise backward elimination

Backward elimination begins with a model which includes all candidate variables. Variables are then deleted from the model one by one until all the variables remaining in the model are significant and exceed certain criteria. At each step, the variable showing the smallest improvement to the model is deleted. Once a variable is deleted, it cannot come back to the model.

*Method 1*
import delimited framingham_clean
stepwise, pr(0.1): regress sysbp i.male age i.currentsmoker i.education cigsperday i.bpmeds i.prevalentstroke i.prevalenthyp i.diabetes totchol diabp bmi heartrate glucose,allbase
(encoding automatically selected: ISO-8859-1)
(16 vars, 3,656 obs)

note: 0b.male omitted because of estimability.
note: 0b.currentsmoker omitted because of estimability.
note: 1b.education omitted because of estimability.
note: 0b.bpmeds omitted because of estimability.
note: 0b.prevalentstroke omitted because of estimability.
note: 0b.prevalenthyp omitted because of estimability.
note: 0b.diabetes omitted because of estimability.

Wald test, begin with full model:
p = 0.8774 >= 0.1000, removing 1.diabetes
p = 0.6270 >= 0.1000, removing bmi
p = 0.6305 >= 0.1000, removing cigsperday
p = 0.5851 >= 0.1000, removing 1.prevalentstroke
p = 0.2021 >= 0.1000, removing 2.education
p = 0.1128 >= 0.1000, removing 1.currentsmoker

      Source |       SS           df       MS      Number of obs   =     3,656
-------------+----------------------------------   F(10, 3645)     =   1006.77
       Model |  1309731.82        10  130973.182   Prob > F        =    0.0000
    Residual |  474186.256     3,645  130.092251   R-squared       =    0.7342
-------------+----------------------------------   Adj R-squared   =    0.7335
       Total |  1783918.07     3,655   488.07608   Root MSE        =    11.406

------------------------------------------------------------------------------
       sysbp | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
      1.male |  -2.754165   .3882471    -7.09   0.000    -3.515368   -1.992962
         age |   .4274418   .0240442    17.78   0.000     .3803003    .4745833
     glucose |   .0447748   .0079966     5.60   0.000     .0290966     .060453
     totchol |   .0085681   .0045045     1.90   0.057    -.0002634    .0173997
             |
   education |
          3  |  -1.345866   .5165159    -2.61   0.009    -2.358555   -.3331771
          4  |  -2.329898   .6034925    -3.86   0.000    -3.513114   -1.146682
             |
   heartrate |   .0505673   .0163351     3.10   0.002     .0185406    .0825941
    1.bpmeds |   7.319757    1.14673     6.38   0.000      5.07146    9.568053
       diabp |   1.041687   .0203397    51.21   0.000     1.001809    1.081565
1.prevalen~p |   12.88386    .541168    23.81   0.000     11.82284    13.94488
       _cons |   12.77098   2.327819     5.49   0.000     8.207023    17.33494
------------------------------------------------------------------------------
Note

Method 2: using best subset selection method

The basic idea of the all possible subsets approach is to run every possible combination of the predictors to find the best subset to meet some pre-defined objective criteria such as \(C_p\) and adjusted \(R^2\).

*Method 2*
import delimited framingham_clean
ssc install gvselect, replace
gvselect <term> male age education currentsmoker cigsperday bpmeds prevalentstroke prevalenthyp diabetes totchol diabp bmi heartrate glucose, nmodels(1): regress sysbp <term>
(encoding automatically selected: ISO-8859-1)
(16 vars, 3,656 obs)

checking gvselect consistency and verifying not already installed...
all files already exist and are up to date.


Optimal models: 

   # Preds        LL       AIC       BIC
         1 -14739.67  29483.34  29495.75
         2 -14353.41  28712.82  28731.43
         3 -14169.83  28347.66  28372.48
         4 -14136.63  28283.26  28314.28
         5 -14116.25   28244.5  28281.72
         6 -14098.56  28211.12  28254.55
         7 -14087.98  28191.97   28241.6
         8 -14082.63  28183.26   28239.1
         9 -14080.72  28181.44  28243.48
        10 -14079.38  28180.76     28249
        11 -14079.23  28182.46  28256.91
        12  -14079.1  28184.19  28264.84
        13 -14078.97  28185.95  28272.81
        14 -14078.96  28187.92  28280.99

predictors for each model:

1 : diabp
2 : prevalenthyp diabp
3 : age prevalenthyp diabp
4 : male age prevalenthyp diabp
5 : male age bpmeds prevalenthyp diabp
6 : male age bpmeds prevalenthyp diabp glucose
7 : male age education bpmeds prevalenthyp diabp glucose
8 : male age education bpmeds prevalenthyp diabp heartrate glucose
9 : male age education bpmeds prevalenthyp totchol diabp heartrate glucose
10 : male age education currentsmoker bpmeds prevalenthyp totchol diabp
    heartrate glucose
11 : male age education currentsmoker bpmeds prevalentstroke prevalenthyp
    totchol diabp heartrate glucose
12 : male age education currentsmoker bpmeds prevalentstroke prevalenthyp
    totchol diabp bmi heartrate glucose
13 : male age education currentsmoker cigsperday bpmeds prevalentstroke
    prevalenthyp totchol diabp bmi heartrate glucose
14 : male age education currentsmoker cigsperday bpmeds prevalentstroke
    prevalenthyp diabetes totchol diabp bmi heartrate glucose
Comments based Best subset
  • Focusing on the Bayesian Information Criterion from the best subset selection method , the model with 8 variables appears to be a better fit subject to its BIC=28239.1 being the lowest among all models with different combination of variables and also compared to the model with all predictors.
  • It is also worth noting that the model with 8 Predictors has the third least AIC where it’s difference in AIC with the 2 models with least AIC is marginally small from the best subset output
  • Model with 10 variables performs better when our focus is on the Akaike information Criterion(\(AIC=28180.76\))
  1. \(\Delta AIC = 2.5\) for 8 predictors vs 10
  2. \(\Delta AIC = 0.68\) for 8 predictors vs 9
  • Due to these reasons , the model with 8 Predictors and that with 10 predictors will be compared based on other metrics and also compared to the full model

Comments based on stepwise Regression

  • Stepwise regression (Backward elimination favors a model with 9 predictors)

The model from the stepwise regression together with the 8 and 10 Predictor model from best subset selection will be compared further

Best subsets VS stepwise VS Full model

import delimited framingham_clean
regress sysbp bmi age diabp heartrate glucose cigsperday totchol i.male i.education i.currentsmoker i.prevalentstroke i.prevalenthyp i.bpmeds i.diabetes, allbase

est store ModelA14

regress sysbp i.male age i.education i.bpmeds i.prevalenthyp diabp heartrate glucose
est store ModelA8

regress sysbp i.male age i.education i.bpmeds i.prevalenthyp diabp heartrate glucose totchol i.currentsmoker

est store ModelA10

stepwise, pr(0.1): regress sysbp i.male age i.currentsmoker i.education cigsperday i.bpmeds i.prevalentstroke i.prevalenthyp i.diabetes totchol diabp bmi heartrate glucose,allbase

est store Modelstep

estout ModelA14 ModelA10 Modelstep ModelA8 , stats(r2 r2_a bic aic)
(encoding automatically selected: ISO-8859-1)
(16 vars, 3,656 obs)

      Source |       SS           df       MS      Number of obs   =     3,656
-------------+----------------------------------   F(16, 3639)     =    629.36
       Model |  1310373.33        16   81898.333   Prob > F        =    0.0000
    Residual |  473544.743     3,639   130.13046   R-squared       =    0.7345
-------------+----------------------------------   Adj R-squared   =    0.7334
       Total |  1783918.07     3,655   488.07608   Root MSE        =    11.407

------------------------------------------------------------------------------
       sysbp | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
         bmi |  -.0248742   .0518146    -0.48   0.631    -.1264628    .0767143
         age |   .4283898   .0250708    17.09   0.000     .3792356     .477544
       diabp |    1.04679   .0210667    49.69   0.000     1.005486    1.088093
   heartrate |   .0474775   .0164587     2.88   0.004     .0152083    .0797467
     glucose |   .0464323   .0100716     4.61   0.000     .0266858    .0661788
  cigsperday |    .012802    .026168     0.49   0.625    -.0385034    .0641073
     totchol |   .0085838   .0045199     1.90   0.058    -.0002779    .0174455
             |
        male |
          0  |          0  (base)
          1  |  -2.960022   .4172047    -7.09   0.000    -3.778001   -2.142044
             |
   education |
          1  |          0  (base)
          2  |  -.6177673   .4674476    -1.32   0.186    -1.534253     .298718
          3  |   -1.62405   .5581798    -2.91   0.004    -2.718427    -.529674
          4  |  -2.599828   .6391894    -4.07   0.000    -3.853033   -1.346623
             |
currentsmo~r |
          0  |          0  (base)
          1  |   .3914087   .6068143     0.65   0.519    -.7983212    1.581138
             |
prevalents~e |
          0  |          0  (base)
          1  |  -1.355887   2.518326    -0.54   0.590    -6.293358    3.581584
             |
prevalenthyp |
          0  |          0  (base)
          1  |   12.90949   .5428089    23.78   0.000     11.84525    13.97373
             |
      bpmeds |
          0  |          0  (base)
          1  |    7.41619   1.152858     6.43   0.000     5.155878    9.676501
             |
    diabetes |
          0  |          0  (base)
          1  |  -.2279949   1.477869    -0.15   0.877    -3.125529    2.669539
             |
       _cons |   13.08773   2.608305     5.02   0.000     7.973844    18.20161
------------------------------------------------------------------------------

      Source |       SS           df       MS      Number of obs   =     3,656
-------------+----------------------------------   F(10, 3645)     =   1005.93
       Model |  1309440.99        10  130944.099   Prob > F        =    0.0000
    Residual |  474477.078     3,645  130.172038   R-squared       =    0.7340
-------------+----------------------------------   Adj R-squared   =    0.7333
       Total |  1783918.07     3,655   488.07608   Root MSE        =    11.409

------------------------------------------------------------------------------
       sysbp | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
      1.male |  -2.823336   .3881123    -7.27   0.000    -3.584275   -2.062398
         age |   .4318045   .0239838    18.00   0.000     .3847815    .4788275
             |
   education |
          2  |  -.5455734    .464128    -1.18   0.240     -1.45555    .3644029
          3  |  -1.558717   .5544055    -2.81   0.005    -2.645692   -.4717407
          4  |  -2.539784   .6359611    -3.99   0.000    -3.786659    -1.29291
             |
    1.bpmeds |   7.428462   1.146756     6.48   0.000     5.180114    9.676809
1.prevalen~p |   12.89554   .5413036    23.82   0.000     11.83426    13.95683
       diabp |   1.044746   .0202756    51.53   0.000     1.004993    1.084498
   heartrate |   .0528364   .0163023     3.24   0.001     .0208738     .084799
     glucose |   .0447871   .0079992     5.60   0.000     .0291038    .0604705
       _cons |    14.4055   2.306565     6.25   0.000     9.883218    18.92779
------------------------------------------------------------------------------

      Source |       SS           df       MS      Number of obs   =     3,656
-------------+----------------------------------   F(12, 3643)     =    839.82
       Model |  1310270.63        12  109189.219   Prob > F        =    0.0000
    Residual |  473647.445     3,643  130.015768   R-squared       =    0.7345
-------------+----------------------------------   Adj R-squared   =    0.7336
       Total |  1783918.07     3,655   488.07608   Root MSE        =    11.402

------------------------------------------------------------------------------
       sysbp | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
      1.male |   -2.92213   .3987409    -7.33   0.000    -3.703907   -2.140352
         age |   .4276106   .0249925    17.11   0.000     .3786098    .4766113
             |
   education |
          2  |  -.5922715   .4642179    -1.28   0.202    -1.502424    .3178813
          3  |  -1.592616   .5543974    -2.87   0.004    -2.679576   -.5056559
          4  |  -2.567674   .6359235    -4.04   0.000    -3.814476   -1.320873
             |
    1.bpmeds |   7.345691   1.146728     6.41   0.000     5.097398    9.593985
1.prevalen~p |   12.88502   .5410092    23.82   0.000     11.82431    13.94573
       diabp |    1.04436   .0204142    51.16   0.000     1.004336    1.084384
   heartrate |   .0481429   .0164055     2.93   0.003      .015978    .0803077
     glucose |    .045165   .0080001     5.65   0.000     .0294798    .0608502
     totchol |   .0085569   .0045069     1.90   0.058    -.0002793    .0173931
1.currents~r |    .644381   .3984548     1.62   0.106    -.1368357    1.425598
       _cons |   12.70032   2.402614     5.29   0.000     7.989722    17.41093
------------------------------------------------------------------------------


note: 0b.male omitted because of estimability.
note: 0b.currentsmoker omitted because of estimability.
note: 1b.education omitted because of estimability.
note: 0b.bpmeds omitted because of estimability.
note: 0b.prevalentstroke omitted because of estimability.
note: 0b.prevalenthyp omitted because of estimability.
note: 0b.diabetes omitted because of estimability.

Wald test, begin with full model:
p = 0.8774 >= 0.1000, removing 1.diabetes
p = 0.6270 >= 0.1000, removing bmi
p = 0.6305 >= 0.1000, removing cigsperday
p = 0.5851 >= 0.1000, removing 1.prevalentstroke
p = 0.2021 >= 0.1000, removing 2.education
p = 0.1128 >= 0.1000, removing 1.currentsmoker

      Source |       SS           df       MS      Number of obs   =     3,656
-------------+----------------------------------   F(10, 3645)     =   1006.77
       Model |  1309731.82        10  130973.182   Prob > F        =    0.0000
    Residual |  474186.256     3,645  130.092251   R-squared       =    0.7342
-------------+----------------------------------   Adj R-squared   =    0.7335
       Total |  1783918.07     3,655   488.07608   Root MSE        =    11.406

------------------------------------------------------------------------------
       sysbp | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
      1.male |  -2.754165   .3882471    -7.09   0.000    -3.515368   -1.992962
         age |   .4274418   .0240442    17.78   0.000     .3803003    .4745833
     glucose |   .0447748   .0079966     5.60   0.000     .0290966     .060453
     totchol |   .0085681   .0045045     1.90   0.057    -.0002634    .0173997
             |
   education |
          3  |  -1.345866   .5165159    -2.61   0.009    -2.358555   -.3331771
          4  |  -2.329898   .6034925    -3.86   0.000    -3.513114   -1.146682
             |
   heartrate |   .0505673   .0163351     3.10   0.002     .0185406    .0825941
    1.bpmeds |   7.319757    1.14673     6.38   0.000      5.07146    9.568053
       diabp |   1.041687   .0203397    51.21   0.000     1.001809    1.081565
1.prevalen~p |   12.88386    .541168    23.81   0.000     11.82284    13.94488
       _cons |   12.77098   2.327819     5.49   0.000     8.207023    17.33494
------------------------------------------------------------------------------



----------------------------------------------------------------
                 ModelA14     ModelA10    Modelstep      ModelA8
                        b            b            b            b
----------------------------------------------------------------
bmi             -.0248742                                       
age              .4283898     .4276106     .4274418     .4318045
diabp             1.04679      1.04436     1.041687     1.044746
heartrate        .0474775     .0481429     .0505673     .0528364
glucose          .0464323      .045165     .0447748     .0447871
cigsperday        .012802                                       
totchol          .0085838     .0085569     .0085681             
0.male                  0            0                         0
1.male          -2.960022     -2.92213    -2.754165    -2.823336
1.education             0            0                         0
2.education     -.6177673    -.5922715                 -.5455734
3.education      -1.62405    -1.592616    -1.345866    -1.558717
4.education     -2.599828    -2.567674    -2.329898    -2.539784
0.currents~r            0            0                          
1.currents~r     .3914087      .644381                          
0.prevalen~e            0                                       
1.prevalen~e    -1.355887                                       
0.prevalen~p            0            0                         0
1.prevalen~p     12.90949     12.88502     12.88386     12.89554
0.bpmeds                0            0                         0
1.bpmeds          7.41619     7.345691     7.319757     7.428462
0.diabetes              0                                       
1.diabetes      -.2279949                                       
_cons            13.08773     12.70032     12.77098      14.4055
----------------------------------------------------------------
r2               .7345479     .7344904     .7341883     .7340253
r2_a             .7333808     .7336158     .7334591     .7332956
bic              28297.08     28265.06     28252.81     28255.05
aic              28191.61      28184.4     28184.56      28186.8
----------------------------------------------------------------
Comments
  • The table above compares
    1. the full model with all predictors
    2. model resulting from Backward elimination
    3. Two models from best subset selection each selected due to the least AIC and BIC
  • Modelstep(Model from stepwise regression) seems to be performing better as compared to other models
  • The stepwise regression optimal model has the the least \(BIC=28252.81\) as compared to other model. The marginal difference in its BIC with other models is significantly large.
  • As much as the model with 10 predictors has the least \(AIC=28184.4\) , Comparing it with the stepwise optimal model we see that the difference is small that it can be neglected hence leaving the stepwise optimal model with both the least \(BIC\) and \(AIC\) as well.
  • The adjusted \(R^2\) values for all models are almost the same with small marginal differences (both \(adj.R^2 \approx 0.73\))
  • the best model is therefore the optimal model from Stepwise regression.

5 Model diagnostics

Linearity

import delimited framingham_clean
regress sysbp i.male age i.education i.bpmeds i.prevalenthyp diabp heartrate glucose totchol 
rvfplot, yline(0) title("Residual vs Fitted Values")
 
estat hettest
graph export linear.png ,replace
(encoding automatically selected: ISO-8859-1)
(16 vars, 3,656 obs)

      Source |       SS           df       MS      Number of obs   =     3,656
-------------+----------------------------------   F(11, 3644)     =    915.52
       Model |  1309930.59        11  119084.599   Prob > F        =    0.0000
    Residual |  473987.479     3,644  130.073403   R-squared       =    0.7343
-------------+----------------------------------   Adj R-squared   =    0.7335
       Total |  1783918.07     3,655   488.07608   Root MSE        =    11.405

------------------------------------------------------------------------------
       sysbp | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
      1.male |  -2.777589   .3886811    -7.15   0.000    -3.539643   -2.015535
         age |   .4207818   .0246387    17.08   0.000     .3724748    .4690888
             |
   education |
          2  |  -.5738198   .4641805    -1.24   0.216    -1.483899    .3362595
          3  |  -1.595382   .5545176    -2.88   0.004    -2.682577    -.508186
          4  |  -2.578322   .6360304    -4.05   0.000    -3.825333   -1.331312
             |
    1.bpmeds |   7.353518   1.146972     6.41   0.000     5.104746    9.602289
1.prevalen~p |   12.88441    .541129    23.81   0.000     11.82346    13.94535
       diabp |   1.041448   .0203391    51.20   0.000     1.001571    1.081325
   heartrate |   .0506771   .0163341     3.10   0.002     .0186521     .082702
     glucose |   .0446825   .0079963     5.59   0.000     .0290047    .0603602
     totchol |   .0087429   .0045064     1.94   0.052    -.0000924    .0175782
       _cons |   13.33082   2.371297     5.62   0.000     8.681618    17.98002
------------------------------------------------------------------------------



Breusch–Pagan/Cook–Weisberg test for heteroskedasticity 
Assumption: Normal error terms
Variable: Fitted values of sysbp

H0: Constant variance

    chi2(1) = 526.41
Prob > chi2 = 0.0000

file linear.png saved as PNG format

Note
  • The points are randomly scattered around zero line and hence do not indicate any strong departure departure from linearity
  • The Breusch -Pagan test for heteroskedacity has (\(p<0.001\)) suggesting that our residuals are heteroskedastic

Normality

import delimited framingham_clean
regress sysbp i.male age i.education i.bpmeds i.prevalenthyp diabp heartrate glucose totchol 
predict resid, residual 

histogram resid, normal title("Histogram of Residuals with Normal Curve")

graph export hist.png , replace
qnorm resid, title("Normal Q-Q Plot of Residuals")

graph export normality.png , replace
(encoding automatically selected: ISO-8859-1)
(16 vars, 3,656 obs)

      Source |       SS           df       MS      Number of obs   =     3,656
-------------+----------------------------------   F(11, 3644)     =    915.52
       Model |  1309930.59        11  119084.599   Prob > F        =    0.0000
    Residual |  473987.479     3,644  130.073403   R-squared       =    0.7343
-------------+----------------------------------   Adj R-squared   =    0.7335
       Total |  1783918.07     3,655   488.07608   Root MSE        =    11.405

------------------------------------------------------------------------------
       sysbp | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
      1.male |  -2.777589   .3886811    -7.15   0.000    -3.539643   -2.015535
         age |   .4207818   .0246387    17.08   0.000     .3724748    .4690888
             |
   education |
          2  |  -.5738198   .4641805    -1.24   0.216    -1.483899    .3362595
          3  |  -1.595382   .5545176    -2.88   0.004    -2.682577    -.508186
          4  |  -2.578322   .6360304    -4.05   0.000    -3.825333   -1.331312
             |
    1.bpmeds |   7.353518   1.146972     6.41   0.000     5.104746    9.602289
1.prevalen~p |   12.88441    .541129    23.81   0.000     11.82346    13.94535
       diabp |   1.041448   .0203391    51.20   0.000     1.001571    1.081325
   heartrate |   .0506771   .0163341     3.10   0.002     .0186521     .082702
     glucose |   .0446825   .0079963     5.59   0.000     .0290047    .0603602
     totchol |   .0087429   .0045064     1.94   0.052    -.0000924    .0175782
       _cons |   13.33082   2.371297     5.62   0.000     8.681618    17.98002
------------------------------------------------------------------------------


(bin=35, start=-41.595417, width=3.7957909)

file hist.png saved as PNG format


file normality.png saved as PNG format

Note
  • The histogram shows that the residuals are not too far from normal or do not deviate too much from normality
  • the normal quantile quantile plot however suggests a little deviation from normality indicating that all things being equal a transformation might be required.

(c) Test for multicollinearity using Variance Inflation Factor (VIF)

import delimited framingham_clean
regress sysbp i.male age i.education i.bpmeds i.prevalenthyp diabp heartrate glucose totchol , allbase
vif
(encoding automatically selected: ISO-8859-1)
(16 vars, 3,656 obs)

      Source |       SS           df       MS      Number of obs   =     3,656
-------------+----------------------------------   F(11, 3644)     =    915.52
       Model |  1309930.59        11  119084.599   Prob > F        =    0.0000
    Residual |  473987.479     3,644  130.073403   R-squared       =    0.7343
-------------+----------------------------------   Adj R-squared   =    0.7335
       Total |  1783918.07     3,655   488.07608   Root MSE        =    11.405

------------------------------------------------------------------------------
       sysbp | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
        male |
          0  |          0  (base)
          1  |  -2.777589   .3886811    -7.15   0.000    -3.539643   -2.015535
             |
         age |   .4207818   .0246387    17.08   0.000     .3724748    .4690888
             |
   education |
          1  |          0  (base)
          2  |  -.5738198   .4641805    -1.24   0.216    -1.483899    .3362595
          3  |  -1.595382   .5545176    -2.88   0.004    -2.682577    -.508186
          4  |  -2.578322   .6360304    -4.05   0.000    -3.825333   -1.331312
             |
      bpmeds |
          0  |          0  (base)
          1  |   7.353518   1.146972     6.41   0.000     5.104746    9.602289
             |
prevalenthyp |
          0  |          0  (base)
          1  |   12.88441    .541129    23.81   0.000     11.82346    13.94535
             |
       diabp |   1.041448   .0203391    51.20   0.000     1.001571    1.081325
   heartrate |   .0506771   .0163341     3.10   0.002     .0186521     .082702
     glucose |   .0446825   .0079963     5.59   0.000     .0290047    .0603602
     totchol |   .0087429   .0045064     1.94   0.052    -.0000924    .0175782
       _cons |   13.33082   2.371297     5.62   0.000     8.681618    17.98002
------------------------------------------------------------------------------

    Variable |       VIF       1/VIF  
-------------+----------------------
      1.male |      1.05    0.954127
         age |      1.25    0.799839
   education |
          2  |      1.27    0.784590
          3  |      1.20    0.836742
          4  |      1.16    0.859593
    1.bpmeds |      1.09    0.918649
1.prevalen~p |      1.77    0.566482
       diabp |      1.67    0.599928
   heartrate |      1.08    0.928929
     glucose |      1.03    0.973544
     totchol |      1.11    0.901242
-------------+----------------------
    Mean VIF |      1.24
Note
  • Based on the stata output, all the variables have a VIF below 5, therefore there is no multicollinearity, so there is no need to adjust or remove some variables

(d) Identify any influential observations (e.g., using Cook’s Distance), and discuss their impact

import delimited framingham_clean
regress sysbp i.male age i.education i.bpmeds i.prevalenthyp diabp heartrate glucose totchol , allbase
predict cookd, cooksd
gen obs = _n
gen threshold = 4/_N
twoway (scatter cookd obs) ///
       (line threshold obs, lcolor(red) lpattern(dash)), ///
       title("Cook's Distance Plot") ///
       ytitle("Cook's Distance") xtitle("Observation Number")
graph export cooks.png , replace
(encoding automatically selected: ISO-8859-1)
(16 vars, 3,656 obs)

      Source |       SS           df       MS      Number of obs   =     3,656
-------------+----------------------------------   F(11, 3644)     =    915.52
       Model |  1309930.59        11  119084.599   Prob > F        =    0.0000
    Residual |  473987.479     3,644  130.073403   R-squared       =    0.7343
-------------+----------------------------------   Adj R-squared   =    0.7335
       Total |  1783918.07     3,655   488.07608   Root MSE        =    11.405

------------------------------------------------------------------------------
       sysbp | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
        male |
          0  |          0  (base)
          1  |  -2.777589   .3886811    -7.15   0.000    -3.539643   -2.015535
             |
         age |   .4207818   .0246387    17.08   0.000     .3724748    .4690888
             |
   education |
          1  |          0  (base)
          2  |  -.5738198   .4641805    -1.24   0.216    -1.483899    .3362595
          3  |  -1.595382   .5545176    -2.88   0.004    -2.682577    -.508186
          4  |  -2.578322   .6360304    -4.05   0.000    -3.825333   -1.331312
             |
      bpmeds |
          0  |          0  (base)
          1  |   7.353518   1.146972     6.41   0.000     5.104746    9.602289
             |
prevalenthyp |
          0  |          0  (base)
          1  |   12.88441    .541129    23.81   0.000     11.82346    13.94535
             |
       diabp |   1.041448   .0203391    51.20   0.000     1.001571    1.081325
   heartrate |   .0506771   .0163341     3.10   0.002     .0186521     .082702
     glucose |   .0446825   .0079963     5.59   0.000     .0290047    .0603602
     totchol |   .0087429   .0045064     1.94   0.052    -.0000924    .0175782
       _cons |   13.33082   2.371297     5.62   0.000     8.681618    17.98002
------------------------------------------------------------------------------





file cooks.png saved as PNG format

Interpretation
  • In the Cook’s Distance plot, the red dashed line indicates the common influence threshold of around 0.045.
  • Most of the observations (Blue points) are all well scattered below this line, suggesting that the majority have minimal influence on the regression model.
  • We should note however that none of the Cook’s D values approach or exceed 1, implying that there are no highly influential outliers present in the model and sample.

Autocorrelation

import delimited framingham_clean
gen trend = _n
tsset trend

regress sysbp i.male age i.education i.bpmeds i.prevalenthyp diabp heartrate glucose totchol , allbase

dwstat
estat bgodfrey
(encoding automatically selected: ISO-8859-1)
(16 vars, 3,656 obs)



Time variable: trend, 1 to 3656
        Delta: 1 unit

      Source |       SS           df       MS      Number of obs   =     3,656
-------------+----------------------------------   F(11, 3644)     =    915.52
       Model |  1309930.59        11  119084.599   Prob > F        =    0.0000
    Residual |  473987.479     3,644  130.073403   R-squared       =    0.7343
-------------+----------------------------------   Adj R-squared   =    0.7335
       Total |  1783918.07     3,655   488.07608   Root MSE        =    11.405

------------------------------------------------------------------------------
       sysbp | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
        male |
          0  |          0  (base)
          1  |  -2.777589   .3886811    -7.15   0.000    -3.539643   -2.015535
             |
         age |   .4207818   .0246387    17.08   0.000     .3724748    .4690888
             |
   education |
          1  |          0  (base)
          2  |  -.5738198   .4641805    -1.24   0.216    -1.483899    .3362595
          3  |  -1.595382   .5545176    -2.88   0.004    -2.682577    -.508186
          4  |  -2.578322   .6360304    -4.05   0.000    -3.825333   -1.331312
             |
      bpmeds |
          0  |          0  (base)
          1  |   7.353518   1.146972     6.41   0.000     5.104746    9.602289
             |
prevalenthyp |
          0  |          0  (base)
          1  |   12.88441    .541129    23.81   0.000     11.82346    13.94535
             |
       diabp |   1.041448   .0203391    51.20   0.000     1.001571    1.081325
   heartrate |   .0506771   .0163341     3.10   0.002     .0186521     .082702
     glucose |   .0446825   .0079963     5.59   0.000     .0290047    .0603602
     totchol |   .0087429   .0045064     1.94   0.052    -.0000924    .0175782
       _cons |   13.33082   2.371297     5.62   0.000     8.681618    17.98002
------------------------------------------------------------------------------


Durbin–Watson d-statistic( 12,  3656) =  1.995538


Breusch–Godfrey LM test for autocorrelation
---------------------------------------------------------------------------
    lags(p)  |          chi2               df                 Prob > chi2
-------------+-------------------------------------------------------------
       1     |          0.018               1                   0.8927
---------------------------------------------------------------------------
                        H0: no serial correlation
Note
  • running the durbin watson test results in statistic of \(1.995538 \approx 2\) indicating no sign of positive or negative autocorrelation.
  • we further on test for identification of autocorrelation i.e Breusch -Godfrey LM test and we can see that the associated \(p-value = 0.8927\) which is above 0.05. Therefore there is no evidence to reject the null hypothesis of no autocorrelation. thus there is no autocorrelation in our data

6 Interpretation and reflection

import delimited framingham_clean
regress sysbp i.male age i.education i.bpmeds i.prevalenthyp diabp heartrate glucose totchol , allbase
(encoding automatically selected: ISO-8859-1)
(16 vars, 3,656 obs)

      Source |       SS           df       MS      Number of obs   =     3,656
-------------+----------------------------------   F(11, 3644)     =    915.52
       Model |  1309930.59        11  119084.599   Prob > F        =    0.0000
    Residual |  473987.479     3,644  130.073403   R-squared       =    0.7343
-------------+----------------------------------   Adj R-squared   =    0.7335
       Total |  1783918.07     3,655   488.07608   Root MSE        =    11.405

------------------------------------------------------------------------------
       sysbp | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
        male |
          0  |          0  (base)
          1  |  -2.777589   .3886811    -7.15   0.000    -3.539643   -2.015535
             |
         age |   .4207818   .0246387    17.08   0.000     .3724748    .4690888
             |
   education |
          1  |          0  (base)
          2  |  -.5738198   .4641805    -1.24   0.216    -1.483899    .3362595
          3  |  -1.595382   .5545176    -2.88   0.004    -2.682577    -.508186
          4  |  -2.578322   .6360304    -4.05   0.000    -3.825333   -1.331312
             |
      bpmeds |
          0  |          0  (base)
          1  |   7.353518   1.146972     6.41   0.000     5.104746    9.602289
             |
prevalenthyp |
          0  |          0  (base)
          1  |   12.88441    .541129    23.81   0.000     11.82346    13.94535
             |
       diabp |   1.041448   .0203391    51.20   0.000     1.001571    1.081325
   heartrate |   .0506771   .0163341     3.10   0.002     .0186521     .082702
     glucose |   .0446825   .0079963     5.59   0.000     .0290047    .0603602
     totchol |   .0087429   .0045064     1.94   0.052    -.0000924    .0175782
       _cons |   13.33082   2.371297     5.62   0.000     8.681618    17.98002
------------------------------------------------------------------------------
Comments
  • males have 2.777589mmHg. less systolic blood pressure as compared to women when adjusted for other variables and the result is statistically significant.
  • A 1 year increase in age results in significant 0.4207mmHg increase in systolic blood pressure when adjusting for other variables. The result is statistically significant at 5% level of significance
  • Going up the education level categories , systolic blood pressure seems to significantly decrease as compared to the education level 1 baseline category when adjusting for other variables. More precisely:
  1. Education level category 2 individuals have 0.5738198mmHg less systolic blood pressure as compared to the baseline (education level 1) though the result is not statistically significant.
  2. Education level category 3 individuals have 1.595382mmHg less systolic blood pressure as compared to the baseline (education level 1) and the result is statistically significant.
  3. Education level category 4 individuals have 2.578322mmHg less systolic blood pressure as compared to the baseline (education level 1) and the result is statistically significant.
  • When adjusting for other variables , people who take bpmeds have on average 7.353518mmHg more systolic blood pressure as compared to those who do not take meds.
  • When ajusting for other variables , Prevalent hypertension patients have 12.88441mmHg more systolic blood pressure on average as compared to those who are not prevalent hypertension patients, the result is statistically significant.
  • A unit increase in diastolic blood pressure will result in a 1.041448mmHg significant increase in systolic blood pressure when adjusting for other variables.
  • A unit increase in Heartrate will result in a 0.0506771mmHg significant increase in systolic blood pressure when adjusting for other variables.
  • A unit increase in Glucose level will result in a 0.446825mmHg significant increase in systolic blood pressure when adjusting for other variables.
  • A unit increase in total cholestrol will result in a 0.0087429 mmHg significant increase in systolic blood pressure when adjusting for other variables.

Section C

(a) Fit a direct model that resembles the final model on question 4, show the SEM diagram and results table (side by side with those for 4c). Comment on the similarities and differences in your result

  • First we create dummy variable for education since it has more than 2 levels.
import delimited framingham_clean
regress sysbp i.male age i.education i.bpmeds i.prevalenthyp diabp heartrate glucose totchol, allbase


*Creating the dummy variables for education category*
tab education, gen(educationlevels)

*Fitting the model
sem (diabp -> sysbp, ) (educationlevels2 -> sysbp, ) (educationlevels3 -> sysbp, ) (educationlevels4 -> sysbp, ) (glucose -> sysbp, ) (heartrate -> sysbp, ) (bpmeds -> sysbp, ) (prevalenthyp -> sysbp, ) (male -> sysbp, ) (age -> sysbp, ) (totchol-> sysbp, ), covstructure(e._endogenous , unstructured) nocapslatent


*Education level 2 was removed becaused it was insignificant in the model (p-value=0.240 ) from the the final model.

estat mindices
(encoding automatically selected: ISO-8859-1)
(16 vars, 3,656 obs)

      Source |       SS           df       MS      Number of obs   =     3,656
-------------+----------------------------------   F(11, 3644)     =    915.52
       Model |  1309930.59        11  119084.599   Prob > F        =    0.0000
    Residual |  473987.479     3,644  130.073403   R-squared       =    0.7343
-------------+----------------------------------   Adj R-squared   =    0.7335
       Total |  1783918.07     3,655   488.07608   Root MSE        =    11.405

------------------------------------------------------------------------------
       sysbp | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
        male |
          0  |          0  (base)
          1  |  -2.777589   .3886811    -7.15   0.000    -3.539643   -2.015535
             |
         age |   .4207818   .0246387    17.08   0.000     .3724748    .4690888
             |
   education |
          1  |          0  (base)
          2  |  -.5738198   .4641805    -1.24   0.216    -1.483899    .3362595
          3  |  -1.595382   .5545176    -2.88   0.004    -2.682577    -.508186
          4  |  -2.578322   .6360304    -4.05   0.000    -3.825333   -1.331312
             |
      bpmeds |
          0  |          0  (base)
          1  |   7.353518   1.146972     6.41   0.000     5.104746    9.602289
             |
prevalenthyp |
          0  |          0  (base)
          1  |   12.88441    .541129    23.81   0.000     11.82346    13.94535
             |
       diabp |   1.041448   .0203391    51.20   0.000     1.001571    1.081325
   heartrate |   .0506771   .0163341     3.10   0.002     .0186521     .082702
     glucose |   .0446825   .0079963     5.59   0.000     .0290047    .0603602
     totchol |   .0087429   .0045064     1.94   0.052    -.0000924    .0175782
       _cons |   13.33082   2.371297     5.62   0.000     8.681618    17.98002
------------------------------------------------------------------------------


  education |      Freq.     Percent        Cum.
------------+-----------------------------------
          1 |      1,526       41.74       41.74
          2 |      1,101       30.11       71.85
          3 |        606       16.58       88.43
          4 |        423       11.57      100.00
------------+-----------------------------------
      Total |      3,656      100.00


Endogenous variables
  Observed: sysbp

Exogenous variables
  Observed: diabp educationlevels2 educationlevels3 educationlevels4 glucose
            heartrate bpmeds prevalenthyp male age totchol

Fitting target model:
Iteration 0:  Log likelihood = -98047.863  
Iteration 1:  Log likelihood = -98047.863  

Structural equation model                                Number of obs = 3,656
Estimation method: ml

Log likelihood = -98047.863

------------------------------------------------------------------------------
             |                 OIM
             | Coefficient  std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
Structural   |
  sysbp      |
       diabp |   1.041448   .0203057    51.29   0.000      1.00165    1.081246
  educatio~2 |  -.5738198   .4634181    -1.24   0.216    -1.482103    .3344629
  educatio~3 |  -1.595382   .5536068    -2.88   0.004    -2.680431   -.5103322
  educatio~4 |  -2.578322   .6349857    -4.06   0.000    -3.822872   -1.333773
     glucose |   .0446825   .0079832     5.60   0.000     .0290357    .0603292
   heartrate |   .0506771   .0163073     3.11   0.002     .0187154    .0826387
      bpmeds |   7.353518   1.145089     6.42   0.000     5.109186     9.59785
  prevalen~p |   12.88441   .5402402    23.85   0.000     11.82555    13.94326
        male |  -2.777589   .3880427    -7.16   0.000    -3.538139   -2.017039
         age |   .4207818   .0245982    17.11   0.000     .3725701    .4689934
     totchol |   .0087429    .004499     1.94   0.052     -.000075    .0175607
       _cons |   13.33082   2.367402     5.63   0.000     8.690795    17.97084
-------------+----------------------------------------------------------------
 var(e.sysbp)|   129.6465   3.032303                      123.8374     135.728
------------------------------------------------------------------------------
LR test of model vs. saturated: chi2(0) = 0.00                 Prob > chi2 = .

(no modification indices to report, all MI values less than 3.841458820694123)

Model Structural Output

Comments
  • The linear regression model output and the structural Equation model had similarities and few disparities
  • The estimated path coefficients in both model outputs were the same (identical) for all variables including the constant/intercept
  • The Z and t values where however quite different
  • The significance of the variables at 5% were also identical.

(b) Work on improving the direct model by introducing some indirect pathways based on research knowledge of the field or suggested pathways from ’’estat mindices”. Display the final direct and indirect SEM diagram and explain your approach of the indirect pathways and/or correlations introduced. Hint: Do not make the modifications too complex, make a few alterations that help improve the model

import delimited framingham_clean


*Creating the dummy variables for education category since it has more than two levels
tab education, gen(educationlevels)

sem (prevalenthyp -> sysbp, ) (educationlevels3 -> sysbp, ) (educationlevels4 -> sysbp, ) (male -> sysbp, ) (glucose -> sysbp, ) (glucose -> prevalenthyp, ) (heartrate -> sysbp, ) (heartrate -> prevalenthyp, ) (diabp -> sysbp, ) (diabp -> prevalenthyp, ) (age -> sysbp, ) (age -> prevalenthyp, ) (bpmeds -> sysbp, ) (bpmeds -> prevalenthyp, ) (totchol -> sysbp, ), nocapslatent

estat gof, stats(all)  
(encoding automatically selected: ISO-8859-1)
(16 vars, 3,656 obs)


  education |      Freq.     Percent        Cum.
------------+-----------------------------------
          1 |      1,526       41.74       41.74
          2 |      1,101       30.11       71.85
          3 |        606       16.58       88.43
          4 |        423       11.57      100.00
------------+-----------------------------------
      Total |      3,656      100.00


Endogenous variables
  Observed: prevalenthyp sysbp

Exogenous variables
  Observed: educationlevels3 educationlevels4 male glucose heartrate diabp
            age bpmeds totchol

Fitting target model:
Iteration 0:  Log likelihood = -96155.335  
Iteration 1:  Log likelihood = -96155.335  

Structural equation model                                Number of obs = 3,656
Estimation method: ml

Log likelihood = -96155.335

------------------------------------------------------------------------------
             |                 OIM
             | Coefficient  std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
Structural   |
  prevalen~p |
     glucose |   .0003951   .0002444     1.62   0.106    -.0000839     .000874
   heartrate |   .0017932    .000492     3.64   0.000      .000829    .0027575
       diabp |   .0211249   .0005093    41.48   0.000     .0201267    .0221231
         age |   .0093578   .0006969    13.43   0.000     .0079918    .0107237
      bpmeds |   .3480831   .0344983    10.09   0.000     .2804678    .4156985
       _cons |  -2.082419   .0588731   -35.37   0.000    -2.197809    -1.96703
  -----------+----------------------------------------------------------------
  sysbp      |
  prevalen~p |   12.88386   .5403533    23.84   0.000     11.82479    13.94293
  educatio~3 |  -1.345866   .5157383    -2.61   0.009    -2.356694   -.3350375
  educatio~4 |  -2.329898   .6025839    -3.87   0.000    -3.510941   -1.148855
        male |  -2.754165   .3876626    -7.10   0.000     -3.51397    -1.99436
     glucose |   .0447748   .0079845     5.61   0.000     .0291254    .0604242
   heartrate |   .0505673   .0163105     3.10   0.002     .0185994    .0825353
       diabp |   1.041687    .020309    51.29   0.000     1.001882    1.081492
         age |   .4274418    .024008    17.80   0.000     .3803869    .4744967
      bpmeds |   7.319757   1.145004     6.39   0.000      5.07559    9.563923
     totchol |   .0085681   .0044977     1.91   0.057    -.0002472    .0173835
       _cons |   12.77098   2.324315     5.49   0.000     8.215408    17.32655
-------------+----------------------------------------------------------------
var(e.prev~p)|   .1216345   .0028449                      .1161845    .1273403
 var(e.sysbp)|   129.7008   3.033575                      123.8894    135.7849
------------------------------------------------------------------------------
LR test of model vs. saturated: chi2(4) = 4.01            Prob > chi2 = 0.4051


----------------------------------------------------------------------------
Fit statistic        |      Value   Description
---------------------+------------------------------------------------------
Likelihood ratio     |
          chi2_ms(4) |      4.007   model vs. saturated
            p > chi2 |      0.405
         chi2_bs(19) |   6921.819   baseline vs. saturated
            p > chi2 |      0.000
---------------------+------------------------------------------------------
Population error     |
               RMSEA |      0.001   Root mean squared error of approximation
 90% CI, lower bound |      0.000
         upper bound |      0.025
              pclose |      1.000   Probability RMSEA <= 0.05
---------------------+------------------------------------------------------
Information criteria |
                 AIC | 192348.671   Akaike's information criterion
                 BIC | 192466.549   Bayesian information criterion
---------------------+------------------------------------------------------
Baseline comparison  |
                 CFI |      1.000   Comparative fit index
                 TLI |      1.000   Tucker–Lewis index
---------------------+------------------------------------------------------
Size of residuals    |
                SRMR |      0.003   Standardized root mean squared residual
                  CD |      0.707   Coefficient of determination
----------------------------------------------------------------------------
Comments
  • firstly education level 2 was removed due to not being significant (\(p=0.216>0.05\)).

  • On running the estat mindices command in stata on the initial direct model ,the estat mindices command did not suggest anything for improvement, hence I had to use expert opinion and prior belief to create indirect pathways.

  • The direct relationship between diastolic blood pressure and systolic blood pressure was mantained , this is supported both biologically and statistically since diastolic blood pressure is known to affect systolic blood pressure due to cardiovascular risk factors.

  • Prevalent hypertension(prevalenthyp) was introduced as a key Mediator since individuals with Prevalent hypertension often suffer more from elevated diastolic and systolic blood pressure.

Justification for the appproach

  • The changes result in more parsimonous model as few changes (justified changes were made to avoid overfitting)

Model Structural Output

(c) Perform and comment on all five SEM model goodness of fit procedures and comment on how each performs based on your final SEM model.

Note

The following command was ran into stata to get model goodness of fit indices

estat gof, stats(all) 

Comments

  1. Likelihood Ratio Test
  • (\(p-value=0.405\)), suggests no significant difference between the model and the saturated model. This model reproduces the observed data structure very well. The null hypothesis that the model fits the data is not rejected, therefore this is ideal in SEM.
  1. RMSEA (Root Mean Square Error of Approximation)
  • A value of RMSEA (< 0.05 )indicates close model fit,here our value (\(RMSEA=0.001\)), which is perfect. Also, pclose = 1.000 means there’s a 100% probability that the true RMSEA is less than 0.05 — again showing excellent fit.
  • The 90% upper and lower bound are also within the expected range i.e \(LB<0.05\) and \(UB<0.1\) ,hence also suggesting a good model fit
  1. CFI and TLI (Comparative Fit Index & Tucker-Lewis Index) Both indices are above 0.95 (exactly at 1.00), indicating excellent comparative fit. The model is much better than the baseline model that assumes no relationships among variables.

  2. SRMR (Standardized Root Mean Squared Residual) SRMR < 0.08 is generally considered good. For this model \(SRMR=0.003\), indicates the perfect fit, the model predicted correlations very closely match the observed ones.

  3. Coefficient of determination

  • value is \(CD=0.707\) and is quite high and significant.
  • The model explains 70.7% of the variance in the outcome variables indicating clinically/behaviorally meaningful predictive accuracy.

(d) Draw-up the table of results from the final SEM model and verify numerically the STATA drawn direct effects, indirect effects and total effects for “diabp” on your outcome variable “sysbp”.

import delimited framingham_clean


*Creating the dummy variables for education category since it has more than two levels
tab education, gen(educationlevels)

sem (prevalenthyp -> sysbp, ) (educationlevels3 -> sysbp, ) (educationlevels4 -> sysbp, ) (male -> sysbp, ) (glucose -> sysbp, ) (glucose -> prevalenthyp, ) (heartrate -> sysbp, ) (heartrate -> prevalenthyp, ) (diabp -> sysbp, ) (diabp -> prevalenthyp, ) (age -> sysbp, ) (age -> prevalenthyp, ) (bpmeds -> sysbp, ) (bpmeds -> prevalenthyp, ) (totchol -> sysbp, ), nocapslatent

estat teffects 
(encoding automatically selected: ISO-8859-1)
(16 vars, 3,656 obs)


  education |      Freq.     Percent        Cum.
------------+-----------------------------------
          1 |      1,526       41.74       41.74
          2 |      1,101       30.11       71.85
          3 |        606       16.58       88.43
          4 |        423       11.57      100.00
------------+-----------------------------------
      Total |      3,656      100.00


Endogenous variables
  Observed: prevalenthyp sysbp

Exogenous variables
  Observed: educationlevels3 educationlevels4 male glucose heartrate diabp
            age bpmeds totchol

Fitting target model:
Iteration 0:  Log likelihood = -96155.335  
Iteration 1:  Log likelihood = -96155.335  

Structural equation model                                Number of obs = 3,656
Estimation method: ml

Log likelihood = -96155.335

------------------------------------------------------------------------------
             |                 OIM
             | Coefficient  std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
Structural   |
  prevalen~p |
     glucose |   .0003951   .0002444     1.62   0.106    -.0000839     .000874
   heartrate |   .0017932    .000492     3.64   0.000      .000829    .0027575
       diabp |   .0211249   .0005093    41.48   0.000     .0201267    .0221231
         age |   .0093578   .0006969    13.43   0.000     .0079918    .0107237
      bpmeds |   .3480831   .0344983    10.09   0.000     .2804678    .4156985
       _cons |  -2.082419   .0588731   -35.37   0.000    -2.197809    -1.96703
  -----------+----------------------------------------------------------------
  sysbp      |
  prevalen~p |   12.88386   .5403533    23.84   0.000     11.82479    13.94293
  educatio~3 |  -1.345866   .5157383    -2.61   0.009    -2.356694   -.3350375
  educatio~4 |  -2.329898   .6025839    -3.87   0.000    -3.510941   -1.148855
        male |  -2.754165   .3876626    -7.10   0.000     -3.51397    -1.99436
     glucose |   .0447748   .0079845     5.61   0.000     .0291254    .0604242
   heartrate |   .0505673   .0163105     3.10   0.002     .0185994    .0825353
       diabp |   1.041687    .020309    51.29   0.000     1.001882    1.081492
         age |   .4274418    .024008    17.80   0.000     .3803869    .4744967
      bpmeds |   7.319757   1.145004     6.39   0.000      5.07559    9.563923
     totchol |   .0085681   .0044977     1.91   0.057    -.0002472    .0173835
       _cons |   12.77098   2.324315     5.49   0.000     8.215408    17.32655
-------------+----------------------------------------------------------------
var(e.prev~p)|   .1216345   .0028449                      .1161845    .1273403
 var(e.sysbp)|   129.7008   3.033575                      123.8894    135.7849
------------------------------------------------------------------------------
LR test of model vs. saturated: chi2(4) = 4.01            Prob > chi2 = 0.4051



Direct effects
------------------------------------------------------------------------------
             |                 OIM
             | Coefficient  std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
Structural   |
  prevalen~p |
     glucose |   .0003951   .0002444     1.62   0.106    -.0000839     .000874
   heartrate |   .0017932    .000492     3.64   0.000      .000829    .0027575
       diabp |   .0211249   .0005093    41.48   0.000     .0201267    .0221231
         age |   .0093578   .0006969    13.43   0.000     .0079918    .0107237
      bpmeds |   .3480831   .0344983    10.09   0.000     .2804678    .4156985
  -----------+----------------------------------------------------------------
  sysbp      |
  prevalen~p |   12.88386   .5403533    23.84   0.000     11.82479    13.94293
  educatio~3 |  -1.345866   .5157383    -2.61   0.009    -2.356694   -.3350375
  educatio~4 |  -2.329898   .6025839    -3.87   0.000    -3.510941   -1.148855
        male |  -2.754165   .3876626    -7.10   0.000     -3.51397    -1.99436
     glucose |   .0447748   .0079845     5.61   0.000     .0291254    .0604242
   heartrate |   .0505673   .0163105     3.10   0.002     .0185994    .0825353
       diabp |   1.041687    .020309    51.29   0.000     1.001882    1.081492
         age |   .4274418    .024008    17.80   0.000     .3803869    .4744967
      bpmeds |   7.319757   1.145004     6.39   0.000      5.07559    9.563923
     totchol |   .0085681   .0044977     1.91   0.057    -.0002472    .0173835
------------------------------------------------------------------------------


Indirect effects
------------------------------------------------------------------------------
             |                 OIM
             | Coefficient  std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
Structural   |
  prevalen~p |
     glucose |          0  (no path)
   heartrate |          0  (no path)
       diabp |          0  (no path)
         age |          0  (no path)
      bpmeds |          0  (no path)
  -----------+----------------------------------------------------------------
  sysbp      |
  prevalen~p |          0  (no path)
  educatio~3 |          0  (no path)
  educatio~4 |          0  (no path)
        male |          0  (no path)
     glucose |   .0050901   .0031555     1.61   0.107    -.0010945    .0112747
   heartrate |   .0231035   .0064122     3.60   0.000     .0105359    .0356711
       diabp |   .2721698   .0131665    20.67   0.000      .246364    .2979756
         age |   .1205643   .0103049    11.70   0.000     .1003672    .1407615
      bpmeds |   4.484655   .4826297     9.29   0.000     3.538718    5.430591
     totchol |          0  (no path)
------------------------------------------------------------------------------


Total effects
------------------------------------------------------------------------------
             |                 OIM
             | Coefficient  std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
Structural   |
  prevalen~p |
     glucose |   .0003951   .0002444     1.62   0.106    -.0000839     .000874
   heartrate |   .0017932    .000492     3.64   0.000      .000829    .0027575
       diabp |   .0211249   .0005093    41.48   0.000     .0201267    .0221231
         age |   .0093578   .0006969    13.43   0.000     .0079918    .0107237
      bpmeds |   .3480831   .0344983    10.09   0.000     .2804678    .4156985
  -----------+----------------------------------------------------------------
  sysbp      |
  prevalen~p |   12.88386   .5403533    23.84   0.000     11.82479    13.94293
  educatio~3 |  -1.345866   .5157383    -2.61   0.009    -2.356694   -.3350375
  educatio~4 |  -2.329898   .6025839    -3.87   0.000    -3.510941   -1.148855
        male |  -2.754165   .3876626    -7.10   0.000     -3.51397    -1.99436
     glucose |   .0498649   .0085801     5.81   0.000     .0330483    .0666815
   heartrate |   .0736708   .0174753     4.22   0.000     .0394199    .1079218
       diabp |   1.313857   .0180371    72.84   0.000     1.278505    1.349209
         age |   .5480061   .0251491    21.79   0.000     .4987148    .5972975
      bpmeds |   11.80441   1.214032     9.72   0.000     9.424953    14.18387
     totchol |   .0085681   .0044977     1.91   0.057    -.0002472    .0173835
------------------------------------------------------------------------------
Solution
Efffect of Diabp on Sysbp
PrevalentHyp Sysbp (Outcome)
diapb
Direct Effect 0.021 1.0
Indirect via PrevalentHyp 13X 0.021=0.273
Total Effect 0.021 1.273

Direct Effect contribution

\(\frac{1}{1.273}*100=78.6\%\)

Indirect Effect contribution

\(\frac{0.273}{1.273}*100=21.45\%\)

e)Interpret your final SEM model and comment on whether SEM helped improve the direct model from 4c)

Comments

Final SEM Model

  • The final model has :
  1. Endogenous variables Observed: prevalenthyp and sysbp

here we observe interrelationships

  1. Exogenous variables Observed: educationlevels3 educationlevels4 male glucose heartrate diabp age bpmeds totchol

Summary of results

Direct effects on systolic blood pressure

  • Prevalent hyperytension has a major effect on systolic blood pressure such that those who experience this have 12.88 more systolic blood pressure as compared to their counterparts adjusting for other variables(\(\beta \approx 12.88,p=0.000\))
  • diastolic blood pressure has a positive significant total effect on systolic blood pressure (\(p<0.001\)) such that a unit increase in diastolic blood pressure results in 1.273 increase in systolic blood pressure adjusting for the mediatory effect of prevalent hypertension and also controlling for other variables. about \(21.45\%\) of this efffect is indirect due to prevalent hypertension and the remainder \(78.6\%\) is due to direct effect of diastolic blood pressure on systolic blood pressure

Model improvement

  • The \(SEM\) helped to improve since:
  1. Root mean Square error or association(\(RMSEA=0.001<0.05\)) whict indicates a better fit.
  2. CF1 and TLI =1 showing a perfect fit
  3. Overally the chisquared test \(p=0.407\) improved from \(0.00\) indicating that the model is now not significantly worse than a saturated model hence our final model greatly improved

General additional effects shown on the table below:

Structural Equation Model Results with Clinical Interpretation
Outcome Predictor β SE p Clinical Interpretation
Binary Outcome: Hypertension Status
Prevalent Hypertension Glucose 0.0004 0.0002 0.106 NS: No significant association with hypertension risk
Prevalent Hypertension Heart Rate 0.0018 0.0005 <0.001 Sig: Each 1 bpm increase → 0.18% higher hypertension odds
Prevalent Hypertension Diastolic BP 0.0211 0.0005 <0.001 STRONG: Each 1 mmHg → 2.1% higher hypertension odds (key predictor)
Prevalent Hypertension Age 0.0094 0.0007 <0.001 Sig: Each year of age → 0.94% higher hypertension odds
Prevalent Hypertension BP Meds 0.3481 0.0345 <0.001 Sig: BP med users have 35% higher hypertension odds (indication bias)
Prevalent Hypertension Constant -2.0824 0.0589 <0.001 Baseline log-odds
Continuous Outcome: Systolic BP (mmHg)
Systolic BP Prevalent Hypertension 12.8839 0.5404 <0.001 STRONG: Hypertensives average 12.9 mmHg higher SBP
Systolic BP Education (Mid) -1.3459 0.5157 0.009 Sig: Mid education → 1.35 mmHg lower SBP vs low education
Systolic BP Education (High) -2.3299 0.6026 <0.001 STRONG: High education → 2.33 mmHg lower SBP vs low education
Systolic BP Male -2.7542 0.3877 <0.001 Sig: Males average 2.75 mmHg lower SBP than females
Systolic BP Glucose 0.0448 0.0080 <0.001 Sig: Each glucose unit → 0.045 mmHg higher SBP
Systolic BP Heart Rate 0.0506 0.0163 0.002 Sig: Each 1 bpm → 0.051 mmHg higher SBP
Systolic BP Diastolic BP 1.0417 0.0203 <0.001 STRONG: Each 1 mmHg diastolic → 1.04 mmHg higher SBP
Systolic BP Age 0.4274 0.0240 <0.001 STRONG: Each year of age → 0.43 mmHg higher SBP
Systolic BP BP Meds 7.3198 1.1450 <0.001 Sig: BP med users average 7.3 mmHg higher SBP (treatment group)
Systolic BP Total Cholesterol 0.0086 0.0045 0.057 Marginal (p=0.057): Cholesterol shows weak positive trend
Systolic BP Constant 12.7710 2.3243 <0.001 Baseline SBP for reference group
Notes: Model fit: χ²(4)=4.01, p=0.405 (Excellent fit); SRMR=0.003; CD=0.707
NS = Not Significant (p>0.05); Sig = Significant (p<0.05); STRONG = p<0.001 with large effect size

Footnotes

  1. The only exception is if the predictor being added is either constant or identical to another variable.↩︎