id | radius_mean | texture_mean | perimeter_mean | area_mean | smoothness_mean | compactness_mean | concavity_mean | concave_points_mean | symmetry_mean | fractal_dimension_mean |
---|---|---|---|---|---|---|---|---|---|---|
842302 | 17.99 | 10.38 | 122.80 | 1001.0 | 0.11840 | 0.27760 | 0.3001 | 0.14710 | 0.2419 | 0.07871 |
842517 | 20.57 | 17.77 | 132.90 | 1326.0 | 0.08474 | 0.07864 | 0.0869 | 0.07017 | 0.1812 | 0.05667 |
84300903 | 19.69 | 21.25 | 130.00 | 1203.0 | 0.10960 | 0.15990 | 0.1974 | 0.12790 | 0.2069 | 0.05999 |
84348301 | 11.42 | 20.38 | 77.58 | 386.1 | 0.14250 | 0.28390 | 0.2414 | 0.10520 | 0.2597 | 0.09744 |
84358402 | 20.29 | 14.34 | 135.10 | 1297.0 | 0.10030 | 0.13280 | 0.1980 | 0.10430 | 0.1809 | 0.05883 |
843786 | 12.45 | 15.70 | 82.57 | 477.1 | 0.12780 | 0.17000 | 0.1578 | 0.08089 | 0.2087 | 0.07613 |
Data description (5 marks)
(a) Use appropriate descriptive statistics to summarise the variables, and include at least two types of graphical displays to support your summary
import delimited BreastCancer
title(Table1) titlestyles( font(, bold) ) export("Statin_table", as(docx) replace) dtable, continuous(radius_mean texture_mean perimeter_mean area_mean smoothness_mean compactness_mean concavity_mean concave_points_mean symmetry_mean fractal_dimension_mean )
(encoding automatically selected: ISO-8859-2)
(11 vars, 569 obs)
Table1
----------------------------------------
Summary
----------------------------------------
N 569
radius_mean 14.127 (3.524)
texture_mean 19.290 (4.301)
perimeter_mean 91.969 (24.299)
area_mean 654.889 (351.914)
smoothness_mean 0.096 (0.014)
compactness_mean 0.104 (0.053)
concavity_mean 0.089 (0.080)
concave_points_mean 0.049 (0.039)
symmetry_mean 0.181 (0.027)
fractal_dimension_mean 0.063 (0.007)
----------------------------------------
(collection DTable exported to file Statin_table.docx)
Graphical representation
Histograms
import delimited BreastCancer
histogram texture_mean ,title("Histogram of texture_mean") note( "") ytitle("Fraction") normal ytitle("Density") name(graph1 , replace) mcolor(blue%50)
histogram area_mean ,title("Histogram of area_mean") note( "") ytitle("Fraction") normal ytitle("Density") name(graph2 , replace) mcolor(blue%50)
histogram compactness_mean ,title("Histogram of compactness_mean") note( "") ytitle("Fraction") normal ytitle("Density") name(graph3 , replace) mcolor(blue%50)
histogram concave_points_mean ,title("Histogram of concave_points_mean") note( "") ytitle("Fraction") normal ytitle("Density") name(graph4 , replace) mcolor(blue%50)
histogram fractal_dimension_mean ,title("Histogram of fractal_dimension_mean") note( "") ytitle("Fraction") normal ytitle("Density") name(graph5 , replace) mcolor(blue%50)
histogram radius_mean ,title("Histogram of radius_mean") note( "") ytitle("Fraction") normal ytitle("Density") name(graph6 , replace) mcolor(blue%50)
histogram perimeter_mean ,title("Histogram of perimeter_mean") note( "") ytitle("Fraction") normal ytitle("Density") name(graph7 , replace) mcolor(blue%50)
histogram smoothness_mean ,title("Histogram of smoothness_mean") note( "") ytitle("Fraction") normal ytitle("Density") name(graph8 , replace) mcolor(blue%50)
histogram concavity_mean ,title("Histogram of concavity_mean") note( "") ytitle("Fraction") normal ytitle("Density") name(graph9 , replace) mcolor(blue%50)
histogram symmetry_mean ,title("Histogram of symmetry_mean") note( "") ytitle("Fraction") normal ytitle("Density") name(graph10 , replace) mcolor(blue%50)
graph combine graph1 graph2 graph3 graph4 graph5 graph6 graph7 graph8 graph9 graph10 ,title("") saving("all_graphs.svg" ,replace)
quietly graph export all_graphs.svg, replace
(encoding automatically selected: ISO-8859-2)
(11 vars, 569 obs)
(bin=23, start=9.71, width=1.2856521)
(bin=23, start=143.5, width=102.5)
(bin=23, start=.01938, width=.01417478)
(bin=23, start=0, width=.00874783)
(bin=23, start=.04996, width=.00206435)
(bin=23, start=6.9809999, width=.9186522)
(bin=23, start=43.790001, width=6.2917391)
(bin=23, start=.05263, width=.00481609)
(bin=23, start=0, width=.01855652)
(bin=23, start=.106, width=.0086087)
file all_graphs.svg saved as .gph format
Boxplots
import delimited BreastCancer
graph box texture_mean ,title("Boxplot of texture_mean") note( "") name(graph1 , replace)
graph box area_mean ,title("Boxplot of area_mean") note( "") name(graph2 , replace)
graph box compactness_mean ,title("Boxplot of compactness_mean") note( "") name(graph3 , replace)
graph box concave_points_mean ,title("Boxplot of concave_points_mean") note( "") name(graph4 , replace)
graph box fractal_dimension_mean ,title("Boxplot of fractal_dimension_mean") note( "") name(graph5 , replace)
graph box radius_mean ,title("Boxplot of radius_mean") note( "") name(graph6 , replace)
graph box perimeter_mean ,title("Boxplot of perimeter_mean") note( "") name(graph7 , replace)
graph box smoothness_mean ,title("Boxplot of smoothness_mean") note( "") name(graph8 , replace)
graph box concavity_mean ,title("Boxplot of concavity_mean") note( "") name(graph9 , replace)
graph box symmetry_mean ,title("Boxplot of symmetry_mean") note( "") name(graph10 , replace)
graph combine graph1 graph2 graph3 graph4 graph5 graph6 graph7 graph8 graph9 graph10 ,title("") saving("all_boxes.svg" ,replace)
quietly graph export all_boxes.svg, replace
(encoding automatically selected: ISO-8859-2)
(11 vars, 569 obs)
file all_boxes.svg saved as .gph format
2. Exploring variable relationships and multicollinearity (25 marks)
(a) Perform a correlation analysis among all continuous variables in the dataset, including the outcome variable radius mean
import delimited BreastCancer
pwcorr radius_mean texture_mean perimeter_mean area_mean smoothness_mean compactness_mean concavity_mean concave_points_mean symmetry_mean fractal_dimension_mean , star(0.05) sig
(encoding automatically selected: ISO-8859-2)
(11 vars, 569 obs)
| radius~n textur~n perime~n area_m~n smooth~n compac~n c~y_mean
-------------+---------------------------------------------------------------
radius_mean | 1.0000
|
|
texture_mean | 0.3238* 1.0000
| 0.0000
|
perimeter_~n | 0.9979* 0.3295* 1.0000
| 0.0000 0.0000
|
area_mean | 0.9874* 0.3211* 0.9865* 1.0000
| 0.0000 0.0000 0.0000
|
smoothness~n | 0.1706* -0.0234 0.2073* 0.1770* 1.0000
| 0.0000 0.5777 0.0000 0.0000
|
compactnes~n | 0.5061* 0.2367* 0.5569* 0.4985* 0.6591* 1.0000
| 0.0000 0.0000 0.0000 0.0000 0.0000
|
concavity_~n | 0.6768* 0.3024* 0.7161* 0.6860* 0.5220* 0.8831* 1.0000
| 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
|
concave_po~n | 0.8225* 0.2935* 0.8510* 0.8233* 0.5537* 0.8311* 0.9214*
| 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
|
symmetry_m~n | 0.1477* 0.0714 0.1830* 0.1513* 0.5578* 0.6026* 0.5007*
| 0.0004 0.0888 0.0000 0.0003 0.0000 0.0000 0.0000
|
fractal_di~n | -0.3116* -0.0764 -0.2615* -0.2831* 0.5848* 0.5654* 0.3368*
| 0.0000 0.0685 0.0000 0.0000 0.0000 0.0000 0.0000
|
| c~e_po~n symmet~n fracta~n
-------------+---------------------------
concave_po~n | 1.0000
|
|
symmetry_m~n | 0.4625* 1.0000
| 0.0000
|
fractal_di~n | 0.1669* 0.4799* 1.0000
| 0.0001 0.0000
|
- radius_mean and perimeter_mean (0.9979) → strong perfect relationship.
- radius_mean and area_mean (0.9874) → strong perfect relationship.
- perimeter_mean and area_mean (0.9865) -> strong perfect relationship.
- concavity_mean and compactness_mean (0.8831) -> strength of relationship is strong.
- concave points_mean and concavity_mean (0.9214) -> strength of relationship is strong.
Moderate to strong positive correlations:
- compactness_mean and smoothness_mean (0.6591).
- concave points_mean and radius_mean (0.8225).
texture_mean and symmetry_mean shows weak correlations with other variables (highest is 0.3295 with perimeter_mean).
- the plot below highlights variables that have negative correlations
if necessary
* Install corrtable ssc install corrtable
data
* Get auto
import delimited BreastCancer
correlation table
* Make lower triangle and puts variable names on the axis.
* The half option just shows the r(rho > 0))
* The flag1 and howflag1 options tell corrtable to plot positive correlations (as blue (blue*.1)
* as pink.
* and flag2 and howflag2 similarly tell it to plot negative correlations r(rho) > 0) howflag1(plotregion(color(blue * 0.1))) flag2(r(rho) < 0) howflag2(plotregion(color(pink*0.1)))
corrtable radius_mean texture_mean perimeter_mean area_mean smoothness_mean compactness_mean concavity_mean concave_points_mean symmetry_mean fractal_dimension_mean, half flag1(
quietly graph export heatmap2.svg, replace
checking corrtable consistency and verifying not already installed...
all files already exist and are up to date.
(encoding automatically selected: ISO-8859-2)
(11 vars, 569 obs)
Negative correlations can be shown below:
fractal_dimension_mean is negatively correlated with size variables i.e:
- With radius_mean (-0.3116).
- With perimeter_mean (-0.2615).
- With area_mean (-0.2831).
The pleminary analysis of correlations shows that there is strong positive correlations between radius_mean and perimeter_mean \((r \approx 1.00)\), as well as between radius_mean and area_mean \((r \approx 0.99)\), indicating that as the radius increases, the perimeter and area also increase significantly.
This is expected as both area and perimeter depend on the radius
Concavity_mean and concavepoints_mean are strongly positively correlated
(r = 0.92)
, suggesting that higher concavity is associated with more concave points.Moderate positive correlations are observed between compactness_mean and smoothness_mean (r = 0.66), and between compactness_mean and symmetry_mean (r = 0.60), implying there is significant and powerful asssociations between these variables.
Weak positive correlations can be noticed between radius_mean and smoothness_mean (r = 0.17) and between symmetry_mean and texture_mean (r = 0.07), indicating very
weak
associations.On the other hand we notice that fractal_dimension_mean shows a weak negative correlation with radius_mean (r = -0.31) and area_mean (r = -0.28), suggesting that larger radius or area values are slightly associated with lower fractal dimension.
(b) Summarise the results using a clear and well-labelled correlation matrix table or a visual format such as a correlogram
Correlogram
ssc install heatplot
ssc install palettes, replace
ssc install colrspace, replace
import delimited BreastCancercorr radius_mean texture_mean perimeter_mean area_mean smoothness_mean compactness_mean concavity_mean concave_points_mean symmetry_mean fractal_dimension_mean
return list
matrix corrmatrix = r(C)
values(format(%4.3f) size(tiny)) legend(off) color(hcl diverging, intensity(.7)) aspectratio(1) xlabel(,labsize(small) angle(45)) xsize(10) ysize(13)
heatplot corrmatrix,
quietly graph export heatmap.svg, replace
checking heatplot consistency and verifying not already installed...
all files already exist and are up to date.
checking palettes consistency and verifying not already installed...
all files already exist and are up to date.
checking colrspace consistency and verifying not already installed...
all files already exist and are up to date.
(encoding automatically selected: ISO-8859-2)
(11 vars, 569 obs)
(obs=569)
| radius~n textur~n perime~n area_m~n smooth~n compac~n c~y_mean
-------------+---------------------------------------------------------------
radius_mean | 1.0000
texture_mean | 0.3238 1.0000
perimeter_~n | 0.9979 0.3295 1.0000
area_mean | 0.9874 0.3211 0.9865 1.0000
smoothness~n | 0.1706 -0.0234 0.2073 0.1770 1.0000
compactnes~n | 0.5061 0.2367 0.5569 0.4985 0.6591 1.0000
concavity_~n | 0.6768 0.3024 0.7161 0.6860 0.5220 0.8831 1.0000
concave_po~n | 0.8225 0.2935 0.8510 0.8233 0.5537 0.8311 0.9214
symmetry_m~n | 0.1477 0.0714 0.1830 0.1513 0.5578 0.6026 0.5007
fractal_di~n | -0.3116 -0.0764 -0.2615 -0.2831 0.5848 0.5654 0.3368
| c~e_po~n symmet~n fracta~n
-------------+---------------------------
concave_po~n | 1.0000
symmetry_m~n | 0.4625 1.0000
fractal_di~n | 0.1669 0.4799 1.0000
scalars:
r(N) = 569
r(rho) = .3237818874068258
matrices:
r(C) : 10 x 10
Pairwise scatter plot
import delimited BreastCancer
graph matrix radius_mean texture_mean perimeter_mean area_mean smoothness_mean compactness_mean concavity_mean concave_points_mean symmetry_mean fractal_dimension_mean, half maxis(ylabel(none) xlabel(none))
quietly graph export scatter.svg, replace
(encoding automatically selected: ISO-8859-2)
(11 vars, 569 obs)
(d) Identify the variables that show the most substantial positive or negative correlations with the outcome variable or with other predictors, and explain why these associations make sense
Substantial Positive Correlations With outcome variable:
Radius mean is substantially strongly positive correlated to :
- perimeter_mean \((r = 1.00)\)
- area_mean \((r = 0.99)\)
- concavity_mean \((r = 0.68)\)
- concavepoints_mean \((r = 0.82)\)
Substantial Positive Correlations between independent variables variable:
- perimeter_mean with area_mean (r = 0.99)
- concavity_mean with concavepoints_mean (r = 0.92)
- compactness_mean with concavity_mean (r = 0.88)
- compactness_mean with:
- oncavepoints_mean (r = 0.83)
Substantial Negative Correlations With outcome variable. + radius_mean with fractal_dimension_mean (r = -0.31)
With predictor variables
- fractal_dimension_mean with perimeter_mean (r = -0.26)
Justifications
- The most substantial positive correlations are between radius_mean, perimeter_mean, and area_mean, with correlations approximately equal to 1.00, meaning they are almost perfectly positively related.
This is expected because all three measure are related to dimension and also perimeter and area depend on the radius i.e:
\(A=\pi r^2 , \quad Perimeter = 2\pi r\)
based on these formulas above we can see that perimeter is directly proportional to radius while area is directly proportional to \(r^2\) as well.
- On the other hand concavity_mean and concavepoints_mean are very strongly correlated (r = 0.92), makes sense as both capture aspects of concavity.
- Substantial positive correlations among predictors seen between compactness_mean, concavity_mean, and concavepoints_mean, which makes sense because more compact tumors (less smooth) tend to have more noticeable concavities.
- Notably ,fractal_dimension_mean shows the most substantial negative correlations with size-related measures like radius_mean, perimeter_mean, and area_mean (r ≈ -0.26 to -0.31), meaning that larger tumors tend to have less complex edges, which aligns with the biological understanding that larger tumors often have smoother borders compared to smaller, more irregular ones.
(e) Identify any predictor variables that may be redundant due to very high correlations with each other, and explain how such redundancy could affect model performance or interpretation
- perimeter_mean and area_mean are highly correlated and both depend on the outcome variable naturally \((r \approx 0.99)\) and we also not that these variables are linear functions of each other and could produce potential multicollinearity.
Proof \[Area=\pi r^2 = \frac{1}{2}*2\pi r*r=\frac{1}{2}*r*perimeter\]
- concavity mean and concave points mean are highly correlated \((r=0.92)\)
- compactness mean and concavity_mean are highly correlated \((r=0.88)\)
How Redundancy Could Affect Model Performance:
- Unstable coefficient estimates
- standard errors may be inflated.
- Worse model generalization (overfitting to training data noise).
- Substantially higher standard errors, with correspondingly lower t
- Unexpected changes in coefficient magnitudes or signs.
- Non-significant coefficients despite a high \(R^2\).
Linear Regression
Linear regression fits this model:
\[ Y = \beta_0 + \beta_1X_1 + \beta_2X_2 + \cdots + \beta_pX_p + \epsilon \]
- \(Y\) represents the outcome variable
- \(X_1, X_2, \cdots, X_p\) represent the predictors, of which there are \(p\) total.
- \(\beta_0\) represents the intercept. If you have a subject for which every predictor is equal to zero, \(\beta_0\) represents their predicted outcome.
- The other \(\beta\)’s are called the coefficients, and represent the relationship between each predictor and the response. We will cover their interpretation in detail later.
- \(\epsilon\) represents the error. Regression is a game of averages, but for any individual observation, the model will contain some error.
import delimited BreastCancerregress radius_mean texture_mean perimeter_mean area_mean smoothness_mean compactness_mean concavity_mean concave_points_mean symmetry_mean fractal_dimension_mean
vif
(encoding automatically selected: ISO-8859-2)
(11 vars, 569 obs)
Source | SS df MS Number of obs = 569
-------------+---------------------------------- F(9, 559) = 99738.39
Model | 7049.55654 9 783.28406 Prob > F = 0.0000
Residual | 4.39004275 559 .007853386 R-squared = 0.9994
-------------+---------------------------------- Adj R-squared = 0.9994
Total | 7053.94658 568 12.41892 Root MSE = .08862
------------------------------------------------------------------------------
radius_mean | Coefficient Std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
texture_mean | .0002343 .0009418 0.25 0.804 -.0016156 .0020843
perimeter_~n | .1568643 .0013373 117.30 0.000 .1542376 .159491
area_mean | -.0002857 .0000783 -3.65 0.000 -.0004396 -.0001318
smoothness~n | 1.273811 .4514421 2.82 0.005 .3870806 2.160541
compactnes~n | -4.827446 .2654031 -18.19 0.000 -5.348755 -4.306136
concavity_~n | -.7595862 .1563819 -4.86 0.000 -1.066754 -.4524182
concave_po~n | -.2975441 .4463421 -0.67 0.505 -1.174257 .5791686
symmetry_m~n | .2350665 .1806098 1.30 0.194 -.1196903 .5898233
fractal_di~n | 3.251577 1.332831 2.44 0.015 .6336083 5.869546
_cons | .0994146 .1342057 0.74 0.459 -.1641945 .3630236
------------------------------------------------------------------------------
Variable | VIF 1/VIF
-------------+----------------------
perimeter_~n | 76.37 0.013094
area_mean | 54.98 0.018190
concave_po~n | 21.69 0.046094
compactnes~n | 14.21 0.070375
concavity_~n | 11.24 0.088962
fractal_di~n | 6.40 0.156137
smoothness~n | 2.92 0.342988
symmetry_m~n | 1.77 0.563991
texture_mean | 1.19 0.842569
-------------+----------------------
Mean VIF | 21.20
- the model for the above model is given by: \[ \begin{aligned} radius~mean =&\beta_0 +\beta_1texturemean +\beta_2perimetermean+\beta_3areamean\\ &+ \beta_4smoothnessmean+\beta_5compactness+\beta_6meanconcavitymean\\ &+\beta_7concavepointsmean+\beta_8symmetrymean +\beta_{9}fractaldimensionmean \end{aligned}\] where the \(\beta\)’s can be found under the coefficients column.
- The following variables show serious multicolinearity concerns, as their VIF values are well above 10,which are:
- perimeter_mean (VIF = 76.37)
- area_mean (VIF = 54.98)
- concave points_mean (VIF = 21.69)
- compactness_mean (VIF = 14.21)
- concavity_mean (VIF = 11.24).
- These extremely high VIFs indicate high multicolinearity with each other and are not providing unique information to the model.
- Only smoothness_mean, symmetry_mean, and texture_mean have VIFs well below 5 suggesting low multicolinearity problems.
Model without redundant variables
import delimited BreastCancerregress radius_mean fractal_dimension_mean smoothness_mean symmetry_mean texture_mean concavity_mean
vif
(encoding automatically selected: ISO-8859-2)
(11 vars, 569 obs)
Source | SS df MS Number of obs = 569
-------------+---------------------------------- F(5, 563) = 438.92
Model | 5613.78998 5 1122.758 Prob > F = 0.0000
Residual | 1440.1566 563 2.55800462 R-squared = 0.7958
-------------+---------------------------------- Adj R-squared = 0.7940
Total | 7053.94658 568 12.41892 Root MSE = 1.5994
------------------------------------------------------------------------------
radius_mean | Coefficient Std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
fractal_di~n | -326.6724 12.1021 -26.99 0.000 -350.4432 -302.9017
smoothness~n | 34.51059 6.802347 5.07 0.000 21.14951 47.87167
symmetry_m~n | -4.461625 3.169397 -1.41 0.160 -10.68691 1.763662
texture_mean | .0222003 .0168809 1.32 0.189 -.0109569 .0553575
concavity_~n | 36.88825 1.106025 33.35 0.000 34.71581 39.06069
_cons | 28.42048 .820753 34.63 0.000 26.80837 30.03259
------------------------------------------------------------------------------
Variable | VIF 1/VIF
-------------+----------------------
smoothness~n | 2.03 0.492051
concavity_~n | 1.73 0.579283
symmetry_m~n | 1.68 0.596549
fractal_di~n | 1.62 0.616847
texture_mean | 1.17 0.854312
-------------+----------------------
Mean VIF | 1.65
- We notice from the output that \(R^2\) and \(Adj.R^2\) reduced slightly, Thus removing these variables from the model could solve the problem of multicollinearity without reducing the overall quality of the regression model.
For one unit increase in fractal dimension mean, radius mean
significantly decreases
by 326.6724 adjusting for all other variables. \((p=0.000)\).For one increase in smoothness mean ,
radius mean significantly increases
by 34.51059 adjusting for all other variables.\(p=0.000\)For one unit increase in symmetry mean ,
radius mean decreases by 4.461625
while adjusting for all other variables. However this relation is not significant since \((p-value= 0.160>0.05 )\).For one unit increase in texture mean , radius mean increases by 0.0222003 while adjusting for all other variables. However this relationship is not significant at 5% level of significance \((p-value= 0.189>0.05)\).
For one-unit increase in concavity mean ,
radius mean significantly increases by 36.88825
while adjusting for all other variables. \((p-value=0.000)\).
Section B
a) Research Question
- What demographic, lifestyle, and clinical factors are associated with systolic blood pressure (sysBP) among participants in the Framingham Heart Study?
b) Simple linear model
import delimited framingham_cleanregress sysbp bmi
(encoding automatically selected: ISO-8859-1)
(16 vars, 3,656 obs)
Source | SS df MS Number of obs = 3,656
-------------+---------------------------------- F(1, 3654) = 449.60
Model | 195452.085 1 195452.085 Prob > F = 0.0000
Residual | 1588465.99 3,654 434.719756 R-squared = 0.1096
-------------+---------------------------------- Adj R-squared = 0.1093
Total | 1783918.07 3,655 488.07608 Root MSE = 20.85
------------------------------------------------------------------------------
sysbp | Coefficient Std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
bmi | 1.798533 .0848209 21.20 0.000 1.632232 1.964834
_cons | 85.99432 2.214055 38.84 0.000 81.65341 90.33522
------------------------------------------------------------------------------
Interpretation
- Predictor - Body mass index
- Outcome - systolic blood pressure (sysbp)
for a unit increase in Body mass index , systolic blood pressure increases significantly by 1.798533
on average while holding all other variables constant.
import delimited framingham_clean
regress sysbp bmi age diabp heartrate glucose cigsperday totchol i.male i.education i.currentsmoker i.prevalentstroke i.prevalenthyp i.bpmeds i.diabetes, allbase
(encoding automatically selected: ISO-8859-1)
(16 vars, 3,656 obs)
Source | SS df MS Number of obs = 3,656
-------------+---------------------------------- F(16, 3639) = 629.36
Model | 1310373.33 16 81898.333 Prob > F = 0.0000
Residual | 473544.743 3,639 130.13046 R-squared = 0.7345
-------------+---------------------------------- Adj R-squared = 0.7334
Total | 1783918.07 3,655 488.07608 Root MSE = 11.407
------------------------------------------------------------------------------
sysbp | Coefficient Std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
bmi | -.0248742 .0518146 -0.48 0.631 -.1264628 .0767143
age | .4283898 .0250708 17.09 0.000 .3792356 .477544
diabp | 1.04679 .0210667 49.69 0.000 1.005486 1.088093
heartrate | .0474775 .0164587 2.88 0.004 .0152083 .0797467
glucose | .0464323 .0100716 4.61 0.000 .0266858 .0661788
cigsperday | .012802 .026168 0.49 0.625 -.0385034 .0641073
totchol | .0085838 .0045199 1.90 0.058 -.0002779 .0174455
|
male |
0 | 0 (base)
1 | -2.960022 .4172047 -7.09 0.000 -3.778001 -2.142044
|
education |
1 | 0 (base)
2 | -.6177673 .4674476 -1.32 0.186 -1.534253 .298718
3 | -1.62405 .5581798 -2.91 0.004 -2.718427 -.529674
4 | -2.599828 .6391894 -4.07 0.000 -3.853033 -1.346623
|
currentsmo~r |
0 | 0 (base)
1 | .3914087 .6068143 0.65 0.519 -.7983212 1.581138
|
prevalents~e |
0 | 0 (base)
1 | -1.355887 2.518326 -0.54 0.590 -6.293358 3.581584
|
prevalenthyp |
0 | 0 (base)
1 | 12.90949 .5428089 23.78 0.000 11.84525 13.97373
|
bpmeds |
0 | 0 (base)
1 | 7.41619 1.152858 6.43 0.000 5.155878 9.676501
|
diabetes |
0 | 0 (base)
1 | -.2279949 1.477869 -0.15 0.877 -3.125529 2.669539
|
_cons | 13.08773 2.608305 5.02 0.000 7.973844 18.20161
------------------------------------------------------------------------------
Interpretation
- Check
Number of obs
. Here, the data has 3656 rows, so the regression model is using all the data (there is no missingness in our data. - The F-test which follows (
F(16, 3639)
^[The 16 and 3639 are degrees of freedom andProb > F=0000
) is testing the null hypothesis that all coefficients are 0. In other words, if this test fails to reject, the conclusion is the model captures no relationships. For this model the null model is rejected and we conclude that the model captures the relationships and at least one of the coefficients is not equal to zero. The model is globally significant - The \(R^2\) (
R-squared
) is a measure of model fit and is a percentage, explaining what percent in the variation in the response is explained by the linear relationship with the predictors. For this model \(R^2= 0.7345\) meaning that \(73.45\%\) of the variability in systolic blood pressure among the participants can be explained by the variables included in my model. - Mathematically, adding a new predictor to the model will increase the \(R^2\), regardless of how useless the variable is.1 This makes \(R^2\) poor for model comparison, as it would always select the model with the most predictors. Instead, the adjusted \(R^2\) (“Adj R-Squared”) accounts for this; it penalizes the \(R^2\) by the number of predictors in the model. Hence for this model \(Adjusted~ R^2 = 0.7334\) is slightly lower, accounting for the number of predictors in the model that it penalizes for not contributing much explanatory power.
- The root mean squared error (
Root MSE
, as known as RMSE) is a measure of the average difference between the observed outcome (Systolic blood pressure) and the predicted outcome. So for this model, the RMSE is \(Root ~MSE = 11.407\) so the average error in the model is about 11.407.
Hence for the research question at hand, which is to model based on demographic, lifestyle, and clinical variables , the \(R^2 ~and~ Adjusted ~R^2\) are high enough , close to one, suggesting that the model fits the data pretty well and the variables significantly help in explaining systolic blood pressure in this sample
4. Variable selection (10 marks)
a) method using stepwise backward elimination
Backward elimination begins with a model which includes all candidate variables. Variables are then deleted from the model one by one until all the variables remaining in the model are significant and exceed certain criteria. At each step, the variable showing the smallest improvement to the model is deleted. Once a variable is deleted, it cannot come back to the model.
*Method 1*
import delimited framingham_cleanstepwise, pr(0.1): regress sysbp i.male age i.currentsmoker i.education cigsperday i.bpmeds i.prevalentstroke i.prevalenthyp i.diabetes totchol diabp bmi heartrate glucose,allbase
(encoding automatically selected: ISO-8859-1)
(16 vars, 3,656 obs)
note: 0b.male omitted because of estimability.
note: 0b.currentsmoker omitted because of estimability.
note: 1b.education omitted because of estimability.
note: 0b.bpmeds omitted because of estimability.
note: 0b.prevalentstroke omitted because of estimability.
note: 0b.prevalenthyp omitted because of estimability.
note: 0b.diabetes omitted because of estimability.
Wald test, begin with full model:
p = 0.8774 >= 0.1000, removing 1.diabetes
p = 0.6270 >= 0.1000, removing bmi
p = 0.6305 >= 0.1000, removing cigsperday
p = 0.5851 >= 0.1000, removing 1.prevalentstroke
p = 0.2021 >= 0.1000, removing 2.education
p = 0.1128 >= 0.1000, removing 1.currentsmoker
Source | SS df MS Number of obs = 3,656
-------------+---------------------------------- F(10, 3645) = 1006.77
Model | 1309731.82 10 130973.182 Prob > F = 0.0000
Residual | 474186.256 3,645 130.092251 R-squared = 0.7342
-------------+---------------------------------- Adj R-squared = 0.7335
Total | 1783918.07 3,655 488.07608 Root MSE = 11.406
------------------------------------------------------------------------------
sysbp | Coefficient Std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
1.male | -2.754165 .3882471 -7.09 0.000 -3.515368 -1.992962
age | .4274418 .0240442 17.78 0.000 .3803003 .4745833
glucose | .0447748 .0079966 5.60 0.000 .0290966 .060453
totchol | .0085681 .0045045 1.90 0.057 -.0002634 .0173997
|
education |
3 | -1.345866 .5165159 -2.61 0.009 -2.358555 -.3331771
4 | -2.329898 .6034925 -3.86 0.000 -3.513114 -1.146682
|
heartrate | .0505673 .0163351 3.10 0.002 .0185406 .0825941
1.bpmeds | 7.319757 1.14673 6.38 0.000 5.07146 9.568053
diabp | 1.041687 .0203397 51.21 0.000 1.001809 1.081565
1.prevalen~p | 12.88386 .541168 23.81 0.000 11.82284 13.94488
_cons | 12.77098 2.327819 5.49 0.000 8.207023 17.33494
------------------------------------------------------------------------------
Method 2: using best subset selection method
The basic idea of the all possible subsets approach is to run every possible combination of the predictors to find the best subset to meet some pre-defined objective criteria such as \(C_p\) and adjusted \(R^2\).
*Method 2*
import delimited framingham_cleanssc install gvselect, replace
regress sysbp <term> gvselect <term> male age education currentsmoker cigsperday bpmeds prevalentstroke prevalenthyp diabetes totchol diabp bmi heartrate glucose, nmodels(1):
(encoding automatically selected: ISO-8859-1)
(16 vars, 3,656 obs)
checking gvselect consistency and verifying not already installed...
all files already exist and are up to date.
Optimal models:
# Preds LL AIC BIC
1 -14739.67 29483.34 29495.75
2 -14353.41 28712.82 28731.43
3 -14169.83 28347.66 28372.48
4 -14136.63 28283.26 28314.28
5 -14116.25 28244.5 28281.72
6 -14098.56 28211.12 28254.55
7 -14087.98 28191.97 28241.6
8 -14082.63 28183.26 28239.1
9 -14080.72 28181.44 28243.48
10 -14079.38 28180.76 28249
11 -14079.23 28182.46 28256.91
12 -14079.1 28184.19 28264.84
13 -14078.97 28185.95 28272.81
14 -14078.96 28187.92 28280.99
predictors for each model:
1 : diabp
2 : prevalenthyp diabp
3 : age prevalenthyp diabp
4 : male age prevalenthyp diabp
5 : male age bpmeds prevalenthyp diabp
6 : male age bpmeds prevalenthyp diabp glucose
7 : male age education bpmeds prevalenthyp diabp glucose
8 : male age education bpmeds prevalenthyp diabp heartrate glucose
9 : male age education bpmeds prevalenthyp totchol diabp heartrate glucose
10 : male age education currentsmoker bpmeds prevalenthyp totchol diabp
heartrate glucose
11 : male age education currentsmoker bpmeds prevalentstroke prevalenthyp
totchol diabp heartrate glucose
12 : male age education currentsmoker bpmeds prevalentstroke prevalenthyp
totchol diabp bmi heartrate glucose
13 : male age education currentsmoker cigsperday bpmeds prevalentstroke
prevalenthyp totchol diabp bmi heartrate glucose
14 : male age education currentsmoker cigsperday bpmeds prevalentstroke
prevalenthyp diabetes totchol diabp bmi heartrate glucose
- Focusing on the Bayesian Information Criterion from the best subset selection method , the model with 8 variables appears to be a better fit subject to its
BIC=28239.1
being the lowest among all models with different combination of variables and also compared to the model with all predictors. - It is also worth noting that the model with 8 Predictors has the third least AIC where it’s difference in AIC with the 2 models with least AIC is marginally small from the best subset output
- Model with 10 variables performs better when our focus is on the Akaike information Criterion(\(AIC=28180.76\))
- \(\Delta AIC = 2.5\) for 8 predictors vs 10
- \(\Delta AIC = 0.68\) for 8 predictors vs 9
- Due to these reasons , the model with 8 Predictors and that with 10 predictors will be compared based on other metrics and also compared to the full model
Best subsets VS stepwise VS Full model
import delimited framingham_cleanregress sysbp bmi age diabp heartrate glucose cigsperday totchol i.male i.education i.currentsmoker i.prevalentstroke i.prevalenthyp i.bpmeds i.diabetes, allbase
est store ModelA14
regress sysbp i.male age i.education i.bpmeds i.prevalenthyp diabp heartrate glucose
est store ModelA8
regress sysbp i.male age i.education i.bpmeds i.prevalenthyp diabp heartrate glucose totchol i.currentsmoker
est store ModelA10
stepwise, pr(0.1): regress sysbp i.male age i.currentsmoker i.education cigsperday i.bpmeds i.prevalentstroke i.prevalenthyp i.diabetes totchol diabp bmi heartrate glucose,allbase
est store Modelstep
stats(r2 r2_a bic aic) estout ModelA14 ModelA10 Modelstep ModelA8 ,
(encoding automatically selected: ISO-8859-1)
(16 vars, 3,656 obs)
Source | SS df MS Number of obs = 3,656
-------------+---------------------------------- F(16, 3639) = 629.36
Model | 1310373.33 16 81898.333 Prob > F = 0.0000
Residual | 473544.743 3,639 130.13046 R-squared = 0.7345
-------------+---------------------------------- Adj R-squared = 0.7334
Total | 1783918.07 3,655 488.07608 Root MSE = 11.407
------------------------------------------------------------------------------
sysbp | Coefficient Std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
bmi | -.0248742 .0518146 -0.48 0.631 -.1264628 .0767143
age | .4283898 .0250708 17.09 0.000 .3792356 .477544
diabp | 1.04679 .0210667 49.69 0.000 1.005486 1.088093
heartrate | .0474775 .0164587 2.88 0.004 .0152083 .0797467
glucose | .0464323 .0100716 4.61 0.000 .0266858 .0661788
cigsperday | .012802 .026168 0.49 0.625 -.0385034 .0641073
totchol | .0085838 .0045199 1.90 0.058 -.0002779 .0174455
|
male |
0 | 0 (base)
1 | -2.960022 .4172047 -7.09 0.000 -3.778001 -2.142044
|
education |
1 | 0 (base)
2 | -.6177673 .4674476 -1.32 0.186 -1.534253 .298718
3 | -1.62405 .5581798 -2.91 0.004 -2.718427 -.529674
4 | -2.599828 .6391894 -4.07 0.000 -3.853033 -1.346623
|
currentsmo~r |
0 | 0 (base)
1 | .3914087 .6068143 0.65 0.519 -.7983212 1.581138
|
prevalents~e |
0 | 0 (base)
1 | -1.355887 2.518326 -0.54 0.590 -6.293358 3.581584
|
prevalenthyp |
0 | 0 (base)
1 | 12.90949 .5428089 23.78 0.000 11.84525 13.97373
|
bpmeds |
0 | 0 (base)
1 | 7.41619 1.152858 6.43 0.000 5.155878 9.676501
|
diabetes |
0 | 0 (base)
1 | -.2279949 1.477869 -0.15 0.877 -3.125529 2.669539
|
_cons | 13.08773 2.608305 5.02 0.000 7.973844 18.20161
------------------------------------------------------------------------------
Source | SS df MS Number of obs = 3,656
-------------+---------------------------------- F(10, 3645) = 1005.93
Model | 1309440.99 10 130944.099 Prob > F = 0.0000
Residual | 474477.078 3,645 130.172038 R-squared = 0.7340
-------------+---------------------------------- Adj R-squared = 0.7333
Total | 1783918.07 3,655 488.07608 Root MSE = 11.409
------------------------------------------------------------------------------
sysbp | Coefficient Std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
1.male | -2.823336 .3881123 -7.27 0.000 -3.584275 -2.062398
age | .4318045 .0239838 18.00 0.000 .3847815 .4788275
|
education |
2 | -.5455734 .464128 -1.18 0.240 -1.45555 .3644029
3 | -1.558717 .5544055 -2.81 0.005 -2.645692 -.4717407
4 | -2.539784 .6359611 -3.99 0.000 -3.786659 -1.29291
|
1.bpmeds | 7.428462 1.146756 6.48 0.000 5.180114 9.676809
1.prevalen~p | 12.89554 .5413036 23.82 0.000 11.83426 13.95683
diabp | 1.044746 .0202756 51.53 0.000 1.004993 1.084498
heartrate | .0528364 .0163023 3.24 0.001 .0208738 .084799
glucose | .0447871 .0079992 5.60 0.000 .0291038 .0604705
_cons | 14.4055 2.306565 6.25 0.000 9.883218 18.92779
------------------------------------------------------------------------------
Source | SS df MS Number of obs = 3,656
-------------+---------------------------------- F(12, 3643) = 839.82
Model | 1310270.63 12 109189.219 Prob > F = 0.0000
Residual | 473647.445 3,643 130.015768 R-squared = 0.7345
-------------+---------------------------------- Adj R-squared = 0.7336
Total | 1783918.07 3,655 488.07608 Root MSE = 11.402
------------------------------------------------------------------------------
sysbp | Coefficient Std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
1.male | -2.92213 .3987409 -7.33 0.000 -3.703907 -2.140352
age | .4276106 .0249925 17.11 0.000 .3786098 .4766113
|
education |
2 | -.5922715 .4642179 -1.28 0.202 -1.502424 .3178813
3 | -1.592616 .5543974 -2.87 0.004 -2.679576 -.5056559
4 | -2.567674 .6359235 -4.04 0.000 -3.814476 -1.320873
|
1.bpmeds | 7.345691 1.146728 6.41 0.000 5.097398 9.593985
1.prevalen~p | 12.88502 .5410092 23.82 0.000 11.82431 13.94573
diabp | 1.04436 .0204142 51.16 0.000 1.004336 1.084384
heartrate | .0481429 .0164055 2.93 0.003 .015978 .0803077
glucose | .045165 .0080001 5.65 0.000 .0294798 .0608502
totchol | .0085569 .0045069 1.90 0.058 -.0002793 .0173931
1.currents~r | .644381 .3984548 1.62 0.106 -.1368357 1.425598
_cons | 12.70032 2.402614 5.29 0.000 7.989722 17.41093
------------------------------------------------------------------------------
note: 0b.male omitted because of estimability.
note: 0b.currentsmoker omitted because of estimability.
note: 1b.education omitted because of estimability.
note: 0b.bpmeds omitted because of estimability.
note: 0b.prevalentstroke omitted because of estimability.
note: 0b.prevalenthyp omitted because of estimability.
note: 0b.diabetes omitted because of estimability.
Wald test, begin with full model:
p = 0.8774 >= 0.1000, removing 1.diabetes
p = 0.6270 >= 0.1000, removing bmi
p = 0.6305 >= 0.1000, removing cigsperday
p = 0.5851 >= 0.1000, removing 1.prevalentstroke
p = 0.2021 >= 0.1000, removing 2.education
p = 0.1128 >= 0.1000, removing 1.currentsmoker
Source | SS df MS Number of obs = 3,656
-------------+---------------------------------- F(10, 3645) = 1006.77
Model | 1309731.82 10 130973.182 Prob > F = 0.0000
Residual | 474186.256 3,645 130.092251 R-squared = 0.7342
-------------+---------------------------------- Adj R-squared = 0.7335
Total | 1783918.07 3,655 488.07608 Root MSE = 11.406
------------------------------------------------------------------------------
sysbp | Coefficient Std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
1.male | -2.754165 .3882471 -7.09 0.000 -3.515368 -1.992962
age | .4274418 .0240442 17.78 0.000 .3803003 .4745833
glucose | .0447748 .0079966 5.60 0.000 .0290966 .060453
totchol | .0085681 .0045045 1.90 0.057 -.0002634 .0173997
|
education |
3 | -1.345866 .5165159 -2.61 0.009 -2.358555 -.3331771
4 | -2.329898 .6034925 -3.86 0.000 -3.513114 -1.146682
|
heartrate | .0505673 .0163351 3.10 0.002 .0185406 .0825941
1.bpmeds | 7.319757 1.14673 6.38 0.000 5.07146 9.568053
diabp | 1.041687 .0203397 51.21 0.000 1.001809 1.081565
1.prevalen~p | 12.88386 .541168 23.81 0.000 11.82284 13.94488
_cons | 12.77098 2.327819 5.49 0.000 8.207023 17.33494
------------------------------------------------------------------------------
----------------------------------------------------------------
ModelA14 ModelA10 Modelstep ModelA8
b b b b
----------------------------------------------------------------
bmi -.0248742
age .4283898 .4276106 .4274418 .4318045
diabp 1.04679 1.04436 1.041687 1.044746
heartrate .0474775 .0481429 .0505673 .0528364
glucose .0464323 .045165 .0447748 .0447871
cigsperday .012802
totchol .0085838 .0085569 .0085681
0.male 0 0 0
1.male -2.960022 -2.92213 -2.754165 -2.823336
1.education 0 0 0
2.education -.6177673 -.5922715 -.5455734
3.education -1.62405 -1.592616 -1.345866 -1.558717
4.education -2.599828 -2.567674 -2.329898 -2.539784
0.currents~r 0 0
1.currents~r .3914087 .644381
0.prevalen~e 0
1.prevalen~e -1.355887
0.prevalen~p 0 0 0
1.prevalen~p 12.90949 12.88502 12.88386 12.89554
0.bpmeds 0 0 0
1.bpmeds 7.41619 7.345691 7.319757 7.428462
0.diabetes 0
1.diabetes -.2279949
_cons 13.08773 12.70032 12.77098 14.4055
----------------------------------------------------------------
r2 .7345479 .7344904 .7341883 .7340253
r2_a .7333808 .7336158 .7334591 .7332956
bic 28297.08 28265.06 28252.81 28255.05
aic 28191.61 28184.4 28184.56 28186.8
----------------------------------------------------------------
- The table above compares
- the full model with all predictors
- model resulting from Backward elimination
- Two models from best subset selection each selected due to the least AIC and BIC
Modelstep
(Model from stepwise regression) seems to be performing better as compared to other models- The stepwise regression optimal model has the the least \(BIC=28252.81\) as compared to other model. The marginal difference in its BIC with other models is significantly large.
- As much as the model with 10 predictors has the least \(AIC=28184.4\) , Comparing it with the stepwise optimal model we see that the difference is small that it can be neglected hence leaving the stepwise optimal model with both the least \(BIC\) and \(AIC\) as well.
- The adjusted \(R^2\) values for all models are almost the same with small marginal differences (both \(adj.R^2 \approx 0.73\))
- the best model is therefore the optimal model from Stepwise regression.
5 Model diagnostics
Linearity
import delimited framingham_cleanregress sysbp i.male age i.education i.bpmeds i.prevalenthyp diabp heartrate glucose totchol
rvfplot, yline(0) title("Residual vs Fitted Values")
estat hettest
graph export linear.png ,replace
(encoding automatically selected: ISO-8859-1)
(16 vars, 3,656 obs)
Source | SS df MS Number of obs = 3,656
-------------+---------------------------------- F(11, 3644) = 915.52
Model | 1309930.59 11 119084.599 Prob > F = 0.0000
Residual | 473987.479 3,644 130.073403 R-squared = 0.7343
-------------+---------------------------------- Adj R-squared = 0.7335
Total | 1783918.07 3,655 488.07608 Root MSE = 11.405
------------------------------------------------------------------------------
sysbp | Coefficient Std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
1.male | -2.777589 .3886811 -7.15 0.000 -3.539643 -2.015535
age | .4207818 .0246387 17.08 0.000 .3724748 .4690888
|
education |
2 | -.5738198 .4641805 -1.24 0.216 -1.483899 .3362595
3 | -1.595382 .5545176 -2.88 0.004 -2.682577 -.508186
4 | -2.578322 .6360304 -4.05 0.000 -3.825333 -1.331312
|
1.bpmeds | 7.353518 1.146972 6.41 0.000 5.104746 9.602289
1.prevalen~p | 12.88441 .541129 23.81 0.000 11.82346 13.94535
diabp | 1.041448 .0203391 51.20 0.000 1.001571 1.081325
heartrate | .0506771 .0163341 3.10 0.002 .0186521 .082702
glucose | .0446825 .0079963 5.59 0.000 .0290047 .0603602
totchol | .0087429 .0045064 1.94 0.052 -.0000924 .0175782
_cons | 13.33082 2.371297 5.62 0.000 8.681618 17.98002
------------------------------------------------------------------------------
Breusch–Pagan/Cook–Weisberg test for heteroskedasticity
Assumption: Normal error terms
Variable: Fitted values of sysbp
H0: Constant variance
chi2(1) = 526.41
Prob > chi2 = 0.0000
file linear.png saved as PNG format
- The points are randomly scattered around zero line and hence do not indicate any strong departure departure from linearity
- The Breusch -Pagan test for heteroskedacity has (\(p<0.001\)) suggesting that our residuals are heteroskedastic
Normality
import delimited framingham_cleanregress sysbp i.male age i.education i.bpmeds i.prevalenthyp diabp heartrate glucose totchol
predict resid, residual
histogram resid, normal title("Histogram of Residuals with Normal Curve")
graph export hist.png , replace
qnorm resid, title("Normal Q-Q Plot of Residuals")
graph export normality.png , replace
(encoding automatically selected: ISO-8859-1)
(16 vars, 3,656 obs)
Source | SS df MS Number of obs = 3,656
-------------+---------------------------------- F(11, 3644) = 915.52
Model | 1309930.59 11 119084.599 Prob > F = 0.0000
Residual | 473987.479 3,644 130.073403 R-squared = 0.7343
-------------+---------------------------------- Adj R-squared = 0.7335
Total | 1783918.07 3,655 488.07608 Root MSE = 11.405
------------------------------------------------------------------------------
sysbp | Coefficient Std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
1.male | -2.777589 .3886811 -7.15 0.000 -3.539643 -2.015535
age | .4207818 .0246387 17.08 0.000 .3724748 .4690888
|
education |
2 | -.5738198 .4641805 -1.24 0.216 -1.483899 .3362595
3 | -1.595382 .5545176 -2.88 0.004 -2.682577 -.508186
4 | -2.578322 .6360304 -4.05 0.000 -3.825333 -1.331312
|
1.bpmeds | 7.353518 1.146972 6.41 0.000 5.104746 9.602289
1.prevalen~p | 12.88441 .541129 23.81 0.000 11.82346 13.94535
diabp | 1.041448 .0203391 51.20 0.000 1.001571 1.081325
heartrate | .0506771 .0163341 3.10 0.002 .0186521 .082702
glucose | .0446825 .0079963 5.59 0.000 .0290047 .0603602
totchol | .0087429 .0045064 1.94 0.052 -.0000924 .0175782
_cons | 13.33082 2.371297 5.62 0.000 8.681618 17.98002
------------------------------------------------------------------------------
(bin=35, start=-41.595417, width=3.7957909)
file hist.png saved as PNG format
file normality.png saved as PNG format
- The histogram shows that the residuals are not too far from normal or do not deviate too much from normality
- the normal quantile quantile plot however suggests a little deviation from normality indicating that all things being equal a transformation might be required.
(c) Test for multicollinearity using Variance Inflation Factor (VIF)
import delimited framingham_cleanregress sysbp i.male age i.education i.bpmeds i.prevalenthyp diabp heartrate glucose totchol , allbase
vif
(encoding automatically selected: ISO-8859-1)
(16 vars, 3,656 obs)
Source | SS df MS Number of obs = 3,656
-------------+---------------------------------- F(11, 3644) = 915.52
Model | 1309930.59 11 119084.599 Prob > F = 0.0000
Residual | 473987.479 3,644 130.073403 R-squared = 0.7343
-------------+---------------------------------- Adj R-squared = 0.7335
Total | 1783918.07 3,655 488.07608 Root MSE = 11.405
------------------------------------------------------------------------------
sysbp | Coefficient Std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
male |
0 | 0 (base)
1 | -2.777589 .3886811 -7.15 0.000 -3.539643 -2.015535
|
age | .4207818 .0246387 17.08 0.000 .3724748 .4690888
|
education |
1 | 0 (base)
2 | -.5738198 .4641805 -1.24 0.216 -1.483899 .3362595
3 | -1.595382 .5545176 -2.88 0.004 -2.682577 -.508186
4 | -2.578322 .6360304 -4.05 0.000 -3.825333 -1.331312
|
bpmeds |
0 | 0 (base)
1 | 7.353518 1.146972 6.41 0.000 5.104746 9.602289
|
prevalenthyp |
0 | 0 (base)
1 | 12.88441 .541129 23.81 0.000 11.82346 13.94535
|
diabp | 1.041448 .0203391 51.20 0.000 1.001571 1.081325
heartrate | .0506771 .0163341 3.10 0.002 .0186521 .082702
glucose | .0446825 .0079963 5.59 0.000 .0290047 .0603602
totchol | .0087429 .0045064 1.94 0.052 -.0000924 .0175782
_cons | 13.33082 2.371297 5.62 0.000 8.681618 17.98002
------------------------------------------------------------------------------
Variable | VIF 1/VIF
-------------+----------------------
1.male | 1.05 0.954127
age | 1.25 0.799839
education |
2 | 1.27 0.784590
3 | 1.20 0.836742
4 | 1.16 0.859593
1.bpmeds | 1.09 0.918649
1.prevalen~p | 1.77 0.566482
diabp | 1.67 0.599928
heartrate | 1.08 0.928929
glucose | 1.03 0.973544
totchol | 1.11 0.901242
-------------+----------------------
Mean VIF | 1.24
- Based on the stata output, all the variables have a VIF below 5, therefore there is no multicollinearity, so there is no need to adjust or remove some variables
(d) Identify any influential observations (e.g., using Cook’s Distance), and discuss their impact
import delimited framingham_cleanregress sysbp i.male age i.education i.bpmeds i.prevalenthyp diabp heartrate glucose totchol , allbase
predict cookd, cooksd
gen obs = _n
gen threshold = 4/_N
twoway (scatter cookd obs) ///
line threshold obs, lcolor(red) lpattern(dash)), ///
(title("Cook's Distance Plot") ///
ytitle("Cook's Distance") xtitle("Observation Number")
graph export cooks.png , replace
(encoding automatically selected: ISO-8859-1)
(16 vars, 3,656 obs)
Source | SS df MS Number of obs = 3,656
-------------+---------------------------------- F(11, 3644) = 915.52
Model | 1309930.59 11 119084.599 Prob > F = 0.0000
Residual | 473987.479 3,644 130.073403 R-squared = 0.7343
-------------+---------------------------------- Adj R-squared = 0.7335
Total | 1783918.07 3,655 488.07608 Root MSE = 11.405
------------------------------------------------------------------------------
sysbp | Coefficient Std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
male |
0 | 0 (base)
1 | -2.777589 .3886811 -7.15 0.000 -3.539643 -2.015535
|
age | .4207818 .0246387 17.08 0.000 .3724748 .4690888
|
education |
1 | 0 (base)
2 | -.5738198 .4641805 -1.24 0.216 -1.483899 .3362595
3 | -1.595382 .5545176 -2.88 0.004 -2.682577 -.508186
4 | -2.578322 .6360304 -4.05 0.000 -3.825333 -1.331312
|
bpmeds |
0 | 0 (base)
1 | 7.353518 1.146972 6.41 0.000 5.104746 9.602289
|
prevalenthyp |
0 | 0 (base)
1 | 12.88441 .541129 23.81 0.000 11.82346 13.94535
|
diabp | 1.041448 .0203391 51.20 0.000 1.001571 1.081325
heartrate | .0506771 .0163341 3.10 0.002 .0186521 .082702
glucose | .0446825 .0079963 5.59 0.000 .0290047 .0603602
totchol | .0087429 .0045064 1.94 0.052 -.0000924 .0175782
_cons | 13.33082 2.371297 5.62 0.000 8.681618 17.98002
------------------------------------------------------------------------------
file cooks.png saved as PNG format
- In the Cook’s Distance plot, the red dashed line indicates the common influence threshold of around 0.045.
- Most of the observations (Blue points) are all well scattered below this line, suggesting that the majority have minimal influence on the regression model.
- We should note however that none of the Cook’s D values approach or exceed 1, implying that there are no highly influential outliers present in the model and sample.
Autocorrelation
import delimited framingham_cleangen trend = _n
tsset trend
regress sysbp i.male age i.education i.bpmeds i.prevalenthyp diabp heartrate glucose totchol , allbase
dwstat
estat bgodfrey
(encoding automatically selected: ISO-8859-1)
(16 vars, 3,656 obs)
Time variable: trend, 1 to 3656
Delta: 1 unit
Source | SS df MS Number of obs = 3,656
-------------+---------------------------------- F(11, 3644) = 915.52
Model | 1309930.59 11 119084.599 Prob > F = 0.0000
Residual | 473987.479 3,644 130.073403 R-squared = 0.7343
-------------+---------------------------------- Adj R-squared = 0.7335
Total | 1783918.07 3,655 488.07608 Root MSE = 11.405
------------------------------------------------------------------------------
sysbp | Coefficient Std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
male |
0 | 0 (base)
1 | -2.777589 .3886811 -7.15 0.000 -3.539643 -2.015535
|
age | .4207818 .0246387 17.08 0.000 .3724748 .4690888
|
education |
1 | 0 (base)
2 | -.5738198 .4641805 -1.24 0.216 -1.483899 .3362595
3 | -1.595382 .5545176 -2.88 0.004 -2.682577 -.508186
4 | -2.578322 .6360304 -4.05 0.000 -3.825333 -1.331312
|
bpmeds |
0 | 0 (base)
1 | 7.353518 1.146972 6.41 0.000 5.104746 9.602289
|
prevalenthyp |
0 | 0 (base)
1 | 12.88441 .541129 23.81 0.000 11.82346 13.94535
|
diabp | 1.041448 .0203391 51.20 0.000 1.001571 1.081325
heartrate | .0506771 .0163341 3.10 0.002 .0186521 .082702
glucose | .0446825 .0079963 5.59 0.000 .0290047 .0603602
totchol | .0087429 .0045064 1.94 0.052 -.0000924 .0175782
_cons | 13.33082 2.371297 5.62 0.000 8.681618 17.98002
------------------------------------------------------------------------------
Durbin–Watson d-statistic( 12, 3656) = 1.995538
Breusch–Godfrey LM test for autocorrelation
---------------------------------------------------------------------------
lags(p) | chi2 df Prob > chi2
-------------+-------------------------------------------------------------
1 | 0.018 1 0.8927
---------------------------------------------------------------------------
H0: no serial correlation
- running the durbin watson test results in statistic of \(1.995538 \approx 2\) indicating no sign of positive or negative autocorrelation.
- we further on test for identification of autocorrelation i.e Breusch -Godfrey LM test and we can see that the associated \(p-value = 0.8927\) which is above 0.05. Therefore there is no evidence to reject the null hypothesis of no autocorrelation. thus there is no autocorrelation in our data
6 Interpretation and reflection
import delimited framingham_cleanregress sysbp i.male age i.education i.bpmeds i.prevalenthyp diabp heartrate glucose totchol , allbase
(encoding automatically selected: ISO-8859-1)
(16 vars, 3,656 obs)
Source | SS df MS Number of obs = 3,656
-------------+---------------------------------- F(11, 3644) = 915.52
Model | 1309930.59 11 119084.599 Prob > F = 0.0000
Residual | 473987.479 3,644 130.073403 R-squared = 0.7343
-------------+---------------------------------- Adj R-squared = 0.7335
Total | 1783918.07 3,655 488.07608 Root MSE = 11.405
------------------------------------------------------------------------------
sysbp | Coefficient Std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
male |
0 | 0 (base)
1 | -2.777589 .3886811 -7.15 0.000 -3.539643 -2.015535
|
age | .4207818 .0246387 17.08 0.000 .3724748 .4690888
|
education |
1 | 0 (base)
2 | -.5738198 .4641805 -1.24 0.216 -1.483899 .3362595
3 | -1.595382 .5545176 -2.88 0.004 -2.682577 -.508186
4 | -2.578322 .6360304 -4.05 0.000 -3.825333 -1.331312
|
bpmeds |
0 | 0 (base)
1 | 7.353518 1.146972 6.41 0.000 5.104746 9.602289
|
prevalenthyp |
0 | 0 (base)
1 | 12.88441 .541129 23.81 0.000 11.82346 13.94535
|
diabp | 1.041448 .0203391 51.20 0.000 1.001571 1.081325
heartrate | .0506771 .0163341 3.10 0.002 .0186521 .082702
glucose | .0446825 .0079963 5.59 0.000 .0290047 .0603602
totchol | .0087429 .0045064 1.94 0.052 -.0000924 .0175782
_cons | 13.33082 2.371297 5.62 0.000 8.681618 17.98002
------------------------------------------------------------------------------
- males have
2.777589mmHg.
less systolic blood pressure as compared to women when adjusted for other variables and the result is statistically significant. - A 1 year increase in age results in
significant 0.4207mmHg increase
in systolic blood pressure when adjusting for other variables. The result is statistically significant at 5% level of significance - Going up the education level categories , systolic blood pressure seems to significantly decrease as compared to the education level 1 baseline category when adjusting for other variables. More precisely:
- Education level category 2 individuals have
0.5738198mmHg
less systolic blood pressure as compared to the baseline (education level 1) though the result is not statistically significant. - Education level category 3 individuals have
1.595382mmHg
less systolic blood pressure as compared to the baseline (education level 1) and the result is statistically significant. - Education level category 4 individuals have
2.578322mmHg
less systolic blood pressure as compared to the baseline (education level 1) and the result is statistically significant.
- When adjusting for other variables , people who take bpmeds have on average 7.353518mmHg more systolic blood pressure as compared to those who do not take meds.
- When ajusting for other variables , Prevalent hypertension patients have 12.88441mmHg more systolic blood pressure on average as compared to those who are not prevalent hypertension patients, the result is statistically significant.
- A unit increase in
diastolic blood pressure
will result in a1.041448mmHg
significant increase in systolic blood pressure when adjusting for other variables. - A unit increase in
Heartrate
will result in a0.0506771mmHg
significant increase in systolic blood pressure when adjusting for other variables. - A unit increase in
Glucose level
will result in a0.446825mmHg
significant increase in systolic blood pressure when adjusting for other variables. - A unit increase in
total cholestrol
will result in a0.0087429 mmHg
significant increase in systolic blood pressure when adjusting for other variables.
Section C
(a) Fit a direct model that resembles the final model on question 4, show the SEM diagram and results table (side by side with those for 4c). Comment on the similarities and differences in your result
- First we create dummy variable for education since it has more than 2 levels.
import delimited framingham_cleanregress sysbp i.male age i.education i.bpmeds i.prevalenthyp diabp heartrate glucose totchol, allbase
for education category*
*Creating the dummy variables tab education, gen(educationlevels)
model
*Fitting the e._endogenous , unstructured) nocapslatent
sem (diabp -> sysbp, ) (educationlevels2 -> sysbp, ) (educationlevels3 -> sysbp, ) (educationlevels4 -> sysbp, ) (glucose -> sysbp, ) (heartrate -> sysbp, ) (bpmeds -> sysbp, ) (prevalenthyp -> sysbp, ) (male -> sysbp, ) (age -> sysbp, ) (totchol-> sysbp, ), covstructure(
level 2 was removed becaused it was insignificant in the model (p-value=0.240 ) from the the final model.
*Education
estat mindices
(encoding automatically selected: ISO-8859-1)
(16 vars, 3,656 obs)
Source | SS df MS Number of obs = 3,656
-------------+---------------------------------- F(11, 3644) = 915.52
Model | 1309930.59 11 119084.599 Prob > F = 0.0000
Residual | 473987.479 3,644 130.073403 R-squared = 0.7343
-------------+---------------------------------- Adj R-squared = 0.7335
Total | 1783918.07 3,655 488.07608 Root MSE = 11.405
------------------------------------------------------------------------------
sysbp | Coefficient Std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
male |
0 | 0 (base)
1 | -2.777589 .3886811 -7.15 0.000 -3.539643 -2.015535
|
age | .4207818 .0246387 17.08 0.000 .3724748 .4690888
|
education |
1 | 0 (base)
2 | -.5738198 .4641805 -1.24 0.216 -1.483899 .3362595
3 | -1.595382 .5545176 -2.88 0.004 -2.682577 -.508186
4 | -2.578322 .6360304 -4.05 0.000 -3.825333 -1.331312
|
bpmeds |
0 | 0 (base)
1 | 7.353518 1.146972 6.41 0.000 5.104746 9.602289
|
prevalenthyp |
0 | 0 (base)
1 | 12.88441 .541129 23.81 0.000 11.82346 13.94535
|
diabp | 1.041448 .0203391 51.20 0.000 1.001571 1.081325
heartrate | .0506771 .0163341 3.10 0.002 .0186521 .082702
glucose | .0446825 .0079963 5.59 0.000 .0290047 .0603602
totchol | .0087429 .0045064 1.94 0.052 -.0000924 .0175782
_cons | 13.33082 2.371297 5.62 0.000 8.681618 17.98002
------------------------------------------------------------------------------
education | Freq. Percent Cum.
------------+-----------------------------------
1 | 1,526 41.74 41.74
2 | 1,101 30.11 71.85
3 | 606 16.58 88.43
4 | 423 11.57 100.00
------------+-----------------------------------
Total | 3,656 100.00
Endogenous variables
Observed: sysbp
Exogenous variables
Observed: diabp educationlevels2 educationlevels3 educationlevels4 glucose
heartrate bpmeds prevalenthyp male age totchol
Fitting target model:
Iteration 0: Log likelihood = -98047.863
Iteration 1: Log likelihood = -98047.863
Structural equation model Number of obs = 3,656
Estimation method: ml
Log likelihood = -98047.863
------------------------------------------------------------------------------
| OIM
| Coefficient std. err. z P>|z| [95% conf. interval]
-------------+----------------------------------------------------------------
Structural |
sysbp |
diabp | 1.041448 .0203057 51.29 0.000 1.00165 1.081246
educatio~2 | -.5738198 .4634181 -1.24 0.216 -1.482103 .3344629
educatio~3 | -1.595382 .5536068 -2.88 0.004 -2.680431 -.5103322
educatio~4 | -2.578322 .6349857 -4.06 0.000 -3.822872 -1.333773
glucose | .0446825 .0079832 5.60 0.000 .0290357 .0603292
heartrate | .0506771 .0163073 3.11 0.002 .0187154 .0826387
bpmeds | 7.353518 1.145089 6.42 0.000 5.109186 9.59785
prevalen~p | 12.88441 .5402402 23.85 0.000 11.82555 13.94326
male | -2.777589 .3880427 -7.16 0.000 -3.538139 -2.017039
age | .4207818 .0245982 17.11 0.000 .3725701 .4689934
totchol | .0087429 .004499 1.94 0.052 -.000075 .0175607
_cons | 13.33082 2.367402 5.63 0.000 8.690795 17.97084
-------------+----------------------------------------------------------------
var(e.sysbp)| 129.6465 3.032303 123.8374 135.728
------------------------------------------------------------------------------
LR test of model vs. saturated: chi2(0) = 0.00 Prob > chi2 = .
(no modification indices to report, all MI values less than 3.841458820694123)
Model Structural Output
- The linear regression model output and the structural Equation model had similarities and few disparities
- The estimated path coefficients in both model outputs were the same (identical) for all variables including the constant/intercept
- The
Z
andt
values where however quite different - The significance of the variables at 5% were also identical.
(b) Work on improving the direct model by introducing some indirect pathways based on research knowledge of the field or suggested pathways from ’’estat mindices”. Display the final direct and indirect SEM diagram and explain your approach of the indirect pathways and/or correlations introduced. Hint: Do not make the modifications too complex, make a few alterations that help improve the model
import delimited framingham_clean
for education category since it has more than two levels
*Creating the dummy variables tab education, gen(educationlevels)
sem (prevalenthyp -> sysbp, ) (educationlevels3 -> sysbp, ) (educationlevels4 -> sysbp, ) (male -> sysbp, ) (glucose -> sysbp, ) (glucose -> prevalenthyp, ) (heartrate -> sysbp, ) (heartrate -> prevalenthyp, ) (diabp -> sysbp, ) (diabp -> prevalenthyp, ) (age -> sysbp, ) (age -> prevalenthyp, ) (bpmeds -> sysbp, ) (bpmeds -> prevalenthyp, ) (totchol -> sysbp, ), nocapslatent
estat gof, stats(all)
(encoding automatically selected: ISO-8859-1)
(16 vars, 3,656 obs)
education | Freq. Percent Cum.
------------+-----------------------------------
1 | 1,526 41.74 41.74
2 | 1,101 30.11 71.85
3 | 606 16.58 88.43
4 | 423 11.57 100.00
------------+-----------------------------------
Total | 3,656 100.00
Endogenous variables
Observed: prevalenthyp sysbp
Exogenous variables
Observed: educationlevels3 educationlevels4 male glucose heartrate diabp
age bpmeds totchol
Fitting target model:
Iteration 0: Log likelihood = -96155.335
Iteration 1: Log likelihood = -96155.335
Structural equation model Number of obs = 3,656
Estimation method: ml
Log likelihood = -96155.335
------------------------------------------------------------------------------
| OIM
| Coefficient std. err. z P>|z| [95% conf. interval]
-------------+----------------------------------------------------------------
Structural |
prevalen~p |
glucose | .0003951 .0002444 1.62 0.106 -.0000839 .000874
heartrate | .0017932 .000492 3.64 0.000 .000829 .0027575
diabp | .0211249 .0005093 41.48 0.000 .0201267 .0221231
age | .0093578 .0006969 13.43 0.000 .0079918 .0107237
bpmeds | .3480831 .0344983 10.09 0.000 .2804678 .4156985
_cons | -2.082419 .0588731 -35.37 0.000 -2.197809 -1.96703
-----------+----------------------------------------------------------------
sysbp |
prevalen~p | 12.88386 .5403533 23.84 0.000 11.82479 13.94293
educatio~3 | -1.345866 .5157383 -2.61 0.009 -2.356694 -.3350375
educatio~4 | -2.329898 .6025839 -3.87 0.000 -3.510941 -1.148855
male | -2.754165 .3876626 -7.10 0.000 -3.51397 -1.99436
glucose | .0447748 .0079845 5.61 0.000 .0291254 .0604242
heartrate | .0505673 .0163105 3.10 0.002 .0185994 .0825353
diabp | 1.041687 .020309 51.29 0.000 1.001882 1.081492
age | .4274418 .024008 17.80 0.000 .3803869 .4744967
bpmeds | 7.319757 1.145004 6.39 0.000 5.07559 9.563923
totchol | .0085681 .0044977 1.91 0.057 -.0002472 .0173835
_cons | 12.77098 2.324315 5.49 0.000 8.215408 17.32655
-------------+----------------------------------------------------------------
var(e.prev~p)| .1216345 .0028449 .1161845 .1273403
var(e.sysbp)| 129.7008 3.033575 123.8894 135.7849
------------------------------------------------------------------------------
LR test of model vs. saturated: chi2(4) = 4.01 Prob > chi2 = 0.4051
----------------------------------------------------------------------------
Fit statistic | Value Description
---------------------+------------------------------------------------------
Likelihood ratio |
chi2_ms(4) | 4.007 model vs. saturated
p > chi2 | 0.405
chi2_bs(19) | 6921.819 baseline vs. saturated
p > chi2 | 0.000
---------------------+------------------------------------------------------
Population error |
RMSEA | 0.001 Root mean squared error of approximation
90% CI, lower bound | 0.000
upper bound | 0.025
pclose | 1.000 Probability RMSEA <= 0.05
---------------------+------------------------------------------------------
Information criteria |
AIC | 192348.671 Akaike's information criterion
BIC | 192466.549 Bayesian information criterion
---------------------+------------------------------------------------------
Baseline comparison |
CFI | 1.000 Comparative fit index
TLI | 1.000 Tucker–Lewis index
---------------------+------------------------------------------------------
Size of residuals |
SRMR | 0.003 Standardized root mean squared residual
CD | 0.707 Coefficient of determination
----------------------------------------------------------------------------
firstly education level 2 was removed due to not being significant (\(p=0.216>0.05\)).
On running the
estat mindices
command in stata on the initial direct model ,the estat mindices command did not suggest anything for improvement, hence I had to use expert opinion and prior belief to create indirect pathways.The direct relationship between diastolic blood pressure and systolic blood pressure was mantained , this is supported both biologically and statistically since diastolic blood pressure is known to affect systolic blood pressure due to cardiovascular risk factors.
Prevalent hypertension(prevalenthyp) was introduced as a key Mediator since individuals with Prevalent hypertension often suffer more from elevated diastolic and systolic blood pressure.
Justification for the appproach
- The changes result in more parsimonous model as few changes (justified changes were made to avoid overfitting)
Model Structural Output
(c) Perform and comment on all five SEM model goodness of fit procedures and comment on how each performs based on your final SEM model.
The following command was ran into stata to get model goodness of fit indices
stats(all) estat gof,
Comments
- Likelihood Ratio Test
- (\(p-value=0.405\)), suggests no significant difference between the model and the saturated model. This model reproduces the observed data structure very well. The null hypothesis that the model fits the data is not rejected, therefore this is ideal in SEM.
- RMSEA (Root Mean Square Error of Approximation)
- A value of RMSEA (< 0.05 )indicates close model fit,here our value (\(RMSEA=0.001\)), which is perfect. Also, pclose = 1.000 means there’s a 100% probability that the true RMSEA is less than 0.05 — again showing excellent fit.
- The 90% upper and lower bound are also within the expected range i.e \(LB<0.05\) and \(UB<0.1\) ,hence also suggesting a good model fit
CFI and TLI (Comparative Fit Index & Tucker-Lewis Index) Both indices are above 0.95 (exactly at 1.00), indicating excellent comparative fit. The model is much better than the baseline model that assumes no relationships among variables.
SRMR (Standardized Root Mean Squared Residual) SRMR < 0.08 is generally considered good. For this model \(SRMR=0.003\), indicates the perfect fit, the model predicted correlations very closely match the observed ones.
Coefficient of determination
- value is \(CD=0.707\) and is quite high and significant.
- The model explains 70.7% of the variance in the outcome variables indicating clinically/behaviorally meaningful predictive accuracy.
(d) Draw-up the table of results from the final SEM model and verify numerically the STATA drawn direct effects, indirect effects and total effects for “diabp” on your outcome variable “sysbp”.
import delimited framingham_clean
for education category since it has more than two levels
*Creating the dummy variables tab education, gen(educationlevels)
sem (prevalenthyp -> sysbp, ) (educationlevels3 -> sysbp, ) (educationlevels4 -> sysbp, ) (male -> sysbp, ) (glucose -> sysbp, ) (glucose -> prevalenthyp, ) (heartrate -> sysbp, ) (heartrate -> prevalenthyp, ) (diabp -> sysbp, ) (diabp -> prevalenthyp, ) (age -> sysbp, ) (age -> prevalenthyp, ) (bpmeds -> sysbp, ) (bpmeds -> prevalenthyp, ) (totchol -> sysbp, ), nocapslatent
estat teffects
(encoding automatically selected: ISO-8859-1)
(16 vars, 3,656 obs)
education | Freq. Percent Cum.
------------+-----------------------------------
1 | 1,526 41.74 41.74
2 | 1,101 30.11 71.85
3 | 606 16.58 88.43
4 | 423 11.57 100.00
------------+-----------------------------------
Total | 3,656 100.00
Endogenous variables
Observed: prevalenthyp sysbp
Exogenous variables
Observed: educationlevels3 educationlevels4 male glucose heartrate diabp
age bpmeds totchol
Fitting target model:
Iteration 0: Log likelihood = -96155.335
Iteration 1: Log likelihood = -96155.335
Structural equation model Number of obs = 3,656
Estimation method: ml
Log likelihood = -96155.335
------------------------------------------------------------------------------
| OIM
| Coefficient std. err. z P>|z| [95% conf. interval]
-------------+----------------------------------------------------------------
Structural |
prevalen~p |
glucose | .0003951 .0002444 1.62 0.106 -.0000839 .000874
heartrate | .0017932 .000492 3.64 0.000 .000829 .0027575
diabp | .0211249 .0005093 41.48 0.000 .0201267 .0221231
age | .0093578 .0006969 13.43 0.000 .0079918 .0107237
bpmeds | .3480831 .0344983 10.09 0.000 .2804678 .4156985
_cons | -2.082419 .0588731 -35.37 0.000 -2.197809 -1.96703
-----------+----------------------------------------------------------------
sysbp |
prevalen~p | 12.88386 .5403533 23.84 0.000 11.82479 13.94293
educatio~3 | -1.345866 .5157383 -2.61 0.009 -2.356694 -.3350375
educatio~4 | -2.329898 .6025839 -3.87 0.000 -3.510941 -1.148855
male | -2.754165 .3876626 -7.10 0.000 -3.51397 -1.99436
glucose | .0447748 .0079845 5.61 0.000 .0291254 .0604242
heartrate | .0505673 .0163105 3.10 0.002 .0185994 .0825353
diabp | 1.041687 .020309 51.29 0.000 1.001882 1.081492
age | .4274418 .024008 17.80 0.000 .3803869 .4744967
bpmeds | 7.319757 1.145004 6.39 0.000 5.07559 9.563923
totchol | .0085681 .0044977 1.91 0.057 -.0002472 .0173835
_cons | 12.77098 2.324315 5.49 0.000 8.215408 17.32655
-------------+----------------------------------------------------------------
var(e.prev~p)| .1216345 .0028449 .1161845 .1273403
var(e.sysbp)| 129.7008 3.033575 123.8894 135.7849
------------------------------------------------------------------------------
LR test of model vs. saturated: chi2(4) = 4.01 Prob > chi2 = 0.4051
Direct effects
------------------------------------------------------------------------------
| OIM
| Coefficient std. err. z P>|z| [95% conf. interval]
-------------+----------------------------------------------------------------
Structural |
prevalen~p |
glucose | .0003951 .0002444 1.62 0.106 -.0000839 .000874
heartrate | .0017932 .000492 3.64 0.000 .000829 .0027575
diabp | .0211249 .0005093 41.48 0.000 .0201267 .0221231
age | .0093578 .0006969 13.43 0.000 .0079918 .0107237
bpmeds | .3480831 .0344983 10.09 0.000 .2804678 .4156985
-----------+----------------------------------------------------------------
sysbp |
prevalen~p | 12.88386 .5403533 23.84 0.000 11.82479 13.94293
educatio~3 | -1.345866 .5157383 -2.61 0.009 -2.356694 -.3350375
educatio~4 | -2.329898 .6025839 -3.87 0.000 -3.510941 -1.148855
male | -2.754165 .3876626 -7.10 0.000 -3.51397 -1.99436
glucose | .0447748 .0079845 5.61 0.000 .0291254 .0604242
heartrate | .0505673 .0163105 3.10 0.002 .0185994 .0825353
diabp | 1.041687 .020309 51.29 0.000 1.001882 1.081492
age | .4274418 .024008 17.80 0.000 .3803869 .4744967
bpmeds | 7.319757 1.145004 6.39 0.000 5.07559 9.563923
totchol | .0085681 .0044977 1.91 0.057 -.0002472 .0173835
------------------------------------------------------------------------------
Indirect effects
------------------------------------------------------------------------------
| OIM
| Coefficient std. err. z P>|z| [95% conf. interval]
-------------+----------------------------------------------------------------
Structural |
prevalen~p |
glucose | 0 (no path)
heartrate | 0 (no path)
diabp | 0 (no path)
age | 0 (no path)
bpmeds | 0 (no path)
-----------+----------------------------------------------------------------
sysbp |
prevalen~p | 0 (no path)
educatio~3 | 0 (no path)
educatio~4 | 0 (no path)
male | 0 (no path)
glucose | .0050901 .0031555 1.61 0.107 -.0010945 .0112747
heartrate | .0231035 .0064122 3.60 0.000 .0105359 .0356711
diabp | .2721698 .0131665 20.67 0.000 .246364 .2979756
age | .1205643 .0103049 11.70 0.000 .1003672 .1407615
bpmeds | 4.484655 .4826297 9.29 0.000 3.538718 5.430591
totchol | 0 (no path)
------------------------------------------------------------------------------
Total effects
------------------------------------------------------------------------------
| OIM
| Coefficient std. err. z P>|z| [95% conf. interval]
-------------+----------------------------------------------------------------
Structural |
prevalen~p |
glucose | .0003951 .0002444 1.62 0.106 -.0000839 .000874
heartrate | .0017932 .000492 3.64 0.000 .000829 .0027575
diabp | .0211249 .0005093 41.48 0.000 .0201267 .0221231
age | .0093578 .0006969 13.43 0.000 .0079918 .0107237
bpmeds | .3480831 .0344983 10.09 0.000 .2804678 .4156985
-----------+----------------------------------------------------------------
sysbp |
prevalen~p | 12.88386 .5403533 23.84 0.000 11.82479 13.94293
educatio~3 | -1.345866 .5157383 -2.61 0.009 -2.356694 -.3350375
educatio~4 | -2.329898 .6025839 -3.87 0.000 -3.510941 -1.148855
male | -2.754165 .3876626 -7.10 0.000 -3.51397 -1.99436
glucose | .0498649 .0085801 5.81 0.000 .0330483 .0666815
heartrate | .0736708 .0174753 4.22 0.000 .0394199 .1079218
diabp | 1.313857 .0180371 72.84 0.000 1.278505 1.349209
age | .5480061 .0251491 21.79 0.000 .4987148 .5972975
bpmeds | 11.80441 1.214032 9.72 0.000 9.424953 14.18387
totchol | .0085681 .0044977 1.91 0.057 -.0002472 .0173835
------------------------------------------------------------------------------
PrevalentHyp | Sysbp (Outcome) | |
---|---|---|
diapb | ||
Direct Effect | 0.021 | 1.0 |
Indirect via PrevalentHyp | 13X 0.021=0.273 | |
Total Effect | 0.021 | 1.273 |
Direct Effect contribution
\(\frac{1}{1.273}*100=78.6\%\)
Indirect Effect contribution
\(\frac{0.273}{1.273}*100=21.45\%\)
e)Interpret your final SEM model and comment on whether SEM helped improve the direct model from 4c)
Final SEM Model
- The final model has :
- Endogenous variables Observed:
prevalenthyp
andsysbp
here we observe interrelationships
- Exogenous variables Observed:
educationlevels3 educationlevels4 male glucose heartrate diabp age bpmeds totchol
Summary of results
Direct effects on systolic blood pressure
- Prevalent hyperytension has a major effect on systolic blood pressure such that those who experience this have 12.88 more systolic blood pressure as compared to their counterparts adjusting for other variables(\(\beta \approx 12.88,p=0.000\))
- diastolic blood pressure has a positive significant total effect on systolic blood pressure (\(p<0.001\)) such that a unit increase in diastolic blood pressure results in
1.273
increase in systolic blood pressure adjusting for the mediatory effect of prevalent hypertension and also controlling for other variables. about \(21.45\%\) of this efffect is indirect due to prevalent hypertension and the remainder \(78.6\%\) is due to direct effect of diastolic blood pressure on systolic blood pressure
Model improvement
- The \(SEM\) helped to improve since:
- Root mean Square error or association(\(RMSEA=0.001<0.05\)) whict indicates a better fit.
- CF1 and TLI =1 showing a perfect fit
- Overally the chisquared test \(p=0.407\) improved from \(0.00\) indicating that the model is now not significantly worse than a saturated model hence our final model greatly improved
General additional effects shown on the table below:
Outcome | Predictor | β | SE | p | Clinical Interpretation |
---|---|---|---|---|---|
Binary Outcome: Hypertension Status | |||||
Prevalent Hypertension | Glucose | 0.0004 | 0.0002 | 0.106 | NS: No significant association with hypertension risk |
Prevalent Hypertension | Heart Rate | 0.0018 | 0.0005 | <0.001 | Sig: Each 1 bpm increase → 0.18% higher hypertension odds |
Prevalent Hypertension | Diastolic BP | 0.0211 | 0.0005 | <0.001 | STRONG: Each 1 mmHg → 2.1% higher hypertension odds (key predictor) |
Prevalent Hypertension | Age | 0.0094 | 0.0007 | <0.001 | Sig: Each year of age → 0.94% higher hypertension odds |
Prevalent Hypertension | BP Meds | 0.3481 | 0.0345 | <0.001 | Sig: BP med users have 35% higher hypertension odds (indication bias) |
Prevalent Hypertension | Constant | -2.0824 | 0.0589 | <0.001 | Baseline log-odds |
Continuous Outcome: Systolic BP (mmHg) | |||||
Systolic BP | Prevalent Hypertension | 12.8839 | 0.5404 | <0.001 | STRONG: Hypertensives average 12.9 mmHg higher SBP |
Systolic BP | Education (Mid) | -1.3459 | 0.5157 | 0.009 | Sig: Mid education → 1.35 mmHg lower SBP vs low education |
Systolic BP | Education (High) | -2.3299 | 0.6026 | <0.001 | STRONG: High education → 2.33 mmHg lower SBP vs low education |
Systolic BP | Male | -2.7542 | 0.3877 | <0.001 | Sig: Males average 2.75 mmHg lower SBP than females |
Systolic BP | Glucose | 0.0448 | 0.0080 | <0.001 | Sig: Each glucose unit → 0.045 mmHg higher SBP |
Systolic BP | Heart Rate | 0.0506 | 0.0163 | 0.002 | Sig: Each 1 bpm → 0.051 mmHg higher SBP |
Systolic BP | Diastolic BP | 1.0417 | 0.0203 | <0.001 | STRONG: Each 1 mmHg diastolic → 1.04 mmHg higher SBP |
Systolic BP | Age | 0.4274 | 0.0240 | <0.001 | STRONG: Each year of age → 0.43 mmHg higher SBP |
Systolic BP | BP Meds | 7.3198 | 1.1450 | <0.001 | Sig: BP med users average 7.3 mmHg higher SBP (treatment group) |
Systolic BP | Total Cholesterol | 0.0086 | 0.0045 | 0.057 | Marginal (p=0.057): Cholesterol shows weak positive trend |
Systolic BP | Constant | 12.7710 | 2.3243 | <0.001 | Baseline SBP for reference group |
Notes: Model fit: χ²(4)=4.01, p=0.405 (Excellent fit); SRMR=0.003; CD=0.707 NS = Not Significant (p>0.05); Sig = Significant (p<0.05); STRONG = p<0.001 with large effect size |
Footnotes
The only exception is if the predictor being added is either constant or identical to another variable.↩︎
Comments based on stepwise Regression
The model from the stepwise regression together with the 8 and 10 Predictor model from best subset selection will be compared further