Body weight | Plasma volume |
---|---|
58.0 | 2.75 |
70.0 | 2.86 |
74.0 | 3.37 |
63.5 | 2.76 |
62.0 | 2.62 |
70.5 | 3.49 |
71.0 | 3.05 |
66.0 | 3.12 |
Correlation is a statistical measure that describes the strength and direction of a relationship between two quantitative variables
Correlation measures the strength and direction of association between two variables. There are three common correlation tests:
Correlation is a statistical measure that describes the strength and direction of a relationship between two quantitative variables
Use the Pearson’s r if both variables are quantitative (interval or ratio), normally distributed, and the relationship is linear with homoscedastic residuals.
The Spearman’s rho and Kendal’s tao correlations are measures, so they are valid for both quantitative and ordinal variables and do not carry the normality and homoscedasticity conditions. However, non-parametric tests have less statistical power than parametric tests, so only use these correlations if Pearson does not apply.
Pearson’s r
Pearson’s \(r\)
\[r = \frac{\sum{(X_i - \bar{X})(Y_i - \bar{Y})}}{\sqrt{\sum{(X_i - \bar{X})^2 \sum{(Y_i - \bar{Y})^2}}}} = \frac{cov(X,Y)}{s_X s_Y}\]
estimates the population correlation \(\rho\). Pearson’s \(r\) ranges from \(-1\) (perfect negative linear relationship) to \(+1\) (perfect positive linear relationship, and \(r = 0\) when there is no linear relationship. A correlation in the range :
- \((.1, .3)\) is condidered small,
- \((.3, .5)\) medium,
- and \((.5, 1.0)\) large.
Pearson’s \(r\) only applies if the variables are interval or ratio, normally distributed, linearly related, there are minimal outliers, and the residuals are homoscedastic.
Data Used for Illustration
Types of correlations
Almost perfect correlations
<- ggplot(data = df, aes(x = `Body weight`, y = `Plasma volume`)) +
p1 geom_point() +
geom_smooth(method = "lm", se = FALSE) +
labs(title = "Pearson's Rho", subtitle = "positive and strong correlation")
<- ggplot(data = df, aes(x = `Body weight`, y = 1/`Plasma volume`)) +
p2 geom_point() +
geom_smooth(method = "lm", se = FALSE) +
labs(title = "Pearson's Rho", subtitle = "Negative yet strong correlation")
::grid.arrange(p1, p2, nrow = 1) gridExtra
Some weak correlations
- here is some fake dataset producing weak correlations
<- tibble(
df2 Height = c(115, 101, 99, 107, 118, 127, 120, 129),
Weight = c(56, 50, 67, 64, 55, 70, 61, 59)
)
Visualizing weak correlations
<- ggplot(data = df2, aes(x = Height, y = Weight)) +
p1 geom_point() +
geom_smooth(method = "lm", se = FALSE) +
labs(title = "Pearson's Rho", subtitle = "positive and weak correlation")
<- ggplot(data = df2, aes(x = Height, y = 1/Weight)) +
p2 geom_point() +
geom_smooth(method = "lm", se = FALSE) +
labs(title = "Pearson's Rho", subtitle = "Negative yet weak correlation")
::grid.arrange(p1, p2, nrow = 1) gridExtra
Manually computing Correlations
we are interested in finding \(\sum{(X_i - \bar{X})(Y_i - \bar{Y})}\) , \(\sum{(X_i - \bar{X})^2}\) and \(\sum{(Y_i - \bar{Y})^2}\)
<- df %>%
dummy_mse mutate("$(X_i - \\bar{X})$" = (`Body weight`- mean(`Body weight`))) %>%
mutate("$(Y_i - \\bar{Y})$" = (`Plasma volume`- mean(`Plasma volume`)))%>%
mutate("$(X_i - \\bar{X})^2$" = (`Body weight`- mean(`Body weight`))^2) %>%
mutate("$(Y_i - \\bar{Y})^2$" = (`Plasma volume`- mean(`Plasma volume`))^2) %>%
mutate("$(Y_i - \\bar{Y})(X_i - \\bar{X})$" = (`Plasma volume`- mean(`Plasma volume`))*(`Body weight`- mean(`Body weight`)))
%>%
dummy_mse ::kable() knitr
Body weight | Plasma volume | \((X_i - \bar{X})\) | \((Y_i - \bar{Y})\) | \((X_i - \bar{X})^2\) | \((Y_i - \bar{Y})^2\) | \((Y_i - \bar{Y})(X_i - \bar{X})\) |
---|---|---|---|---|---|---|
58.0 | 2.75 | -8.875 | -0.2525 | 78.765625 | 0.0637562 | 2.2409375 |
70.0 | 2.86 | 3.125 | -0.1425 | 9.765625 | 0.0203063 | -0.4453125 |
74.0 | 3.37 | 7.125 | 0.3675 | 50.765625 | 0.1350563 | 2.6184375 |
63.5 | 2.76 | -3.375 | -0.2425 | 11.390625 | 0.0588063 | 0.8184375 |
62.0 | 2.62 | -4.875 | -0.3825 | 23.765625 | 0.1463062 | 1.8646875 |
70.5 | 3.49 | 3.625 | 0.4875 | 13.140625 | 0.2376563 | 1.7671875 |
71.0 | 3.05 | 4.125 | 0.0475 | 17.015625 | 0.0022562 | 0.1959375 |
66.0 | 3.12 | -0.875 | 0.1175 | 0.765625 | 0.0138063 | -0.1028125 |
Summations and calculating correlation
<-dummy_mse %>%
(dummy_mse1summarise("$\\sum Height$"=sum(.[1]),
"$\\sum Weight$"=sum(.[2]),
"$\\sum(X_i - \\bar{X})$"=sum(.[3]),
"$\\sum(Y_i - \\bar{Y})$"=sum(.[4]),
"$\\sum(X_i - \\bar{X})^2$"=sum(.[5]),
"$\\sum(Y_i - \\bar{Y})^2$"=sum(.[6]),
"$\\sum(Y_i - \\bar{Y})(X_i - \\bar{X})$"=sum(.[7]))%>%
mutate(Pearson_R = (.[7]/sqrt(.[5]*.[6]))) %>% ##seventh index/sqrt(5th times 6th index)
relocate(Pearson_R)%>%
::kable()) knitr
Pearson_R | \(\sum Height\) | \(\sum Weight\) | \(\sum(X_i - \bar{X})\) | \(\sum(Y_i - \bar{Y})\) | \(\sum(X_i - \bar{X})^2\) | \(\sum(Y_i - \bar{Y})^2\) | \(\sum(Y_i - \bar{Y})(X_i - \bar{X})\) |
---|---|---|---|---|---|---|---|
0.7591266 | 535 | 24.02 | 0 | 0 | 205.375 | 0.67795 | 8.9575 |
- While it is important to know the mathematical theory behind , R has a built in function for this
- we use the
cor(x,y)
function
cor(df$`Body weight`,df$`Plasma volume`)
[1] 0.7591266
use PlasmaB , clear
pwcorr Body_weight Plasma_volume
| Body_w~t Plasma~e
-------------+------------------
Body_weight | 1.0000
Plasma_vol~e | 0.7591 1.0000
Spearman’s Rho
Spearman’s \(\rho\) is the Pearson’s r applied to the sample variable ranks. Let \((X_i, Y_i)\) be the ranks of the \(n\) sample pairs with mean ranks \(\bar{X} = \bar{Y} = (n+1)/2\). Spearman’s rho is
\[\hat{\rho} = \frac{\sum{(X_i - \bar{X})(Y_i - \bar{Y})}}{\sqrt{\sum{(X_i - \bar{X})^2 \sum{(Y_i - \bar{Y})^2}}}}\] You will also encounter another formula given by \[\hat{\rho} = 1 -\frac{6\sum{D^2}}{n(n^2-1)}\] where \(D=rank(X)-rank(Y)\)
Spearman’s rho is a non-parametric test, so there is no associated confidence interval.
Practical Example 1
In the breast cancer dataset, the variable radius mean measures the average size of the tumour. We want to explore how radius mean is related to other tumour characteristics such as texture mean, perimeter mean, smoothness mean, and compactness mean
<-haven::read_dta("Breastcancer.dta")) |>
(datahead(6)
id | diagnosis | radius_mean | texture_mean | perimeter_mean | area_mean | smoothness_mean | compactness_mean | concavity_mean | concavepoints_mean | symmetry_mean | fractal_dimension_mean | radius_se | texture_se | perimeter_se | area_se | smoothness_se | compactness_se | concavity_se | concavepoints_se | symmetry_se | fractal_dimension_se | radius_worst | texture_worst | perimeter_worst | area_worst | smoothness_worst | compactness_worst | concavity_worst | concavepoints_worst | symmetry_worst | fractal_dimension_worst | radius_hat | resid | resid_label | tag | cooksd |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
842302 | M | 17.99 | 10.38 | 122.80 | 1001.0 | 0.11840 | 0.27760 | 0.3001 | 0.14710 | 0.2419 | 0.07871 | 1.0950 | 0.9053 | 8.589 | 153.40 | 0.006399 | 0.04904 | 0.05373 | 0.01587 | 0.03003 | 0.006193 | 25.38 | 17.33 | 184.60 | 2019.0 | 0.1622 | 0.6656 | 0.7119 | 0.2654 | 0.4601 | 0.11890 | 19.97863 | -1.988633 | -1.99 | 1 | 0.0046133 |
842517 | M | 20.57 | 17.77 | 132.90 | 1326.0 | 0.08474 | 0.07864 | 0.0869 | 0.07017 | 0.1812 | 0.05667 | 0.5435 | 0.7339 | 3.398 | 74.08 | 0.005225 | 0.01308 | 0.01860 | 0.01340 | 0.01389 | 0.003532 | 24.99 | 23.41 | 158.80 | 1956.0 | 0.1238 | 0.1866 | 0.2416 | 0.1860 | 0.2750 | 0.08902 | 13.25931 | 7.310688 | 7.31 | 1 | 0.0063065 |
84300903 | M | 19.69 | 21.25 | 130.00 | 1203.0 | 0.10960 | 0.15990 | 0.1974 | 0.12790 | 0.2069 | 0.05999 | 0.7456 | 0.7869 | 4.585 | 94.03 | 0.006150 | 0.04006 | 0.03832 | 0.02058 | 0.02250 | 0.004571 | 23.57 | 25.53 | 152.50 | 1709.0 | 0.1444 | 0.4245 | 0.4504 | 0.2430 | 0.3613 | 0.08758 | 16.00364 | 3.686358 | 3.69 | 1 | 0.0027413 |
84348301 | M | 11.42 | 20.38 | 77.58 | 386.1 | 0.14250 | 0.28390 | 0.2414 | 0.10520 | 0.2597 | 0.09744 | 0.4956 | 1.1560 | 3.445 | 27.23 | 0.009110 | 0.07458 | 0.05661 | 0.01867 | 0.05963 | 0.009208 | 14.91 | 26.50 | 98.87 | 567.7 | 0.2098 | 0.8663 | 0.6869 | 0.2575 | 0.6638 | 0.17300 | 20.19140 | -8.771399 | -8.77 | 0 | 0.0961080 |
84358402 | M | 20.29 | 14.34 | 135.10 | 1297.0 | 0.10030 | 0.13280 | 0.1980 | 0.10430 | 0.1809 | 0.05883 | 0.7572 | 0.7813 | 5.438 | 94.44 | 0.011490 | 0.02461 | 0.05688 | 0.01885 | 0.01756 | 0.005115 | 22.54 | 16.67 | 152.20 | 1575.0 | 0.1374 | 0.2050 | 0.4000 | 0.1625 | 0.2364 | 0.07678 | 15.08842 | 5.201585 | 5.20 | 0 | 0.0033317 |
843786 | M | 12.45 | 15.70 | 82.57 | 477.1 | 0.12780 | 0.17000 | 0.1578 | 0.08089 | 0.2087 | 0.07613 | 0.3345 | 0.8902 | 2.217 | 27.19 | 0.007510 | 0.03345 | 0.03672 | 0.01137 | 0.02165 | 0.005082 | 15.47 | 23.75 | 103.40 | 741.6 | 0.1791 | 0.5249 | 0.5355 | 0.1741 | 0.3985 | 0.12440 | 16.34474 | -3.894743 | -3.89 | 0 | 0.0037038 |
Radius mean vs smoothness mean
|>
data ggplot(aes(x=radius_mean , y=smoothness_mean))+
geom_point()+
geom_smooth(method="lm", se=FALSE)
cor(data$radius_mean,data$smoothness_mean)
[1] 0.1705812
use Breastcancer
twoway (scatter radius_mean smoothness_mean , mcolor(blue)) ///
lfit radius_mean smoothness_mean, lcolor(red))
(graph export scatter1.png,replace
file scatter1.png saved as PNG format
use Breastcancer
pwcorr radius_mean smoothness_mean, sig
bootstrap r=r(rho), reps(1000): corr radius_mean smoothness_mean
| radius~n smooth~n
-------------+------------------
radius_mean | 1.0000
|
|
smoothness~n | 0.1706 1.0000
| 0.0000
|
(running correlate on estimation sample)
warning: correlate does not set e(sample), so no observations will be
excluded from the resampling because of missing values or other
reasons. To exclude observations, press Break, save the data, drop
any observations that are to be excluded, and rerun bootstrap.
Bootstrap replications (1,000): .........10.........20.........30.........40...
> ......50.........60.........70.........80.........90.........100.........110.
> ........120.........130.........140.........150.........160.........170......
> ...180.........190.........200.........210.........220.........230.........24
> 0.........250.........260.........270.........280.........290.........300....
> .....310.........320.........330.........340.........350.........360.........
> 370.........380.........390.........400.........410.........420.........430..
> .......440.........450.........460.........470.........480.........490.......
> ..500.........510.........520.........530.........540.........550.........560
> .........570.........580.........590.........600.........610.........620.....
> ....630.........640.........650.........660.........670.........680.........6
> 90.........700.........710.........720.........730.........740.........750...
> ......760.........770.........780.........790.........800.........810........
> .820.........830.........840.........850.........860.........870.........880.
> ........890.........900.........910.........920.........930.........940......
> ...950.........960.........970.........980.........990.........1,000 done
Bootstrap results Number of obs = 569
Replications = 1,000
Command: correlate radius_mean smoothness_mean
r: r(rho)
------------------------------------------------------------------------------
| Observed Bootstrap Normal-based
| coefficient std. err. z P>|z| [95% conf. interval]
-------------+----------------------------------------------------------------
r | .1705812 .0412809 4.13 0.000 .0896721 .2514903
------------------------------------------------------------------------------
The results show a weak correlation between radius
and smoothness
mean (\(r=0.1705812\))
Radius mean vs Perimeter mean
|>
data ggplot(aes(x=radius_mean , y=perimeter_mean))+
geom_point()+
geom_smooth(method="lm", se=FALSE)
cor(data$radius_mean,data$perimeter_mean)
[1] 0.9978553
use Breastcancer
twoway (scatter radius_mean perimeter_mean , mcolor(blue)) ///
lfit radius_mean perimeter_mean, lcolor(red))
(graph export scatter2.png
file scatter2.png already exists
r(602);
r(602);
use Breastcancer
pwcorr radius_mean smoothness_mean, sig
bootstrap r=r(rho), reps(1000): corr radius_mean smoothness_mean
| radius~n smooth~n
-------------+------------------
radius_mean | 1.0000
|
|
smoothness~n | 0.1706 1.0000
| 0.0000
|
(running correlate on estimation sample)
warning: correlate does not set e(sample), so no observations will be
excluded from the resampling because of missing values or other
reasons. To exclude observations, press Break, save the data, drop
any observations that are to be excluded, and rerun bootstrap.
Bootstrap replications (1,000): .........10.........20.........30.........40...
> ......50.........60.........70.........80.........90.........100.........110.
> ........120.........130.........140.........150.........160.........170......
> ...180.........190.........200.........210.........220.........230.........24
> 0.........250.........260.........270.........280.........290.........300....
> .....310.........320.........330.........340.........350.........360.........
> 370.........380.........390.........400.........410.........420.........430..
> .......440.........450.........460.........470.........480.........490.......
> ..500.........510.........520.........530.........540.........550.........560
> .........570.........580.........590.........600.........610.........620.....
> ....630.........640.........650.........660.........670.........680.........6
> 90.........700.........710.........720.........730.........740.........750...
> ......760.........770.........780.........790.........800.........810........
> .820.........830.........840.........850.........860.........870.........880.
> ........890.........900.........910.........920.........930.........940......
> ...950.........960.........970.........980.........990.........1,000 done
Bootstrap results Number of obs = 569
Replications = 1,000
Command: correlate radius_mean smoothness_mean
r: r(rho)
------------------------------------------------------------------------------
| Observed Bootstrap Normal-based
| coefficient std. err. z P>|z| [95% conf. interval]
-------------+----------------------------------------------------------------
r | .1705812 .0412809 4.13 0.000 .0896721 .2514903
------------------------------------------------------------------------------
The results show a perfect correlation between radius
and smoothness
mean (\(r=0.9978553\))
Why Linear Regression
Correlation and simple linear regression (LR) are used to investigate the relationship between two quantitative variables.
Correlation quantifies the strength and direction of a relationship between two variables, but it does not allow one to predict one variable from the other, while LR can be used to determine how the value of one variable varies in response to the values of the other variable.
Correlation does not specify a dependent or independent variable (symmetric relationship) but is just concerned with assessing the global movement shared between two variables, while LR specify one variable as the outcome and the other as the explanatory variable (asymmetric relationship)
The mathematical equation of a straight line
Simple linear regression involves a numeric dependent (or outcome) variable \(Y\) and one independent (or explanatory) variable \(X\) that is either numeric or categorical.
The general equation of a straight line is: \[ Y_i = \beta_0 + \beta_1 X_i + \epsilon_i \tag{1}\]
Preparing the data for this exercises
Dataset 1 : InClass
library(readxl)
<- haven::read_dta("abrevaya1.dta") |>
BirthWeight2 mutate(birwtkg =birwt/1000)
Basic concepts
First, we plot the data in a scatter plot of the previous Plasma volume vs Body
weight dataset:
ggplot(df, aes(`Body weight`, `Plasma volume`)) +
geom_point(size = 2, alpha = 0.4) +
#theme_prism(base_size = 14, base_line_size = 0.4, palette = "office") +
labs(x = "x = Body weight", y = "y = Plasma volume")
The objective is to find a mathematical approach that yields a line that, on average, goes through most of the points on the graph. This line is commonly known as a regression line or a line of best fit. We write the equation of the best-fitting line as:
\[\widehat{y_i} = b_0 + b_1 \cdot x_i\]
where \(b_0\) and \(b_1\) are best choices or estimates of the parameters \(\beta_0\) and \(\beta_1\) in Equation 1, and \(\hat y\) is the value of Y that is expected by this model for a specified value of X.
We define the following three concepts:
Observed value \(y_i\): the observed value of the dependent variable Y for a given \(x_i\) value.
Expected (or fitted) value \(\widehat{y_i}\): the value of the dependent variable Y that is expected by the model for a given \(x_i\) value.
Residual \(e_i\): the deviation (error) between the observed value \(y_i\) and the expected value \(\hat y\) for a given \(x_i\) value (\(e_i = y_i - \hat y_i\)).
Let’s see, for example, the values for the infant in the position 207 in our data set (Figure 3).
The observed value \(y\) = 2.86 litres for \(x\) = 70 kg.
The expected value \(\widehat{y}\) is the value 3.14 litres on the regression line for \(x\) = 70. This value can be computed by the ?@eq-line3.
The residual is computed by subtracting the expected (fitted) value \(\widehat{y}\) from the observed value \(y\). In the case, it is \(y - \widehat{y} = 3.14 - 2.86=-0.28\) litres.
Ordinary least squares (OLS) estimation of \(b_0\) and \(b_1\) in matrix form
Let’s describe for the \(i^{th}\) observation the underlying association between \(y_i\) and \(x_i\) by:
\[ Y_i = \beta_0 + \beta_1 X_i + \epsilon_i \]
- \(Y_i\): response (dependent) variable at i-th observation
- \(\beta_0,\beta_1\): regression parameters for intercept and slope.
- \(X_i\): known constant (independent or predictor variable) for i-th observation
- \(\epsilon_i\): random error term
\[ E(\epsilon_i) = 0 \\ \]
\[ var(\epsilon_i) = \sigma^2 \\ \]
\[ cov(\epsilon_i,\epsilon_j) = 0 \text{ for all } i \neq j \tag{2}\]
where the \(e_i\) is the error term, called residual, which represents the deviation between the observed value \(y_i\) and the expected value \(\hat y\) of the linear model (\(e_i = y_i - \hat y_i\)).
We make the following assumptions about \(e_i\)
Linearity
The relationship between the independent variable(s) and the outcome is linearZero Mean:
The expected value of the error term is zeroHomoscedasticity:
The error term has a constant varianceIndependence of Errors:
There is no autocorrelation betweenNormality:
The error terms are normally distributed
In a matrix notation, given \(n\) observations of the explanatory variable x and the outcome variable y, the Equation 2 becomes:
\[\begin{bmatrix} y_1\\ y_2 \\ y_3 \\ ... \\ y_n \end{bmatrix} = \begin{bmatrix} 1 & x_1\\ 1 & x_2\\ 1 & x_3\\ ... \\ 1 & x_n \end{bmatrix} \bullet \begin{bmatrix} b_0\\ b_1\end{bmatrix} + \begin{bmatrix} e_1\\ e_2 \\ e_3 \\ ... \\ e_n \end{bmatrix} \tag{3}\]
Then the outcome y can be compactly modeled by:
\[Y = X\hat{B} + e \tag{4}\]
where the matrix X is called the design matrix (the first column contains 1s, and the second column contains the actual observations of x).
The coefficients \(b_0\) and \(b_1\) are selected by minimizing the sum of squares of the residuals \(e\). It can be proven that the least squares estimates are obtained by the following matrix formula:
\[\hat{B} = (X^TX)^{-1}X^TY = \begin{bmatrix} b_0 \\ b_1 \end{bmatrix} \tag{5}\]
That is, we’ve got one matrix equation which gives us both coefficient estimates.
<- lm(`Plasma volume`~ `Body weight`, data = df)
model_lm model_lm
Call:
lm(formula = `Plasma volume` ~ `Body weight`, data = df)
Coefficients:
(Intercept) `Body weight`
0.08572 0.04362
The regression equation for our example becomes the one shown in the plot below.
The intercept \(b_0\) = 0.0857 is the mean Plasma volume with weight of 0. Visually, it’s the point where the line crosses the y-axis when x equals 0 (Figure 4).
Show the code
# Create a linear regression line with annotations
ggplot(df, aes(x = `Body weight`, y = `Plasma volume`)) +
geom_point(alpha = 0.5) +
geom_smooth(method = "lm", se = FALSE, color = "blue", fullrange = TRUE) +
xlim(0, 100) +
ylim(0, 4) +
geom_hline(yintercept = 0, color = "black", linetype = "solid") +
geom_vline(xintercept = 0, color = "black", linetype = "solid") +
annotate("text", x = 67.5, y = 3.05, label = bquote(paste (widehat(PlasmaVolume), "= 0.0857 + 0.0436*weight")), color = "purple", angle = 32, size = 5) +
#theme_prism(base_size = 14, base_line_size = 0.4, palette = "office") +
labs(x = "x = Body weight", y = "y = Plasma volume")
Of greater importance is the slope of the fitted line, \(b_1 = 0.0436\), as this describes the association between the Plasma volume
and weight
variables.
The positive slope indicates a positive association between plasma volume and weight.
For every 1 kg increase in weight, there is on average an associated increase of 0.0436 litres in plasma volume.
The Standard error (SE) of the coefficients
The standard error is a way to measure the “uncertainty” in the estimate of the regression coefficients and it can be calculated by:
\[ \mathrm{SE}_{b_j} = \sqrt{\mathrm{Var}(b_j)} \tag{6}\]
where the \(\mathrm{Var}(b_j)\) represents the variance of the coefficients of the model.
Test statistic and confidence intervals
The hypothesis testing for the coefficients of the regression model are:
\[ \begin{aligned} H_0 &: \beta_j = 0\\ \text{vs } H_1&: \beta_j \neq 0. \end{aligned} \]
The null hypothesis, \(H_{0}\), states that the coefficients are equal to zero, and the alternative hypothesis, \(H_{1}\), states that the coefficients are not equal to zero.
The t-statistic for the coefficients are defined by the following equation:
\[ \ t_j = \frac{\ b_j}{\text{SE}_{b_j}} \tag{7}\]
In R:
# Compute t for the coefficients bo and b1
summary(model_lm)
Call:
lm(formula = `Plasma volume` ~ `Body weight`, data = df)
Residuals:
Min 1Q Median 3Q Max
-0.27880 -0.14178 -0.01928 0.13986 0.32939
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.08572 1.02400 0.084 0.9360
`Body weight` 0.04362 0.01527 2.857 0.0289 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.2188 on 6 degrees of freedom
Multiple R-squared: 0.5763, Adjusted R-squared: 0.5057
F-statistic: 8.16 on 1 and 6 DF, p-value: 0.02893
We can use the p-value to guide our decision:
- If p − value < 0.05, reject the null hypothesis, \(H_{0}\).
- If p − value ≥ 0.05, do not reject the null hypothesis, \(H_{0}\).
In our example \(p =0.0289 \le 0.05 \Rightarrow\) hence we reject \(H_{0}\).
The \(95\%\) CI of the coefficients for a significance level α, \(df=n-k\) degrees of freedom and for a two-tailed t-test is given by:
\[ 95\% \ \text{CI}_{b_j} = b_j \pm t^{*}_{df;a/2} \cdot \text{SE}_{b_j} \tag{8}\]
Summary table
All the estimates and statistics of interest for the regression model are presented in the following summary table (Figure 5):
term | estimate | std_error | statistic | p_value | lower_ci | upper_ci |
---|---|---|---|---|---|---|
intercept | 0.09 | 1.02 | 0.08 | <0.001 | -2.42 | 2.59 |
Body weight
|
0.04 | 0.01 | 2.86 | <0.001 | 0.01 | 0.08 |
Verifying Model Assumptions
In the breast cancer dataset, the variable radius mean represents the average size of the tumour. We aim to investigate the effect of compactness mean on radius mean using simple linear regression..
Specific assumptions have to be met for reliable hypothesis tests and confidence intervals in simple linear regression: linearity of the data, independence of the residuals, normality of the residuals, and equality of variance of the residuals.
We will describe some statistical test and diagnostic plots in R for testing the assumptions underlying linear regression model.
Variations or different versions of the statistical tests may have different results.
Linearity of the data
Linear association between the independent variable x and the dependent variable (outcome) y.
use Breastcancer
// Linear regression
* Fit the linear regressionregress radius_mean compactness_mean
fitted values
* Predict predict yhat
scatter with residuals and regression line
* Plot the twoway (scatter radius_mean compactness_mean, mcolor(blue)) (line yhat compactness_mean, lcolor(red)) (rcap radius_mean yhat compactness_mean, lcolor(gs10)) ytitle("Radius Mean") xtitle("Compactness Mean")
graph export dia.png
Source | SS df MS Number of obs = 569
-------------+---------------------------------- F(1, 567) = 195.26
Model | 1806.94648 1 1806.94648 Prob > F = 0.0000
Residual | 5247.0001 567 9.25396842 R-squared = 0.2562
-------------+---------------------------------- Adj R-squared = 0.2548
Total | 7053.94658 568 12.41892 Root MSE = 3.042
------------------------------------------------------------------------------
radius_mean | Coefficient Std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
compactnes~n | 33.77222 2.416857 13.97 0.000 29.02514 38.51931
_cons | 10.60346 .2825897 37.52 0.000 10.04841 11.15852
------------------------------------------------------------------------------
(option xb assumed; fitted values)
) required
r(100);
r(100);
Error in knitr::include_graphics("dia.png"): Cannot find the file(s): "dia.png"
<-readr::read_csv("BreastCancer.csv")
data<-lm(radius_mean~compactness_mean, data)
model_lm2summary(model_lm2)
Call:
lm(formula = radius_mean ~ compactness_mean, data = data)
Residuals:
Min 1Q Median 3Q Max
-8.8971 -2.0131 -0.4438 1.5317 12.3867
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 10.6035 0.2826 37.52 <0.0000000000000002 ***
compactness_mean 33.7722 2.4169 13.97 <0.0000000000000002 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3.042 on 567 degrees of freedom
Multiple R-squared: 0.2562, Adjusted R-squared: 0.2548
F-statistic: 195.3 on 1 and 567 DF, p-value: < 0.00000000000000022
<- plot(check_model(model_lm2, panel = FALSE))
diagnostic_plots 2] diagnostic_plots[
$NCV
Residuals vs Fitted values plot is used to check the linear association assumption. We would expect to see a random scatter of points around the horizontal dashed line at zero, as demonstrated in our example. The green line is a scatter plot smoother, showing the average value of the residuals corresponding to each fitted value. A horizontal green line, without noticeable patterns or curvature, indicates a linear association.
Independence of the residuals
The residuals (or errors) resulting from the model should be independent of each other.
Statistical test: Durbin-Watson test
We can perform a Durbin-Watson-Test to check for autocorrelated residuals (a p-value < 0.05 indicates autocorrelated residuals).
set.seed(126)
check_autocorrelation(model_lm2, nsim = 1000)
OK: Residuals appear to be independent and not autocorrelated (p = 0.064).
Normality of the residuals
The residuals should be normally distributed.
Statistical test: Shapiro test
check_normality(model_lm2)
Warning: Non-normality of residuals detected (p < .001).
The function performs a shapiro.test
and checks the standardized residuals for normal distribution. According to the documentation “this formal test almost always yields significant results for the distribution of residuals for large samples and visual inspection (e.g., Q-Q plots) are preferable.”
Plot: Normal Q-Q plot
5] diagnostic_plots[
$QQ
Normal Q-Q plot is used to examine whether the residuals are roughly normally distributed. Ideally, the standardized residuals points should follow the green line. In our example, the normal Q-Q normal plot deviates greatly from normal.
Equality of variance of the residuals (Homoscedasticity)
The spread or dispersion of the residuals should remain roughly the same for all values of the X variable.
Statistical test: Breusch-Pagan test
The most common test for checking equality of variance of the residuals (homoskedasticity) is the Breush-Pagan test (a p-value < 0.05 indicates presence of heteroscedasticity).
The original version of Breusch-Pagan test:
::bptest(model_lm2, studentize = F) lmtest
Breusch-Pagan test
data: model_lm2
BP = 78.115, df = 1, p-value < 0.00000000000000022
However, the studentized BP test (which is the default) is more robust than the original one:
::bptest(model_lm2) lmtest
studentized Breusch-Pagan test
data: model_lm2
BP = 56.116, df = 1, p-value = 0.00000000000006833
Plot: Square root of standardized residuals vs Fitted values
3] diagnostic_plots[
$HOMOGENEITY
This plot checks the assumption of equal variance of the residuals (homoscedasticity). To verify the assumption, we plot the square root of standardized residuals against the fitted values from the regression. In this case, the residuals are re-scaled and all values are positive. A horizontal line with equally spread points would be the desired pattern. However, if the spread of residuals widens or narrows systematically as we move along the x-axis, it indicates a violation of homoscedasticity.
Influential observations in the data
The dataset should not include data points with exceptional influence on the model.
Plot: Standardized residuals vs Leverage
4] diagnostic_plots[
$OUTLIERS
This plot is used to detect influential observations. These data are outlier points that might have a substantial impact on the regression model’s results when included or excluded from the analysis. If any point in this plot fall outside the dashed green lines then it is considered an influential observation. In our example, all the data points are inside the dashed green lines suggesting that none of the observations is influential on the fitted model.
Overall Model fit
Two related measures are typically used to evaluate the linear regression model: the residual standard error (RSE) and the coefficient of determination \(R^2\).
Residual standard error (RSE)
The residual standard error is the square root of the residual variance \(\sigma^2_{e}\) that we have already calculated. It informs us about the average error of the regression model in terms of the units of the outcome variable. Lower values are better because it indicates that the observations are closer to the fitted line. In our example:
\[\ RSE = \sqrt{\frac{SS_{residuals}}{n-k}} \tag{9}\]
Coefficient of determination \(R^2\)
The regression model allows us to divide the total variation in the outcome into two components: one explained by the model, and the other that remains unexplained by the model.
\[ \mathrm{SS}_{\mathrm{total}} = \mathrm{SS}_{\mathrm{model}} + \mathrm{SS}_{\mathrm{residual}} \]
\[ \begin{aligned} \sum_{i=1}^n (Y_i - \bar{Y})^2 &= \sum_{i=1}^n (Y_i - \hat{Y}_i + \hat{Y}_i - \bar{Y})^2 \\ &= \sum_{i=1}^n(Y_i - \hat{Y}_i)^2 + \sum_{i=1}^n(\hat{Y}_i - \bar{Y})^2 + 2\sum_{i=1}^n(Y_i - \hat{Y}_i)(\hat{Y}_i-\bar{Y}) \\ &= \sum_{i=1}^n(Y_i - \hat{Y}_i)^2 + \sum_{i=1}^n(\hat{Y}_i -\bar{Y})^2 \\ SS_{total} &= SS_{residuals} + SS_{model} \\ \end{aligned} \]
where SSR(\(SS_{model}\)) is the regression sum of squares, which measures how the conditional mean varies about a central value. + Variation explained by the model (what your model learns from the data)
The cross-product term in the decomposition is 0:
\[ \begin{aligned} \sum_{i=1}^n (Y_i - \hat{Y}_i)(\hat{Y}_i - \bar{Y}) &= \sum_{i=1}^{n}(Y_i - \bar{Y} -b_1 (X_i - \bar{X}))(\bar{Y} + b_1 (X_i - \bar{X})-\bar{Y}) \\ &= b_1 \sum_{i=1}^{n} (Y_i - \bar{Y})(X_i - \bar{X}) - b_1^2\sum_{i=1}^{n}(X_i - \bar{X})^2 \\ &= b_1 \frac{\sum_{i=1}^{n}(Y_i -\bar{Y})(X_i - \bar{X})}{\sum_{i=1}^{n}(X_i - \bar{X})^2} \sum_{i=1}^{n}(X_i - \bar{X})^2 - b_1^2\sum_{i=1}^{n}(X_i - \bar{X})^2 \\ &= b_1^2 \sum_{i=1}^{n}(X_i - \bar{X})^2 - b_1^2 \sum_{i=1}^{n}(X_i - \bar{X})^2 \\ &= 0 \end{aligned} \]
\[ \begin{aligned} SSTO &= SSR + SSE \\ (n-1 d.f) &= (1 d.f.) + (n-2 d.f.) \end{aligned} \]
Source of Variation | Sum of Squares | df | Mean Square | F |
---|---|---|---|---|
Regression (model) | SSR | 1 | MSR = SSR/df | MSR/MSE |
Error | SSE | n-2 | MSE = SSE/df | |
Total (Corrected) | SSTO | n-1 |
\[ E(MSE) = \sigma^2 \\ E(MSR) = \sigma^2 + \beta_1^2 \sum_{i=1}^{n} (X_i - \bar{X})^2 \]
Coefficient of Determination (\(R^2\)) and Adjusted \(R^2\)
\(R^2\) is the square of the correlation coefficient r.
- It tells us the proportion of variability in the outcome variable that is explained by the model.
- In our regression, \(R^2 = 0.5763\): about \(58\%\) of the variability in plasma volume is explained by bodyweight.
- Adjusted \(R^2 = 0.5057\): tells us how well the model explains the outcome after accounting for sample size and number of predictors It gives a more honest estimate of model performance, especially with small datasets .After adjusting for model simplicity and sample size, about \(51\%\) of the variability in plasma volume is still explained by body weight.
\[ R^2 = \frac{SS_{model}}{SS_{total}} = 1- \frac{SS_{residuals}}{SS_{total}} \]
where \(0 \le R^2 \le 1\)
Interpretation: The proportionate reduction of the total variation in Y after fitting a linear model in X.
It is not really correct to say that \(R^2\) is the “variation in Y explained by X”.
\(R^2\) is related to the correlation coefficient between Y and X:
\[ R^2 = (r)^2 \]
where \(r= corr(x,y)\) is an estimate of the Pearson correlation coefficient.
The coefficient of determination, \(R^2\), indicates the percentage of the total variation in the dependent variable Y that can be explained by the regression model (in this simple case the independent variable X). Hence, it is a measure of the ‘goodness of fit’ of the regression line to the data.
\[\ R^2 = \frac{\ explained \ \ variation}{total \ \ variation} = \frac{SS_{model}}{SS_{total}} = 1- \frac{SS_{residual}}{SS_{total}} \tag{10}\]
The R-squared measure ranges between 0 and 1. When \(R^2\) approaches 1 indicates that a substantial proportion of the variation in the dependent variable Y has been explained by the regression model. Conversely, when it nears 0, it implies that the regression model has not effectively explained the majority of the variation in the outcome.
In our example, \(R^2 = 0.58\) indicates that about 58% of the variation in Plasma Volume can be explained by the variation of the body weight.
cor(df$`Body weight`,df$`Plasma volume`))^2 (
[1] 0.5762732
In matrix notation:
- the sum of squares total is defined as:
\[ \mathrm{SS}_{total} = (\mathbf{Y} - \bar{\mathbf{Y}})^2 = (\mathbf{Y} - \bar{\mathbf{Y}})^\intercal(\mathbf{Y} - \bar{\mathbf{Y}}) \]
- the residual sum of squares is:
\[ SS_{residuals} = \mathbf{e}^2 =\mathbf{e}^\intercal\mathbf{e} \]
Therefore,
\[ \mathrm{SS}_{\mathrm{model}} = \mathrm{SS}_{\mathrm{total}} - \mathrm{SS}_{\mathrm{residual}} \]
F-Test
The F-test for the model is a test of the null hypothesis that none of the independent variables linearly predict the dependent variable, that is, the model parameters are jointly zero:
The F-statistic tests whether the regression model explains a signicant portion of the total variability in the outcome variable:
\[H_0 : \beta_1 = \ldots = \beta_k = 0\]. The regression mean sum of squares \(MSR = \frac{(\hat{y} - \bar{y})'(\hat{y} - \bar{y})}{k-1}\) and the error mean sum of squares \(MSE = \frac{\hat{\epsilon}'\hat{\epsilon}}{n-k}\) are each chi-square variables. Their ratio has an F distribution with \(k - 1\) numerator degrees of freedom and \(n - k\) denominator degrees of freedom. The F statistic can also be expressed in terms of the coefficient of correlation \(R^2 = \frac{MS_{model}}{MS_{total}}\).
\[F(k - 1, n - k) = \frac{MS_{model}}{MS_{residuals}} = \frac{R^2}{1 - R^2} \frac{n-k}{k-1}\]
MSE is \(\sigma^2\). If \(H_0\) is true, that is, there is no relationship between the predictors and the response, then \(MSR\) is also equal to \(\sigma^2\), so \(F = 1\). As \(R^2 \rightarrow 1\), \(F \rightarrow \infty\), and as \(R^2 \rightarrow 0\), \(F \rightarrow 0\). F increases with \(n\) and decreases with \(k\).
A large F-value indicates that the model explains significantly more variability than the mean alone. Also, p value = 0.029
Present the results
We can use the above information to write up a final report:
Multiple Linear Regression and General Linear Model
We would like to fit a model that includes all potentially important variables simultaneously. This would help us evaluate the relationship between a predictor variable and the outcome while controlling for the potential influence of other variables. This is the strategy used in multiple regression. While we remain cautious about making any causal interpretations using multiple regression, such models are a common first step in providing evidence of a causal connection.
\[y = \beta_0 + \beta_1x_1 + \beta_2x_2 + \dots + \beta_px_p+\epsilon_i \tag{11}\]
Just as with the single predictor case, a multiple regression model may be missing important components or it might not precisely represent the relationship between the outcome and the available explanatory variables. While no model is perfect, we wish to explore the possibility that this model may fit the data reasonably well.
We estimate the parameters \(\beta_0\), \(\beta_1\), …, \(\beta_4\) in the same way as we did in the case of a single predictor. We select \(b_0\), \(b_1\), …, \(b_4\) that minimize the sum of the squared residuals:
\[ \text{SSE} = e_1^2 + e_2^2 + \dots + e_{n}^2 = \sum_{i=1}^{n} e_i^2 = \sum_{i=1}^{n} \left(y_i - \hat{y}_i\right)^2 \]
<-readr::read_csv("heart.csv") heart
<- lm(chol~age+sex+cp+trestbps+fbs+restecg +thalach+exang+oldpeak+slope+ca +thal,data=heart) modelA
::stepAIC(modelA ,direction="both") MASS
Start: AIC=7969.2
chol ~ age + sex + cp + trestbps + fbs + restecg + thalach +
exang + oldpeak + slope + ca + thal
Df Sum of Sq RSS AIC
- fbs 1 27 2378351 7967.2
- ca 1 658 2378982 7967.5
- oldpeak 1 2021 2380345 7968.1
- slope 1 3210 2381534 7968.6
- trestbps 1 3350 2381674 7968.6
<none> 2378324 7969.2
- cp 1 6847 2385171 7970.1
- exang 1 7582 2385906 7970.5
- thalach 1 16619 2394943 7974.3
- thal 1 27512 2405836 7979.0
- restecg 1 41775 2420099 7985.0
- age 1 69200 2447524 7996.6
- sex 1 112947 2491271 8014.8
Step: AIC=7967.21
chol ~ age + sex + cp + trestbps + restecg + thalach + exang +
oldpeak + slope + ca + thal
Df Sum of Sq RSS AIC
- ca 1 634 2378985 7965.5
- oldpeak 1 2069 2380420 7966.1
- slope 1 3270 2381621 7966.6
- trestbps 1 3332 2381683 7966.6
<none> 2378351 7967.2
- cp 1 7012 2385363 7968.2
- exang 1 7559 2385910 7968.5
+ fbs 1 27 2378324 7969.2
- thalach 1 16592 2394943 7972.3
- thal 1 27763 2406114 7977.1
- restecg 1 41812 2420163 7983.1
- age 1 69299 2447650 7994.6
- sex 1 113275 2491626 8012.9
Step: AIC=7965.48
chol ~ age + sex + cp + trestbps + restecg + thalach + exang +
oldpeak + slope + thal
Df Sum of Sq RSS AIC
- oldpeak 1 2485 2381470 7964.6
- trestbps 1 3430 2382415 7965.0
- slope 1 3590 2382576 7965.0
<none> 2378985 7965.5
- exang 1 7390 2386375 7966.7
- cp 1 7641 2386627 7966.8
+ ca 1 634 2378351 7967.2
+ fbs 1 3 2378982 7967.5
- thalach 1 16261 2395246 7970.5
- thal 1 28457 2407443 7975.7
- restecg 1 42232 2421217 7981.5
- age 1 74852 2453837 7995.2
- sex 1 112786 2491771 8011.0
Step: AIC=7964.55
chol ~ age + sex + cp + trestbps + restecg + thalach + exang +
slope + thal
Df Sum of Sq RSS AIC
- slope 1 1659 2383129 7963.3
- trestbps 1 4286 2385756 7964.4
<none> 2381470 7964.6
+ oldpeak 1 2485 2378985 7965.5
- cp 1 7940 2389410 7966.0
+ ca 1 1050 2380420 7966.1
- exang 1 8523 2389993 7966.2
+ fbs 1 22 2381449 7966.5
- thalach 1 15436 2396906 7969.2
- thal 1 31170 2412640 7975.9
- restecg 1 41542 2423012 7980.3
- age 1 77027 2458497 7995.2
- sex 1 111210 2492680 8009.3
Step: AIC=7963.26
chol ~ age + sex + cp + trestbps + restecg + thalach + exang +
thal
Df Sum of Sq RSS AIC
- trestbps 1 3828 2386956 7962.9
<none> 2383129 7963.3
+ slope 1 1659 2381470 7964.6
- exang 1 7722 2390851 7964.6
- cp 1 8187 2391315 7964.8
+ ca 1 1120 2382009 7964.8
+ oldpeak 1 553 2382576 7965.0
+ fbs 1 37 2383092 7965.2
- thalach 1 20679 2403807 7970.1
- thal 1 30676 2413804 7974.4
- restecg 1 40738 2423866 7978.6
- age 1 77309 2460438 7994.0
- sex 1 110883 2494012 8007.9
Step: AIC=7962.91
chol ~ age + sex + cp + restecg + thalach + exang + thal
Df Sum of Sq RSS AIC
<none> 2386956 7962.9
+ trestbps 1 3828 2383129 7963.3
- cp 1 7412 2394368 7964.1
+ ca 1 1330 2385627 7964.3
+ slope 1 1200 2385756 7964.4
+ oldpeak 1 1087 2385869 7964.4
- exang 1 8750 2395706 7964.7
+ fbs 1 8 2386948 7964.9
- thalach 1 22208 2409164 7970.4
- thal 1 31968 2418924 7974.5
- restecg 1 43516 2430472 7979.4
- age 1 92435 2479392 7999.9
- sex 1 114492 2501448 8008.9
Call:
lm(formula = chol ~ age + sex + cp + restecg + thalach + exang +
thal, data = heart)
Coefficients:
(Intercept) age sex cp restecg thalach
148.2538 1.1645 -23.8543 -2.9184 -12.5114 0.2418
exang thal
7.1878 9.4035
<-lm(formula = chol ~ age + cp+ sex +restecg + thalach + exang +
modelBdata = heart) thal,
summary(modelB)
Call:
lm(formula = chol ~ age + cp + sex + restecg + thalach + exang +
thal, data = heart)
Residuals:
Min 1Q Median 3Q Max
-119.725 -30.585 -3.199 29.412 276.662
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 148.25377 19.52092 7.595 0.0000000000000698 ***
age 1.16445 0.18555 6.276 0.0000000005145924 ***
cp -2.91837 1.64225 -1.777 0.075857 .
sex -23.85426 3.41539 -6.984 0.0000000000051637 ***
restecg -12.51140 2.90566 -4.306 0.0000182425378184 ***
thalach 0.24183 0.07862 3.076 0.002154 **
exang 7.18776 3.72265 1.931 0.053783 .
thal 9.40350 2.54797 3.691 0.000236 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 48.45 on 1017 degrees of freedom
Multiple R-squared: 0.1243, Adjusted R-squared: 0.1182
F-statistic: 20.62 on 7 and 1017 DF, p-value: < 0.00000000000000022
summary(modelA)
Call:
lm(formula = chol ~ age + sex + cp + trestbps + fbs + restecg +
thalach + exang + oldpeak + slope + ca + thal, data = heart)
Residuals:
Min 1Q Median 3Q Max
-126.835 -30.997 -0.782 28.271 281.288
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 135.8311 21.4148 6.343 0.00000000033947 ***
age 1.0677 0.1968 5.426 0.00000007198361 ***
sex -23.9646 3.4568 -6.933 0.00000000000735 ***
cp -2.8520 1.6708 -1.707 0.088138 .
trestbps 0.1115 0.0934 1.194 0.232786
fbs -0.4760 4.4364 -0.107 0.914577
restecg -12.3752 2.9352 -4.216 0.00002707898712 ***
thalach 0.2218 0.0834 2.659 0.007956 **
exang 6.8454 3.8111 1.796 0.072765 .
oldpeak 1.5620 1.6845 0.927 0.354012
slope 3.6714 3.1414 1.169 0.242784
ca 0.8508 1.6078 0.529 0.596776
thal 8.8567 2.5885 3.422 0.000648 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 48.48 on 1012 degrees of freedom
Multiple R-squared: 0.1274, Adjusted R-squared: 0.1171
F-statistic: 12.32 on 12 and 1012 DF, p-value: < 0.00000000000000022
anova(modelA,modelB)
Res.Df | RSS | Df | Sum of Sq | F | Pr(>F) |
---|---|---|---|---|---|
1012 | 2378324 | NA | NA | NA | NA |
1017 | 2386956 | -5 | -8632.434 | 0.734637 | 0.5975415 |
AIC(modelA,modelB)
df | AIC | |
---|---|---|
modelA | 14 | 10880.02 |
modelB | 9 | 10873.73 |
BIC(modelA,modelB)
df | BIC | |
---|---|---|
modelA | 14 | 10949.07 |
modelB | 9 | 10918.12 |
// multiple linear regression
*
"heart.csv"
import delimited
// drop outcome
drop target
//run multiple regression - full model
regress chol age sex cp trestbps fbs restecg thalach exang oldpeak slope ca thal
est store modelA // Save the estimates
// Run stepwise regression with 5% level
stepwise , pe(0.05): regress chol age sex cp trestbps fbs restecg thalach exang oldpeak slope ca thal
//Run with selected variables
regress chol age sex restecg thalach exang thal
est store modelB // Save the estimates
// compare models
ssc install estout, replace
stats(r2 r2_a aic bic)
estout modelA modelB,
//Information criteria
estat ic
// check multicollinearity- using variance inflation factor
vif
(encoding automatically selected: ISO-8859-1)
(14 vars, 1,025 obs)
Source | SS df MS Number of obs = 1,025
-------------+---------------------------------- F(12, 1012) = 12.32
Model | 347346.089 12 28945.5074 Prob > F = 0.0000
Residual | 2378323.91 1,012 2350.12244 R-squared = 0.1274
-------------+---------------------------------- Adj R-squared = 0.1171
Total | 2725670 1,024 2661.78711 Root MSE = 48.478
------------------------------------------------------------------------------
chol | Coefficient Std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
age | 1.067711 .1967646 5.43 0.000 .6815978 1.453825
sex | -23.96464 3.45684 -6.93 0.000 -30.74803 -17.18124
cp | -2.852014 1.670825 -1.71 0.088 -6.130692 .4266636
trestbps | .1115105 .0933981 1.19 0.233 -.0717655 .2947865
fbs | -.475993 4.436357 -0.11 0.915 -9.181505 8.229519
restecg | -12.37525 2.935229 -4.22 0.000 -18.13508 -6.615418
thalach | .2217753 .0833985 2.66 0.008 .0581216 .385429
exang | 6.845388 3.8111 1.80 0.073 -.6331754 14.32395
oldpeak | 1.561961 1.684487 0.93 0.354 -1.743527 4.867449
slope | 3.671422 3.141354 1.17 0.243 -2.492892 9.835735
ca | .8508421 1.607758 0.53 0.597 -2.304078 4.005762
thal | 8.85672 2.588534 3.42 0.001 3.777211 13.93623
_cons | 135.8311 21.41484 6.34 0.000 93.80857 177.8537
------------------------------------------------------------------------------
Wald test, begin with empty model:
p = 0.0000 < 0.0500, adding age
p = 0.0000 < 0.0500, adding sex
p = 0.0000 < 0.0500, adding restecg
p = 0.0000 < 0.0500, adding thal
p = 0.0480 < 0.0500, adding thalach
p = 0.0097 < 0.0500, adding exang
Source | SS df MS Number of obs = 1,025
-------------+---------------------------------- F(6, 1018) = 23.48
Model | 331301.787 6 55216.9646 Prob > F = 0.0000
Residual | 2394368.21 1,018 2352.03164 R-squared = 0.1215
-------------+---------------------------------- Adj R-squared = 0.1164
Total | 2725670 1,024 2661.78711 Root MSE = 48.498
------------------------------------------------------------------------------
chol | Coefficient Std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
age | 1.149399 .1855542 6.19 0.000 .7852869 1.513512
sex | -24.09322 3.416361 -7.05 0.000 -30.79713 -17.3893
restecg | -12.62229 2.90807 -4.34 0.000 -18.32879 -6.915791
thal | 9.848034 2.538348 3.88 0.000 4.867041 14.82903
thalach | .2162272 .0773674 2.79 0.005 .0644095 .3680449
exang | 9.200627 3.549896 2.59 0.010 2.234677 16.16658
_cons | 148.6544 19.5403 7.61 0.000 110.3106 186.9983
------------------------------------------------------------------------------
Source | SS df MS Number of obs = 1,025
-------------+---------------------------------- F(6, 1018) = 23.48
Model | 331301.787 6 55216.9646 Prob > F = 0.0000
Residual | 2394368.21 1,018 2352.03164 R-squared = 0.1215
-------------+---------------------------------- Adj R-squared = 0.1164
Total | 2725670 1,024 2661.78711 Root MSE = 48.498
------------------------------------------------------------------------------
chol | Coefficient Std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
age | 1.149399 .1855542 6.19 0.000 .7852869 1.513512
sex | -24.09322 3.416361 -7.05 0.000 -30.79713 -17.3893
restecg | -12.62229 2.90807 -4.34 0.000 -18.32879 -6.915791
thalach | .2162272 .0773674 2.79 0.005 .0644095 .3680449
exang | 9.200627 3.549896 2.59 0.010 2.234677 16.16658
thal | 9.848034 2.538348 3.88 0.000 4.867041 14.82903
_cons | 148.6544 19.5403 7.61 0.000 110.3106 186.9983
------------------------------------------------------------------------------
checking estout consistency and verifying not already installed...
all files already exist and are up to date.
--------------------------------------
modelA modelB
b b
--------------------------------------
age 1.067711 1.149399
sex -23.96464 -24.09322
cp -2.852014
trestbps .1115105
fbs -.475993
restecg -12.37525 -12.62229
thalach .2217753 .2162272
exang 6.845388 9.200627
oldpeak 1.561961
slope 3.671422
ca .8508421
thal 8.85672 9.848034
_cons 135.8311 148.6544
--------------------------------------
r2 .1274351 .1215488
r2_a .1170885 .1163712
aic 10878.02 10872.91
bic 10942.14 10907.44
--------------------------------------
Akaike's information criterion and Bayesian information criterion
-----------------------------------------------------------------------------
Model | N ll(null) ll(model) df AIC BIC
-------------+---------------------------------------------------------------
modelB | 1,025 -5495.873 -5429.455 7 10872.91 10907.44
-----------------------------------------------------------------------------
Note: BIC uses N = number of observations. See [R] IC note.
Variable | VIF 1/VIF
-------------+----------------------
thalach | 1.38 0.725030
age | 1.23 0.810527
exang | 1.23 0.815470
thal | 1.08 0.925408
sex | 1.08 0.928529
restecg | 1.03 0.974691
-------------+----------------------
Mean VIF | 1.17
General Linear model
Use the one-way ANOVA test to compare the mean response of a continuous dependent variable among the levels of a factor variable (if you only have two levels, use the independent-samples t-test)
. The observations must be independent, meaning the data generators cannot influence each other (e.g., same participant in different groups, or participants interact with each other to produce the observed outcome).
The ANOVA method decomposes the deviation of observation \(Y_{ij}\) around the overall mean \(\bar{Y}_{..}\) into two parts, the deviation of the observations around their treatment means, \(SSE\), and the deviation of the treatment means around the overall mean, \(SSR\). Their ratio, \(F = \frac{SSR}{SSE}\) follows an F-distribution with \(k-1\) numerator dof and \(N-k\) denominator dof. The more observation variance captured by the treatments, the large is \(F\), and the less likely that the null hypothesis, \(H_0 = \mu_1 = \mu_2 = \cdots = \mu_k\) is true.
Compare the F-statistic to the F-distribution with \(k-1\) numerator degrees of freedom and \(N-k\) denominator degrees of freedom
The F test does not indicate which populations cause the rejection of \(H_0\). For this, use one of the post-hoc tests: Tukey, Fisher’s Least Significant Difference (LSD), Bonferroni, Scheffe, or Dunnett.
ANOVA returns reliable results if the following conditions hold:
- there should be no significant outliers within the factor levels;
- the response variable should be approximately normally distributed within each factor level; and
- the the response variable variances within the factor levels should be equal.
Data set abreveya1.dta contains a sample of data from the first birth of mothers in the study, with data from 3,798 mothers who have complete data for the variables of interest and who had singleton births Birth outcomes of interest are birthweight and whether the baby was of low birthweight (birthweight < 2500 grams)
ANC
Note: “kess” is an example of a factor – a categorical variable that divides the data into three groups according to the quality of antenatal care (adequate, intermediate or inadequate)
Question of interest is: “What can be said about the variation in birthweight between the different levels of antenatal care?”
<-haven::read_dta("abrevaya1.dta") |>
dfmutate(birwtkg = birwt/1000,
kess_cat = as.factor(kess))
use abrevaya1
gen birwtkg = birwt/1000
"adequate" 2 "intermediate" 3 "inadequate"
lab def kessl2 1
lab val kess kessl2 var kess "Kessner ANC score"
lab
oneway birwtkg kess ,tabulate
Kessner ANC | Summary of birwtkg
score | Mean Std. dev. Freq.
------------+------------------------------------
adequate | 3.4660382 .49382928 3,086
intermedi | 3.3503617 .53106249 705
inadequat | 3.2852353 .55723067 187
------------+------------------------------------
Total | 3.4370382 .50664017 3,978
Analysis of variance
Source SS df MS F Prob > F
------------------------------------------------------------------------
Between groups 12.2011191 2 6.10055954 24.04 0.0000
Within groups 1008.6322 3975 .25374395
------------------------------------------------------------------------
Total 1020.83332 3977 .256684265
Bartlett's equal-variances test: chi2(2) = 10.5193 Prob>chi2 = 0.005
Bonferroni in stata
use abrevaya1
gen birwtkg = birwt/1000
"adequate" 2 "intermediate" 3 "inadequate"
lab def kessl2 1
lab val kess kessl2 var kess "Kessner ANC score"
lab
oneway birwtkg kess,tabulate bon
Kessner ANC | Summary of birwtkg
score | Mean Std. dev. Freq.
------------+------------------------------------
adequate | 3.4660382 .49382928 3,086
intermedi | 3.3503617 .53106249 705
inadequat | 3.2852353 .55723067 187
------------+------------------------------------
Total | 3.4370382 .50664017 3,978
Analysis of variance
Source SS df MS F Prob > F
------------------------------------------------------------------------
Between groups 12.2011191 2 6.10055954 24.04 0.0000
Within groups 1008.6322 3975 .25374395
------------------------------------------------------------------------
Total 1020.83332 3977 .256684265
Bartlett's equal-variances test: chi2(2) = 10.5193 Prob>chi2 = 0.005
Comparison of birwtkg by Kessner ANC score
(Bonferroni)
Row Mean-|
Col Mean | adequate intermed
---------+----------------------
intermed | -.115677
| 0.000
|
inadequa | -.180803 -.065126
| 0.000 0.348
<-aov(birwtkg~kess_cat,data=df)) (aov_model
Call:
aov(formula = birwtkg ~ kess_cat, data = df)
Terms:
kess_cat Residuals
Sum of Squares 12.2011 1008.6322
Deg. of Freedom 2 3975
Residual standard error: 0.50373
Estimated effects may be unbalanced
summary(aov_model)
Df Sum Sq Mean Sq F value Pr(>F)
kess_cat 2 12.2 6.101 24.04 0.0000000000418 ***
Residuals 3975 1008.6 0.254
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Bonferroni in R
pairwise.t.test(df$birwtkg,df$kess_cat,p.adjust.method = "bonferroni")
Pairwise comparisons using t tests with pooled SD
data: df$birwtkg and df$kess_cat
1 2
2 0.00000012 -
3 0.00000584 0.35
P value adjustment method: bonferroni
Overall analysis of variance is significant \((F=24.04; P<0.0001)\) we can reject H0 and conclude that the mean birthweight is not the same for all three groups of mothers defined by the level of antenatal care
Bonferroni adjusted comparison of means shows that the mean birthweight is significantly lower for mothers who had either intermediate or inadequate antenatal care compared to mothers who had adequate antenatal care, while the difference in mean birthweight between mothers with intermediate antenatal care and mothers with inadequate antenatal care is not statistically significant
Close connection between multiple regression models and analysis of variance models Measure effect of intermediate antenatal care and inadequate antenatal care relative to adequate antenatal care (first level and also the most common in our study) i.e. we set Kessner level 1 (adequate antenatal care) to be the reference or baseline level Define variables to measure the effect of Kess level 2 and Kess level 3 relative to the baseline as follows:
use abrevaya1
gen birwtkg = birwt/1000
gen kess2 = 0
replace kess2=1 if kess==2
gen kess3 = 0
replace kess3 =1 if kess==3
reg birwtkg kess2 kess3
(705 real changes made)
(187 real changes made)
Source | SS df MS Number of obs = 3,978
-------------+---------------------------------- F(2, 3975) = 24.04
Model | 12.2011191 2 6.10055954 Prob > F = 0.0000
Residual | 1008.6322 3,975 .25374395 R-squared = 0.0120
-------------+---------------------------------- Adj R-squared = 0.0115
Total | 1020.83332 3,977 .256684265 Root MSE = .50373
------------------------------------------------------------------------------
birwtkg | Coefficient Std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
kess2 | -.1156765 .0210272 -5.50 0.000 -.1569017 -.0744514
kess3 | -.180803 .037936 -4.77 0.000 -.2551789 -.106427
_cons | 3.466038 .0090678 382.24 0.000 3.44826 3.483816
------------------------------------------------------------------------------
OR
use abrevaya1
gen birwtkg = birwt/1000
reg birwtkg i.kess
*Change baseline*reg birwtkg ib3.kess
Source | SS df MS Number of obs = 3,978
-------------+---------------------------------- F(2, 3975) = 24.04
Model | 12.2011191 2 6.10055954 Prob > F = 0.0000
Residual | 1008.6322 3,975 .25374395 R-squared = 0.0120
-------------+---------------------------------- Adj R-squared = 0.0115
Total | 1020.83332 3,977 .256684265 Root MSE = .50373
------------------------------------------------------------------------------
birwtkg | Coefficient Std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
kess |
Intermedi.. | -.1156765 .0210272 -5.50 0.000 -.1569017 -.0744514
Inadequat.. | -.180803 .037936 -4.77 0.000 -.2551789 -.106427
|
_cons | 3.466038 .0090678 382.24 0.000 3.44826 3.483816
------------------------------------------------------------------------------
Source | SS df MS Number of obs = 3,978
-------------+---------------------------------- F(2, 3975) = 24.04
Model | 12.2011191 2 6.10055954 Prob > F = 0.0000
Residual | 1008.6322 3,975 .25374395 R-squared = 0.0120
-------------+---------------------------------- Adj R-squared = 0.0115
Total | 1020.83332 3,977 .256684265 Root MSE = .50373
------------------------------------------------------------------------------
birwtkg | Coefficient Std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
kess |
Adequate .. | .180803 .037936 4.77 0.000 .106427 .2551789
Intermedi.. | .0651264 .0414348 1.57 0.116 -.016109 .1463618
|
_cons | 3.285235 .0368364 89.18 0.000 3.213015 3.357455
------------------------------------------------------------------------------
<-lm(birwtkg~kess_cat,data=df)) (lm_model
Call:
lm(formula = birwtkg ~ kess_cat, data = df)
Coefficients:
(Intercept) kess_cat2 kess_cat3
3.4660 -0.1157 -0.1808
summary(lm_model)
Call:
lm(formula = birwtkg ~ kess_cat, data = df)
Residuals:
Min 1Q Median 3Q Max
-2.53004 -0.31904 -0.00604 0.30664 1.77896
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.466038 0.009068 382.238 < 0.0000000000000002 ***
kess_cat2 -0.115677 0.021027 -5.501 0.0000000401 ***
kess_cat3 -0.180803 0.037936 -4.766 0.0000019464 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.5037 on 3975 degrees of freedom
Multiple R-squared: 0.01195, Adjusted R-squared: 0.01145
F-statistic: 24.04 on 2 and 3975 DF, p-value: 0.00000000004181
If we compare analysis of variance for regression model to that of the one-way ANOVA, we see that they are in fact identical Model sum of squares and degrees of freedom are equal to the “Between groups” in the ANOVA – similarly for Residual sum of squares and residual degrees of freedom and within groups sum of squares and residual degrees of freedom Results in identical F-ratios (\(24.04\)) and identical P-values (\(<0.0001\))
General linear model : allows us to include both variates and factors as explanatory variables and thus allows us to analyze a wide range of data from research studies – provided it is reasonable to assume that the quantitative outcome or response variable has an underlying normal distribution
Example 2
- Does birthweight increase with increasing maternal age?
- Is there a difference in mean birthweight between babies whose mothers smoked & babies whose mothers did not smoke adjusting for maternal age?
- Does birth weight increase with increasing maternal age at same rate for babies where mother smoked and babies where mother did not smoke during pregnancy?
|>
df ggplot(aes(x=birwtkg , y=mage, color = smoke))+
geom_point()+
geom_smooth(method ="lm")
|>
df ggplot(aes(x=birwtkg , y=mage, color = smoke))+
geom_point()+
geom_smooth(method ="lm")+
facet_wrap(~smoke)
<-lm(birwtkg~mage,data=df)) (lm_model
Call:
lm(formula = birwtkg ~ mage, data = df)
Coefficients:
(Intercept) mage
3.11228 0.01187
summary(lm_model)
Call:
lm(formula = birwtkg ~ mage, data = df)
Residuals:
Min 1Q Median 3Q Max
-2.49672 -0.31169 0.00489 0.31881 1.66987
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.112278 0.041854 74.360 < 0.0000000000000002 ***
mage 0.011868 0.001502 7.904 0.00000000000000347 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.5028 on 3976 degrees of freedom
Multiple R-squared: 0.01547, Adjusted R-squared: 0.01522
F-statistic: 62.47 on 1 and 3976 DF, p-value: 0.00000000000000347
use abrevaya1
gen birwtkg = birwt/1000
reg birwtkg mage
Source | SS df MS Number of obs = 3,978
-------------+---------------------------------- F(1, 3976) = 62.47
Model | 15.791756 1 15.791756 Prob > F = 0.0000
Residual | 1005.04157 3,976 .252777054 R-squared = 0.0155
-------------+---------------------------------- Adj R-squared = 0.0152
Total | 1020.83332 3,977 .256684265 Root MSE = .50277
------------------------------------------------------------------------------
birwtkg | Coefficient Std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
mage | .011868 .0015015 7.90 0.000 .0089242 .0148119
_cons | 3.112278 .0418542 74.36 0.000 3.03022 3.194336
------------------------------------------------------------------------------
There is overwhelming evidence that birthweight is associated with maternal age – for each year the birthweight increases by 0.012 kg (95% CI 0.0089 ; 0.015 ; P<0.0001)
<-df |>
dfmutate(mage_cent = mage -mean(mage),
smoke = as.factor(smoke))
<-lm(birwtkg~mage_cent+smoke,data=df)
lm_model_smoke summary(lm_model_smoke)
Call:
lm(formula = birwtkg ~ mage_cent + smoke, data = df)
Residuals:
Min 1Q Median 3Q Max
-2.50063 -0.30397 0.00637 0.31290 1.80470
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.474831 0.008471 410.188 < 0.0000000000000002 ***
mage_cent 0.009116 0.001495 6.099 0.00000000117 ***
smoke1 -0.267983 0.022800 -11.754 < 0.0000000000000002 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.4943 on 3975 degrees of freedom
Multiple R-squared: 0.04854, Adjusted R-squared: 0.04806
F-statistic: 101.4 on 2 and 3975 DF, p-value: < 0.00000000000000022
augment(lm_model_smoke) %>%
gf_point(birwtkg~mage_cent,color=~smoke) %>%
gf_line(.fitted~mage_cent,data=subset(augment(lm_model_smoke), smoke == 1),color=~smoke)%>%
gf_line(.fitted~mage_cent,data=subset(augment(lm_model_smoke), smoke == 0),color=~smoke) %>%
gf_theme(theme_bw())
anova(lm_model,lm_model_smoke)
Res.Df | RSS | Df | Sum of Sq | F | Pr(>F) |
---|---|---|---|---|---|
3976 | 1005.0416 | NA | NA | NA | NA |
3975 | 971.2843 | 1 | 33.75728 | 138.1523 | 0 |
use abrevaya1
egen mage_bar = mean(mage)
gen mage_cent = mage -mage_bar
gen birwtkg = birwt/1000
reg birwtkg mage_cent i.smoke
testparm i.smoke
Source | SS df MS Number of obs = 3,978
-------------+---------------------------------- F(2, 3975) = 101.39
Model | 49.5490345 2 24.7745173 Prob > F = 0.0000
Residual | 971.284287 3,975 .244348248 R-squared = 0.0485
-------------+---------------------------------- Adj R-squared = 0.0481
Total | 1020.83332 3,977 .256684265 Root MSE = .49432
------------------------------------------------------------------------------
birwtkg | Coefficient Std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
mage_cent | .0091161 .0014947 6.10 0.000 .0061856 .0120466
1.smoke | -.2679825 .0227996 -11.75 0.000 -.3126825 -.2232825
_cons | 3.474831 .0084713 410.19 0.000 3.458222 3.491439
------------------------------------------------------------------------------
( 1) 1.smoke = 0
F( 1, 3975) = 138.15
Prob > F = 0.0000
Model equation
\[ \mbox{E}(Birwtkg)=\beta_0 + \beta_1*\text{mage_cent}+ \beta_2*\text{(smoke=Yes)} \]
Post model test (“testparm i.smoke”) shows that adding i.smoke significantly improves the fit of the model \((F=138.15; P<0.0001)\) Estimated effect of smoking of -0.268 (se 0.023) after adjusting for the effect of maternal age Compared with result of the t-test which said that if the mother smoked birth weight was lower by 0.290 (se 0.023) – so the effect of smoking is slightly reduced by adjusting for mother’s age. In this case the standard error is unchanged (to 3 decimal places) - Often adjusting leads to smaller s.e.
use abrevaya1
egen mage_bar = mean(mage)
gen mage_cent = mage -mage_bar
gen birwtkg = birwt/1000
reg birwtkg mage_cent i.smoke smoke#c.mage_cent
testparm smoke#c.mage_cent
Source | SS df MS Number of obs = 3,978
-------------+---------------------------------- F(3, 3974) = 67.78
Model | 49.6892689 3 16.5630896 Prob > F = 0.0000
Residual | 971.144053 3,974 .244374447 R-squared = 0.0487
-------------+---------------------------------- Adj R-squared = 0.0480
Total | 1020.83332 3,977 .256684265 Root MSE = .49434
------------------------------------------------------------------------------
birwtkg | Coefficient Std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
mage_cent | .0096089 .0016302 5.89 0.000 .0064127 .012805
1.smoke | -.2731566 .0238019 -11.48 0.000 -.3198217 -.2264916
|
smoke#|
c.mage_cent |
1 | -.0030948 .0040854 -0.76 0.449 -.0111045 .0049149
|
_cons | 3.474665 .0084746 410.01 0.000 3.45805 3.49128
------------------------------------------------------------------------------
( 1) 1.smoke#c.mage_cent = 0
F( 1, 3974) = 0.57
Prob > F = 0.4488
<-lm(birwtkg~mage_cent+smoke+smoke*mage_cent,data=df)
lm_model_smoke_INT summary(lm_model_smoke_INT)
Call:
lm(formula = birwtkg ~ mage_cent + smoke + smoke * mage_cent,
data = df)
Residuals:
Min 1Q Median 3Q Max
-2.50077 -0.30495 0.00597 0.31122 1.80389
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.474665 0.008475 410.009 < 0.0000000000000002 ***
mage_cent 0.009609 0.001630 5.894 0.00000000408 ***
smoke1 -0.273157 0.023802 -11.476 < 0.0000000000000002 ***
mage_cent:smoke1 -0.003095 0.004085 -0.758 0.449
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.4943 on 3974 degrees of freedom
Multiple R-squared: 0.04868, Adjusted R-squared: 0.04796
F-statistic: 67.78 on 3 and 3974 DF, p-value: < 0.00000000000000022
augment(lm_model_smoke_INT) %>%
gf_point(birwtkg~mage_cent,color=~smoke) %>%
gf_line(.fitted~mage_cent,data=subset(augment(lm_model_smoke_INT), smoke == 1),color=~smoke)%>%
gf_line(.fitted~mage_cent,data=subset(augment(lm_model_smoke_INT), smoke == 0),color=~smoke) %>%
gf_theme(theme_bw())
anova(lm_model_smoke_INT, lm_model_smoke)
Res.Df | RSS | Df | Sum of Sq | F | Pr(>F) |
---|---|---|---|---|---|
3974 | 971.1441 | NA | NA | NA | NA |
3975 | 971.2843 | -1 | -0.1402345 | 0.5738508 | 0.4487777 |
The post-estimation test after fitting the interaction term shows no evidence of an improvement to the model \((F=0.57; P=0.449)\) We would not choose the model with an interaction (the separate lines model) and we would choose the parallel lines model Example of the principle of parsimony – we only choose a more complicated model over a simpler model if it fits significantly better
Also Note
There is no different slope for each case of the Smoke
variable. Thus there is no synergy or interaction between these variables. The smoke status does not change the impact of mom age
on birthweight
.
use abrevaya1
egen mage_bar = mean(mage)
gen mage_cent = mage -mage_bar
gen birwtkg = birwt/1000
sw , pr(0.15) lockterm1 : reg birwtkg (i.smoke) mage meduc (i.male) (i.married) (i.black) (i.kess)
margins smoke
margins kess
note: 0b.male omitted because of estimability.
note: 0b.married omitted because of estimability.
note: 0b.black omitted because of estimability.
note: 1b.kess omitted because of estimability.
Wald test, begin with full model:
p = 0.1640 >= 0.1500, removing meduc
Source | SS df MS Number of obs = 3,978
-------------+---------------------------------- F(7, 3970) = 47.96
Model | 79.5997715 7 11.3713959 Prob > F = 0.0000
Residual | 941.23355 3,970 .237086537 R-squared = 0.0780
-------------+---------------------------------- Adj R-squared = 0.0763
Total | 1020.83332 3,977 .256684265 Root MSE = .48692
------------------------------------------------------------------------------
birwtkg | Coefficient Std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
1.smoke | -.240804 .023359 -10.31 0.000 -.2866008 -.1950072
mage | .0059965 .0015393 3.90 0.000 .0029786 .0090145
|
kess |
Intermedi.. | -.0499028 .0209532 -2.38 0.017 -.0909828 -.0088229
Inadequat.. | -.0487722 .0380074 -1.28 0.199 -.1232881 .0257436
|
male |
male | .1027106 .0154438 6.65 0.000 .0724321 .1329892
|
married |
Yes | .0695491 .0286438 2.43 0.015 .0133911 .1257071
|
black |
Yes | -.207705 .0324948 -6.39 0.000 -.271413 -.143997
_cons | 3.219545 .0477062 67.49 0.000 3.126014 3.313076
------------------------------------------------------------------------------
Predictive margins Number of obs = 3,978
Model VCE: OLS
Expression: Linear prediction, predict()
------------------------------------------------------------------------------
| Delta-method
| Margin std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
smoke |
0 | 3.470998 .0083935 413.53 0.000 3.454542 3.487454
1 | 3.230194 .0214987 150.25 0.000 3.188044 3.272343
------------------------------------------------------------------------------
Predictive margins Number of obs = 3,978
Model VCE: OLS
Expression: Linear prediction, predict()
------------------------------------------------------------------------------
| Delta-method
| Margin std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
kess |
Adequate .. | 3.448175 .0088707 388.72 0.000 3.430783 3.465566
Intermedi.. | 3.398272 .0187141 181.59 0.000 3.361582 3.434962
Inadequat.. | 3.399403 .0366415 92.77 0.000 3.327565 3.471241
------------------------------------------------------------------------------
sw
: stepwisepr(0.15)
: probability of removing the variable
|>
df mutate_at(c("smoke","male","married","black","kess"), as.factor)->df
<-lm(birwtkg~mage+ meduc+smoke+male+married+black+kess,data = df) lm_model_new
step(lm_model_new , direction="backward")
Error in knit_print(x, ...): class name too long in 'knit_print'
use abrevaya1
egen mage_bar = mean(mage)
gen mage_cent = mage -mage_bar
gen birwtkg = birwt/1000
reg birwtkg i.smoke mage i.male i.married i.black kess
cprplot mage , lowess
graph export cpr.png
Source | SS df MS Number of obs = 3,978
-------------+---------------------------------- F(6, 3971) = 55.80
Model | 79.3706313 6 13.2284385 Prob > F = 0.0000
Residual | 941.46269 3,971 .237084535 R-squared = 0.0778
-------------+---------------------------------- Adj R-squared = 0.0764
Total | 1020.83332 3,977 .256684265 Root MSE = .48691
------------------------------------------------------------------------------
birwtkg | Coefficient Std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
1.smoke | -.2404347 .0233559 -10.29 0.000 -.2862254 -.194644
mage | .006056 .0015381 3.94 0.000 .0030404 .0090716
|
male |
male | .1027676 .0154436 6.65 0.000 .0724894 .1330459
|
married |
Yes | .0695001 .0286437 2.43 0.015 .0133424 .1256577
|
black |
Yes | -.2073296 .0324924 -6.38 0.000 -.2710329 -.1436262
kess | -.0356693 .0151464 -2.35 0.019 -.0653647 -.0059738
_cons | 3.252061 .0533562 60.95 0.000 3.147453 3.356669
------------------------------------------------------------------------------
file cpr.png already exists
r(602);
r(602);
Collinearity
Multicollinearity refers to correlation among explanatory variables.
- large changes in the estimated regression coefficient when a predictor variable is added or deleted, or when an observation is altered or deleted.
- noninsignificant results in individual tests on regression coefficients for important predictor variables.
- estimated regression coefficients with an algebraic sign that is the opposite of that expected from theoretical consideration or prior experience.
- large coefficients of simple correlation between pairs of predictor variables in the correlation matrix.
- wide confidence intervals for the regression coefficients representing important predictor variables.
When some of X variables are so highly correlated that the inverse \((X'X)^{-1}\) does not exist or is very computationally unstable.
Correlated Predictor Variables: if some X variables are “perfectly” correlated, the system is undetermined and there are an infinite number of models that fit the data. That is, if \(X'X\) is singular, then \((X'X)^{-1}\) doesn’t exist. Then,
- parameters cannot be interpreted (\(\mathbf{b = (X'X)^{-1}X'y}\))
- sampling variability is infinite (\(\mathbf{s^2(b) = MSE (X'X)^{-1}}\))
VIFs
Let \(R^2_k\) be the coefficient of multiple determination when \(X_k\) is regressed on the p - 2 other X variables in the model. Then,
\[ VIF_k = \frac{1}{1-R^2_k} \]
- large values indicate that a near collinearity is causing the variance of \(b_k\) to be inflated, \(var(b_k) \propto \sigma^2 (VIF_k)\)
- Typically, the rule of thumb is that \(VIF > 4\) mean you should see why this is the case, and \(VIF_k > 10\) indicates a serious problem collinearity problem that could result in poor parameters estimates.
- the mean of all VIF’s provide an estimate of the ratio of the true multicollinearity to a model where the X variables are uncorrelated
- serious multicollinearity if \(\bar{(VIF)} >>1\)
The multicollinearity condition is violated when two or more of the predictors in a regression model are correlated. Muticollinearity can occur for structural reasons, as when one variable is a transformation of another variable, or for data reasons, as occurs in observational studies. Multicollinearity is a problem because it inflates the variances of the estimated coefficients, resulting in larger confidence intervals.
When predictor variables are correlated, the precision of the estimated regression coefficients decreases with each added correlated predictor variable. The usual interpretation of a slope coefficient as the change in the mean response for each additional unit increase in the predictor when all the other predictors are held constant breaks down because changing one predictor necessarily changes the others.
A residuals vs fits plot \((\epsilon \sim \hat{Y})\) should have correlation \(\rho \sim 0\). A correlation matrix is helpful for picking out the correlation strengths. A good rule of thumb is correlation coefficients should be less than 0.80. However, this test may not work when a variable is correlated with a function of other variables. A model with multicollinearity may have a significant F-test with insignificant individual slope estimator t-tests. Another way to detect multicollinearity is by calculating variance inflation factors. The predictor variance \(Var(\hat{\beta_k})\) increases by a factor
\[VIF_k = \frac{1}{1 - R_k^2}\]
where \(R_k^2\) is the \(R^2\) of a regression of the \(k^{th}\) predictor on the remaining predictors. A \(VIF_k\) of \(1\) indicates no inflation (no corellation). A \(VIF_k \geq 4\) warrants investigation. A \(VIF_k \geq 10\) requires correction.
If the multicollinearity occurs because you are using a polynomial regression model, center the predictor variables (subtract their means).
- May be significant change in regression coefficients if we add or delete an explanatory variable.
- Estimated standard errors of fitted coefficients are inflated.
- Estimated coefficients may not be statistically significant, even though statistical relationship exists between outcome and explanatory variables.
Can check for multi-collinearity after fitting regression model by calculating variance inflation factor (VIF), which gives for each coefficient an estimate of factor by which variance has been increased due to multi-collinearity Most analysts apply informal rules of thumb to the VIF- evidence of multi-collinearity if :
- Largest VIF is greater than 10
- Mean of all VIF’s is considerably larger than 1
APPENDIX
\(Y_i\) is random since \(\epsilon_i\) is:
\[ \begin{aligned} E(Y_i) &= E(\beta_0 + \beta_1 X_i + \epsilon_i) \\ &= E(\beta_0) + E(\beta_1 X_i) + E(\epsilon) \\ &= \beta_0 + \beta_1 X_i \end{aligned} \]
\[ \begin{aligned} var(Y_i) &= var(\beta_0 + \beta_1 X_i + \epsilon_i) \\ &= var(\epsilon_i) \\ &= \sigma^2 \end{aligned} \]
\[ \beta_1 = \frac{Cov(Y_i, X_i)}{Var(X_i)} \] \[b_1 = \frac{\sum_{i=1}^{n} (X_i - \bar{X})(Y_i - \bar{Y})}{\sum_{i=1}^{n}(X_i - \bar{X})^2}=\frac{\sigma_{xy}}{\sigma_{xx}}\] \[b_0 = \frac{1}{n}(\sum_{i=1}^{n}Y_i - b_1\sum_{i=1}^{n}X_i) = \bar{Y} - b_1 \bar{X}\]
Properties of Least Least Estimators
\[E(b_1) = \beta_1\] \[E(b_0) = E(\bar{Y}) - \bar{X}\beta_1\] \[E(\bar{Y}) = \beta_0 + \beta_1 \bar{X}\] \[E(b_0) = \beta_0\] \[var(b_1) = \frac{\sigma^2}{\sum_{i=1}^{n}(X_i - \bar{X})^2}\] \[var(b_0) = \sigma^2 (\frac{1}{n} + \frac{\bar{X}^2}{\sum (X_i - \bar{X})^2}) \]
Mean Square Error
\[ MSE = \frac{SSE}{n-2} = \frac{\sum_{i=1}^{n}e_i^2}{n-2} = \frac{\sum(Y_i - \hat{Y_i})^2}{n-2} \]
Interpretations
Model | Form | Interpretation of \(\beta\) | In words |
---|---|---|---|
Level-Level | \(y =\beta_0+\beta_1x+\epsilon\) | \(\Delta y = \beta_1 \Delta x\) | A unit change in x will result in \(\beta_1\) unit change in y |
Log-Level | \(ln(y) = \beta_0 + \beta_1x + \epsilon\) | \(\% \Delta y=100 \beta_1 \Delta x\) | A unit change in x result in 100 \(\beta_1\) % change in y |
Level-Log | \(y = \beta _0 + \beta_1 ln (x) + \epsilon\) | \(\Delta y = (\beta_1/ 100)\%\Delta x\) | One percent change in x result in \(\beta_1/100\) units change in y |
Log-Log | \(ln(y) = \beta_0 + \beta_1 l n(x) +\epsilon\) | \(\% \Delta y= \beta _1 \% \Delta x\) | One percent change in x result in \(\beta_1\) percent change in y |
Model Diagnostics
Akaike Information Criterion (AIC)
\[ AUC = n ln(\frac{SSE_p}{n}) + 2p \]
- increasing p (number of parameters) leads first-term decreases, and second-term increases.
- We want model with small values of AIC. If the AIC increases when a parameter is added to the model, that parameter is not needed.
- AIC represents a tradeoff between precision of fit against the number of parameters used.
Bayes (or Schwarz) Information Criterion
\[ BIC = n \ln(\frac{SSE_p}{n})+ (\ln n)p \]
The coefficient in front of p tends to penalize more heavily models with a larger number of parameters (as compared to AIC).
Stepwise Selection Procedures
The forward stepwise procedure:
- finds a plausible subset sequentially.
- at each step, a variable is added or deleted.
- criterion for adding or deleting is based on SSE, \(R^2\), T, or F-statistic.
Note:
- Instead of using exact F-values, computer packages usually specify the equivalent “significance” level. For example, SLE is the “significance” level to enter, and SLS is the “significance” level to stay. The SLE and SLS are guides rather than true tests of significance.
- The choice of SLE and SLS represents a balancing of opposing tendencies. Use of large SLE values tends to result in too many predictor variables; models with small SLE tend to be under-specified resulting in \(\sigma^2\) being badly overestimated.
- As for choice of SLE, can choose between 0.05 and 0.5.
- If SLE > SLS then a cycling pattern may occur. Although most computer packages can detect can stop when it happens. A quick fix: SLS = SLE /2
- If SLE < SLS then the procedure is conservative and may lead variables with low contribution to be retained.
- Order of variable entry does not matter.
Automated Selection Procedures:
- Forward selection: Same idea as forward stepwise except it doesn’t test if variables should be dropped once enter. (not as good as forward stepwise).
- Backward Elimination: begin with all variables and identifies the one with the smallest F-value to be dropped.
Diagnostics
Influential observations/outliers
Hat matrix
\[ \mathbf{H = X(X'X)^{-1}} \]
where \(\mathbf{\hat{Y}= HY, e = (I-H)Y}\) and \(var(\mathbf{e}) = \sigma^2 (\mathbf{I-H})\)
- \(\sigma^2(e_i) = \sigma^2 (1-h_{ii})\), where \(h_{ii}\) is the i-th element of the main diagonal of \(\mathbf{H}\) (must be between 0 and 1).
- \(\sum_{i=1}^{n} h_{ii} = p\)
- \(cov(e_i,e_j) = -h_{ii}\sigma^2\) where \(i \neq j\)
- Estimate: \(s^2(e_i) = MSE (1-h_{ii})\)
- Estimate: \(\hat{cov}(e_i,e_j) = -h_{ij}(MSE)\); if model assumption are correct, this covariance is very small for large data sets.
- If \(\mathbf{x}_i = [1 X_{i,1} ... X_{i,p-1}]'\) (the vector of X-values for a given response), then \(h_{ii} = \mathbf{x_i'(X'X)^{-1}x_i}\) (depends on relative positions of the design points \(X_{i,1},...,X_{i,p-1}\))
Studentized Residuals
\[ r_i = \frac{e_i}{s(e_i)} \\ r_i \sim N(0,1) \]
where \(s(e_i) = \sqrt{MSE(1-h_{ii})}\). \(r_i\) is called the studentized residual or standardized residual.
- you can use the semi-studentized residual before, \(e_i^*= e_i \sqrt{MSE}\). This doesn’t take into account the different variances for each \(e_i\).
We would want to see the model without a particular value. You delete the i-th case, fit the regression to the remaining n-1 cases, get estimated responses for the i-th case, \(\hat{Y}_{i(i)}\), and find the difference, called the deleted residual:
\[ d_i = Y_i - \hat{Y}_{i(i)} \\ = \frac{e_i}{1-h_{ii}} \]
we don’t need to recompute the regression model for each case
As \(h_{ii}\) increases, \(d_i\) increases.
\[ s^2(d_i)= \frac{MSE_{(i)}}{1-h_{ii}} \]
where \(MSE_{(i)}\) is the mean square error when the i-th case is omitted.
Let
\[ t_i = \frac{d_i}{s(d_i)} = \frac{e_i}{\sqrt{MSE_{(i)}(1-h_{ii})}} \]
be the studentized deleted residual, which follows a t-distribution with n-p-1 df.
\[ (n-p)MSE = (n-p-1)MSE_{(i)}+ \frac{e^2_{i}}{1-h_{ii}} \]
hence, we do not need to fit regressions for each case and
\[ t_i = e_i (\frac{n-p-1}{SSE(1-h_{ii})-e^2_i})^{1/2} \]
The outlying Y-observations are those cases whose studentized deleted residuals are large in absolute value. If there are many residuals to consider, a Bonferroni critical value can be can (\(t_{1-\alpha/2n;n-p-1}\))
Outlying X Observations
Recall, \(0 \le h_{ii} \le 1\) and \(\sum_{i=1}^{n}h_{ii}=p\) (the total number of parameters)
A large \(h_{ii}\) indicates that the i-th case is distant from the center of all X observations (the leverage of the i-th case). That is, a large value suggests that the observation exercises substantial leverage in determining the fitted value \(\hat{Y}_i\)
We have \(\mathbf{\hat{Y}=HY}\), a linear combination of Y-values; \(h_{ii}\) is the weight of the observation \(Y_i\); so \(h_{ii}\) measures the role of the X values in determining how important \(Y_i\) is in affecting the \(\hat{Y}_i\).
Large \(h_{ii}\) implies \(var(e_i)\) is small, so larger \(h_{ii}\) implies that \(\hat{Y}_i\) is close to \(Y_i\)
- small data sets: \(h_{ii} > .5\) suggests “large”.
- large data sets: \(h_{ii} > \frac{2p}{n}\) is “large.
Using the hat matrix to identify extrapolation:
- Let \(\mathbf{x_{new}}\) be a vector containing the X values for which an inference about a mean response or a new observation is to be made.
- Let \(\mathbf{X}\) be the data design matrix used to fit the data. Then, if \(h_{new,new} = \mathbf{x_{new}(X'X)^{-1}x_{new}}\) is within the range of leverage values (\(h_{ii}\)) for cases in the data set, no extrapolation is involved; otherwise; extrapolation is indicated.
Identifying Influential Cases:
by influential we mean that exclusion of an observation causes major changes int he fitted regression. (not all outliers are influential)
DFFITS
Influence on Single Fitted Values: DFFITS
\[
(DFFITS)_i = \frac{\hat{Y}_i - \hat{Y}_{i(i)}}{\sqrt{MSE_{(i)}h_{ii}}} \\
= t_i (\frac{h_{ii}}{1-h_{ii}})^{1/2}
\] - the standardized difference between the i-th fitted value with all observations and with the i-th case removed.
studentized deleted residual multiplied by a factor that is a function fo the i-th leverage value.
influence if:
- small to medium data sets: \(|DFFITS|>1\)
- large data sets: \(|DFFITS|> 2 \sqrt{p/n}\)
- small to medium data sets: \(|DFFITS|>1\)
Cook’s D
Influence on All Fitted Values: Cook’s D
\[ D_i = \frac{\sum_{j=1}^{n}(\hat{Y}_j - \hat{Y}_{j(i)})^2}{p(MSE)} \\ = \frac{e^2_i}{p(MSE)}(\frac{h_{ii}}{(1-h_{ii})^2}) \]
gives the influence of i-th case on all fitted values.
If \(e_i\) increases or \(h_{ii}\) increases, then \(D_i\) increases.
\(D_i\) is a percentile of an \(F_{(p,n-p)}\) distribution. If the percentile is greater than \(.5(50\%)\) then the i-th case has major influence. In practice, if \(D_i >4/n\), then the i-th case has major influence.
DFBETAS
Influence on the Regression Coefficients: DFBETAS
\[ (DFBETAS)_{k(i)} = \frac{b_k - b_{k(i)}}{\sqrt{MSE_{(i)}c_{kk}}} \]