Correlation

Correlation is a statistical measure that describes the strength and direction of a relationship between two quantitative variables

Correlation measures the strength and direction of association between two variables. There are three common correlation tests:

Correlation is a statistical measure that describes the strength and direction of a relationship between two quantitative variables

the Pearson product moment (Pearson’s r),
Spearman’s rank-order (Spearman’s rho),
and Kendall’s tau (Kendall’s tau).

Use the Pearson’s r if both variables are quantitative (interval or ratio), normally distributed, and the relationship is linear with homoscedastic residuals.

The Spearman’s rho and Kendal’s tao correlations are measures, so they are valid for both quantitative and ordinal variables and do not carry the normality and homoscedasticity conditions. However, non-parametric tests have less statistical power than parametric tests, so only use these correlations if Pearson does not apply.

Pearson’s r

Pearson’s \(r\)

\[r = \frac{\sum{(X_i - \bar{X})(Y_i - \bar{Y})}}{\sqrt{\sum{(X_i - \bar{X})^2 \sum{(Y_i - \bar{Y})^2}}}} = \frac{cov(X,Y)}{s_X s_Y}\]

estimates the population correlation \(\rho\). Pearson’s \(r\) ranges from \(-1\) (perfect negative linear relationship) to \(+1\) (perfect positive linear relationship, and \(r = 0\) when there is no linear relationship. A correlation in the range :

\((.1, .3)\) is condidered small,
\((.3, .5)\) medium,
and \((.5, 1.0)\) large.

Pearson’s \(r\) only applies if the variables are interval or ratio, normally distributed, linearly related, there are minimal outliers, and the residuals are homoscedastic.

Data Used for Illustration

Body weight	Plasma volume
58.0	2.75
70.0	2.86
74.0	3.37
63.5	2.76
62.0	2.62
70.5	3.49
71.0	3.05
66.0	3.12

Types of correlations

Almost perfect correlations

p1 <- ggplot(data = df, aes(x = `Body weight`, y = `Plasma volume`)) +
  geom_point() + 
  geom_smooth(method = "lm", se = FALSE) +
  
  labs(title = "Pearson's Rho", subtitle = "positive and strong correlation")
p2 <- ggplot(data = df, aes(x = `Body weight`, y = 1/`Plasma volume`)) +
  geom_point() + 
  geom_smooth(method = "lm", se = FALSE) +
 
  labs(title = "Pearson's Rho", subtitle = "Negative yet strong correlation")
gridExtra::grid.arrange(p1, p2, nrow = 1)

Some weak correlations

here is some fake dataset producing weak correlations

df2 <- tibble(
  Height = c(115, 101, 99, 107, 118, 127, 120, 129),
  Weight = c(56, 50, 67, 64, 55, 70, 61, 59)
  
)

Visualizing weak correlations

p1 <- ggplot(data = df2, aes(x = Height, y = Weight)) +
  geom_point() + 
  geom_smooth(method = "lm", se = FALSE) +
  
  labs(title = "Pearson's Rho", subtitle = "positive and weak correlation")
p2 <- ggplot(data = df2, aes(x = Height, y = 1/Weight)) +
  geom_point() + 
  geom_smooth(method = "lm", se = FALSE) +
 
  labs(title = "Pearson's Rho", subtitle = "Negative yet weak correlation")
gridExtra::grid.arrange(p1, p2, nrow = 1)

Manually computing Correlations

we are interested in finding \(\sum{(X_i - \bar{X})(Y_i - \bar{Y})}\) , \(\sum{(X_i - \bar{X})^2}\) and \(\sum{(Y_i - \bar{Y})^2}\)

dummy_mse <- df %>% 
  mutate("$(X_i - \\bar{X})$" = (`Body weight`- mean(`Body weight`))) %>% 
  mutate("$(Y_i - \\bar{Y})$" = (`Plasma volume`- mean(`Plasma volume`)))%>% 
  mutate("$(X_i - \\bar{X})^2$" = (`Body weight`- mean(`Body weight`))^2) %>% 
  mutate("$(Y_i - \\bar{Y})^2$" = (`Plasma volume`- mean(`Plasma volume`))^2) %>%  
  mutate("$(Y_i - \\bar{Y})(X_i - \\bar{X})$" = (`Plasma volume`- mean(`Plasma volume`))*(`Body weight`- mean(`Body weight`)))

dummy_mse %>% 
  knitr::kable()

Body weight	Plasma volume	\((X_i - \bar{X})\)	\((Y_i - \bar{Y})\)	\((X_i - \bar{X})^2\)	\((Y_i - \bar{Y})^2\)	\((Y_i - \bar{Y})(X_i - \bar{X})\)
58.0	2.75	-8.875	-0.2525	78.765625	0.0637562	2.2409375
70.0	2.86	3.125	-0.1425	9.765625	0.0203063	-0.4453125
74.0	3.37	7.125	0.3675	50.765625	0.1350563	2.6184375
63.5	2.76	-3.375	-0.2425	11.390625	0.0588063	0.8184375
62.0	2.62	-4.875	-0.3825	23.765625	0.1463062	1.8646875
70.5	3.49	3.625	0.4875	13.140625	0.2376563	1.7671875
71.0	3.05	4.125	0.0475	17.015625	0.0022562	0.1959375
66.0	3.12	-0.875	0.1175	0.765625	0.0138063	-0.1028125

Summations and calculating correlation

(dummy_mse1<-dummy_mse %>% 
  summarise("$\\sum Height$"=sum(.[1]),
            "$\\sum Weight$"=sum(.[2]),
            "$\\sum(X_i - \\bar{X})$"=sum(.[3]),
            "$\\sum(Y_i - \\bar{Y})$"=sum(.[4]),
            "$\\sum(X_i - \\bar{X})^2$"=sum(.[5]),
            "$\\sum(Y_i - \\bar{Y})^2$"=sum(.[6]),
            "$\\sum(Y_i - \\bar{Y})(X_i - \\bar{X})$"=sum(.[7]))%>%  
   mutate(Pearson_R = (.[7]/sqrt(.[5]*.[6]))) %>%  ##seventh index/sqrt(5th times 6th index) 
   relocate(Pearson_R)%>% 
   knitr::kable())

Pearson_R	\(\sum Height\)	\(\sum Weight\)	\(\sum(X_i - \bar{X})\)	\(\sum(Y_i - \bar{Y})\)	\(\sum(X_i - \bar{X})^2\)	\(\sum(Y_i - \bar{Y})^2\)	\(\sum(Y_i - \bar{Y})(X_i - \bar{X})\)
0.7591266	535	24.02	0	0	205.375	0.67795	8.9575

While it is important to know the mathematical theory behind , R has a built in function for this
we use the cor(x,y) function

cor(df$`Body weight`,df$`Plasma volume`)

[1] 0.7591266

use PlasmaB , clear

pwcorr Body_weight Plasma_volume

             | Body_w~t Plasma~e
-------------+------------------
 Body_weight |   1.0000 
Plasma_vol~e |   0.7591   1.0000

Spearman’s Rho

Spearman’s \(\rho\) is the Pearson’s r applied to the sample variable ranks. Let \((X_i, Y_i)\) be the ranks of the \(n\) sample pairs with mean ranks \(\bar{X} = \bar{Y} = (n+1)/2\). Spearman’s rho is

\[\hat{\rho} = \frac{\sum{(X_i - \bar{X})(Y_i - \bar{Y})}}{\sqrt{\sum{(X_i - \bar{X})^2 \sum{(Y_i - \bar{Y})^2}}}}\] You will also encounter another formula given by \[\hat{\rho} = 1 -\frac{6\sum{D^2}}{n(n^2-1)}\] where \(D=rank(X)-rank(Y)\)

Spearman’s rho is a non-parametric test, so there is no associated confidence interval.

Practical Example 1

In the breast cancer dataset, the variable radius mean measures the average size of the tumour. We want to explore how radius mean is related to other tumour characteristics such as texture mean, perimeter mean, smoothness mean, and compactness mean

(data<-haven::read_dta("Breastcancer.dta")) |> 
  head(6)

id	diagnosis	radius_mean	texture_mean	perimeter_mean	area_mean	smoothness_mean	compactness_mean	concavity_mean	concavepoints_mean	symmetry_mean	fractal_dimension_mean	radius_se	texture_se	perimeter_se	area_se	smoothness_se	compactness_se	concavity_se	concavepoints_se	symmetry_se	fractal_dimension_se	radius_worst	texture_worst	perimeter_worst	area_worst	smoothness_worst	compactness_worst	concavity_worst	concavepoints_worst	symmetry_worst	fractal_dimension_worst	radius_hat	resid	resid_label	tag	cooksd
842302	M	17.99	10.38	122.80	1001.0	0.11840	0.27760	0.3001	0.14710	0.2419	0.07871	1.0950	0.9053	8.589	153.40	0.006399	0.04904	0.05373	0.01587	0.03003	0.006193	25.38	17.33	184.60	2019.0	0.1622	0.6656	0.7119	0.2654	0.4601	0.11890	19.97863	-1.988633	-1.99	1	0.0046133
842517	M	20.57	17.77	132.90	1326.0	0.08474	0.07864	0.0869	0.07017	0.1812	0.05667	0.5435	0.7339	3.398	74.08	0.005225	0.01308	0.01860	0.01340	0.01389	0.003532	24.99	23.41	158.80	1956.0	0.1238	0.1866	0.2416	0.1860	0.2750	0.08902	13.25931	7.310688	7.31	1	0.0063065
84300903	M	19.69	21.25	130.00	1203.0	0.10960	0.15990	0.1974	0.12790	0.2069	0.05999	0.7456	0.7869	4.585	94.03	0.006150	0.04006	0.03832	0.02058	0.02250	0.004571	23.57	25.53	152.50	1709.0	0.1444	0.4245	0.4504	0.2430	0.3613	0.08758	16.00364	3.686358	3.69	1	0.0027413
84348301	M	11.42	20.38	77.58	386.1	0.14250	0.28390	0.2414	0.10520	0.2597	0.09744	0.4956	1.1560	3.445	27.23	0.009110	0.07458	0.05661	0.01867	0.05963	0.009208	14.91	26.50	98.87	567.7	0.2098	0.8663	0.6869	0.2575	0.6638	0.17300	20.19140	-8.771399	-8.77	0	0.0961080
84358402	M	20.29	14.34	135.10	1297.0	0.10030	0.13280	0.1980	0.10430	0.1809	0.05883	0.7572	0.7813	5.438	94.44	0.011490	0.02461	0.05688	0.01885	0.01756	0.005115	22.54	16.67	152.20	1575.0	0.1374	0.2050	0.4000	0.1625	0.2364	0.07678	15.08842	5.201585	5.20	0	0.0033317
843786	M	12.45	15.70	82.57	477.1	0.12780	0.17000	0.1578	0.08089	0.2087	0.07613	0.3345	0.8902	2.217	27.19	0.007510	0.03345	0.03672	0.01137	0.02165	0.005082	15.47	23.75	103.40	741.6	0.1791	0.5249	0.5355	0.1741	0.3985	0.12440	16.34474	-3.894743	-3.89	0	0.0037038

Radius mean vs smoothness mean

data |> 
  ggplot(aes(x=radius_mean , y=smoothness_mean))+
  geom_point()+
  geom_smooth(method="lm", se=FALSE)

cor(data$radius_mean,data$smoothness_mean)

[1] 0.1705812

use Breastcancer
twoway (scatter radius_mean smoothness_mean , mcolor(blue)) ///
      (lfit radius_mean smoothness_mean, lcolor(red))
graph export scatter1.png,replace

file scatter1.png saved as PNG format

use Breastcancer
pwcorr radius_mean smoothness_mean, sig        
bootstrap r=r(rho), reps(1000): corr radius_mean smoothness_mean

             | radius~n smooth~n
-------------+------------------
 radius_mean |   1.0000 
             |
             |
smoothness~n |   0.1706   1.0000 
             |   0.0000
             |

(running correlate on estimation sample)

warning: correlate does not set e(sample), so no observations will be
         excluded from the resampling because of missing values or other
         reasons. To exclude observations, press Break, save the data, drop
         any observations that are to be excluded, and rerun bootstrap.

Bootstrap replications (1,000): .........10.........20.........30.........40...
> ......50.........60.........70.........80.........90.........100.........110.
> ........120.........130.........140.........150.........160.........170......
> ...180.........190.........200.........210.........220.........230.........24
> 0.........250.........260.........270.........280.........290.........300....
> .....310.........320.........330.........340.........350.........360.........
> 370.........380.........390.........400.........410.........420.........430..
> .......440.........450.........460.........470.........480.........490.......
> ..500.........510.........520.........530.........540.........550.........560
> .........570.........580.........590.........600.........610.........620.....
> ....630.........640.........650.........660.........670.........680.........6
> 90.........700.........710.........720.........730.........740.........750...
> ......760.........770.........780.........790.........800.........810........
> .820.........830.........840.........850.........860.........870.........880.
> ........890.........900.........910.........920.........930.........940......
> ...950.........960.........970.........980.........990.........1,000 done

Bootstrap results                                        Number of obs =   569
                                                         Replications  = 1,000

      Command: correlate radius_mean smoothness_mean
            r: r(rho)

------------------------------------------------------------------------------
             |   Observed   Bootstrap                         Normal-based
             | coefficient  std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
           r |   .1705812   .0412809     4.13   0.000     .0896721    .2514903
------------------------------------------------------------------------------

Note

The results show a weak correlation between radius and smoothness mean (\(r=0.1705812\))

Radius mean vs Perimeter mean

data |> 
  ggplot(aes(x=radius_mean , y=perimeter_mean))+
  geom_point()+
  geom_smooth(method="lm", se=FALSE)

cor(data$radius_mean,data$perimeter_mean)

[1] 0.9978553

use Breastcancer
twoway (scatter radius_mean perimeter_mean , mcolor(blue)) ///
      (lfit radius_mean perimeter_mean, lcolor(red))
graph export scatter2.png

file scatter2.png already exists
r(602);

r(602);

use Breastcancer
pwcorr radius_mean smoothness_mean, sig        
bootstrap r=r(rho), reps(1000): corr radius_mean smoothness_mean

             | radius~n smooth~n
-------------+------------------
 radius_mean |   1.0000 
             |
             |
smoothness~n |   0.1706   1.0000 
             |   0.0000
             |

(running correlate on estimation sample)

warning: correlate does not set e(sample), so no observations will be
         excluded from the resampling because of missing values or other
         reasons. To exclude observations, press Break, save the data, drop
         any observations that are to be excluded, and rerun bootstrap.

Bootstrap replications (1,000): .........10.........20.........30.........40...
> ......50.........60.........70.........80.........90.........100.........110.
> ........120.........130.........140.........150.........160.........170......
> ...180.........190.........200.........210.........220.........230.........24
> 0.........250.........260.........270.........280.........290.........300....
> .....310.........320.........330.........340.........350.........360.........
> 370.........380.........390.........400.........410.........420.........430..
> .......440.........450.........460.........470.........480.........490.......
> ..500.........510.........520.........530.........540.........550.........560
> .........570.........580.........590.........600.........610.........620.....
> ....630.........640.........650.........660.........670.........680.........6
> 90.........700.........710.........720.........730.........740.........750...
> ......760.........770.........780.........790.........800.........810........
> .820.........830.........840.........850.........860.........870.........880.
> ........890.........900.........910.........920.........930.........940......
> ...950.........960.........970.........980.........990.........1,000 done

Bootstrap results                                        Number of obs =   569
                                                         Replications  = 1,000

      Command: correlate radius_mean smoothness_mean
            r: r(rho)

------------------------------------------------------------------------------
             |   Observed   Bootstrap                         Normal-based
             | coefficient  std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
           r |   .1705812   .0412809     4.13   0.000     .0896721    .2514903
------------------------------------------------------------------------------

Note

The results show a perfect correlation between radius and smoothness mean (\(r=0.9978553\))

Why Linear Regression

Note

Correlation and simple linear regression (LR) are used to investigate the relationship between two quantitative variables.
Correlation quantifies the strength and direction of a relationship between two variables, but it does not allow one to predict one variable from the other, while LR can be used to determine how the value of one variable varies in response to the values of the other variable.
Correlation does not specify a dependent or independent variable (symmetric relationship) but is just concerned with assessing the global movement shared between two variables, while LR specify one variable as the outcome and the other as the explanatory variable (asymmetric relationship)

The mathematical equation of a straight line

Simple linear regression involves a numeric dependent (or outcome) variable \(Y\) and one independent (or explanatory) variable \(X\) that is either numeric or categorical.

The general equation of a straight line is: \[ Y_i = \beta_0 + \beta_1 X_i + \epsilon_i \tag{1}\]

Figure 1: Parameters of a straight line.

Preparing the data for this exercises

Dataset 1 : InClass

library(readxl)
BirthWeight2 <- haven::read_dta("abrevaya1.dta") |> 
  mutate(birwtkg =birwt/1000)

Basic concepts

First, we plot the data in a scatter plot of the previous Plasma volume vs Body weight dataset:

ggplot(df, aes(`Body weight`, `Plasma volume`)) +
  geom_point(size = 2, alpha = 0.4) +
  #theme_prism(base_size = 14, base_line_size = 0.4, palette = "office") +
  labs(x = "x = Body weight", y = "y = Plasma volume")

Figure 2: Scatter plot of weight vs Plasma volume.

The objective is to find a mathematical approach that yields a line that, on average, goes through most of the points on the graph. This line is commonly known as a regression line or a line of best fit. We write the equation of the best-fitting line as:

\[\widehat{y_i} = b_0 + b_1 \cdot x_i\]

where \(b_0\) and \(b_1\) are best choices or estimates of the parameters \(\beta_0\) and \(\beta_1\) in Equation 1, and \(\hat y\) is the value of Y that is expected by this model for a specified value of X.

We define the following three concepts:

Observed value \(y_i\): the observed value of the dependent variable Y for a given \(x_i\) value.
Expected (or fitted) value \(\widehat{y_i}\): the value of the dependent variable Y that is expected by the model for a given \(x_i\) value.
Residual \(e_i\): the deviation (error) between the observed value \(y_i\) and the expected value \(\hat y\) for a given \(x_i\) value (\(e_i = y_i - \hat y_i\)).

Let’s see, for example, the values for the infant in the position 207 in our data set (Figure 3).

Figure 3: Explanation of the basic concepts of the linear model.

The observed value \(y\) = 2.86 litres for \(x\) = 70 kg.
The expected value \(\widehat{y}\) is the value 3.14 litres on the regression line for \(x\) = 70. This value can be computed by the ?@eq-line3.
The residual is computed by subtracting the expected (fitted) value \(\widehat{y}\) from the observed value \(y\). In the case, it is \(y - \widehat{y} = 3.14 - 2.86=-0.28\) litres.

Comment

The residuals are exactly the vertical distance between the observed data point and the associated point on the regression line (expected value) (Figure 3). Positive residuals are associated with y values above the fitted line, while negative residuals correspond to values below the line.

Residuals can be thought of as “noise” - they are fluctuations caused by various factors, preventing the observed values from forming a perfectly straight line on the scatter plot. Residuals play a crucial role in linear regression as they provide valuable insights into the assessment of a linear regression model.

Ordinary least squares (OLS) estimation of \(b_0\) and \(b_1\) in matrix form

Let’s describe for the \(i^{th}\) observation the underlying association between \(y_i\) and \(x_i\) by:

\[ Y_i = \beta_0 + \beta_1 X_i + \epsilon_i \]

\(Y_i\): response (dependent) variable at i-th observation
\(\beta_0,\beta_1\): regression parameters for intercept and slope.
\(X_i\): known constant (independent or predictor variable) for i-th observation
\(\epsilon_i\): random error term

\[ E(\epsilon_i) = 0 \\ \]

\[ var(\epsilon_i) = \sigma^2 \\ \]

\[ cov(\epsilon_i,\epsilon_j) = 0 \text{ for all } i \neq j \tag{2}\]

where the \(e_i\) is the error term, called residual, which represents the deviation between the observed value \(y_i\) and the expected value \(\hat y\) of the linear model (\(e_i = y_i - \hat y_i\)).

We make the following assumptions about \(e_i\)

Linearity The relationship between the independent variable(s) and the outcome is linear
Zero Mean: The expected value of the error term is zero
Homoscedasticity: The error term has a constant variance
Independence of Errors: There is no autocorrelation between
Normality: The error terms are normally distributed

In a matrix notation, given \(n\) observations of the explanatory variable x and the outcome variable y, the Equation 2 becomes:

\[\begin{bmatrix} y_1\\ y_2 \\ y_3 \\ ... \\ y_n \end{bmatrix} = \begin{bmatrix} 1 & x_1\\ 1 & x_2\\ 1 & x_3\\ ... \\ 1 & x_n \end{bmatrix} \bullet \begin{bmatrix} b_0\\ b_1\end{bmatrix} + \begin{bmatrix} e_1\\ e_2 \\ e_3 \\ ... \\ e_n \end{bmatrix} \tag{3}\]

Then the outcome y can be compactly modeled by:

\[Y = X\hat{B} + e \tag{4}\]

where the matrix X is called the design matrix (the first column contains 1s, and the second column contains the actual observations of x).

The coefficients \(b_0\) and \(b_1\) are selected by minimizing the sum of squares of the residuals \(e\). It can be proven that the least squares estimates are obtained by the following matrix formula:

\[\hat{B} = (X^TX)^{-1}X^TY = \begin{bmatrix} b_0 \\ b_1 \end{bmatrix} \tag{5}\]

That is, we’ve got one matrix equation which gives us both coefficient estimates.

model_lm <- lm(`Plasma volume`~ `Body weight`, data = df)
model_lm


Call:
lm(formula = `Plasma volume` ~ `Body weight`, data = df)

Coefficients:
  (Intercept)  `Body weight`  
      0.08572        0.04362

Summary information about the model

summary(model_lm)


Call:
lm(formula = `Plasma volume` ~ `Body weight`, data = df)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.27880 -0.14178 -0.01928  0.13986  0.32939 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)  
(Intercept)    0.08572    1.02400   0.084   0.9360  
`Body weight`  0.04362    0.01527   2.857   0.0289 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.2188 on 6 degrees of freedom
Multiple R-squared:  0.5763,    Adjusted R-squared:  0.5057 
F-statistic:  8.16 on 1 and 6 DF,  p-value: 0.02893

use PlasmaB , clear

regress Plasma_volume Body_weight

      Source |       SS           df       MS      Number of obs   =         8
-------------+----------------------------------   F(1, 6)         =      8.16
       Model |  .390684388         1  .390684388   Prob > F        =    0.0289
    Residual |  .287265612         6  .047877602   R-squared       =    0.5763
-------------+----------------------------------   Adj R-squared   =    0.5057
       Total |      .67795         7      .09685   Root MSE        =    .21881

------------------------------------------------------------------------------
Plasma_vol~e | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
 Body_weight |   .0436153   .0152684     2.86   0.029      .006255    .0809757
       _cons |   .0857243   1.023998     0.08   0.936    -2.419909    2.591357
------------------------------------------------------------------------------

The regression equation for our example becomes the one shown in the plot below.

The intercept \(b_0\) = 0.0857 is the mean Plasma volume with weight of 0. Visually, it’s the point where the line crosses the y-axis when x equals 0 (Figure 4).

Show the code

# Create a linear regression line with annotations
ggplot(df, aes(x = `Body weight`, y = `Plasma volume`)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = "lm", se = FALSE, color = "blue", fullrange = TRUE) +
  xlim(0, 100) +
  ylim(0, 4) +
  geom_hline(yintercept = 0, color = "black", linetype = "solid") +
  geom_vline(xintercept = 0, color = "black", linetype = "solid") +
  annotate("text", x = 67.5, y = 3.05, label = bquote(paste (widehat(PlasmaVolume), "= 0.0857 + 0.0436*weight")), color = "purple", angle = 32, size = 5) +
  #theme_prism(base_size = 14, base_line_size = 0.4, palette = "office") +
  labs(x = "x = Body weight", y = "y = Plasma volume")

Figure 4: Scatter plot with fitted line crossing the y-axis.

Comment

Note that while there is a mathematical interpretation for the intercept of the regression line, it doesn’t hold any physical meaning in this case, as the observation of a weight of 0 is not feasible.

Of greater importance is the slope of the fitted line, \(b_1 = 0.0436\), as this describes the association between the Plasma volume and weight variables.

The positive slope indicates a positive association between plasma volume and weight.

For every 1 kg increase in weight, there is on average an associated increase of 0.0436 litres in plasma volume.

The Standard error (SE) of the coefficients

The standard error is a way to measure the “uncertainty” in the estimate of the regression coefficients and it can be calculated by:

\[ \mathrm{SE}_{b_j} = \sqrt{\mathrm{Var}(b_j)} \tag{6}\]

where the \(\mathrm{Var}(b_j)\) represents the variance of the coefficients of the model.

Test statistic and confidence intervals

The hypothesis testing for the coefficients of the regression model are:

\[ \begin{aligned} H_0 &: \beta_j = 0\\ \text{vs } H_1&: \beta_j \neq 0. \end{aligned} \]

The null hypothesis, \(H_{0}\), states that the coefficients are equal to zero, and the alternative hypothesis, \(H_{1}\), states that the coefficients are not equal to zero.

The t-statistic for the coefficients are defined by the following equation:

\[ \ t_j = \frac{\ b_j}{\text{SE}_{b_j}} \tag{7}\]

In R:

# Compute t for the coefficients bo and b1
summary(model_lm)


Call:
lm(formula = `Plasma volume` ~ `Body weight`, data = df)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.27880 -0.14178 -0.01928  0.13986  0.32939 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)  
(Intercept)    0.08572    1.02400   0.084   0.9360  
`Body weight`  0.04362    0.01527   2.857   0.0289 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.2188 on 6 degrees of freedom
Multiple R-squared:  0.5763,    Adjusted R-squared:  0.5057 
F-statistic:  8.16 on 1 and 6 DF,  p-value: 0.02893

We can use the p-value to guide our decision:

If p − value < 0.05, reject the null hypothesis, \(H_{0}\).
If p − value ≥ 0.05, do not reject the null hypothesis, \(H_{0}\).

In our example \(p =0.0289 \le 0.05 \Rightarrow\) hence we reject \(H_{0}\).

The \(95\%\) CI of the coefficients for a significance level α, \(df=n-k\) degrees of freedom and for a two-tailed t-test is given by:

\[ 95\% \ \text{CI}_{b_j} = b_j \pm t^{*}_{df;a/2} \cdot \text{SE}_{b_j} \tag{8}\]

Summary table

All the estimates and statistics of interest for the regression model are presented in the following summary table (Figure 5):

term	estimate	std_error	statistic	p_value	lower_ci	upper_ci
intercept	0.09	1.02	0.08	<0.001	-2.42	2.59
`Body weight`	0.04	0.01	2.86	<0.001	0.01	0.08

Figure 5: The linear regression table.

Verifying Model Assumptions

Note

In the breast cancer dataset, the variable radius mean represents the average size of the tumour. We aim to investigate the effect of compactness mean on radius mean using simple linear regression..

Specific assumptions have to be met for reliable hypothesis tests and confidence intervals in simple linear regression: linearity of the data, independence of the residuals, normality of the residuals, and equality of variance of the residuals.

We will describe some statistical test and diagnostic plots in R for testing the assumptions underlying linear regression model.

Warning

Variations or different versions of the statistical tests may have different results.

Linearity of the data

Linear association between the independent variable x and the dependent variable (outcome) y.

use Breastcancer
// Linear regression

* Fit the linear regression
regress radius_mean compactness_mean

* Predict fitted values
predict yhat

* Plot the scatter with residuals and regression line
twoway (scatter radius_mean compactness_mean, mcolor(blue)) (line yhat compactness_mean, lcolor(red)) (rcap radius_mean yhat compactness_mean, lcolor(gs10)) ytitle("Radius Mean") xtitle("Compactness Mean")
         
graph export dia.png

      Source |       SS           df       MS      Number of obs   =       569
-------------+----------------------------------   F(1, 567)       =    195.26
       Model |  1806.94648         1  1806.94648   Prob > F        =    0.0000
    Residual |   5247.0001       567  9.25396842   R-squared       =    0.2562
-------------+----------------------------------   Adj R-squared   =    0.2548
       Total |  7053.94658       568    12.41892   Root MSE        =     3.042

------------------------------------------------------------------------------
 radius_mean | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
compactnes~n |   33.77222   2.416857    13.97   0.000     29.02514    38.51931
       _cons |   10.60346   .2825897    37.52   0.000     10.04841    11.15852
------------------------------------------------------------------------------

(option xb assumed; fitted values)

) required
r(100);

r(100);

Error in knitr::include_graphics("dia.png"): Cannot find the file(s): "dia.png"

data<-readr::read_csv("BreastCancer.csv")
model_lm2<-lm(radius_mean~compactness_mean, data)
summary(model_lm2)


Call:
lm(formula = radius_mean ~ compactness_mean, data = data)

Residuals:
    Min      1Q  Median      3Q     Max 
-8.8971 -2.0131 -0.4438  1.5317 12.3867 

Coefficients:
                 Estimate Std. Error t value            Pr(>|t|)    
(Intercept)       10.6035     0.2826   37.52 <0.0000000000000002 ***
compactness_mean  33.7722     2.4169   13.97 <0.0000000000000002 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.042 on 567 degrees of freedom
Multiple R-squared:  0.2562,    Adjusted R-squared:  0.2548 
F-statistic: 195.3 on 1 and 567 DF,  p-value: < 0.00000000000000022

diagnostic_plots <- plot(check_model(model_lm2, panel = FALSE))
diagnostic_plots[2]

$NCV

Residuals vs Fitted values plot is used to check the linear association assumption. We would expect to see a random scatter of points around the horizontal dashed line at zero, as demonstrated in our example. The green line is a scatter plot smoother, showing the average value of the residuals corresponding to each fitted value. A horizontal green line, without noticeable patterns or curvature, indicates a linear association.

Independence of the residuals

The residuals (or errors) resulting from the model should be independent of each other.

Statistical test: Durbin-Watson test

We can perform a Durbin-Watson-Test to check for autocorrelated residuals (a p-value < 0.05 indicates autocorrelated residuals).

set.seed(126)
check_autocorrelation(model_lm2, nsim = 1000)

OK: Residuals appear to be independent and not autocorrelated (p = 0.064).

Normality of the residuals

The residuals should be normally distributed.

Statistical test: Shapiro test

check_normality(model_lm2)

Warning: Non-normality of residuals detected (p < .001).

The function performs a shapiro.test and checks the standardized residuals for normal distribution. According to the documentation “this formal test almost always yields significant results for the distribution of residuals for large samples and visual inspection (e.g., Q-Q plots) are preferable.”

Plot: Normal Q-Q plot

diagnostic_plots[5]

$QQ

Normal Q-Q plot is used to examine whether the residuals are roughly normally distributed. Ideally, the standardized residuals points should follow the green line. In our example, the normal Q-Q normal plot deviates greatly from normal.

Equality of variance of the residuals (Homoscedasticity)

The spread or dispersion of the residuals should remain roughly the same for all values of the X variable.

Statistical test: Breusch-Pagan test

The most common test for checking equality of variance of the residuals (homoskedasticity) is the Breush-Pagan test (a p-value < 0.05 indicates presence of heteroscedasticity).

The original version of Breusch-Pagan test:

lmtest::bptest(model_lm2, studentize = F)


    Breusch-Pagan test

data:  model_lm2
BP = 78.115, df = 1, p-value < 0.00000000000000022

However, the studentized BP test (which is the default) is more robust than the original one:

lmtest::bptest(model_lm2)


    studentized Breusch-Pagan test

data:  model_lm2
BP = 56.116, df = 1, p-value = 0.00000000000006833

Plot: Square root of standardized residuals vs Fitted values

diagnostic_plots[3]

$HOMOGENEITY

This plot checks the assumption of equal variance of the residuals (homoscedasticity). To verify the assumption, we plot the square root of standardized residuals against the fitted values from the regression. In this case, the residuals are re-scaled and all values are positive. A horizontal line with equally spread points would be the desired pattern. However, if the spread of residuals widens or narrows systematically as we move along the x-axis, it indicates a violation of homoscedasticity.

Influential observations in the data

The dataset should not include data points with exceptional influence on the model.

Plot: Standardized residuals vs Leverage

diagnostic_plots[4]

$OUTLIERS

This plot is used to detect influential observations. These data are outlier points that might have a substantial impact on the regression model’s results when included or excluded from the analysis. If any point in this plot fall outside the dashed green lines then it is considered an influential observation. In our example, all the data points are inside the dashed green lines suggesting that none of the observations is influential on the fitted model.

Comment

The standardized residual is the residual divided by its standard deviation. The standardization, which results in a mean of zero and a variance of one, enables the residuals to be compared on the “standard scale”. When a standardized residual falls within the range of +/- 2, it indicates an unusual observation, while a value between +/- 3 indicates a highly unusual data”.

Overall Model fit

Two related measures are typically used to evaluate the linear regression model: the residual standard error (RSE) and the coefficient of determination \(R^2\).

Residual standard error (RSE)

The residual standard error is the square root of the residual variance \(\sigma^2_{e}\) that we have already calculated. It informs us about the average error of the regression model in terms of the units of the outcome variable. Lower values are better because it indicates that the observations are closer to the fitted line. In our example:

\[\ RSE = \sqrt{\frac{SS_{residuals}}{n-k}} \tag{9}\]

Coefficient of determination \(R^2\)

The regression model allows us to divide the total variation in the outcome into two components: one explained by the model, and the other that remains unexplained by the model.

\[ \mathrm{SS}_{\mathrm{total}} = \mathrm{SS}_{\mathrm{model}} + \mathrm{SS}_{\mathrm{residual}} \]

Proof

\[ \begin{aligned} \sum_{i=1}^n (Y_i - \bar{Y})^2 &= \sum_{i=1}^n (Y_i - \hat{Y}_i + \hat{Y}_i - \bar{Y})^2 \\ &= \sum_{i=1}^n(Y_i - \hat{Y}_i)^2 + \sum_{i=1}^n(\hat{Y}_i - \bar{Y})^2 + 2\sum_{i=1}^n(Y_i - \hat{Y}_i)(\hat{Y}_i-\bar{Y}) \\ &= \sum_{i=1}^n(Y_i - \hat{Y}_i)^2 + \sum_{i=1}^n(\hat{Y}_i -\bar{Y})^2 \\ SS_{total} &= SS_{residuals} + SS_{model} \\ \end{aligned} \]

where SSR(\(SS_{model}\)) is the regression sum of squares, which measures how the conditional mean varies about a central value. + Variation explained by the model (what your model learns from the data)

The cross-product term in the decomposition is 0:

\[ \begin{aligned} \sum_{i=1}^n (Y_i - \hat{Y}_i)(\hat{Y}_i - \bar{Y}) &= \sum_{i=1}^{n}(Y_i - \bar{Y} -b_1 (X_i - \bar{X}))(\bar{Y} + b_1 (X_i - \bar{X})-\bar{Y}) \\ &= b_1 \sum_{i=1}^{n} (Y_i - \bar{Y})(X_i - \bar{X}) - b_1^2\sum_{i=1}^{n}(X_i - \bar{X})^2 \\ &= b_1 \frac{\sum_{i=1}^{n}(Y_i -\bar{Y})(X_i - \bar{X})}{\sum_{i=1}^{n}(X_i - \bar{X})^2} \sum_{i=1}^{n}(X_i - \bar{X})^2 - b_1^2\sum_{i=1}^{n}(X_i - \bar{X})^2 \\ &= b_1^2 \sum_{i=1}^{n}(X_i - \bar{X})^2 - b_1^2 \sum_{i=1}^{n}(X_i - \bar{X})^2 \\ &= 0 \end{aligned} \]

\[ \begin{aligned} SSTO &= SSR + SSE \\ (n-1 d.f) &= (1 d.f.) + (n-2 d.f.) \end{aligned} \]

Source of Variation	Sum of Squares	df	Mean Square	F
Regression (model)	SSR	1	MSR = SSR/df	MSR/MSE
Error	SSE	n-2	MSE = SSE/df
Total (Corrected)	SSTO	n-1

\[ E(MSE) = \sigma^2 \\ E(MSR) = \sigma^2 + \beta_1^2 \sum_{i=1}^{n} (X_i - \bar{X})^2 \]

Coefficient of Determination (\(R^2\)) and Adjusted \(R^2\)

\(R^2\) is the square of the correlation coefficient r.

It tells us the proportion of variability in the outcome variable that is explained by the model.
In our regression, \(R^2 = 0.5763\): about \(58\%\) of the variability in plasma volume is explained by bodyweight.
Adjusted \(R^2 = 0.5057\): tells us how well the model explains the outcome after accounting for sample size and number of predictors It gives a more honest estimate of model performance, especially with small datasets .After adjusting for model simplicity and sample size, about \(51\%\) of the variability in plasma volume is still explained by body weight.

\[ R^2 = \frac{SS_{model}}{SS_{total}} = 1- \frac{SS_{residuals}}{SS_{total}} \]

where \(0 \le R^2 \le 1\)

Remember

Interpretation: The proportionate reduction of the total variation in Y after fitting a linear model in X.
It is not really correct to say that \(R^2\) is the “variation in Y explained by X”.

\(R^2\) is related to the correlation coefficient between Y and X:

\[ R^2 = (r)^2 \]

where \(r= corr(x,y)\) is an estimate of the Pearson correlation coefficient.

The coefficient of determination, \(R^2\), indicates the percentage of the total variation in the dependent variable Y that can be explained by the regression model (in this simple case the independent variable X). Hence, it is a measure of the ‘goodness of fit’ of the regression line to the data.

\[\ R^2 = \frac{\ explained \ \ variation}{total \ \ variation} = \frac{SS_{model}}{SS_{total}} = 1- \frac{SS_{residual}}{SS_{total}} \tag{10}\]

The R-squared measure ranges between 0 and 1. When \(R^2\) approaches 1 indicates that a substantial proportion of the variation in the dependent variable Y has been explained by the regression model. Conversely, when it nears 0, it implies that the regression model has not effectively explained the majority of the variation in the outcome.

In our example, \(R^2 = 0.58\) indicates that about 58% of the variation in Plasma Volume can be explained by the variation of the body weight.

Comment

In simple linear regression \(\sqrt{0.5762732} = 0.7591266\) which equals to the Pearson’s correlation coefficient r.

(cor(df$`Body weight`,df$`Plasma volume`))^2

[1] 0.5762732

In matrix notation:

the sum of squares total is defined as:

\[ \mathrm{SS}_{total} = (\mathbf{Y} - \bar{\mathbf{Y}})^2 = (\mathbf{Y} - \bar{\mathbf{Y}})^\intercal(\mathbf{Y} - \bar{\mathbf{Y}}) \]

the residual sum of squares is:

\[ SS_{residuals} = \mathbf{e}^2 =\mathbf{e}^\intercal\mathbf{e} \]

Therefore,

\[ \mathrm{SS}_{\mathrm{model}} = \mathrm{SS}_{\mathrm{total}} - \mathrm{SS}_{\mathrm{residual}} \]

F-Test

The F-test for the model is a test of the null hypothesis that none of the independent variables linearly predict the dependent variable, that is, the model parameters are jointly zero:

The F-statistic tests whether the regression model explains a signicant portion of the total variability in the outcome variable:

\[H_0 : \beta_1 = \ldots = \beta_k = 0\]. The regression mean sum of squares \(MSR = \frac{(\hat{y} - \bar{y})'(\hat{y} - \bar{y})}{k-1}\) and the error mean sum of squares \(MSE = \frac{\hat{\epsilon}'\hat{\epsilon}}{n-k}\) are each chi-square variables. Their ratio has an F distribution with \(k - 1\) numerator degrees of freedom and \(n - k\) denominator degrees of freedom. The F statistic can also be expressed in terms of the coefficient of correlation \(R^2 = \frac{MS_{model}}{MS_{total}}\).

\[F(k - 1, n - k) = \frac{MS_{model}}{MS_{residuals}} = \frac{R^2}{1 - R^2} \frac{n-k}{k-1}\]

MSE is \(\sigma^2\). If \(H_0\) is true, that is, there is no relationship between the predictors and the response, then \(MSR\) is also equal to \(\sigma^2\), so \(F = 1\). As \(R^2 \rightarrow 1\), \(F \rightarrow \infty\), and as \(R^2 \rightarrow 0\), \(F \rightarrow 0\). F increases with \(n\) and decreases with \(k\).

A large F-value indicates that the model explains significantly more variability than the mean alone. Also, p value = 0.029

Present the results

We can use the above information to write up a final report:

Final report

In summary, the regression coefficient of the weight is positive and significantly different from zero \((p < 0.001)\). There is on average an increase of 0.0436 (\(95\%\)CI: 0.006255 to 0.0809757) litres in Plasma Volume for every 1 kg increase in weight. The model explains a substantial proportion of variance (\(58%\) of the variation in Plasma Volume can be explained by the variation of body weight).

Multiple Linear Regression and General Linear Model

We would like to fit a model that includes all potentially important variables simultaneously. This would help us evaluate the relationship between a predictor variable and the outcome while controlling for the potential influence of other variables. This is the strategy used in multiple regression. While we remain cautious about making any causal interpretations using multiple regression, such models are a common first step in providing evidence of a causal connection.

\[y = \beta_0 + \beta_1x_1 + \beta_2x_2 + \dots + \beta_px_p+\epsilon_i \tag{11}\]

Just as with the single predictor case, a multiple regression model may be missing important components or it might not precisely represent the relationship between the outcome and the available explanatory variables. While no model is perfect, we wish to explore the possibility that this model may fit the data reasonably well.

We estimate the parameters \(\beta_0\), \(\beta_1\), …, \(\beta_4\) in the same way as we did in the case of a single predictor. We select \(b_0\), \(b_1\), …, \(b_4\) that minimize the sum of the squared residuals:

\[ \text{SSE} = e_1^2 + e_2^2 + \dots + e_{n}^2 = \sum_{i=1}^{n} e_i^2 = \sum_{i=1}^{n} \left(y_i - \hat{y}_i\right)^2 \]

heart<-readr::read_csv("heart.csv")

modelA <- lm(chol~age+sex+cp+trestbps+fbs+restecg +thalach+exang+oldpeak+slope+ca +thal,data=heart)

MASS::stepAIC(modelA ,direction="both")

Start:  AIC=7969.2
chol ~ age + sex + cp + trestbps + fbs + restecg + thalach + 
    exang + oldpeak + slope + ca + thal

           Df Sum of Sq     RSS    AIC
- fbs       1        27 2378351 7967.2
- ca        1       658 2378982 7967.5
- oldpeak   1      2021 2380345 7968.1
- slope     1      3210 2381534 7968.6
- trestbps  1      3350 2381674 7968.6
<none>                  2378324 7969.2
- cp        1      6847 2385171 7970.1
- exang     1      7582 2385906 7970.5
- thalach   1     16619 2394943 7974.3
- thal      1     27512 2405836 7979.0
- restecg   1     41775 2420099 7985.0
- age       1     69200 2447524 7996.6
- sex       1    112947 2491271 8014.8

Step:  AIC=7967.21
chol ~ age + sex + cp + trestbps + restecg + thalach + exang + 
    oldpeak + slope + ca + thal

           Df Sum of Sq     RSS    AIC
- ca        1       634 2378985 7965.5
- oldpeak   1      2069 2380420 7966.1
- slope     1      3270 2381621 7966.6
- trestbps  1      3332 2381683 7966.6
<none>                  2378351 7967.2
- cp        1      7012 2385363 7968.2
- exang     1      7559 2385910 7968.5
+ fbs       1        27 2378324 7969.2
- thalach   1     16592 2394943 7972.3
- thal      1     27763 2406114 7977.1
- restecg   1     41812 2420163 7983.1
- age       1     69299 2447650 7994.6
- sex       1    113275 2491626 8012.9

Step:  AIC=7965.48
chol ~ age + sex + cp + trestbps + restecg + thalach + exang + 
    oldpeak + slope + thal

           Df Sum of Sq     RSS    AIC
- oldpeak   1      2485 2381470 7964.6
- trestbps  1      3430 2382415 7965.0
- slope     1      3590 2382576 7965.0
<none>                  2378985 7965.5
- exang     1      7390 2386375 7966.7
- cp        1      7641 2386627 7966.8
+ ca        1       634 2378351 7967.2
+ fbs       1         3 2378982 7967.5
- thalach   1     16261 2395246 7970.5
- thal      1     28457 2407443 7975.7
- restecg   1     42232 2421217 7981.5
- age       1     74852 2453837 7995.2
- sex       1    112786 2491771 8011.0

Step:  AIC=7964.55
chol ~ age + sex + cp + trestbps + restecg + thalach + exang + 
    slope + thal

           Df Sum of Sq     RSS    AIC
- slope     1      1659 2383129 7963.3
- trestbps  1      4286 2385756 7964.4
<none>                  2381470 7964.6
+ oldpeak   1      2485 2378985 7965.5
- cp        1      7940 2389410 7966.0
+ ca        1      1050 2380420 7966.1
- exang     1      8523 2389993 7966.2
+ fbs       1        22 2381449 7966.5
- thalach   1     15436 2396906 7969.2
- thal      1     31170 2412640 7975.9
- restecg   1     41542 2423012 7980.3
- age       1     77027 2458497 7995.2
- sex       1    111210 2492680 8009.3

Step:  AIC=7963.26
chol ~ age + sex + cp + trestbps + restecg + thalach + exang + 
    thal

           Df Sum of Sq     RSS    AIC
- trestbps  1      3828 2386956 7962.9
<none>                  2383129 7963.3
+ slope     1      1659 2381470 7964.6
- exang     1      7722 2390851 7964.6
- cp        1      8187 2391315 7964.8
+ ca        1      1120 2382009 7964.8
+ oldpeak   1       553 2382576 7965.0
+ fbs       1        37 2383092 7965.2
- thalach   1     20679 2403807 7970.1
- thal      1     30676 2413804 7974.4
- restecg   1     40738 2423866 7978.6
- age       1     77309 2460438 7994.0
- sex       1    110883 2494012 8007.9

Step:  AIC=7962.91
chol ~ age + sex + cp + restecg + thalach + exang + thal

           Df Sum of Sq     RSS    AIC
<none>                  2386956 7962.9
+ trestbps  1      3828 2383129 7963.3
- cp        1      7412 2394368 7964.1
+ ca        1      1330 2385627 7964.3
+ slope     1      1200 2385756 7964.4
+ oldpeak   1      1087 2385869 7964.4
- exang     1      8750 2395706 7964.7
+ fbs       1         8 2386948 7964.9
- thalach   1     22208 2409164 7970.4
- thal      1     31968 2418924 7974.5
- restecg   1     43516 2430472 7979.4
- age       1     92435 2479392 7999.9
- sex       1    114492 2501448 8008.9


Call:
lm(formula = chol ~ age + sex + cp + restecg + thalach + exang + 
    thal, data = heart)

Coefficients:
(Intercept)          age          sex           cp      restecg      thalach  
   148.2538       1.1645     -23.8543      -2.9184     -12.5114       0.2418  
      exang         thal  
     7.1878       9.4035

modelB<-lm(formula = chol ~ age + cp+ sex +restecg + thalach + exang + 
    thal, data = heart)

summary(modelB)


Call:
lm(formula = chol ~ age + cp + sex + restecg + thalach + exang + 
    thal, data = heart)

Residuals:
     Min       1Q   Median       3Q      Max 
-119.725  -30.585   -3.199   29.412  276.662 

Coefficients:
             Estimate Std. Error t value           Pr(>|t|)    
(Intercept) 148.25377   19.52092   7.595 0.0000000000000698 ***
age           1.16445    0.18555   6.276 0.0000000005145924 ***
cp           -2.91837    1.64225  -1.777           0.075857 .  
sex         -23.85426    3.41539  -6.984 0.0000000000051637 ***
restecg     -12.51140    2.90566  -4.306 0.0000182425378184 ***
thalach       0.24183    0.07862   3.076           0.002154 ** 
exang         7.18776    3.72265   1.931           0.053783 .  
thal          9.40350    2.54797   3.691           0.000236 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 48.45 on 1017 degrees of freedom
Multiple R-squared:  0.1243,    Adjusted R-squared:  0.1182 
F-statistic: 20.62 on 7 and 1017 DF,  p-value: < 0.00000000000000022

summary(modelA)


Call:
lm(formula = chol ~ age + sex + cp + trestbps + fbs + restecg + 
    thalach + exang + oldpeak + slope + ca + thal, data = heart)

Residuals:
     Min       1Q   Median       3Q      Max 
-126.835  -30.997   -0.782   28.271  281.288 

Coefficients:
            Estimate Std. Error t value         Pr(>|t|)    
(Intercept) 135.8311    21.4148   6.343 0.00000000033947 ***
age           1.0677     0.1968   5.426 0.00000007198361 ***
sex         -23.9646     3.4568  -6.933 0.00000000000735 ***
cp           -2.8520     1.6708  -1.707         0.088138 .  
trestbps      0.1115     0.0934   1.194         0.232786    
fbs          -0.4760     4.4364  -0.107         0.914577    
restecg     -12.3752     2.9352  -4.216 0.00002707898712 ***
thalach       0.2218     0.0834   2.659         0.007956 ** 
exang         6.8454     3.8111   1.796         0.072765 .  
oldpeak       1.5620     1.6845   0.927         0.354012    
slope         3.6714     3.1414   1.169         0.242784    
ca            0.8508     1.6078   0.529         0.596776    
thal          8.8567     2.5885   3.422         0.000648 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 48.48 on 1012 degrees of freedom
Multiple R-squared:  0.1274,    Adjusted R-squared:  0.1171 
F-statistic: 12.32 on 12 and 1012 DF,  p-value: < 0.00000000000000022

anova(modelA,modelB)

Res.Df	RSS	Df	Sum of Sq	F	Pr(>F)
1012	2378324	NA	NA	NA	NA
1017	2386956	-5	-8632.434	0.734637	0.5975415

AIC(modelA,modelB)

	df	AIC
modelA	14	10880.02
modelB	9	10873.73

BIC(modelA,modelB)

	df	BIC
modelA	14	10949.07
modelB	9	10918.12

*// multiple linear regression

import delimited "heart.csv"

// drop outcome
drop target
//run multiple regression - full model
regress chol age sex cp trestbps fbs restecg thalach exang oldpeak slope ca thal

est store modelA   // Save the estimates

// Run stepwise regression with 5% level
stepwise , pe(0.05): regress  chol age sex cp trestbps fbs restecg thalach exang oldpeak slope ca thal

//Run with selected variables
regress  chol age sex restecg thalach exang thal
est store modelB        // Save the estimates


// compare models
ssc install estout, replace

estout modelA modelB, stats(r2 r2_a aic bic)

//Information criteria
estat ic

// check multicollinearity- using variance inflation factor
vif

(encoding automatically selected: ISO-8859-1)
(14 vars, 1,025 obs)

      Source |       SS           df       MS      Number of obs   =     1,025
-------------+----------------------------------   F(12, 1012)     =     12.32
       Model |  347346.089        12  28945.5074   Prob > F        =    0.0000
    Residual |  2378323.91     1,012  2350.12244   R-squared       =    0.1274
-------------+----------------------------------   Adj R-squared   =    0.1171
       Total |     2725670     1,024  2661.78711   Root MSE        =    48.478

------------------------------------------------------------------------------
        chol | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
         age |   1.067711   .1967646     5.43   0.000     .6815978    1.453825
         sex |  -23.96464    3.45684    -6.93   0.000    -30.74803   -17.18124
          cp |  -2.852014   1.670825    -1.71   0.088    -6.130692    .4266636
    trestbps |   .1115105   .0933981     1.19   0.233    -.0717655    .2947865
         fbs |   -.475993   4.436357    -0.11   0.915    -9.181505    8.229519
     restecg |  -12.37525   2.935229    -4.22   0.000    -18.13508   -6.615418
     thalach |   .2217753   .0833985     2.66   0.008     .0581216     .385429
       exang |   6.845388     3.8111     1.80   0.073    -.6331754    14.32395
     oldpeak |   1.561961   1.684487     0.93   0.354    -1.743527    4.867449
       slope |   3.671422   3.141354     1.17   0.243    -2.492892    9.835735
          ca |   .8508421   1.607758     0.53   0.597    -2.304078    4.005762
        thal |    8.85672   2.588534     3.42   0.001     3.777211    13.93623
       _cons |   135.8311   21.41484     6.34   0.000     93.80857    177.8537
------------------------------------------------------------------------------



Wald test, begin with empty model:
p = 0.0000 <  0.0500, adding age
p = 0.0000 <  0.0500, adding sex
p = 0.0000 <  0.0500, adding restecg
p = 0.0000 <  0.0500, adding thal
p = 0.0480 <  0.0500, adding thalach
p = 0.0097 <  0.0500, adding exang

      Source |       SS           df       MS      Number of obs   =     1,025
-------------+----------------------------------   F(6, 1018)      =     23.48
       Model |  331301.787         6  55216.9646   Prob > F        =    0.0000
    Residual |  2394368.21     1,018  2352.03164   R-squared       =    0.1215
-------------+----------------------------------   Adj R-squared   =    0.1164
       Total |     2725670     1,024  2661.78711   Root MSE        =    48.498

------------------------------------------------------------------------------
        chol | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
         age |   1.149399   .1855542     6.19   0.000     .7852869    1.513512
         sex |  -24.09322   3.416361    -7.05   0.000    -30.79713    -17.3893
     restecg |  -12.62229    2.90807    -4.34   0.000    -18.32879   -6.915791
        thal |   9.848034   2.538348     3.88   0.000     4.867041    14.82903
     thalach |   .2162272   .0773674     2.79   0.005     .0644095    .3680449
       exang |   9.200627   3.549896     2.59   0.010     2.234677    16.16658
       _cons |   148.6544    19.5403     7.61   0.000     110.3106    186.9983
------------------------------------------------------------------------------

      Source |       SS           df       MS      Number of obs   =     1,025
-------------+----------------------------------   F(6, 1018)      =     23.48
       Model |  331301.787         6  55216.9646   Prob > F        =    0.0000
    Residual |  2394368.21     1,018  2352.03164   R-squared       =    0.1215
-------------+----------------------------------   Adj R-squared   =    0.1164
       Total |     2725670     1,024  2661.78711   Root MSE        =    48.498

------------------------------------------------------------------------------
        chol | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
         age |   1.149399   .1855542     6.19   0.000     .7852869    1.513512
         sex |  -24.09322   3.416361    -7.05   0.000    -30.79713    -17.3893
     restecg |  -12.62229    2.90807    -4.34   0.000    -18.32879   -6.915791
     thalach |   .2162272   .0773674     2.79   0.005     .0644095    .3680449
       exang |   9.200627   3.549896     2.59   0.010     2.234677    16.16658
        thal |   9.848034   2.538348     3.88   0.000     4.867041    14.82903
       _cons |   148.6544    19.5403     7.61   0.000     110.3106    186.9983
------------------------------------------------------------------------------


checking estout consistency and verifying not already installed...
all files already exist and are up to date.


--------------------------------------
                   modelA       modelB
                        b            b
--------------------------------------
age              1.067711     1.149399
sex             -23.96464    -24.09322
cp              -2.852014             
trestbps         .1115105             
fbs              -.475993             
restecg         -12.37525    -12.62229
thalach          .2217753     .2162272
exang            6.845388     9.200627
oldpeak          1.561961             
slope            3.671422             
ca               .8508421             
thal              8.85672     9.848034
_cons            135.8311     148.6544
--------------------------------------
r2               .1274351     .1215488
r2_a             .1170885     .1163712
aic              10878.02     10872.91
bic              10942.14     10907.44
--------------------------------------


Akaike's information criterion and Bayesian information criterion

-----------------------------------------------------------------------------
       Model |          N   ll(null)  ll(model)      df        AIC        BIC
-------------+---------------------------------------------------------------
      modelB |      1,025  -5495.873  -5429.455       7   10872.91   10907.44
-----------------------------------------------------------------------------
Note: BIC uses N = number of observations. See [R] IC note.

    Variable |       VIF       1/VIF  
-------------+----------------------
     thalach |      1.38    0.725030
         age |      1.23    0.810527
       exang |      1.23    0.815470
        thal |      1.08    0.925408
         sex |      1.08    0.928529
     restecg |      1.03    0.974691
-------------+----------------------
    Mean VIF |      1.17

General Linear model

Recap : Analysis Of Variance

Use the one-way ANOVA test to compare the mean response of a continuous dependent variable among the levels of a factor variable (if you only have two levels, use the independent-samples t-test). The observations must be independent, meaning the data generators cannot influence each other (e.g., same participant in different groups, or participants interact with each other to produce the observed outcome).

The ANOVA method decomposes the deviation of observation \(Y_{ij}\) around the overall mean \(\bar{Y}_{..}\) into two parts, the deviation of the observations around their treatment means, \(SSE\), and the deviation of the treatment means around the overall mean, \(SSR\). Their ratio, \(F = \frac{SSR}{SSE}\) follows an F-distribution with \(k-1\) numerator dof and \(N-k\) denominator dof. The more observation variance captured by the treatments, the large is \(F\), and the less likely that the null hypothesis, \(H_0 = \mu_1 = \mu_2 = \cdots = \mu_k\) is true.

Compare the F-statistic to the F-distribution with \(k-1\) numerator degrees of freedom and \(N-k\) denominator degrees of freedom

The F test does not indicate which populations cause the rejection of \(H_0\). For this, use one of the post-hoc tests: Tukey, Fisher’s Least Significant Difference (LSD), Bonferroni, Scheffe, or Dunnett.

ANOVA returns reliable results if the following conditions hold:

1. there should be no significant outliers within the factor levels;
1. the response variable should be approximately normally distributed within each factor level; and
1. the the response variable variances within the factor levels should be equal.

Example dataset

Data set abreveya1.dta contains a sample of data from the first birth of mothers in the study, with data from 3,798 mothers who have complete data for the variables of interest and who had singleton births Birth outcomes of interest are birthweight and whether the baby was of low birthweight (birthweight < 2500 grams)

ANC

Note: “kess” is an example of a factor – a categorical variable that divides the data into three groups according to the quality of antenatal care (adequate, intermediate or inadequate)

Question of interest is: “What can be said about the variation in birthweight between the different levels of antenatal care?”

df<-haven::read_dta("abrevaya1.dta") |> 
   mutate(birwtkg = birwt/1000,
          kess_cat = as.factor(kess))

use abrevaya1
gen birwtkg = birwt/1000
lab def kessl2 1 "adequate" 2 "intermediate" 3 "inadequate"
lab val kess kessl2 
lab var kess "Kessner ANC score"

oneway birwtkg kess ,tabulate

Kessner ANC |         Summary of birwtkg
      score |        Mean   Std. dev.       Freq.
------------+------------------------------------
   adequate |   3.4660382   .49382928       3,086
  intermedi |   3.3503617   .53106249         705
  inadequat |   3.2852353   .55723067         187
------------+------------------------------------
      Total |   3.4370382   .50664017       3,978

                        Analysis of variance
    Source              SS         df      MS            F     Prob > F
------------------------------------------------------------------------
Between groups      12.2011191      2   6.10055954     24.04     0.0000
 Within groups       1008.6322   3975    .25374395
------------------------------------------------------------------------
    Total           1020.83332   3977   .256684265

Bartlett's equal-variances test: chi2(2) =  10.5193    Prob>chi2 = 0.005

Bonferroni in stata

use abrevaya1
gen birwtkg = birwt/1000
lab def kessl2 1 "adequate" 2 "intermediate" 3 "inadequate"
lab val kess kessl2 
lab var kess "Kessner ANC score"

oneway birwtkg kess,tabulate bon

Kessner ANC |         Summary of birwtkg
      score |        Mean   Std. dev.       Freq.
------------+------------------------------------
   adequate |   3.4660382   .49382928       3,086
  intermedi |   3.3503617   .53106249         705
  inadequat |   3.2852353   .55723067         187
------------+------------------------------------
      Total |   3.4370382   .50664017       3,978

                        Analysis of variance
    Source              SS         df      MS            F     Prob > F
------------------------------------------------------------------------
Between groups      12.2011191      2   6.10055954     24.04     0.0000
 Within groups       1008.6322   3975    .25374395
------------------------------------------------------------------------
    Total           1020.83332   3977   .256684265

Bartlett's equal-variances test: chi2(2) =  10.5193    Prob>chi2 = 0.005

                  Comparison of birwtkg by Kessner ANC score
                                (Bonferroni)
Row Mean-|
Col Mean |   adequate   intermed
---------+----------------------
intermed |   -.115677
         |      0.000
         |
inadequa |   -.180803   -.065126
         |      0.000      0.348

(aov_model <-aov(birwtkg~kess_cat,data=df))

Call:
   aov(formula = birwtkg ~ kess_cat, data = df)

Terms:
                 kess_cat Residuals
Sum of Squares    12.2011 1008.6322
Deg. of Freedom         2      3975

Residual standard error: 0.50373
Estimated effects may be unbalanced

summary(aov_model)

              Df Sum Sq Mean Sq F value          Pr(>F)    
kess_cat       2   12.2   6.101   24.04 0.0000000000418 ***
Residuals   3975 1008.6   0.254                            
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Bonferroni in R

pairwise.t.test(df$birwtkg,df$kess_cat,p.adjust.method = "bonferroni")


    Pairwise comparisons using t tests with pooled SD 

data:  df$birwtkg and df$kess_cat 

  1          2   
2 0.00000012 -   
3 0.00000584 0.35

P value adjustment method: bonferroni

Conclusion

Overall analysis of variance is significant \((F=24.04; P<0.0001)\) we can reject H0 and conclude that the mean birthweight is not the same for all three groups of mothers defined by the level of antenatal care

Bonferroni adjusted comparison of means shows that the mean birthweight is significantly lower for mothers who had either intermediate or inadequate antenatal care compared to mothers who had adequate antenatal care, while the difference in mean birthweight between mothers with intermediate antenatal care and mothers with inadequate antenatal care is not statistically significant

Anova and Regression

Close connection between multiple regression models and analysis of variance models Measure effect of intermediate antenatal care and inadequate antenatal care relative to adequate antenatal care (first level and also the most common in our study) i.e. we set Kessner level 1 (adequate antenatal care) to be the reference or baseline level Define variables to measure the effect of Kess level 2 and Kess level 3 relative to the baseline as follows:

use abrevaya1

gen birwtkg = birwt/1000
gen kess2 = 0
replace kess2=1 if kess==2
gen kess3 = 0
replace kess3 =1 if kess==3

reg birwtkg kess2 kess3

(705 real changes made)


(187 real changes made)

      Source |       SS           df       MS      Number of obs   =     3,978
-------------+----------------------------------   F(2, 3975)      =     24.04
       Model |  12.2011191         2  6.10055954   Prob > F        =    0.0000
    Residual |   1008.6322     3,975   .25374395   R-squared       =    0.0120
-------------+----------------------------------   Adj R-squared   =    0.0115
       Total |  1020.83332     3,977  .256684265   Root MSE        =    .50373

------------------------------------------------------------------------------
     birwtkg | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
       kess2 |  -.1156765   .0210272    -5.50   0.000    -.1569017   -.0744514
       kess3 |   -.180803    .037936    -4.77   0.000    -.2551789    -.106427
       _cons |   3.466038   .0090678   382.24   0.000      3.44826    3.483816
------------------------------------------------------------------------------

OR

use abrevaya1
gen birwtkg = birwt/1000
reg birwtkg i.kess

*Change baseline*
reg birwtkg ib3.kess

      Source |       SS           df       MS      Number of obs   =     3,978
-------------+----------------------------------   F(2, 3975)      =     24.04
       Model |  12.2011191         2  6.10055954   Prob > F        =    0.0000
    Residual |   1008.6322     3,975   .25374395   R-squared       =    0.0120
-------------+----------------------------------   Adj R-squared   =    0.0115
       Total |  1020.83332     3,977  .256684265   Root MSE        =    .50373

------------------------------------------------------------------------------
     birwtkg | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
        kess |
Intermedi..  |  -.1156765   .0210272    -5.50   0.000    -.1569017   -.0744514
Inadequat..  |   -.180803    .037936    -4.77   0.000    -.2551789    -.106427
             |
       _cons |   3.466038   .0090678   382.24   0.000      3.44826    3.483816
------------------------------------------------------------------------------

      Source |       SS           df       MS      Number of obs   =     3,978
-------------+----------------------------------   F(2, 3975)      =     24.04
       Model |  12.2011191         2  6.10055954   Prob > F        =    0.0000
    Residual |   1008.6322     3,975   .25374395   R-squared       =    0.0120
-------------+----------------------------------   Adj R-squared   =    0.0115
       Total |  1020.83332     3,977  .256684265   Root MSE        =    .50373

------------------------------------------------------------------------------
     birwtkg | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
        kess |
Adequate ..  |    .180803    .037936     4.77   0.000      .106427    .2551789
Intermedi..  |   .0651264   .0414348     1.57   0.116     -.016109    .1463618
             |
       _cons |   3.285235   .0368364    89.18   0.000     3.213015    3.357455
------------------------------------------------------------------------------

(lm_model <-lm(birwtkg~kess_cat,data=df))


Call:
lm(formula = birwtkg ~ kess_cat, data = df)

Coefficients:
(Intercept)    kess_cat2    kess_cat3  
     3.4660      -0.1157      -0.1808

summary(lm_model)


Call:
lm(formula = birwtkg ~ kess_cat, data = df)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.53004 -0.31904 -0.00604  0.30664  1.77896 

Coefficients:
             Estimate Std. Error t value             Pr(>|t|)    
(Intercept)  3.466038   0.009068 382.238 < 0.0000000000000002 ***
kess_cat2   -0.115677   0.021027  -5.501         0.0000000401 ***
kess_cat3   -0.180803   0.037936  -4.766         0.0000019464 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.5037 on 3975 degrees of freedom
Multiple R-squared:  0.01195,   Adjusted R-squared:  0.01145 
F-statistic: 24.04 on 2 and 3975 DF,  p-value: 0.00000000004181

Note

If we compare analysis of variance for regression model to that of the one-way ANOVA, we see that they are in fact identical Model sum of squares and degrees of freedom are equal to the “Between groups” in the ANOVA – similarly for Residual sum of squares and residual degrees of freedom and within groups sum of squares and residual degrees of freedom Results in identical F-ratios (\(24.04\)) and identical P-values (\(<0.0001\))

General linear model : allows us to include both variates and factors as explanatory variables and thus allows us to analyze a wide range of data from research studies – provided it is reasonable to assume that the quantitative outcome or response variable has an underlying normal distribution

Example 2

Does birthweight increase with increasing maternal age?
Is there a difference in mean birthweight between babies whose mothers smoked & babies whose mothers did not smoke adjusting for maternal age?
Does birth weight increase with increasing maternal age at same rate for babies where mother smoked and babies where mother did not smoke during pregnancy?

df |> 
  ggplot(aes(x=birwtkg , y=mage, color = smoke))+
  geom_point()+
  geom_smooth(method ="lm")

df |> 
  ggplot(aes(x=birwtkg , y=mage, color = smoke))+
  geom_point()+
  geom_smooth(method ="lm")+
  facet_wrap(~smoke)

(lm_model <-lm(birwtkg~mage,data=df))


Call:
lm(formula = birwtkg ~ mage, data = df)

Coefficients:
(Intercept)         mage  
    3.11228      0.01187

summary(lm_model)


Call:
lm(formula = birwtkg ~ mage, data = df)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.49672 -0.31169  0.00489  0.31881  1.66987 

Coefficients:
            Estimate Std. Error t value             Pr(>|t|)    
(Intercept) 3.112278   0.041854  74.360 < 0.0000000000000002 ***
mage        0.011868   0.001502   7.904  0.00000000000000347 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.5028 on 3976 degrees of freedom
Multiple R-squared:  0.01547,   Adjusted R-squared:  0.01522 
F-statistic: 62.47 on 1 and 3976 DF,  p-value: 0.00000000000000347

use abrevaya1

gen birwtkg = birwt/1000

reg birwtkg mage

      Source |       SS           df       MS      Number of obs   =     3,978
-------------+----------------------------------   F(1, 3976)      =     62.47
       Model |   15.791756         1   15.791756   Prob > F        =    0.0000
    Residual |  1005.04157     3,976  .252777054   R-squared       =    0.0155
-------------+----------------------------------   Adj R-squared   =    0.0152
       Total |  1020.83332     3,977  .256684265   Root MSE        =    .50277

------------------------------------------------------------------------------
     birwtkg | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
        mage |    .011868   .0015015     7.90   0.000     .0089242    .0148119
       _cons |   3.112278   .0418542    74.36   0.000      3.03022    3.194336
------------------------------------------------------------------------------

Note

There is overwhelming evidence that birthweight is associated with maternal age – for each year the birthweight increases by 0.012 kg (95% CI 0.0089 ; 0.015 ; P<0.0001)

df<-df |> 
  mutate(mage_cent = mage -mean(mage),
         smoke = as.factor(smoke))

lm_model_smoke <-lm(birwtkg~mage_cent+smoke,data=df)
summary(lm_model_smoke)


Call:
lm(formula = birwtkg ~ mage_cent + smoke, data = df)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.50063 -0.30397  0.00637  0.31290  1.80470 

Coefficients:
             Estimate Std. Error t value             Pr(>|t|)    
(Intercept)  3.474831   0.008471 410.188 < 0.0000000000000002 ***
mage_cent    0.009116   0.001495   6.099        0.00000000117 ***
smoke1      -0.267983   0.022800 -11.754 < 0.0000000000000002 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4943 on 3975 degrees of freedom
Multiple R-squared:  0.04854,   Adjusted R-squared:  0.04806 
F-statistic: 101.4 on 2 and 3975 DF,  p-value: < 0.00000000000000022

augment(lm_model_smoke) %>%
  gf_point(birwtkg~mage_cent,color=~smoke) %>%
  gf_line(.fitted~mage_cent,data=subset(augment(lm_model_smoke), smoke == 1),color=~smoke)%>%
  gf_line(.fitted~mage_cent,data=subset(augment(lm_model_smoke), smoke == 0),color=~smoke) %>%
  gf_theme(theme_bw())

Figure 6: Scatterplot of birthweight and mom age for smoke status.

anova(lm_model,lm_model_smoke)

Res.Df	RSS	Df	Sum of Sq	F	Pr(>F)
3976	1005.0416	NA	NA	NA	NA
3975	971.2843	1	33.75728	138.1523	0

use abrevaya1

egen mage_bar = mean(mage)
gen mage_cent = mage -mage_bar


gen birwtkg = birwt/1000

reg birwtkg mage_cent i.smoke
testparm i.smoke

      Source |       SS           df       MS      Number of obs   =     3,978
-------------+----------------------------------   F(2, 3975)      =    101.39
       Model |  49.5490345         2  24.7745173   Prob > F        =    0.0000
    Residual |  971.284287     3,975  .244348248   R-squared       =    0.0485
-------------+----------------------------------   Adj R-squared   =    0.0481
       Total |  1020.83332     3,977  .256684265   Root MSE        =    .49432

------------------------------------------------------------------------------
     birwtkg | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
   mage_cent |   .0091161   .0014947     6.10   0.000     .0061856    .0120466
     1.smoke |  -.2679825   .0227996   -11.75   0.000    -.3126825   -.2232825
       _cons |   3.474831   .0084713   410.19   0.000     3.458222    3.491439
------------------------------------------------------------------------------


 ( 1)  1.smoke = 0

       F(  1,  3975) =  138.15
            Prob > F =    0.0000

Model equation

\[ \mbox{E}(Birwtkg)=\beta_0 + \beta_1*\text{mage_cent}+ \beta_2*\text{(smoke=Yes)} \]

Note

Post model test (“testparm i.smoke”) shows that adding i.smoke significantly improves the fit of the model \((F=138.15; P<0.0001)\) Estimated effect of smoking of -0.268 (se 0.023) after adjusting for the effect of maternal age Compared with result of the t-test which said that if the mother smoked birth weight was lower by 0.290 (se 0.023) – so the effect of smoking is slightly reduced by adjusting for mother’s age. In this case the standard error is unchanged (to 3 decimal places) - Often adjusting leads to smaller s.e.

use abrevaya1

egen mage_bar = mean(mage)
gen mage_cent = mage -mage_bar


gen birwtkg = birwt/1000

reg birwtkg mage_cent i.smoke smoke#c.mage_cent

testparm smoke#c.mage_cent

      Source |       SS           df       MS      Number of obs   =     3,978
-------------+----------------------------------   F(3, 3974)      =     67.78
       Model |  49.6892689         3  16.5630896   Prob > F        =    0.0000
    Residual |  971.144053     3,974  .244374447   R-squared       =    0.0487
-------------+----------------------------------   Adj R-squared   =    0.0480
       Total |  1020.83332     3,977  .256684265   Root MSE        =    .49434

------------------------------------------------------------------------------
     birwtkg | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
   mage_cent |   .0096089   .0016302     5.89   0.000     .0064127     .012805
     1.smoke |  -.2731566   .0238019   -11.48   0.000    -.3198217   -.2264916
             |
       smoke#|
 c.mage_cent |
          1  |  -.0030948   .0040854    -0.76   0.449    -.0111045    .0049149
             |
       _cons |   3.474665   .0084746   410.01   0.000      3.45805     3.49128
------------------------------------------------------------------------------


 ( 1)  1.smoke#c.mage_cent = 0

       F(  1,  3974) =    0.57
            Prob > F =    0.4488

lm_model_smoke_INT <-lm(birwtkg~mage_cent+smoke+smoke*mage_cent,data=df)
summary(lm_model_smoke_INT)


Call:
lm(formula = birwtkg ~ mage_cent + smoke + smoke * mage_cent, 
    data = df)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.50077 -0.30495  0.00597  0.31122  1.80389 

Coefficients:
                  Estimate Std. Error t value             Pr(>|t|)    
(Intercept)       3.474665   0.008475 410.009 < 0.0000000000000002 ***
mage_cent         0.009609   0.001630   5.894        0.00000000408 ***
smoke1           -0.273157   0.023802 -11.476 < 0.0000000000000002 ***
mage_cent:smoke1 -0.003095   0.004085  -0.758                0.449    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4943 on 3974 degrees of freedom
Multiple R-squared:  0.04868,   Adjusted R-squared:  0.04796 
F-statistic: 67.78 on 3 and 3974 DF,  p-value: < 0.00000000000000022

augment(lm_model_smoke_INT) %>%
  gf_point(birwtkg~mage_cent,color=~smoke) %>%
  gf_line(.fitted~mage_cent,data=subset(augment(lm_model_smoke_INT), smoke == 1),color=~smoke)%>%
  gf_line(.fitted~mage_cent,data=subset(augment(lm_model_smoke_INT), smoke == 0),color=~smoke) %>%
  gf_theme(theme_bw())

Figure 7: Scatterplot of birthweight and mom age for smoke status with interaction..

anova(lm_model_smoke_INT, lm_model_smoke)

Res.Df	RSS	Df	Sum of Sq	F	Pr(>F)
3974	971.1441	NA	NA	NA	NA
3975	971.2843	-1	-0.1402345	0.5738508	0.4487777

Tip

The post-estimation test after fitting the interaction term shows no evidence of an improvement to the model \((F=0.57; P=0.449)\) We would not choose the model with an interaction (the separate lines model) and we would choose the parallel lines model Example of the principle of parsimony – we only choose a more complicated model over a simpler model if it fits significantly better

Also Note

There is no different slope for each case of the Smoke variable. Thus there is no synergy or interaction between these variables. The smoke status does not change the impact of mom age on birthweight.

use abrevaya1

egen mage_bar = mean(mage)
gen mage_cent = mage -mage_bar


gen birwtkg = birwt/1000

sw , pr(0.15) lockterm1 : reg birwtkg (i.smoke) mage meduc (i.male) (i.married) (i.black) (i.kess)
margins smoke

margins kess

note: 0b.male omitted because of estimability.
note: 0b.married omitted because of estimability.
note: 0b.black omitted because of estimability.
note: 1b.kess omitted because of estimability.

Wald test, begin with full model:
p = 0.1640 >= 0.1500, removing meduc

      Source |       SS           df       MS      Number of obs   =     3,978
-------------+----------------------------------   F(7, 3970)      =     47.96
       Model |  79.5997715         7  11.3713959   Prob > F        =    0.0000
    Residual |   941.23355     3,970  .237086537   R-squared       =    0.0780
-------------+----------------------------------   Adj R-squared   =    0.0763
       Total |  1020.83332     3,977  .256684265   Root MSE        =    .48692

------------------------------------------------------------------------------
     birwtkg | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
     1.smoke |   -.240804    .023359   -10.31   0.000    -.2866008   -.1950072
        mage |   .0059965   .0015393     3.90   0.000     .0029786    .0090145
             |
        kess |
Intermedi..  |  -.0499028   .0209532    -2.38   0.017    -.0909828   -.0088229
Inadequat..  |  -.0487722   .0380074    -1.28   0.199    -.1232881    .0257436
             |
        male |
       male  |   .1027106   .0154438     6.65   0.000     .0724321    .1329892
             |
     married |
        Yes  |   .0695491   .0286438     2.43   0.015     .0133911    .1257071
             |
       black |
        Yes  |   -.207705   .0324948    -6.39   0.000     -.271413    -.143997
       _cons |   3.219545   .0477062    67.49   0.000     3.126014    3.313076
------------------------------------------------------------------------------


Predictive margins                                       Number of obs = 3,978
Model VCE: OLS

Expression: Linear prediction, predict()

------------------------------------------------------------------------------
             |            Delta-method
             |     Margin   std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
       smoke |
          0  |   3.470998   .0083935   413.53   0.000     3.454542    3.487454
          1  |   3.230194   .0214987   150.25   0.000     3.188044    3.272343
------------------------------------------------------------------------------


Predictive margins                                       Number of obs = 3,978
Model VCE: OLS

Expression: Linear prediction, predict()

------------------------------------------------------------------------------
             |            Delta-method
             |     Margin   std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
        kess |
Adequate ..  |   3.448175   .0088707   388.72   0.000     3.430783    3.465566
Intermedi..  |   3.398272   .0187141   181.59   0.000     3.361582    3.434962
Inadequat..  |   3.399403   .0366415    92.77   0.000     3.327565    3.471241
------------------------------------------------------------------------------

Note

sw : stepwise
pr(0.15) : probability of removing the variable

df |> 
  mutate_at(c("smoke","male","married","black","kess"), as.factor)->df

lm_model_new<-lm(birwtkg~mage+ meduc+smoke+male+married+black+kess,data = df)

step(lm_model_new , direction="backward")

Error in knit_print(x, ...): class name too long in 'knit_print'

use abrevaya1

egen mage_bar = mean(mage)
gen mage_cent = mage -mage_bar


gen birwtkg = birwt/1000

reg birwtkg i.smoke mage i.male i.married i.black kess

cprplot mage , lowess

graph export cpr.png

      Source |       SS           df       MS      Number of obs   =     3,978
-------------+----------------------------------   F(6, 3971)      =     55.80
       Model |  79.3706313         6  13.2284385   Prob > F        =    0.0000
    Residual |   941.46269     3,971  .237084535   R-squared       =    0.0778
-------------+----------------------------------   Adj R-squared   =    0.0764
       Total |  1020.83332     3,977  .256684265   Root MSE        =    .48691

------------------------------------------------------------------------------
     birwtkg | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
     1.smoke |  -.2404347   .0233559   -10.29   0.000    -.2862254    -.194644
        mage |    .006056   .0015381     3.94   0.000     .0030404    .0090716
             |
        male |
       male  |   .1027676   .0154436     6.65   0.000     .0724894    .1330459
             |
     married |
        Yes  |   .0695001   .0286437     2.43   0.015     .0133424    .1256577
             |
       black |
        Yes  |  -.2073296   .0324924    -6.38   0.000    -.2710329   -.1436262
        kess |  -.0356693   .0151464    -2.35   0.019    -.0653647   -.0059738
       _cons |   3.252061   .0533562    60.95   0.000     3.147453    3.356669
------------------------------------------------------------------------------


file cpr.png already exists
r(602);

r(602);

Collinearity

Multicollinearity refers to correlation among explanatory variables.

large changes in the estimated regression coefficient when a predictor variable is added or deleted, or when an observation is altered or deleted.
noninsignificant results in individual tests on regression coefficients for important predictor variables.
estimated regression coefficients with an algebraic sign that is the opposite of that expected from theoretical consideration or prior experience.
large coefficients of simple correlation between pairs of predictor variables in the correlation matrix.
wide confidence intervals for the regression coefficients representing important predictor variables.

When some of X variables are so highly correlated that the inverse \((X'X)^{-1}\) does not exist or is very computationally unstable.

Correlated Predictor Variables: if some X variables are “perfectly” correlated, the system is undetermined and there are an infinite number of models that fit the data. That is, if \(X'X\) is singular, then \((X'X)^{-1}\) doesn’t exist. Then,

parameters cannot be interpreted (\(\mathbf{b = (X'X)^{-1}X'y}\))
sampling variability is infinite (\(\mathbf{s^2(b) = MSE (X'X)^{-1}}\))

VIFs

Let \(R^2_k\) be the coefficient of multiple determination when \(X_k\) is regressed on the p - 2 other X variables in the model. Then,

\[ VIF_k = \frac{1}{1-R^2_k} \]

large values indicate that a near collinearity is causing the variance of \(b_k\) to be inflated, \(var(b_k) \propto \sigma^2 (VIF_k)\)
Typically, the rule of thumb is that \(VIF > 4\) mean you should see why this is the case, and \(VIF_k > 10\) indicates a serious problem collinearity problem that could result in poor parameters estimates.
the mean of all VIF’s provide an estimate of the ratio of the true multicollinearity to a model where the X variables are uncorrelated
serious multicollinearity if \(\bar{(VIF)} >>1\)

Multicollinearity

The multicollinearity condition is violated when two or more of the predictors in a regression model are correlated. Muticollinearity can occur for structural reasons, as when one variable is a transformation of another variable, or for data reasons, as occurs in observational studies. Multicollinearity is a problem because it inflates the variances of the estimated coefficients, resulting in larger confidence intervals.

When predictor variables are correlated, the precision of the estimated regression coefficients decreases with each added correlated predictor variable. The usual interpretation of a slope coefficient as the change in the mean response for each additional unit increase in the predictor when all the other predictors are held constant breaks down because changing one predictor necessarily changes the others.

A residuals vs fits plot \((\epsilon \sim \hat{Y})\) should have correlation \(\rho \sim 0\). A correlation matrix is helpful for picking out the correlation strengths. A good rule of thumb is correlation coefficients should be less than 0.80. However, this test may not work when a variable is correlated with a function of other variables. A model with multicollinearity may have a significant F-test with insignificant individual slope estimator t-tests. Another way to detect multicollinearity is by calculating variance inflation factors. The predictor variance \(Var(\hat{\beta_k})\) increases by a factor

\[VIF_k = \frac{1}{1 - R_k^2}\]

where \(R_k^2\) is the \(R^2\) of a regression of the \(k^{th}\) predictor on the remaining predictors. A \(VIF_k\) of \(1\) indicates no inflation (no corellation). A \(VIF_k \geq 4\) warrants investigation. A \(VIF_k \geq 10\) requires correction.

If the multicollinearity occurs because you are using a polynomial regression model, center the predictor variables (subtract their means).

In summary (Multicollinearity)

May be significant change in regression coefficients if we add or delete an explanatory variable.
Estimated standard errors of fitted coefficients are inflated.
Estimated coefficients may not be statistically significant, even though statistical relationship exists between outcome and explanatory variables.

Can check for multi-collinearity after fitting regression model by calculating variance inflation factor (VIF), which gives for each coefficient an estimate of factor by which variance has been increased due to multi-collinearity Most analysts apply informal rules of thumb to the VIF- evidence of multi-collinearity if :

Largest VIF is greater than 10
Mean of all VIF’s is considerably larger than 1

APPENDIX

\(Y_i\) is random since \(\epsilon_i\) is:

\[ \begin{aligned} E(Y_i) &= E(\beta_0 + \beta_1 X_i + \epsilon_i) \\ &= E(\beta_0) + E(\beta_1 X_i) + E(\epsilon) \\ &= \beta_0 + \beta_1 X_i \end{aligned} \]

\[ \begin{aligned} var(Y_i) &= var(\beta_0 + \beta_1 X_i + \epsilon_i) \\ &= var(\epsilon_i) \\ &= \sigma^2 \end{aligned} \]

\[ \beta_1 = \frac{Cov(Y_i, X_i)}{Var(X_i)} \] \[b_1 = \frac{\sum_{i=1}^{n} (X_i - \bar{X})(Y_i - \bar{Y})}{\sum_{i=1}^{n}(X_i - \bar{X})^2}=\frac{\sigma_{xy}}{\sigma_{xx}}\] \[b_0 = \frac{1}{n}(\sum_{i=1}^{n}Y_i - b_1\sum_{i=1}^{n}X_i) = \bar{Y} - b_1 \bar{X}\]

Properties of Least Least Estimators

\[E(b_1) = \beta_1\] \[E(b_0) = E(\bar{Y}) - \bar{X}\beta_1\] \[E(\bar{Y}) = \beta_0 + \beta_1 \bar{X}\] \[E(b_0) = \beta_0\] \[var(b_1) = \frac{\sigma^2}{\sum_{i=1}^{n}(X_i - \bar{X})^2}\] \[var(b_0) = \sigma^2 (\frac{1}{n} + \frac{\bar{X}^2}{\sum (X_i - \bar{X})^2}) \]

Mean Square Error

\[ MSE = \frac{SSE}{n-2} = \frac{\sum_{i=1}^{n}e_i^2}{n-2} = \frac{\sum(Y_i - \hat{Y_i})^2}{n-2} \]

Interpretations

Model	Form	Interpretation of \(\beta\)	In words
Level-Level	\(y =\beta_0+\beta_1x+\epsilon\)	\(\Delta y = \beta_1 \Delta x\)	A unit change in x will result in \(\beta_1\) unit change in y
Log-Level	\(ln(y) = \beta_0 + \beta_1x + \epsilon\)	\(\% \Delta y=100 \beta_1 \Delta x\)	A unit change in x result in 100 \(\beta_1\) % change in y
Level-Log	\(y = \beta _0 + \beta_1 ln (x) + \epsilon\)	\(\Delta y = (\beta_1/ 100)\%\Delta x\)	One percent change in x result in \(\beta_1/100\) units change in y
Log-Log	\(ln(y) = \beta_0 + \beta_1 l n(x) +\epsilon\)	\(\% \Delta y= \beta _1 \% \Delta x\)	One percent change in x result in \(\beta_1\) percent change in y

Model Diagnostics

Akaike Information Criterion (AIC)

\[ AUC = n ln(\frac{SSE_p}{n}) + 2p \]

increasing p (number of parameters) leads first-term decreases, and second-term increases.
We want model with small values of AIC. If the AIC increases when a parameter is added to the model, that parameter is not needed.
AIC represents a tradeoff between precision of fit against the number of parameters used.

Bayes (or Schwarz) Information Criterion

\[ BIC = n \ln(\frac{SSE_p}{n})+ (\ln n)p \]

The coefficient in front of p tends to penalize more heavily models with a larger number of parameters (as compared to AIC).

Stepwise Selection Procedures

The forward stepwise procedure:

finds a plausible subset sequentially.
at each step, a variable is added or deleted.
criterion for adding or deleting is based on SSE, \(R^2\), T, or F-statistic.

Note:

Instead of using exact F-values, computer packages usually specify the equivalent “significance” level. For example, SLE is the “significance” level to enter, and SLS is the “significance” level to stay. The SLE and SLS are guides rather than true tests of significance.
The choice of SLE and SLS represents a balancing of opposing tendencies. Use of large SLE values tends to result in too many predictor variables; models with small SLE tend to be under-specified resulting in \(\sigma^2\) being badly overestimated.
As for choice of SLE, can choose between 0.05 and 0.5.
If SLE > SLS then a cycling pattern may occur. Although most computer packages can detect can stop when it happens. A quick fix: SLS = SLE /2
If SLE < SLS then the procedure is conservative and may lead variables with low contribution to be retained.
Order of variable entry does not matter.

Automated Selection Procedures:

Forward selection: Same idea as forward stepwise except it doesn’t test if variables should be dropped once enter. (not as good as forward stepwise).
Backward Elimination: begin with all variables and identifies the one with the smallest F-value to be dropped.

Diagnostics

Influential observations/outliers

Hat matrix

\[ \mathbf{H = X(X'X)^{-1}} \]

where \(\mathbf{\hat{Y}= HY, e = (I-H)Y}\) and \(var(\mathbf{e}) = \sigma^2 (\mathbf{I-H})\)

\(\sigma^2(e_i) = \sigma^2 (1-h_{ii})\), where \(h_{ii}\) is the i-th element of the main diagonal of \(\mathbf{H}\) (must be between 0 and 1).
\(\sum_{i=1}^{n} h_{ii} = p\)
\(cov(e_i,e_j) = -h_{ii}\sigma^2\) where \(i \neq j\)
Estimate: \(s^2(e_i) = MSE (1-h_{ii})\)
Estimate: \(\hat{cov}(e_i,e_j) = -h_{ij}(MSE)\); if model assumption are correct, this covariance is very small for large data sets.
If \(\mathbf{x}_i = [1 X_{i,1} ... X_{i,p-1}]'\) (the vector of X-values for a given response), then \(h_{ii} = \mathbf{x_i'(X'X)^{-1}x_i}\) (depends on relative positions of the design points \(X_{i,1},...,X_{i,p-1}\))

Studentized Residuals

\[ r_i = \frac{e_i}{s(e_i)} \\ r_i \sim N(0,1) \]

where \(s(e_i) = \sqrt{MSE(1-h_{ii})}\). \(r_i\) is called the studentized residual or standardized residual.

you can use the semi-studentized residual before, \(e_i^*= e_i \sqrt{MSE}\). This doesn’t take into account the different variances for each \(e_i\).

We would want to see the model without a particular value. You delete the i-th case, fit the regression to the remaining n-1 cases, get estimated responses for the i-th case, \(\hat{Y}_{i(i)}\), and find the difference, called the deleted residual:

\[ d_i = Y_i - \hat{Y}_{i(i)} \\ = \frac{e_i}{1-h_{ii}} \]

we don’t need to recompute the regression model for each case

As \(h_{ii}\) increases, \(d_i\) increases.

\[ s^2(d_i)= \frac{MSE_{(i)}}{1-h_{ii}} \]

where \(MSE_{(i)}\) is the mean square error when the i-th case is omitted.

Let

\[ t_i = \frac{d_i}{s(d_i)} = \frac{e_i}{\sqrt{MSE_{(i)}(1-h_{ii})}} \]

be the studentized deleted residual, which follows a t-distribution with n-p-1 df.

\[ (n-p)MSE = (n-p-1)MSE_{(i)}+ \frac{e^2_{i}}{1-h_{ii}} \]

hence, we do not need to fit regressions for each case and

\[ t_i = e_i (\frac{n-p-1}{SSE(1-h_{ii})-e^2_i})^{1/2} \]

The outlying Y-observations are those cases whose studentized deleted residuals are large in absolute value. If there are many residuals to consider, a Bonferroni critical value can be can (\(t_{1-\alpha/2n;n-p-1}\))

Outlying X Observations

Recall, \(0 \le h_{ii} \le 1\) and \(\sum_{i=1}^{n}h_{ii}=p\) (the total number of parameters)

A large \(h_{ii}\) indicates that the i-th case is distant from the center of all X observations (the leverage of the i-th case). That is, a large value suggests that the observation exercises substantial leverage in determining the fitted value \(\hat{Y}_i\)

We have \(\mathbf{\hat{Y}=HY}\), a linear combination of Y-values; \(h_{ii}\) is the weight of the observation \(Y_i\); so \(h_{ii}\) measures the role of the X values in determining how important \(Y_i\) is in affecting the \(\hat{Y}_i\).

Large \(h_{ii}\) implies \(var(e_i)\) is small, so larger \(h_{ii}\) implies that \(\hat{Y}_i\) is close to \(Y_i\)

small data sets: \(h_{ii} > .5\) suggests “large”.
large data sets: \(h_{ii} > \frac{2p}{n}\) is “large.

Using the hat matrix to identify extrapolation:

Let \(\mathbf{x_{new}}\) be a vector containing the X values for which an inference about a mean response or a new observation is to be made.
Let \(\mathbf{X}\) be the data design matrix used to fit the data. Then, if \(h_{new,new} = \mathbf{x_{new}(X'X)^{-1}x_{new}}\) is within the range of leverage values (\(h_{ii}\)) for cases in the data set, no extrapolation is involved; otherwise; extrapolation is indicated.

Identifying Influential Cases:

by influential we mean that exclusion of an observation causes major changes int he fitted regression. (not all outliers are influential)

Influence on Single Fitted Values: DFFITS
Influence on All Fitted Values: Cook’s D
Influence on the Regression Coefficients: DFBETAS

DFFITS

Influence on Single Fitted Values: DFFITS

\[ (DFFITS)_i = \frac{\hat{Y}_i - \hat{Y}_{i(i)}}{\sqrt{MSE_{(i)}h_{ii}}} \\ = t_i (\frac{h_{ii}}{1-h_{ii}})^{1/2} \] - the standardized difference between the i-th fitted value with all observations and with the i-th case removed.

studentized deleted residual multiplied by a factor that is a function fo the i-th leverage value.
influence if:
- small to medium data sets: \(|DFFITS|>1\)
- large data sets: \(|DFFITS|> 2 \sqrt{p/n}\)

Cook’s D

Influence on All Fitted Values: Cook’s D

\[ D_i = \frac{\sum_{j=1}^{n}(\hat{Y}_j - \hat{Y}_{j(i)})^2}{p(MSE)} \\ = \frac{e^2_i}{p(MSE)}(\frac{h_{ii}}{(1-h_{ii})^2}) \]

gives the influence of i-th case on all fitted values.

If \(e_i\) increases or \(h_{ii}\) increases, then \(D_i\) increases.

\(D_i\) is a percentile of an \(F_{(p,n-p)}\) distribution. If the percentile is greater than \(.5(50\%)\) then the i-th case has major influence. In practice, if \(D_i >4/n\), then the i-th case has major influence.

DFBETAS

Influence on the Regression Coefficients: DFBETAS

\[ (DFBETAS)_{k(i)} = \frac{b_k - b_{k(i)}}{\sqrt{MSE_{(i)}c_{kk}}} \]