Statistics Spring Coursework

In this report, I will be investigating associations between the variables age,gest,sex,smokes,weight and rate with birth weight (bwt). I will use data from 327 children to eventually produce a refined model that accurately predicts bwt. To begin I will explore the initial data, then will apply a transformation. After looking at various plots, comparing different models and their significance, I will remove variables that do not have a significant association to improve the model further. Then finally I will remove outliars and high leverage points from the model.

Part B of this will be using my final model from part A, to predict the birthweight of 100 individuals. I will use plots, MSE and distribution of both the predicted and actual birthweight to compare and see the effectiveness of the model.

## 'data.frame':    327 obs. of  7 variables:
##  $ age   : int  25 34 29 23 26 30 27 26 40 19 ...
##  $ gest  : int  294 252 280 280 266 245 273 273 287 287 ...
##  $ sex   : chr  "Female" "Female" "Male" "Female" ...
##  $ smokes: chr  "No" "No" "Heavy" "Light" ...
##  $ weight: num  65.2 58.6 57.3 65.9 59.1 49.1 69.5 75 82 58.3 ...
##  $ rate  : num  -0.01618 0.00678 0.03293 -0.02179 0.06346 ...
##  $ bwt   : int  3870 2440 3400 2820 2940 2400 3080 3600 3600 3930 ...

This string shows the content of the Training data. The variables ‘age’,‘bwt’ and ‘gest’ are integers, ‘weight’ and ‘rate’ are numeric, with ‘rate’ being the only variable that has negative values. ‘sex’ and ‘smokes’ are not numeric variables.

Now I will look into summary plots of the full linear model to initially highlight key variables that are associated with birthweight.

## 
## Call:
## lm(formula = bwt ~ gest + weight + age + smokes + rate + sex, 
##     data = Births)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1141.01  -262.64    -9.88   255.56  1801.05 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -4010.687    584.583  -6.861 3.57e-11 ***
## gest           23.434      2.045  11.458  < 2e-16 ***
## weight          8.979      2.410   3.726  0.00023 ***
## age             8.262      4.326   1.910  0.05704 .  
## smokesLight   111.809    157.627   0.709  0.47864    
## smokesNo      167.604    116.365   1.440  0.15075    
## rate         1711.712    719.352   2.380  0.01792 *  
## sexMale        83.385     45.429   1.835  0.06736 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 408.5 on 319 degrees of freedom
## Multiple R-squared:  0.3631, Adjusted R-squared:  0.3491 
## F-statistic: 25.98 on 7 and 319 DF,  p-value: < 2.2e-16

Based on the summary output, ‘gest’, ‘weight’ and ‘rate’ are significant variables as all have a p-value <0.05. These findings suggest strong statistical evidence for their roles in influencing birth weight. Although the other variables currently show higher p-values, indicating less statistical significance, I will not eliminate any variables at this point and will investigate further.

To visualise the association between birth weight, here is a scatter plot matrix.

Corresponding numerical values for the Pearson sample correlation coefficients between these variables
	weight	rate	bwt	gest	age
weight	1.0000000	0.0562945	0.2695799	0.1433880	0.1782062
rate	0.0562945	1.0000000	0.1148574	-0.0269921	0.1603730
bwt	0.2695799	0.1148574	1.0000000	0.5402130	0.0868098
gest	0.1433880	-0.0269921	0.5402130	1.0000000	-0.0976735
age	0.1782062	0.1603730	0.0868098	-0.0976735	1.0000000

The correlation coefficients shows that the variables in order of strongest to weakest linear relationship with birthweight is ‘gest’, ‘weight’,‘rate’ and finally ‘age’. This agrees with the previous summary as ‘age’ was not statistically significant and has the lowest positive correlation coefficient, I will still not remove this variable at this point and investigate what the effect of transformations has on significance.

I removed the ‘smokes’ and ‘sex’ variable from this scatter plot matrix as they are not numeric variables. These will also not be changed by the transformation.

I have chosen to use the transformation of log(x+1), in hope of having a better fit model, it is (x+1) as it makes the ‘gest’ variable values now all postive, hence all values can now be logged. Here is a summary and a component and residual plot for the transformed variables.

## 
## Call:
## lm(formula = log_bwt ~ log_weight + log_gest + log_rate + log_age + 
##     sex + smokes, data = Births)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.34080 -0.07406  0.00367  0.07339  0.37943 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -4.63995    0.90410  -5.132 4.99e-07 ***
## log_weight   0.15721    0.04362   3.604 0.000363 ***
## log_gest     2.09696    0.15986  13.118  < 2e-16 ***
## log_rate     0.51611    0.21512   2.399 0.017005 *  
## log_age      0.07803    0.03534   2.208 0.027957 *  
## sexMale      0.02431    0.01312   1.853 0.064822 .  
## smokesLight  0.04770    0.04555   1.047 0.295766    
## smokesNo     0.06235    0.03363   1.854 0.064623 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.118 on 319 degrees of freedom
## Multiple R-squared:  0.4171, Adjusted R-squared:  0.4043 
## F-statistic: 32.61 on 7 and 319 DF,  p-value: < 2.2e-16

Bases on the summary statistics of the transformed variables, ‘weight’, ‘gest’, ‘rate’ and ‘age’ are statistically significant as p-value <0.05. ‘gest’ and ‘rate’ p-value<0.001, indicating these are very significant variables. The component and residual plots allow us to fit the fitted model and the mean residual to see where the residual mean and model disagree. For the variables ‘age’.’weight’ and ‘rate’ both lines fit on top of each other fairly well suggesting that the transformation makes a good fit. Towards the ends there is some disagreement, suggesting possible outliars or high leverage points. For the ‘gest’ variable there is a large disagreement for low values, indicating outliars or high leverage points, especially the first point at 5.41 which makes the pink line (model) clearly deviate away from the mean residual. Therfore later on I will use a method to remove the outlairs/high influence points.

Here is two QQ plots of the residuals, the first one being the full model and the second being the transformed model.

## [1] 137 290

## [1] 109 137

The QQ plots provide a visual indication of how well the residuals of the models fit to a normal distribution. The transformed model shows a better fit in terms of the normality assumption of the residuals as there is a better alignment. However, there are some points towards the ends that curve upwards from reference line, such as ‘109’ and ‘137’,these likely represent data points of high leverage or outliars which require further attention as they will skew results if left.

To explore further the associations between birth weight and other variables, I will use step wise regression which will remove predictors that are not statistically valid.

## Start:  AIC=-1389.52
## log_bwt ~ log_weight + log_gest + log_rate + log_age + sex + 
##     smokes
## 
##              Df Sum of Sq    RSS     AIC
## - smokes      2   0.04975 4.4945 -1389.9
## <none>                    4.4448 -1389.5
## - sex         1   0.04784 4.4926 -1388.0
## - log_age     1   0.06793 4.5127 -1386.6
## - log_rate    1   0.08020 4.5250 -1385.7
## - log_weight  1   0.18098 4.6257 -1378.5
## - log_gest    1   2.39754 6.8423 -1250.5
## 
## Step:  AIC=-1389.88
## log_bwt ~ log_weight + log_gest + log_rate + log_age + sex
## 
##              Df Sum of Sq    RSS     AIC
## <none>                    4.4945 -1389.9
## - sex         1   0.05449 4.5490 -1387.9
## - log_age     1   0.06770 4.5622 -1387.0
## - log_rate    1   0.07529 4.5698 -1386.5
## - log_weight  1   0.19338 4.6879 -1378.1
## - log_gest    1   2.44270 6.9372 -1250.0

## 
## Call:
## lm(formula = log_bwt ~ log_weight + log_gest + log_rate + log_age + 
##     sex + smokes, data = Births)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.34080 -0.07406  0.00367  0.07339  0.37943 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -4.63995    0.90410  -5.132 4.99e-07 ***
## log_weight   0.15721    0.04362   3.604 0.000363 ***
## log_gest     2.09696    0.15986  13.118  < 2e-16 ***
## log_rate     0.51611    0.21512   2.399 0.017005 *  
## log_age      0.07803    0.03534   2.208 0.027957 *  
## sexMale      0.02431    0.01312   1.853 0.064822 .  
## smokesLight  0.04770    0.04555   1.047 0.295766    
## smokesNo     0.06235    0.03363   1.854 0.064623 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.118 on 319 degrees of freedom
## Multiple R-squared:  0.4171, Adjusted R-squared:  0.4043 
## F-statistic: 32.61 on 7 and 319 DF,  p-value: < 2.2e-16

Here, the step wise regression model has removed the ‘smokes’ variables, it did not provide a statistically significant contribution to the model and with it removed, the AIC is lower, from AIC=-1389.52 in the full transformed model to AIC=-1389.88 when removing the smokes variable, hence indicating an improvement in the model. The summary after this removal also shows ‘gest’,‘weight’,‘age’ and ‘rate’ variables as significant as P-value<0.05. Going forward, I will remove the ‘smokes’ variable from my model. The step section indicates that after removing ‘smokes’, the next step was to consider removing more variables, however removing any other variable would increase the AIC which is not desirable, so implies no further variables should be removed at this step. I will do further investigation.

Now I will investigate the ‘sex’ variable as it has not been captured previously in the correlation coefficient. First I will use two different models, one looking at the transformed model with ‘weight’,‘rate’,‘gest’,‘age’ and ‘sex’ and then removing the ‘sex’ variable to test its contribution to the model.

## Analysis of Variance Table
## 
## Model 1: log_bwt ~ log_weight + log_rate + log_gest + log_age
## Model 2: log_bwt ~ log_weight + log_rate + log_gest + log_age + sex
##   Res.Df    RSS Df Sum of Sq      F  Pr(>F)  
## 1    322 4.5490                              
## 2    321 4.4945  1  0.054488 3.8915 0.04939 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

From the ANOVA table, when comparing the two models, the significance level is p = 0.04939, which is just below alpha= 0.05. This indicates that the inclusion of sex as a variable is statistically significant. However it is just below the threshold so I will investigate further. Now I will look at comparing two new models - one including the sex variable and one as a baseline model with just an intercept (representing the mean birthweight regardless of sex)

## Warning: `qplot()` was deprecated in ggplot2 3.4.0.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

## Analysis of Variance Table
## 
## Model 1: log_bwt ~ 1
## Model 2: log_bwt ~ sex
##   Res.Df    RSS Df Sum of Sq      F Pr(>F)
## 1    326 7.6252                           
## 2    325 7.5659  1  0.059263 2.5457 0.1116

The p-value, 0.1116>0.05, indicating inclusion of ‘sex’ does not significantly improve the model’s fit for predicting birthweight. It may be that previously , there was association between other variables (not birthweight) and sex which caused it to be significant. However as we are focusing on the association with birthweight, I will remove this variable.This decision is also reinforced when looking at the summary statistics of both the full model and transformed model, in both the p-values>0.05, the ‘sex’ variable is not statistically significant. Therefore I will remove it from my model, it will not only simplify but also improve it as it only focuses on relevant variables.

Now I will investigate the ‘age’ variable, as initially it had the lowest correlation coefficient compared to the other variables. I will compare two models one including the transformed variables ‘weight’, ‘rate’, ‘gest’ and ‘age’, and the other model removing ‘age’ to investigate its contribution to the model.

## `geom_smooth()` using formula = 'y ~ x'

## Analysis of Variance Table
## 
## Model 1: log_bwt ~ log_weight + log_rate + log_gest
## Model 2: log_bwt ~ log_weight + log_rate + log_gest + log_age
##   Res.Df    RSS Df Sum of Sq      F  Pr(>F)  
## 1    323 4.6202                              
## 2    322 4.5490  1  0.071228 5.0419 0.02542 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Looking at the ANOVA table, he P-value is 0.02542<0.05, suggesting that keeping in the ‘age’ variable is statistically significant. Furthermore, including ‘age’ significantly reduces the Residual Sum of Squares (RSS) compared to the model without it. This decrease in RSS indicates that ‘age’ explains a significant portion of the variance in birthweight that is not explained by the other variables in the model. Also its significant agrees with summary statistics of the transformed variables too and stepmodel. Therefore I will keep in the ‘age’ variable.

Now I have a model, transformed by log(x+1) and only including the variable age’,‘gest’,‘weight’ and ‘rate’ as they are all statistically significant.

## 
## Call:
## lm(formula = log_bwt ~ log_age + log_gest + log_weight + log_rate, 
##     data = Births)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.40462 -0.06777  0.00422  0.08046  0.38601 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -4.70291    0.90993  -5.168 4.15e-07 ***
## log_age      0.07987    0.03557   2.245 0.025420 *  
## log_gest     2.11875    0.16066  13.188  < 2e-16 ***
## log_weight   0.15891    0.04375   3.632 0.000327 ***
## log_rate     0.49934    0.21627   2.309 0.021584 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1189 on 322 degrees of freedom
## Multiple R-squared:  0.4034, Adjusted R-squared:  0.396 
## F-statistic: 54.44 on 4 and 322 DF,  p-value: < 2.2e-16

Previously, I showed how ‘age’,‘weight’ and ‘rate’ fit well looking at the component and residual plots suggesting a good model, however ‘gest’ variable does not fit well. I will look at cooks distance to mitigate the impact of these potential outliers or high leverage points, in hope to improve the model’s accuracy and the predictive reliability.

## [1] 133  43 231 236

##        133         43        231        236 
## 0.11847053 0.08788170 0.05984093 0.05479554

##       StudRes         Hat      CookD
## 43  -2.158946 0.087045100 0.08788170
## 133 -3.548338 0.046475119 0.11847053
## 137  3.313889 0.009782872 0.02104667
## 231 -1.845790 0.081286447 0.05984093

##        StudRes        Hat       CookD
## 58   0.4289587 0.05862437 0.002297694
## 137  3.5449722 0.01016817 0.024912695
## 143 -3.0370556 0.02380441 0.043849803
## 187 -1.6040700 0.06256876 0.034178400
## 290  3.2787576 0.01180169 0.024913377

## 
## Call:
## lm(formula = log_bwt ~ log_weight + log_gest + log_rate + log_age, 
##     data = Births[-largest.cooks, ])
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.33822 -0.06964  0.00559  0.07338  0.39551 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -3.14516    0.98345  -3.198  0.00152 ** 
## log_weight   0.16996    0.04207   4.040 6.72e-05 ***
## log_gest     1.84334    0.17392  10.599  < 2e-16 ***
## log_rate     0.49280    0.20910   2.357  0.01904 *  
## log_age      0.06440    0.03433   1.876  0.06161 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1142 on 318 degrees of freedom
## Multiple R-squared:  0.3301, Adjusted R-squared:  0.3217 
## F-statistic: 39.18 on 4 and 318 DF,  p-value: < 2.2e-16

## Calls:
## 1: lm(formula = log_bwt ~ log_age + log_gest + log_weight + log_rate, data =
##    Births)
## 2: lm(formula = log_bwt ~ log_weight + log_gest + log_rate + log_age, data =
##    Births[-largest.cooks, ])
## 
##             Model 1 Model 2
## (Intercept)  -4.703  -3.145
## SE            0.910   0.983
##                            
## log_age      0.0799  0.0644
## SE           0.0356  0.0343
##                            
## log_gest      2.119   1.843
## SE            0.161   0.174
##                            
## log_weight   0.1589  0.1700
## SE           0.0438  0.0421
##                            
## log_rate      0.499   0.493
## SE            0.216   0.209
##

This reduces the cooks distance from 0.118 to 0.0438, this is a significant decrease so the model has improved by this, it is less sensitive to individual data points. The four points, ‘43’,‘133’,‘137’ and ‘231’ make sense to be removed as they are visually shown by larger bubbles and of darker colour. Also 131 and 137 are also positioned towards the right of the plot indicating high leverage. Furthermore looking at the coefficients table for the model with and wihout the cooks distances, for ‘gest’ variable there is significant decrease in the regression coefficient from 2.119 to 1.843, also the standard error slightly increased, suggesting the removed points had a disproportionate influence. This improvement is demonstrated in the new component and residual plot for ‘gest’, which now shows one fluctuating on top of eachother, there is also less disagreement towards the ends and a more consistent pattern of residuals around the regression line. This indicates that the final model provides a more accurate and reliable estimation of the relationship between ‘gest’ and birthweight, reducing the potential distortion caused by outliers.

Therefore I will now have a final model which including the variables ‘gest’,‘rate’,‘age’ and ‘weight’, transformed by log(x+1) and with removed influence points.

‘lm(log_bwt~log_weight+log_gest+log_rate,data=Births[-largest.cooks,])’

To conclude, and further reinforcing this decision to exclude the variables ‘smokes’ and ‘sex’ variable, we look at the best subsets regression model.

## Subset selection object
## Call: regsubsets.formula(log_bwt ~ log_age + log_gest + sex + smokes + 
##     log_weight + log_rate, data = Births)
## 7 Variables  (and intercept)
##             Forced in Forced out
## log_age         FALSE      FALSE
## log_gest        FALSE      FALSE
## sexMale         FALSE      FALSE
## smokesLight     FALSE      FALSE
## smokesNo        FALSE      FALSE
## log_weight      FALSE      FALSE
## log_rate        FALSE      FALSE
## 1 subsets of each size up to 7
## Selection Algorithm: exhaustive
##          log_age log_gest sexMale smokesLight smokesNo log_weight log_rate
## 1  ( 1 ) " "     "*"      " "     " "         " "      " "        " "     
## 2  ( 1 ) " "     "*"      " "     " "         " "      "*"        " "     
## 3  ( 1 ) " "     "*"      " "     " "         " "      "*"        "*"     
## 4  ( 1 ) "*"     "*"      " "     " "         " "      "*"        "*"     
## 5  ( 1 ) "*"     "*"      "*"     " "         " "      "*"        "*"     
## 6  ( 1 ) "*"     "*"      "*"     " "         "*"      "*"        "*"     
## 7  ( 1 ) "*"     "*"      "*"     "*"         "*"      "*"        "*"

## [1] 34.321813 17.980424 12.592947  9.480900  7.570325  7.096783  8.000000

From the summary it shows that the order of variables in terms of desirability in the model is ‘gest’, ‘weight’, ‘rate’ and ‘age’ then variables relating to ‘sex’ and ‘smokes’ hence this observation aligns with my decision to exclude ‘sex’ and ‘smokes’ from the final model due to their minimal contribution to improving the fit. From looking at the BIC diagram, it suggests the model first including ‘gest’, ‘weight’ and ‘rate’, the the secondly the model ‘age’, ‘gest’ ‘weight’ and ‘rate’ which is the one we chose as the final model. It indicating ‘age’ is not in the top model is not unexpected, but from my previous models investigating this variable I decided it did have a significant impact, also having four variables means that the model is not too simple either.

Another potential good model could be transformed by log(x+1) and include all the variables apart from ‘smokes’ - this is agreement with the cp as it has the lowest value of 7.096783 and also what the stepwise selection showed. However, I am favouring a model which is more simpler model with less variables, but that all the variables have a more significant impact.

The model I have chosen includes the most relevant variables in term of associated to birthweight but is refined and less complex which reduces possibility of over fitting, ensuring it maintains robustness.

To finally conclude I will look at the histograms, comparing the full model, transformed model and the final model.

All three models are unimodal and roughly follow normal distribution,symmetric about 0. The histogram of the full model has a fairly central tendency around 0, so centered around the mean however it does not peak sharply. There are also only 7 bars compared to the other which have considerably more, this implies an overfitted distribution which makes sense as the model has not been improved yet. The histogram of the transformed model is improved, it centers around 0 and has a defined peak. However it appears to be more negatively skewed, this is due to the presence of outliars which have not yet been removed. The final model if further improved as there is now skew, as the points found from cooks distance have been removed. It has a central tendency, which indicates it does not have bias or is under/over estimated, it also is relatively symmetrical which is a key characteristic of normal distribution.

Part B

This is the summary of log(x+1) of the ‘bwt’ variable in the Test data set, as my final model will predict ‘bwt’ with this transformation too. This allows for a correct comparison to be made.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   7.669   8.063   8.161   8.150   8.244   8.487

Now I will use the final model I found in part a to predict log(bwt+1).

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   7.973   8.122   8.149   8.150   8.187   8.266

The mean value is the same for both the actual and predicted. Median is also very similar before and after applying the model, suggesting that the model effectively captures the median and average behaviors of the data set. IQR for predicted is 0.065, for actual is 0.181, therefore with the model it reduces the impact of out liars or extreme values in the data. Furthermore, the maximum and minimum values of predicted birth weight is more contained compared to actual, it moderates extreme values which may make more stable predictions. Also, while the predictions from the model show a more contained range, they do not oversimplify the complexities of the data. The figures across key statistical measures are very similar before and after applying the model, suggesting a strong alignment with the original data distribution. This similarity implies that the model is well-calibrated and provides a good fit, accurately reflecting the central characteristics of the dataset without losing essential details.

Now I will plot the prediction vs actual ‘bwt’, where both have been transformed by log(x+1)

The model shows a generally good fit for most of the data points, as evidenced by the cluster of points close to the ideal line. There’s a clear cluster of points around the line, particularly in the middle range of values (around actual birthweights of 8.0 to 8.2), suggesting the model predicts these values with a high degree of accuracy. Towards very low and very high values, for example the point at 7.6 and 8.5, these show the model may be less accurate when predicting extreme birthweight values. Therefore there are potential areas for model improvement to ensure that these extreme birthweights are captured for in the model.

## [1] 0.01231915

The mean squared error for the predictors compared to the actual values is relatively small, this low value indicates a high degree of accuracy in the model’s predictions, as the errors between predicted and actual values are minor on average.

Finally I will do residual plots to see the fit of the regression model.

The variable ‘rate’ and ‘weight’ has a very good fit as the residual distribution is uniformly scattered around the zero line implying these variables are well accounted for in the model, also it maintains homoscedasticity which is crucial for reliability of regression results. The variable ‘gest’ residuals has a much better fit with this Test data then the Train data, the residuals are evenly spread above and below zero, suggesting variability is consistent, there are a few points that deviate away from the zero line but these do not form a pattern, there could be further outliars so potentially investigating if there are any more high leverage points in this variable would be wise. For the variable ‘age’, at low ages (2.8) it deviates up away from the zero line, this is bad but is only an isolated deviation, not systematic noise. Therefore it may be wise to do further investigation into the ‘age’ variable, potentially refining the model further by removing more high influence points.

Statistics Spring Coursework

Sophia Forbes

2024-03-25