Introduction

“The Health Survey for England (HSE) is designed to monitor trends in the nation’s health. It began in 1991 and is carried out annually. The study provides regular information not available from other sources on a range of aspects concerning the public’s health, including many factors that affect health. Some core questions are included in every wave but each year’s survey also has a particular focus on a disease or condition or population group. The survey combines questionnaire-based answers with physical measurements and the analysis of blood samples. The dataset for this coursework is based on the Health Survey for England 2011. It includes data for 10,617 cases and 58 variables.”[1]

Data

name explanation
omsysval Systolic blood pressure
Age Age
bmival BMI
totalwu Total units alcohol/week
cigst1 Cigarette smoking status.
Category (code)
Never smoked cigarettes at all (1)
Used to smoke cigarettes occasionally (2)
Used to smoke cigarettes regularly (3)
Current cigarette smoker (4)

Libraries and loading/wrangling data.

# libraries
library(tidyverse)
library(ggplot2)
library(plyr)
# load data 
HSE_data = read_csv('7402_F1.csv')%>%select(c(omsysval,Age, bmival,totalwu, cigst1))
# check for missing data. Missing data is signified by *. 
incompleteObs <- which(HSE_data$omsysval == "*" | HSE_data$Age == "*" | HSE_data$bmival == "*" | HSE_data$totalwu == "*" | HSE_data$cigst1 == "*")
length(incompleteObs)
## [1] 0

Viewing data and summary statistics.

# glimpse at the data to get a feel for structure and any interesting patterns or observations.
glimpse(HSE_data)
## Observations: 10,617
## Variables: 5
## $ omsysval <dbl> -1.0, -1.0, -1.0, -1.0, -1.0, -1.0, 163.0, -1.0, -1.0...
## $ Age      <int> 75, 47, 77, 66, 44, 66, 84, 63, 62, 74, 46, 44, 15, 1...
## $ bmival   <dbl> 25.32541, -1.00000, 25.58436, -1.00000, -1.00000, -1....
## $ totalwu  <dbl> 0.0580, 4.9910, 49.0290, 0.0000, 30.2300, 13.5580, 24...
## $ cigst1   <int> 3, 3, 1, 1, 2, 3, 3, 1, 3, 3, 1, 4, -1, -1, 1, 1, 1, ...
# view the data to get a feel for structure and any interesting patterns or observations.
#View(HSE_data)
# view summary statistics noting the negative min values which correspond to how the data has been coded as per the code book.
summary(HSE_data)
##     omsysval           Age             bmival         totalwu       
##  Min.   : -8.00   Min.   :  0.00   Min.   :-1.00   Min.   : -9.000  
##  1st Qu.: -1.00   1st Qu.: 22.00   1st Qu.:16.05   1st Qu.:  0.000  
##  Median : 92.50   Median : 42.00   Median :23.83   Median :  1.356  
##  Mean   : 62.38   Mean   : 41.56   Mean   :20.24   Mean   :  8.384  
##  3rd Qu.:122.00   3rd Qu.: 61.00   3rd Qu.:28.23   3rd Qu.: 10.615  
##  Max.   :203.50   Max.   :100.00   Max.   :65.28   Max.   :461.500  
##      cigst1      
##  Min.   :-9.000  
##  1st Qu.: 1.000  
##  Median : 1.000  
##  Mean   : 1.558  
##  3rd Qu.: 3.000  
##  Max.   : 4.000

1. Exploring Strength of Association.

i. Age vs Omron Valid Mean Systolic BP

# create a vector of labels to title the facet grid. 
labels <- c("1" = "Current Smoker", "2" = "Smoked Occasionally", "3" = "Smoked Regularly", "4" = "Never Smoked")

# both variables are continous therefore a scatterplot is appropriate. 
# filter where omsysval > 0 & cigst1 > 0 so that we only consider observations where data was actually collected as per the code book.
# create scatter plot of Age vs omsystval, coloured by smoker status.
# fit a line of best fit using the lm method, set standard error to FALSE and color it black.
# split into facets via cigst1 and title plots using the labels vector.
# add main title 
# add theme, make adjustments and label the x axis and y axis.
HSE_data %>% filter(omsysval > 0 & cigst1 > 0) %>% ggplot(aes(x = Age, y = omsysval, colour = factor(cigst1))) + 
      geom_point(shape = 1, alpha= 0.6, na.rm = TRUE, position = "jitter") +
      geom_smooth(method = "lm", color = "blue", se = FALSE) +
      facet_grid(.~cigst1, labeller = labeller(cigst1 = labels)) +
      ggtitle("Fig.1 Age vs Omron Valid Mean Systolic BP") +
      theme_bw() +
      theme(legend.position = "none", plot.title = element_text(face = "bold", hjust = 0.5)) +
      labs(x = "Age", y = "Omron Valid Mean Systolic BP")

# find the correlation between Age and Omron Valid Mean Systolic BP.
# We are using pearson's corr as we assume the relationship to be linear.
# cor = cov(xdata,ydata)/(sd(xdata)*sd(ydata))
cor(HSE_data$omsysval, HSE_data$Age)
## [1] 0.2559397

ii. BMI vs Omron Valid Mean Systolic BP

# create a vector of labels to title the facet grid. 
labels <- c("1" = "Current Smoker", "2" = "Smoked Occasionally", "3" = "Smoked Regularly", "4" = "Never Smoked")

# both variables are continous therefore a scatterplot is appropriate. 
# filter where omsysval > 0 & cigst1 > 0 & bmival >0 so that we only consider observations where data was actually collected as per the code book.
# create scatter plot of BMI vs omsystval, coloured by smoker status.
# fit a line of best fit using the lm method, set standard error to FALSE and color it black.
# split into facets via cigst1 and title plots using the labels vector.
# add main title 
# add theme, make adjustments and label the x axis and y axis.
HSE_data %>% filter(omsysval > 0 & cigst1 > 0 & bmival >0) %>% ggplot(aes(x = bmival, y = omsysval, colour = factor(cigst1))) + 
      geom_point(shape = 1, alpha= 0.6, na.rm = TRUE, position = "jitter") +
      geom_smooth(method = "lm", color = "blue", se = FALSE) +
      facet_grid(.~cigst1, labeller = labeller(cigst1 = labels)) +
      ggtitle("Fig.2 BMI vs Omron Valid Mean Systolic BP") +
      theme_bw() +
      theme(legend.position = "none", plot.title = element_text(face = "bold", hjust = 0.5)) +
      labs(x = "BMI", y = "Omron Valid Mean Systolic BP")

# find the correlation between BMI and Omron Valid Mean Systolic BP.
# We are using pearson's corr as we assume the relationship to be linear.
# cor = cov(xdata,ydata)/(sd(xdata)*sd(ydata))
cor(HSE_data$omsysval, HSE_data$bmival)
## [1] 0.3036062

iii. Total Units of Alcohol per Week vs Omron Valid Mean Systolic BP

# create a vector of labels to title the facet grid. 
labels <- c("1" = "Current Smoker", "2" = "Smoked Occasionally", "3" = "Smoked Regularly", "4" = "Never Smoked")

# both variables are continous therefore a scatterplot is appropriate. 
# filter where omsysval > 0 & cigst1 > 0 & totalwu > 0 so that we only consider observations where data was actually collected as per the code book.
# create scatter plot of totalwu vs omsystval, coloured by smoker status.
# fit a line of best fit using the lm method, set standard error to FALSE and color it black.
# split into facets via cigst1 and title plots using the labels vector.
# add main title 
# add theme, make adjustments and label the x axis and y axis.
HSE_data %>% filter(omsysval > 0 & cigst1 > 0 & totalwu > 0) %>% ggplot(aes(x = totalwu, y = omsysval, colour = factor(cigst1))) + 
      geom_point(shape = 1, alpha= 0.6, na.rm = TRUE, position = "jitter") +
      geom_smooth(method = "lm", color = "blue", se = FALSE) +
      facet_grid(.~cigst1, labeller = labeller(cigst1 = labels)) +
      ggtitle("Fig.3 Total Units of Alcohol per Week vs Omron Valid Mean Systolic BP") +
      theme_bw() +
      theme(legend.position = "none", plot.title = element_text(face = "bold", hjust = 0.5)) +
      labs(x = "Total Units of Alcohol per Week", y = "Omron Valid Mean Systolic BP")

# find the correlation between total unit of alcohol per week and Omron Valid Mean Systolic BP.
# We are using pearson's corr as we assume the relationship to be linear.
# cor = cov(xdata,ydata)/(sd(xdata)*sd(ydata))
cor(HSE_data$omsysval, HSE_data$totalwu)
## [1] 0.05760052

iv. Smoking Status vs Omron Valid Mean Systolic BP

# create a vector of labels to title the facet grid. 
labels <- c("1" = "Current Smoker", "2" = "Smoked Occasionally", "3" = "Smoked Regularly", "4" = "Never Smoked")

# smoking status is discrete and BP is continuous therefore a boxplot is appropriate.
# filter where omsysval > 0 & cigst1 > 0 so that we only consider observations where data was actually collected as per the code book.
# create boxplot of smoker status vs omsystval, coloured by smoker status.
# add main title 
# add theme, make adjustments and label the x axis and y axis.
HSE_data %>% filter(omsysval > 0 & cigst1 > 0) %>% ggplot(aes(x = cigst1, y = omsysval, color = factor(cigst1))) + 
      geom_boxplot() +
      ggtitle("Fig.4 Smoking Status vs Omron Valid Mean Systolic BP", subtitle = "1- Red - Current Smoker\n2 - Green - Used to Smoke Occasionally\n3 - Mouve - Used to Smoke Regularly\n4 - Purple - Never Smoked") +
      theme_bw() +
      theme(plot.title = element_text(face = "bold", hjust = 0.5), plot.subtitle = element_text(hjust = 0.5), legend.position = "none") +
      labs(x = "Smoking Status", y = "Omron Valid Mean Systolic BP")

# find the correlation between smoking status and Omron Valid Mean Systolic BP.
# We are using pearson's corr as we assume the relationship to be linear.
# cor = cov(xdata,ydata)/(sd(xdata)*sd(ydata))
cor(HSE_data$omsysval, HSE_data$cigst1)
## [1] 0.1177323

2. Using Multiple Linear Regression

The relationship of interest is the effect of Age, BMI, total weekly units of alcohol and cigarette consumption on systolic BP.

\[ \mathtt{Systolic BP}_i = \beta_0 + \beta_1 \mathtt{Age}_i + \beta_2 \mathtt{bmival}_i + \beta_3 \mathtt{totalwu}_i + \beta_4 \mathtt{cigst1}_i + \epsilon_i \]

# filter the data so that observations that were not collected as per the code book are removed. 
# create a multiple regression with omsysval as the dependent variable and the remaining variables in the dataframe as the independent variables.
HSE_dataLM <- HSE_data %>% filter(omsysval > 0 & Age > 0 & bmival > 0 & totalwu > 0) 
  model <- lm(formula = omsysval ~ Age + bmival + totalwu + as.factor(cigst1), data = HSE_dataLM) 
  summary(model)
## 
## Call:
## lm(formula = omsysval ~ Age + bmival + totalwu + as.factor(cigst1), 
##     data = HSE_dataLM)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -59.114  -9.967  -1.168   8.604  74.969 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        82.55206   15.21701   5.425 6.17e-08 ***
## Age                 0.34312    0.01507  22.772  < 2e-16 ***
## bmival              0.55217    0.04943  11.170  < 2e-16 ***
## totalwu             0.09220    0.01339   6.887 6.66e-12 ***
## as.factor(cigst1)1  9.98025   15.10922   0.661    0.509    
## as.factor(cigst1)2  9.22480   15.13619   0.609    0.542    
## as.factor(cigst1)3  9.85581   15.10972   0.652    0.514    
## as.factor(cigst1)4 10.45043   15.12242   0.691    0.490    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.1 on 3655 degrees of freedom
## Multiple R-squared:  0.189,  Adjusted R-squared:  0.1875 
## F-statistic: 121.7 on 7 and 3655 DF,  p-value: < 2.2e-16

Explanation of findings:

This report will now proceed to an analysis of the above findings in regard to firstly the strength of association between the variables Age, BMI, total weekly units of alcohol consumed, cigarette smoking status and systolic BP using a linear fitted line to visualise any association and the correlation coefficient, and secondly using multiple linear regression to assess the effects of the aforementioned potential relationship. A definition of Strength of Association is “A measure of association may be determined by any of several different analyses, including correlation analysis and regression analysis. (Although the terms correlation and association are often used interchangeably, correlation in a stricter sense refers to linear correlation, and association refers to any relationship between variables.)”[2]. i. Age vs Omron Valid Mean Systolic BP In fig.1 age is compared to BP and then further analysed in regard to smoking status. It is apparent that the fitted linear line is almost identical in all the plots, therefore, it would seem even though as we age BP does increase, smoking status does not have an influence in this circumstance. Furthermore, the correlation coefficient of 0.260 3.dp shows there is a weak positive linear correlation between Age and BP in the data, so as we age BP increases slowly, which is plausible. However, this is not proved to a causal effect, only strength of association. From the summary statistics we observe that the range of the variable Age is 0 to 100, so we don’t need to consider erroneous outliers affecting the data in this correlation value as both values are plausible. ii. BMI vs Systolic BP In fig.2 BMI is compared to BP and then further analysed in regard to smoking status. It is apparent that the fitted linear line is almost identical in all the plots apart from plot 1 - current smoker, therefore, it would seem even though as BMI increases BP does also goes up, currently smoking does seem to make this effect more pronounced, this is consistent with medical knowledge on the negative effects of smoking. Furthermore, the correlation coefficient of 0.304 3.dp shows there is a weak to moderate positive linear correlation between BMI and BP in the data, so as BMI increases so does BP. However, this is not proved to a causal effect, only strength of association. Moreover from the summary statistics we observe that the range of the variable BMI is from 8.34011 to 65.27721, so we don’t need to consider erroneous outliers affecting the data in this correlation value as both values are plausible. iii. Total Units of Alcohol per Week vs Systolic BP In fig.3 total weekly units of alcohol consumed is compared to BP and then further analysed in regard to smoking status.It is apparent that the fitted linear line is almost identical in all the plots apart from plot 2 - smoked occasionally, therefore, it would seem as weekly units increases BP does also goes up, the variable smoked occasionally does seem to make this effect more pronounced, this effect is somewhat strange as we might expect current or smoked regularly to have the most influence, but smoked occasionally might be linked to certain other lifesytle choices, and have an unseen effect on weekly units and BP. Furthermore, the correlation coefficient of 0.058 3.dp shows there is a very weak positive linear correlation between total weekly units and BP in the data, which shows as units increase so does BP. However, this is not proved to a causal effect, only strength of association. Moreover, from the summary statistics we observe that the range of the variable totalwu is from 0 to 461.5000 which at the high end of the range seems implausible but not medically impossible, so we don’t need to consider erroneous outliers affecting the data in this correlation value. iv. Smoking Status vs Systolic BP In fig.4 cigarette consumption is compared to BP and then further analysed in regard to smoking status by examining separate boxplots. “Boxplots are a standardized way of displaying the distribution of data based on a five number summary (“minimum”, first quartile (Q1), median, third quartile (Q3), and “maximum”)".[3] From the boxplots it can be seen that median BP in regard to smoke status is similar apart from plot 3 - those who smoked regularly is higher than the others. The shorter downwards whisker in plot 2 - those who smoked ocassionally shows less downwards variability in BP. In regard to skew all four plots seem not to be skewed, and finally in terms of outliers which represent data points that are statistically different plot 1 - current smokers seems to have the most, which could lend evidence towards a negative impact of current smoking and BP in this dataset, which would be consistent with medical knowledge.[4]. Moreover, the correlation coefficient of 0.118 3.dp shows there is a very weak positive linear correlation between smoker status and BP in the data. This variable has four different categories including current smoker and never smoked so it might not give much insight into the strength of association in this case. However, as before this correlation is not proved to a causal effect, only strength of association between the two variables. Multiple linear regression with systolic BP as the dependent variable and Age, BMI, total weekly units of alcohol consumed, cigarette smoking status as the independent variables will fit a linear model to the data if certain assumptions regarding the error are satisfied. These include that the errors have an expected value of zero, are independent and have constant variance.[5] These assumptions will be examined further in the analysis. An interpretation of the coefficients is as follows: in this case the intercept is nonsensical because it would mean age and BMI are zero. The value of the coefficient \(\beta_1 \mathtt{Age}_i\) is highly statistical significant at the 0.1% level, and the interpretation is that a 1 year increase in age leads to a 0.34312 unit increase in Omron Valid Mean Systolic BP. The value of the coefficient \(\beta_2 \mathtt{bmival}_i\) is highly statistical significant at the 0.1% level, and the interpretation is that a 1 unit increase in BMI value leads to a 0.55217 unit increase in Omron Valid Mean Systolic BP. The value of the coefficient \(\beta_3 \mathtt{totalwu}_i\) is highly statistical significant at the 0.1% level, and the interpretation is that a 1 unit increase in total weekly units of alcohol value leads to a 0.09220 unit increase in Omron Valid Mean Systolic BP. The value of the coefficient(s) \(\beta_4 \mathtt{cigst1}_i\) which has been factored based on smoking status, are not statistically significant so are not going to be considered. Furthermore, the model has an adjusted R^2 value of 0.189 which shows the model explains 18.9% of the variation of the response data around its mean. This value is low, however, in the social sciences low R^2 values are not uncommon.

References

[1] MA50258: Assessed Coursework 1 https://moodle.bath.ac.uk/pluginfile.php/1328970/mod_resource/content/1/MA50258_CW1.pdf

[2] Measure of association https://www.britannica.com/topic/measure-of-association

[3] Understanding Boxplots https://towardsdatascience.com/understanding-boxplots-5e2df7bcbd51

[4] Reading and Interpreting Box Plots https://magoosh.com/statistics/reading-interpreting-box-plots/

[5] MA50258: Lab 3 Solutions https://moodle.bath.ac.uk/course/view.php?id=58412&section=3