mydata <- read.csv("./HW3.csv",
                    header = TRUE,
                    sep = ';',
                    dec = ',')

head(mydata)
##   id age education sex is_smoking cigsPerDay totChol sysBP diaBP BMI heartRate
## 1  1  36         4   0          0          0     212 168.0    98  30        72
## 2  2  46         1   1          1         10     250 116.0    71  20        88
## 3  3  50         1   0          1         20     233 158.0    88  28        68
## 4  4  64         1   1          1         30     241 136.5    85  26        70
## 5  5  61         3   1          0          0     272 182.0   121  33        85
## 6  6  61         1   0          0          0     238 232.0   136  25        75
##   glucose TenYearCHD
## 1      75          0
## 2      94          0
## 3      94          1
## 4      77          0
## 5      65          1
## 6      79          0

Description of the variables:

id: Number of the patient/resident of Framingham (row identifier)

Age: Age of the patient/resident of Framingham (in years)

education: Level of education (level 1 - level 4)

sex: Gender of the patient (0- male patient, 1- female patient)

is_smoking: Whether or not the patient/resident is a current smoker (YES or NO)?

cigsPerDay: Whether or not the patient is a current smoker (YES or NO)?

totChol: Total Cholesterol Level (in mg/dL) for individual

sysBP: Systolic Blood Pressure (in mmHG) for individual

diaBP: Diastolic Blood Pressure (in mmHG) for individual

BMI: Body Mass Index (in kg/m2) for individual

heartRate: Heart Rate (in x times) for individual

glucose: Glucose Level (in mg/dL) for individual

TenYearCHD: Individual’s 10 year risk of coronary heart disease CHD (Yes = 1 or No = 0)

Sample has 150 observations. Unit of observation is a patient/resident of the town of Framingham, Massachusetts. The dataset was found on the Kaggle website (link: https://www.kaggle.com/datasets/christofel04/cardiovascular-study-dataset-predict-heart-disea).

The main goal of this analysis is to find out if patients with higher BMI have higher diabolic blood pressure, have higher glucose level, are older and are male patients. From the internet, we can say that age and gender influence BMI (link: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4184278/,http://www.scielo.org.co/scielo.php?script=sci_arttext&pid=S0120-53072017000300276).

Below, we will show the variables that will be used in the analysis. Then we will also rename them.

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
mydata <- mydata[, c(2,4,9,10,12)]
colnames(mydata) <- c('Age','Gender','Diastolic_Blood_Pressure','BMI','Glucose_Level')
head(mydata)
##   Age Gender Diastolic_Blood_Pressure BMI Glucose_Level
## 1  36      0                       98  30            75
## 2  46      1                       71  20            94
## 3  50      0                       88  28            94
## 4  64      1                       85  26            77
## 5  61      1                      121  33            65
## 6  61      0                      136  25            79

Below, we will put the variable Gender into the factor variable.

mydata$GenderF <- factor(mydata$Gender,
                         levels = c(0,1),
                         labels = c('M','F'))
head(mydata)
##   Age Gender Diastolic_Blood_Pressure BMI Glucose_Level GenderF
## 1  36      0                       98  30            75       M
## 2  46      1                       71  20            94       F
## 3  50      0                       88  28            94       M
## 4  64      1                       85  26            77       F
## 5  61      1                      121  33            65       F
## 6  61      0                      136  25            79       M

Definition of the regression model

With this regression model, we want to identify the relationship between different variables in our dataset.

Variables:

Age - independent variable

Gender - independent variable

Diabolic_Blood_Pressure - independent variable

BMI - dependent variable

Glucose_Level - independent variable

With this regression model, we want to see what is the magnitude of the relationship of these variables and its statistical significance. We want to further examine how independent variables influence dependent variable. From the stated variables above, we can see that we have chosen BMI as a dependent variable and other variables (Age, Gender, Diabolic_Blood_Pressure and Glucose_Level) as independent. All this will present us an insight in how different factors/variables influence the BMI of the patient/resident.

Scatterplot

library(car)
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
## 
##     recode
scatterplotMatrix(mydata[,c(4,1,3,5)], 
                  smooth = FALSE) 

From the table of graphs above, we can see that there is a positive relationship between BMI and other independent variables (Age, Diastolic Blood Pressure, Glucose Level).

Regression model

fit1 <- lm(BMI ~ Age + Diastolic_Blood_Pressure + Glucose_Level + GenderF, data = mydata)

Multicolinearity

library(car)
library(dplyr)
vif(fit1)
##                      Age Diastolic_Blood_Pressure            Glucose_Level 
##                 1.095947                 1.066858                 1.031062 
##                  GenderF 
##                 1.016385
mean(vif(fit1))
## [1] 1.052563

As we can see the VIF is less than 5 for all numeric variables and less than the √5 for the factor variable. We have mean around 1, which concludes that multicollinearity is not a problem in our regression model.

Detection of outliers

mydata$StdResid <- round(rstandard(fit1), 3) 
hist(mydata$StdResid, 
     xlab = "Standardized residuals", 
     ylab = "Frequency", 
     main = "Histogram of standardized residuals",
     col = "lightblue")

H0: Errors are normally distributed.

H1: Errors are not normally distributed.

shapiro.test(mydata$StdResid)
## 
##  Shapiro-Wilk normality test
## 
## data:  mydata$StdResid
## W = 0.93292, p-value = 1.594e-06

The given p-value is less than 5% that is why we can reject H0 and we accept the H1, which states that errors are not normally distributed. We can see from histogram that we will need to remove some units.

mydata$CooksD <- round(cooks.distance(fit1), 3)
hist(mydata$CooksD, 
     xlab = "Cooks distance", 
     ylab = "Frequency", 
     main = "Histogram of Cooks distances",
     col = "lightblue")

Removing outliers

head(mydata[order(mydata$StdResid),], 5)
##     Age Gender Diastolic_Blood_Pressure BMI Glucose_Level GenderF StdResid
## 6    61      0                      136  25            79       M   -2.168
## 104  39      1                       87  17            75       F   -1.852
## 64   58      1                       98  20            60       F   -1.719
## 146  48      1                       87  18            63       F   -1.684
## 62   63      0                      104  23            73       M   -1.511
##     CooksD
## 6    0.145
## 104  0.019
## 64   0.021
## 146  0.011
## 62   0.021

Here we can see that none of the units have standard residuals higher than -3.

head(mydata[order(-mydata$StdResid),], 5)
##     Age Gender Diastolic_Blood_Pressure BMI Glucose_Level GenderF StdResid
## 18   42      1                       92  44            60       F    4.374
## 147  67      1                      102  44            62       F    3.685
## 119  48      1                       80  36            67       F    2.714
## 117  63      0                       75  38           120       M    2.695
## 107  64      1                       78  36            74       F    2.499
##     CooksD
## 18   0.124
## 147  0.155
## 119  0.024
## 117  0.087
## 107  0.042

Here we can see that the unit ID=18 and unit ID=147 have standard residuals higher than 3. So, we need to remove them.

head(mydata[order(-mydata$CooksD),], 10) 
##     Age Gender Diastolic_Blood_Pressure BMI Glucose_Level GenderF StdResid
## 147  67      1                      102  44            62       F    3.685
## 6    61      0                      136  25            79       M   -2.168
## 68   63      1                       88  37           170       F    1.974
## 18   42      1                       92  44            60       F    4.374
## 117  63      0                       75  38           120       M    2.695
## 107  64      1                       78  36            74       F    2.499
## 125  57      0                       97  38            94       M    2.211
## 12   58      1                      120  36            85       F    1.225
## 119  48      1                       80  36            67       F    2.714
## 62   63      0                      104  23            73       M   -1.511
##     CooksD
## 147  0.155
## 6    0.145
## 68   0.135
## 18   0.124
## 117  0.087
## 107  0.042
## 125  0.027
## 12   0.025
## 119  0.024
## 62   0.021

From Cooks Distances, we can detect that units having Cooks Distance higher than 0.03 have big jump and are not so constant than before. That is why we will use 0.03 as a threshold. So we need to remove units with ID: 147,6,68,18,117,107.

mydata <- mydata[mydata$StdResid < 3,]
mydata <- mydata[mydata$StdResid > -3,]
mydata <- mydata[mydata$CooksD < 0.03,]

Estimating new regression model

fit2 <- lm(BMI ~ Age + Diastolic_Blood_Pressure + Glucose_Level + GenderF, data = mydata)

Below, we will look again at the diagnostics.

mydata$StdResid <- round(rstandard(fit2), 3) 
mydata$CooksD <- round(cooks.distance(fit2), 3)

H0: Errors are normally distributed.

H1: Errors are not normally distributed.

shapiro.test(mydata$StdResid)
## 
##  Shapiro-Wilk normality test
## 
## data:  mydata$StdResid
## W = 0.98432, p-value = 0.1002

From the shapiro test, we cannot reject the H0 due to the given p-value > 0.05, which states that errors are normally distributed. The normality assumption is met.

Linearity

mydata$StdFitValues <- scale(fit2$fitted.values)

scatterplot(y = mydata$StdResid, x = mydata$StdFitValues,
            ylab = "Standardized residuals",
            xlab = "Standardized fitted values",
            boxplots = FALSE,
            regLine = TRUE,
            smooth = FALSE)

From the scatterplot above, we can see that the linearity assumption is met.

Homoskedasticity

Ho: The variance is constant (homoskedasticity).

H1: The variance is not constant (heteroskedasticity).

library(olsrr)
## 
## Attaching package: 'olsrr'
## The following object is masked from 'package:datasets':
## 
##     rivers
ols_test_breusch_pagan(fit2)
## 
##  Breusch Pagan Test for Heteroskedasticity
##  -----------------------------------------
##  Ho: the variance is constant            
##  Ha: the variance is not constant        
## 
##              Data               
##  -------------------------------
##  Response : BMI 
##  Variables: fitted values of BMI 
## 
##         Test Summary         
##  ----------------------------
##  DF            =    1 
##  Chi2          =    1.10924 
##  Prob > Chi2   =    0.2922472

From the given p-value, we cannot reject the H0 because p-value > 0.05. Therefore, the variance is constant and we have homoskedasticity.

library(pastecs)
## 
## Attaching package: 'pastecs'
## The following objects are masked from 'package:dplyr':
## 
##     first, last
round(stat.desc(mydata[ ,c(1,2,3,4,5)]), 1)
##                 Age Gender Diastolic_Blood_Pressure    BMI Glucose_Level
## nbr.val       144.0  144.0                    144.0  144.0         144.0
## nbr.null        0.0   67.0                      0.0    0.0           0.0
## nbr.na          0.0    0.0                      0.0    0.0           0.0
## min            34.0    0.0                     57.0   17.0          55.0
## max            68.0    1.0                    121.0   38.0         177.0
## range          34.0    1.0                     64.0   21.0         122.0
## sum          7149.0   77.0                  11812.0 3599.0       11578.0
## median         48.5    1.0                     80.5   25.0          77.0
## mean           49.6    0.5                     82.0   25.0          80.4
## SE.mean         0.7    0.0                      1.0    0.3           1.6
## CI.mean.0.95    1.5    0.1                      1.9    0.6           3.1
## var            79.9    0.3                    139.1   15.5         347.1
## std.dev         8.9    0.5                     11.8    3.9          18.6
## coef.var        0.2    0.9                      0.1    0.2           0.2

From the table above, we see that we have now 144 observations after removing outliers and units of high impact on the data.

Age: The minimum age in the sample is 34 years. The maximum age in the sample is 68 years. The average age in the sample is 49.6 years. Half of the individuals have up to or equal to 48.5 years in the sample and other half has above 48.5 years. From the table above, we can see that variable age has a standard deviation of 8.9. It means that the age of individual deviates from the mean by 8.9 years.

Diastolic Blood Pressure: The minimum diastolic blood pressure in the sample is 57 mmHG. The maximum BMI in the sample is 121 mmHG. The average diastolic blood pressure in the sample is 82 mmHG. Half of the individuals have up to or equal to 80.5 mmHG and other half has above 80.5 mmHG. From the table above, we can see that diastolic blood pressure has a standard deviation of 11.8. It means that the diastolic blood pressure of individual deviates from the mean by 11.8 mmHG.

BMI: The minimum BMI in the sample is 17 (kg/m2). The maximum BMI in the sample is 38 (kg/m2). The average BMI in the sample is 25 (kg/m2). Half of the individuals have up to or equal to 25 BMI (kg/m2) and other half has above 25 BMI (kg/m2). From the table above, we can see that BMI has a standard deviation of 3.9. It means that the BMI of individual deviates from the mean by 3.9 BMI (kg/m2).

Glucose_Level: The minimum glucose level in the sample is 55 mg/dL. The maximum glucose level in the sample is 177 mg/dL. The average glucose level in the sample is 80.4 mg/dL. Half of the individuals have up to or equal to 77 mg/dL glucose level and other half has above 77 mg/dL glucose level. From the table above, we can see that the glucose level has a standard deviation of 18.6. It means that the glucose level of individual deviates from the mean by 18.6 mg/dL.

summary(lm(formula = BMI ~ Age + GenderF + Diastolic_Blood_Pressure + Glucose_Level, data = mydata))
## 
## Call:
## lm(formula = BMI ~ Age + GenderF + Diastolic_Blood_Pressure + 
##     Glucose_Level, data = mydata)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -7.687 -2.425 -0.046  2.013 12.159 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              10.83414    2.66465   4.066 7.97e-05 ***
## Age                       0.03310    0.03376   0.980   0.3286    
## GenderFF                 -1.17976    0.58705  -2.010   0.0464 *  
## Diastolic_Blood_Pressure  0.14102    0.02537   5.558 1.35e-07 ***
## Glucose_Level             0.01963    0.01580   1.243   0.2161    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.488 on 139 degrees of freedom
## Multiple R-squared:  0.236,  Adjusted R-squared:  0.214 
## F-statistic: 10.73 on 4 and 139 DF,  p-value: 1.308e-07

Final regression equation:

BMI = 10.834 + 0.033 * Age - 1.18 * GenderF + 0.141 * Diastolic_Blood_Pressure + 0.02 * Glucose_Level

For the conclusion, we need to state that the regression coefficients of Age and Glucose level are not statistically significant because of p-value > 0.05 and we will not explain it.

GenderF: On average, females will have 1.18 lower BMI than men, ceteris paribus.

Diastolic_Blood_Pressure: If diastolic blood pressure increases by 1 mmHG, ceteris paribus, on average BMI will increase by 0.14 kg/m2.

The Multiple R-squared has the value of 0.236, and it indicates that 23.6% of variance of dependent variable is explained by all independent variables in the model (Age, Gender, Diastolic Blood Pressure and Glucose Level).

For F-test, we can determine if the regression model is good.

H0: ρ2 = 0

H1: ρ2 > 0

In our case, the p-value is lower than 0.001, so we can reject the H0. We can accept H1, which states that some variability of dependent variable is explained by other variables in the model.

sqrt(summary(fit2)$r.squared)
## [1] 0.4857693

The linear relationship between BMI and all explanatory variables is moderate.