mydata1 <- read.csv("./hw111.csv",
                    header = TRUE,
                    sep = ';',
                    dec = ',')

mydata1$glucose <- strtoi(mydata1$glucose)
head(mydata1)
##   id age sex is_smoking totChol   BMI heartRate glucose TenYearCHD
## 1  1  36   0         NO     212 29.77        72      75          0
## 2  2  46   1        YES     250 20.35        88      94          0
## 3  3  50   0        YES     233 28.26        68      94          1
## 4  4  64   1        YES     241 26.42        70      77          0
## 5  5  61   1         NO     272 32.80        85      65          1
## 6  6  61   0         NO     238 24.83        75      79          0

Description of the variables:

Sample has 90 observations. Unit of observation is a patient/resident of the town of Framingham, Massachusetts. The dataset was found on the Kaggle website (link:https://www.kaggle.com/datasets/christofel04/cardiovascular-study-dataset-predict-heart-disea).

The main goal of this data analysis is to find how various factors influence BMI (research question).

Below, I changed the names of the variables.

colnames(mydata1) <- c('ID','Age','Gender','Smoking?','Cholesterol_Level','BMI','Heart_Rate','Glucose_Level','Ten_Year_CHD_Prediction')

head(mydata1)
##   ID Age Gender Smoking? Cholesterol_Level   BMI Heart_Rate Glucose_Level
## 1  1  36      0       NO               212 29.77         72            75
## 2  2  46      1      YES               250 20.35         88            94
## 3  3  50      0      YES               233 28.26         68            94
## 4  4  64      1      YES               241 26.42         70            77
## 5  5  61      1       NO               272 32.80         85            65
## 6  6  61      0       NO               238 24.83         75            79
##   Ten_Year_CHD_Prediction
## 1                       0
## 2                       0
## 3                       1
## 4                       0
## 5                       1
## 6                       0

Here I have deleted the variable Smoking?.

mydata1a <- mydata1[,-4]

head(mydata1a)
##   ID Age Gender Cholesterol_Level   BMI Heart_Rate Glucose_Level
## 1  1  36      0               212 29.77         72            75
## 2  2  46      1               250 20.35         88            94
## 3  3  50      0               233 28.26         68            94
## 4  4  64      1               241 26.42         70            77
## 5  5  61      1               272 32.80         85            65
## 6  6  61      0               238 24.83         75            79
##   Ten_Year_CHD_Prediction
## 1                       0
## 2                       0
## 3                       1
## 4                       0
## 5                       1
## 6                       0

Here I introduced a new variable GenderF.

mydata1a$GenderF <- factor(mydata1a$Gender,
                           levels = c(0,1),
                           labels = c('M','F'))
head(mydata1a)
##   ID Age Gender Cholesterol_Level   BMI Heart_Rate Glucose_Level
## 1  1  36      0               212 29.77         72            75
## 2  2  46      1               250 20.35         88            94
## 3  3  50      0               233 28.26         68            94
## 4  4  64      1               241 26.42         70            77
## 5  5  61      1               272 32.80         85            65
## 6  6  61      0               238 24.83         75            79
##   Ten_Year_CHD_Prediction GenderF
## 1                       0       M
## 2                       0       F
## 3                       1       M
## 4                       0       F
## 5                       1       F
## 6                       0       M

Below we can see that I introduced the condition (BMI >= 25) to see how many individuals in the sample have 25 or higher BMI. Here are displayed 6 rows, however, there are 41 individuals that have BMI at 25 or higher.

mydata1aa <- mydata1a [(mydata1a$BMI >= 25), ]

head(mydata1aa[order(mydata1aa$BMI), ])
##    ID Age Gender Cholesterol_Level   BMI Heart_Rate Glucose_Level
## 64 64  61      1               183 25.05         66            70
## 20 20  38      1               164 25.75         70            75
## 17 17  42      0               232 25.77         72            70
## 60 60  62      0               264 26.15         73            63
## 34 34  46      1               193 26.18         75            NA
## 71 71  62      1               288 26.18         68            87
##    Ten_Year_CHD_Prediction GenderF
## 64                       1       F
## 20                       0       F
## 17                       0       M
## 60                       0       M
## 34                       0       F
## 71                       0       F

Here, the summary of variables was created without ID, Gender and Ten_Year_CHD_Prediction. Here I also deleted two rows that had missing values that is why the sample size changed to 88 observations.

library(tidyr)
mydata1a <- drop_na(mydata1a)
summary(mydata1a[ ,c(-1,-3,-8)])
##       Age        Cholesterol_Level      BMI          Heart_Rate    
##  Min.   :34.00   Min.   :150.0     Min.   :18.10   Min.   : 50.00  
##  1st Qu.:42.00   1st Qu.:209.5     1st Qu.:21.89   1st Qu.: 68.75  
##  Median :50.50   Median :234.5     Median :24.58   Median : 75.00  
##  Mean   :49.81   Mean   :234.6     Mean   :25.33   Mean   : 74.99  
##  3rd Qu.:58.00   3rd Qu.:262.5     3rd Qu.:27.95   3rd Qu.: 80.00  
##  Max.   :65.00   Max.   :346.0     Max.   :43.69   Max.   :100.00  
##  Glucose_Level    GenderF
##  Min.   : 57.00   M:42   
##  1st Qu.: 67.75   F:46   
##  Median : 77.00          
##  Mean   : 78.60          
##  3rd Qu.: 85.00          
##  Max.   :170.00

Age: The minimum age in the sample is 34 years. The maximum age in the sample is 65 years. If the age was the same for each individual in the sample, the average would be 49.81 years. Half of the individuals have up to or equal to 50.50 years in the sample and other half has above 50.50 years.

Cholesterol_Level: The minimum cholesterol level in the sample is 150 mg/dL. The maximum cholesterol level 346 mg/dL. If the cholesterol level was the same for each individual in the sample, the average would be 234.6 mg/dL. Half of the individuals have up to or equal to 234.5 mg/dL cholesterol level and other half has above 234.5 mg/dL.

BMI: The minimum BMI in the sample is 18.10 (kg/m2). The maximum BMI in the sample is 43.69 (kg/m2). If the BMI was the same for each individual in the sample, the average would be 25.33 (kg/m2). Half of the individuals have up to or equal to 24.58 BMI (kg/m2) and other half has above 24.58 BMI (kg/m2).

Heart_Rate: The minimum heart rate in the sample is 50. The maximum heart rate is 100. If the heart rate was the same for each individual in the sample, the average would be 74.99. Half of the individuals have up to or equal to 75 heart rate and other half has above 75 heart rate.

Glucose_Level: The minimum glucose level in the sample is 57 mg/dL. The maximum glucose level in the sample is 170 mg/dL. If the glucose level was the same for each individual in the sample, the average would be 78.60 mg/dL. Half of the individuals have up to or equal to 77 mg/dL glucose level and other half has above 77 mg/dL glucose level.

From the summary, we can see that in the sample we have 42 male patients/residents and 46 female patients/residents. The sample size is composed now of 88 patients/residents.

library(psych)
describe(mydata1a[,c(-1, -3, -8, -9)])
##                   vars  n   mean    sd median trimmed   mad   min    max  range
## Age                  1 88  49.81  9.08  50.50   49.83 12.60  34.0  65.00  31.00
## Cholesterol_Level    2 88 234.57 40.44 234.50  233.64 40.77 150.0 346.00 196.00
## BMI                  3 88  25.33  4.35  24.58   24.98  4.56  18.1  43.69  25.59
## Heart_Rate           4 88  74.99  9.78  75.00   74.99  8.90  50.0 100.00  50.00
## Glucose_Level        5 88  78.60 16.53  77.00   76.71 13.34  57.0 170.00 113.00
##                   skew kurtosis   se
## Age               0.03    -1.25 0.97
## Cholesterol_Level 0.30     0.06 4.31
## BMI               1.11     2.35 0.46
## Heart_Rate        0.03    -0.34 1.04
## Glucose_Level     2.36     9.51 1.76

From the table above, we can see that glucose level has a standard deviation of 16.53. It means that the glucose level of individual deviates from the mean by 16.53 mg/dL. For the BMI variable, it has a standard deviation of 4.35, which means that the BMI of individual deviates from the mean by 4.35 kg/m2. For the heart rate, it has a standard deviation of 9.78, which means that the heart rate of individual deviates from the mean by 9.78 times. For the cholesterol level variable, it has a standard deviation of 40.44, which means that the cholesterol level of individual deviates from the mean by 40.44 mg/dL. We can see that the standard deviation of age is 9.08, which means that the age of individual deviates from the mean by 9.08 years. In case of range, we can see that the lowest range is for BMI, which means that it has low variability in a distribution. The highest range is for cholesterol level and it means that it has high variability in a distribution. From the table, we can also see that all variables have positive skewness, which means that it is positively skewed or skewed to the right. In all variables we can see that the kurtosis is less than 3, which means the distribution of data is platykurtic.

Male_Glucose_Level <- mydata1a$Glucose_Level[mydata1a$GenderF=="M"]

meanMG <- round(mean(Male_Glucose_Level),2 )
print(meanMG)
## [1] 74.76
Female_Glucose_Level <- mydata1a$Glucose_Level[mydata1a$GenderF=="F"]

meanFG <- round(mean(Female_Glucose_Level),2 )
print(meanFG)
## [1] 82.11

I have computed the mean for each gender category in order to see who has higher average glucose level. We can see that female patients on average have higher glucose level than male patients.

library(ggplot2)
## 
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
## 
##     %+%, alpha
ggplot(mydata1a, aes(x = GenderF, y = Glucose_Level, fill = GenderF)) +
  geom_boxplot() +
  xlab("Gender") +
  ylab("Glucose Level")

From the boxplot, we can also see that male patients/residents have lower median of glucose level than female patients/residents. We can see also that we have two outliers. Furthermore, we can see that male patients/residents have lower minimum amount of glucose level and female patients/residents have higher maximum amount of glucose level.

ggplot(mydata1a,
       aes(x=Glucose_Level, fill= GenderF)) +
  geom_histogram(position = "dodge", bins = 10) +
  xlab('Glucose Level') +
  ylab('Frequency')

From this graph, we can also see that we have outliers and we can see that male patients are inclined to have lower glucose level than female patients.

library(car)
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:psych':
## 
##     logit
scatterplotMatrix(mydata1a[c(-1,-3,-6,-8,-9)],
            smooth = FALSE)

From the scatterplot matrix above and calculation of the correlation coefficient below, we can see that the relationship between Age and BMI is positive. The relationship between Cholesterol Level and BMI is negative. Furthermore, the relationship between Glucose Level and BMI is positive.

cor(mydata1a$BMI,mydata1a$Age)
## [1] 0.1285334
cor(mydata1a$BMI,mydata1a$Cholesterol_Level)
## [1] -0.02604946

Correlation between BMI and Age is very weak. The same is for the correlation between BMI and Cholesterol Level.