1. Correlation
Correlation is a measure that quantifies the relationship or association between two or more variables. Correlation describes the extent to which changes in one variables are accompanied by consistent changes in other variables. It also determines the strength and direction of the linear relationship between variables and indicating how closely their values are related.
Correlation coefficient is denoted by r or \({\rho}\)and is expressed in a range between -1 to +1. \({\rho = -1}\) indicates a perfect negative relationship referred to as an inverse relationship where as one variable increases, the other tends to decrease. \({\rho = +1}\) indicates a perfect positive relationship/ direct relationship where as one variable increases, the other variable tends to increase as well.Lastly , a correlation value \({\rho = 0}\) suggests no correlation or a weak relationship.
1.2 Correlation Table
A correlation table is a tabular representation of the correlation coefficients between multiple variables. It provides a comprehensive overview of the relationships between pairs of variables within a data set. Each cell in the table contains the correlation coefficient and to some extent color codes to show the strength of the relationship.
1.2.1 Correlation Table using the Variables Stroke, CHD, Diabetes, BMI, Age, Gender and Smoking.
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.2 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.2 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
## Loading required package: readxl
## [1] 364
## [1] 208
The variable smoke has some significant number of missing values. It will be best to work with the complete cases and ignore the effect of missing values to the data. But on the same cases, we wouldn’t want to introduce blessedness in our results but rather seek a solution relative to the context of smoking and the sample itself. Suppose th nature of missingness is due to participants reluctant to disclose such a sensitive information. In general, we have 33% of data missing in the smoking variable and its safe to say this is a large amount of missing information. Imputing the mean to the NA records will most likely be the better option than dropping values.
1.2.1.1 Case for Filling Nulls
1.2.1.1.1 Pre-Processing and Transformation
This stage involves preparing our data for quality to be sued in further analysis.This process will involving transforming the string values in smoking variable to categorical values. This stage also filling of NAs.
dataframe %>% distinct(smoking)## # A tibble: 6 × 1
## smoking
## <chr>
## 1 <NA>
## 2 never
## 3 not current
## 4 ever
## 5 former
## 6 current
There are 5 categories in the smoking variable therefore we assign the field as follows:
- never as 1
- not current as 2
- ever as 3
- former as 4
- current as 5
dataframe2 <- dataframe %>%
mutate(smoking_ = case_when(
smoking == "never" ~ 1,
smoking == "not current" ~ 2,
smoking == "ever" ~ 3,
smoking == "former" ~ 4,
smoking == "current" ~ 5,
TRUE ~ NA_integer_
))
dataframe2## # A tibble: 630 × 9
## gender age diabetes hypertension stroke `coronary heart disease` smoking
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 Female 2 0 0 0 0 <NA>
## 2 Male 2 0 0 0 0 <NA>
## 3 Female 2 0 0 0 0 <NA>
## 4 Male 2 0 0 0 0 <NA>
## 5 Female 2 0 0 0 0 never
## 6 Female 3 0 0 0 0 <NA>
## 7 Female 3 0 0 0 0 <NA>
## 8 Female 3 0 0 0 0 <NA>
## 9 Female 3 0 0 0 0 <NA>
## 10 Female 3 0 0 0 0 <NA>
## # ℹ 620 more rows
## # ℹ 2 more variables: BMI <dbl>, smoking_ <dbl>
Filling the smoking and BMI field with mean imputation
smoking_mean <- round(mean(dataframe2$smoking_, na.rm = TRUE))
dataframe2 <- dataframe2 %>%
mutate(smoking_ = ifelse(is.na(smoking_), smoking_mean, smoking_)) %>%
select(-smoking)
dataframe2 <- dataframe2 %>%
mutate(BMI = ifelse(is.na(BMI), round(mean(dataframe2$BMI, na.rm = TRUE)), BMI)) %>%
mutate(gender = ifelse(gender == "Male", 1, 2 )) %>%
rename(smoking = smoking_)
dataframe2## # A tibble: 630 × 8
## gender age diabetes hypertension stroke `coronary heart disease` BMI
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2 2 0 0 0 0 27
## 2 1 2 0 0 0 0 18.2
## 3 2 2 0 0 0 0 18.4
## 4 1 2 0 0 0 0 16.5
## 5 2 2 0 0 0 0 15.9
## 6 2 3 0 0 0 0 19.0
## 7 2 3 0 0 0 0 21.2
## 8 2 3 0 0 0 0 17.7
## 9 2 3 0 0 0 0 18.6
## 10 2 3 0 0 0 0 16.3
## # ℹ 620 more rows
## # ℹ 1 more variable: smoking <dbl>
Running this chunk twice brings an error since smoking is dropped after transformation.The variable smoking is now transformed and is now time to look at the structure and the convert the smoking variable into factor/numeric
subset1 <- dataframe2 %>%
mutate(smoking = as.numeric(smoking), diabetes = as.numeric(diabetes), stroke = as.numeric(stroke), CHD = as.numeric(`coronary heart disease`)) %>%
select(stroke, CHD, diabetes, BMI, age, gender, smoking)The chunk involves selecting the variables for correlation and setting factors to ordinal variables and categorical variables.
correlation <- round(cor(subset1), 2)
head(correlation)## stroke CHD diabetes BMI age gender smoking
## stroke 1.00 0.06 0.12 0.00 0.13 0.03 -0.01
## CHD 0.06 1.00 0.23 0.06 0.23 -0.10 0.04
## diabetes 0.12 0.23 1.00 0.28 0.31 -0.08 0.08
## BMI 0.00 0.06 0.28 1.00 0.33 -0.01 0.08
## age 0.13 0.23 0.31 0.33 1.00 -0.06 0.10
## gender 0.03 -0.10 -0.08 -0.01 -0.06 1.00 -0.16
require(ggcorrplot)## Loading required package: ggcorrplot
ggcorrplot(correlation, method = "circle")+
ggtitle("Correlation Plot")This is the table and the plot for correlation. Its worth to note that null values have been filled to reduced any biassdness.
1.2.1.1.2 Interpretation.
We observe that the correlation between “stroke” and other variables is generally weak. Suggesting that there is no strong linear relationship between stroke and the rest of the variables. Similarly, the correlation between “gender” and the other variables is also weak, indicating a lack of strong linear association. On the other hand, “CHD” and “diabetes” exhibit moderate positive correlations with other variables. This suggests a moderate linear relationship between these variables and the other factors under consideration. Notably, “diabetes” demonstrates a relatively higher correlation with “BMI” and “age” compared to other variables, indicating a stronger linear association. “BMI” shows weak positive correlations with “CHD,” “diabetes,” and “age.” This implies a mild tendency for higher BMI values to be associated with higher occurrences of “CHD,” “diabetes,” and older age.
Additionally, “age” demonstrates weak positive correlations with “diabetes,” “BMI,” and “stroke,” indicating that as age increases, the likelihood of having diabetes, higher BMI, and experiencing a stroke also slightly increases. Lastly, it is worth noting the weak negative correlations between “gender” and “CHD” and “diabetes.” This suggests that, to some extent, being male may be inversely associated with the occurrence of “CHD” and “diabetes.”
1.2.1.2 Case for Complete Cases
We use completed pairs to calculate correlation coefficient and thus ignoring rows with missing values for any pair of variables.
subset2 <- dataframe %>% mutate(gender = ifelse(gender == "Male", 1, 2 )) %>%
mutate(smoking = case_when(
smoking == "never" ~ 1,
smoking == "not current" ~ 2,
smoking == "ever" ~ 3,
smoking == "former" ~ 4,
smoking == "current" ~ 5,
TRUE ~ NA_integer_
)) %>%
mutate(CHD = `coronary heart disease`) %>%
select(stroke, CHD, diabetes, BMI, age, gender, smoking)
correlation_2 <- round(cor(subset2, use = "pairwise.complete.obs"), 2)
correlation_2## stroke CHD diabetes BMI age gender smoking
## stroke 1.00 0.06 0.12 0.00 0.13 0.03 -0.02
## CHD 0.06 1.00 0.23 0.06 0.23 -0.10 0.04
## diabetes 0.12 0.23 1.00 0.29 0.31 -0.08 0.07
## BMI 0.00 0.06 0.29 1.00 0.38 -0.01 0.09
## age 0.13 0.23 0.31 0.38 1.00 -0.06 0.11
## gender 0.03 -0.10 -0.08 -0.01 -0.06 1.00 -0.19
## smoking -0.02 0.04 0.07 0.09 0.11 -0.19 1.00
This is the correlation matrix putting into consideration only complete cases placed under consideration
ggcorrplot(correlation_2)+
ggtitle("Correlation Plot with Complete Pairwise Cases")1.2.1.2.1 Interpretation.
The analysis reveals that there are various levels of correlation among these variables. “Stroke” shows weak positive associations with “CHD,” “diabetes,” “age,” and a weak negative association with “smoking.” Individuals who have had a stroke may be more likely to have CHD and diabetes, be older in age, and less likely to smoke. “CHD” demonstrates weak positive correlations with “diabetes,” “age,” and “smoking,” suggesting that individuals with a history of coronary heart disease may have a higher likelihood of having diabetes, being older, and being current or former smokers. “Diabetes” exhibits a moderate positive correlation with “BMI,” “age,” and a weak positive correlation with “smoking.” These variables indicate a stronger association between diabetes and higher BMI values, older age, and a history of smoking. “BMI” shows weak positive correlations with “diabetes,” “age,” and “smoking,” suggesting that higher BMI values may be associated with an increased likelihood of having diabetes, being older, and being a current or former smoker. The variable “age” demonstrates weak positive correlations with “diabetes,” “BMI,” “stroke,” and “smoking.” This suggests that older individuals may have a higher prevalence of diabetes, higher BMI values, a history of stroke, and a higher likelihood of being a smoker. “Gender” exhibits weak negative correlations with “CHD” and “diabetes,” indicating that being male may be associated with a slightly lower occurrence of these conditions. Additionally, there is a weak negative correlation between “gender” and “smoking,” suggesting that males may have a higher prevalence of smoking compared to females.
2. Regression
It is a modeling technique used to analyze the relationship between a dependent variable and one or more independent variables. It aims to understand how changes in the independent variables are associated with changes in the dependent variable.
2.1 What Factors are Associated with The Incidence of Stroke?
regression1 <- lm(stroke ~ CHD + diabetes + BMI + age + gender + smoking, subset2)
summary(regression1)##
## Call:
## lm(formula = stroke ~ CHD + diabetes + BMI + age + gender + smoking,
## data = subset2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.09980 -0.03600 -0.01115 0.00658 0.95231
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.006047 0.043493 -0.139 0.8895
## CHD 0.008257 0.034269 0.241 0.8097
## diabetes 0.036210 0.022441 1.614 0.1076
## BMI -0.001302 0.001092 -1.192 0.2342
## age 0.001174 0.000390 3.011 0.0028 **
## gender 0.008602 0.015059 0.571 0.5682
## smoking -0.006436 0.004666 -1.379 0.1688
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1306 on 327 degrees of freedom
## (296 observations deleted due to missingness)
## Multiple R-squared: 0.05413, Adjusted R-squared: 0.03678
## F-statistic: 3.119 on 6 and 327 DF, p-value: 0.005505
2.1.2 Results & Interpretation
Among the variables included in the model, age emerges as a significant predictor indicating that as individuals grow older, their likelihood of experiencing stroke increases. The remaining variables including; CHD, diabetes, BMI, gender, and smoking did not exhibit statistically significant associations with stroke incidence as reported by the model.
The adjusted R-squared value is a parameter that measures the proportion of variation in stroke incidence explained by the model. The value is relatively low at 0.03678 indicating that there are additional factors not accounted for in the analysis which might contribute to the incidence of strokes.
2.2 What Factors are Associated With The Incidence of CHD?
2.2.1 Logistic Regression
Logistic regression is a modelling technique used to analyze the relationship between a binary dependent variable and one or more independent variables. The goal of logistic regression is to estimate the probability of the occurrence of the event of interest based on the values of the independent variables.
regression2 <- glm(CHD ~ age + gender + smoking + diabetes, data = subset2, family = binomial)
regression2##
## Call: glm(formula = CHD ~ age + gender + smoking + diabetes, family = binomial,
## data = subset2)
##
## Coefficients:
## (Intercept) age gender smoking diabetes
## -6.82407 0.06984 -0.50890 0.04924 1.20404
##
## Degrees of Freedom: 421 Total (i.e. Null); 417 Residual
## (208 observations deleted due to missingness)
## Null Deviance: 155
## Residual Deviance: 121.6 AIC: 131.6
2.2.1.1 Results and Interpretation
The results reveal the following findings; first, age showed a positive association with the log-odds of CHD occurrence with an estimated coefficient of 0.0698. This showed that as age increases, the likelihood of experiencing CHD also tends to increase. On the other hand, gender appeared to have a negative association with CHD which was indicated by the coefficient estimate of -0.51. This shows that being male was associated with a lower log-odds of CHD compared to being female, after controlling for other variables in th
Also, smoking showed a slightly positive association with CHD with an estimated coefficient of 0.049. An indication that smokers tended to have slightly higher log-odds of CHD compared to non-smokers with all other factors being equal/constant. Lastly, diabetes exhibited a strong positive association with CHD as evidenced by the coefficient estimate of 1.2. This is a suggestion that individuals with diabetes had significantly higher log-odds of CHD compared to those without diabetes while holding other factors constant.
2.3 What Factors are Associated with Hypertension?
regression3 <- glm(hypertension ~ gender + age + diabetes + stroke + `coronary heart disease` + smoking + BMI, data = dataframe, family = binomial)
regression3##
## Call: glm(formula = hypertension ~ gender + age + diabetes + stroke +
## `coronary heart disease` + smoking + BMI, family = binomial,
## data = dataframe)
##
## Coefficients:
## (Intercept) genderMale age
## -5.65586 -0.02412 0.05872
## diabetes stroke `coronary heart disease`
## 0.61324 1.72889 0.79667
## smokingever smokingformer smokingnever
## 0.32718 -0.95861 -0.82378
## smokingnot current BMI
## -0.88866 0.02304
##
## Degrees of Freedom: 333 Total (i.e. Null); 323 Residual
## (296 observations deleted due to missingness)
## Null Deviance: 240.8
## Residual Deviance: 188 AIC: 210
2.3.1 Results and Interpretation
Firstly, age showed a positive association with hypertension with a coefficient of 0.059. This meant that for each unit increase in age, the odds of having hypertension increased by approximately \(1.060 \approx e^{0.05872}\) . Diabetes and stroke were also positively associated with hypertension. The coefficient estimates for diabetes and stroke were estimated as 0.61and 1.73 respectively. This showed that individuals with diabetes and a history of stroke had higher odds of hypertension compared to those without these conditions. In the matter of smoking, individuals who reported being former or never smokers had lower odds of hypertension with coefficient estimates of -0.95861 and -0.82378 respectively compared to those who reported being not currently smoking. However, individuals who reported ever smoking had higher odds of hypertension with a coefficient estimate of 0.32718.
Coronary heart disease/CHD was found to be positively associated with hypertension with a coefficient estimate of 0.79667. This showed that individuals with a history of coronary heart disease were more likely to have hypertension. Lastly, BMI showed a positive association with hypertension with an estimated coefficient of 0.02304. This suggested that as BMI increased, the odds of hypertension also tended to increase.
Therefore, the logistic regression model showed that age, diabetes, stroke, coronary heart disease, smoking status, and BMI were associated with hypertension.
3. What Factors are Associated With Job Satisfaction?
dataframe3 <- read_xlsx("nurses1 (2).xlsx")
#names(dataframe3)
x <- sum(is.na(dataframe3))
cat("Null cases are", x)## Null cases are 420
dataframe3 <- dataframe3[complete.cases(dataframe3),]Checking the levels in Job satisfaction
dataframe3 %>%
distinct(Jobsat) %>%
pull(Jobsat)## [1] 15 12 14 10 11 9 18 17 16 8 7 21 19 13 24 22 23 20 31 25
names(dataframe3)## [1] "RespondentID"
## [2] "gender"
## [3] "age"
## [4] "currentemploymentstatus"
## [5] "maritalstatus"
## [6] "n_children"
## [7] "education"
## [8] "hours_work"
## [9] "workhours_informallyexpected"
## [10] "workhoursareinformallyexpected_colleagues"
## [11] "annualincome"
## [12] "current position"
## [13] "supervisoryduties"
## [14] "tenure"
## [15] "intquit"
## [16] "jobsecurity"
## [17] "Lifesat"
## [18] "Jobsat"
## [19] "Careersat"
## [20] "Jobstress"
## [21] "psychosomatic"
This therefore shows the Job Satisfaction as a scalar variable with high values as high scores and lower values as the least scores for job satisfaction. The appropriate procedure to conduct this might be the use of multiple linear regression.
We might convert some variables into factors before conducting the MLR
factors <- dataframe3 %>%
select(gender, currentemploymentstatus, maritalstatus,
education, annualincome, `current position`,
supervisoryduties)
factors <- names(factors)
dataframe3[,factors] <- lapply(dataframe3[,factors], factor)
regression4 <- lm(Jobsat ~ gender + age + currentemploymentstatus + maritalstatus
+ n_children + education + hours_work + workhours_informallyexpected
+ workhoursareinformallyexpected_colleagues + annualincome
+ `current position` + supervisoryduties + tenure + intquit
+ jobsecurity + Lifesat + Careersat + Jobstress + psychosomatic, data = dataframe3)
summary(regression4)##
## Call:
## lm(formula = Jobsat ~ gender + age + currentemploymentstatus +
## maritalstatus + n_children + education + hours_work + workhours_informallyexpected +
## workhoursareinformallyexpected_colleagues + annualincome +
## `current position` + supervisoryduties + tenure + intquit +
## jobsecurity + Lifesat + Careersat + Jobstress + psychosomatic,
## data = dataframe3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.2890 -2.0297 -0.3223 1.6449 11.2436
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) -1.7943145 6.1529519 -0.292
## gender2 -0.7730453 2.9978218 -0.258
## age 0.6560028 0.6356958 1.032
## currentemploymentstatus2 1.6889166 1.2714066 1.328
## maritalstatus2 1.0140232 1.8993239 0.534
## maritalstatus3 1.5961068 1.5665656 1.019
## n_children -0.5001866 0.3854265 -1.298
## education2 2.4488033 1.7430671 1.405
## education3 2.1194881 2.4188844 0.876
## education4 1.7616006 1.6250152 1.084
## education5 4.8050248 2.3849678 2.015
## education6 -1.8439998 3.4946917 -0.528
## hours_work 0.0093636 0.0646937 0.145
## workhours_informallyexpected -0.0009189 0.0444724 -0.021
## workhoursareinformallyexpected_colleagues 0.0043937 0.0427214 0.103
## annualincome2 4.3593932 3.1978781 1.363
## annualincome3 4.6758065 3.2552742 1.436
## annualincome4 3.7377491 3.3429116 1.118
## annualincome5 3.4368422 3.8265202 0.898
## `current position`2 -0.2781898 0.9233308 -0.301
## `current position`3 0.7425186 2.2753049 0.326
## `current position`4 -0.0766547 3.2349704 -0.024
## `current position`5 -4.4582116 4.3313724 -1.029
## `current position`6 1.3578155 1.7309073 0.784
## `current position`7 -1.6148285 2.0887214 -0.773
## supervisoryduties2 0.5066841 0.8109406 0.625
## tenure 0.0395116 0.0617447 0.640
## intquit 0.2390910 0.1523659 1.569
## jobsecurity 0.0952355 0.2548000 0.374
## Lifesat 0.0603512 0.0597209 1.011
## Careersat 0.4535551 0.1096307 4.137
## Jobstress -0.2038342 0.0756148 -2.696
## psychosomatic 0.0560981 0.0394100 1.423
## Pr(>|t|)
## (Intercept) 0.77120
## gender2 0.79705
## age 0.30466
## currentemploymentstatus2 0.18717
## maritalstatus2 0.59464
## maritalstatus3 0.31081
## n_children 0.19745
## education2 0.16325
## education3 0.38307
## education4 0.28103
## education5 0.04670 *
## education6 0.59894
## hours_work 0.88522
## workhours_informallyexpected 0.98356
## workhoursareinformallyexpected_colleagues 0.91830
## annualincome2 0.17597
## annualincome3 0.15411
## annualincome4 0.26628
## annualincome5 0.37132
## `current position`2 0.76384
## `current position`3 0.74487
## `current position`4 0.98114
## `current position`5 0.30591
## `current position`6 0.43469
## `current position`7 0.44133
## supervisoryduties2 0.53356
## tenure 0.52373
## intquit 0.11986
## jobsecurity 0.70939
## Lifesat 0.31474
## Careersat 7.49e-05 ***
## Jobstress 0.00828 **
## psychosomatic 0.15781
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.668 on 97 degrees of freedom
## Multiple R-squared: 0.5269, Adjusted R-squared: 0.3709
## F-statistic: 3.376 on 32 and 97 DF, p-value: 2.18e-06
Results and Interpretation
From the results, several variables play a role in determining job satisfaction. However some variables in the model aren’t statistically significant in determining Jobsat.
The variables ‘education5’ (Masters degree in nursing), ‘Careersat’ (career satisfaction), and ‘Jobstress’ had statistically significant coefficients. This meant that they are associated with the value of job satisfaction. A higher level of education (Masters degree in nursing) was positively associated with job satisfaction suggesting that individuals with advanced education may experience higher levels of job satisfaction. Similarly, individuals who reported higher levels of career satisfaction (Careersat) had higher job satisfaction. On the other hand, job stress (Jobstress) showed a negative association with job satisfaction, implying that increased stress levels may lead to lower levels of job satisfaction.
Factors such as gender, age, marital status, and current employment status did not appear to have a significant impact on job satisfaction.