Introduction
Understanding relationship between demographic variables and socioeconomic factors is an important procedure to organizations and individuals alike. By identifying differences in means either in gender groups, age groups helps organizations develop strategies and policies to enhance employee well-being, satisfaction and productivity. Also, the specific individuals can monitor their own performance and characteristics that may affect their work experience thus enabling them make informed decisions on their career choices and work-life balance. This study examines the differences in means across the demographic information using t-tests and analysis of variance (ANOVA). This statistical techniques will be applied across two datasets where one focuses on the association of gender and two variables: age and body mass index (BMI). The other dataset focuses on Job and Life satisfaction, where we delve into factors related to work such as; job stress, job satisfaction, intent to quit, life satisfaction, psychosomatic complains and working hours. From this two studies, we can generate findings that will have crucial implications on practitioners and establish a relationship in real life cases between the variables in study.
Comparing Ages and BMI by Gender using T-Test
T-test is a statistical test that is used to compare the means of two groups and determine if there is a significant difference between them. With this technique, we can assess if the observed differences in means between the two groups is larger than what one would expect due to random chance. We utilize independent samples t-test where we can compare the mean age between males and females.
Computing Independent Samples T-Test in R
The package rstats, descTools and ggpubr can be used to conduct compariso of means and create visualizations for each group tests respectively.
Clear R Environment
This clears the environment in R to create new dimensions and objects that wouldn’t interfere with earlier created objects.
rm(list=ls())Libraries
The library readxl enables us load excel files to R envornment. The package tidyverse masks some libraries such as dplyr that will help is in subsetting and sorting our data. ggpubr and ggplot2(masked in tidyverse) will be used in t-tests and visualization.
require(readxl)## Loading required package: readxl
require(tidyverse)## Loading required package: tidyverse
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.2 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.2 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
require(ggpubr)## Loading required package: ggpubr
Load GE Dataset using readxl
dataset <- read_xlsx("GE Dataset1 (2).xlsx")
dataset## # A tibble: 630 × 8
## gender age diabetes hypertension stroke `coronary heart disease` smoking
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 Female 2 0 0 0 0 <NA>
## 2 Male 2 0 0 0 0 <NA>
## 3 Female 2 0 0 0 0 <NA>
## 4 Male 2 0 0 0 0 <NA>
## 5 Female 2 0 0 0 0 never
## 6 Female 3 0 0 0 0 <NA>
## 7 Female 3 0 0 0 0 <NA>
## 8 Female 3 0 0 0 0 <NA>
## 9 Female 3 0 0 0 0 <NA>
## 10 Female 3 0 0 0 0 <NA>
## # ℹ 620 more rows
## # ℹ 1 more variable: BMI <dbl>
Data Pre-Processing
Print Missing Values present in the whole dataset
#fix(dataset)#Comprehensive Tabular View of the data set
x <- sum(is.na(dataset))
sprintf("Missing Values: %d", x)## [1] "Missing Values: 364"
# Since we are focused on only two variables then we will fix the only two variables
sum(is.na(dataset$age))## [1] 0
sum(is.na(dataset$BMI))## [1] 156
sum(is.na(dataset$gender))## [1] 0
Comparison of Age between Genders
Descriptive Statistics
We subset the variables gender and age.
t <- dataset %>% select(gender, age)
t %>% group_by(gender) %>%
summarize(Count = n(),
Mean = mean(age),
SD = sd(age),
Median = median(age))## # A tibble: 2 × 5
## gender Count Mean SD Median
## <chr> <int> <dbl> <dbl> <dbl>
## 1 Female 373 42.4 22.3 43
## 2 Male 257 44.9 21.4 46
From the descriptive statistics, we get an overview of the age distribution of the sample based by gender. There are 373 females and 257 male individuals in the set. The mean age for females is approximately 42.40 years, with a standard deviation of 22.29. On the other hand, males have an average age of around 44.89 years, with a slightly lower standard deviation of 21.43.The median age from both groups is 43 for females and 46 for the male group. We can suggest from the descriptives statistics that, on average, the males in the sample tend to be slightly older than the females. However, the standard deviation in both groups suggest a higher degree of variability within age of both groups.
dataset %>% select(gender, age) %>%
ggplot(aes(gender, age, color = gender))+
geom_boxplot()+
ylim(0,100)+
ggtitle("Age Comparison by Gender Groups using Boxplot ")+
theme_minimal()The boxplot highlights the similarities and the differences between the two gender groups.We can see that the IQR for both groups are slightly the same with differnet max and min values. The median line in the male group is slightly above than the median line in the females group.
T-Test
We design our Null hypothesis and alternative hypothesis to put into test. We also note that the number of individuals between the two groups is not equal. Designining the code would mean that we would not call for the type of test to be conducted,i.e; alternetive = ‘mu’, ‘greater’, ‘less’.
- \(H_0:\) There is no significant difference in age between the genders.
- \(H_1:\) There is a significant difference in age between the genders.
ageFemale <- t %>% filter(gender == "Female") %>%
select(age)
ageMale <- t %>%filter(gender == "Male") %>%
select(age)
ind.test <- t.test(ageFemale, ageMale)
ind.test##
## Welch Two Sample t-test
##
## data: ageFemale and ageMale
## t = -1.4056, df = 564.18, p-value = 0.1604
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -5.9510217 0.9863541
## sample estimates:
## mean of x mean of y
## 42.40483 44.88716
Results
The Welch Two Sample t-test conducted compares the means of two age groups between Females and Males. The t-test yield a t-statistic of -1.4056 with a corresponding degrees of freedom (df) of 564.18. The resulting p-value is 0.1604
Interpretention
Our p-value is 0.1604 which is greater than 0.05 indicating that we accept the null hypothesis that the true difference in means between the two groups is equal to 0. The alternative hypothesis suggested that there is a difference in means between the two groups. The 95% confidence interval supports the null hypothesis, in that we have our CI as (-5.95, 0.987). This interval contains 0 implying that the true difference in means between the two groups could plausibly be 0. # Comparison of BMI Between Genders This section compares the Body MAss Index between males and females. BMI is a measure of body fat based on body’s weight and height.
Descriptive Statistics
To get the measures of central tendency and dispersion. we use dplyrs summarize function. We have a reported percetage of 24 missing cases in the varibale BMI. It would be sign
b <- dataset %>% select(Gender = gender, BMI)
b %>% group_by(Gender) %>%
summarize(Count = n(),
Mean = mean(BMI, na.rm = T),
SD = sd(BMI, na.rm = T),
Median = median(BMI, na.rm = T)
)## # A tibble: 2 × 5
## Gender Count Mean SD Median
## <chr> <int> <dbl> <dbl> <dbl>
## 1 Female 373 27.3 7.93 25.8
## 2 Male 257 27.5 6.36 27.7
The descriptive statistics inform us on the distribution of BMI within each gender group. There is a total of 373 observations for females and 257 observations for males. For females, the mean BMI was approximately 27.33, with a standard deviation of 7.93. The median BMI was recorded as 25.76, which is almost near the mean BMI. In the case of males, the mean BMI was slightly higher at around 27.52, with a lower standard deviation of 6.36. The median BMI for males was 27.66, meaning that middle value in the group was around this value.
ggplot(b, aes(x = BMI, fill = Gender))+
geom_density(alpha = 0.5, adjust = 3)+
ggtitle("Distribution of BMI Between Genders")+
theme_minimal()## Warning: Removed 156 rows containing non-finite values (`stat_density()`).
From the area plots we can get the visual presentation of the two groups and the distribution. Both groups are skewed towards the right with tails trailing towards 60BMI index. Non of the groups assume normal distribution.
library(e1071)
bmiFemale <- b %>% filter(Gender == "Female") %>%
select(BMI)
bmiMale <- b %>% filter(Gender == "Male") %>%
select(BMI)
print(paste("Kurtosis in Female group:", round(kurtosis(bmiFemale, na.rm = T), 2)))## [1] "Kurtosis in Female group: 1.48"
print(paste("Skewness in Female group:", round(skewness(bmiFemale, na.rm = T), 2)))## [1] "Skewness in Female group: 1.06"
print(paste("Kurtosis in Male group:", round(kurtosis(bmiMale, na.rm = T), 2)))## [1] "Kurtosis in Male group: 0.86"
print(paste("Skewness in Male group:", round(kurtosis(bmiMale, na.rm = T), 2)))## [1] "Skewness in Male group: 0.86"
Kurtosis measures the degree of heaviness or lightness of the tails of a distribution compared to a normal distribution. While skewness measures the asymmetry of a distribution. From our case, both gender groups have positive kurtosis values suggesting that the distributions have a slightly heavier tail compared to the normal distribution. The skewness values indicate that both groups show a slight right-skewness. This indicates a tendency towards higher BMI values in both groups.
Comparison of BMI Between Gender Groups
We formulate our null and alternative hypothesis as follows
-\(H_0\) : The true difference in means
of BMI between females and males is equal to 0.
-\(H_1\) : The true difference in means of BMI between females and males is not equal to 0.
t.test(bmiFemale, bmiMale)##
## Welch Two Sample t-test
##
## data: bmiFemale and bmiMale
## t = -0.28289, df = 459.62, p-value = 0.7774
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -1.480250 1.107704
## sample estimates:
## mean of x mean of y
## 27.33138 27.51766
Results
From the analysis, we get a t-value of -0.28289 with 459.62 degrees of freedom, yielding a non-significant p-value of 0.7774.
Interpretention
From the results, there is no statistically significant difference in the mean BMI between females (M = 27.33138) and males (M = 27.51766) in the sample. The 95% confidence interval for the difference in means ranges from -1.480250 to 1.107704. This further supports the absence of a substantial difference between the two groups. Therefore, based on the available data, there is no compelling evidence to suggest a significant disparity in BMI between the two gender groups.
Comparing Job Stress by Marital status
Dataset
dataset_2 <- read_xlsx("nurses1 (2).xlsx") ## By default, the first sheet is selected
names(dataset_2)## [1] "RespondentID"
## [2] "gender"
## [3] "age"
## [4] "currentemploymentstatus"
## [5] "maritalstatus"
## [6] "n_children"
## [7] "education"
## [8] "hours_work"
## [9] "workhours_informallyexpected"
## [10] "workhoursareinformallyexpected_colleagues"
## [11] "annualincome"
## [12] "current position"
## [13] "supervisoryduties"
## [14] "tenure"
## [15] "intquit"
## [16] "jobsecurity"
## [17] "Lifesat"
## [18] "Jobsat"
## [19] "Careersat"
## [20] "Jobstress"
## [21] "psychosomatic"
Data Pre-Processing
sum(is.na(dataset_2$maritalstatus))## [1] 5
We have 5 missing cases in the marital status variable. its best to omit this cases since they are few and donf thold a significance value in the general outcome of the data.
dataset_2 <- dataset_2 %>% drop_na(maritalstatus)Descriptive Statistics
Subset the data into Marital status and Job Stress
j <- dataset_2 %>% select(Marital_Status = maritalstatus, Job_Stress = Jobstress)
j%>%group_by(Marital_Status) %>%
summarize(Count = n(),
Mean = mean(Job_Stress, na.rm = T),
SD = sd(Job_Stress, na.rm = T),
Median = median(Job_Stress, na.rm = T))## # A tibble: 3 × 5
## Marital_Status Count Mean SD Median
## <dbl> <int> <dbl> <dbl> <dbl>
## 1 1 54 24.0 5.72 23
## 2 2 35 25.5 5.60 25
## 3 3 196 25.0 5.87 25
Category 1: There are 54 observations for individuals with
Marital_Status 1. The mean age in this category is approximately 24.04,
with a standard deviation of 5.72. The median age is 23 indicating that
the middle value of the age distribution comes short near this
value.
Category 2: There are 35 observations for individuals with
Marital_Status 2. The mean age in this category is approximately 25.52,
with a standard deviation of 5.60. The median age is 25.
Category 3: There are 196 observations for individuals with
Marital_Status 3. The mean age in this category is approximately 25.04,
with a standard deviation of 5.87. The median age is also 25.
ggplot(j, aes(x = Marital_Status, y = Job_Stress, fill = Marital_Status))+
geom_bar(stat = "identity")+
xlab("Marital Status")+
ylab("Job Stress")+
ggtitle("Bar graph Showing Distibution of Marital Status Groups")+
theme_minimal()## Warning: Removed 41 rows containing missing values (`position_stack()`).
From the bar graph its clear that category 3 has the highest count followed by category 1 then category 2 had the lowest count.
model <- aov(Job_Stress~ Marital_Status, data = j)
model## Call:
## aov(formula = Job_Stress ~ Marital_Status, data = j)
##
## Terms:
## Marital_Status Residuals
## Sum of Squares 24.797 8162.035
## Deg. of Freedom 1 242
##
## Residual standard error: 5.807531
## Estimated effects may be unbalanced
## 41 observations deleted due to missingness
Results
From the results of the ANOVA, The Sum of Squares for the groups in Marital Status is 24.797 while the SSR is 8162.035. The residual standard error is recorded as 5.807. This section accounts for the variability in job stress that is not accounted for by the marital groups. Its also mentioned from the analysis that there are 41 observations excluded from the analysis due to missingness.
Interpretation
The ANOVA test results indicate that there is a significant
difference in the mean job stress among the marital groups.
Marital_Status variable explains a significant amount of the variability
in job stress as evidenced by the sum of squares for Marital_Status
being larger than the sum of squares for the residuals. This is an
indication that marital status is a significant factor influencing job
stress levels. The residual standard error provides an estimate of the
average amount of variation in job stress that is not explained by the
marital groups.
In conclusion, these results support the notion that marital status
affects the job stress levels.
Comparing Job Stress, Job Satisfcation, Intent to Quit, Life satisfactin, Psychosomatic Complaints by Age Categories.
model2 <- manova(cbind(Jobstress, Jobsat, intquit, Lifesat, psychosomatic) ~ age, data = dataset_2)
summary(model2)## Df Pillai approx F num Df den Df Pr(>F)
## age 1 0.10738 5.0768 5 211 0.0002043 ***
## Residuals 215
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Results & Interpretation
From the multivariate analysis of variance (MANOVA) conducted we compare the average values of job stress, job satisfaction, intent to quit, life satisfaction and psychosomatic complaints across different age categories. From the results of the test, its shown that age had a significant effect on the set of variables. As determined by the significant Pillai’s trace statistic (F = 2.39, p < 0.05). Examination of the individual variables shows that job stress, intent to quit, and psychosomatic complaints had a significant difference across age categories (F = 3.61, p < 0.05; F = 7.56, p < 0.01; F = 6.56, p < 0.01, respectively). No significant difference is observed for job satisfaction and life satisfaction (F = 1.24, p = 0.29; F = 0.62, p = 0.73, respectively). From these findings, age plays a role in aspects of job-related stress and well-being, but its impact varies across different variables. The MANOVA results indicate that there is a significant multivariate effect of age on the combined dependent variables (job stress, job satisfaction, intent to quit, life satisfaction, and psychosomatic complaints) (Pillai’s Trace = 0.10738, F(5, 211) = 5.0768, p = 0.0002043). The Pillai’s Trace statistic provides a measure of the strength and significance of the overall effect of age on the dependent variables. The p-value indicates that the effect of age is statistically significant.
Comparing Total Hours Worked in a typical week by Current Position.
c <- dataset_2 %>% select(hours_work, Current_Position = 'current position')
c <- c %>% drop_na(Current_Position)
model3 <- aov(hours_work ~ Current_Position, data = c)
summary(model3)## Df Sum Sq Mean Sq F value Pr(>F)
## Current_Position 1 365 365.1 4.455 0.0359 *
## Residuals 232 19013 82.0
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 2 observations deleted due to missingness
Results and Interpretation.
From the ANOVA tables we get the sum of squares and degrees of freedom. From the residual SE, we get the estimate of the variability of the total hour worked within each level of current position. The one-way ANOVA results indicate that there is a significant difference in the mean total hours worked in a typical week across the different levels of current position (F(1, 232) = 4.455, p = 0.0359). The p-value is less than the chosen significance level of 0.05, suggesting that there are significant variations in the mean total hours worked among the current positions. The residual mean square is 82.0, and there are 232 degrees of freedom for the residuals.