esta primera etapa de estudio mostrará cálculos, visualizaciones e interpretaciones con base en un conjunto de datos cualitativos y cuantitativos. Con base a los conocimientos previos provenientes de las clases del curso de Gestión de Datos dictado por el docente Giancarlo Libreros Londoño para el aprendizaje significativo, en modalidad presencial (cohorte 2023-4) de la Universidad del Valle (sede Zarzal-Valle); utilizando el siguiente software para obtener el conjunto de datos: https://www.kaggle.com/datasets?tags=12127-Software. El trabajo hecho en gestion de datos que puede ser consultado temporalmente a través de: https://rpubs.com/DFVV00/1222963. Por último, este trabajo fue procesado con R versión 4.4.1 (2024-06-15 ucrt) mediado por RStudio 2024.09.0+375 en una plataforma x86_64-w64-mingw32.
Descripcion del conjunto de datos:
El conjunto de datos contiene 13 campos y 374 registros. Uno de los campos es simplemente un identificador numérico secuencial de los registros , otro de naturaleza bicotomica , tres de naturaleza politomica y el resto son numéricos estrictamente positivos. La lista siguiente los describe en el mismo orden, de izquierdda a derecha, como aparecen en el rango de datos que los contiene y se establece para cada campo, excepto el campo “Identificacion de persona”, el tipo de variable y su escala de medición con base en la nomenclatura (tipo_de_variable::escala_de_medición[ordenamiento]):
str(cdd_Impact_of_Remote_Work_on_Mental_Health_G11_original)
## tibble [5,000 × 20] (S3: tbl_df/tbl/data.frame)
## $ Employee_ID : chr [1:5000] "EMP0001" "EMP0002" "EMP0003" "EMP0004" ...
## $ Age : num [1:5000] 32 40 59 27 49 59 31 42 56 30 ...
## $ Gender : chr [1:5000] "Non-binary" "Female" "Non-binary" "Male" ...
## $ Job_Role : chr [1:5000] "HR" "Data Scientist" "Software Engineer" "Software Engineer" ...
## $ Industry : chr [1:5000] "Healthcare" "IT" "Education" "Finance" ...
## $ Years_of_Experience : num [1:5000] 13 3 22 20 32 31 24 6 9 28 ...
## $ Work_Location : chr [1:5000] "Hybrid" "Remote" "Hybrid" "Onsite" ...
## $ Hours_Worked_Per_Week : num [1:5000] 47 52 46 32 35 39 51 54 24 57 ...
## $ Number_of_Virtual_Meetings : num [1:5000] 7 4 11 8 12 3 7 7 4 6 ...
## $ Work_Life_Balance_Rating : num [1:5000] 2 1 5 4 2 4 3 3 2 1 ...
## $ Stress_Level : chr [1:5000] "Medium" "Medium" "Medium" "High" ...
## $ Mental_Health_Condition : chr [1:5000] "Depression" "Anxiety" "Anxiety" "Depression" ...
## $ Access_to_Mental_Health_Resources: chr [1:5000] "No" "No" "No" "Yes" ...
## $ Productivity_Change : chr [1:5000] "Decrease" "Increase" "No Change" "Increase" ...
## $ Social_Isolation_Rating : num [1:5000] 1 3 4 3 3 5 5 5 2 2 ...
## $ Satisfaction_with_Remote_Work : chr [1:5000] "Unsatisfied" "Satisfied" "Unsatisfied" "Unsatisfied" ...
## $ Company_Support_for_Remote_Work : num [1:5000] 1 2 5 3 3 1 3 4 4 1 ...
## $ Physical_Activity : chr [1:5000] "Weekly" "Weekly" "None" "None" ...
## $ Sleep_Quality : chr [1:5000] "Good" "Good" "Poor" "Poor" ...
## $ Region : chr [1:5000] "Europe" "Asia" "North America" "Europe" ...
(cdd_Impact_of_Remote_Work_on_Mental_Health_G11_original)
## # A tibble: 5,000 × 20
## Employee_ID Age Gender Job_Role Industry Years_of_Experience Work_Location
## <chr> <dbl> <chr> <chr> <chr> <dbl> <chr>
## 1 EMP0001 32 Non-bi… HR Healthc… 13 Hybrid
## 2 EMP0002 40 Female Data Sc… IT 3 Remote
## 3 EMP0003 59 Non-bi… Softwar… Educati… 22 Hybrid
## 4 EMP0004 27 Male Softwar… Finance 20 Onsite
## 5 EMP0005 49 Male Sales Consult… 32 Onsite
## 6 EMP0006 59 Non-bi… Sales IT 31 Hybrid
## 7 EMP0007 31 Prefer… Sales IT 24 Remote
## 8 EMP0008 42 Non-bi… Data Sc… Manufac… 6 Onsite
## 9 EMP0009 56 Prefer… Data Sc… Healthc… 9 Hybrid
## 10 EMP0010 30 Female HR IT 28 Hybrid
## # ℹ 4,990 more rows
## # ℹ 13 more variables: Hours_Worked_Per_Week <dbl>,
## # Number_of_Virtual_Meetings <dbl>, Work_Life_Balance_Rating <dbl>,
## # Stress_Level <chr>, Mental_Health_Condition <chr>,
## # Access_to_Mental_Health_Resources <chr>, Productivity_Change <chr>,
## # Social_Isolation_Rating <dbl>, Satisfaction_with_Remote_Work <chr>,
## # Company_Support_for_Remote_Work <dbl>, Physical_Activity <chr>, …
str(cdd_Impact_of_Remote_Work_on_Mental_Health_G11_depurado)
## tibble [5,000 × 11] (S3: tbl_df/tbl/data.frame)
## $ Employee_ID : chr [1:5000] "EMP0001" "EMP0002" "EMP0003" "EMP0004" ...
## $ Age : num [1:5000] 32 40 59 27 49 59 31 42 56 30 ...
## $ Gender : chr [1:5000] "Non-binary" "Female" "Non-binary" "Male" ...
## $ Job_Role : chr [1:5000] "HR" "Data Scientist" "Software Engineer" "Software Engineer" ...
## $ Years_of_Experience : num [1:5000] 13 3 22 20 32 31 24 6 9 28 ...
## $ Work_Location : chr [1:5000] "Hybrid" "Remote" "Hybrid" "Onsite" ...
## $ Hours_Worked_Per_Week : num [1:5000] 47 52 46 32 35 39 51 54 24 57 ...
## $ Number_of_Virtual_Meetings : num [1:5000] 7 4 11 8 12 3 7 7 4 6 ...
## $ Access_to_Mental_Health_Resources: chr [1:5000] "No" "No" "No" "Yes" ...
## $ Productivity_Change : chr [1:5000] "Decrease" "Increase" "No Change" "Increase" ...
## $ Sleep_Quality : chr [1:5000] "Good" "Good" "Poor" "Poor" ...
(cdd_Impact_of_Remote_Work_on_Mental_Health_G11_depurado)
## # A tibble: 5,000 × 11
## Employee_ID Age Gender Job_Role Years_of_Experience Work_Location
## <chr> <dbl> <chr> <chr> <dbl> <chr>
## 1 EMP0001 32 Non-binary HR 13 Hybrid
## 2 EMP0002 40 Female Data Sc… 3 Remote
## 3 EMP0003 59 Non-binary Softwar… 22 Hybrid
## 4 EMP0004 27 Male Softwar… 20 Onsite
## 5 EMP0005 49 Male Sales 32 Onsite
## 6 EMP0006 59 Non-binary Sales 31 Hybrid
## 7 EMP0007 31 Prefer not to s… Sales 24 Remote
## 8 EMP0008 42 Non-binary Data Sc… 6 Onsite
## 9 EMP0009 56 Prefer not to s… Data Sc… 9 Hybrid
## 10 EMP0010 30 Female HR 28 Hybrid
## # ℹ 4,990 more rows
## # ℹ 5 more variables: Hours_Worked_Per_Week <dbl>,
## # Number_of_Virtual_Meetings <dbl>, Access_to_Mental_Health_Resources <chr>,
## # Productivity_Change <chr>, Sleep_Quality <chr>
employee_ID: (identificador) Registra un número secuenciado a partir de 1 para identificar el registro de cada persona consignado en la base datos de forma única.
Age: (cuantitativa::razon) Registra la edad medida en años de la persona.
Gender: (cualitativa::nominal) Registra el sexo del estudiante del cual se registraron los datos , en el caso de este campo toma cuatro posibles valores, masculino, femenino, no binario y prefieren no decir.
Job_role: (cualitativa::nominal) Registra la ocupacion, trabajo u oficio de cada persona.
Work_location: (cualitativa::nominal) agrupa a los empleados en categorías, En este caso, la variable describe el entorno laboral en el que los empleados realizan sus actividades profesionales entregando tres valores remoto; hibrido y presencial.
Hours_worked_per_week: (cuantitativa::razon) esta variable mide la cantidad total de horas que un empleado trabaja en una semana.
Work_life_balance_rating: (cuantitativa::ordinal) Esta variable mide la percepción de los empleados sobre su equilibrio entre la vida laboral y personal, utilizando una escala de calificación del uno al cinco que permite evaluar su satisfacción en este aspecto
Stress_level: (cualitativa::ordinal) Esta variable mide la percepción de los empleados sobre su nivel de estrés en el trabajo, utilizando categorías que reflejan diferentes grados de estrés mediante tres valores bajo, medio y alto.
Access_to_mental_health_resources: (cualitativo::nominal) Esta variable mide si los empleados tienen acceso a recursos de salud mental proporcionados por la empresa, lo cual puede incluir servicios como terapia, asesoramiento, programas de bienestar mental y otros recursos relacionados,tiene respuesta dicotómica obteniendo solo dos valores sí o no.
satisfaction_with_remote_work: (cualitativa::ordinal) Esta variable mide el nivel de satisfacción de los empleados respecto a su experiencia de trabajo remoto, utilizando categorías que reflejan diferentes grados de satisfacción como insatisfecho, neutral y satisfecho.
Company_support_for_remote_work: (cuantitativa::ordinal) Esta variable mide el nivel de apoyo que los empleados perciben recibir de su empresa en relación con el trabajo remoto, utilizando una escala de calificación del uno al cinco que refleja diferentes grados de apoyo.
apply(cdd_Impact_of_Remote_Work_on_Mental_Health_G11_depurado[,-c(1,3,4,6,9,10,11)], 2, mean)
## Age Years_of_Experience
## 40.9950 17.8102
## Hours_Worked_Per_Week Number_of_Virtual_Meetings
## 39.6146 7.5590
cdd_Impact_of_Remote_Work_on_Mental_Health_G11_muestreado = cdd_Impact_of_Remote_Work_on_Mental_Health_G11_depurado[,-c(1,3,4,6,9,10,11)]
par(mfrow = c(1, ncol(cdd_Impact_of_Remote_Work_on_Mental_Health_G11_muestreado)))
invisible(lapply(1:ncol(cdd_Impact_of_Remote_Work_on_Mental_Health_G11_muestreado), function(i) boxplot(cdd_Impact_of_Remote_Work_on_Mental_Health_G11_muestreado[, i])))
#### Matriz de varianzas y covarianzas
round(cov(cdd_Impact_of_Remote_Work_on_Mental_Health_G11_depurado[,-c(1,3,4,6,9,10,11)]),2)
## Age Years_of_Experience Hours_Worked_Per_Week
## Age 127.60 -0.51 -0.18
## Years_of_Experience -0.51 100.41 -2.20
## Hours_Worked_Per_Week -0.18 -2.20 140.66
## Number_of_Virtual_Meetings 0.19 0.88 -0.25
## Number_of_Virtual_Meetings
## Age 0.19
## Years_of_Experience 0.88
## Hours_Worked_Per_Week -0.25
## Number_of_Virtual_Meetings 21.49
round(cor(cdd_Impact_of_Remote_Work_on_Mental_Health_G11_depurado[,-c(1,3,4,6,9,10,11)]),3)
## Age Years_of_Experience Hours_Worked_Per_Week
## Age 1.000 -0.004 -0.001
## Years_of_Experience -0.004 1.000 -0.019
## Hours_Worked_Per_Week -0.001 -0.019 1.000
## Number_of_Virtual_Meetings 0.004 0.019 -0.005
## Number_of_Virtual_Meetings
## Age 0.004
## Years_of_Experience 0.019
## Hours_Worked_Per_Week -0.005
## Number_of_Virtual_Meetings 1.000
set.seed(780728)
cdd_Impact_of_Remote_Work_on_Mental_Health_G11_muestreado = cdd_Impact_of_Remote_Work_on_Mental_Health_G11_depurado[sample(1:nrow(cdd_Impact_of_Remote_Work_on_Mental_Health_G11_depurado),400),-c(1,3,4,6,9,10,11)]
ggpairs(cdd_Impact_of_Remote_Work_on_Mental_Health_G11_muestreado)
#### Diagrama de estrellas
set.seed(780720)
cdd_Impact_of_Remote_Work_on_Mental_Health_G11_muestreado = cdd_Impact_of_Remote_Work_on_Mental_Health_G11_depurado[sample(1:nrow(cdd_Impact_of_Remote_Work_on_Mental_Health_G11_depurado),20),-c(1,3,4,6,9,10,11)]
stars(cdd_Impact_of_Remote_Work_on_Mental_Health_G11_muestreado, len = 0.5, cex = 0.5, key.loc = c(3,4,5,6), draw.segments = TRUE)
#### Caras de Chernoff
set.seed(780728)
cdd_Impact_of_Remote_Work_on_Mental_Health_G11_muestreado = cdd_Impact_of_Remote_Work_on_Mental_Health_G11_depurado[sample(1:nrow(cdd_Impact_of_Remote_Work_on_Mental_Health_G11_depurado),23),-c(1,3,4,6,9,10,11)]
faces(cdd_Impact_of_Remote_Work_on_Mental_Health_G11_muestreado)
## effect of variables:
## modified item Var
## "height of face " "Age"
## "width of face " "Years_of_Experience"
## "structure of face" "Hours_Worked_Per_Week"
## "height of mouth " "Number_of_Virtual_Meetings"
## "width of mouth " "Age"
## "smiling " "Years_of_Experience"
## "height of eyes " "Hours_Worked_Per_Week"
## "width of eyes " "Number_of_Virtual_Meetings"
## "height of hair " "Age"
## "width of hair " "Years_of_Experience"
## "style of hair " "Hours_Worked_Per_Week"
## "height of nose " "Number_of_Virtual_Meetings"
## "width of nose " "Age"
## "width of ear " "Years_of_Experience"
## "height of ear " "Hours_Worked_Per_Week"
mvn(cdd_Impact_of_Remote_Work_on_Mental_Health_G11_depurado[,-c(1,3,4,6,9,10,11)], mvnTest="mardia")
## $multivariateNormality
## Test Statistic p value Result
## 1 Mardia Skewness 8.87403201942417 0.984320855901219 YES
## 2 Mardia Kurtosis -24.4508232183314 0 NO
## 3 MVN <NA> <NA> NO
##
## $univariateNormality
## Test Variable Statistic p value Normality
## 1 Anderson-Darling Age 58.3829 <0.001 NO
## 2 Anderson-Darling Years_of_Experience 59.3035 <0.001 NO
## 3 Anderson-Darling Hours_Worked_Per_Week 58.3981 <0.001 NO
## 4 Anderson-Darling Number_of_Virtual_Meetings 66.5611 <0.001 NO
##
## $Descriptives
## n Mean Std.Dev Median Min Max 25th 75th
## Age 5000 40.9950 11.296021 41 22 60 31 51
## Years_of_Experience 5000 17.8102 10.020412 18 1 35 9 26
## Hours_Worked_Per_Week 5000 39.6146 11.860194 40 20 60 29 50
## Number_of_Virtual_Meetings 5000 7.5590 4.636121 8 0 15 4 12
## Skew Kurtosis
## Age -0.020564124 -1.204454
## Years_of_Experience 0.007743975 -1.207130
## Hours_Worked_Per_Week 0.032296538 -1.204903
## Number_of_Virtual_Meetings -0.015113999 -1.203879
mvn(cdd_Impact_of_Remote_Work_on_Mental_Health_G11_depurado[,-c(1,3,4,6,9,10,11)], mvnTest="hz")
## $multivariateNormality
## Test HZ p value MVN
## 1 Henze-Zirkler 10.08074 0 NO
##
## $univariateNormality
## Test Variable Statistic p value Normality
## 1 Anderson-Darling Age 58.3829 <0.001 NO
## 2 Anderson-Darling Years_of_Experience 59.3035 <0.001 NO
## 3 Anderson-Darling Hours_Worked_Per_Week 58.3981 <0.001 NO
## 4 Anderson-Darling Number_of_Virtual_Meetings 66.5611 <0.001 NO
##
## $Descriptives
## n Mean Std.Dev Median Min Max 25th 75th
## Age 5000 40.9950 11.296021 41 22 60 31 51
## Years_of_Experience 5000 17.8102 10.020412 18 1 35 9 26
## Hours_Worked_Per_Week 5000 39.6146 11.860194 40 20 60 29 50
## Number_of_Virtual_Meetings 5000 7.5590 4.636121 8 0 15 4 12
## Skew Kurtosis
## Age -0.020564124 -1.204454
## Years_of_Experience 0.007743975 -1.207130
## Hours_Worked_Per_Week 0.032296538 -1.204903
## Number_of_Virtual_Meetings -0.015113999 -1.203879
mvn(cdd_Impact_of_Remote_Work_on_Mental_Health_G11_depurado[,-c(1,3,4,6,9,10,11)], mvnTest="dh")
## $multivariateNormality
## Test E df p value MVN
## 1 Doornik-Hansen 1761.553 8 0 NO
##
## $univariateNormality
## Test Variable Statistic p value Normality
## 1 Anderson-Darling Age 58.3829 <0.001 NO
## 2 Anderson-Darling Years_of_Experience 59.3035 <0.001 NO
## 3 Anderson-Darling Hours_Worked_Per_Week 58.3981 <0.001 NO
## 4 Anderson-Darling Number_of_Virtual_Meetings 66.5611 <0.001 NO
##
## $Descriptives
## n Mean Std.Dev Median Min Max 25th 75th
## Age 5000 40.9950 11.296021 41 22 60 31 51
## Years_of_Experience 5000 17.8102 10.020412 18 1 35 9 26
## Hours_Worked_Per_Week 5000 39.6146 11.860194 40 20 60 29 50
## Number_of_Virtual_Meetings 5000 7.5590 4.636121 8 0 15 4 12
## Skew Kurtosis
## Age -0.020564124 -1.204454
## Years_of_Experience 0.007743975 -1.207130
## Hours_Worked_Per_Week 0.032296538 -1.204903
## Number_of_Virtual_Meetings -0.015113999 -1.203879
subset_data <- cdd_Impact_of_Remote_Work_on_Mental_Health_G11_depurado[sample(nrow(cdd_Impact_of_Remote_Work_on_Mental_Health_G11_depurado), 400), -c(1,3,4,6,9,10,11)]
mvn(subset_data, mvnTest="royston")
## $multivariateNormality
## Test H p value MVN
## 1 Royston 163.4475 2.663488e-34 NO
##
## $univariateNormality
## Test Variable Statistic p value Normality
## 1 Anderson-Darling Age 3.4968 <0.001 NO
## 2 Anderson-Darling Years_of_Experience 5.4696 <0.001 NO
## 3 Anderson-Darling Hours_Worked_Per_Week 5.5836 <0.001 NO
## 4 Anderson-Darling Number_of_Virtual_Meetings 6.2630 <0.001 NO
##
## $Descriptives
## n Mean Std.Dev Median Min Max 25th 75th
## Age 400 40.8000 10.972282 41 22 60 32 50
## Years_of_Experience 400 18.2350 9.837006 18 1 35 10 27
## Hours_Worked_Per_Week 400 39.9275 12.177972 40 20 60 29 51
## Number_of_Virtual_Meetings 400 7.6300 4.749156 8 0 15 3 12
## Skew Kurtosis
## Age -0.04642292 -1.104395
## Years_of_Experience 0.05249222 -1.244883
## Hours_Worked_Per_Week 0.06970159 -1.251568
## Number_of_Virtual_Meetings -0.05145374 -1.239707
get_eigenvalue(PCA(cdd_Impact_of_Remote_Work_on_Mental_Health_G11_depurado[,-c(1,3,4,6,9,10,11)], ncp = 6, scale.unit = TRUE, graph = F))
## eigenvalue variance.percent cumulative.variance.percent
## Dim.1 1.0289855 25.72464 25.72464
## Dim.2 1.0016178 25.04044 50.76508
## Dim.3 0.9950404 24.87601 75.64109
## Dim.4 0.9743564 24.35891 100.00000