Čistenie údajov

Moje dáta sú v správnom formáte, názvy stĺpcov neobsahujú medzery, preto nie je potrebné ich hromadne upravovať.

Import údajov

Skontrolujeme si, či sa doplňujú na miesta chýbajúcich údajov doplňujú NA hodnoty (NA - Not Available).

# Import the CSV file into a data frame
# - header = TRUE: the first row contains variable names
# - sep = ",": variables are separated by colons
# - dec = ".": decimal numbers use a dot
# - na.strings = c("", "NA"): empty cells and text "NA" are treated as missing values
# - stringsAsFactors = FALSE: text variables remain text, not factors

udaje1 <- read.csv2(
  "udaje/productivity.csv",
  header = TRUE,
  sep = ",",
  dec = ".",
  na.strings = c("", "NA"),
  stringsAsFactors = FALSE
)

# Show the first rows of the dataset
library(knitr)
kable(head(udaje1), caption = "Náhľad prvých 6 riadkov datasetu")
Náhľad prvých 6 riadkov datasetu
Employee_ID Department Gender Age Job_Title Hire_Date Years_At_Company Education_Level Performance_Score Monthly_Salary Work_Hours_Per_Week Projects_Handled Overtime_Hours Sick_Days Remote_Work_Frequency Team_Size Training_Hours Promotions Employee_Satisfaction_Score Resigned
1 IT Male 55 Specialist 2022-01-19 08:03:05.556036 2 High School 5 6750 33 32 22 2 0 14 66 0 2.63 False
2 Finance Male 29 Developer 2024-04-18 08:03:05.556036 0 High School 5 7500 34 34 13 14 100 12 61 2 1.72 False
3 Finance Male 55 Specialist 2015-10-26 08:03:05.556036 8 High School 3 5850 37 27 6 3 50 10 1 0 3.17 False
4 Customer Support Female 48 Analyst 2016-10-22 08:03:05.556036 7 Bachelor 2 4800 52 10 28 12 100 10 0 1 1.86 False
5 Engineering Female 36 Analyst 2021-07-23 08:03:05.556036 3 Bachelor 2 4800 38 11 29 13 100 15 9 1 1.25 False
6 IT Male 43 Manager 2016-08-14 08:03:05.556036 8 High School 3 7800 46 31 8 0 100 15 95 0 2.77 False

Upravovanie názvov premenných

# Load the dplyr package
# dplyr provides convenient tools for working with data frames
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
# -----------------------------
# 1. Save the original column names
# -----------------------------
old_names <- names(udaje1)

# -----------------------------
# 2. Shorten (abbreviate) column names
# -----------------------------
# rename_with() applies a function to all column names
# abbreviate() automatically shortens long names
# strict = FALSE allows a more flexible abbreviation
udaje1 <- udaje1 %>%
  rename_with(~ abbreviate(.x, strict = FALSE))

# -----------------------------
# 3. Ensure that column names are unique
# -----------------------------
# Sometimes abbreviation may create identical names
# make.unique() automatically adds suffixes (.1, .2, ...) if necessary
names(udaje1) <- make.unique(names(udaje1))

# -----------------------------
# 4. Show comparison: old vs. new names
# -----------------------------
comparison <- data.frame(
  Original_Name = old_names,
  Shortened_Name = names(udaje1)
)

print(comparison)
##                  Original_Name Shortened_Name
## 1                  Employee_ID           E_ID
## 2                   Department           Dprt
## 3                       Gender           Gndr
## 4                          Age            Age
## 5                    Job_Title           Jb_T
## 6                    Hire_Date           Hr_D
## 7             Years_At_Company           Y_A_
## 8              Education_Level           Ed_L
## 9            Performance_Score           Pr_S
## 10              Monthly_Salary           Mn_S
## 11         Work_Hours_Per_Week           W_H_
## 12            Projects_Handled           Pr_H
## 13              Overtime_Hours           Ov_H
## 14                   Sick_Days           Sc_D
## 15       Remote_Work_Frequency           R_W_
## 16                   Team_Size           Tm_S
## 17              Training_Hours           Tr_H
## 18                  Promotions           Prmt
## 19 Employee_Satisfaction_Score           E_S_
## 20                    Resigned           Rsgn

Počiatočné čistenie obsahu databázy, imputácia chýbahúcich údajov

Odporúčam tu použiť knižnice VIM, Amelia, mice a iné. Pokiaľ máme databázu dostatočne nekonzistentnú a nevieme ju upraviť vynechaním niekoľkých riadkov / stĺpcov, potom odporúčame blog M. Fatih Tüzen: Handling Missing Data in R: A Comprehensive Guide, R bloggers.

library(mice)
library(VIM)

# Count missing values in each column
missing_counts <- colSums(is.na(udaje1))
print("Počet chýbajúcich údajov:")
## [1] "Počet chýbajúcich údajov:"
print(missing_counts)
## E_ID Dprt Gndr  Age Jb_T Hr_D Y_A_ Ed_L Pr_S Mn_S W_H_ Pr_H Ov_H Sc_D R_W_ Tm_S Tr_H Prmt E_S_ Rsgn 
##    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0

Z analýzy vyplýva, že databáza neobsahuje žiadne chýbajúce hodnoty (NA = 0).

# pattern of missingness
md.pattern(udaje1)
##  /\     /\
## {  `---'  }
## {  O   O  }
## ==>  V <==  No need for mice. This data set is completely observed.
##  \  \|/  /
##   `-----'

##        E_ID Dprt Gndr Age Jb_T Hr_D Y_A_ Ed_L Pr_S Mn_S W_H_ Pr_H Ov_H Sc_D R_W_ Tm_S Tr_H Prmt E_S_ Rsgn
## 100000    1    1    1   1    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1
##           0    0    0   0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
##         
## 100000 0
##        0
# visualize missing data
aggr(udaje1, bars=FALSE, col=c('navyblue','red'), numbers=TRUE, sortVars=TRUE, combined=TRUE)

## 
##  Variables sorted by number of missings: 
##  Variable Count
##      E_ID     0
##      Dprt     0
##      Gndr     0
##       Age     0
##      Jb_T     0
##      Hr_D     0
##      Y_A_     0
##      Ed_L     0
##      Pr_S     0
##      Mn_S     0
##      W_H_     0
##      Pr_H     0
##      Ov_H     0
##      Sc_D     0
##      R_W_     0
##      Tm_S     0
##      Tr_H     0
##      Prmt     0
##      E_S_     0
##      Rsgn     0
# multiple imputation - v pripade, ak vam chyba mensi rozsah udajov
#imp <- mice(udaje1, seed=123)   # konkretne parametre imputacie vieme nastavovat - pozri help
#udaje_imputovane <- complete(imp, 1)
#udaje1 <- udaje_imputovane
#head(udaje1)
#rm(imp)
#rm(udaje_imputovane)
#print("pocet chybajucich udajov za jednotlive premenne")
#colSums(is.na(udaje1))

Analýza potvrdila, že dataset je kompletný a neobsahuje žiadne chýbajúce hodnoty, preto nie je potrebná ďalšia imputácia údajov.

Tabuľky, grafy, jednoduché štatistiky

Grafy

ggplot2 - knižnica pre grafy

Výber a následné triedenie

library(dplyr)

udaje.IT <- udaje1 %>%
  filter(Dprt == "IT") %>%
  dplyr::select(Pr_S, Mn_S, Ov_H, Sc_D)

Scatter plot

# Basic scatter plot
library(ggplot2)
ggplot(udaje.IT, aes(x = Mn_S, y = Pr_S)) +            # specifikacia osi
 geom_point(color = "blue") +
  theme_minimal() +
  labs(title = "Vzťah medzi platom a výkonom v IT oddelení", 
       x = "Mesačný plat", 
       y = "Skóre výkonu")

Boxplot

# Bar plot with grouping
library(ggplot2)

library(ggplot2)

ggplot(udaje1, aes(x = factor(Dprt), y = E_S_)) +        # specifikacia osi
  geom_boxplot(fill = "lightblue", color = "darkblue") +      # typ grafu - boxplot
  labs(                                                       # oznacenie osi, nazov grafu
    title = "Spokojnosť zamestnancov podľa oddelenia",
    x = "oddelenie",
    y = "Spokojnosť zamestanancov"
  ) +
  theme_minimal()

Základné štatistiky.

knitr - tabuľka

library(dplyr)
library(knitr)

# Summarise basic statistics
salary.stats <- udaje1 %>%
  group_by(Dprt) %>% 
  summarise(
    n      = n(),
    mean   = mean(Mn_S, na.rm = TRUE),
    sd     = sd(Mn_S, na.rm = TRUE),
    min    = min(Mn_S, na.rm = TRUE),
    q25    = quantile(Mn_S, 0.25, na.rm = TRUE),
    median = median(Mn_S, na.rm = TRUE),
    q75    = quantile(Mn_S, 0.75, na.rm = TRUE),
    max    = max(Mn_S, na.rm = TRUE),
    .groups = "drop"
  )

# Create knitr table
kable(salary.stats, digits = 2, caption = "Základné štatistiky mesačného platu podľa oddelení")
Základné štatistiky mesačného platu podľa oddelení
Dprt n mean sd min q25 median q75 max
Customer Support 11116 6403.88 1371.62 3850 5250 6500 7500 9000
Engineering 10956 6417.25 1365.20 3850 5250 6500 7500 9000
Finance 11200 6398.58 1364.24 3850 5250 6500 7500 9000
HR 10960 6400.36 1369.48 3850 5250 6500 7500 9000
IT 11131 6414.60 1378.18 3850 5250 6500 7500 9000
Legal 11118 6391.28 1375.83 3850 5250 6500 7500 9000
Marketing 11216 6377.86 1382.82 3850 5250 6500 7200 9000
Operations 11181 6412.31 1379.73 3850 5250 6500 7500 9000
Sales 11122 6413.15 1365.13 3850 5250 6500 7500 9000

alebo krajšie tabuľky s pomocou .kableExtra.:

library(dplyr)
library(knitr)
library(kableExtra)
## 
## Attaching package: 'kableExtra'
## The following object is masked from 'package:dplyr':
## 
##     group_rows
# Summarise basic statistics
salary.stats <- udaje1 %>%
  group_by(Dprt) %>% 
  summarise(
    Počet      = n(),
    Priemer   = mean(Mn_S, na.rm = TRUE),
    Smerodajná_odchylka     = sd(Mn_S, na.rm = TRUE),
    Minimum = min(Mn_S, na.rm = TRUE),
    Prvý_kvartil    = quantile(Mn_S, 0.25, na.rm = TRUE),
    Median = median(Mn_S, na.rm = TRUE),
    Treti_kvartal    = quantile(Mn_S, 0.75, na.rm = TRUE),
    Maximum   = max(Mn_S, na.rm = TRUE),
    .groups = "drop"
  )


# Create styled kableExtra table
salary.stats %>%
  kable(digits = 2, caption = "Základné štatistiky mesačných platov podľa oddelení") %>%
  kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover", "condensed")) %>%
  column_spec(1, bold = TRUE) %>%
  row_spec(0, bold = TRUE, background = "#f2f2f2")
Základné štatistiky mesačných platov podľa oddelení
Dprt Počet Priemer Smerodajná_odchylka Minimum Prvý_kvartil Median Treti_kvartal Maximum
Customer Support 11116 6403.88 1371.62 3850 5250 6500 7500 9000
Engineering 10956 6417.25 1365.20 3850 5250 6500 7500 9000
Finance 11200 6398.58 1364.24 3850 5250 6500 7500 9000
HR 10960 6400.36 1369.48 3850 5250 6500 7500 9000
IT 11131 6414.60 1378.18 3850 5250 6500 7500 9000
Legal 11118 6391.28 1375.83 3850 5250 6500 7500 9000
Marketing 11216 6377.86 1382.82 3850 5250 6500 7200 9000
Operations 11181 6412.31 1379.73 3850 5250 6500 7500 9000
Sales 11122 6413.15 1365.13 3850 5250 6500 7500 9000

Testovanie hypotéz

t-test: Porovnanie výkonu dvoch oddelení

t.test.result <- t.test(
  udaje1$Pr_S[udaje1$Dprt == "IT"],
  udaje1$Pr_S[udaje1$Dprt == "Sales"]
)

print(t.test.result)
## 
##  Welch Two Sample t-test
## 
## data:  udaje1$Pr_S[udaje1$Dprt == "IT"] and udaje1$Pr_S[udaje1$Dprt == "Sales"]
## t = 0.29833, df = 22250, p-value = 0.7655
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.03155226  0.04288116
## sample estimates:
## mean of x mean of y 
##  3.000000  2.994336

ANOVA: Porovnanie výkonu naprieč všetkými oddeleniami

anova.result <- aov(Pr_S ~ Dprt, data = udaje1)
summary(anova.result)
##                Df Sum Sq Mean Sq F value Pr(>F)
## Dprt            8     15   1.917   0.958  0.467
## Residuals   99991 200128   2.002

Linear Regression: Predikcia výkonu pomocou platu, nadčasov a tréningu

model <- lm(Pr_S ~ Mn_S + Ov_H + Tr_H, data = udaje1)
summary(model)
## 
## Call:
## lm(formula = Pr_S ~ Mn_S + Ov_H + Tr_H, data = udaje1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.10648 -0.93884 -0.04578  0.97527  2.61844 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -3.785e-01  2.056e-02 -18.406   <2e-16 ***
## Mn_S         5.257e-04  2.804e-06 187.507   <2e-16 ***
## Ov_H         3.667e-05  4.442e-04   0.083    0.934    
## Tr_H         1.426e-04  1.332e-04   1.071    0.284    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.217 on 99996 degrees of freedom
## Multiple R-squared:  0.2601, Adjusted R-squared:  0.2601 
## F-statistic: 1.172e+04 on 3 and 99996 DF,  p-value: < 2.2e-16