Čistenie údajov

Moje dáta sú v správnom formáte, názvy stĺpcov neobsahujú medzery, preto nie je potrebné ich hromadne upravovať.

Import údajov

Skontrolujeme si, či sa doplňujú na miesta chýbajúcich údajov doplňujú NA hodnoty (NA - Not Available).

# Import the CSV file into a data frame
# - header = TRUE: the first row contains variable names
# - sep = ",": variables are separated by colons
# - dec = ".": decimal numbers use a dot
# - na.strings = c("", "NA"): empty cells and text "NA" are treated as missing values
# - stringsAsFactors = FALSE: text variables remain text, not factors

udaje1 <- read.csv2(
  "udaje/productivity.csv",
  header = TRUE,
  sep = ",",
  dec = ".",
  na.strings = c("", "NA"),
  stringsAsFactors = FALSE
)

# Show the first rows of the dataset
library(knitr)
kable(head(udaje1), caption = "Náhľad prvých 6 riadkov datasetu")

Náhľad prvých 6 riadkov datasetu
Employee_ID	Department	Gender	Age	Job_Title	Hire_Date	Years_At_Company	Education_Level	Performance_Score	Monthly_Salary	Work_Hours_Per_Week	Projects_Handled	Overtime_Hours	Sick_Days	Remote_Work_Frequency	Team_Size	Training_Hours	Promotions	Employee_Satisfaction_Score	Resigned
1	IT	Male	55	Specialist	2022-01-19 08:03:05.556036	2	High School	5	6750	33	32	22	2	0	14	66	0	2.63	False
2	Finance	Male	29	Developer	2024-04-18 08:03:05.556036	0	High School	5	7500	34	34	13	14	100	12	61	2	1.72	False
3	Finance	Male	55	Specialist	2015-10-26 08:03:05.556036	8	High School	3	5850	37	27	6	3	50	10	1	0	3.17	False
4	Customer Support	Female	48	Analyst	2016-10-22 08:03:05.556036	7	Bachelor	2	4800	52	10	28	12	100	10	0	1	1.86	False
5	Engineering	Female	36	Analyst	2021-07-23 08:03:05.556036	3	Bachelor	2	4800	38	11	29	13	100	15	9	1	1.25	False
6	IT	Male	43	Manager	2016-08-14 08:03:05.556036	8	High School	3	7800	46	31	8	0	100	15	95	0	2.77	False

Upravovanie názvov premenných

# Load the dplyr package
# dplyr provides convenient tools for working with data frames
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

# -----------------------------
# 1. Save the original column names
# -----------------------------
old_names <- names(udaje1)

# -----------------------------
# 2. Shorten (abbreviate) column names
# -----------------------------
# rename_with() applies a function to all column names
# abbreviate() automatically shortens long names
# strict = FALSE allows a more flexible abbreviation
udaje1 <- udaje1 %>%
  rename_with(~ abbreviate(.x, strict = FALSE))

# -----------------------------
# 3. Ensure that column names are unique
# -----------------------------
# Sometimes abbreviation may create identical names
# make.unique() automatically adds suffixes (.1, .2, ...) if necessary
names(udaje1) <- make.unique(names(udaje1))

# -----------------------------
# 4. Show comparison: old vs. new names
# -----------------------------
comparison <- data.frame(
  Original_Name = old_names,
  Shortened_Name = names(udaje1)
)

print(comparison)

##                  Original_Name Shortened_Name
## 1                  Employee_ID           E_ID
## 2                   Department           Dprt
## 3                       Gender           Gndr
## 4                          Age            Age
## 5                    Job_Title           Jb_T
## 6                    Hire_Date           Hr_D
## 7             Years_At_Company           Y_A_
## 8              Education_Level           Ed_L
## 9            Performance_Score           Pr_S
## 10              Monthly_Salary           Mn_S
## 11         Work_Hours_Per_Week           W_H_
## 12            Projects_Handled           Pr_H
## 13              Overtime_Hours           Ov_H
## 14                   Sick_Days           Sc_D
## 15       Remote_Work_Frequency           R_W_
## 16                   Team_Size           Tm_S
## 17              Training_Hours           Tr_H
## 18                  Promotions           Prmt
## 19 Employee_Satisfaction_Score           E_S_
## 20                    Resigned           Rsgn

Počiatočné čistenie obsahu databázy, imputácia chýbahúcich údajov

Odporúčam tu použiť knižnice VIM, Amelia, mice a iné. Pokiaľ máme databázu dostatočne nekonzistentnú a nevieme ju upraviť vynechaním niekoľkých riadkov / stĺpcov, potom odporúčame blog M. Fatih Tüzen: Handling Missing Data in R: A Comprehensive Guide, R bloggers.

library(mice)
library(VIM)

# Count missing values in each column
missing_counts <- colSums(is.na(udaje1))
print("Počet chýbajúcich údajov:")

## [1] "Počet chýbajúcich údajov:"

print(missing_counts)

## E_ID Dprt Gndr  Age Jb_T Hr_D Y_A_ Ed_L Pr_S Mn_S W_H_ Pr_H Ov_H Sc_D R_W_ Tm_S Tr_H Prmt E_S_ Rsgn 
##    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0

Z analýzy vyplýva, že databáza neobsahuje žiadne chýbajúce hodnoty (NA = 0).

# pattern of missingness
md.pattern(udaje1)

##  /\     /\
## {  `---'  }
## {  O   O  }
## ==>  V <==  No need for mice. This data set is completely observed.
##  \  \|/  /
##   `-----'

##        E_ID Dprt Gndr Age Jb_T Hr_D Y_A_ Ed_L Pr_S Mn_S W_H_ Pr_H Ov_H Sc_D R_W_ Tm_S Tr_H Prmt E_S_ Rsgn
## 100000    1    1    1   1    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1
##           0    0    0   0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
##         
## 100000 0
##        0

# visualize missing data
aggr(udaje1, bars=FALSE, col=c('navyblue','red'), numbers=TRUE, sortVars=TRUE, combined=TRUE)

## 
##  Variables sorted by number of missings: 
##  Variable Count
##      E_ID     0
##      Dprt     0
##      Gndr     0
##       Age     0
##      Jb_T     0
##      Hr_D     0
##      Y_A_     0
##      Ed_L     0
##      Pr_S     0
##      Mn_S     0
##      W_H_     0
##      Pr_H     0
##      Ov_H     0
##      Sc_D     0
##      R_W_     0
##      Tm_S     0
##      Tr_H     0
##      Prmt     0
##      E_S_     0
##      Rsgn     0

# multiple imputation - v pripade, ak vam chyba mensi rozsah udajov
#imp <- mice(udaje1, seed=123)   # konkretne parametre imputacie vieme nastavovat - pozri help
#udaje_imputovane <- complete(imp, 1)
#udaje1 <- udaje_imputovane
#head(udaje1)
#rm(imp)
#rm(udaje_imputovane)
#print("pocet chybajucich udajov za jednotlive premenne")
#colSums(is.na(udaje1))

Analýza potvrdila, že dataset je kompletný a neobsahuje žiadne chýbajúce hodnoty, preto nie je potrebná ďalšia imputácia údajov.

Tabuľky, grafy, jednoduché štatistiky

Grafy

ggplot2 - knižnica pre grafy

Výber a následné triedenie

library(dplyr)

udaje.IT <- udaje1 %>%
  filter(Dprt == "IT") %>%
  dplyr::select(Pr_S, Mn_S, Ov_H, Sc_D)

Scatter plot

# Basic scatter plot
library(ggplot2)
ggplot(udaje.IT, aes(x = Mn_S, y = Pr_S)) +            # specifikacia osi
 geom_point(color = "blue") +
  theme_minimal() +
  labs(title = "Vzťah medzi platom a výkonom v IT oddelení", 
       x = "Mesačný plat", 
       y = "Skóre výkonu")

Boxplot

# Bar plot with grouping
library(ggplot2)

library(ggplot2)

ggplot(udaje1, aes(x = factor(Dprt), y = E_S_)) +        # specifikacia osi
  geom_boxplot(fill = "lightblue", color = "darkblue") +      # typ grafu - boxplot
  labs(                                                       # oznacenie osi, nazov grafu
    title = "Spokojnosť zamestnancov podľa oddelenia",
    x = "oddelenie",
    y = "Spokojnosť zamestanancov"
  ) +
  theme_minimal()

Základné štatistiky.

knitr - tabuľka

library(dplyr)
library(knitr)

# Summarise basic statistics
salary.stats <- udaje1 %>%
  group_by(Dprt) %>% 
  summarise(
    n      = n(),
    mean   = mean(Mn_S, na.rm = TRUE),
    sd     = sd(Mn_S, na.rm = TRUE),
    min    = min(Mn_S, na.rm = TRUE),
    q25    = quantile(Mn_S, 0.25, na.rm = TRUE),
    median = median(Mn_S, na.rm = TRUE),
    q75    = quantile(Mn_S, 0.75, na.rm = TRUE),
    max    = max(Mn_S, na.rm = TRUE),
    .groups = "drop"
  )

# Create knitr table
kable(salary.stats, digits = 2, caption = "Základné štatistiky mesačného platu podľa oddelení")

Základné štatistiky mesačného platu podľa oddelení
Dprt	n	mean	sd	min	q25	median	q75	max
Customer Support	11116	6403.88	1371.62	3850	5250	6500	7500	9000
Engineering	10956	6417.25	1365.20	3850	5250	6500	7500	9000
Finance	11200	6398.58	1364.24	3850	5250	6500	7500	9000
HR	10960	6400.36	1369.48	3850	5250	6500	7500	9000
IT	11131	6414.60	1378.18	3850	5250	6500	7500	9000
Legal	11118	6391.28	1375.83	3850	5250	6500	7500	9000
Marketing	11216	6377.86	1382.82	3850	5250	6500	7200	9000
Operations	11181	6412.31	1379.73	3850	5250	6500	7500	9000
Sales	11122	6413.15	1365.13	3850	5250	6500	7500	9000

alebo krajšie tabuľky s pomocou .kableExtra.:

library(dplyr)
library(knitr)
library(kableExtra)

## 
## Attaching package: 'kableExtra'

## The following object is masked from 'package:dplyr':
## 
##     group_rows

# Summarise basic statistics
salary.stats <- udaje1 %>%
  group_by(Dprt) %>% 
  summarise(
    Počet      = n(),
    Priemer   = mean(Mn_S, na.rm = TRUE),
    Smerodajná_odchylka     = sd(Mn_S, na.rm = TRUE),
    Minimum = min(Mn_S, na.rm = TRUE),
    Prvý_kvartil    = quantile(Mn_S, 0.25, na.rm = TRUE),
    Median = median(Mn_S, na.rm = TRUE),
    Treti_kvartal    = quantile(Mn_S, 0.75, na.rm = TRUE),
    Maximum   = max(Mn_S, na.rm = TRUE),
    .groups = "drop"
  )


# Create styled kableExtra table
salary.stats %>%
  kable(digits = 2, caption = "Základné štatistiky mesačných platov podľa oddelení") %>%
  kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover", "condensed")) %>%
  column_spec(1, bold = TRUE) %>%
  row_spec(0, bold = TRUE, background = "#f2f2f2")

Základné štatistiky mesačných platov podľa oddelení
Dprt	Počet	Priemer	Smerodajná_odchylka	Minimum	Prvý_kvartil	Median	Treti_kvartal	Maximum
Customer Support	11116	6403.88	1371.62	3850	5250	6500	7500	9000
Engineering	10956	6417.25	1365.20	3850	5250	6500	7500	9000
Finance	11200	6398.58	1364.24	3850	5250	6500	7500	9000
HR	10960	6400.36	1369.48	3850	5250	6500	7500	9000
IT	11131	6414.60	1378.18	3850	5250	6500	7500	9000
Legal	11118	6391.28	1375.83	3850	5250	6500	7500	9000
Marketing	11216	6377.86	1382.82	3850	5250	6500	7200	9000
Operations	11181	6412.31	1379.73	3850	5250	6500	7500	9000
Sales	11122	6413.15	1365.13	3850	5250	6500	7500	9000

Testovanie hypotéz

t-test: Porovnanie výkonu dvoch oddelení

t.test.result <- t.test(
  udaje1$Pr_S[udaje1$Dprt == "IT"],
  udaje1$Pr_S[udaje1$Dprt == "Sales"]
)

print(t.test.result)

## 
##  Welch Two Sample t-test
## 
## data:  udaje1$Pr_S[udaje1$Dprt == "IT"] and udaje1$Pr_S[udaje1$Dprt == "Sales"]
## t = 0.29833, df = 22250, p-value = 0.7655
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.03155226  0.04288116
## sample estimates:
## mean of x mean of y 
##  3.000000  2.994336

ANOVA: Porovnanie výkonu naprieč všetkými oddeleniami

anova.result <- aov(Pr_S ~ Dprt, data = udaje1)
summary(anova.result)

##                Df Sum Sq Mean Sq F value Pr(>F)
## Dprt            8     15   1.917   0.958  0.467
## Residuals   99991 200128   2.002

Linear Regression: Predikcia výkonu pomocou platu, nadčasov a tréningu

model <- lm(Pr_S ~ Mn_S + Ov_H + Tr_H, data = udaje1)
summary(model)

## 
## Call:
## lm(formula = Pr_S ~ Mn_S + Ov_H + Tr_H, data = udaje1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.10648 -0.93884 -0.04578  0.97527  2.61844 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -3.785e-01  2.056e-02 -18.406   <2e-16 ***
## Mn_S         5.257e-04  2.804e-06 187.507   <2e-16 ***
## Ov_H         3.667e-05  4.442e-04   0.083    0.934    
## Tr_H         1.426e-04  1.332e-04   1.071    0.284    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.217 on 99996 degrees of freedom
## Multiple R-squared:  0.2601, Adjusted R-squared:  0.2601 
## F-statistic: 1.172e+04 on 3 and 99996 DF,  p-value: < 2.2e-16

Analýza výkonu zamestnancov

Iris Vitárius

March 2026