Moje dáta sú v správnom formáte, názvy stĺpcov neobsahujú medzery, preto nie je potrebné ich hromadne upravovať.
Skontrolujeme si, či sa doplňujú na miesta chýbajúcich údajov doplňujú NA hodnoty (NA - Not Available).
# Import the CSV file into a data frame
# - header = TRUE: the first row contains variable names
# - sep = ",": variables are separated by colons
# - dec = ".": decimal numbers use a dot
# - na.strings = c("", "NA"): empty cells and text "NA" are treated as missing values
# - stringsAsFactors = FALSE: text variables remain text, not factors
udaje1 <- read.csv2(
"udaje/productivity.csv",
header = TRUE,
sep = ",",
dec = ".",
na.strings = c("", "NA"),
stringsAsFactors = FALSE
)
# Show the first rows of the dataset
library(knitr)
kable(head(udaje1), caption = "Náhľad prvých 6 riadkov datasetu")| Employee_ID | Department | Gender | Age | Job_Title | Hire_Date | Years_At_Company | Education_Level | Performance_Score | Monthly_Salary | Work_Hours_Per_Week | Projects_Handled | Overtime_Hours | Sick_Days | Remote_Work_Frequency | Team_Size | Training_Hours | Promotions | Employee_Satisfaction_Score | Resigned |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | IT | Male | 55 | Specialist | 2022-01-19 08:03:05.556036 | 2 | High School | 5 | 6750 | 33 | 32 | 22 | 2 | 0 | 14 | 66 | 0 | 2.63 | False |
| 2 | Finance | Male | 29 | Developer | 2024-04-18 08:03:05.556036 | 0 | High School | 5 | 7500 | 34 | 34 | 13 | 14 | 100 | 12 | 61 | 2 | 1.72 | False |
| 3 | Finance | Male | 55 | Specialist | 2015-10-26 08:03:05.556036 | 8 | High School | 3 | 5850 | 37 | 27 | 6 | 3 | 50 | 10 | 1 | 0 | 3.17 | False |
| 4 | Customer Support | Female | 48 | Analyst | 2016-10-22 08:03:05.556036 | 7 | Bachelor | 2 | 4800 | 52 | 10 | 28 | 12 | 100 | 10 | 0 | 1 | 1.86 | False |
| 5 | Engineering | Female | 36 | Analyst | 2021-07-23 08:03:05.556036 | 3 | Bachelor | 2 | 4800 | 38 | 11 | 29 | 13 | 100 | 15 | 9 | 1 | 1.25 | False |
| 6 | IT | Male | 43 | Manager | 2016-08-14 08:03:05.556036 | 8 | High School | 3 | 7800 | 46 | 31 | 8 | 0 | 100 | 15 | 95 | 0 | 2.77 | False |
# Load the dplyr package
# dplyr provides convenient tools for working with data frames
library(dplyr)##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
# -----------------------------
# 1. Save the original column names
# -----------------------------
old_names <- names(udaje1)
# -----------------------------
# 2. Shorten (abbreviate) column names
# -----------------------------
# rename_with() applies a function to all column names
# abbreviate() automatically shortens long names
# strict = FALSE allows a more flexible abbreviation
udaje1 <- udaje1 %>%
rename_with(~ abbreviate(.x, strict = FALSE))
# -----------------------------
# 3. Ensure that column names are unique
# -----------------------------
# Sometimes abbreviation may create identical names
# make.unique() automatically adds suffixes (.1, .2, ...) if necessary
names(udaje1) <- make.unique(names(udaje1))
# -----------------------------
# 4. Show comparison: old vs. new names
# -----------------------------
comparison <- data.frame(
Original_Name = old_names,
Shortened_Name = names(udaje1)
)
print(comparison)## Original_Name Shortened_Name
## 1 Employee_ID E_ID
## 2 Department Dprt
## 3 Gender Gndr
## 4 Age Age
## 5 Job_Title Jb_T
## 6 Hire_Date Hr_D
## 7 Years_At_Company Y_A_
## 8 Education_Level Ed_L
## 9 Performance_Score Pr_S
## 10 Monthly_Salary Mn_S
## 11 Work_Hours_Per_Week W_H_
## 12 Projects_Handled Pr_H
## 13 Overtime_Hours Ov_H
## 14 Sick_Days Sc_D
## 15 Remote_Work_Frequency R_W_
## 16 Team_Size Tm_S
## 17 Training_Hours Tr_H
## 18 Promotions Prmt
## 19 Employee_Satisfaction_Score E_S_
## 20 Resigned Rsgn
Odporúčam tu použiť knižnice VIM, Amelia, mice a iné. Pokiaľ máme databázu dostatočne nekonzistentnú a nevieme ju upraviť vynechaním niekoľkých riadkov / stĺpcov, potom odporúčame blog M. Fatih Tüzen: Handling Missing Data in R: A Comprehensive Guide, R bloggers.
library(mice)
library(VIM)
# Count missing values in each column
missing_counts <- colSums(is.na(udaje1))
print("Počet chýbajúcich údajov:")## [1] "Počet chýbajúcich údajov:"
## E_ID Dprt Gndr Age Jb_T Hr_D Y_A_ Ed_L Pr_S Mn_S W_H_ Pr_H Ov_H Sc_D R_W_ Tm_S Tr_H Prmt E_S_ Rsgn
## 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Z analýzy vyplýva, že databáza neobsahuje žiadne chýbajúce hodnoty (NA = 0).
## /\ /\
## { `---' }
## { O O }
## ==> V <== No need for mice. This data set is completely observed.
## \ \|/ /
## `-----'
## E_ID Dprt Gndr Age Jb_T Hr_D Y_A_ Ed_L Pr_S Mn_S W_H_ Pr_H Ov_H Sc_D R_W_ Tm_S Tr_H Prmt E_S_ Rsgn
## 100000 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
##
## 100000 0
## 0
# visualize missing data
aggr(udaje1, bars=FALSE, col=c('navyblue','red'), numbers=TRUE, sortVars=TRUE, combined=TRUE)##
## Variables sorted by number of missings:
## Variable Count
## E_ID 0
## Dprt 0
## Gndr 0
## Age 0
## Jb_T 0
## Hr_D 0
## Y_A_ 0
## Ed_L 0
## Pr_S 0
## Mn_S 0
## W_H_ 0
## Pr_H 0
## Ov_H 0
## Sc_D 0
## R_W_ 0
## Tm_S 0
## Tr_H 0
## Prmt 0
## E_S_ 0
## Rsgn 0
# multiple imputation - v pripade, ak vam chyba mensi rozsah udajov
#imp <- mice(udaje1, seed=123) # konkretne parametre imputacie vieme nastavovat - pozri help
#udaje_imputovane <- complete(imp, 1)
#udaje1 <- udaje_imputovane
#head(udaje1)
#rm(imp)
#rm(udaje_imputovane)
#print("pocet chybajucich udajov za jednotlive premenne")
#colSums(is.na(udaje1))Analýza potvrdila, že dataset je kompletný a neobsahuje žiadne chýbajúce hodnoty, preto nie je potrebná ďalšia imputácia údajov.
Výber a následné triedenie
library(dplyr)
udaje.IT <- udaje1 %>%
filter(Dprt == "IT") %>%
dplyr::select(Pr_S, Mn_S, Ov_H, Sc_D)# Basic scatter plot
library(ggplot2)
ggplot(udaje.IT, aes(x = Mn_S, y = Pr_S)) + # specifikacia osi
geom_point(color = "blue") +
theme_minimal() +
labs(title = "Vzťah medzi platom a výkonom v IT oddelení",
x = "Mesačný plat",
y = "Skóre výkonu")# Bar plot with grouping
library(ggplot2)
library(ggplot2)
ggplot(udaje1, aes(x = factor(Dprt), y = E_S_)) + # specifikacia osi
geom_boxplot(fill = "lightblue", color = "darkblue") + # typ grafu - boxplot
labs( # oznacenie osi, nazov grafu
title = "Spokojnosť zamestnancov podľa oddelenia",
x = "oddelenie",
y = "Spokojnosť zamestanancov"
) +
theme_minimal()library(dplyr)
library(knitr)
# Summarise basic statistics
salary.stats <- udaje1 %>%
group_by(Dprt) %>%
summarise(
n = n(),
mean = mean(Mn_S, na.rm = TRUE),
sd = sd(Mn_S, na.rm = TRUE),
min = min(Mn_S, na.rm = TRUE),
q25 = quantile(Mn_S, 0.25, na.rm = TRUE),
median = median(Mn_S, na.rm = TRUE),
q75 = quantile(Mn_S, 0.75, na.rm = TRUE),
max = max(Mn_S, na.rm = TRUE),
.groups = "drop"
)
# Create knitr table
kable(salary.stats, digits = 2, caption = "Základné štatistiky mesačného platu podľa oddelení")| Dprt | n | mean | sd | min | q25 | median | q75 | max |
|---|---|---|---|---|---|---|---|---|
| Customer Support | 11116 | 6403.88 | 1371.62 | 3850 | 5250 | 6500 | 7500 | 9000 |
| Engineering | 10956 | 6417.25 | 1365.20 | 3850 | 5250 | 6500 | 7500 | 9000 |
| Finance | 11200 | 6398.58 | 1364.24 | 3850 | 5250 | 6500 | 7500 | 9000 |
| HR | 10960 | 6400.36 | 1369.48 | 3850 | 5250 | 6500 | 7500 | 9000 |
| IT | 11131 | 6414.60 | 1378.18 | 3850 | 5250 | 6500 | 7500 | 9000 |
| Legal | 11118 | 6391.28 | 1375.83 | 3850 | 5250 | 6500 | 7500 | 9000 |
| Marketing | 11216 | 6377.86 | 1382.82 | 3850 | 5250 | 6500 | 7200 | 9000 |
| Operations | 11181 | 6412.31 | 1379.73 | 3850 | 5250 | 6500 | 7500 | 9000 |
| Sales | 11122 | 6413.15 | 1365.13 | 3850 | 5250 | 6500 | 7500 | 9000 |
alebo krajšie tabuľky s pomocou .kableExtra.:
##
## Attaching package: 'kableExtra'
## The following object is masked from 'package:dplyr':
##
## group_rows
# Summarise basic statistics
salary.stats <- udaje1 %>%
group_by(Dprt) %>%
summarise(
Počet = n(),
Priemer = mean(Mn_S, na.rm = TRUE),
Smerodajná_odchylka = sd(Mn_S, na.rm = TRUE),
Minimum = min(Mn_S, na.rm = TRUE),
Prvý_kvartil = quantile(Mn_S, 0.25, na.rm = TRUE),
Median = median(Mn_S, na.rm = TRUE),
Treti_kvartal = quantile(Mn_S, 0.75, na.rm = TRUE),
Maximum = max(Mn_S, na.rm = TRUE),
.groups = "drop"
)
# Create styled kableExtra table
salary.stats %>%
kable(digits = 2, caption = "Základné štatistiky mesačných platov podľa oddelení") %>%
kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover", "condensed")) %>%
column_spec(1, bold = TRUE) %>%
row_spec(0, bold = TRUE, background = "#f2f2f2")| Dprt | Počet | Priemer | Smerodajná_odchylka | Minimum | Prvý_kvartil | Median | Treti_kvartal | Maximum |
|---|---|---|---|---|---|---|---|---|
| Customer Support | 11116 | 6403.88 | 1371.62 | 3850 | 5250 | 6500 | 7500 | 9000 |
| Engineering | 10956 | 6417.25 | 1365.20 | 3850 | 5250 | 6500 | 7500 | 9000 |
| Finance | 11200 | 6398.58 | 1364.24 | 3850 | 5250 | 6500 | 7500 | 9000 |
| HR | 10960 | 6400.36 | 1369.48 | 3850 | 5250 | 6500 | 7500 | 9000 |
| IT | 11131 | 6414.60 | 1378.18 | 3850 | 5250 | 6500 | 7500 | 9000 |
| Legal | 11118 | 6391.28 | 1375.83 | 3850 | 5250 | 6500 | 7500 | 9000 |
| Marketing | 11216 | 6377.86 | 1382.82 | 3850 | 5250 | 6500 | 7200 | 9000 |
| Operations | 11181 | 6412.31 | 1379.73 | 3850 | 5250 | 6500 | 7500 | 9000 |
| Sales | 11122 | 6413.15 | 1365.13 | 3850 | 5250 | 6500 | 7500 | 9000 |
t.test.result <- t.test(
udaje1$Pr_S[udaje1$Dprt == "IT"],
udaje1$Pr_S[udaje1$Dprt == "Sales"]
)
print(t.test.result)##
## Welch Two Sample t-test
##
## data: udaje1$Pr_S[udaje1$Dprt == "IT"] and udaje1$Pr_S[udaje1$Dprt == "Sales"]
## t = 0.29833, df = 22250, p-value = 0.7655
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.03155226 0.04288116
## sample estimates:
## mean of x mean of y
## 3.000000 2.994336
## Df Sum Sq Mean Sq F value Pr(>F)
## Dprt 8 15 1.917 0.958 0.467
## Residuals 99991 200128 2.002
##
## Call:
## lm(formula = Pr_S ~ Mn_S + Ov_H + Tr_H, data = udaje1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.10648 -0.93884 -0.04578 0.97527 2.61844
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.785e-01 2.056e-02 -18.406 <2e-16 ***
## Mn_S 5.257e-04 2.804e-06 187.507 <2e-16 ***
## Ov_H 3.667e-05 4.442e-04 0.083 0.934
## Tr_H 1.426e-04 1.332e-04 1.071 0.284
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.217 on 99996 degrees of freedom
## Multiple R-squared: 0.2601, Adjusted R-squared: 0.2601
## F-statistic: 1.172e+04 on 3 and 99996 DF, p-value: < 2.2e-16