Assignment DS Programming Week 5
Angelique Kiyoshi Lakeisha B.U
NIM: 52250001
Student Major Data Science at Institut Teknologi Sains Bandung
1 Dynamic Multi-Formula
functions, nested loops, and conditional statements to generate structured outputs. The results are further visualized to highlight differences in growth patterns across formulas.
1.1 Nested loops
compute_multiple_formulas() to calculate multiple mathematical formulas simultaneously. The function uses nested for loops, where the outer loop iterates over the formula types (formulas) and the inner loop iterates over input values (x_values). Conditional statements (if-else) are used inside the loop to apply the correct formula (linear, quadratic, cubic, exponential). The results are stored dynamically using rbind() into a data frame.
# menghitung jenis formula menggunakan nested loop
compute_multiple_formulas <- function(x_values, formulas) {
results <- data.frame() # data kosong untuk menyimpan hasil
for (f in formulas) { # loop untuk setiap jenis formula
for (i in x_values) { # loop untuk setiap nilai x
# menentukan rumus berdasarkan jenis formula
if (f == "linear") {
y <- 3*i - 2
} else if (f == "quadratic") {
y <- i^2 - 4*i + 5
} else if (f == "cubic") {
y <- 0.05*i^3 - 3*i^2 + i + 5
} else if (f == "exponential") {
y <- 1.5^i
}
results <- rbind(results, data.frame(x = i, y = y, formula = f)) # simpan hasil
}
}
return(results)
}Interpretation:
The results confirm that each formula produces distinct value patterns. Linear growth increases steadily, while quadratic and cubic functions show accelerating trends, and the exponential function grows the fastest. This also shows that the nested loop structure can efficiently generate and organize multiple formula outputs in a single dataset.
1.2 Validation
compute_formula_checked(). A predefined vector valid_formulas is used along with the %in% operator to check whether the input formula is valid. If the formula is not valid, it is still recorded in the output with status "Invalid", while valid formulas are computed using the same nested loop structure.
# fungsi untuk menghitung formula + validasi input
compute_formula_checked <- function(x_values, formulas) {
valid_formulas <- c("linear", "quadratic", "cubic", "exponential")
results <- data.frame()
for (f in formulas) { # loop untuk setiap formula
if (!(f %in% valid_formulas)) { # cek apakah formula valid
results <- rbind(results, data.frame(
x = NA,
y = NA,
formula = f,
status = "Invalid" # ketentuan untuk invalid
))
next # lanjut ke formula berikutnya
}
for (i in x_values) { # jika valid dihitung
if (f == "linear") {
y <- 3*i - 2
} else if (f == "quadratic") {
y <- i^2 - 4*i + 5
} else if (f == "cubic") {
y <- 0.05*i^3 - 3*i^2 + i + 5
} else if (f == "exponential") {
y <- 1.5^i
}
results <- rbind(results, data.frame( # simpan hasil + status valid
x = i,
y = y,
formula = f,
status = "Valid"
))
}
}
return(results)
}Interpretation:
The system successfully handles invalid inputs without causing errors. This highlights the importance of validation in maintaining data quality and ensuring a stable analysis process.
1.3 Visualization
ggplot2 to generate the line plot and plotly via ggplotly() to make the graph interactive. The plot uses data_nested as input and maps x, y, and formula into aesthetics. The geom_line() function is used to display trends for each formula, allowing direct comparison of their growth patterns.
Interpretation:
The graph highlights differences in growth patterns, with the exponential function increasing the fastest. This visualization makes it easier to identify patterns that are difficult to observe from tables alone.
2 Sales & Discounts
functions, loops, conditional logic, and nested functions to generate realistic data, apply discount rules, and compute cumulative sales. The results are summarized and visualized to analyze performance patterns.
2.1 Nested Function
hitung_cumulative()hitung_cumulative() to compute cumulative sales per salesperson. Inside it, a nested function (cumulative_manual()) is created to manually calculate running totals using a for loop. The function is applied using ave(), which groups data by sales_id and applies the cumulative calculation for each group.
hitung_cumulative <- function(data) { # menghitung cumulative sales per salesperson
cumulative_manual <- function(x) { # fungsi manual untuk cumulative
total <- 0 # inisialisasi total
hasil <- c() # vector kosong
for (i in x) { # loop setiap nilai sales
total <- total + i # menambahkan nilai ke total
hasil <- c(hasil, total) # menyimpan hasil cumulative
}
return(hasil)
}
data$cumulative_sales <- ave( # menerapkan fungsi cumulative ke setiap sales_id
data$sales_amount,
data$sales_id,
FUN = cumulative_manual
)
return(data) # mengembalikan data dengan kolom baru
}Interpretation:
The cumulative calculation correctly aggregates sales over time for each salesperson, ensuring that individual performance is tracked independently. This approach highlights how total sales evolve sequentially rather than as isolated daily values.
2.2 Loops Conditional Discounts
generate_sales_data()generate_sales_data(), which applies nested loops to iterate over salespersons and days. Random sales values are created using sample(), and if-else conditions are used to assign discount rates based on sales amount.
The discount rules are defined as follows:
- High Sales: Sales amount > 1000 → Discount 20%
- Medium Sales: Sales amount > 500 and ≤ 1000 → Discount 10%
- Low Sales: Sales amount ≤ 500 → Discount 5%
# fungsi untuk generate data sales
generate_sales_data <- function(n_salesperson, days) {
results <- data.frame()
for (s in 1:n_salesperson) { # loop tiap salesperson
sales_id <- paste0("5225", s) # membuat ID unik
for (d in 1:days) { # loop tiap hari
sales_amount <- sample(100:1500, 1) # generate sales random
# menentukan diskon berdasarkan jumlah sales
if (sales_amount > 1000) {
discount <- "20%"
} else if (sales_amount > 500) {
discount <- "10%"
} else {
discount <- "5%"
}
# simpan ke data frame
results <- rbind(results, data.frame(
sales_id = sales_id,
day = d,
sales_amount = sales_amount,
discount_rate = discount
))
}
}
return(results)
}Interpretation:
The generated dataset reflects realistic variation in daily sales, where higher sales values result in higher discount rates. The cumulative column further shows how repeated transactions contribute to increasing total performance over time for each salesperson.
2.3 Summary Cumulative
dplyr functions such as group_by() and summarise() to calculate total and average sales per salesperson. Visualization is created using ggplot2 with geom_line() and geom_point(), then converted into an interactive plot using plotly via ggplotly().
Ringkasan data dihitung menggunakan
dplyr seperti group_by() dan summarise() untuk mendapatkan total dan rata-rata sales tiap salesperson. Visualisasi dibuat menggunakan ggplot2 dengan geom_line() dan geom_point(), lalu dibuat interaktif menggunakan plotly melalui ggplotly().
summary_sales <- sales_data %>% # membuat ringkasan per salesperson
group_by(sales_id) %>% # mengelompokkan berdasarkan sales_id
summarise(
total_sales = sum(sales_amount), # menghitung total sales
avg_sales = round(mean(sales_amount), 2) # menghitung rata-rata sales
)Interpretation:
The summary highlights differences in sales performance across individuals, where some salespersons achieve higher total sales and averages than others. This indicates variability in productivity and effectiveness among salespersons.
Interpretation:
The cumulative sales plot shows a consistent upward trend for all salespersons, as values accumulate over time. Differences in line steepness indicate varying sales performance, where steeper lines represent faster growth in total sales.
3 Performance Categorization
loop-based functions, dplyr for aggregation, and plotly for visualization. The goal is to transform raw sales data into meaningful performance insights.
3.1 Loop Through Vector
categorize_performance()categorize_performance() that classifies sales values into five categories: Excellent, Very Good, Good, Average, and Poor. The function uses a for loop to iterate through each value in the input vector and applies if-else conditions to determine the appropriate category.
The performance classification rules are defined as follows:
- Excellent: Sales amount > 1200
- Very Good: Sales amount > 900 and ≤ 1200
- Good: Sales amount > 600 and ≤ 900
- Average: Sales amount > 300 and ≤ 600
- Poor: Sales amount ≤ 300
# fungsi untuk mengkategorikan performa berdasarkan sales
categorize_performance <- function(sales_vector) {
categories <- c() # simpan hasil
# loop setiap nilai sales
for (i in sales_vector) {
if (i > 1200) {
cat <- "Excellent"
} else if (i > 900) {
cat <- "Very Good"
} else if (i > 600) {
cat <- "Good"
} else if (i > 300) {
cat <- "Average"
} else {
cat <- "Poor"
}
categories <- c(categories, cat)
}
return(categories)
}Interpretation:
The function successfully classifies each sales value into predefined performance categories using a loop-based approach. This ensures that all observations are consistently evaluated based on the same thresholds, making the categorization process structured and reproducible.
3.2 Percentages per Category
dplyr functions such as group_by() and summarise() to count the number of observations in each category. The mutate() function is then used to compute percentages, providing a clearer view of distribution proportions.
sales_vector <- sales_data$sales_amount # ambil kolom sales task 2
performance <- categorize_performance(sales_vector) # kategorisasi performance
sales_data$performance <- performance # menambahkan hasil kategori ke dataset
summary_perf <- sales_data %>% # ringkasan jumlah & persentase tiap kategori
group_by(performance) %>% # kelompokkan berdasarkan kategori
summarise(count = n()) %>% # hitung jumlah data
mutate(percentage = round((count / sum(count)) * 100, 2)) # hitung persentaseInterpretation:
The distribution shows that most sales fall into the “Excellent” and “Good” categories, each having the highest frequency. The “Average” category appears moderately, while “Poor” has the lowest count. This indicates that overall sales performance tends to be strong, with only a small portion of low-performing transactions.
3.3 Bar plot and pie chart of distribution
plotly to display both a horizontal bar chart and a donut chart in a single layout.
Interpretation:
The visualization clearly shows that the distribution is concentrated in the higher performance categories, especially “Excellent” and “Good”, which have the largest proportions. In contrast, “Average” and “Poor” contribute a smaller share of the data. This pattern confirms that most sales transactions achieve relatively high performance levels, while low-performance cases are limited.
4 Company Dataset Simulation
4.1 Nested Loops
generate_company_data() is implemented using nested loops to simulate hierarchical data across companies and employees. Random values are generated using sample() and rnorm(), while if-else conditions are used to classify employee performance.
The following code is used to create the company dataset:
generate_company_data <- function(n_company, n_employees) { # generate data perusahaan
data <- data.frame()
dept_list <- c("HR", "Finance", "IT", "Marketing")
for (c in 1:n_company) { # loop untuk setiap company
company_id <- paste0("COMP", c)
for (e in 1:n_employees) { # loop untuk setiap karyawan
employee_id <- paste0(company_id, "_EMP", e)
salary <- sample(4000:15000, 1) # salary random
department <- sample(dept_list, 1) # department random
performance_score <- sample(65:98, 1) # performance score
KPI_score <- round(rnorm(1, mean = 85, sd = 8)) # KPI variatif
KPI_score <- max(min(KPI_score, 100), 70) # batas 70–100
# conditional
if (KPI_score > 90) {
status <- "Top Performer"
} else {
status <- "Regular"
}
data <- rbind(data, data.frame(
company_id = company_id,
employee_id = employee_id,
salary = salary,
department = department,
performance_score = performance_score,
KPI_score = KPI_score,
status = status
))
}
}
return(data)
}Interpretation:
This function successfully generates a structured dataset by combining companies and employees through nested loops. This approach allows for more complex and realistic data simulation.
4.2 Conditional Logic
if-else to classify employees based on KPI scores. The dataset is shuffled using sample() to simulate a more natural data structure.
company_data <- generate_company_data(4, 12) # generate data
company_data <- company_data[sample(nrow(company_data)), ] # acak dataInterpretation:
The conditional logic successfully classifies employees into “Top Performer” and “Regular” based on KPI scores. This helps distinguish high-performing employees and simplifies further analysis across companies.
4.3 Summary per company
dplyr functions such as group_by() and summarise() to compute averages and maximum values per company.
summary_company <- company_data %>% # membuat ringkasan per company
group_by(company_id) %>% # mengelompokkan data berdasarkan company
summarise(
avg_salary = round(mean(salary), 2), # menghitung rata-rata salary
avg_performance = round(mean(performance_score), 2), # menghitung rata-rata performance
max_KPI = max(KPI_score) # mengambil nilai KPI tertinggi
)Interpretation:
The summary shows that each company has different average salary and performance levels. However, higher average salary does not always align with higher maximum KPI. This indicates that overall compensation does not directly guarantee top individual performance, and performance variation still exists within each company.
4.4 Visualizations
plotly to create interactive visualizations. The bar chart compares average salary, while the line chart shows maximum KPI per company.
Interpretation:
The bar chart shows that Company 1 has the highest average salary among all companies. However, the line chart reveals that the highest maximum KPI is achieved by Company 2 and Company 4 (reaching 100), while Company 1 is slightly below. This indicates that the company with the highest salary does not necessarily have the best KPI performance.
5 Pi & Probability
5.1 Loop
for and the runif() function to generate random points within the (0–1) coordinate space. Conditional logic if-else is used to determine whether a point lies inside the circle and within a sub-square.
# fungsi untuk simulasi monte carlo
monte_carlo_sim <- function(n_points) {
data <- data.frame()
for (i in 1:n_points) { # loop untuk generate titik
x <- runif(1) # random titik (0 - 1)
y <- runif(1)
if (x^2 + y^2 <= 1) { # cek apakah masuk lingkaran
posisi <- "Inside"
} else {
posisi <- "Outside"
}
if (x <= 0.5 && y <= 0.5) { # cek apakah masuk sub-square
square <- "Yes"
} else {
square <- "No"
}
data <- rbind(data, data.frame( # simpan ke data frame
x = x,
y = y,
posisi = posisi,
in_square = square
))
}
return(data)
}Interpretasi:
This function generates random points and classifies them based on specific conditions. This approach enables a probability-based simulation that mimics random distribution in a two-dimensional space.
5.2 Count
sum(). The proportion of points inside the circle relative to the total number of points is then used to estimate π using the Monte Carlo formula.
n_points <- 3000 # jumlah titik percobaan dalam simulasi Monte Carlo
mc_data <- monte_carlo_sim(n_points) # menghasilkan dataset posisi titik
inside_count <- sum(mc_data$posisi == "Inside") # menghitung jumlah titik "Inside"
pi_estimate <- 4 * (inside_count / n_points) # hitung rumus π dari rasio luas lingkaranInterpretation:
The estimated value of π is 3.176, which is reasonably close to the actual value of \(π (≈ 3.1416)\). The difference occurs because the Monte Carlo method relies on random sampling, making the result an approximation rather than an exact value. Increasing the number of points generally improves the accuracy of the estimation.
5.3 Probability of Random Points
sum() and comparison operators. The in_square column identifies whether a point falls inside the sub-square (x ≤ 0.5 and y ≤ 0.5). The number of such points is counted and divided by the total number of points to obtain the probability.
# menghitung jumlah titik yang berada di dalam square
square_count <- sum(mc_data$in_square == "Yes") # menghitung jumlah "Yes"
# menghitung probabilitas titik jatuh ke dalam square
prob_square <- square_count / n_points # jumlah dalam square dibagi total titikInterpretation:
The probability result is close to 25%, which aligns with the theoretical expectation since the sub-square represents one-fourth of the total area. This confirms that the random points are distributed uniformly, and the simulation behaves as expected.
5.4 Visualization
ggplot2, where geom_point() displays random points and stat_function() draws the circle boundary. The plot is then converted into an interactive visualization using ggplotly() from the plotly package, allowing better data exploration.
Interpretation:
The visualization shows a clear separation between points inside and outside the circle. The density of points inside the boundary reflects the proportion used in estimating π. This visual pattern helps validate the Monte Carlo approach, where area comparison is approximated through random sampling.
6 Advanced Data
6.1 Loop
normalize_df() and zscore_df(). Both functions utilize a for loop to iterate through selected numeric columns. The min() and max() functions are used for normalization to scale values into a 0–1 range, while mean() and sd() are used in z-score calculation to measure how far values deviate from the average.
normalize_df <- function(df, cols) { # normalisasi (skala 0–1)
for (col in cols) { # loop untuk setiap kolom numerik
min_val <- min(df[[col]]) # mengambil nilai minimum
max_val <- max(df[[col]]) # mengambil nilai maksimum
# menghitung normalisasi dan membuat kolom baru dengan suffix "_norm"
df[[paste0(col, "_norm")]] <- (df[[col]] - min_val) / (max_val - min_val)
}
return(df) # mengembalikan data yang sudah dinormalisasi
}
zscore_df <- function(df, cols) { # fungsi untuk menghitung z-score
for (col in cols) { # loop untuk setiap kolom numerik
mean_val <- mean(df[[col]]) # menghitung rata-rata
sd_val <- sd(df[[col]]) # menghitung standar deviasi
df[[paste0(col, "_z")]] <- (df[[col]] - mean_val) / sd_val # menghitung z-score dan membuat kolom baru dengan suffix "_z"
}
return(df) # mengembalikan data yang sudah ditransformasi
}Interpretation:
The loop-based approach ensures that normalization and standardization are applied consistently across all selected variables. As a result, each numeric column is transformed into a comparable scale, making further analysis more reliable.
6.2 Feature Engineering
company_data) is used as the input. The functions normalize_df() and zscore_df() are applied to transform numeric variables such as salary, performance_score, and KPI_score. Additionally, the ifelse() function is used to create new features such as performance_category and salary_bracket, which help categorize the data for better analysis.
# menggunakan data dari task sebelumnya
df <- company_data # menyalin dataset agar tidak mengubah data asli
num_cols <- c("salary", "performance_score", "KPI_score") # menentukan kolom numerik yang akan ditransformasi
df <- normalize_df(df, num_cols) # melakukan normalisasi (0–1) menggunakan fungsi yang sudah dibuat
df <- zscore_df(df, num_cols) # menghitung z-score untuk melihat deviasi dari rata-rata
# pembulatan hasil transformasi agar lebih rapi
for (col in c("salary_norm","performance_score_norm","KPI_score_norm",
"salary_z","performance_score_z","KPI_score_z")) {
df[[col]] <- round(df[[col]], 3) # membulatkan hingga 3 desimal
}
# FEATURE ENGINEERING
# membuat kategori performance berdasarkan nilai performance_score
df$performance_category <- ifelse(df$performance_score > 85, "High",
ifelse(df$performance_score > 70, "Medium", "Low"))
# membuat kategori salary berdasarkan rentang gaji
df$salary_bracket <- ifelse(df$salary > 10000, "High",
ifelse(df$salary > 7000, "Medium", "Low"))Interpretation:
The newly created features transform continuous variables into categorical groups, making patterns easier to interpret. For example, employees can now be quickly compared based on performance level and salary range without relying on raw numeric values.
6.3 Compare Before After
summary() function to observe differences in descriptive statistics before and after transformation. This helps in understanding how normalization changes the scale without altering the relative distribution.
compare_table <- data.frame( # membuat tabel perbandingan statistik before after
Statistic = c("Min", "Q1", "Median", "Mean", "Q3", "Max"), # jenis statistik deskriptif
Before = as.numeric(summary(company_data$salary)), # statistik salary before
After = round(as.numeric(summary(df$salary_norm)), 4) # statistik after
)
datatable( # menampilkan tabel dengan format interaktif
compare_table,
rownames = FALSE, # menghilangkan index tabel
options = list(pageLength = 6), # jumlah baris yang ditampilkan
class = "stripe hover" # styling tabel
) %>%
formatStyle(
names(compare_table) # memberi style ke seluruh kolom
)Interpretation:
The comparison shows that normalization changes the scale of the data into a 0–1 range while preserving the overall distribution pattern. This means that relative differences between values remain unchanged, allowing fair comparison across variables.
6.4 Visualizations
ggplot2, where geom_histogram() is used to observe data distribution and geom_boxplot() to examine spread and outliers. The plots are then converted into interactive visuals using ggplotly() from the plotly package.
Interpretation:
The histogram shows that the overall shape of the distribution remains consistent before and after transformation, confirming that normalization does not distort the data pattern. The boxplot further supports this by showing that the relative spread and median position remain proportional.
To enable a direct visual comparison, the normalized data is rescaled back to the original range. This helps demonstrate that normalization changes only the scale, not the underlying structure of the data.
7 Mini Project
7.1 Generate Dataset
sample() function generates random values, while rnorm() introduces variation in KPI relative to performance.
generate_company_data <- function() { # fungsi untuk generate dataset multi-company
emp_counts <- c(70, 50, 95, 65, 135, 110, 150) # jumlah karyawan per company
departments <- c("HR", "Finance", "IT", "Marketing", "Operations") # daftar department
data <- data.frame() # menyiapkan data kosong
for (c in 1:length(emp_counts)) { # loop untuk setiap company
company_id <- paste0("1626C", c) # membuat ID company
n_emp <- emp_counts[c] # mengambil jumlah karyawan
for (e in 1:n_emp) { # loop setiap karyawan dalam company
employee_id <- paste0(company_id, "_", sprintf("%03d", e)) # membuat ID employee
salary <- sample(3000:15000, 1) # generate salary random
performance <- sample(60:100, 1) # generate performance random
KPI <- round(performance + rnorm(1, 0, 5)) # KPI berbasis performance + variasi normal
KPI <- max(min(KPI, 100), 50) # membatasi KPI antara 50–100
dept <- sample(departments, 1) # memilih department secara acak
# menambahkan data ke dalam dataset
data <- rbind(data, data.frame(
employee_id = employee_id,
company_id = company_id,
salary = salary,
performance_score = performance,
KPI_score = KPI,
department = dept
))
}
}
data <- data[sample(nrow(data)), ] # mengacak urutan data
return(data) # mengembalikan dataset
}Interpretation:
The generated dataset contains multiple companies with varying employee counts, salaries, performance, and KPI values. This structure simulates real-world organizational data and supports further analysis across multiple dimensions.
7.2 Summarize per company
group_by() and summarise() to calculate average salary, average KPI, and top performers.
summary_company <- company_data %>% # membuat ringkasan data per company
group_by(company_id) %>% # mengelompokkan data berdasarkan company
summarise(
avg_salary = round(mean(salary), 0), # menghitung rata-rata salary
avg_KPI = round(mean(KPI_score), 1), # menghitung rata-rata KPI
top_performers = sum(KPI_score > 90) # menghitung jumlah karyawan dengan KPI > 90
)Interpretation:
The summary highlights differences in average salary, KPI, and number of top performers across companies. This indicates that each company has a distinct performance profile and workforce composition.
7.3 Loop-based
if-else to group KPI into High, Medium, and Low categories.
- High: KPI score > 90
- Medium: KPI score between 76 – 90
- Low: KPI score ≤ 75
categorize_kpi <- function(df) { # mengkategorikan KPI menjadi beberapa level
tiers <- c() # menyiapkan vector kosong untuk menyimpan kategori
for (i in df$KPI_score) { # loop setiap nilai KPI
if (i > 90) {
tiers <- c(tiers, "High") # KPI tinggi
} else if (i > 75) {
tiers <- c(tiers, "Medium") # KPI sedang
} else {
tiers <- c(tiers, "Low") # KPI rendah
}
}
df$KPI_tier <- tiers # menambahkan kolom kategori ke data
return(df) # mengembalikan data yang sudah ditambahkan kategori
}
company_data <- categorize_kpi(company_data) # menerapkan fungsi ke datasetInterpretation:
The loop-based categorization groups employees into KPI tiers, making it easier to analyze performance distribution. This approach simplifies comparison across companies and improves interpretability.
7.4 Output Tables
filter() for top performers and group_by() for department-level insights.
top_perf <- company_data %>% # mengambil data karyawan dengan KPI tinggi (top performers)
filter(KPI_score > 90) # menyaring karyawan dengan KPI di atas 90# membuat ringkasan per company dan department
dept_analysis <- company_data %>%
group_by(company_id, department) %>% # mengelompokkan data berdasarkan company & department
summarise(
avg_salary = round(mean(salary), 0), # menghitung rata-rata salary
count = n(), # menghitung jumlah karyawan
.groups = "drop" # menghilangkan struktur grouping
)
output_list <- list() # menyiapkan list untuk menyimpan output
for (comp in unique(dept_analysis$company_id)) { # loop untuk setiap company
title <- tags$h4(paste("Department Analysis -", comp)) # membuat judul per company
table <- datatable(
dept_analysis %>% filter(company_id == comp), # filter data sesuai company
rownames = FALSE,
class = "stripe hover" # styling tabel
)
output_list[[comp]] <- tagList(title, table, tags$br()) # menyimpan hasil ke dalam list
}
tagList(output_list) # menampilkan outputDepartment Analysis - 1626C1
Department Analysis - 1626C2
Department Analysis - 1626C3
Department Analysis - 1626C4
Department Analysis - 1626C5
Department Analysis - 1626C6
Department Analysis - 1626C7
Interpretation:
The results show how top performers are distributed across companies and how departments differ in salary and employee count. This helps identify which departments contribute most to overall performance.
7.5 Salary Distribution
geom_histogram() to observe how salary values are spread across companies. This helps in understanding compensation variability between companies.
Interpretation:
The salary distribution shows variation across companies, indicating differences in compensation structures and workforce composition. Some companies display wider ranges, suggesting more diverse salary levels.
7.6 Advanced visualization
geom_col() for grouped bar charts and geom_smooth() to add regression lines in scatter plots. This helps compare departments and analyze the relationship between performance and KPI.
Interpretation:
The grouped bar chart compares average salary across departments and companies, while the scatter plot shows a positive relationship between performance and KPI. This suggests that higher performance generally leads to higher KPI scores.
8 Automated Report
generate_company_report() is created to automatically generate reports for each company. Inside the function, filter() is used to subset data, summarise() computes key metrics, and ggplot2 is used to create visualizations. The subplot() function combines multiple plots into one view.
for loop to iterate over all companies. Each iteration generates a complete report, which is then stored and displayed using tagList() from the htmltools package.
generate_company_report <- function(df, comp_id) {
data_comp <- df %>% filter(company_id == comp_id) # ambil data per company
# ringkasan utama
summary <- data_comp %>%
summarise(
avg_salary = round(mean(salary), 0), # rata-rata salary
avg_KPI = round(mean(KPI_score), 1), # rata-rata KPI
total_employee = n() # jumlah karyawan
)
# ambil top performers
top_perf <- data_comp %>% filter(KPI_score > 90)
# histogram salary
p1 <- ggplot(data_comp, aes(salary)) +
geom_histogram(fill = "#3b82f6", bins = 15, alpha = 0.7) +
labs(title = "Salary Distribution") +
theme_minimal() +
theme(
plot.background = element_rect(fill = "#eaf2ff", color = NA),
panel.background = element_rect(fill = "#eaf2ff", color = NA)
)
# scatter performance vs KPI
p2 <- ggplot(data_comp, aes(performance_score, KPI_score)) +
geom_point(color = "#1e4b7a", alpha = 0.6) +
geom_smooth(method = "lm", se = FALSE, color = "#f97316") +
labs(title = "Performance vs KPI") +
theme_minimal() +
theme(
plot.background = element_rect(fill = "#eaf2ff", color = NA),
panel.background = element_rect(fill = "#eaf2ff", color = NA)
)
# insight singkat otomatis
insight <- paste(
"Avg salary:", summary$avg_salary,
"| Avg KPI:", summary$avg_KPI,
"| Top performers:", nrow(top_perf)
)
combined_plot <- subplot( # gabung plot
ggplotly(p1),
ggplotly(p2),
nrows = 1
) %>%
layout(
paper_bgcolor = "#eaf2ff",
plot_bgcolor = "#eaf2ff"
)
# output per company
tagList(
tags$h3(paste("Company Report -", comp_id), style = "color:#1e4b7a;"),
tags$h4("Summary"),
datatable(summary, rownames = FALSE),
tags$h4("Top Performers"),
datatable(top_perf, rownames = FALSE, options = list(pageLength = 5)),
tags$h4("Visualizations"),
combined_plot,
tags$p(insight, style = "font-size:20px; color:#006400; margin-top:10px"),
tags$hr()
)
}This section applies a loop to automatically generate reports for all
companies. The lapply() function is used to iterate through
each company_id, calling the report function and storing
the results into a structured output.
# LOOP SEMUA COMPANY
company_list <- sort(unique(company_data$company_id)) # mengurutkan company
report_list <- lapply(company_list, function(comp) {
generate_company_report(company_data, comp) # generate report per company
})
tagList(report_list) # tampilkan semua reportCompany Report - 1626C1
Summary
Top Performers
Visualizations
Avg salary: 9430 | Avg KPI: 80.2 | Top performers: 17
Company Report - 1626C2
Summary
Top Performers
Visualizations
Avg salary: 8562 | Avg KPI: 80.2 | Top performers: 10
Company Report - 1626C3
Summary
Top Performers
Visualizations
Avg salary: 8988 | Avg KPI: 79.7 | Top performers: 22
Company Report - 1626C4
Summary
Top Performers
Visualizations
Avg salary: 9002 | Avg KPI: 82 | Top performers: 17
Company Report - 1626C5
Summary
Top Performers
Visualizations
Avg salary: 8871 | Avg KPI: 79.2 | Top performers: 25
Company Report - 1626C6
Summary
Top Performers
Visualizations
Avg salary: 9464 | Avg KPI: 81.6 | Top performers: 33
Company Report - 1626C7
Summary
Top Performers
Visualizations
Avg salary: 9466 | Avg KPI: 80.1 | Top performers: 42
Interpretation:
The function successfully generates an automated report for each company using a combination of functions, loops, and data visualization. Each report includes summary statistics, top performers, and interactive plots, making the analysis more structured and efficient.
The following code shows the implementation:
# export data per company ke csv
for (comp in company_list) {
data_comp <- company_data %>%
filter(company_id == comp)
write.csv(
data_comp,
paste0("report_", comp, ".csv"),
row.names = FALSE
)
}Interpretation:
The export process successfully generates separate CSV files for each company. This demonstrates how automation can be extended beyond visualization into data storage, allowing the results to be reused for further analysis or reporting outside the R environment.