Assignment DS Programming Week 5

Advanced Practicum: Functions & Loops + Data Science

1 Dynamic Multi-Formula

Dynamic Multi-Formula Computation

This task implements a dynamic function in R to compute multiple mathematical formulas, including linear, quadratic, cubic, and exponential. The computation leverages functions, nested loops, and conditional statements to generate structured outputs. The results are further visualized to highlight differences in growth patterns across formulas.

1.1 Nested loops

Nested loops to compute multiple formulas at once

This section implements a custom function compute_multiple_formulas() to calculate multiple mathematical formulas simultaneously. The function uses nested for loops, where the outer loop iterates over the formula types (formulas) and the inner loop iterates over input values (x_values). Conditional statements (if-else) are used inside the loop to apply the correct formula (linear, quadratic, cubic, exponential). The results are stored dynamically using rbind() into a data frame.

Linear

y = 3x - 2

Quadratic

y = x² - 4x + 5

Cubic

y = 0.05x³ - 3x² + x + 5

Exponential

y = 1.5ˣ

# menghitung jenis formula menggunakan nested loop
compute_multiple_formulas <- function(x_values, formulas) {
  results <- data.frame()                # data kosong untuk menyimpan hasil
  for (f in formulas) {                  # loop untuk setiap jenis formula
    for (i in x_values) {                # loop untuk setiap nilai x
      
      # menentukan rumus berdasarkan jenis formula
      if (f == "linear") {
        y <- 3*i - 2
      } else if (f == "quadratic") {
        y <- i^2 - 4*i + 5
      } else if (f == "cubic") {
        y <- 0.05*i^3 - 3*i^2 + i + 5
      } else if (f == "exponential") {
        y <- 1.5^i
      }
      results <- rbind(results, data.frame(x = i, y = y, formula = f))  # simpan hasil
    }
  }
  return(results)
}

Interpretation:

The results confirm that each formula produces distinct value patterns. Linear growth increases steadily, while quadratic and cubic functions show accelerating trends, and the exponential function grows the fastest. This also shows that the nested loop structure can efficiently generate and organize multiple formula outputs in a single dataset.

1.2 Validation

Validate Formula Input

This section introduces validation using the function compute_formula_checked(). A predefined vector valid_formulas is used along with the %in% operator to check whether the input formula is valid. If the formula is not valid, it is still recorded in the output with status "Invalid", while valid formulas are computed using the same nested loop structure.

# fungsi untuk menghitung formula + validasi input
compute_formula_checked <- function(x_values, formulas) {
  valid_formulas <- c("linear", "quadratic", "cubic", "exponential")
  results <- data.frame()
  
  for (f in formulas) {                 # loop untuk setiap formula
    if (!(f %in% valid_formulas)) {     # cek apakah formula valid
      results <- rbind(results, data.frame(
        x = NA,
        y = NA,
        formula = f,
        status = "Invalid"              # ketentuan untuk invalid
      ))
      next                              # lanjut ke formula berikutnya
    }
    
    for (i in x_values) {               # jika valid dihitung
      if (f == "linear") {
        y <- 3*i - 2
      } else if (f == "quadratic") {
        y <- i^2 - 4*i + 5
      } else if (f == "cubic") {
        y <- 0.05*i^3 - 3*i^2 + i + 5
      } else if (f == "exponential") {
        y <- 1.5^i
      }
      
      results <- rbind(results, data.frame(   # simpan hasil + status valid
        x = i,
        y = y,
        formula = f,
        status = "Valid"
      ))
    }
  }
  return(results)
}

Interpretation:

The system successfully handles invalid inputs without causing errors. This highlights the importance of validation in maintaining data quality and ensuring a stable analysis process.

1.3 Visualization

Plot all formulas on same graph for x = 1:20

Visualization is created using ggplot2 to generate the line plot and plotly via ggplotly() to make the graph interactive. The plot uses data_nested as input and maps x, y, and formula into aesthetics. The geom_line() function is used to display trends for each formula, allowing direct comparison of their growth patterns.

Interpretation:

The graph highlights differences in growth patterns, with the exponential function increasing the fastest. This visualization makes it easier to identify patterns that are difficult to observe from tables alone.

2 Sales & Discounts

Multi-Sales Simulation with Discounts and Cumulative Analysis

This task simulates sales data for multiple salespersons over several days. The process combines functions, loops, conditional logic, and nested functions to generate realistic data, apply discount rules, and compute cumulative sales. The results are summarized and visualized to analyze performance patterns.

2.1 Nested Function

Cumulative Sales Calculation using hitung_cumulative()

This section defines a function hitung_cumulative() to compute cumulative sales per salesperson. Inside it, a nested function (cumulative_manual()) is created to manually calculate running totals using a for loop. The function is applied using ave(), which groups data by sales_id and applies the cumulative calculation for each group.

hitung_cumulative <- function(data) {   # menghitung cumulative sales per salesperson
  cumulative_manual <- function(x) {    # fungsi manual untuk cumulative
    total <- 0                    # inisialisasi total
    hasil <- c()                  # vector kosong
    
    for (i in x) {                # loop setiap nilai sales
      total <- total + i          # menambahkan nilai ke total
      hasil <- c(hasil, total)    # menyimpan hasil cumulative
    }
    return(hasil)
  }
  
  data$cumulative_sales <- ave(   # menerapkan fungsi cumulative ke setiap sales_id
    data$sales_amount, 
    data$sales_id, 
    FUN = cumulative_manual
  )
  return(data)   # mengembalikan data dengan kolom baru
}

Interpretation:

The cumulative calculation correctly aggregates sales over time for each salesperson, ensuring that individual performance is tracked independently. This approach highlights how total sales evolve sequentially rather than as isolated daily values.

2.2 Loops Conditional Discounts

Sales Data Generation using generate_sales_data()

Sales data is generated using the function generate_sales_data(), which applies nested loops to iterate over salespersons and days. Random sales values are created using sample(), and if-else conditions are used to assign discount rates based on sales amount.

The discount rules are defined as follows:

High Sales: Sales amount > 1000 → Discount 20%
Medium Sales: Sales amount > 500 and ≤ 1000 → Discount 10%
Low Sales: Sales amount ≤ 500 → Discount 5%

# fungsi untuk generate data sales
generate_sales_data <- function(n_salesperson, days) {
  results <- data.frame()
  
  for (s in 1:n_salesperson) {              # loop tiap salesperson
    sales_id <- paste0("5225", s)           # membuat ID unik

    for (d in 1:days) {                     # loop tiap hari
      sales_amount <- sample(100:1500, 1)   # generate sales random
      
      # menentukan diskon berdasarkan jumlah sales
      if (sales_amount > 1000) {
        discount <- "20%"
      } else if (sales_amount > 500) {
        discount <- "10%"
      } else {
        discount <- "5%"
      }
      
      # simpan ke data frame
      results <- rbind(results, data.frame(
        sales_id = sales_id,
        day = d,
        sales_amount = sales_amount,
        discount_rate = discount
      ))
    }
  }
  
  return(results)
}

Interpretation:

The generated dataset reflects realistic variation in daily sales, where higher sales values result in higher discount rates. The cumulative column further shows how repeated transactions contribute to increasing total performance over time for each salesperson.

2.3 Summary Cumulative

Summary and Visualization using dplyr, ggplot2, and plotly

Summary statistics are computed using dplyr functions such as group_by() and summarise() to calculate total and average sales per salesperson. Visualization is created using ggplot2 with geom_line() and geom_point(), then converted into an interactive plot using plotly via ggplotly().

Ringkasan data dihitung menggunakan dplyr seperti group_by() dan summarise() untuk mendapatkan total dan rata-rata sales tiap salesperson. Visualisasi dibuat menggunakan ggplot2 dengan geom_line() dan geom_point(), lalu dibuat interaktif menggunakan plotly melalui ggplotly().

summary_sales <- sales_data %>%              # membuat ringkasan per salesperson
  group_by(sales_id) %>%                     # mengelompokkan berdasarkan sales_id
  summarise(
    total_sales = sum(sales_amount),         # menghitung total sales
    avg_sales = round(mean(sales_amount), 2) # menghitung rata-rata sales
  )

Interpretation:

The summary highlights differences in sales performance across individuals, where some salespersons achieve higher total sales and averages than others. This indicates variability in productivity and effectiveness among salespersons.

Interpretation:

The cumulative sales plot shows a consistent upward trend for all salespersons, as values accumulate over time. Differences in line steepness indicate varying sales performance, where steeper lines represent faster growth in total sales.

3 Performance Categorization

Performance Categorization and Distribution Analysis

This task focuses on categorizing sales performance into multiple levels and analyzing its distribution. The process uses loop-based functions, dplyr for aggregation, and plotly for visualization. The goal is to transform raw sales data into meaningful performance insights.

3.1 Loop Through Vector

Performance Categorization using categorize_performance()

This section defines a custom function categorize_performance() that classifies sales values into five categories: Excellent, Very Good, Good, Average, and Poor. The function uses a for loop to iterate through each value in the input vector and applies if-else conditions to determine the appropriate category.

The performance classification rules are defined as follows:

Excellent: Sales amount > 1200
Very Good: Sales amount > 900 and ≤ 1200
Good: Sales amount > 600 and ≤ 900
Average: Sales amount > 300 and ≤ 600
Poor: Sales amount ≤ 300

# fungsi untuk mengkategorikan performa berdasarkan sales
categorize_performance <- function(sales_vector) {
  categories <- c()  # simpan hasil
  
  # loop setiap nilai sales
  for (i in sales_vector) {
    
    if (i > 1200) {
      cat <- "Excellent"
    } else if (i > 900) {
      cat <- "Very Good"
    } else if (i > 600) {
      cat <- "Good"
    } else if (i > 300) {
      cat <- "Average"
    } else {
      cat <- "Poor"
    }
    categories <- c(categories, cat)
  }
  return(categories)
}

Interpretation:

The function successfully classifies each sales value into predefined performance categories using a loop-based approach. This ensures that all observations are consistently evaluated based on the same thresholds, making the categorization process structured and reproducible.

3.2 Percentages per Category

Category Distribution using dplyr

After categorization, the data is aggregated using dplyr functions such as group_by() and summarise() to count the number of observations in each category. The mutate() function is then used to compute percentages, providing a clearer view of distribution proportions.

sales_vector <- sales_data$sales_amount              # ambil kolom sales task 2
performance <- categorize_performance(sales_vector)  # kategorisasi performance
sales_data$performance <- performance                # menambahkan hasil kategori ke dataset
summary_perf <- sales_data %>%                       # ringkasan jumlah & persentase tiap kategori
  group_by(performance) %>%                          # kelompokkan berdasarkan kategori
  summarise(count = n()) %>%                         # hitung jumlah data
  mutate(percentage = round((count / sum(count)) * 100, 2))  # hitung persentase

Interpretation:

The distribution shows that most sales fall into the “Excellent” and “Good” categories, each having the highest frequency. The “Average” category appears moderately, while “Poor” has the lowest count. This indicates that overall sales performance tends to be strong, with only a small portion of low-performing transactions.

3.3 Bar plot and pie chart of distribution

Visualization using plotly (Bar Chart & Donut Chart)

Visualization is created using plotly to display both a horizontal bar chart and a donut chart in a single layout.

Interpretation:

The visualization clearly shows that the distribution is concentrated in the higher performance categories, especially “Excellent” and “Good”, which have the largest proportions. In contrast, “Average” and “Poor” contribute a smaller share of the data. This pattern confirms that most sales transactions achieve relatively high performance levels, while low-performance cases are limited.

4 Company Dataset Simulation

Multi-Company Dataset Simulation

This simulation aims to generate a company dataset consisting of multiple employees with attributes such as salary, department, and performance. The data is used to analyze patterns and comparisons across companies.

4.1 Nested Loops

Nested loops per company & employee

In this section, a custom function generate_company_data() is implemented using nested loops to simulate hierarchical data across companies and employees. Random values are generated using sample() and rnorm(), while if-else conditions are used to classify employee performance.

The following code is used to create the company dataset:

generate_company_data <- function(n_company, n_employees) {  # generate data perusahaan
  data <- data.frame()
  dept_list <- c("HR", "Finance", "IT", "Marketing")
  
  for (c in 1:n_company) {             # loop untuk setiap company
    company_id <- paste0("COMP", c)
    
    for (e in 1:n_employees) {         # loop untuk setiap karyawan
      employee_id <- paste0(company_id, "_EMP", e)
      salary <- sample(4000:15000, 1)       # salary random
      department <- sample(dept_list, 1)    # department random
      performance_score <- sample(65:98, 1) # performance score
      KPI_score <- round(rnorm(1, mean = 85, sd = 8)) # KPI variatif
      KPI_score <- max(min(KPI_score, 100), 70)  # batas 70–100
      
      # conditional
      if (KPI_score > 90) {
        status <- "Top Performer"
      } else {
        status <- "Regular"
      }
      data <- rbind(data, data.frame(
        company_id = company_id,
        employee_id = employee_id,
        salary = salary,
        department = department,
        performance_score = performance_score,
        KPI_score = KPI_score,
        status = status
      ))
    }
  }
  return(data)
}

Interpretation:

This function successfully generates a structured dataset by combining companies and employees through nested loops. This approach allows for more complex and realistic data simulation.

4.2 Conditional Logic

Conditional logic: top performers KPI > 90

In this section, we apply conditional logic if-else to classify employees based on KPI scores. The dataset is shuffled using sample() to simulate a more natural data structure.

company_data <- generate_company_data(4, 12)                # generate data
company_data <- company_data[sample(nrow(company_data)), ]  # acak data

Interpretation:

The conditional logic successfully classifies employees into “Top Performer” and “Regular” based on KPI scores. This helps distinguish high-performing employees and simplifies further analysis across companies.

4.3 Summary per company

Summary per company: avg salary, avg performance, max KPI

In this section, we use dplyr functions such as group_by() and summarise() to compute averages and maximum values per company.

summary_company <- company_data %>%        # membuat ringkasan per company
  group_by(company_id) %>%                 # mengelompokkan data berdasarkan company
  summarise(
    avg_salary = round(mean(salary), 2),   # menghitung rata-rata salary
    avg_performance = round(mean(performance_score), 2),  # menghitung rata-rata performance
    max_KPI = max(KPI_score)               # mengambil nilai KPI tertinggi
  )

Interpretation:

The summary shows that each company has different average salary and performance levels. However, higher average salary does not always align with higher maximum KPI. This indicates that overall compensation does not directly guarantee top individual performance, and performance variation still exists within each company.

4.4 Visualizations

Company Performance Overview

In this section, we use plotly to create interactive visualizations. The bar chart compares average salary, while the line chart shows maximum KPI per company.

Interpretation:

The bar chart shows that Company 1 has the highest average salary among all companies. However, the line chart reveals that the highest maximum KPI is achieved by Company 2 and Company 4 (reaching 100), while Company 1 is slightly below. This indicates that the company with the highest salary does not necessarily have the best KPI performance.

5 Pi & Probability

Monte Carlo Simulation: Pi & Probability

This simulation is used to estimate the value of π (pi) and compute probability using a random sampling approach. This method is known as Monte Carlo simulation, commonly used in data analysis to model probabilistic phenomena.

5.1 Loop

Loop for iterations

In this section, we use a loop for and the runif() function to generate random points within the (0–1) coordinate space. Conditional logic if-else is used to determine whether a point lies inside the circle and within a sub-square.

# fungsi untuk simulasi monte carlo
monte_carlo_sim <- function(n_points) {
  data <- data.frame()
  for (i in 1:n_points) {           # loop untuk generate titik
    x <- runif(1)                   # random titik (0 - 1)
    y <- runif(1)
    
    if (x^2 + y^2 <= 1) {           # cek apakah masuk lingkaran
      posisi <- "Inside"
    } else {
      posisi <- "Outside"
    }
    
    if (x <= 0.5 && y <= 0.5) {     # cek apakah masuk sub-square
      square <- "Yes"
    } else {
      square <- "No"
    }

    data <- rbind(data, data.frame( # simpan ke data frame
      x = x,
      y = y,
      posisi = posisi,
      in_square = square
    ))
  }
  return(data)
}

Interpretasi:

This function generates random points and classifies them based on specific conditions. This approach enables a probability-based simulation that mimics random distribution in a two-dimensional space.

5.2 Count

Count points in circle & compute pi

In this section, the number of points inside the circle is calculated using sum(). The proportion of points inside the circle relative to the total number of points is then used to estimate π using the Monte Carlo formula.

n_points <- 3000                                  # jumlah titik percobaan dalam simulasi Monte Carlo
mc_data <- monte_carlo_sim(n_points)              # menghasilkan dataset posisi titik
inside_count <- sum(mc_data$posisi == "Inside")   # menghitung jumlah titik "Inside"
pi_estimate <- 4 * (inside_count / n_points)      # hitung rumus π dari rasio luas lingkaran

Interpretation:

The estimated value of π is 3.176, which is reasonably close to the actual value of \(π (≈ 3.1416)\). The difference occurs because the Monte Carlo method relies on random sampling, making the result an approximation rather than an exact value. Increasing the number of points generally improves the accuracy of the estimation.

5.3 Probability of Random Points

Compute probability of random points falling in a sub-square

In this part, probability is calculated using logical conditions and basic R functions such as sum() and comparison operators. The in_square column identifies whether a point falls inside the sub-square (x ≤ 0.5 and y ≤ 0.5). The number of such points is counted and divided by the total number of points to obtain the probability.

# menghitung jumlah titik yang berada di dalam square
square_count <- sum(mc_data$in_square == "Yes")   # menghitung jumlah "Yes"

# menghitung probabilitas titik jatuh ke dalam square
prob_square <- square_count / n_points   # jumlah dalam square dibagi total titik

Interpretation:

The probability result is close to 25%, which aligns with the theoretical expectation since the sub-square represents one-fourth of the total area. This confirms that the random points are distributed uniformly, and the simulation behaves as expected.

5.4 Visualization

Plot points inside vs outside circle

The visualization is created using ggplot2, where geom_point() displays random points and stat_function() draws the circle boundary. The plot is then converted into an interactive visualization using ggplotly() from the plotly package, allowing better data exploration.

Interpretation:

The visualization shows a clear separation between points inside and outside the circle. The density of points inside the boundary reflects the proportion used in estimating π. This visual pattern helps validate the Monte Carlo approach, where area comparison is approximated through random sampling.

6 Advanced Data

Advanced Data Transformation & Feature Engineering

This section focuses on data transformation and feature engineering to prepare the dataset for further analysis. Techniques such as normalization and z-score standardization are applied to adjust the scale of numeric variables, while additional features are created to enhance interpretability. These processes are essential in data science to ensure comparability across variables and to extract more meaningful insights from the data.

6.1 Loop

Loop-based Normalization & Z-score

In this section, two main functions are created: normalize_df() and zscore_df(). Both functions utilize a for loop to iterate through selected numeric columns. The min() and max() functions are used for normalization to scale values into a 0–1 range, while mean() and sd() are used in z-score calculation to measure how far values deviate from the average.

normalize_df <- function(df, cols) {  # normalisasi (skala 0–1)
  
  for (col in cols) {                 # loop untuk setiap kolom numerik
    min_val <- min(df[[col]])         # mengambil nilai minimum
    max_val <- max(df[[col]])         # mengambil nilai maksimum
    
    # menghitung normalisasi dan membuat kolom baru dengan suffix "_norm"
    df[[paste0(col, "_norm")]] <- (df[[col]] - min_val) / (max_val - min_val)
  }
  
  return(df)   # mengembalikan data yang sudah dinormalisasi
}

zscore_df <- function(df, cols) {           # fungsi untuk menghitung z-score
  
  for (col in cols) {                       # loop untuk setiap kolom numerik
    mean_val <- mean(df[[col]])             # menghitung rata-rata
    sd_val <- sd(df[[col]])                 # menghitung standar deviasi
    df[[paste0(col, "_z")]] <- (df[[col]] - mean_val) / sd_val  # menghitung z-score dan membuat kolom baru dengan suffix "_z"
  }
  
  return(df)   # mengembalikan data yang sudah ditransformasi
}

Interpretation:

The loop-based approach ensures that normalization and standardization are applied consistently across all selected variables. As a result, each numeric column is transformed into a comparable scale, making further analysis more reliable.

6.2 Feature Engineering

Create new features

The dataset from Task 4 (company_data) is used as the input. The functions normalize_df() and zscore_df() are applied to transform numeric variables such as salary, performance_score, and KPI_score. Additionally, the ifelse() function is used to create new features such as performance_category and salary_bracket, which help categorize the data for better analysis.

# menggunakan data dari task sebelumnya
df <- company_data                                         # menyalin dataset agar tidak mengubah data asli
num_cols <- c("salary", "performance_score", "KPI_score")  # menentukan kolom numerik yang akan ditransformasi
df <- normalize_df(df, num_cols)                           # melakukan normalisasi (0–1) menggunakan fungsi yang sudah dibuat
df <- zscore_df(df, num_cols)                              # menghitung z-score untuk melihat deviasi dari rata-rata

# pembulatan hasil transformasi agar lebih rapi
for (col in c("salary_norm","performance_score_norm","KPI_score_norm",
              "salary_z","performance_score_z","KPI_score_z")) {
  
  df[[col]] <- round(df[[col]], 3)   # membulatkan hingga 3 desimal
}

# FEATURE ENGINEERING
# membuat kategori performance berdasarkan nilai performance_score
df$performance_category <- ifelse(df$performance_score > 85, "High",
                           ifelse(df$performance_score > 70, "Medium", "Low"))

# membuat kategori salary berdasarkan rentang gaji
df$salary_bracket <- ifelse(df$salary > 10000, "High",
                     ifelse(df$salary > 7000, "Medium", "Low"))

Interpretation:

The newly created features transform continuous variables into categorical groups, making patterns easier to interpret. For example, employees can now be quickly compared based on performance level and salary range without relying on raw numeric values.

6.3 Compare Before After

Distribution Comparison

A comparison is performed using the summary() function to observe differences in descriptive statistics before and after transformation. This helps in understanding how normalization changes the scale without altering the relative distribution.

compare_table <- data.frame(       # membuat tabel perbandingan statistik before after
  Statistic = c("Min", "Q1", "Median", "Mean", "Q3", "Max"),  # jenis statistik deskriptif
  Before = as.numeric(summary(company_data$salary)),          # statistik salary before
  After = round(as.numeric(summary(df$salary_norm)), 4)       # statistik after 
)

datatable(                           # menampilkan tabel dengan format interaktif
  compare_table,
  rownames = FALSE,                  # menghilangkan index tabel
  options = list(pageLength = 6),    # jumlah baris yang ditampilkan
  class = "stripe hover"             # styling tabel
) %>%
  formatStyle(
    names(compare_table)             # memberi style ke seluruh kolom
  )

Interpretation:

The comparison shows that normalization changes the scale of the data into a 0–1 range while preserving the overall distribution pattern. This means that relative differences between values remain unchanged, allowing fair comparison across variables.

6.4 Visualizations

Histograms & Boxplots

Visualization is created using ggplot2, where geom_histogram() is used to observe data distribution and geom_boxplot() to examine spread and outliers. The plots are then converted into interactive visuals using ggplotly() from the plotly package.

Interpretation:

The histogram shows that the overall shape of the distribution remains consistent before and after transformation, confirming that normalization does not distort the data pattern. The boxplot further supports this by showing that the relative spread and median position remain proportional.

To enable a direct visual comparison, the normalized data is rescaled back to the original range. This helps demonstrate that normalization changes only the scale, not the underlying structure of the data.

7 Mini Project

Company KPI Dashboard & Simulation

This task simulates a multi-company dataset to analyze employee performance, salary, and KPI. The data is processed, summarized, and visualized to identify patterns and relationships across companies.

7.1 Generate Dataset

Generate multi-company dataset

The dataset is generated using nested loops for each company and employee. The sample() function generates random values, while rnorm() introduces variation in KPI relative to performance.

generate_company_data <- function() {   # fungsi untuk generate dataset multi-company
  emp_counts <- c(70, 50, 95, 65, 135, 110, 150)   # jumlah karyawan per company
  departments <- c("HR", "Finance", "IT", "Marketing", "Operations")  # daftar department
  data <- data.frame()   # menyiapkan data kosong
  
  for (c in 1:length(emp_counts)) {    # loop untuk setiap company
    company_id <- paste0("1626C", c)   # membuat ID company
    n_emp <- emp_counts[c]             # mengambil jumlah karyawan
    
    for (e in 1:n_emp) {               # loop setiap karyawan dalam company
      employee_id <- paste0(company_id, "_", sprintf("%03d", e))  # membuat ID employee
      salary <- sample(3000:15000, 1)             # generate salary random
      performance <- sample(60:100, 1)            # generate performance random
      KPI <- round(performance + rnorm(1, 0, 5))  # KPI berbasis performance + variasi normal
      KPI <- max(min(KPI, 100), 50)               # membatasi KPI antara 50–100
      dept <- sample(departments, 1)              # memilih department secara acak
      
      # menambahkan data ke dalam dataset
      data <- rbind(data, data.frame(
        employee_id = employee_id,
        company_id = company_id,
        salary = salary,
        performance_score = performance,
        KPI_score = KPI,
        department = dept
      ))
    }
  }
  
  data <- data[sample(nrow(data)), ]   # mengacak urutan data
  return(data)                         # mengembalikan dataset
}

Interpretation:

The generated dataset contains multiple companies with varying employee counts, salaries, performance, and KPI values. This structure simulates real-world organizational data and supports further analysis across multiple dimensions.

7.2 Summarize per company

Summary per company

The summary is computed using group_by() and summarise() to calculate average salary, average KPI, and top performers.

summary_company <- company_data %>%        # membuat ringkasan data per company
  group_by(company_id) %>%                 # mengelompokkan data berdasarkan company
  summarise(
    avg_salary = round(mean(salary), 0),   # menghitung rata-rata salary
    avg_KPI = round(mean(KPI_score), 1),   # menghitung rata-rata KPI
    top_performers = sum(KPI_score > 90)   # menghitung jumlah karyawan dengan KPI > 90
  )

Interpretation:

The summary highlights differences in average salary, KPI, and number of top performers across companies. This indicates that each company has a distinct performance profile and workforce composition.

7.3 Loop-based

KPI Categorization using Loop

Categorization is performed using loops and if-else to group KPI into High, Medium, and Low categories.

High: KPI score > 90
Medium: KPI score between 76 – 90
Low: KPI score ≤ 75

categorize_kpi <- function(df) {   # mengkategorikan KPI menjadi beberapa level
  tiers <- c()                     # menyiapkan vector kosong untuk menyimpan kategori
  
  for (i in df$KPI_score) {        # loop setiap nilai KPI
    if (i > 90) {
      tiers <- c(tiers, "High")    # KPI tinggi
    } else if (i > 75) {
      tiers <- c(tiers, "Medium")  # KPI sedang
    } else {
      tiers <- c(tiers, "Low")     # KPI rendah
    }
  }
  
  df$KPI_tier <- tiers   # menambahkan kolom kategori ke data
  return(df)             # mengembalikan data yang sudah ditambahkan kategori
}

company_data <- categorize_kpi(company_data)  # menerapkan fungsi ke dataset

Interpretation:

The loop-based categorization groups employees into KPI tiers, making it easier to analyze performance distribution. This approach simplifies comparison across companies and improves interpretability.

7.4 Output Tables

Top Performers & Department Analysis

Analysis is performed using filter() for top performers and group_by() for department-level insights.

top_perf <- company_data %>%   # mengambil data karyawan dengan KPI tinggi (top performers)
  filter(KPI_score > 90)       # menyaring karyawan dengan KPI di atas 90

# membuat ringkasan per company dan department
dept_analysis <- company_data %>% 
  group_by(company_id, department) %>%    # mengelompokkan data berdasarkan company & department
  summarise(
    avg_salary = round(mean(salary), 0),  # menghitung rata-rata salary
    count = n(),                          # menghitung jumlah karyawan
    .groups = "drop"                      # menghilangkan struktur grouping
  )
output_list <- list()                     # menyiapkan list untuk menyimpan output

for (comp in unique(dept_analysis$company_id)) {           # loop untuk setiap company
  title <- tags$h4(paste("Department Analysis -", comp))   # membuat judul per company
  table <- datatable(
    dept_analysis %>% filter(company_id == comp),          # filter data sesuai company
    rownames = FALSE,
    class = "stripe hover"                                 # styling tabel
  )
  output_list[[comp]] <- tagList(title, table, tags$br())  # menyimpan hasil ke dalam list
}

tagList(output_list)   # menampilkan output

Department Analysis - 1626C1

Department Analysis - 1626C2

Department Analysis - 1626C3

Department Analysis - 1626C4

Department Analysis - 1626C5

Department Analysis - 1626C6

Department Analysis - 1626C7

Interpretation:

The results show how top performers are distributed across companies and how departments differ in salary and employee count. This helps identify which departments contribute most to overall performance.

7.5 Salary Distribution

Salary Distribution

Salary distribution is visualized using geom_histogram() to observe how salary values are spread across companies. This helps in understanding compensation variability between companies.

Interpretation:

The salary distribution shows variation across companies, indicating differences in compensation structures and workforce composition. Some companies display wider ranges, suggesting more diverse salary levels.

7.6 Advanced visualization

Grouped bar charts, scatter plots with regression lines

Advanced visualization is performed using geom_col() for grouped bar charts and geom_smooth() to add regression lines in scatter plots. This helps compare departments and analyze the relationship between performance and KPI.

Interpretation:

The grouped bar chart compares average salary across departments and companies, while the scatter plot shows a positive relationship between performance and KPI. This suggests that higher performance generally leads to higher KPI scores.

8 Automated Report

Function-based Report Automation

This task focuses on building an automated reporting system using functions and loops. The goal is to generate a complete summary for each company, including tables and visualizations, in a structured and reusable format.

A function generate_company_report() is created to automatically generate reports for each company. Inside the function, filter() is used to subset data, summarise() computes key metrics, and ggplot2 is used to create visualizations. The subplot() function combines multiple plots into one view.

The automation is implemented using a for loop to iterate over all companies. Each iteration generates a complete report, which is then stored and displayed using tagList() from the htmltools package.

generate_company_report <- function(df, comp_id) {
  data_comp <- df %>% filter(company_id == comp_id)  # ambil data per company
  
  # ringkasan utama
  summary <- data_comp %>% 
    summarise(
      avg_salary = round(mean(salary), 0),   # rata-rata salary
      avg_KPI = round(mean(KPI_score), 1),   # rata-rata KPI
      total_employee = n()                   # jumlah karyawan
    )
  
  # ambil top performers
  top_perf <- data_comp %>% filter(KPI_score > 90)
  
  # histogram salary
  p1 <- ggplot(data_comp, aes(salary)) +
    geom_histogram(fill = "#3b82f6", bins = 15, alpha = 0.7) +
    labs(title = "Salary Distribution") +
    theme_minimal() +
  theme(
    plot.background = element_rect(fill = "#eaf2ff", color = NA),
    panel.background = element_rect(fill = "#eaf2ff", color = NA)
  )
  
  # scatter performance vs KPI
  p2 <- ggplot(data_comp, aes(performance_score, KPI_score)) +
    geom_point(color = "#1e4b7a", alpha = 0.6) +
    geom_smooth(method = "lm", se = FALSE, color = "#f97316") +
    labs(title = "Performance vs KPI") +
    theme_minimal() +
  theme(
    plot.background = element_rect(fill = "#eaf2ff", color = NA),
    panel.background = element_rect(fill = "#eaf2ff", color = NA)
  )
  
  # insight singkat otomatis
  insight <- paste(
    "Avg salary:", summary$avg_salary,
    "| Avg KPI:", summary$avg_KPI,
    "| Top performers:", nrow(top_perf)
  )
  
  combined_plot <- subplot(    # gabung plot
    ggplotly(p1),
    ggplotly(p2),
    nrows = 1
    ) %>%
    layout(
      paper_bgcolor = "#eaf2ff",
      plot_bgcolor = "#eaf2ff"
      )
  
  # output per company
  tagList(
    tags$h3(paste("Company Report -", comp_id), style = "color:#1e4b7a;"),
    
    tags$h4("Summary"),
    datatable(summary, rownames = FALSE),
    
    tags$h4("Top Performers"),
    datatable(top_perf, rownames = FALSE, options = list(pageLength = 5)),
    
    tags$h4("Visualizations"),
    combined_plot,
    
    tags$p(insight, style = "font-size:20px; color:#006400; margin-top:10px"),
    
    tags$hr()
  )
}

This section applies a loop to automatically generate reports for all companies. The lapply() function is used to iterate through each company_id, calling the report function and storing the results into a structured output.

# LOOP SEMUA COMPANY
company_list <- sort(unique(company_data$company_id))  # mengurutkan company

report_list <- lapply(company_list, function(comp) {
  generate_company_report(company_data, comp)  # generate report per company
})

tagList(report_list)  # tampilkan semua report

Company Report - 1626C1

Summary

Top Performers

Visualizations

Avg salary: 9430 | Avg KPI: 80.2 | Top performers: 17

Company Report - 1626C2

Summary

Top Performers

Visualizations

Avg salary: 8562 | Avg KPI: 80.2 | Top performers: 10

Company Report - 1626C3

Summary

Top Performers

Visualizations

Avg salary: 8988 | Avg KPI: 79.7 | Top performers: 22

Company Report - 1626C4

Summary

Top Performers

Visualizations

Avg salary: 9002 | Avg KPI: 82 | Top performers: 17

Company Report - 1626C5

Summary

Top Performers

Visualizations

Avg salary: 8871 | Avg KPI: 79.2 | Top performers: 25

Company Report - 1626C6

Summary

Top Performers

Visualizations

Avg salary: 9464 | Avg KPI: 81.6 | Top performers: 33

Company Report - 1626C7

Summary

Top Performers

Visualizations

Avg salary: 9466 | Avg KPI: 80.1 | Top performers: 42

Interpretation:

The function successfully generates an automated report for each company using a combination of functions, loops, and data visualization. Each report includes summary statistics, top performers, and interactive plots, making the analysis more structured and efficient.

Export Company Report Data to CSV

In this stage, each company's data is exported into CSV files as part of the automated reporting process. This is implemented using a for loop to iterate over each company_id, combined with the filter() function from dplyr to extract relevant data, and write.csv() to save the results as external files. This approach ensures that the analysis is not only displayed but also stored for further external use.

The following code shows the implementation:

# export data per company ke csv
for (comp in company_list) {
  data_comp <- company_data %>% 
    filter(company_id == comp)
  write.csv(
    data_comp,
    paste0("report_", comp, ".csv"),
    row.names = FALSE
  )
}

Interpretation:

The export process successfully generates separate CSV files for each company. This demonstrates how automation can be extended beyond visualization into data storage, allowing the results to be reused for further analysis or reporting outside the R environment.

Assignment DS Programming Week 5