Name : Hirose Kawarin Sirait
ID Number : 52250012
Study Program : Data Science
Lecturer : Mr. Bakti Siregar, M.Sc., CDS.
Course : Data Science Programming
Format : RPubs (R)

0.1 Introduction

In this practicum, we explore various concepts in data science including function development, data simulation, statistical analysis, and visualization.

The objective of this project is to build structured and reusable functions using loops and conditional logic, simulate real-world datasets, and analyze the results through meaningful visualizations.

Each task is designed to reflect practical data science scenarios, such as sales analysis, performance evaluation, and company-level data processing. By completing this practicum, we aim to enhance our understanding of how data can be generated, transformed, and interpreted effectively.

Furthermore, this project emphasizes clean coding practices, clear data presentation, and insightful interpretation to support data-driven decision making.

1 Objectives

This project aims to:

  1. Build multi-layer functions using loops and conditional logic

  2. Perform data simulation and transformation

  3. Create data visualizations

  4. Develop automated workflows in R

2 TASK 1 – Dynamic Multi-Formula Function

This task creates a function to compute multiple mathematical formulas (linear, quadratic, cubic, exponential) and visualize them in one graph.

2.1 FUNCTION

compute_formula <- function(x, formula) {
  
  if (formula == "linear") {
    return(2*x + 5)
    
  } else if (formula == "quadratic") {
    return(x^2 + 3*x + 2)
    
  } else if (formula == "cubic") {
    return(x^3 - 2*x^2 + x)
    
  } else if (formula == "exponential") {
    return(exp(x))
    
  } else {
    stop("Invalid formula input")  # VALIDASI
  }
}

2.2 DATA & LOOP

x <- 1:20
formulas <- c("linear", "quadratic", "cubic", "exponential")

2.3 NESTED LOOP + COUNT

results <- data.frame()

for (f in formulas) {
  for (i in x) {
    y <- compute_formula(i, f)
    results <- rbind(results, data.frame(x=i, y=y, formula=f))
  }
}

# 🔍 CEK DATA (INI YANG KAMU TANYA)
head(results)
##   x  y formula
## 1 1  7  linear
## 2 2  9  linear
## 3 3 11  linear
## 4 4 13  linear
## 5 5 15  linear
## 6 6 17  linear
Sample of Computed Results
x y formula
1 7 linear
2 9 linear
3 11 linear
4 13 linear
5 15 linear
6 17 linear
7 19 linear
8 21 linear
9 23 linear
10 25 linear

2.4 VISUALIZATION

library(ggplot2)
library(plotly)

p <- ggplot(results, aes(x = x, y = y, color = formula)) +
  geom_line(linewidth=1.2) +
  geom_point(size=2, alpha=0.8) +
  
  scale_y_log10() + 
  
  labs(
    title = "Interactive Comparison of Mathematical Functions",
    subtitle = "With Soft Purple Theme",
    x = "X Value",
    y = "Y Value",
    color = "Formula"
  ) +
  
  theme_minimal() +
  
  theme(
    plot.background = element_rect(fill = "#F3E8FF", color = NA),   # ungu soft
    panel.background = element_rect(fill = "#F9F5FF", color = NA),
    
    plot.title = element_text(size=16, face="bold"),
    plot.subtitle = element_text(size=12),
    
    legend.position = "top"
  )

#  jadi interaktif
ggplotly(p)

2.5 Interpretation

The interactive visualization allows users to explore how each function behaves dynamically.

  • The exponential function increases rapidly and dominates at higher x values.

  • The linear function grows at a constant rate.

  • Quadratic and cubic functions show curved growth patterns.

The interactivity (zoom and hover) helps better understand differences between functions.

The exponential function grows much faster than other functions, causing scale imbalance.

By applying a logarithmic scale, all functions become visible and comparable.

This highlights how exponential growth dominates over linear and polynomial functions.

3 TASK 2 – Nested Simulation: Multi-Sales & Discounts

3.1 MAIN FUNCTION

simulate_sales <- function(n_salesperson, days) {
  
  data <- data.frame()
  
  for (s in 1:n_salesperson) {
    for (d in 1:days) {
      
      sales_amount <- sample(100:1000, 1)
      
      # Conditional discount
      if (sales_amount > 800) {
        discount_rate <- 0.2
      } else if (sales_amount > 500) {
        discount_rate <- 0.1
      } else {
        discount_rate <- 0.05
      }
      
      data <- rbind(data, data.frame(
        salesperson_id = s,
        day = d,
        sales_amount = sales_amount,
        discount_rate = discount_rate
      ))
    }
  }
  
  return(data)
}

3.2 RUN SIMULATION

sales_data <- simulate_sales(5, 10)

head(sales_data)
##   salesperson_id day sales_amount discount_rate
## 1              1   1          118          0.05
## 2              1   2          559          0.10
## 3              1   3          841          0.20
## 4              1   4          137          0.05
## 5              1   5          238          0.05
## 6              1   6          541          0.10

3.3 FUNCTION CUMULATIVE SALES

calculate_cumulative <- function(df) {
  
  df$cumulative_sales <- ave(df$sales_amount, df$salesperson_id, FUN = cumsum)
  
  return(df)
}

sales_data <- calculate_cumulative(sales_data)

head(sales_data)
##   salesperson_id day sales_amount discount_rate cumulative_sales
## 1              1   1          118          0.05              118
## 2              1   2          559          0.10              677
## 3              1   3          841          0.20             1518
## 4              1   4          137          0.05             1655
## 5              1   5          238          0.05             1893
## 6              1   6          541          0.10             2434

3.4 TABLE

📊 Sales Simulation Data
salesperson_id day sales_amount discount_rate cumulative_sales
1 1 118 0.05 118
1 2 559 0.10 677
1 3 841 0.20 1518
1 4 137 0.05 1655
1 5 238 0.05 1893
1 6 541 0.10 2434
1 7 434 0.05 2868
1 8 961 0.20 3829
1 9 100 0.05 3929
1 10 706 0.10 4635

3.5 SUMMARY STATS

aggregate(sales_amount ~ salesperson_id, data = sales_data, sum)
##   salesperson_id sales_amount
## 1              1         4635
## 2              2         4963
## 3              3         4201
## 4              4         4662
## 5              5         6552

3.6 VISUALIZATION

library(ggplot2)
library(plotly)

p <- ggplot(sales_data, aes(x = day, y = cumulative_sales, color = factor(salesperson_id))) +
  geom_line(linewidth=1.2) +
  geom_point() +
  
  labs(
    title = "Cumulative Sales per Salesperson",
    x = "Day",
    y = "Cumulative Sales",
    color = "Salesperson"
  ) +
  
  theme_minimal() +
  theme(
    plot.background = element_rect(fill = "#F3E8FF", color = NA),
    legend.position = "top"
  )

ggplotly(p)

3.7 INTERPRETATION

The simulation shows how each salesperson accumulates sales over time.

  • Salespersons with higher daily sales grow faster.

  • Discount rates affect the final revenue indirectly.

  • The cumulative trend helps identify top performers.

4 TASK 3 – Multi-Level Performance Categorization

In this task, we classify sales performance into multiple categories based on sales amount. The purpose is to better understand how performance is distributed across different levels.

The categorization helps in identifying high-performing and low-performing sales outcomes, which is useful for decision-making and performance evaluation.

We will also calculate the percentage distribution of each category and visualize the results using bar charts and pie charts.

4.1 FUNCTION

categorize_performance <- function(sales_amount) {
  
  category <- c()
  
  for (i in sales_amount) {
    
    if (i > 800) {
      category <- c(category, "Excellent")
      
    } else if (i > 600) {
      category <- c(category, "Very Good")
      
    } else if (i > 400) {
      category <- c(category, "Good")
      
    } else if (i > 200) {
      category <- c(category, "Average")
      
    } else {
      category <- c(category, "Poor")
    }
  }
  
  return(category)
}

4.2 APPLY TO DATA

sales_data$performance_category <- categorize_performance(sales_data$sales_amount)

head(sales_data)
##   salesperson_id day sales_amount discount_rate cumulative_sales
## 1              1   1          118          0.05              118
## 2              1   2          559          0.10              677
## 3              1   3          841          0.20             1518
## 4              1   4          137          0.05             1655
## 5              1   5          238          0.05             1893
## 6              1   6          541          0.10             2434
##   performance_category
## 1                 Poor
## 2                 Good
## 3            Excellent
## 4                 Poor
## 5              Average
## 6                 Good

4.3 CALCULATE PERCENTAGE

category_table <- table(sales_data$performance_category)

percentage <- prop.table(category_table) * 100

category_table
## 
##   Average Excellent      Good      Poor Very Good 
##        13         8        15         6         8
percentage
## 
##   Average Excellent      Good      Poor Very Good 
##        26        16        30        12        16

4.4 BAR VISUALIZATION

library(ggplot2)

bar_data <- as.data.frame(category_table)

ggplot(bar_data, aes(x = Var1, y = Freq, fill = Var1)) +
  geom_bar(stat="identity") +
  
  # TAMBAH ANGKA
  geom_text(aes(label = Freq), vjust = -0.5, size = 5) +
  
  labs(
    title = "Performance Category Distribution",
    x = "Category",
    y = "Count"
  ) +
  
  theme_minimal() +
  theme(
    plot.background = element_rect(fill = "#F3E8FF", color = NA),
    legend.position = "none"
  )

4.5 PIE CHART VISUALIZATION

# ubah ke data frame
pie_data <- data.frame(
  category = names(percentage),
  percent = as.numeric(percentage)
)

# label persen
labels <- paste0(pie_data$category, " (", round(pie_data$percent,1), "%)")

pie(pie_data$percent,
    labels = labels,
    main = "📊 Performance Distribution (%)",
    col = c("#C4B5FD","#A78BFA","#7C3AED","#DDD6FE","#F3E8FF"))

4.6 INTERPRETATION

The visualization shows the distribution of sales performance across categories.

  • Most sales fall into the Good and Very Good categories, indicating moderate to strong performance overall.

  • A smaller portion is classified as Excellent, meaning only a few sales achieve very high values.

  • The Average and Poor categories represent lower sales outcomes and indicate areas for improvement.

This distribution helps in understanding overall sales quality and identifying performance trends among sales data.

5 TASK 4 – Multi-Company Dataset Simulation

In this task, we simulate a dataset representing multiple companies and their employees. Each company contains several employees with attributes such as salary, department, performance score, and KPI score.

The objective is to generate structured data using nested loops and apply conditional logic to identify top performers. This simulation reflects real-world organizational data, where companies analyze employee performance and salary distribution.

We will also summarize the dataset at the company level, including average salary, average performance score, and maximum KPI score, followed by visualizations to better understand the patterns within the data.

5.1 FUNCTION GENERATE DATA

generate_company_data <- function(n_company, n_employees) {
  
  data <- data.frame()
  
  departments <- c("HR", "Finance", "IT", "Marketing")
  
  for (c in 1:n_company) {
    for (e in 1:n_employees) {
      
      salary <- sample(3000:10000, 1)
      performance_score <- sample(60:100, 1)
      KPI_score <- sample(50:100, 1)
      department <- sample(departments, 1)
      
      # conditional: top performer
      if (KPI_score > 90) {
        performer <- "Top Performer"
      } else {
        performer <- "Regular"
      }
      
      data <- rbind(data, data.frame(
        company_id = c,
        employee_id = paste0("C", c, "_E", e),
        salary = salary,
        department = department,
        performance_score = performance_score,
        KPI_score = KPI_score,
        performer_status = performer
      ))
    }
  }
  
  return(data)
}

5.2 GENERATE DATA

company_data <- generate_company_data(5, 20)

head(company_data)
##   company_id employee_id salary department performance_score KPI_score
## 1          1       C1_E1   9996         IT                75        75
## 2          1       C1_E2   6692         HR                83        91
## 3          1       C1_E3   6726         HR                61        70
## 4          1       C1_E4   3310  Marketing                70        56
## 5          1       C1_E5   6916         HR                63        86
## 6          1       C1_E6   7234    Finance                72        57
##   performer_status
## 1          Regular
## 2    Top Performer
## 3          Regular
## 4          Regular
## 5          Regular
## 6          Regular

5.3 SUMMARY PER COMPANY

summary_data <- aggregate(cbind(salary, performance_score, KPI_score) ~ company_id, 
                          data = company_data, 
                          FUN = mean)

max_kpi <- aggregate(KPI_score ~ company_id, data = company_data, max)

summary_data$max_KPI <- max_kpi$KPI_score

summary_data
##   company_id  salary performance_score KPI_score max_KPI
## 1          1 6572.90             78.45     73.55      94
## 2          2 6788.55             72.45     76.85     100
## 3          3 6782.00             76.95     75.50      99
## 4          4 6455.90             81.55     70.80      92
## 5          5 6498.65             82.40     80.15      98

5.4 TABLE

Company Summary
company_id salary performance_score KPI_score max_KPI
1 6572.90 78.45 73.55 94
2 6788.55 72.45 76.85 100
3 6782.00 76.95 75.50 99
4 6455.90 81.55 70.80 92
5 6498.65 82.40 80.15 98

5.5 VISUALIZATION

5.5.1 Average Salary per Company

library(ggplot2)

ggplot(summary_data, aes(x=factor(company_id), y=salary, fill=factor(company_id))) +
  geom_bar(stat="identity") +
  geom_text(aes(label=round(salary,0)), vjust=-0.5) +
  
  labs(
    title="Average Salary per Company",
    x="Company",
    y="Average Salary"
  ) +
  
  theme_minimal() +
  theme(
    plot.background = element_rect(fill="#F3E8FF"),
    legend.position="none"
  )

5.5.2 KPI Distribution (Scatter)

ggplot(company_data, aes(x=performance_score, y=KPI_score, color=factor(company_id))) +
  geom_point() +
  
  labs(
    title="Performance vs KPI",
    x="Performance Score",
    y="KPI Score",
    color="Company"
  ) +
  
  theme_minimal() +
  theme(
    plot.background = element_rect(fill="#F3E8FF")
  )

5.6 INTERPRETATION

The generated dataset represents multiple companies with varying employee attributes.

From the summary table and bar chart:

  • Each company shows different average salary levels, indicating variation in compensation structures.

  • Some companies have higher average salaries, which may reflect higher performance or different roles.

From the scatter plot:

  • There is a visible relationship between performance score and KPI score.

  • Employees with higher performance scores tend to have higher KPI values.

  • Top performers (KPI > 90) are distributed across different companies, indicating that high performance is not limited to a single company.

Overall, the simulation demonstrates how organizational data can be analyzed to identify performance trends and company-level differences.

6 TASK 5 – Monte Carlo Simulation: Pi & Probability

In this task, we use Monte Carlo simulation to estimate the value of π (pi) and analyze probability through random point generation.

By generating random points within a square and checking whether they fall inside a circle, we can approximate π mathematically. Additionally, we compute the probability of points falling within a defined sub-region.

This method demonstrates how randomness and probability can be used to solve mathematical problems and simulate real-world uncertainty.

6.1 FUNCTION

monte_carlo_pi <- function(n_points) {
  
  x <- runif(n_points, -1, 1)
  y <- runif(n_points, -1, 1)
  
  inside_circle <- x^2 + y^2 <= 1
  
  pi_estimate <- 4 * sum(inside_circle) / n_points
  
  # probability sub-square (misalnya area kecil)
  inside_square <- (x > 0 & x < 0.5 & y > 0 & y < 0.5)
  probability <- sum(inside_square) / n_points
  
  data <- data.frame(x, y, inside_circle)
  
  return(list(
    pi_estimate = pi_estimate,
    probability = probability,
    data = data
  ))
}

6.2 RUN

result <- monte_carlo_pi(5000)

result$pi_estimate
## [1] 3.0968
result$probability
## [1] 0.0612

6.3 VISUALIZATION

ggplot(result$data, aes(x=x, y=y, color=inside_circle)) +
  geom_point(alpha=0.6) +
  
  labs(
    title="Monte Carlo Simulation for Pi",
    subtitle=paste("Estimated Pi =", round(result$pi_estimate,4)),
    x="X",
    y="Y"
  ) +
  
  theme_minimal() +
  theme(
    plot.background = element_rect(fill="#F3E8FF")
  )

6.4 INTRPRETATION

The simulation estimates the value of π by comparing points inside a circle to the total number of random points.

- The estimated value of π approaches the true value (≈ 3.14) as the number of points increases.

  • The visualization shows points inside and outside the circle, forming a circular pattern.

  • The probability result indicates how likely a random point falls within the defined sub-square region.

Points inside the circle form a circular pattern, representing the area of the unit circle. Points outside the circle remain within the square but do not satisfy the circle equation. The ratio of points inside the circle to total points approximates the ratio of the circle’s area to the square’s area, which is used to estimate π. As the number of points increases, the circular shape becomes clearer and the estimation of π becomes more accurate

This demonstrates how randomness can approximate mathematical constants and analyze probability.

7 TASK 6 – Data Transformation & Feature Engineering

In this task, we performed data transformation and feature engineering to improve the quality and usability of data for analysis.

Normalization and z-score standardization were applied to adjust numerical data, making it easier to compare variables. In addition, new features were created to enrich the dataset, such as performance categories and salary ranges.

These transformations help reveal patterns in the data and prepare it for further analysis and visualization.

7.1 NORMALIZATION FUNCTION

normalize_columns <- function(df) {
  
  df_norm <- df
  
  for (col in c("salary", "performance_score", "KPI_score")) {
    
    df_norm[[col]] <- (df[[col]] - min(df[[col]])) /
                      (max(df[[col]]) - min(df[[col]]))
  }
  
  return(df_norm)
}

7.2 Z-SCORE FUNCTION

z_score <- function(df) {
  
  df_z <- df
  
  for (col in c("salary", "performance_score", "KPI_score")) {
    
    df_z[[col]] <- (df[[col]] - mean(df[[col]])) /
                   sd(df[[col]])
  }
  
  return(df_z)
}

7.3 APPLY

norm_data <- normalize_columns(company_data)
z_data <- z_score(company_data)

head(norm_data)
##   company_id employee_id     salary department performance_score KPI_score
## 1          1       C1_E1 1.00000000         IT             0.375      0.50
## 2          1       C1_E2 0.51927834         HR             0.575      0.82
## 3          1       C1_E3 0.52422523         HR             0.025      0.40
## 4          1       C1_E4 0.02720792  Marketing             0.250      0.12
## 5          1       C1_E5 0.55186963         HR             0.075      0.72
## 6          1       C1_E6 0.59813764    Finance             0.300      0.14
##   performer_status
## 1          Regular
## 2    Top Performer
## 3          Regular
## 4          Regular
## 5          Regular
## 6          Regular
head(z_data)
##   company_id employee_id      salary department performance_score   KPI_score
## 1          1       C1_E1  1.70530338         IT        -0.2825985 -0.02766579
## 2          1       C1_E2  0.03656675         HR         0.3902550  1.16869256
## 3          1       C1_E3  0.05373898         HR        -1.4600921 -0.40152777
## 4          1       C1_E4 -1.67156500  Marketing        -0.7031319 -1.44834133
## 5          1       C1_E5  0.14970143         HR        -1.2918787  0.79483058
## 6          1       C1_E6  0.31031228    Finance        -0.5349185 -1.37356893
##   performer_status
## 1          Regular
## 2    Top Performer
## 3          Regular
## 4          Regular
## 5          Regular
## 6          Regular

7.4 FEATURE ENGINEERING

company_data$salary_bracket <- ifelse(company_data$salary > 7000, "High", "Low")

company_data$performance_category <- ifelse(company_data$performance_score > 85, "High", "Low")

head(company_data)
##   company_id employee_id salary department performance_score KPI_score
## 1          1       C1_E1   9996         IT                75        75
## 2          1       C1_E2   6692         HR                83        91
## 3          1       C1_E3   6726         HR                61        70
## 4          1       C1_E4   3310  Marketing                70        56
## 5          1       C1_E5   6916         HR                63        86
## 6          1       C1_E6   7234    Finance                72        57
##   performer_status salary_bracket performance_category
## 1          Regular           High                  Low
## 2    Top Performer            Low                  Low
## 3          Regular            Low                  Low
## 4          Regular            Low                  Low
## 5          Regular            Low                  Low
## 6          Regular           High                  Low

7.5 VISUALIZATION

library(ggplot2)
library(plotly)

# Histogram Salary (Before vs Normalized)
p1 <- ggplot(company_data, aes(x = salary)) +
  geom_histogram(fill="#A78BFA", bins=20, alpha=0.8) +
  labs(title="Original Salary Distribution") +
  theme_minimal() +
  theme(plot.background = element_rect(fill="#F3E8FF"))

p2 <- ggplot(norm_data, aes(x = salary)) +
  geom_histogram(fill="#7C3AED", bins=20, alpha=0.8) +
  labs(title="Normalized Salary Distribution") +
  theme_minimal() +
  theme(plot.background = element_rect(fill="#F3E8FF"))

ggplotly(p1)
ggplotly(p2)

7.5.1 BOXPLOT DEEPENS ANALYSIS

p3 <- ggplot(company_data, aes(x = salary_bracket, y = performance_score, fill = salary_bracket)) +
  geom_boxplot() +
  labs(
    title="Performance by Salary Bracket",
    x="Salary Category",
    y="Performance Score"
  ) +
  theme_minimal() +
  theme(plot.background = element_rect(fill="#F3E8FF"))

ggplotly(p3)

7.6 INTERPRETATION

The visualizations provide a comprehensive view of how data transformation and feature engineering affect the dataset.

  • The original salary distribution shows a wide spread of values, indicating variability among employees.

  • After normalization, the distribution is scaled between 0 and 1, making it easier to compare across variables.

  • The boxplot reveals that employees in the high salary bracket tend to have higher performance scores, suggesting a relationship between compensation and performance.

Overall, the transformations improve data comparability, while feature engineering helps uncover meaningful patterns such as the link between salary and performance.

8 TASK 7 – Mini Project: Company KPI Dashboard

In this mini project, a comprehensive dataset was built representing several companies and their employees, including salaries, performance scores, KPI scores, and departments.

The goal is to simulate real-world company data and create a dashboard that summarizes key performance indicators (KPIs). We categorize employees, analyze company-level metrics, and visualize patterns using advanced plots.

This task integrates all previous concepts, including data simulation, loops, feature engineering, and visualization, to produce a complete data analysis workflow.

8.1 GENERATE DATA

dashboard_data <- generate_company_data(6, 50)

head(dashboard_data)
##   company_id employee_id salary department performance_score KPI_score
## 1          1       C1_E1   8493    Finance                63        64
## 2          1       C1_E2   3503  Marketing                70       100
## 3          1       C1_E3   3077         HR               100        64
## 4          1       C1_E4   4300  Marketing                99        61
## 5          1       C1_E5   8764  Marketing                87        68
## 6          1       C1_E6   5182         HR                91        77
##   performer_status
## 1          Regular
## 2    Top Performer
## 3          Regular
## 4          Regular
## 5          Regular
## 6          Regular

8.2 SUMMARY KPI

library(dplyr)
## Warning: package 'dplyr' was built under R version 4.5.2
## 
## Attaching package: 'dplyr'
## The following object is masked from 'package:kableExtra':
## 
##     group_rows
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
summary_kpi <- dashboard_data %>%
  group_by(company_id) %>%
  summarise(
    avg_salary = mean(salary),
    avg_KPI = mean(KPI_score),
    top_performers = sum(KPI_score > 90)
  )

summary_kpi
## # A tibble: 6 × 4
##   company_id avg_salary avg_KPI top_performers
##        <int>      <dbl>   <dbl>          <int>
## 1          1      6252.    73.2              5
## 2          2      6663.    74.0              6
## 3          3      6311.    74.6              5
## 4          4      6994.    71.4              7
## 5          5      6478.    74.2             12
## 6          6      7006.    72.7              8

8.3 KPI CATEGORY

dashboard_data$KPI_category <- ifelse(
  dashboard_data$KPI_score > 90, "High",
  ifelse(dashboard_data$KPI_score > 75, "Medium", "Low")
)

8.4 DASHBOARD VISUALIZATION

8.4.1 Bar Chart KPI

p1 <- ggplot(summary_kpi, aes(x=factor(company_id), y=avg_KPI, fill=factor(company_id))) +
  geom_bar(stat="identity") +
  geom_text(aes(label=round(avg_KPI,1)), vjust=-0.5) +
  theme_minimal() +
  labs(title="Average KPI per Company")

ggplotly(p1)

8.4.2 Scatter + Regression

p2 <- ggplot(dashboard_data, aes(x=salary, y=KPI_score, color=department)) +
  geom_point() +
  geom_smooth(method="lm", se=FALSE) +
  theme_minimal() +
  labs(title="Salary vs KPI")

ggplotly(p2)
## `geom_smooth()` using formula = 'y ~ x'

8.4.3 Distribution Salary

p3 <- ggplot(dashboard_data, aes(x=salary)) +
  geom_histogram(fill="#7C3AED", bins=20) +
  theme_minimal() +
  labs(title="Salary Distribution")

ggplotly(p3)

8.5 INTERPRETATION

The dashboard provides a complete overview of company performance.

  • The bar chart shows differences in average KPI across companies, helping identify top-performing companies.

  • The scatter plot reveals a positive relationship between salary and KPI, indicating that higher-paid employees tend to perform better.

  • The histogram illustrates the distribution of salaries across all companies.

Overall, the dashboard highlights key performance patterns and supports data-driven decision making at the company level.

9 TASK 8 – Automated Report Generation

In this final task, we develop an automated reporting system that generates summaries for each company using functions and loops.

The goal is to create a scalable workflow where reports, including tables and visualizations, are automatically produced without manual repetition.

This approach reflects real-world data science practices, where automation is essential for handling large datasets efficiently and consistently.

9.1 FUNCTION REPORT

generate_report <- function(data, company_id) {
  
  df <- data[data$company_id == company_id, ]
  
  summary <- data.frame(
    Company = company_id,
    Avg_Salary = mean(df$salary),
    Avg_KPI = mean(df$KPI_score),
    Top_Performers = sum(df$KPI_score > 90)
  )
  
  return(list(data=df, summary=summary))
}

9.2 LOOP ALL COMPANY

company_ids <- unique(dashboard_data$company_id)

reports <- list()

for (cid in company_ids) {
  reports[[as.character(cid)]] <- generate_report(dashboard_data, cid)
}

9.3 DISPLAY AUTOMATIC REPORT

## ##  Company 1 
## <table class="table" style="color: black; width: auto !important; margin-left: auto; margin-right: auto;">
##  <thead>
##   <tr>
##    <th style="text-align:center;color: white !important;background-color: rgba(106, 13, 173, 255) !important;"> Company </th>
##    <th style="text-align:center;color: white !important;background-color: rgba(106, 13, 173, 255) !important;"> Avg_Salary </th>
##    <th style="text-align:center;color: white !important;background-color: rgba(106, 13, 173, 255) !important;"> Avg_KPI </th>
##    <th style="text-align:center;color: white !important;background-color: rgba(106, 13, 173, 255) !important;"> Top_Performers </th>
##   </tr>
##  </thead>
## <tbody>
##   <tr>
##    <td style="text-align:center;"> 1 </td>
##    <td style="text-align:center;"> 6251.88 </td>
##    <td style="text-align:center;"> 73.18 </td>
##    <td style="text-align:center;"> 5 </td>
##   </tr>
## </tbody>
## </table>##  Company 2 
## <table class="table" style="color: black; width: auto !important; margin-left: auto; margin-right: auto;">
##  <thead>
##   <tr>
##    <th style="text-align:center;color: white !important;background-color: rgba(106, 13, 173, 255) !important;"> Company </th>
##    <th style="text-align:center;color: white !important;background-color: rgba(106, 13, 173, 255) !important;"> Avg_Salary </th>
##    <th style="text-align:center;color: white !important;background-color: rgba(106, 13, 173, 255) !important;"> Avg_KPI </th>
##    <th style="text-align:center;color: white !important;background-color: rgba(106, 13, 173, 255) !important;"> Top_Performers </th>
##   </tr>
##  </thead>
## <tbody>
##   <tr>
##    <td style="text-align:center;"> 2 </td>
##    <td style="text-align:center;"> 6662.68 </td>
##    <td style="text-align:center;"> 74.02 </td>
##    <td style="text-align:center;"> 6 </td>
##   </tr>
## </tbody>
## </table>##  Company 3 
## <table class="table" style="color: black; width: auto !important; margin-left: auto; margin-right: auto;">
##  <thead>
##   <tr>
##    <th style="text-align:center;color: white !important;background-color: rgba(106, 13, 173, 255) !important;"> Company </th>
##    <th style="text-align:center;color: white !important;background-color: rgba(106, 13, 173, 255) !important;"> Avg_Salary </th>
##    <th style="text-align:center;color: white !important;background-color: rgba(106, 13, 173, 255) !important;"> Avg_KPI </th>
##    <th style="text-align:center;color: white !important;background-color: rgba(106, 13, 173, 255) !important;"> Top_Performers </th>
##   </tr>
##  </thead>
## <tbody>
##   <tr>
##    <td style="text-align:center;"> 3 </td>
##    <td style="text-align:center;"> 6310.7 </td>
##    <td style="text-align:center;"> 74.58 </td>
##    <td style="text-align:center;"> 5 </td>
##   </tr>
## </tbody>
## </table>##  Company 4 
## <table class="table" style="color: black; width: auto !important; margin-left: auto; margin-right: auto;">
##  <thead>
##   <tr>
##    <th style="text-align:center;color: white !important;background-color: rgba(106, 13, 173, 255) !important;"> Company </th>
##    <th style="text-align:center;color: white !important;background-color: rgba(106, 13, 173, 255) !important;"> Avg_Salary </th>
##    <th style="text-align:center;color: white !important;background-color: rgba(106, 13, 173, 255) !important;"> Avg_KPI </th>
##    <th style="text-align:center;color: white !important;background-color: rgba(106, 13, 173, 255) !important;"> Top_Performers </th>
##   </tr>
##  </thead>
## <tbody>
##   <tr>
##    <td style="text-align:center;"> 4 </td>
##    <td style="text-align:center;"> 6994.08 </td>
##    <td style="text-align:center;"> 71.36 </td>
##    <td style="text-align:center;"> 7 </td>
##   </tr>
## </tbody>
## </table>##  Company 5 
## <table class="table" style="width: auto !important; margin-left: auto; margin-right: auto;">
##  <thead>
##   <tr>
##    <th style="text-align:center;color: white !important;background-color: rgba(106, 13, 173, 255) !important;"> Company </th>
##    <th style="text-align:center;color: white !important;background-color: rgba(106, 13, 173, 255) !important;"> Avg_Salary </th>
##    <th style="text-align:center;color: white !important;background-color: rgba(106, 13, 173, 255) !important;"> Avg_KPI </th>
##    <th style="text-align:center;color: white !important;background-color: rgba(106, 13, 173, 255) !important;"> Top_Performers </th>
##   </tr>
##  </thead>
## <tbody>
##   <tr>
##    <td style="text-align:center;"> 5 </td>
##    <td style="text-align:center;"> 6477.78 </td>
##    <td style="text-align:center;"> 74.24 </td>
##    <td style="text-align:center;"> 12 </td>
##   </tr>
## </tbody>
## </table>##  Company 6 
## <table class="table" style="color: black; width: auto !important; margin-left: auto; margin-right: auto;">
##  <thead>
##   <tr>
##    <th style="text-align:center;color: white !important;background-color: rgba(106, 13, 173, 255) !important;"> Company </th>
##    <th style="text-align:center;color: white !important;background-color: rgba(106, 13, 173, 255) !important;"> Avg_Salary </th>
##    <th style="text-align:center;color: white !important;background-color: rgba(106, 13, 173, 255) !important;"> Avg_KPI </th>
##    <th style="text-align:center;color: white !important;background-color: rgba(106, 13, 173, 255) !important;"> Top_Performers </th>
##   </tr>
##  </thead>
## <tbody>
##   <tr>
##    <td style="text-align:center;"> 6 </td>
##    <td style="text-align:center;"> 7006.44 </td>
##    <td style="text-align:center;"> 72.66 </td>
##    <td style="text-align:center;"> 8 </td>
##   </tr>
## </tbody>
## </table>

9.4 VISUALIZATION

9.4.1 SCATTER + REGRESSION

library(ggplot2)
library(plotly)
library(htmltools)

plots <- list()

for (cid in company_ids) {
  
  df <- reports[[as.character(cid)]]$data
  
  p <- ggplot(df, aes(x=salary, y=KPI_score, color=department)) +
    geom_point(size=2, alpha=0.7) +
    geom_smooth(method="lm", se=FALSE) +
    
    labs(
      title=paste("Company", cid, "- Salary vs KPI"),
      x="Salary",
      y="KPI Score"
    ) +
    
    theme_minimal()
  
  plots[[cid]] <- ggplotly(p)
}

tagList(plots)

9.4.2 HISTOGRAM

plots_hist <- list()

for (cid in company_ids) {
  
  df <- reports[[as.character(cid)]]$data
  
  p <- ggplot(df, aes(x=salary)) +
    geom_histogram(fill="#7C3AED", bins=15) +
    
    labs(
      title=paste("Salary Distribution - Company", cid),
      x="Salary",
      y="Count"
    ) +
    
    theme_minimal() +
    theme(plot.background = element_rect(fill="#F3E8FF"))
  
  plots_hist[[cid]] <- ggplotly(p)
}

tagList(plots_hist)

9.5 EXPORT

write.csv(dashboard_data, "company_data.csv", row.names = FALSE)

9.6 INTERPRETATION

The automated reporting system successfully generates summaries and visualizations for each company using loops and functions.

  • The scatter plots show the relationship between salary and KPI, where higher salaries tend to be associated with higher KPI scores.

  • The regression lines highlight a positive trend across companies.

  • Differences between departments can be observed through color variations.

  • The salary distribution plots reveal how employee compensation varies within each company.

The interactive features allow deeper exploration of the data, making it easier to identify patterns and outliers. Overall, automation improves efficiency and demonstrates how data science workflows can scale to handle multiple datasets simultaneously.