KAYLA APRILIA
Data Science Student at ITSB
NIM: 52250057
Email: kaylaaprilia2142@gmail.com
1 Dynamic Multi-Formula Function
This project develops a dynamic multi-formula function in R that computes linear, quadratic, cubic, and exponential models. The implementation uses nested loops, conditional logic, and input validation to ensure flexibility and correctness.
Additionally, the workflow demonstrates a complete data science process:
- Data computation
- Data transformation
- Data visualization
library(ggplot2)
library(dplyr)
library(reshape2)
library(plotly)
library(htmltools)
# Define function
compute_formula <- function(x, formulas) {
# Input validation
valid_formulas <- c("linear", "quadratic", "cubic", "exponential")
if (!all(formulas %in% valid_formulas)) {
stop("Invalid formula detected. Allowed: linear, quadratic, cubic, exponential")
}
results <- list()
# Outer loop (formulas)
for (f in formulas) {
y_values <- c()
# Inner loop (x values)
for (i in x) {
if (f == "linear") {
y <- 2*i + 3
} else if (f == "quadratic") {
y <- i^2 + 2*i + 1
} else if (f == "cubic") {
y <- i^3 - i^2 + 2
} else if (f == "exponential") {
y <- 2^i
}
y_values <- c(y_values, y)
}
results[[f]] <- y_values
}
return(results)
}
# Generate data
x <- 1:20
formulas <- c("linear", "quadratic", "cubic", "exponential")
results <- compute_formula(x, formulas)
# Convert to dataframe
df <- data.frame(x = x)
for (name in names(results)) {
df[[name]] <- results[[name]]
}
# Reshape data for plotting
df_long <- melt(df, id.vars = "x")
# Plot visualization
p <- ggplot(df_long, aes(x = x, y = value, color = variable)) +
geom_line(size = 1) +
labs(
title = "Comparison of Multiple Mathematical Functions",
x = "X values",
y = "Y values",
color = "Function Type"
) +
theme_minimal()
ggplotly(p, width = 700, height = 450)
Visualization Interpretation
The graph compares four mathematical functions over the same range of x values (1–20):
- The linear function increases at a constant rate, forming a straight line.
- The quadratic function grows faster and produces a curved shape.
- The cubic function increases more rapidly with a steeper curve.
- The exponential function grows the fastest, rising sharply as x increases.
2 Nested Simulation: Multi-Sales & Discounts
This project simulates a multi-salesperson environment using a
nested function approach in R.
The function generates sales data including sales ID, day, sales
amount, and discount rate.
This demonstrates a complete workflow: simulation → transformation → analysis → visualization.
# Main simulation function
simulate_sales <- function(n_salesperson, days) {
all_data <- data.frame(
salesperson = character(),
sales_id = character(),
day = integer(),
sales_amount = numeric(),
discount_rate = numeric(),
cumulative_sales = numeric()
)
# Loop per salesperson
for (s in 1:n_salesperson) {
sales_amounts <- c()
# Nested loop per day
for (d in 1:days) {
# Generate random sales amount
sales_amount <- round(runif(1, 100, 1000), 2)
# Conditional discount logic
if (sales_amount > 800) {
discount_rate <- 0.20
} else if (sales_amount > 500) {
discount_rate <- 0.10
} else {
discount_rate <- 0.05
}
sales_amounts <- c(sales_amounts, sales_amount)
# Create row
temp <- data.frame(
salesperson = paste0("SP", s),
sales_id = paste0("S", s, "_", d),
day = d,
sales_amount = sales_amount,
discount_rate = discount_rate,
cumulative_sales = NA
)
all_data <- dplyr::bind_rows(all_data, temp)
}
# Nested function for cumulative sales
cumulative_sales <- function(x) {
return(cumsum(x))
}
# Apply cumulative calculation
idx <- all_data$salesperson == paste0("SP", s)
all_data$cumulative_sales[idx] <- cumulative_sales(sales_amounts)
}
return(all_data)
}
# Run simulation
set.seed(123)
sales_data <- simulate_sales(n_salesperson = 5, days = 30)
# Summary statistics
summary_stats <- sales_data %>%
group_by(salesperson) %>%
summarise(
total_sales = sum(sales_amount),
avg_sales = mean(sales_amount),
max_sales = max(sales_amount),
min_sales = min(sales_amount)
)
# Display table
knitr::kable(summary_stats,
caption = "Sales Summary per Salesperson",
align = "c")
| salesperson | total_sales | avg_sales | max_sales | min_sales |
|---|---|---|---|---|
| SP1 | 18454.79 | 615.1597 | 994.84 | 137.85 |
| SP2 | 14843.30 | 494.7767 | 966.72 | 122.15 |
| SP3 | 16840.52 | 561.3507 | 986.46 | 100.56 |
| SP4 | 17019.89 | 567.3297 | 959.03 | 154.65 |
| SP5 | 15904.32 | 530.1440 | 985.80 | 109.42 |
# Plot cumulative sales
p <- ggplot(sales_data, aes(x = day, y = cumulative_sales, color = salesperson)) +
geom_line(size = 1) +
labs(
title = "Cumulative Sales per Salesperson",
x = "Day",
y = "Cumulative Sales"
) +
theme_minimal()
ggplotly(p, width = 700, height = 450)
Visualization Interpretation
The graph shows the cumulative sales growth of each salesperson over time.
Each line represents one salesperson’s total accumulated sales. The upward trend indicates that sales are continuously increasing over days. Differences in slope reflect performance:
- A steeper line means higher daily sales accumulation.
- A flatter line indicates slower growth.
3 Multi-Level Performance Categorization
This project develops a function to categorize sales performance into five levels: Excellent, Very Good, Good, Average, and Poor.
The implementation includes:
- Looping through a vector of sales data
- Applying conditional logic for categorization
- Calculating percentage distribution for each category
- Visualizing results using a bar plot and pie chart
This demonstrates data classification, transformation, and visualization in a structured workflow.
library(ggplot2)
library(plotly)
library(dplyr)
# Function to categorize performance
categorize_performance <- function(sales_amount) {
categories <- c()
# Loop through sales values
for (i in sales_amount) {
if (i > 800) {
category <- "Excellent"
} else if (i > 600) {
category <- "Very Good"
} else if (i > 400) {
category <- "Good"
} else if (i > 200) {
category <- "Average"
} else {
category <- "Poor"
}
categories <- c(categories, category)
}
return(categories)
}
# Example data (can reuse from previous task)
set.seed(123)
sales_amount <- round(runif(100, 100, 1000), 2)
# Apply categorization
performance <- categorize_performance(sales_amount)
# Create dataframe
df <- data.frame(
sales_amount = sales_amount,
performance = performance
)
# Calculate percentages
category_counts <- table(df$performance)
category_percent <- prop.table(category_counts) * 100
# Convert to dataframe
df_summary <- data.frame(
category = names(category_counts),
count = as.numeric(category_counts),
percentage = as.numeric(category_percent)
)
knitr::kable(df_summary,
caption = "Performance Category Distribution",
align = "c")
| category | count | percentage |
|---|---|---|
| Average | 24 | 24 |
| Excellent | 24 | 24 |
| Good | 24 | 24 |
| Poor | 9 | 9 |
| Very Good | 19 | 19 |
# Bar plot
p1 <- ggplot(df_summary, aes(x = category, y = percentage, fill = category)) +
geom_bar(stat = "identity") +
labs(
title = "Performance Distribution (Bar Plot)",
x = "Category",
y = "Percentage (%)"
) +
theme_minimal()
ggplotly(p1, width = 700, height = 450)
# Pie chart
plot_ly(df_summary,
labels = ~category,
values = ~percentage,
type = 'pie') %>%
layout(
title = "Performance Distribution (Pie Chart)",
width = 700,
height = 450
)
Visualization Interpretation
The bar plot and pie chart show the distribution of sales performance across five categories.
- The bar plot clearly compares percentages between categories.
- The pie chart highlights the proportion of each category in the dataset.
Key observations:
- Most data points typically fall into middle categories (Good and Very Good) due to random distribution.
- Excellent and Poor categories usually have smaller proportions.
- The distribution reflects how sales performance is spread across different levels.
4 Multi-Company Dataset Simulation
This project simulates a multi-company dataset using nested
loops in R.
The function generates employee-level data including:
- Company ID
- Employee ID
- Salary
- Department
- Performance Score
- KPI Score
This demonstrates a full workflow: simulation → aggregation → analysis → visualization.
# Function to generate company data
generate_company_data <- function(n_company, n_employees) {
all_data <- data.frame()
departments <- c("HR", "Finance", "IT", "Marketing", "Operations")
# Loop per company
for (c in 1:n_company) {
# Loop per employee
for (e in 1:n_employees) {
salary <- round(runif(1, 3000, 10000), 2)
performance_score <- round(runif(1, 60, 100), 2)
KPI_score <- round(runif(1, 50, 100), 2)
# Conditional logic: top performer
top_performer <- ifelse(KPI_score > 90, "Yes", "No")
temp <- data.frame(
company_id = paste0("C", c),
employee_id = paste0("E", c, "_", e),
salary = salary,
department = sample(departments, 1),
performance_score = performance_score,
KPI_score = KPI_score,
top_performer = top_performer
)
all_data <- rbind(all_data, temp)
}
}
return(all_data)
}
# Generate data
set.seed(123)
company_data <- generate_company_data(n_company = 3, n_employees = 50)
# Summary per company
summary_table <- company_data %>%
group_by(company_id) %>%
summarise(
avg_salary = mean(salary),
avg_performance = mean(performance_score),
max_KPI = max(KPI_score)
)
knitr::kable(summary_table,
caption = "Company Performance Summary",
align = "c")
| company_id | avg_salary | avg_performance | max_KPI |
|---|---|---|---|
| C1 | 5951.905 | 84.2378 | 96.87 |
| C2 | 6468.252 | 79.2874 | 99.97 |
| C3 | 6639.389 | 79.2358 | 99.45 |
# Plot: Average Salary per Company
p1 <- ggplot(summary_table, aes(x = company_id, y = avg_salary, fill = company_id)) +
geom_bar(stat = "identity") +
labs(
title = "Average Salary per Company",
x = "Company",
y = "Average Salary"
) +
theme_minimal()
ggplotly(p1, width = 700, height = 450)
# Plot: Average Performance per Company
p2 <- ggplot(summary_table, aes(x = company_id, y = avg_performance, fill = company_id)) +
geom_bar(stat = "identity") +
labs(
title = "Average Performance Score per Company",
x = "Company",
y = "Average Performance"
) +
theme_minimal()
ggplotly(p2, width = 700, height = 450)
# Plot: KPI Distribution
p3 <- ggplot(company_data, aes(x = KPI_score, fill = company_id)) +
geom_histogram(bins = 20, alpha = 0.6, position = "identity") +
labs(
title = "KPI Score Distribution",
x = "KPI Score",
y = "Frequency"
) +
theme_minimal()
ggplotly(p3, width = 700, height = 450)
Visualization Interpretation
The visualizations provide insights into company performance:
Average Salary Plot, shows how compensation differs between companies. Higher bars indicate companies paying more on average.
Average Performance Plot, displays overall employee performance levels per company. Companies with higher values indicate stronger workforce performance.
KPI Distribution Histogram, illustrates how KPI scores are spread across employees.
- A concentration near high values suggests many high performers.
- The presence of values above 90 highlights top performers.
5 Monte Carlo Simulation: Pi & Probability
This project implements a Monte Carlo simulation to estimate the value of π (pi) and analyze probability.
This demonstrates simulation-based estimation and probabilistic modeling in data science.
# Monte Carlo function
monte_carlo_pi <- function(n_points) {
inside_circle <- 0
inside_square_small <- 0
x_vals <- c()
y_vals <- c()
inside_flag <- c()
# Loop for simulation
for (i in 1:n_points) {
x <- runif(1, -1, 1)
y <- runif(1, -1, 1)
x_vals <- c(x_vals, x)
y_vals <- c(y_vals, y)
# Check if inside circle
if (x^2 + y^2 <= 1) {
inside_circle <- inside_circle + 1
inside_flag <- c(inside_flag, "Inside")
} else {
inside_flag <- c(inside_flag, "Outside")
}
# Probability for sub-square (-0.5 to 0.5)
if (x >= -0.5 && x <= 0.5 && y >= -0.5 && y <= 0.5) {
inside_square_small <- inside_square_small + 1
}
}
# Estimate Pi
pi_estimate <- 4 * (inside_circle / n_points)
# Probability of falling in sub-square
prob_subsquare <- inside_square_small / n_points
# Create dataframe
df <- data.frame(
x = x_vals,
y = y_vals,
position = inside_flag
)
return(list(
pi_estimate = pi_estimate,
prob_subsquare = prob_subsquare,
data = df
))
}
# Run simulation
set.seed(123)
result <- monte_carlo_pi(5000)
# Print results
result$pi_estimate
## [1] 3.1552
result$prob_subsquare
## [1] 0.2624
# Plot points
p <- ggplot(result$data, aes(x = x, y = y, color = position)) +
geom_point(alpha = 0.6) +
labs(
title = "Monte Carlo Simulation of Pi",
x = "X",
y = "Y",
color = "Position"
) +
theme_minimal()
ggplotly(p, width = 700, height = 450)
Visualization Interpretation
The scatter plot displays randomly generated points:
- Points inside the circle form a circular shape centered at (0,0).
- Points outside the circle fill the remaining square area.
Key observations:
- The circular boundary becomes clearer as the number of points increases.
- The ratio of points inside the circle compared to total points approximates π.
- The sub-square probability reflects how often points fall within a smaller region inside the square.
6 Advanced Data Transformation & Feature Engineering
This project focuses on data transformation and feature engineering techniques in R.
This demonstrates how raw data can be transformed into more meaningful and analysis-ready features.
# Example dataset
set.seed(123)
df <- data.frame(
salary = runif(100, 3000, 10000),
performance_score = runif(100, 60, 100),
KPI_score = runif(100, 50, 100)
)
# Function: Min-Max Normalization
normalize_columns <- function(df) {
df_norm <- df
for (col in names(df)) {
if (is.numeric(df[[col]])) {
min_val <- min(df[[col]])
max_val <- max(df[[col]])
df_norm[[col]] <- (df[[col]] - min_val) / (max_val - min_val)
}
}
return(df_norm)
}
# Function: Z-score Standardization
z_score <- function(df) {
df_z <- df
for (col in names(df)) {
if (is.numeric(df[[col]])) {
mean_val <- mean(df[[col]])
sd_val <- sd(df[[col]])
df_z[[col]] <- (df[[col]] - mean_val) / sd_val
}
}
return(df_z)
}
# Apply transformations
df_normalized <- normalize_columns(df)
df_zscore <- z_score(df)
# Feature Engineering
df$performance_category <- cut(
df$performance_score,
breaks = c(-Inf, 70, 80, 90, Inf),
labels = c("Poor", "Average", "Good", "Excellent")
)
df$salary_bracket <- cut(
df$salary,
breaks = c(-Inf, 4000, 7000, Inf),
labels = c("Low", "Medium", "High")
)
# Histogram before & after normalization
p1 <- ggplot(df, aes(x = salary, fill = salary_bracket)) +
geom_histogram(bins = 20, alpha = 0.7) +
labs(title = "Salary Distribution (Original)") +
theme_minimal()
ggplotly(p1, width = 700, height = 450)
p2 <- ggplot(df_normalized, aes(x = salary)) +
geom_histogram(bins = 20, fill = "steelblue", alpha = 0.7) +
labs(title = "Salary Distribution (Normalized)") +
theme_minimal()
ggplotly(p2, width = 700, height = 450)
# Boxplot comparison
p3 <- ggplot(df, aes(x = salary_bracket, y = salary, fill = salary_bracket)) +
geom_boxplot() +
labs(title = "Boxplot (Original Salary)") +
theme_minimal()
ggplotly(p3, width = 700, height = 450)
p4 <- ggplot(df_zscore, aes(y = salary)) +
geom_boxplot(fill = "tomato", alpha = 0.7) +
labs(title = "Boxplot (Z-Score Salary)") +
theme_minimal()
ggplotly(p4, width = 700, height = 450)
Visualization Interpretation
The visualizations compare data distributions before and after transformation:
- Histogram (Original vs Normalized)
- Original data shows actual salary distribution.
- Normalized data scales values between 0 and 1, making them easier to compare across variables.
- Boxplot (Original vs Z-score)
- Original boxplot shows raw spread and outliers.
- Z-score transformation centers the data around 0 with standardized deviation, making patterns more comparable.
Key observations:
- Normalization changes the scale but preserves the distribution shape.
- Z-score transformation standardizes the data, making it suitable for statistical modeling.
- Feature engineering (performance_category & salary_bracket) simplifies analysis by grouping continuous data into meaningful categories.
7 Mini Project: Company KPI Dashboard & Simulation
This mini project simulates a company KPI dashboard using synthetic data.
The dataset includes:
- Employee ID
- Company ID
- Salary
- Performance Score
- KPI Score
- Department
This represents a complete data science pipeline from simulation to dashboard-style insights.
# Function to generate dataset
generate_kpi_data <- function(n_company) {
all_data <- data.frame()
departments <- c("HR", "Finance", "IT", "Marketing", "Operations")
for (c in 1:n_company) {
n_employees <- sample(50:200, 1)
for (e in 1:n_employees) {
salary <- round(runif(1, 3000, 10000), 2)
performance_score <- round(runif(1, 60, 100), 2)
KPI_score <- round(runif(1, 50, 100), 2)
temp <- data.frame(
employee_id = paste0("E", c, "_", e),
company_id = paste0("C", c),
salary = salary,
performance_score = performance_score,
KPI_score = KPI_score,
department = sample(departments, 1)
)
all_data <- dplyr::bind_rows(all_data, temp)
}
}
return(all_data)
}
# Generate data
set.seed(123)
df <- generate_kpi_data(5)
# KPI Categorization
df$KPI_tier <- NA
for (i in 1:nrow(df)) {
if (df$KPI_score[i] > 90) {
df$KPI_tier[i] <- "Top Performer"
} else if (df$KPI_score[i] > 75) {
df$KPI_tier[i] <- "High"
} else if (df$KPI_score[i] > 60) {
df$KPI_tier[i] <- "Medium"
} else {
df$KPI_tier[i] <- "Low"
}
}
# Summary per company
summary_company <- df %>%
group_by(company_id) %>%
summarise(
avg_salary = mean(salary),
avg_KPI = mean(KPI_score),
top_performers = sum(KPI_score > 90)
)
knitr::kable(summary_company,
caption = "Company KPI Summary",
align = "c")
| company_id | avg_salary | avg_KPI | top_performers |
|---|---|---|---|
| C1 | 6082.763 | 73.68413 | 12 |
| C2 | 6612.674 | 75.56000 | 17 |
| C3 | 6324.746 | 73.57075 | 23 |
| C4 | 6683.919 | 75.73927 | 17 |
| C5 | 6333.176 | 75.43160 | 35 |
# Department analysis
dept_summary <- df %>%
group_by(company_id, department) %>%
summarise(avg_KPI = mean(KPI_score), .groups = "drop")
# Salary distribution plot
p1 <- ggplot(df, aes(x = salary, fill = company_id)) +
geom_histogram(bins = 20, alpha = 0.6, position = "identity") +
labs(title = "Salary Distribution by Company") +
theme_minimal()
ggplotly(p1, width = 700, height = 450)
# Grouped bar chart (department KPI)
p2 <- ggplot(dept_summary, aes(x = department, y = avg_KPI, fill = company_id)) +
geom_bar(stat = "identity", position = "dodge") +
labs(
title = "Average KPI by Department and Company",
x = "Department",
y = "Average KPI"
) +
theme_minimal()
ggplotly(p2, width = 700, height = 450)
# Scatter plot with regression line
p3 <- ggplot(df, aes(x = salary, y = KPI_score, color = company_id)) +
geom_point(alpha = 0.6) +
geom_smooth(method = "lm", se = FALSE) +
labs(
title = "Salary vs KPI Score",
x = "Salary",
y = "KPI Score"
) +
theme_minimal()
ggplotly(p3, width = 700, height = 450)
# Top performers table
top_performers <- df %>%
filter(KPI_score > 90) %>%
arrange(desc(KPI_score))
knitr::kable(head(top_performers),
caption = "Top Performers (KPI > 90)",
align = "c")
| employee_id | company_id | salary | performance_score | KPI_score | department | KPI_tier |
|---|---|---|---|---|---|---|
| E5_121 | C5 | 5099.34 | 71.58 | 99.96 | Operations | Top Performer |
| E3_122 | C3 | 6795.32 | 76.66 | 99.94 | Marketing | Top Performer |
| E5_7 | C5 | 6046.03 | 67.89 | 99.73 | Marketing | Top Performer |
| E5_34 | C5 | 7044.63 | 83.18 | 99.68 | Operations | Top Performer |
| E4_28 | C4 | 3425.87 | 69.40 | 99.65 | Operations | Top Performer |
| E3_99 | C3 | 8171.24 | 90.28 | 99.62 | IT | Top Performer |
Visualization Interpretation
The dashboard visualizations provide several insights:
- Salary Distribution, shows how salaries vary across companies. Overlapping distributions indicate similarities, while shifts suggest differences in pay structure.
- Grouped Bar Chart (Department KPI), compares
average KPI across departments and companies.
- Some departments consistently perform better.
- Differences between companies highlight organizational performance gaps.
- Scatter Plot with Regression Line, displays the
relationship between salary and KPI score.
- A positive slope suggests that higher salaries may be associated with higher performance.
- The spread of points shows variability among employees.
8 Automated Company Report Generation
This project automates the creation of company-level reports using functions and loops in R.
This demonstrates how automation can streamline reporting workflows in data science.
# Load libraries
library(dplyr)
library(ggplot2)
set.seed(123)
generate_kpi_data <- function(n_company) {
all_data <- data.frame()
departments <- c("HR", "Finance", "IT", "Marketing", "Operations")
for (c in 1:n_company) {
n_employees <- sample(50:200, 1)
for (e in 1:n_employees) {
temp <- data.frame(
employee_id = paste0("E", c, "_", e),
company_id = paste0("C", c),
salary = runif(1, 3000, 10000),
performance_score = runif(1, 60, 100),
KPI_score = runif(1, 50, 100),
department = sample(departments, 1)
)
all_data <- rbind(all_data, temp)
}
}
return(all_data)
}
df <- generate_kpi_data(5)
# Function to generate report per company
generate_company_report <- function(data, company_name) {
cat("====================================\n")
cat("Report for Company:", company_name, "\n")
cat("====================================\n")
df_company <- data %>% filter(company_id == company_name)
# Summary table
summary <- df_company %>%
summarise(
avg_salary = mean(salary),
avg_KPI = mean(KPI_score),
max_KPI = max(KPI_score)
)
print(summary)
# Plot salary distribution
p1 <- ggplot(df_company, aes(x = salary)) +
geom_histogram(bins = 20, fill = "skyblue") +
labs(title = paste("Salary Distribution -", company_name)) +
theme_minimal()
# Plot KPI distribution
p2 <- ggplot(df_company, aes(x = KPI_score)) +
geom_histogram(bins = 20, fill = "orange") +
labs(title = paste("KPI Distribution -", company_name)) +
theme_minimal()
gridExtra::grid.arrange(p1, p2, ncol = 2, widths = c(1,1))
# Export CSv
write.csv(df_company, paste0("report_", company_name, ".csv"), row.names = FALSE)
}
# Loop through companies
companies <- unique(df$company_id)
for (c in companies) {
generate_company_report(df, c)
}
## ====================================
## Report for Company: C1
## ====================================
## avg_salary avg_KPI max_KPI
## 1 6082.763 73.68449 98.36992
## ====================================
## Report for Company: C2
## ====================================
## avg_salary avg_KPI max_KPI
## 1 6612.675 75.55986 99.45391
## ====================================
## Report for Company: C3
## ====================================
## avg_salary avg_KPI max_KPI
## 1 6324.746 73.57111 99.94413
## ====================================
## Report for Company: C4
## ====================================
## avg_salary avg_KPI max_KPI
## 1 6683.919 75.73943 99.65078
## ====================================
## Report for Company: C5
## ====================================
## avg_salary avg_KPI max_KPI
## 1 6333.177 75.43162 99.96369
Visualization Interpretation
Each company report includes:
- Summary Table, displays average salary, average KPI, and maximum KPI, giving a quick overview of company performance.
- Salary Distribution Plot, shows how employee salaries are spread within the company. A wider spread indicates higher variability in compensation.
- KPI Distribution Plot, illustrates employee performance levels. Concentration at higher values indicates stronger overall performance.