Functions & Loops + Data Science

Assignment ~ Week 5

1 Introduction

This assignment explores the use of functions, loops, and conditional logic in R for building structured and automated data science workflows. The project integrates simulation, mathematical modeling, and visualization to demonstrate scalable analysis techniques.

1.1 Objectives

Build multi-layer functions with nested loops and conditional logic.
Handle multi-dataset simulations.
Perform advanced statistics, data transformation, and visualization.
Develop an automated data science workflow.

2 Task 1 – Dynamic Multi-Formula Function

2.1 Function Definition

compute_formula <- function(x, formula) {
  # Fungsi ini menerima input x dan tipe formula, mengembalikan hasil per formula
  
  if (formula == "linear") {
    return(2*x + 3)  # linear
    
  } else if (formula == "quadratic") {
    return(0.5*x^2 + 2*x + 1) # kuadrat
    
  } else if (formula == "cubic") {
    return(0.05*x^3 - 0.5*x^2 + x) # kubik
    
  } else if (formula == "exponential") {
    return(exp(x/6)) # eksponensial
    
  } else {
    stop("Input formula tidak valid!")
  }
}

2.2 Nested Loop Computation

x_values <- 1:20
formula_list <- c("linear","quadratic","cubic","exponential")

results <- data.frame()

for (f in formula_list) {
  for (x in x_values) {
    results <- rbind(results,
                     data.frame(x=x,
                                y=compute_formula(x,f),
                                formula=f))
  }
}
head(results)

2.3 Visualization (Revised)

p1 <- ggplot(results, aes(x=x, y=y, color=formula)) +
  geom_line() +
  geom_point() +
  theme_minimal() +       # Hapus background warna
  theme(panel.background = element_blank(),
        plot.background = element_blank()) +
  labs(title="Dynamic Multi-Formula Function", y="Hasil (y)", x="Input (x)")

ggplotly(p1)

2.4 Interpretation

Multiple mathematical formulas (linear, quadratic, cubic, exponential) are computed dynamically using a single function.
Nested loops allow calculation across a range of input values for each formula.
Results are organized into a structured data frame for easy analysis.
Line plots with points visualize the output of all formulas, showing trends and differences clearly.

3 Task 2 – Multi-Sales & Discounts

3.1 Function Definition

simulate_sales <- function(n_salesperson, days) {
  # Simulasi data sales per salesperson per hari
  
  sales_data <- data.frame()
  
  for (s in 1:n_salesperson) {
    for (d in 1:days) {
      sales_amount <- sample(200:1200,1)  # simulasi nilai sales
      
      if (sales_amount > 900) {
        discount <- 0.25
      } else if (sales_amount > 600) {
        discount <- 0.15
      } else {
        discount <- 0.05
      }
      
      sales_data <- rbind(sales_data,
                         data.frame(
                           salesperson_id=s,
                           day=d,
                           sales_amount=sales_amount,
                           discount_rate=discount))
    }
  }
  sales_data
}

3.2 Implementation

sales_data <- simulate_sales(5,10)
head(sales_data)

3.3 Nested Function (Cumulative)

calculate_cumulative <- function(df) {
  # Hitung cumulative sales per salesperson
  df$cumulative <- ave(df$sales_amount,
                       df$salesperson_id,
                       FUN=cumsum)
  df
}
sales_data <- calculate_cumulative(sales_data)

3.4 Result

aggregate(sales_amount ~ salesperson_id, sales_data, sum)

3.5 Visualization

p2 <- ggplot(sales_data,
             aes(day, cumulative,
                 color=factor(salesperson_id))) +
  geom_line() +
  geom_point() +
  theme_minimal() +
  labs(title="Cumulative Sales per Salesperson", y="Cumulative Sales", x="Day")

ggplotly(p2)

3.6 Interpretation

Daily sales data is simulated for multiple salespeople over a defined period.
Discounts are automatically assigned based on sales thresholds.
Cumulative sales per salesperson highlight top performers over time.
Line plots visualize cumulative sales trajectories, allowing performance comparison

4 Task 3 – Performance Categorization

4.1 Function Definition

categorize_performance <- function(x){
  # Kategori sales berdasarkan jumlah
  if(x>1000) "Excellent"
  else if(x>800) "Very Good"
  else if(x>600) "Good"
  else if(x>400) "Average"
  else "Poor"
}

4.2 Loop Implementation

category <- c()
for(i in sales_data$sales_amount){
  category <- c(category, categorize_performance(i))
}
sales_data$category <- category

4.3 Result Table

category_df <- as.data.frame(prop.table(table(sales_data$category))*100)
colnames(category_df) <- c("Category","Percentage")
category_df

4.4 Visualization (Revised – Pie Chart Interaktif)

p3 <- plot_ly(category_df, labels=~Category, values=~Percentage, type='pie') %>%
  layout(title='Sales Performance Categories',
         showlegend=TRUE)
p3

4.5 Interpretation

Sales amounts are categorized into performance levels: Poor, Average, Good, Very Good, Excellent.
This categorization identifies individual and overall team performance.
Pie charts illustrate the distribution of performance categories.

5 Task 4 – Company Dataset Simulation

5.1 Function Definition

generate_company_data <- function(n_company,n_employees){
  # Generate data karyawan per perusahaan
  
  data <- data.frame()
  
  for(c in 1:n_company){
    for(e in 1:n_employees){
      data <- rbind(data,
                    data.frame(
                      company_id=c,
                      employee_id=e,
                      salary=sample(3000:10000,1),
                      department=sample(c("HR","IT","Finance","Marketing"),1),
                      performance_score=runif(1,60,100),
                      KPI_score=runif(1,50,100)
                    ))
    }
  }
  data
}

5.2 Implementation

company_data <- generate_company_data(3,20)
company_data$top_performer <- ifelse(company_data$KPI_score > 90,"Yes","No")

5.3 Summary

summary_company <- aggregate(cbind(salary,performance_score,KPI_score)~company_id,
                            company_data,
                            function(x) c(mean=mean(x), max=max(x)))
summary_company

5.4 Visualization

p5 <- plot_ly(company_data, x=~factor(company_id), y=~salary, color=~factor(company_id),
              type='box') %>%
  layout(title='Salary Distribution per Company')
p5

5.5 Interpretation

Synthetic company data includes employee salary, department, performance score, and KPI score.
Summary statistics provide a snapshot of workforce characteristics per company.
Boxplots show salary distributions, revealing variability within and across companies.
Top performers are identified based on KPI scores.

6 Task 5 – Monte Carlo Simulation

6.1 Function Definition

monte_carlo_pi <- function(n){
  # Estimasi π dengan Monte Carlo
  
  inside <- 0
  x <- runif(n)
  y <- runif(n)
  
  for(i in 1:n){
    if(x[i]^2 + y[i]^2 <= 1){
      inside <- inside + 1
    }
  }
  
  pi_est <- 4 * inside / n
  prob <- mean(x<=0.5 & y<=0.5)
  
  list(pi=pi_est, prob=prob, x=x, y=y)
}

6.2 Run Simulation

mc <- monte_carlo_pi(5000)
mc$pi

## [1] 3.196

mc$prob

## [1] 0.251

6.3 Visualization

df <- data.frame(x=mc$x, y=mc$y)
df$inside <- (df$x^2 + df$y^2 <= 1)

p6 <- plot_ly(df, x=~x, y=~y, color=~inside, type='scatter', mode='markers',
              marker=list(size=5, opacity=0.5)) %>%
  layout(title='Monte Carlo Simulation Points Inside vs Outside Circle')
p6

6.4 Interpretation *

Monte Carlo simulation estimates π by sampling random points in a unit square.
The simulation also calculates probabilities of points falling in specific regions.
Scatter plots visualize points inside vs. outside the circle.
This demonstrates statistical estimation and visual validation.

7 Task 6 – Data Transformation

7.1 Functions

normalize_columns <- function(df){
  # Normalisasi kolom numeric
  num <- sapply(df,is.numeric)
  for(col in names(df)[num]){
    df[[col]] <- (df[[col]]-min(df[[col]]))/
                 (max(df[[col]])-min(df[[col]]))
  }
  df
}

z_score <- function(df){
  # Z-score kolom numeric
  num <- sapply(df,is.numeric)
  for(col in names(df)[num]){
    df[[col]] <- (df[[col]]-mean(df[[col]]))/sd(df[[col]])
  }
  df
}

7.2 Apply Transformation

normalized_data <- normalize_columns(company_data)
z_data <- z_score(company_data)

# Feature engineering
company_data$salary_bracket <- cut(company_data$salary,
                                   breaks=c(0,4000,7000,10000),
                                   labels=c("Low","Medium","High"))

7.3 Visualization (All Plotly)

# Histogram Salary
p_hist <- plot_ly(company_data, x=~salary, type='histogram') %>%
  layout(title='Salary Histogram')
p_hist

# Boxplot Salary by Bracket
p_box <- plot_ly(company_data, x=~salary_bracket, y=~salary, type='box', color=~salary_bracket) %>%
  layout(title='Salary by Bracket')
p_box

7.4 Interpretation

Numeric columns are normalized (min-max) and standardized (z-score).
Feature engineering, such as salary brackets, enables meaningful categorization.
Histograms and boxplots highlight patterns and variability in salaries.
Data transformations facilitate comparison and downstream analysis.

8 Task 7 – KPI Dashboard Mini Project

8.1 Generate Data

company_big <- generate_company_data(5,100)

8.2 KPI Summary per Company

summary_kpi <- company_big %>%
  group_by(company_id) %>%
  summarise(avg_salary=mean(salary),
            avg_KPI=mean(KPI_score),
            top_performers=sum(KPI_score>90))
summary_kpi

8.3 Categorize KPI Tiers

company_big$KPI_tier <- cut(company_big$KPI_score,
                            breaks=c(0,60,75,90,100),
                            labels=c("Low","Medium","High","Top"))

8.4 Top Performers Table

top_perf <- subset(company_big, KPI_score > 90)
head(top_perf)

8.5 Visualizations

# Scatter KPI vs Salary with Regression
p7_scatter <- plot_ly(company_big, x=~KPI_score, y=~salary,
                      color=~factor(company_id), type='scatter', mode='markers') %>%
  layout(title='KPI vs Salary per Company') %>%
  add_lines(x=~KPI_score, y=~predict(lm(salary~KPI_score, data=company_big)), line=list(color='black'))
p7_scatter

# Grouped Bar Chart Department Count per Company
dept_summary <- company_big %>%
  group_by(company_id, department) %>%
  summarise(count=n())

p7_bar <- plot_ly(dept_summary, x=~factor(company_id), y=~count, color=~department,
                  type='bar') %>%
  layout(barmode='group', title='Employees per Department per Company')
p7_bar

# Salary Distribution Histogram
p7_salary <- plot_ly(company_big, x=~salary, type='histogram') %>%
  layout(title='Salary Distribution')
p7_salary

8.6 Interpretation

KPI dashboards summarize average salary, average KPI, and top performer counts per company.
Employees are categorized into KPI tiers (Low, Medium, High, Top).
Scatter plots show KPI vs. Salary relationships, with regression lines for trends.
Bar charts display department-wise employee counts per company.
Histograms display salary distributions across companies.

9 Task 8 – Automated Company Report

library(plotly)
library(dplyr)
library(kableExtra)
library(htmltools)
library(ggplot2)

# Function to generate report per company
generate_company_report <- function(company_df, company_id){
  df <- company_df %>% filter(company_id == !!company_id)
  
  # Summary table
  summary_tbl <- df %>%
    summarise(
      Company = unique(company_id),
      Avg_Salary = round(mean(salary),2),
      Avg_KPI = round(mean(KPI_score),2),
      Top_Performers = sum(KPI_score > 90)
    )
  
  # Scatter plot: KPI vs Salary
  p_scatter <- ggplot(df, aes(x=KPI_score, y=salary, color=department)) +
    geom_point(alpha=0.7) +
    geom_smooth(method="lm", se=FALSE, color="black") +
    labs(title=paste("KPI vs Salary - Company", company_id),
         x="KPI Score", y="Salary") +
    theme_minimal()
  
  # Histogram: Salary
  p_hist <- ggplot(df, aes(x=salary)) +
    geom_histogram(fill="#7C3AED", bins=15, alpha=0.7) +
    labs(title=paste("Salary Distribution - Company", company_id),
         x="Salary", y="Count") +
    theme_minimal()
  
  # Bar chart: Department count
  dept_summary <- df %>% group_by(department) %>% summarise(count=n())
  p_dept <- plot_ly(dept_summary, x=~department, y=~count, type='bar', color=~department) %>%
    layout(title=paste("Employees per Department - Company", company_id),
           xaxis=list(title="Department"),
           yaxis=list(title="Count"))
  
  list(
    summary = summary_tbl,
    scatter = plotly::ggplotly(p_scatter),
    histogram = plotly::ggplotly(p_hist),
    bar_chart = p_dept
  )
}

# Loop over all companies
company_ids <- unique(company_big$company_id)
reports <- lapply(company_ids, function(cid){
  generate_company_report(company_big, cid)
})
names(reports) <- paste0("Company_", company_ids)

# Render reports
report_html <- lapply(company_ids, function(cid){
  rep <- reports[[paste0("Company_", cid)]]
  
  tagList(
    h2(paste("Company", cid)),
    HTML(kable(rep$summary, "html") %>%
           kable_styling(full_width = F, position = "center", bootstrap_options = "striped")),
    rep$scatter,
    rep$histogram,
    rep$bar_chart
  )
})

browsable(tagList(report_html))

Company 1

Company	Avg_Salary	Avg_KPI	Top_Performers
1	6399.8	72.3	17

Company 2

Company	Avg_Salary	Avg_KPI	Top_Performers
2	6630.13	77.11	21

Company 3

Company	Avg_Salary	Avg_KPI	Top_Performers
3	6637.58	74.61	17

Company 4

Company	Avg_Salary	Avg_KPI	Top_Performers
4	6451.07	75.57	18

Company 5

Company	Avg_Salary	Avg_KPI	Top_Performers
5	6484.17	77.89	29

# Export combined summary CSV
dashboard_data <- bind_rows(lapply(reports, function(r) r$summary))
write.csv(dashboard_data, "dashboard_company_data.csv", row.names = FALSE)

9.1 Interpretation

Automatic reports are generated for each company.
Scatter plots show the relationship between KPIs and Salary, using a regression line.
Histograms display the salary distribution for each company.
Bar charts display the number of employees per department.
All processes use the + loop function, according to a scalable workflow.
CSV export allows data to be used for further analysis or additional reporting.

10 Conclusion

This assignment highlights how functions and loops can automate data workflows effectively. Using dynamic computations, simulations, and nested loops, we analyze sales, employee performance, and Monte Carlo simulations in a reproducible way. Visualizations like line plots, scatter plots, histograms, and bar charts helped reveal trends, distributions, and top performers. Data transformations and engineering features improved clarity and enabled meaningful categorization. Automated dashboards and company reports summarized key metrics, while CSV exports allowed further analysis and reporting. Overall, the project demonstrates how structured R workflows can integrate computation, visualization, and reporting to generate actionable insights efficiently.

References

Siregar, B. (n.d.). Data Science Programming: Study Case Using R and Python. Retrieved from https://bookdown.org/dsciencelabs/data_science_programming/