ULIN NIKMAH (52250042)

INSTITUT TEKNOLOGI SAINS BANDUNG

Course:Data Science Programming Study Program:Data Science Lecturer:Bakti Siregar, M.SC., CDS.

Introduction

This practicum aims to practice the use of functions, loops, and conditional logic in a data science context. Through several tasks, data simulations are performed, including mathematical function computation, sales analysis, and company dataset generation.

In addition, this practicum includes data transformation, statistical analysis, and visualization to understand data patterns. Each task is designed to help students build a more structured and realistic data science workflow.

1. Taks 1 - Dynamic Multi-Formula Function

This program aims to compute and compare multiple mathematical functions (linear, quadratic, cubic, and exponential) dynamically using loops and visualize them on a single graph.

Visulization

library(ggplot2)
library(dplyr)
library(plotly)

compute_formula <- function(x, formulas){
  results <- data.frame()

  for(f in formulas){
    y <- sapply(x, function(val){
      if(f == "linear") return(2*val + 1)
      else if(f == "quadratic") return(val^2 + 2*val + 1)
      else if(f == "cubic") return(val^3)
      else if(f == "exponential") return(exp(val))
    })
    results <- rbind(results, data.frame(x=x, y=y, formula=f))
  }

  p <- ggplot(results, aes(x=x, y=y, color=formula)) +
    geom_line() + geom_point() +
    ggtitle("Function Comparison") +
    theme(plot.title = element_text(hjust = 0.5, face = "bold"))

  ggplotly(p)
}

x <- 1:20
compute_formula(x, c("linear","quadratic","cubic","exponential"))

Interpretation

The graph compares four functions: linear, quadratic, cubic, and exponential over the range
\(x = 1\) to \(20\).

It can be observed that:

  • The linear function increases steadily with a constant rate.
  • The quadratic and cubic functions grow faster than the linear function but remain within a moderate scale.
  • The exponential function increases extremely rapidly, especially after \(x > 10\), resulting in values much larger than the other functions.

As a result, the exponential function dominates the graph scale, making the other functions appear almost flat near the bottom. This highlights that exponential growth is significantly faster than linear and polynomial growth.

2. Taks 2 - Nested Simulation:Multi Sales & Discounts

This simulation is designed to analyze sales data from multiple salespersons over several days, including discount calculations, cumulative sales, and performance summaries.

Visulization

library(ggplot2)
library(dplyr)
library(plotly)
library(knitr)
library(kableExtra)
library(readr)

# =========================
# LOAD DATA
# =========================
sales_df <- read_csv("sales_data_final.csv")

# =========================
# TAMBAH KOLOM
# =========================
sales_df <- sales_df %>%
  mutate(
    discount_rate = case_when(
      sales_amount > 800 ~ 0.2,
      sales_amount > 500 ~ 0.1,
      TRUE ~ 0.05
    ),
    final_sales = sales_amount * (1 - discount_rate)
  ) %>%
  group_by(sales_id) %>%
  mutate(cumulative_sales = cumsum(sales_amount)) %>%
  ungroup()

# =========================
# SUMMARY
# =========================
summary_stats <- sales_df %>%
  group_by(sales_id) %>%
  summarise(
    total_sales = sum(sales_amount),
    total_final_sales = sum(final_sales),
    avg_sales = mean(sales_amount),
    max_sales = max(sales_amount),
    min_sales = min(sales_amount)
  )

kable(summary_stats, "html", caption = "Summary Statistics per Salesperson") %>%
  kable_styling(full_width = FALSE,
                bootstrap_options = c("striped","hover","condensed","responsive"))
Summary Statistics per Salesperson
sales_id total_sales total_final_sales avg_sales max_sales min_sales
1 3407 3057.0 486.7143 992 253
2 3120 2804.0 445.7143 775 -180
3 4216 3701.9 602.2857 1402 253
# =========================
# PLOT
# =========================
p <- ggplot(sales_df, aes(x=day, y=cumulative_sales, color=factor(sales_id))) +
  geom_line(size=1.2) +
  geom_point(size=2) +
  labs(
    title = "Cumulative Sales per Salesperson",
    x = "Day",
    y = "Cumulative Sales",
    color = "Salesperson"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.5, face = "bold", size=14),
    axis.title = element_text(face="bold")
  )

ggplotly(p)

Interpretation

Based on the code, sales data is processed using tiered discounts, producing final_sales and cumulative_sales.

From the chart:

  • Salesperson 3 shows the best performance with the highest growth.
  • Salesperson 1 is relatively stable.
  • Salesperson 2 grows early but slows down later.

From the summary:

  • Salesperson 3 has the highest total and average sales.
  • Salesperson 2 has a negative minimum value (-180), indicating possible returns or errors.

Conclusion: Salesperson 3 performs the best, while Salesperson 2 needs evaluation.

3. Taks 3 - Multi Level Performance Categorization

This analysis aims to classify sales data into five performance categories and calculate their distribution and percentages through visualization.

Visulization

library(ggplot2)
library(dplyr)
library(plotly)
library(knitr)
library(kableExtra)
library(readr)
library(RColorBrewer)

# =========================
# LOAD DATA CSV
# =========================
sales_df <- read_csv("sales_data_final.csv")

# =========================
# FUNCTION KATEGORI
# =========================
categorize_performance <- function(sales){
  category <- sapply(sales, function(s){
    if(s > 800) "Excellent"
    else if(s > 600) "Very Good"
    else if(s > 400) "Good"
    else if(s > 200) "Average"
    else "Poor"
  })
  data.frame(sales_amount = sales, performance_category = category)
}

# =========================
# APPLY KE DATA CSV
# =========================
perf_df <- categorize_performance(sales_df$sales_amount)

# =========================
# SUMMARY
# =========================
perf_summary <- perf_df %>%
  group_by(performance_category) %>%
  summarise(count = n()) %>%
  mutate(percentage = round(count / sum(count) * 100, 2)) %>%
  arrange(desc(count))

kable(perf_summary, "html", caption = "Performance Distribution Summary") %>%
  kable_styling(full_width = FALSE,
                bootstrap_options = c("striped", "hover", "condensed", "responsive"))
Performance Distribution Summary
performance_category count percentage
Good 7 33.33
Average 6 28.57
Very Good 4 19.05
Excellent 2 9.52
Poor 2 9.52
# =========================
# BAR PLOT
# =========================
bar_plot <- ggplot(perf_summary, aes(x=performance_category, y=count, fill=performance_category)) +
  geom_bar(stat="identity") +
  labs(title="Performance Distribution", x="Category", y="Count") +
  theme_minimal() +
  theme(plot.title = element_text(hjust=0.5, face="bold"))

ggplotly(bar_plot)
# =========================
# PIE CHART
# =========================
pie_chart <- plot_ly(
  perf_summary,
  labels = ~performance_category,
  values = ~count,
  type = 'pie',
  textposition = 'inside',
  textinfo = 'label+percent',
  hoverinfo = 'label+value+percent',
  marker = list(
    colors = brewer.pal(n = 5, name = "Set2"),
    line = list(color = '#FFFFFF', width = 1)
  )
) %>%
  layout(
    title = list(text="Performance Distribution", font=list(size=18)),
    showlegend = TRUE
  )

pie_chart

Interpretation

Based on the code, sales data is categorized into five performance levels based on sales amount.

The results show: - Good is the most dominant category (33.33%), indicating generally good performance
- Average is the second highest (28.57%), showing some standard-level performance
- Very Good is noticeable (19.05%), indicating improvement
- Excellent and Poor have the lowest proportions (each 9.52%)

Conclusion: Overall, sales performance is relatively stable at a medium-to-high level, but improvements are still needed to increase the number of Excellent performances.

4. Taks 4 - Multi Company Dataset Simulation

This program simulates employee datasets from multiple companies using nested loops and analyzes average salary, performance, and KPI for each company.

Visulization

library(ggplot2)
library(dplyr)
library(plotly)
library(DT)
library(htmltools)
library(readr)

# =========================
# LOAD DATA CSV (GANTI RANDOM)
# =========================
company_df <- read_csv("company_data_final.csv")

# =========================
# TABEL DETAIL
# =========================
df1 <- company_df %>%
  arrange(company_id, employee_id) %>%
  select(company_id, employee_id, salary, department, performance_score, KPI_score)

datatable(df1,
          options=list(scrollX=TRUE, lengthMenu=c(10,25,50,100)),
          caption=tags$caption(
            style='caption-side: bottom; text-align: center;',
            'Table: ', em('Company Employee Dataset'))
)
# =========================
# SUMMARY
# =========================
summary_df <- company_df %>%
  group_by(company_id) %>%
  summarise(
    avg_salary = mean(salary),
    avg_performance = mean(KPI_score),
    max_KPI = max(KPI_score)
  )

datatable(summary_df,
          options=list(scrollX=TRUE),
          caption=tags$caption(
            style='caption-side: bottom; text-align: center;',
            'Table: ', em('Company Summary'))
)
# =========================
# PLOT
# =========================
p1 <- ggplot(summary_df, aes(x=factor(company_id), y=avg_salary)) +
  geom_bar(stat="identity", fill="steelblue") +
  geom_text(aes(label=round(avg_salary,0)), nudge_y=200) +
  labs(title="Average Salary per Company", x="Company ID", y="Average Salary") +
  theme_minimal() +
  theme(plot.title=element_text(hjust=0.5, face="bold", size=18))

p2 <- ggplot(summary_df, aes(x=factor(company_id), y=avg_performance)) +
  geom_bar(stat="identity", fill="darkgreen") +
  geom_text(aes(label=round(avg_performance,1)), nudge_y=2) +
  labs(title="Average KPI Score per Company", x="Company ID", y="Average KPI") +
  theme_minimal() +
  theme(plot.title=element_text(hjust=0.5, face="bold", size=18))

p3 <- ggplot(summary_df, aes(x=factor(company_id), y=max_KPI)) +
  geom_bar(stat="identity", fill="orange") +
  geom_text(aes(label=max_KPI), nudge_y=2) +
  labs(title="Maximum KPI Score per Company", x="Company ID", y="Max KPI") +
  theme_minimal() +
  theme(plot.title=element_text(hjust=0.5, face="bold", size=18))

ggplotly(p1)
ggplotly(p2)
ggplotly(p3)

Interpretation

Based on the code, company data is processed to obtain average salary, average KPI, and maximum KPI for each company.

The results show: - Company 1 has the highest average salary (6696) but the lowest average KPI (69.3)
- Company 2 has the lowest average salary (6024) but the highest average KPI (80.1) and maximum KPI (100)
- Company 3 is in the middle in terms of both salary (6253) and KPI (75.8; max 96)

Conclusion: Higher salary does not necessarily correlate with better performance. Company 2 demonstrates the best performance despite having the lowest average salary.

5. Taks 5 - Monte Carlo Simulation: Pi& Probability

This Monte Carlo simulation is used to estimate the value of π and compute the probability of random points falling within a specific area through iterative processes.

Visulization

library(ggplot2)
library(plotly)
set.seed(123)

monte_carlo_pi <- function(n_points){
  inside_count <- 0
  points_df <- data.frame(x=numeric(0), y=numeric(0), inside=logical(0))
  
  for(i in 1:n_points){
    x_val <- runif(1)
    y_val <- runif(1)
    is_inside <- x_val^2 + y_val^2 <= 1
    if(is_inside) inside_count <- inside_count + 1
    points_df <- rbind(points_df, data.frame(x=x_val, y=y_val, inside=is_inside))
  }
  
  pi_estimate <- 4 * inside_count / n_points
  cat("Estimated Pi:", pi_estimate, "\n")
  
  in_subsquare <- sum(points_df$x >= 0.25 & points_df$x <= 0.75 & points_df$y >= 0.25 & points_df$y <= 0.75)
  prob_subsquare <- in_subsquare / n_points
  cat("Probability in sub-square [0.25,0.75]^2:", prob_subsquare, "\n")
  
  p <- ggplot(points_df, aes(x=x, y=y, color=inside)) +
    geom_point(alpha=0.6) +
    scale_color_manual(values=c("red","blue"), labels=c("Outside Circle","Inside Circle")) +
    coord_fixed() +
    labs(title="Monte Carlo Simulation of Pi", subtitle=paste("n =", n_points), x="X", y="Y", color="Legend") +
    theme_minimal() + theme(plot.title=element_text(hjust=0.5, face="bold", size=18), plot.subtitle=element_text(hjust=0.5))
  
  ggplotly(p)
}

monte_carlo_pi(3000)
## Estimated Pi: 3.176 
## Probability in sub-square [0.25,0.75]^2: 0.256

Interpretation

Monte Carlo Simulation for Estimating π

This simulation applies the Monte Carlo method to estimate the value of π by randomly generating points inside a 1×1 square. Each point is classified as inside the quarter circle (blue) or outside (red) using the condition

\[ x^2 + y^2 \leq 1 \]

The value of π is then estimated using:

\[ \pi \approx 4 \times \frac{\text{number of points inside the circle}}{\text{total points}} \]

In the plot, blue points (TRUE) represent points inside the circle, while red points (FALSE) are outside. As the number of points increases (n = 3000), the distribution becomes more uniform and the estimation of π approaches its true value (~3.14).

Additionally, the code calculates the probability of points falling within a sub-square

\[ [0.25, 0.75]^2 \]

representing an empirical probability over a specific region.

In short: this simulation demonstrates how probabilistic methods can approximate π, and higher sample sizes lead to more accurate results.

6. Taks 6 - Advanced Data Transformation & Feature Engineering

This analysis aims to perform data transformation such as normalization and z-score, as well as create new features to improve data analysis and comparison.

Visulization

library(ggplot2)
library(dplyr)
library(plotly)
library(DT)
library(htmltools)
library(readr)

# =========================
# LOAD DATA CSV
# =========================
company_df <- read_csv("company_data_final.csv")

set.seed(123)

# =========================
# NORMALIZATION FUNCTION
# =========================
normalize <- function(x){ (x - min(x)) / (max(x) - min(x)) }

# =========================
# FEATURE ENGINEERING
# =========================
company_df <- company_df %>%
  mutate(
    normalized_salary = normalize(salary),
    normalized_KPI = normalize(KPI_score),
    
    performance_category = case_when(
      KPI_score > 90 ~ "Top",
      KPI_score > 75 ~ "High",
      KPI_score > 60 ~ "Medium",
      TRUE ~ "Low"
    ),
    
    salary_bracket = case_when(
      salary <= 5000 ~ "Low",
      salary <= 8000 ~ "Medium",
      TRUE ~ "High"
    )
  )

# =========================
# TABEL
# =========================
datatable(
  company_df %>% 
    select(company_id, employee_id, salary, normalized_salary, KPI_score, normalized_KPI, department, performance_category, salary_bracket),
  options=list(scrollX=TRUE, lengthMenu=c(10,25,50,100)),
  caption=tags$caption(
    style='caption-side: bottom; text-align: center;',
    'Table: ', em('Company Employee Dataset with Features')
  )
)
# =========================
# VISUALISASI
# =========================

# Histogram Salary
p_salary_hist <- ggplot(company_df, aes(x=salary)) +
  geom_histogram(fill="steelblue", bins=10, alpha=0.6) +
  geom_histogram(aes(x=normalized_salary*10000), fill="orange", bins=10, alpha=0.4) +
  labs(title="Salary Distribution: Original vs Normalized", x="Salary", y="Count") +
  theme_minimal() +
  theme(plot.title=element_text(hjust=0.5, face="bold", size=16))

# Boxplot Salary
p_salary_box <- ggplot(company_df, aes(y=salary)) +
  geom_boxplot(fill="steelblue", alpha=0.6) +
  geom_boxplot(aes(y=normalized_salary*10000), fill="orange", alpha=0.4) +
  labs(title="Boxplot: Original vs Normalized Salary", y="Salary") +
  theme_minimal() +
  theme(plot.title=element_text(hjust=0.5, face="bold", size=16))

# Histogram KPI
p_KPI_hist <- ggplot(company_df, aes(x=KPI_score)) +
  geom_histogram(fill="darkgreen", bins=10, alpha=0.6) +
  geom_histogram(aes(x=normalized_KPI*100), fill="purple", bins=10, alpha=0.4) +
  labs(title="KPI Distribution: Original vs Normalized", x="KPI Score", y="Count") +
  theme_minimal() +
  theme(plot.title=element_text(hjust=0.5, face="bold", size=16))

# Boxplot KPI
p_KPI_box <- ggplot(company_df, aes(y=KPI_score)) +
  geom_boxplot(fill="darkgreen", alpha=0.6) +
  geom_boxplot(aes(y=normalized_KPI*100), fill="purple", alpha=0.4) +
  labs(title="Boxplot: Original vs Normalized KPI", y="KPI Score") +
  theme_minimal() +
  theme(plot.title=element_text(hjust=0.5, face="bold", size=16))

# =========================
# INTERAKTIF
# =========================
ggplotly(p_salary_hist)
ggplotly(p_salary_box)
ggplotly(p_KPI_hist)
ggplotly(p_KPI_box)

Interpretation

Salary:

  • The histogram shows that original and normalized salary have the same shape → indicating consistent distribution patterns.
  • The boxplot shows similar median and spread, with only scale adjustments.
  • Outliers remain unchanged, meaning normalization preserves data characteristics.

KPI Score:

  • The KPI histogram also shows similar distributions between original and normalized data.
  • The boxplot indicates that the relative median and variability are preserved.
  • Normalization standardizes KPI into a comparable range (0–100), making analysis easier.

Conclusion:

Normalization effectively rescales the data without altering its distribution, making it suitable for further analysis such as modeling or machine learning.

7. Taks 7 - Mini Project: Company KPI Dashboard & Simulation

This project aims to simulate company data and build employee KPI analysis, including tier classification, performance summaries, and interactive visualizations.

Visulization

library(dplyr)
library(ggplot2)
library(plotly)
library(DT)
library(htmltools)

# Load dataset
company_df <- read.csv("employee_dataset.csv")

# Ensure correct data types
company_df$KPI_score <- as.numeric(company_df$KPI_score)
company_df$salary <- as.numeric(company_df$salary)

# Create KPI tier
company_df <- company_df %>% mutate(KPI_tier = case_when(
  KPI_score > 90 ~ "Tier 1",
  KPI_score > 80 ~ "Tier 2",
  KPI_score > 70 ~ "Tier 3",
  TRUE ~ "Tier 4"
))

# Summary table
summary_df <- company_df %>%
  group_by(company_id) %>%
  summarise(
    avg_salary = mean(salary),
    avg_KPI = mean(KPI_score),
    top_performers = sum(KPI_score > 90)
  )

# =========================
# DATA TABLE 
# =========================
datatable(
  company_df %>%
    select(company_id, employee_id, salary, KPI_score, KPI_tier, performance_score, department),
  options = list(
    scrollX = TRUE,
    lengthMenu = c(10,25,50),
    autoWidth = TRUE
  ),
  class = "cell-border stripe",
  caption = tags$caption(
    style='caption-side: bottom; text-align: center;',
    'Table: ', em('Employee Dataset with KPI Tiers')
  )
) %>%
  formatStyle(columns = names(company_df),
              `text-align` = 'center')
# =========================
# SUMMARY TABLE
# =========================
datatable(
  summary_df,
  options = list(scrollX = TRUE, autoWidth = TRUE),
  class = "cell-border stripe",
  caption = tags$caption(
    style='caption-side: bottom; text-align: center;',
    'Table: ', em('Company Summary')
  )
) %>%
  formatStyle(columns = names(summary_df),
              `text-align` = 'center')
# =========================
# PLOTS 
# =========================

# Salary distribution
p_salary <- ggplot(company_df, aes(x = salary)) +
  geom_histogram(fill = "steelblue", bins = 15, alpha = 0.7) +
  labs(title = "Salary Distribution", x = "Salary", y = "Count") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5, face = "bold", size = 14))

# Salary vs KPI
p_scatter <- ggplot(company_df, aes(x = salary, y = KPI_score)) +
  geom_point(aes(color = KPI_tier)) +
  geom_smooth(method = "lm", se = FALSE, color = "black") +
  labs(title = "Salary vs KPI", x = "Salary", y = "KPI Score", color = "KPI Tier") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5, face = "bold", size = 14))

# Average KPI per department
p_bar <- company_df %>%
  group_by(company_id, department) %>%
  summarise(avg_KPI = mean(KPI_score)) %>%
  ggplot(aes(x = factor(company_id), y = avg_KPI, fill = department)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(title = "Average KPI per Department per Company",
       x = "Company ID", y = "Average KPI") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5, face = "bold", size = 14))

# =========================
# INTERACTIVE PLOTS
# =========================
ggplotly(p_salary)
ggplotly(p_scatter)
ggplotly(p_bar)

Interpretation

Salary Distribution:

  • The histogram shows salaries are fairly distributed within the range of ~3000–10000.
  • No strong skewness → the distribution is relatively normal/moderate.
  • This indicates no dominant salary group.

Salary vs KPI Relationship:

  • The scatter plot indicates a weak relationship between salary and KPI.
  • The regression line is nearly flat → higher salary does not significantly increase KPI.
  • KPI is likely influenced by other factors (e.g., skills, experience, department).

Average KPI per Department and Company: - There are variations in KPI across departments within each company. - Departments like Marketing and IT tend to show higher KPI in some companies. - Differences across companies are not extreme → performance is relatively consistent.

Conclusion:

Salary distribution is fairly even, there is no strong correlation between salary and KPI, and employee performance is more influenced by department rather than salary level.

Conclusion

Overall, this practicum demonstrates the application of functions, loops, and conditional logic in various data science cases. From simulation to data analysis, each part helps in understanding the data processing workflow more systematically.

With data transformation and visualization, the analysis results become clearer and more informative. This practicum also helps in building a more organized and efficient data analysis workflow.

Reference

Siregar, B. (2025). Data Science Programming: Study Case Using R and Python. Online module. bookdown.org. Retrieved from https://bookdown.org/dsciencelabs/data_science_programming/03-Functions-and-Loops.html