ULIN NIKMAH (52250042)
INSTITUT TEKNOLOGI SAINS BANDUNG
Course:Data Science Programming Study Program:Data Science Lecturer:Bakti Siregar, M.SC., CDS.
This practicum aims to practice the use of functions, loops, and conditional logic in a data science context. Through several tasks, data simulations are performed, including mathematical function computation, sales analysis, and company dataset generation.
In addition, this practicum includes data transformation, statistical analysis, and visualization to understand data patterns. Each task is designed to help students build a more structured and realistic data science workflow.
This program aims to compute and compare multiple mathematical functions (linear, quadratic, cubic, and exponential) dynamically using loops and visualize them on a single graph.
library(ggplot2)
library(dplyr)
library(plotly)
compute_formula <- function(x, formulas){
results <- data.frame()
for(f in formulas){
y <- sapply(x, function(val){
if(f == "linear") return(2*val + 1)
else if(f == "quadratic") return(val^2 + 2*val + 1)
else if(f == "cubic") return(val^3)
else if(f == "exponential") return(exp(val))
})
results <- rbind(results, data.frame(x=x, y=y, formula=f))
}
p <- ggplot(results, aes(x=x, y=y, color=formula)) +
geom_line() + geom_point() +
ggtitle("Function Comparison") +
theme(plot.title = element_text(hjust = 0.5, face = "bold"))
ggplotly(p)
}
x <- 1:20
compute_formula(x, c("linear","quadratic","cubic","exponential"))
The graph compares four functions: linear, quadratic, cubic, and
exponential over the range
\(x = 1\) to \(20\).
It can be observed that:
As a result, the exponential function dominates the graph scale, making the other functions appear almost flat near the bottom. This highlights that exponential growth is significantly faster than linear and polynomial growth.
This simulation is designed to analyze sales data from multiple salespersons over several days, including discount calculations, cumulative sales, and performance summaries.
library(ggplot2)
library(dplyr)
library(plotly)
library(knitr)
library(kableExtra)
library(readr)
# =========================
# LOAD DATA
# =========================
sales_df <- read_csv("sales_data_final.csv")
# =========================
# TAMBAH KOLOM
# =========================
sales_df <- sales_df %>%
mutate(
discount_rate = case_when(
sales_amount > 800 ~ 0.2,
sales_amount > 500 ~ 0.1,
TRUE ~ 0.05
),
final_sales = sales_amount * (1 - discount_rate)
) %>%
group_by(sales_id) %>%
mutate(cumulative_sales = cumsum(sales_amount)) %>%
ungroup()
# =========================
# SUMMARY
# =========================
summary_stats <- sales_df %>%
group_by(sales_id) %>%
summarise(
total_sales = sum(sales_amount),
total_final_sales = sum(final_sales),
avg_sales = mean(sales_amount),
max_sales = max(sales_amount),
min_sales = min(sales_amount)
)
kable(summary_stats, "html", caption = "Summary Statistics per Salesperson") %>%
kable_styling(full_width = FALSE,
bootstrap_options = c("striped","hover","condensed","responsive"))
| sales_id | total_sales | total_final_sales | avg_sales | max_sales | min_sales |
|---|---|---|---|---|---|
| 1 | 3407 | 3057.0 | 486.7143 | 992 | 253 |
| 2 | 3120 | 2804.0 | 445.7143 | 775 | -180 |
| 3 | 4216 | 3701.9 | 602.2857 | 1402 | 253 |
# =========================
# PLOT
# =========================
p <- ggplot(sales_df, aes(x=day, y=cumulative_sales, color=factor(sales_id))) +
geom_line(size=1.2) +
geom_point(size=2) +
labs(
title = "Cumulative Sales per Salesperson",
x = "Day",
y = "Cumulative Sales",
color = "Salesperson"
) +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.5, face = "bold", size=14),
axis.title = element_text(face="bold")
)
ggplotly(p)
Based on the code, sales data is processed using tiered discounts, producing final_sales and cumulative_sales.
From the chart:
From the summary:
Conclusion: Salesperson 3 performs the best, while Salesperson 2 needs evaluation.
This analysis aims to classify sales data into five performance categories and calculate their distribution and percentages through visualization.
library(ggplot2)
library(dplyr)
library(plotly)
library(knitr)
library(kableExtra)
library(readr)
library(RColorBrewer)
# =========================
# LOAD DATA CSV
# =========================
sales_df <- read_csv("sales_data_final.csv")
# =========================
# FUNCTION KATEGORI
# =========================
categorize_performance <- function(sales){
category <- sapply(sales, function(s){
if(s > 800) "Excellent"
else if(s > 600) "Very Good"
else if(s > 400) "Good"
else if(s > 200) "Average"
else "Poor"
})
data.frame(sales_amount = sales, performance_category = category)
}
# =========================
# APPLY KE DATA CSV
# =========================
perf_df <- categorize_performance(sales_df$sales_amount)
# =========================
# SUMMARY
# =========================
perf_summary <- perf_df %>%
group_by(performance_category) %>%
summarise(count = n()) %>%
mutate(percentage = round(count / sum(count) * 100, 2)) %>%
arrange(desc(count))
kable(perf_summary, "html", caption = "Performance Distribution Summary") %>%
kable_styling(full_width = FALSE,
bootstrap_options = c("striped", "hover", "condensed", "responsive"))
| performance_category | count | percentage |
|---|---|---|
| Good | 7 | 33.33 |
| Average | 6 | 28.57 |
| Very Good | 4 | 19.05 |
| Excellent | 2 | 9.52 |
| Poor | 2 | 9.52 |
# =========================
# BAR PLOT
# =========================
bar_plot <- ggplot(perf_summary, aes(x=performance_category, y=count, fill=performance_category)) +
geom_bar(stat="identity") +
labs(title="Performance Distribution", x="Category", y="Count") +
theme_minimal() +
theme(plot.title = element_text(hjust=0.5, face="bold"))
ggplotly(bar_plot)
# =========================
# PIE CHART
# =========================
pie_chart <- plot_ly(
perf_summary,
labels = ~performance_category,
values = ~count,
type = 'pie',
textposition = 'inside',
textinfo = 'label+percent',
hoverinfo = 'label+value+percent',
marker = list(
colors = brewer.pal(n = 5, name = "Set2"),
line = list(color = '#FFFFFF', width = 1)
)
) %>%
layout(
title = list(text="Performance Distribution", font=list(size=18)),
showlegend = TRUE
)
pie_chart
Based on the code, sales data is categorized into five performance levels based on sales amount.
The results show: - Good is the most dominant
category (33.33%), indicating generally good performance
- Average is the second highest (28.57%), showing some
standard-level performance
- Very Good is noticeable (19.05%), indicating
improvement
- Excellent and Poor have the lowest
proportions (each 9.52%)
Conclusion: Overall, sales performance is relatively stable at a medium-to-high level, but improvements are still needed to increase the number of Excellent performances.
This program simulates employee datasets from multiple companies using nested loops and analyzes average salary, performance, and KPI for each company.
library(ggplot2)
library(dplyr)
library(plotly)
library(DT)
library(htmltools)
library(readr)
# =========================
# LOAD DATA CSV (GANTI RANDOM)
# =========================
company_df <- read_csv("company_data_final.csv")
# =========================
# TABEL DETAIL
# =========================
df1 <- company_df %>%
arrange(company_id, employee_id) %>%
select(company_id, employee_id, salary, department, performance_score, KPI_score)
datatable(df1,
options=list(scrollX=TRUE, lengthMenu=c(10,25,50,100)),
caption=tags$caption(
style='caption-side: bottom; text-align: center;',
'Table: ', em('Company Employee Dataset'))
)
# =========================
# SUMMARY
# =========================
summary_df <- company_df %>%
group_by(company_id) %>%
summarise(
avg_salary = mean(salary),
avg_performance = mean(KPI_score),
max_KPI = max(KPI_score)
)
datatable(summary_df,
options=list(scrollX=TRUE),
caption=tags$caption(
style='caption-side: bottom; text-align: center;',
'Table: ', em('Company Summary'))
)
# =========================
# PLOT
# =========================
p1 <- ggplot(summary_df, aes(x=factor(company_id), y=avg_salary)) +
geom_bar(stat="identity", fill="steelblue") +
geom_text(aes(label=round(avg_salary,0)), nudge_y=200) +
labs(title="Average Salary per Company", x="Company ID", y="Average Salary") +
theme_minimal() +
theme(plot.title=element_text(hjust=0.5, face="bold", size=18))
p2 <- ggplot(summary_df, aes(x=factor(company_id), y=avg_performance)) +
geom_bar(stat="identity", fill="darkgreen") +
geom_text(aes(label=round(avg_performance,1)), nudge_y=2) +
labs(title="Average KPI Score per Company", x="Company ID", y="Average KPI") +
theme_minimal() +
theme(plot.title=element_text(hjust=0.5, face="bold", size=18))
p3 <- ggplot(summary_df, aes(x=factor(company_id), y=max_KPI)) +
geom_bar(stat="identity", fill="orange") +
geom_text(aes(label=max_KPI), nudge_y=2) +
labs(title="Maximum KPI Score per Company", x="Company ID", y="Max KPI") +
theme_minimal() +
theme(plot.title=element_text(hjust=0.5, face="bold", size=18))
ggplotly(p1)
ggplotly(p2)
ggplotly(p3)
Based on the code, company data is processed to obtain average salary, average KPI, and maximum KPI for each company.
The results show: - Company 1 has the highest
average salary (6696) but the lowest average KPI (69.3)
- Company 2 has the lowest average salary (6024) but
the highest average KPI (80.1) and maximum KPI (100)
- Company 3 is in the middle in terms of both salary
(6253) and KPI (75.8; max 96)
Conclusion: Higher salary does not necessarily correlate with better performance. Company 2 demonstrates the best performance despite having the lowest average salary.
This Monte Carlo simulation is used to estimate the value of π and compute the probability of random points falling within a specific area through iterative processes.
library(ggplot2)
library(plotly)
set.seed(123)
monte_carlo_pi <- function(n_points){
inside_count <- 0
points_df <- data.frame(x=numeric(0), y=numeric(0), inside=logical(0))
for(i in 1:n_points){
x_val <- runif(1)
y_val <- runif(1)
is_inside <- x_val^2 + y_val^2 <= 1
if(is_inside) inside_count <- inside_count + 1
points_df <- rbind(points_df, data.frame(x=x_val, y=y_val, inside=is_inside))
}
pi_estimate <- 4 * inside_count / n_points
cat("Estimated Pi:", pi_estimate, "\n")
in_subsquare <- sum(points_df$x >= 0.25 & points_df$x <= 0.75 & points_df$y >= 0.25 & points_df$y <= 0.75)
prob_subsquare <- in_subsquare / n_points
cat("Probability in sub-square [0.25,0.75]^2:", prob_subsquare, "\n")
p <- ggplot(points_df, aes(x=x, y=y, color=inside)) +
geom_point(alpha=0.6) +
scale_color_manual(values=c("red","blue"), labels=c("Outside Circle","Inside Circle")) +
coord_fixed() +
labs(title="Monte Carlo Simulation of Pi", subtitle=paste("n =", n_points), x="X", y="Y", color="Legend") +
theme_minimal() + theme(plot.title=element_text(hjust=0.5, face="bold", size=18), plot.subtitle=element_text(hjust=0.5))
ggplotly(p)
}
monte_carlo_pi(3000)
## Estimated Pi: 3.176
## Probability in sub-square [0.25,0.75]^2: 0.256
This simulation applies the Monte Carlo method to estimate the value of π by randomly generating points inside a 1×1 square. Each point is classified as inside the quarter circle (blue) or outside (red) using the condition
\[ x^2 + y^2 \leq 1 \]
The value of π is then estimated using:
\[ \pi \approx 4 \times \frac{\text{number of points inside the circle}}{\text{total points}} \]
In the plot, blue points (TRUE) represent points inside the circle, while red points (FALSE) are outside. As the number of points increases (n = 3000), the distribution becomes more uniform and the estimation of π approaches its true value (~3.14).
Additionally, the code calculates the probability of points falling within a sub-square
\[ [0.25, 0.75]^2 \]
representing an empirical probability over a specific region.
In short: this simulation demonstrates how probabilistic methods can approximate π, and higher sample sizes lead to more accurate results.
This analysis aims to perform data transformation such as normalization and z-score, as well as create new features to improve data analysis and comparison.
library(ggplot2)
library(dplyr)
library(plotly)
library(DT)
library(htmltools)
library(readr)
# =========================
# LOAD DATA CSV
# =========================
company_df <- read_csv("company_data_final.csv")
set.seed(123)
# =========================
# NORMALIZATION FUNCTION
# =========================
normalize <- function(x){ (x - min(x)) / (max(x) - min(x)) }
# =========================
# FEATURE ENGINEERING
# =========================
company_df <- company_df %>%
mutate(
normalized_salary = normalize(salary),
normalized_KPI = normalize(KPI_score),
performance_category = case_when(
KPI_score > 90 ~ "Top",
KPI_score > 75 ~ "High",
KPI_score > 60 ~ "Medium",
TRUE ~ "Low"
),
salary_bracket = case_when(
salary <= 5000 ~ "Low",
salary <= 8000 ~ "Medium",
TRUE ~ "High"
)
)
# =========================
# TABEL
# =========================
datatable(
company_df %>%
select(company_id, employee_id, salary, normalized_salary, KPI_score, normalized_KPI, department, performance_category, salary_bracket),
options=list(scrollX=TRUE, lengthMenu=c(10,25,50,100)),
caption=tags$caption(
style='caption-side: bottom; text-align: center;',
'Table: ', em('Company Employee Dataset with Features')
)
)
# =========================
# VISUALISASI
# =========================
# Histogram Salary
p_salary_hist <- ggplot(company_df, aes(x=salary)) +
geom_histogram(fill="steelblue", bins=10, alpha=0.6) +
geom_histogram(aes(x=normalized_salary*10000), fill="orange", bins=10, alpha=0.4) +
labs(title="Salary Distribution: Original vs Normalized", x="Salary", y="Count") +
theme_minimal() +
theme(plot.title=element_text(hjust=0.5, face="bold", size=16))
# Boxplot Salary
p_salary_box <- ggplot(company_df, aes(y=salary)) +
geom_boxplot(fill="steelblue", alpha=0.6) +
geom_boxplot(aes(y=normalized_salary*10000), fill="orange", alpha=0.4) +
labs(title="Boxplot: Original vs Normalized Salary", y="Salary") +
theme_minimal() +
theme(plot.title=element_text(hjust=0.5, face="bold", size=16))
# Histogram KPI
p_KPI_hist <- ggplot(company_df, aes(x=KPI_score)) +
geom_histogram(fill="darkgreen", bins=10, alpha=0.6) +
geom_histogram(aes(x=normalized_KPI*100), fill="purple", bins=10, alpha=0.4) +
labs(title="KPI Distribution: Original vs Normalized", x="KPI Score", y="Count") +
theme_minimal() +
theme(plot.title=element_text(hjust=0.5, face="bold", size=16))
# Boxplot KPI
p_KPI_box <- ggplot(company_df, aes(y=KPI_score)) +
geom_boxplot(fill="darkgreen", alpha=0.6) +
geom_boxplot(aes(y=normalized_KPI*100), fill="purple", alpha=0.4) +
labs(title="Boxplot: Original vs Normalized KPI", y="KPI Score") +
theme_minimal() +
theme(plot.title=element_text(hjust=0.5, face="bold", size=16))
# =========================
# INTERAKTIF
# =========================
ggplotly(p_salary_hist)
ggplotly(p_salary_box)
ggplotly(p_KPI_hist)
ggplotly(p_KPI_box)
Salary:
KPI Score:
Conclusion:
Normalization effectively rescales the data without altering its distribution, making it suitable for further analysis such as modeling or machine learning.
This project aims to simulate company data and build employee KPI analysis, including tier classification, performance summaries, and interactive visualizations.
library(dplyr)
library(ggplot2)
library(plotly)
library(DT)
library(htmltools)
# Load dataset
company_df <- read.csv("employee_dataset.csv")
# Ensure correct data types
company_df$KPI_score <- as.numeric(company_df$KPI_score)
company_df$salary <- as.numeric(company_df$salary)
# Create KPI tier
company_df <- company_df %>% mutate(KPI_tier = case_when(
KPI_score > 90 ~ "Tier 1",
KPI_score > 80 ~ "Tier 2",
KPI_score > 70 ~ "Tier 3",
TRUE ~ "Tier 4"
))
# Summary table
summary_df <- company_df %>%
group_by(company_id) %>%
summarise(
avg_salary = mean(salary),
avg_KPI = mean(KPI_score),
top_performers = sum(KPI_score > 90)
)
# =========================
# DATA TABLE
# =========================
datatable(
company_df %>%
select(company_id, employee_id, salary, KPI_score, KPI_tier, performance_score, department),
options = list(
scrollX = TRUE,
lengthMenu = c(10,25,50),
autoWidth = TRUE
),
class = "cell-border stripe",
caption = tags$caption(
style='caption-side: bottom; text-align: center;',
'Table: ', em('Employee Dataset with KPI Tiers')
)
) %>%
formatStyle(columns = names(company_df),
`text-align` = 'center')
# =========================
# SUMMARY TABLE
# =========================
datatable(
summary_df,
options = list(scrollX = TRUE, autoWidth = TRUE),
class = "cell-border stripe",
caption = tags$caption(
style='caption-side: bottom; text-align: center;',
'Table: ', em('Company Summary')
)
) %>%
formatStyle(columns = names(summary_df),
`text-align` = 'center')
# =========================
# PLOTS
# =========================
# Salary distribution
p_salary <- ggplot(company_df, aes(x = salary)) +
geom_histogram(fill = "steelblue", bins = 15, alpha = 0.7) +
labs(title = "Salary Distribution", x = "Salary", y = "Count") +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5, face = "bold", size = 14))
# Salary vs KPI
p_scatter <- ggplot(company_df, aes(x = salary, y = KPI_score)) +
geom_point(aes(color = KPI_tier)) +
geom_smooth(method = "lm", se = FALSE, color = "black") +
labs(title = "Salary vs KPI", x = "Salary", y = "KPI Score", color = "KPI Tier") +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5, face = "bold", size = 14))
# Average KPI per department
p_bar <- company_df %>%
group_by(company_id, department) %>%
summarise(avg_KPI = mean(KPI_score)) %>%
ggplot(aes(x = factor(company_id), y = avg_KPI, fill = department)) +
geom_bar(stat = "identity", position = "dodge") +
labs(title = "Average KPI per Department per Company",
x = "Company ID", y = "Average KPI") +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5, face = "bold", size = 14))
# =========================
# INTERACTIVE PLOTS
# =========================
ggplotly(p_salary)
ggplotly(p_scatter)
ggplotly(p_bar)
Salary Distribution:
Salary vs KPI Relationship:
Average KPI per Department and Company: - There are variations in KPI across departments within each company. - Departments like Marketing and IT tend to show higher KPI in some companies. - Differences across companies are not extreme → performance is relatively consistent.
Conclusion:
Salary distribution is fairly even, there is no strong correlation between salary and KPI, and employee performance is more influenced by department rather than salary level.
Overall, this practicum demonstrates the application of functions, loops, and conditional logic in various data science cases. From simulation to data analysis, each part helps in understanding the data processing workflow more systematically.
With data transformation and visualization, the analysis results become clearer and more informative. This practicum also helps in building a more organized and efficient data analysis workflow.
Siregar, B. (2025). Data Science Programming: Study Case Using R and Python. Online module. bookdown.org. Retrieved from https://bookdown.org/dsciencelabs/data_science_programming/03-Functions-and-Loops.html