Undergraduate Student in Data Science at Institut Teknologi Sains Bandung
In an increasingly data-driven world, the ability to understand and process data is no longer merely an additional skill, but has become a fundamental necessity. Through this practicum, not only are programming concepts explored, but also how data can be processed and interpreted into meaningful information. This practicum aims to explore the use of R in the data analysis process, starting from data creation, processing, to visualization. Each task is designed not only to focus on the final outcome, but also on the logical thinking process in developing solutions through the use of functions, loops, and various data processing techniques.
# Library
library(ggplot2)
library(tidyr)
library(plotly)
# Color
PASTEL <- c('#FFB3C1','#FFD6A5','#A0C4FF','#BDB2FF')
# Function
compute_formula <- function(x, formula) {
valid <- c("linear", "quadratic", "cubic", "exponential")
if (!(formula %in% valid)) {
stop(paste("Formula '", formula, "' tidak valid. Pilih:", paste(valid, collapse = ", ")))
}
if (formula == "linear") return(2*x + 3)
else if (formula == "quadratic") return(x^2 - 4*x + 4)
else if (formula == "cubic") return(x^3 - 3*x^2 + 2*x)
else if (formula == "exponential") return(exp(0.3 * x))
}
# Nested loop: hitung semua formula untuk x = 1..20
x_vals <- 1:20
formulas <- c("linear", "quadratic", "cubic", "exponential")
results <- data.frame(x = x_vals)
for (formula in formulas) { # loop formula (luar)
y_vals <- c()
for (x in x_vals) { # loop nilai x (dalam)
y_vals <- c(y_vals, compute_formula(x, formula))
}
results[[formula]] <- y_vals
}
# Reshape untuk ggplot
df_long <- pivot_longer(results, cols = -x,
names_to = "formula",
values_to = "y")
# Plot
p <-ggplot(df_long, aes(x = x, y = y, color = formula, group = formula)) +
geom_line(linewidth = 1.2) +
geom_point(size = 2) +
scale_color_manual(values = setNames(PASTEL, formulas),
labels = tools::toTitleCase(formulas)) +
labs(
title = "Multi-Formula Comparison",
x = "x",
y = "y = f(x)",
color = "Formula"
) +
theme_minimal(base_size = 13) +
theme(
plot.background = element_rect(fill = "#FFF9F9", color = NA),
panel.background = element_rect(fill = "#FFF9F9", color = NA),
panel.grid.major = element_line(color = "#F0E6EE", linewidth = 0.8),
panel.grid.minor = element_blank(),
axis.text = element_text(color = "#9B7BB8"),
axis.title = element_text(color = "#9B7BB8"),
plot.title = element_text(color = "#7B5EA7", face = "bold", size = 13),
legend.background = element_rect(fill = "#FFF9F9", color = "#CCAACC"),
legend.text = element_text(color = "#7B5EA7"),
legend.title = element_text(color = "#7B5EA7")
)
ggplotly(p)
Interpretation:
The chart shows a comparison of four functions are linear, quadratic, cubic, and exponential, with respect to x values ranging from 1 to 20. The linear function increases steadily and forms a straight line, meaning that each increase in x results in a constant increase in y. The quadratic function increases more rapidly than the linear function and forms a curve that rises as x increases. The exponential function also shows a faster increase, especially at higher x values.
However, the graph shows that the cubic function increases more sharply than the other functions, resulting in the largest y values at high x values. Overall, all functions show an increasing trend, but at different growth rates, with the linear function being the most stable, the quadratic and exponential increasing more rapidly, and the cubic function increasing the fastest over this x range.
#Library
library(dplyr)
library(ggplot2)
library(plotly)
library(DT)
# Load Data
df23 <- read.csv("C:/Users/Asus/OneDrive/Desktop/Assignment Week 5/dataset 2, 3.csv")
datatable(df23,
caption = "Dataset",
options = list(pageLength = 10, scrollX = TRUE),
rownames = FALSE)
# Color
PASTEL <- c('#FFB3C1','#FFD6A5','#CAFFBF','#9BF6FF','#BDB2FF')
# Function: Discount
apply_discount <- function(x) {
if (x > 900) return(0.20)
else if (x > 700) return(0.15)
else if (x > 500) return(0.10)
else if (x > 300) return(0.05)
else return(0.00)
}
# Nested Function: Cumulative Sales
cumulative_sales_func <- function(sales_vec) {
total <- 0
result <- c()
for (s in sales_vec) {
total <- total + s
result <- c(result, total)
}
return(result)
}
# Loop Per Salaesperson
sales_ids <- unique(df23$sales_id)
final_data <- data.frame()
for (s in sales_ids) {
temp <- df23 %>% filter(sales_id == s)
# Apply discount
temp$discount_rate <- sapply(temp$sales_amount, apply_discount)
temp$net_sales <- temp$sales_amount * (1 - temp$discount_rate)
# Cumulative (nested function)
temp$cumulative_sales <- cumulative_sales_func(temp$net_sales)
final_data <- rbind(final_data, temp)
}
# Summary Stats
summary_sales <- final_data %>%
group_by(sales_id) %>%
summarise(
total_sales = sum(net_sales),
avg_sales = mean(sales_amount),
max_sales = max(sales_amount),
avg_discount = mean(discount_rate)
)
# Table Summary
datatable(summary_sales)
# Plot
p <- ggplot(final_data, aes(x = day, y = cumulative_sales,
color = sales_id, group = sales_id)) +
geom_line(linewidth = 1.2) +
geom_point(size = 2) +
scale_color_manual(values = setNames(PASTEL, unique(final_data$sales_id))) +
labs(
title = "Cumulative Sales per Salesperson",
x = "Day",
y = "Cumulative Net Sales"
) +
theme_minimal(base_size = 13) +
theme(
plot.background = element_rect(fill = "#FFF9F9", color = NA),
panel.background = element_rect(fill = "#FFF9F9", color = NA),
panel.grid.major = element_line(color = "#F0E6EE"),
axis.text = element_text(color = "#9B7BB8"),
plot.title = element_text(color = "#7B5EA7", face = "bold")
)
ggplotly(p)
Interpretation:
The cumulative sales graph shows that the total net sales of each salesperson increase from day to day because it is an accumulation of sales. A steeper line indicates a salesperson with higher sales within a certain period of time. From the graph, it can be seen that some salespeople have faster sales growth than others, which means their sales performance is higher. Meanwhile, a flatter line indicates smaller or more stable sales.
# Library
library(readr)
library(dplyr)
library(plotly)
library(DT)
# Load Data
data <- read_csv("C:/Users/Asus/OneDrive/Desktop/Assignment Week 5/dataset 2, 3.csv")
datatable(data,
caption = "Dataset",
options = list(pageLength = 10, scrollX = TRUE),
rownames = FALSE)
# Kolom kategori
kategori <- data$performance_category
# Loop hitung frekuensi
freq <- c()
for(i in unique(kategori)){
freq[i] <- sum(kategori == i)
}
# Hitung persentase
persentase <- (freq / sum(freq)) * 100
# Buat tabel
tabel <- data.frame(
Category = names(freq),
Frequency = as.numeric(freq),
Percentage = persentase
)
tabel <- tabel %>%
arrange(Frequency)
tabel$Category <- factor(tabel$Category, levels = tabel$Category)
# Bar Chart
bar_plot <- plot_ly(
tabel,
x = ~Category,
y = ~Frequency,
type = "bar",
color = ~Category,
text = ~Frequency,
textposition = "outside",
hovertext = ~paste("Category:", Category,
"<br>Frequency:", Frequency,
"<br>Percentage:", round(Percentage,2), "%"),
hoverinfo = "text"
) %>%
layout(
title = "Bar Plot Distribution of Category",
xaxis = list(title = "Category"),
yaxis = list(title = "Frequency"),
showlegend = FALSE
)
bar_plot
# Pie Chart
pie_chart <- plot_ly(
tabel,
labels = ~Category,
values = ~Frequency,
type = "pie",
textinfo = "percent",
hoverinfo = "label+value+percent"
) %>%
layout(
title = "Pie Chart Distribution of Category"
)
pie_chart
Interpretation:
The bar plot shows that the “Poor” category has the highest frequency, followed by “Very Good” and “Average”, while “Good” and especially “Excellent” have the lowest counts. This can be seen from the tallest bar appearing in the “Poor” category. Meanwhile, the pie chart clarifies the proportion of the distribution, where “Poor” takes the largest portion (32%), followed by “Very Good” (24%) and “Average” (20%), while “Excellent” is only about 10% as the smallest portion. Both charts consistently show that the performance distribution is not evenly distributed and is still dominated by the low performance category.
# Library
library(readr)
library(dplyr)
library(ggplot2)
library(plotly)
library(DT)
# Load Data
data <- read_csv("C:/Users/Asus/OneDrive/Desktop/Assignment Week 5/dataset 4,6.csv")
datatable(data,
caption = "Dataset",
options = list(pageLength = 10, scrollX = TRUE),
rownames = FALSE)
# Nested Loop Company & Employee
companies <- unique(data$company_id)
top_performers <- data.frame()
for(c in companies){
company_data <- data[data$company_id == c, ]
employees <- unique(company_data$employee_id)
for(e in employees){
emp_data <- company_data[company_data$employee_id == e, ]
if(emp_data$KPI_score > 90){
top_performers <- rbind(top_performers, emp_data)
}
}
}
# Top Performers Table
datatable(top_performers,
caption = "Top Performers (KPI > 90)",
options = list(pageLength = 10, scrollX = TRUE),
rownames = FALSE)
# Summary per Company
summary_company <- data %>%
group_by(company_id) %>%
summarise(
Avg_Salary = mean(salary, na.rm = TRUE),
Avg_Performance = mean(performance_score, na.rm = TRUE),
Max_KPI = max(KPI_score, na.rm = TRUE)
)
# Summary Table
datatable(summary_company,
caption = "Summary per Company",
options = list(pageLength = 5, scrollX = TRUE),
rownames = FALSE)
# Plot 1: AVG SALARY
sc1 <- summary_company %>% arrange(Avg_Salary)
plot_salary <- plot_ly(
sc1,
x = ~company_id,
y = ~Avg_Salary,
type = "bar",
color = ~company_id,
text = ~round(Avg_Salary, 0),
textposition = "outside"
) %>%
layout(
title = "Average Salary per Company",
xaxis = list(title = "Company"),
yaxis = list(title = "Average Salary",
range = c(0, max(sc1$Avg_Salary) * 1.2)),
showlegend = FALSE
)
plot_salary
# Plot 2: AVG PERFORMANCE
sc2 <- summary_company %>% arrange(Avg_Performance)
plot_perf <- plot_ly(
sc2,
x = ~company_id,
y = ~Avg_Performance,
type = "bar",
color = ~company_id,
text = ~round(Avg_Performance, 1),
textposition = "outside"
) %>%
layout(
title = "Average Performance per Company",
xaxis = list(title = "Company"),
yaxis = list(title = "Average Performance",
range = c(0, max(sc2$Avg_Performance) * 1.2)),
showlegend = FALSE
)
plot_perf
# Plot 3: MAX KPI
sc3 <- summary_company %>% arrange(Max_KPI)
plot_kpi <- plot_ly(
sc3,
x = ~company_id,
y = ~Max_KPI,
type = "bar",
color = ~company_id,
text = ~Max_KPI,
textposition = "outside"
) %>%
layout(
title = "Maximum KPI per Company",
xaxis = list(title = "Company"),
yaxis = list(title = "Max KPI",
range = c(0, max(sc3$Max_KPI) * 1.2)),
showlegend = FALSE
)
plot_kpi
Interpretation:
# Library
library(ggplot2)
library(plotly)
# Jumlah iterasi
n <- 5000
# Loop generate titik random
x <- runif(n, -1, 1)
y <- runif(n, -1, 1)
inside_circle <- c()
inside_square <- c()
for(i in 1:n){
# Cek dalam lingkaran
if(x[i]^2 + y[i]^2 <= 1){
inside_circle[i] <- 1
} else {
inside_circle[i] <- 0
}
# Sub-square kecil
if(x[i] >= -0.5 & x[i] <= 0.5 & y[i] >= -0.5 & y[i] <= 0.5){
inside_square[i] <- 1
} else {
inside_square[i] <- 0
}
}
# Hitung Pi
pi_estimate <- 4 * sum(inside_circle) / n
pi_estimate
## [1] 3.1424
# Probabilitas titik masuk sub-square
prob_square <- sum(inside_square) / n
prob_square
## [1] 0.2416
# Data untuk plot
points_data <- data.frame(
x = x,
y = y,
inside_circle = as.factor(inside_circle)
)
# Plot
p <- ggplot(points_data, aes(x = x, y = y, color = inside_circle)) +
geom_point(alpha = 0.6) +
labs(
title = "Monte Carlo Simulation",
color = "Inside Circle"
) +
theme_minimal()
ggplotly(p)
Interpretation:
The graph shows the distribution of random points in a Monte Carlo simulation, where the green points are inside the circle and the orange points are outside the circle but inside the square. The ratio of the number of points inside the circle to the total number of points is used to estimate the value of π, and the more points that are used, the more accurate the estimation of π will be.
# Library
library(readr)
library(dplyr)
library(ggplot2)
library(plotly)
library(DT)
# Load Data
company <- read_csv("C:/Users/Asus/OneDrive/Desktop/Assignment Week 5/dataset 4,6.csv")
datatable(company,
caption = "Dataset",
options = list(pageLength = 10, scrollX = TRUE),
rownames = FALSE)
# Normalization Function
normalize <- function(x){
(x - min(x)) / (max(x) - min(x))
}
# Apply normalization
company$salary_norm <- normalize(company$salary)
# Feature Engineering
company$salary_bracket <- cut(
company$salary,
breaks = 3,
labels = c("Low", "Medium", "High")
)
company$performance_category <- cut(
company$performance_score,
breaks = 3,
labels = c("Low", "Medium", "High")
)
# Table
datatable(
company %>% select(salary, salary_norm, salary_bracket),
caption = "Salary Transformation Table",
options = list(pageLength = 10),
rownames = FALSE
)
# Histogram
df_plot <- data.frame(
value = c(company$salary, company$salary_norm * max(company$salary)),
type = c(rep("Before", nrow(company)),
rep("After", nrow(company)))
)
p1 <- ggplot(df_plot, aes(x=value, fill=type)) +
geom_histogram(alpha=0.5, bins=30, position="identity") +
labs(
title="Salary Distribution: Before vs After Normalization",
x="Salary",
y="Count",
fill="Condition"
) +
scale_fill_manual(values = c("#FFB3C1", "#A0C4FF")) +
theme_minimal()
ggplotly(p1)
# Boxplot
df_box <- data.frame(
value = c(company$salary, company$salary_norm * max(company$salary)),
type = c(rep("Before", nrow(company)),
rep("After", nrow(company)))
)
p2 <- ggplot(df_box, aes(x=type, y=value, fill=type)) +
geom_boxplot(alpha=0.7) +
labs(
title="Salary Distribution: Before vs After Normalization",
x="Condition",
y="Salary",
fill="Condition"
) +
scale_fill_manual(values = c("#FFB3C1", "#A0C4FF")) +
theme_minimal()
ggplotly(p2)
Interpretation:
# Library
library(readr)
library(dplyr)
library(ggplot2)
library(plotly)
library(DT)
# Load Data
data <- read_csv("C:/Users/Asus/OneDrive/Desktop/Assignment Week 5/dataset task 7.csv")
datatable(data,
caption = "Dataset",
options = list(pageLength = 10, scrollX = TRUE),
rownames = FALSE)
# Color
PASTEL <- c(
"#FFB3C1", "#FFD6A5", "#A0C4FF",
"#BDB2FF", "#CAFFBF", "#FFC6FF",
"#9BF6FF", "#FDFFB6", "#Caffbf",
"#E7C6FF"
)
# Summary per Company
company_summary <- data %>%
group_by(company_id) %>%
summarise(
avg_salary = mean(salary),
avg_KPI = mean(KPI_score),
top_performers = sum(KPI_score > 90)
)
datatable(company_summary, caption = "Company Summary")
# KPI Tier (Loop)
KPI_tier <- c()
for(i in 1:nrow(data)){
if(data$KPI_score[i] >= 90){
KPI_tier[i] <- "Excellent"
} else if(data$KPI_score[i] >= 75){
KPI_tier[i] <- "Good"
} else if(data$KPI_score[i] >= 60){
KPI_tier[i] <- "Average"
} else {
KPI_tier[i] <- "Low"
}
}
data$KPI_tier <- KPI_tier
datatable(data %>% select(employee_id, KPI_score, KPI_tier),
caption = "KPI Tier Table")
# Top Performers
top_perf_summary <- data %>%
filter(KPI_score > 90) %>%
group_by(company_id) %>%
summarise(count = n()) %>%
arrange(count)
top_perf_summary$company_id <- factor(
top_perf_summary$company_id,
levels = top_perf_summary$company_id
)
bar_plot <- plot_ly(
top_perf_summary,
x = ~company_id,
y = ~count,
type = "bar",
color = ~company_id,
text = ~count,
textposition = "outside"
) %>%
layout(
title = "Top Performers per Company",
xaxis = list(title = "Company"),
yaxis = list(title = "Count",
range = c(0, max(top_perf_summary$count) * 1.2)),
showlegend = FALSE
)
bar_plot
# AVG Salary per Department
dept_summary <- data %>%
group_by(department) %>%
summarise(avg_salary = mean(salary)) %>%
arrange(avg_salary)
dept_summary$department <- factor(
dept_summary$department,
levels = dept_summary$department
)
bar_plot2 <- plot_ly(
dept_summary,
x = ~department,
y = ~avg_salary,
type = "bar",
color = ~department,
text = ~round(avg_salary, 0),
textposition = "outside"
) %>%
layout(
title = "Average Salary per Department",
xaxis = list(title = "Department"),
yaxis = list(title = "Average Salary",
range = c(0, max(dept_summary$avg_salary) * 1.2)),
showlegend = FALSE
)
bar_plot2
# Salary Distribution
p3 <- ggplot(data, aes(x=salary)) +
geom_histogram(fill=PASTEL[1], bins=30, alpha=0.7) +
labs(
title="Salary Distribution",
x="Salary",
y="Count"
) +
theme_minimal()
ggplotly(p3)
# Scatter + Regression
p4 <- ggplot(data,
aes(x=salary, y=performance_score)) +
geom_point(color=PASTEL[3], size=2) +
geom_smooth(method="lm", se=FALSE, color=PASTEL[4]) +
labs(
title="Salary vs Performance Score",
x="Salary",
y="Performance Score"
) +
theme_minimal()
ggplotly(p4)
Interpretation:
In conclusion, this practicum provides a comprehensive understanding of fundamental data processing and analysis using R. Through a series of tasks, various programming concepts such as functions, loops, and data manipulation techniques have been effectively applied to handle and analyze data. Furthermore, the practicum demonstrates how raw data can be transformed into meaningful information through preprocessing, feature engineering, and visualization. The use of simulated datasets, including employee and sales data, also highlights the practical application of these concepts in real-world scenarios. Overall, this practicum not only strengthens technical programming skills but also enhances analytical thinking in interpreting data and generating relevant insights.