?????? ????
Light

Clara Maisie Wanghili

52250039

Student Major in Data Science at
Institut Teknologi Sains Bandung

Data Science Data Science Programming Dosen Pengampu: Bakti Siregar, M.Sc., CDS.

Introduction

In programming, we often need to repeat the same tasks many times. Functions and loops help make our code more organized and efficient. A function allows us to reuse code for specific tasks, while a loop helps us run the same process multiple times without rewriting it.

In this practicum, these concepts are used in several activities, such as creating functions, using loops, and applying conditional logic. They are also used for more practical tasks like generating data, processing datasets, and performing simple analysis.

We also work with data by summarizing it, creating new features, and visualizing it using graphs. These visualizations help us understand patterns and relationships in the data more easily.

Overall, this practicum shows how basic programming concepts can be applied to solve problems and analyze data in a more structured and efficient way.

1 Dynamic Multi-Formula

This code creates a dynamic system to calculate multiple mathematical formulas using a function and loops.

First, a function called compute_formula is created. This function takes two inputs: a value x and a formula type (linear, quadratic, cubic, or exponential). Based on the formula chosen, the function calculates different results.

Then, a sequence of values from 1 to 20 is generated. For each formula, a loop is used to calculate the result of the function for every value of x. The results are stored in a data frame, so each formula has its own column.

After that, the data is reshaped into a long format using pivot_longer, which makes it easier to visualize multiple formulas in one plot.

Finally, a line plot is created using ggplot, where each formula is shown as a different colored line.

library(tidyr)
library(ggplot2)
library(plotly)

compute_formula <- function(x, formula){
  if(formula == "linear"){
    return(2*x + 3)
  } else if(formula == "quadratic"){
    return(x^2 + 2*x + 1)
  } else if(formula == "cubic"){
    return(x^3)
  } else if(formula == "exponential"){
    return(2^x)
  } else {
    stop("Invalid formula")
  }
}

x <- 1:20
formulas <- c("linear","quadratic","cubic","exponential")

df <- data.frame(x)

for(f in formulas){
  y <- c()
  for(i in x){
    y <- c(y, compute_formula(i,f))
  }
  df[[f]] <- y
}

df_long <- pivot_longer(df, -x)

# colors
colors <- c(
  "linear" = "#FFB3BA",
  "quadratic" = "#BAFFC9",
  "cubic" = "#BAE1FF",
  "exponential" = "#FFD6A5"
)

# ggplot
p <- ggplot(df_long, aes(x, value, color=name)) +
  geom_line(size=1.2) +
  scale_color_manual(values = colors) +
  ggtitle("Multiple Formula Plot") +
  theme_minimal()

ggplotly(p)


The visualization shows how each mathematical formula grows as the value of x increases.

  • The linear function increases steadily in a straight line.
  • The quadratic function grows faster and forms a curved line.
  • The cubic function increases even more rapidly.
  • The exponential function grows the fastest, which is why its line rises very sharply compared to the others.

  • Because the exponential values are very large, the other lines look almost flat near the bottom of the graph. Different formulas have different growth rates, and exponential growth is significantly faster than linear, quadratic, and cubic growth.

    2 Sales Simulation

    This code simulates daily sales for several salespersons over a number of days using functions, loops, and conditional logic.

    First, a nested function called cumulative_sales is created. This function takes a list of daily sales and calculates the cumulative total over time. It adds each day’s sales to the previous total, so we can see how sales grow day by day.

    Then, the main function simulate_sales is used to generate the data. For each salesperson, a loop generates random daily sales values between 100 and 1000. After that, the cumulative sales are calculated using the nested function.

    Next, a conditional statement is applied to assign a discount based on the sales value. Higher sales receive higher discounts.

    Finally, all the generated data (salesperson, day, sales, discount, and cumulative sales) are combined into a data frame and returned.

    library(plotly)
    
    # Nested Function (Cumulative)
    cumulative_sales <- function(sales){
      total <- 0
      result <- c()
      
      for(s in sales){
        total <- total + s
        result <- c(result, total)
      }
      
      return(result)
    }
    
    # function
    simulate_sales <- function(n_salesperson, days){
      data <- data.frame()
      
      for(sp in 1:n_salesperson){
        
        sales_list <- c()
        
        # generate sales
        for(d in 1:days){
          sales <- sample(100:1000, 1)
          sales_list <- c(sales_list, sales)
        }
        
        # use nested function
        cumulative <- cumulative_sales(sales_list)
        
        # filled data
        for(d in 1:days){
          sales <- sales_list[d]
          
          # conditional discount
          if(sales > 800){
            discount <- 0.2
          } else if(sales > 500){
            discount <- 0.1
          } else {
            discount <- 0.05
          }
          
          data <- rbind(data, data.frame(
            salesperson = sp,
            day = d,
            sales = sales,
            discount = discount,
            cumulative = cumulative[d]
          ))
        }
      }
      
      return(data)
    }
    
    sales <- simulate_sales(5, 10)
    
    # colors
    colors <- c("#FFB3BA", "#BAE1FF", "#BAFFC9", "#FFDFBA", "#E8BAFF")
    
    # visualization
    fig <- plot_ly()
    
    for(sp in unique(sales$salesperson)){
      sp_data <- sales[sales$salesperson == sp, ]
      
      fig <- add_trace(
        fig,
        data = sp_data,
        x = ~day,
        y = ~cumulative,
        type = "scatter",
        mode = "lines+markers",
        name = paste("Salesperson", sp),
        line = list(color = colors[sp], width = 2.5),
        marker = list(color = colors[sp], size = 7),
        hovertemplate = paste0(
          "<b>Salesperson ", sp, "</b><br>",
          "Day: %{x}<br>",
          "Cumulative Sales: %{y:,}<br>",
          "<extra></extra>"
        )
      )
    }
    
    fig <- layout(
      fig,
      title = list(text = "Cumulative Sales per Salesperson"),
      xaxis = list(title = "Day", dtick = 1),
      yaxis = list(title = "Cumulative Sales"),
      legend = list(title = list(text = "Salesperson")),
      hovermode = "x unified"
    )
    
    fig


    The visualization shows cumulative sales for each salesperson over time.

    Each line represents one salesperson, and the x-axis shows the days while the y-axis shows the total accumulated sales. All lines generally move upward because cumulative sales always increase over time.

    The differences in the lines indicate that some salespersons achieve higher total sales faster than others. Lines that rise more steeply represent better performance.

    Sales performance varies between salespersons, and those with steeper cumulative lines are more productive over time.

    3 Performance Categorization

    This code is used to categorize sales performance into different levels based on the sales value.

    First, a function called categorize is created. This function takes sales values as input and uses conditional logic to assign categories such as Excellent, Very Good, Good, Average, and Poor. The higher the sales value, the better the category.

    The function is applied to the sales data using sapply, so each sales value is converted into a category. The result is stored in a new column called category.

    After that, the code counts how many data points fall into each category using table, and then calculates the percentage using prop.table.

    Finally, two visualizations are created: a bar chart to show the count of each category and a pie chart to show the proportion of each category

    library(plotly)
    
    set.seed(123)
    
    # simulate sales data
    simulate_sales <- function(n_salesperson, days){
      data <- data.frame()
      
      for(sp in 1:n_salesperson){
        
        for(d in 1:days){
          
          sales <- sample(100:1000, 1)
          
          data <- rbind(data, data.frame(
            salesperson = sp,
            day = d,
            sales = sales
          ))
        }
      }
      
      return(data)
    }
    
    sales <- simulate_sales(5, 10)
    
    # categorization function
    categorize <- function(x){
      sapply(x, function(s){
        if(s > 800) "Excellent"
        else if(s > 600) "Very Good"
        else if(s > 400) "Good"
        else if(s > 200) "Average"
        else "Poor"
      })
    }
    
    sales$category <- categorize(sales$sales)
    
    # count
    counts  <- table(sales$category)
    percent <- round(prop.table(counts) * 100, 1)
    
    # colors 
    colors <- c(
      "Excellent" = "#FFB3BA",
      "Very Good" = "#BAE1FF",
      "Good"      = "#BAFFC9",
      "Average"   = "#FFDFBA",
      "Poor"      = "#E8BAFF"
    )
    
    cat_names  <- names(counts)
    cat_colors <- colors[cat_names]
    
    # bar chart
    fig_bar <- plot_ly(
      x    = cat_names,
      y    = as.numeric(counts),
      type = "bar",
      marker = list(color = cat_colors),
      text = as.numeric(counts),
      textposition = "outside",
      hovertemplate = "<b>%{x}</b><br>Count: %{y}<extra></extra>"
    ) %>%
      layout(
        title  = list(text = "Performance Category Distribution"),
        xaxis  = list(title = "Category"),
        yaxis  = list(title = "Count", range = c(0, max(counts) * 1.3)),
        showlegend = FALSE
      )
    
    # pie chart
    fig_pie <- plot_ly(
      labels = cat_names,
      values = as.numeric(counts),
      type   = "pie",
      marker = list(colors = cat_colors),
      textinfo      = "label+percent",
      textposition  = "inside",
      insidetextorientation = "radial",
      hovertemplate = "<b>%{label}</b><br>Count: %{value}<br>Percent: %{percent}<extra></extra>",
      showlegend = TRUE
    ) %>%
      layout(
        title  = list(text = "Performance Distribution"),
        legend = list(title = list(text = "Category"))
      )
    
    # display
    fig_bar
    fig_pie

    Bar Chart

    The bar chart shows the number of employees in each performance category. The categories Average and Very Good have the highest counts, while Poor has the lowest. This indicates that most employees are performing at a moderate to good level, with only a small number having low performance.

    Pie Chart

    The pie chart shows the percentage distribution of each performance category. The largest portions are Average (26%) and Very Good (26%), followed by Excellent (20%) and Good (18%), while Poor (10%) is the smallest. This suggests that employee performance is fairly balanced, but tends to be concentrated in the middle categories.

    Most employees fall into middle performance categories, indicating stable but not extremely high performance overall.

    4 Multi-Company Dataset Simulation

    This code is used to analyse a dataset containing employee data from several companies.

    First, the data is loaded from a CSV file and shown in an interactive table for easy reading.

    Next, a new variable called top_performer is created using conditional logic, whereby employees with a KPI above 90 are categorised as top performers.

    Then, the data is summarised by company_id. For each company, the following are calculated:

  • average salary
  • average KPI
  • maximum KPI value

  • This summary helps us compare performance across companies.

    Finally, the results are visualised using a bar chart to highlight the differences between companies.

    library(plotly)
    library(dplyr)
    library(DT)
    
    # load CSV 
    company <- read.csv("company_data.csv")
    
    datatable(
      company,
      caption = "Table: Company Employee Dataset",
      options = list(
        pageLength = 10,
        autoWidth = TRUE,
        scrollX = TRUE
      ),
      rownames = FALSE
    )
    # top performer
    company$top_performer <- company$KPI_score > 90
    
    # summary per company
    summary <- company %>%
      group_by(company_id) %>%
      summarise(
        avg_salary     = mean(salary),
        avg_performance = mean(KPI_score),
        max_kpi        = max(KPI_score)
      )
    
    summary
    # colors
    colors <- c("#FFB3BA", "#BAE1FF", "#BAFFC9", "#FFDFBA", "#E8BAFF",
                       "#FFD6BA", "#B3F0FF", "#FFB3F0", "#D4FFBA", "#BAC9FF")
    bar_colors <- colors[1:nrow(summary)]
    
    # Avg Salary
    plot_ly(
      data = summary,
      x    = ~factor(company_id),
      y    = ~avg_salary,
      type = "bar",
      marker = list(color = bar_colors),
      text = ~round(avg_salary, 0),
      textposition = "outside",
      hovertemplate = "<b>Company %{x}</b><br>Avg Salary: %{y:,.0f}<extra></extra>"
    ) %>%
      layout(
        title      = list(text = "Average Salary per Company"),
        xaxis      = list(title = "Company"),
        yaxis      = list(title = "Average Salary", range = c(0, max(summary$avg_salary) * 1.15)),
        showlegend = FALSE
      )
    # Avg KPI
    plot_ly(
      data = summary,
      x    = ~factor(company_id),
      y    = ~avg_performance,
      type = "bar",
      marker = list(color = bar_colors),
      text = ~round(avg_performance, 1),
      textposition = "outside",
      hovertemplate = "<b>Company %{x}</b><br>Avg KPI: %{y:.1f}<extra></extra>"
    ) %>%
      layout(
        title      = list(text = "Average KPI Score per Company"),
        xaxis      = list(title = "Company"),
        yaxis      = list(title = "Average KPI", range = c(0, max(summary$avg_performance) * 1.15)),
        showlegend = FALSE
      )
    # Max KPI
    plot_ly(
      data = summary,
      x    = ~factor(company_id),
      y    = ~max_kpi,
      type = "bar",
      marker = list(color = bar_colors),
      text = ~max_kpi,
      textposition = "outside",
      hovertemplate = "<b>Company %{x}</b><br>Max KPI: %{y}<extra></extra>"
    ) %>%
      layout(
        title      = list(text = "Maximum KPI Score per Company"),
        xaxis      = list(title = "Company"),
        yaxis      = list(title = "Max KPI", range = c(0, max(summary$max_kpi) * 1.15)),
        showlegend = FALSE
      )

    Interpretasi

    1. Average Salary per Company

    This chart shows the average salary for each company. Company 2 has the highest average salary (around 6925), while Company 5 has the lowest (around 6445). However, the differences are not very large, indicating that salary levels across companies are relatively similar.

    1. Average KPI Score per Company

    This chart shows the average KPI score for each company. Company 2 and Company 4 have slightly higher average KPI scores, while Company 3 has the lowest. This suggests some variation in employee performance across companies.

    1. Maximum KPI Score per Company

    This chart shows the highest KPI score achieved in each company. All companies have high maximum KPI values (above 90), indicating that every company has at least one high-performing employee. Company 2 has the highest maximum KPI (98).

    Overall, salary levels are relatively consistent across companies, but KPI performance shows some variation. While each company has top performers, some companies perform slightly better on average than others.

    5 Monte Carlo Simulation

    This code uses the Monte Carlo method to estimate the value of \(π\) (pi) using random points.

    First, random points are generated using runif, which creates values between 0 and 1 for both x and y coordinates. These points represent positions inside a square.

    Then, a condition is used to check whether each point is inside a quarter circle using the formula \(x^2+y^2≤1\). Points inside the circle are marked as TRUE, while others are FALSE.

    After that, the value of π is estimated using the ratio between points inside the circle and the total number of points.

    The code also calculates a simple probability of points falling inside a smaller square area.

    Finally, all the points are visualized using a scatter plot, with different colors showing whether the points are inside or outside the circle.

    library(plotly)
    
    set.seed(123)
    
    monte_carlo_pi <- function(n_points){
      
      x <- runif(n_points)
      y <- runif(n_points)
      
      inside <- (x^2 + y^2) <= 1
      
      pi_est <- 4 * sum(inside) / n_points
      
      subsquare <- (x < 0.5 & y < 0.5)
      prob <- sum(subsquare) / n_points
      
      cat("Estimated Pi:", round(pi_est, 4), "\n")
      cat("Probability:", round(prob, 4), "\n")
      
      df <- data.frame(x = x, y = y, inside = inside)
      
      plot_ly(
        data  = df,
        x     = ~x,
        y     = ~y,
        color = ~inside,
        colors = c("FALSE" = "#E8BAFF", "TRUE" = "#FFB3F0"),
        type  = "scatter",
        mode  = "markers",
        marker = list(opacity = 0.6, size = 4),
        hovertemplate = "X: %{x:.3f}<br>Y: %{y:.3f}<extra></extra>"
      ) %>%
        layout(
          title  = list(text = paste0("Monte Carlo Simulation — Estimated Pi: ", round(pi_est, 4))),
          xaxis  = list(title = "X", scaleanchor = "y", scaleratio = 1),
          yaxis  = list(title = "Y"),
          legend = list(
            title = list(text = "Point Type"),
            itemclick = "toggleothers"
          )
        ) %>%
        style(name = "Outside Circle", traces = 1) %>%
        style(name = "Inside Circle",  traces = 2)
    }
    
    monte_carlo_pi(5000)
    ## Estimated Pi: 3.1816 
    ## Probability: 0.2506


    The plot shows randomly generated points inside a square. The purple points represent points inside the quarter circle, while the pink points represent points outside the circle.

    The shape formed by the purple points clearly outlines a quarter circle. This happens because only points that satisfy the circle equation are included.

    The more points that fall inside the circle, the closer the estimated value of π will be to the actual value.

    The simulation shows that random sampling can be used to estimate mathematical values like π, and the accuracy improves as the number of points increases.

    Output Estimated Pi & Probability

    The output shows the results of the Monte Carlo simulation.

  • Estimated Pi: 3.1816
  • This is the estimated value of π based on the random points. The actual value of π is about 3.1416, so this result is quite close. The small difference happens because the simulation uses random sampling.

  • Probability: 0.2506
  • This represents the probability of points falling inside a smaller square area (from 0 to 0.5 on both x and y). The expected probability is 0.25, and the result (0.2506) is very close to it.

    This shows that the simulation works correctly, and the results become more accurate as the number of points increases.

    6 Advanced Data Transformation & Feature Engineering

    This code focuses on transforming data and creating new features to make the dataset more useful for analysis.

    First, the dataset is loaded from a CSV file. Then, a normalization function is created to scale the salary values between 0 and 1. This helps make the data easier to compare.

    Next, a new column called salary_norm is created using the normalization function.

    After that, feature engineering is applied by creating a new variable called salary_bracket. The salary is grouped into three categories: Low, Medium, and High.

    The data is also displayed in a table to show the transformation results clearly.

    Finally, the data is visualized using histograms and boxplots to compare the distribution of salary before and after normalization.

    library(ggplot2)
    library(dplyr)
    library(DT)
    library(plotly)
    
    # load data
    company <- read.csv("company_data.csv")
    set.seed(123)
    
    # normalization function
    normalize <- function(x){
      (x - min(x)) / (max(x) - min(x))
    }
    
    company$salary_norm <- normalize(company$salary)
    
    # feature engineering
    company$salary_bracket <- cut(
      company$salary,
      breaks = 3,
      labels = c("Low", "Medium", "High")
    )
    
    # table
    datatable(
      company %>% select(salary, salary_norm, salary_bracket),
      caption = "Table: Salary Transformation",
      options = list(pageLength = 10),
      rownames = FALSE
    )
    # prepare data
    df_plot <- data.frame(
      value = c(company$salary, company$salary_norm * max(company$salary)),
      type  = c(rep("Before", nrow(company)),
                rep("After",  nrow(company)))
    )
    
    # histogram interaktif
    p_hist <- plot_ly(
      data = df_plot,
      x    = ~value,
      color = ~type,
      colors = c("After" = "#F4A7B9", "Before" = "#80CBC4"),
      type  = "histogram",
      nbinsx = 30,
      opacity = 0.6,
      hovertemplate = paste(
        "Condition: %{legendgroup}<br>",
        "Salary: %{x}<br>",
        "Count: %{y}<extra></extra>"
      )
    ) %>%
      layout(
        barmode = "overlay",
        title   = list(text = "Salary Distribution: Before vs After Normalization"),
        xaxis   = list(title = "Salary"),
        yaxis   = list(title = "Count"),
        legend  = list(title = list(text = "Condition")),
        plot_bgcolor  = "white",
        paper_bgcolor = "white",
        hovermode = "x unified"
      )
    
    p_hist
    # boxplot interaktif
    p_box <- plot_ly(
      data   = df_plot,
      x      = ~type,
      y      = ~value,
      color  = ~type,
      colors = c("After" = "#F4A7B9", "Before" = "#80CBC4"),
      type   = "box",
      boxmean = FALSE,
      hovertemplate = paste(
        "Condition: %{x}<br>",
        "Value: %{y:.2f}<extra></extra>"
      )
    ) %>%
      layout(
        title  = list(text = "Salary Distribution: Before vs After Normalization (Boxplot)"),
        xaxis  = list(title = "Condition"),
        yaxis  = list(title = "Salary"),
        legend = list(title = list(text = "Condition")),
        plot_bgcolor  = "white",
        paper_bgcolor = "white"
      )
    
    p_box

    Histogram (Before vs After Normalization)

    This histogram compares the salary distribution before and after normalization. The “Before” data shows the original salary values, while the “After” data has been scaled but adjusted back for comparison.

    Even though the shapes look similar, the normalization process changes the scale of the data, not the distribution pattern. This means normalization does not change the structure of the data, only its range.

    Boxplot (Before vs After Normalization)

    The boxplot shows the spread of salary values before and after normalization. The “Before” boxplot has a wider range, while the “After” version appears more compressed.

    This indicates that normalization reduces the scale differences but keeps the relative position of the data (median, spread, and outliers) consistent.

    7 Mini Project: Company KPI Dashboard & Simulation

    This mini-project was used to analyse company data using functions, loops and visualisations.

    First, the data was read from a CSV file. A loop was then used to group employees into KPI categories (Top, High, Medium, Low).

    After that, the data was summarised by company to calculate the average salary, average KPI and number of top performers.

    Lastly, several visualisations are created to see insights about top performers, departmental distribution, salary distribution, and the relationship between salary and KPI.

    library(ggplot2)
    library(dplyr)
    library(DT)
    library(plotly)
    
    # load data
    company_df <- read.csv("task7.csv")
    
    datatable(
      company_df,
      caption = "Dataset",
      options = list(
        pageLength = 10,
        autoWidth = TRUE,
        scrollX = TRUE
      ),
      rownames = FALSE
    )
    # loop (KPI category)
    company_df$kpi_category <- ""
    for(i in 1:nrow(company_df)){
      if(company_df$KPI_score[i] > 90){
        company_df$kpi_category[i] <- "Top"
      } else if(company_df$KPI_score[i] > 75){
        company_df$kpi_category[i] <- "High"
      } else if(company_df$KPI_score[i] > 60){
        company_df$kpi_category[i] <- "Medium"
      } else {
        company_df$kpi_category[i] <- "Low"
      }
    }
    
    # summary
    summary_df <- company_df %>%
      group_by(company_id) %>%
      summarise(
        avg_salary     = mean(salary),
        avg_kpi        = mean(KPI_score),
        top_performers = sum(KPI_score > 90)
      )
    
    # table summary
    datatable(
      summary_df,
      caption = "Summary per Company",
      options = list(pageLength = 5),
      rownames = FALSE
    )
    # color
    pastel_colors <- c(
      "#F4A7B9", "#A8D8EA", "#B5EAD7", "#FFD6A5",
      "#C9B1FF", "#FFDDD2", "#D4F1F4", "#FCE4EC"
    )
    
    # visualization 1: top performers per company
    p1 <- plot_ly(
      data = summary_df,
      x    = ~factor(company_id),
      y    = ~top_performers,
      type = "bar",
      marker = list(color = pastel_colors[1:nrow(summary_df)]),
      text = ~top_performers,
      textposition = "outside",
      hovertemplate = paste(
        "Company: %{x}<br>",
        "Top Performers: %{y}<extra></extra>"
      )
    ) %>%
      layout(
        title       = list(text = "Top Performers per Company"),
        xaxis       = list(title = "Company"),
        yaxis       = list(title = "Count"),
        showlegend  = FALSE,
        plot_bgcolor  = "white",
        paper_bgcolor = "white"
      )
    
    p1
    # visualization 2: department distribution per company
    dept_summary <- company_df %>%
      group_by(company_id, department) %>%
      summarise(count = n(), .groups = "drop")
    
    company_ids <- unique(dept_summary$company_id)
    
    p2 <- plot_ly()
    
    for(i in seq_along(company_ids)){
      df_sub <- dept_summary %>% filter(company_id == company_ids[i])
      p2 <- add_trace(
        p2,
        data         = df_sub,
        x            = ~department,
        y            = ~count,
        type         = "bar",
        name         = paste("Company", company_ids[i]),
        marker       = list(color = pastel_colors[i]),
        hovertemplate = paste(
          "Department: %{x}<br>",
          "Count: %{y}<br>",
          paste0("Company: ", company_ids[i], "<extra></extra>")
        )
      )
    }
    
    p2 <- p2 %>%
      layout(
        barmode = "group",
        title   = list(text = "Department Distribution per Company"),
        xaxis   = list(title = "Department"),
        yaxis   = list(title = "Count"),
        legend  = list(title = list(text = "Company")),
        plot_bgcolor  = "white",
        paper_bgcolor = "white"
      )
    
    p2
    # visualization 3: salary distribution
    p3 <- plot_ly(
      data  = company_df,
      x     = ~salary,
      type  = "histogram",
      nbinsx = 30,
      marker = list(
        color = "#A8D8EA",
        line  = list(color = "#6BB8D4", width = 0.8)
      ),
      hovertemplate = paste(
        "Salary: %{x}<br>",
        "Count: %{y}<extra></extra>"
      )
    ) %>%
      layout(
        title  = list(text = "Salary Distribution"),
        xaxis  = list(title = "Salary"),
        yaxis  = list(title = "Count"),
        plot_bgcolor  = "white",
        paper_bgcolor = "white"
      )
    
    p3
    # visualization 4: scatter + regression
    lm_model  <- lm(KPI_score ~ salary, data = company_df)
    salary_seq <- seq(min(company_df$salary), max(company_df$salary), length.out = 200)
    kpi_pred   <- predict(lm_model, newdata = data.frame(salary = salary_seq))
    
    p4 <- plot_ly() %>%
      add_trace(
        data = company_df,
        x    = ~salary,
        y    = ~KPI_score,
        type = "scatter",
        mode = "markers",
        name = "Data",
        marker = list(
          color   = "#C9B1FF",
          opacity = 0.6,
          size    = 7,
          line    = list(color = "#9B7FE0", width = 0.5)
        ),
        hovertemplate = paste(
          "Salary: %{x:,.0f}<br>",
          "KPI Score: %{y:.1f}<extra></extra>"
        )
      ) %>%
      add_trace(
        x    = salary_seq,
        y    = kpi_pred,
        type = "scatter",
        mode = "lines",
        name = "Regression",
        line = list(color = "#F4A7B9", width = 2.5),
        hovertemplate = paste(
          "Salary: %{x:,.0f}<br>",
          "Predicted KPI: %{y:.1f}<extra></extra>"
        )
      ) %>%
      layout(
        title  = list(text = "Salary vs KPI Score"),
        xaxis  = list(title = "Salary"),
        yaxis  = list(title = "KPI Score"),
        legend = list(orientation = "h", y = -0.15),
        plot_bgcolor  = "white",
        paper_bgcolor = "white"
      )
    
    p4

    1. Top Performers per Company

    This chart shows the number of top performers in each company. Company 3 has the highest number, while Company 5 has the lowest.

    1. Department Distribution per Company

    This chart shows how employees are distributed across departments in each company. Some departments have more employees than others, and the distribution varies between companies.

    1. Salary Distribution

    This histogram shows how salary values are spread. Most salaries are distributed between mid to high ranges, with no extreme concentration.

    1. Salary vs KPI (Regression)

    This scatter plot shows the relationship between salary and KPI. The regression line is slightly flat, indicating a weak relationship between salary and performance.
    This project shows that data can be processed and visualized to understand company performance, and that higher salary does not always mean better KPI.

    Conclusion

    Overall, this practical demonstrates how fundamental programming concepts such as functions, loops and conditionals can be applied to data processing and analysis.

    From the tasks, I learned how to create data, process it and extract useful information. Visualisation also helped me to understand patterns, comparisons and relationships within the data more clearly.

    From simple calculations to simulations and mini-projects, this practical session helps to improve our understanding of how programming can be used to solve problems and support decision-making.