Urban Business Dataset

Midterm Exam

Urban Business Dataset

1 Introduction
2 Background
3 Purpose of Writing
4 Data Preparation
5 Data Visualization
6 Measures of Dispersion
7 video presentation

CHRISTIAN MICHAEL JULIANO

HIROSE KAWARIN SIRAIT

CECILIA MUTIARA HANDAYANI

DHEFIO ALIM MUZAKKI

M. YUSTIAN PUTRA MUHADI

November 15, 2025

Logo

library(tidyverse)
library(readr)
library(ggplot2)
library(dplyr)
library(ggridges)
library(knitr)
library(DT)
library(modeest)
library(forcats)

1 Introduction

In a rapidly growing urban economy, businesses operate in a highly dynamic environment influenced by population density, consumer preferences, technological adaptation, and market competition. Understanding the factors driving monthly revenue — such as marketing expenditure, product pricing, workforce size, managerial experience, and customer satisfaction — is becoming increasingly important for strategic decision-making. Business performance in urban areas also varies by city and industry sector, with unique patterns emerging among the retail, technology, manufacturing, and food and beverage sectors. To uncover these patterns, it is necessary to apply data visualization, measures of central tendency, and measures of dispersion to trace how revenue fluctuates across different types of businesses, cities, and sales channels. By conducting this descriptive and visual analysis, organizations can identify not only performance gaps but also opportunities to optimize marketing strategies, pricing, and human resource management.

2 Background

In growing cities, businesses face fast changes in competition, technology, and consumer behavior. To succeed, they must understand what affects their monthly revenue, such as marketing, pricing, staff, and customer satisfaction. Using data analysis and visualization helps reveal revenue patterns across different cities and industries, allowing companies to find gaps and improve their marketing, pricing, and management strategies.

In this case, we need to understand the measures of central tendency and central density.

2.1 Central Tendency

Central tendency is a statistical measure that indicates the central point or average value of a data set. In other words, it tells us around which value most of the data cluster. So, central tendency is like the “center of gravity” of the data — where the data values tend to group.

There are three main measures of central tendency:

Mean : is the value that represents the center of the data by summing all the values and then dividing by the number of data points.So the mean indicates “what the average value” of all the data is.

Formula to find the mean

\[ \bar{X} = \frac{\sum_{i=1}^{n} X_i}{n} \]

description:

$\bar{X}$ = mean
$X_i$ = data value to-$i
$n$ = amount of data

Median : The median is the middle value of a set of data that has been arranged from smallest to largest. In other words, the median indicates the middle position of the data, rather than the average like the mean. That is why the median is often used when data contains extreme values (outliers) — because the median is not affected by values that are too large or too small.

Formula to find the median

For single data, the formula for the median is:

\[ \text{Median} = \begin{cases} X_{\frac{n+1}{2}}, & \text{if } n \text{ odd} \\ \\ \frac{X_{\frac{n}{2}} + X_{\frac{n}{2} + 1}}{2}, & \text{if } n \text{ even} \end{cases} \]

For grouped data, the formula for the median is:

\[ \text{Median} = L + \left( \frac{\frac{n}{2} - F}{f} \right) \times c \]

description:

L = lower boundary of the median
n = total frequency
F = cumulative frequency before the median class
f = median class frequency
c = class length

Mode : Mode is the value that appears most frequently in a data set. In other words, the mode indicates the data with the highest frequency.

How to Find the Mode For Single Data (Ungrouped Data)The steps are as follows:

Collect all the data you want to analyze.
Identify the value or category that appears most frequently.
The value with the highest frequency of occurrence is called the mode.

2.2 Central Density

Central Density is the level of data density around a central value (central tendency) such as the mean, median, or mode.In other words, central density indicates how much data clusters around the average value.

The following table explains several uses of Central Density in data analysis:

# Load the required library
library(knitr)

# Create a data frame containing the uses of central density
central_density_uses <- data.frame(
  No = 1:5,
  Purpose = c(
    "Understand the shape of data distribution",
    "Evaluate how data are spread around the center value",
    "Support the analysis of Central Tendency (mean, median, mode)",
    "Assess data normality within a distribution",
    "Compare central density patterns between data groups"
  ),
  Explanation = c(
    "Central density helps visualize whether the data are symmetric (normal) or skewed.",
    "Shows whether the data are concentrated near the mean or widely dispersed.",
    "Provides visual context to measures of central tendency.",
    "Data with a peak in the center and balanced tails indicate a normal distribution.",
    "Helps identify differences in density concentration among groups, such as cities or categories."
  )
)

# Display the table neatly
kable(
  central_density_uses,
  caption = "Table: Uses of Central Density in Data Analysis",
  col.names = c("No", "Purpose", "Explanation")
)

Table: Uses of Central Density in Data Analysis
No	Purpose	Explanation
1	Understand the shape of data distribution	Central density helps visualize whether the data are symmetric (normal) or skewed.
2	Evaluate how data are spread around the center value	Shows whether the data are concentrated near the mean or widely dispersed.
3	Support the analysis of Central Tendency (mean, median, mode)	Provides visual context to measures of central tendency.
4	Assess data normality within a distribution	Data with a peak in the center and balanced tails indicate a normal distribution.
5	Compare central density patterns between data groups	Helps identify differences in density concentration among groups, such as cities or categories.

The following table explains the relationship between Central Tendency and Central Density in data analysis:

# Load library
library(knitr)

# Create data frame explaining the relationship
relationship_cd <- data.frame(
  No = 1:5,
  Aspect = c(
    "Definition",
    "Focus of Analysis",
    "Representation in Graphs",
    "Statistical Purpose",
    "Analytical Connection"
  ),
  Central_Tendency = c(
    "Represents the central value or typical value of a dataset (mean, median, mode).",
    "Focuses on finding the center of the data distribution.",
    "Usually represented by vertical lines indicating mean or median.",
    "Used to summarize data with a single representative value.",
    "Helps identify where most data values cluster in the distribution."
  ),
  Central_Density = c(
    "Represents how data points are distributed around the center.",
    "Focuses on the shape, spread, and concentration of the data.",
    "Represented by smooth curves or peaks in density plots.",
    "Used to understand data distribution patterns and normality.",
    "Provides visual confirmation of how central tendency reflects the data distribution."
  )
)

# Display table neatly
kable(
  relationship_cd,
  caption = "Table: Relationship Between Central Tendency and Central Density",
  col.names = c("No", "Aspect", "Central Tendency", "Central Density")
)

Table: Relationship Between Central Tendency and Central Density
No	Aspect	Central Tendency	Central Density
1	Definition	Represents the central value or typical value of a dataset (mean, median, mode).	Represents how data points are distributed around the center.
2	Focus of Analysis	Focuses on finding the center of the data distribution.	Focuses on the shape, spread, and concentration of the data.
3	Representation in Graphs	Usually represented by vertical lines indicating mean or median.	Represented by smooth curves or peaks in density plots.
4	Statistical Purpose	Used to summarize data with a single representative value.	Used to understand data distribution patterns and normality.
5	Analytical Connection	Helps identify where most data values cluster in the distribution.	Provides visual confirmation of how central tendency reflects the data distribution.

2.3 Standard Deviation

Definition of Standard Deviation

Standard deviation is a measure of how far data is spread from the average value (mean). It means that standard deviation indicates the level of variation or dispersion in a data set. If the standard deviation is small, it means the data is close to the mean (homogeneous data). If the standard deviation is large, it means the data is spread far from the mean (highly varied data).

Standard Deviation Formula

The mathematical formula for sample standard deviation is:

The standard deviation measures how spread out the data are from the mean.
It can be calculated differently for a population and a sample.

Population Standard Deviation:

\[ \sigma = \sqrt{\frac{\sum (x_i - \mu)^2}{N}} \]

Sample Standard Deviation:

\[ s = \sqrt{\frac{\sum (x_i - \bar{x})^2}{n - 1}} \]

Description of Each Symbol

library(knitr)

# Create a table describing each variable in the formula
sd_symbols <- data.frame(
  Symbol = c("σ", "s", "xᵢ", "μ", "x̄", "N", "n"),
  Description = c(
    "Population standard deviation",
    "Sample standard deviation",
    "Each individual data value",
    "Population mean (average of all population data)",
    "Sample mean (average of sample data)",
    "Number of data points in the population",
    "Number of data points in the sample"
  )
)

# Display the table neatly
kable(
  sd_symbols,
  caption = "Table: Description of Symbols in the Standard Deviation Formula",
  col.names = c("Symbol", "Description")
)

Table: Description of Symbols in the Standard Deviation Formula
Symbol	Description
σ	Population standard deviation
s	Sample standard deviation
xᵢ	Each individual data value
μ	Population mean (average of all population data)
x̄	Sample mean (average of sample data)
N	Number of data points in the population
n	Number of data points in the sample

3 Purpose of Writing

To demonstrate your understanding of data visualization, central tendency analysis, and measures of dispersion.
Understanding trends, patterns, or potential outliers in the displayed data.
Displaying the average (mean) line to show the central value of the data.
Helps identify general trends whether the data leans towards high values, low values, or is balanced.
To enhance collaborative learning and the application of statistical concepts.

4 Data Preparation

Data preparation is the initial stage in the data analysis process aimed at readying raw data for use in statistical analysis, modeling, or visualization. This stage is necessary because collected data is rarely clean or ready-to-use — often there are typos, missing values, inconsistent formats, or extreme values (outliers). In other words, data preparation is the process of cleaning, organizing, and transforming raw data into a neat, complete, and structured form, so that the analysis results later on become more accurate and reliable. The data contains information about the Urban Business Dataset. The data includes:

City
Business Type
Sales Channel
Marketing Spend
Product Price
Employee Count
Manager Experience
Customer Rating
Monthly Revenue

# Membaca data dari file raw_data.csv
file_path <- "term.csv"
data <- read_csv(file_path)

if ("...1" %in% names(data)) {
data <- data |> select(-...1)
}
datatable(
data,
caption = htmltools::tags$caption(
style = 'caption-side: top; text-align: center; font-weight: bold; font-size: 16px; color: #2C3E50;',
'Customer Purchase Data'
),
rownames = FALSE,
options = list(
pageLength = 10,
autoWidth = TRUE,
dom = 'Bfrtip',
buttons = c('copy', 'csv', 'excel', 'pdf', 'print'),
initComplete = JS(
"function(settings, json) {",
"$(this.api().table().header()).css({'background-color': '#2C3E50', 'color': '#fff'});",
"}"
)
)
)

# Install dan panggil library yang diperlukan
install.packages("DT")
library(DT)

# Buat data frame klasifikasi variabel
data_class <- data.frame(
  No = 1:9,
  Variable = c("City", "Business Type", "Sales Channel", 
               "Marketing Spend", "Product Price", 
               "Employee Count", "Manager Experience", 
               "Customer Rating", "Monthly Revenue"),
  Type_Data = c("Categorical", "Categorical", "Categorical", 
                 "Numeric", "Numeric", "Numeric", 
                 "Numeric", "Categorical", "Numeric"),
  Data_Subtype = c("Nominal", "Nominal", "Nominal", 
               "Continuous", "Continuous", "Diskrit", 
               "Continuous", "Ordinal", "Continuous"),
  Reason = c(
    "Because it shows the names of cities (Surabaya, Bandung, Jakarta, Makassar, Medan). There is no order or ranking between the cities, it only distinguishes categories.",
    "Indicates the type of business (for example: Retail, Food & Beverage, Manufacturing, Technology). These types only differ in name and do not have any order or hierarchy.",
    "Indicate the sales channels (Online, Offline). The categories have no particular order, so they are nominal.",
    "Describes the amount of money spent on marketing. The value can be fractional (decimal) and measurable, thus it is considered continuous numeric.",
    "Showing the price of a product. Because it can have decimal values and be measured continuously, it is considered continuous.",
    ",Indicates the number of employees, in whole numbers (it is not possible to have 30.5 people). Because it is counted, not measured, it is considered discrete numeric.",
    "Measuring the length of a manager's experience, for example in years or months. If measured in full years (e.g., 3 years, 5 years), it can be called discrete, but if measured in more detailed time (e.g., 3.5 years), it is considered continuous.",
    "Indicates customer ratings (for example 1–100). Although represented by numbers, this is not true numerical data because the numbers only represent levels of satisfaction — there is an order, but the distance between values is not necessarily equal.",
    "Shows monthly income, the value can be in units of money (for example, Rp 8,750,000). Since it can have fractional values and be measured, it is considered continuous numeric."
  )
)

# Tampilkan tabel interaktif dengan styling warna
datatable(
  data_class,
  caption = htmltools::tags$caption(
    style = 'caption-side: top; text-align: center; 
             font-weight: bold; font-size: 16px; color: #2C3E50;',
    'Classification of Variables Based on Data Type'
  ),
  rownames = FALSE,
  options = list(
    pageLength = 9,
    autoWidth = TRUE,
    dom = 'Bfrtip',
    buttons = c('copy', 'csv', 'excel', 'pdf', 'print'),
    initComplete = JS(
      "function(settings, json) {",
      "$(this.api().table().header()).css({'background-color': '#2C3E50', 'color': '#fff', 'font-weight': 'bold'});",
      "}"
    )
  ),
  class = 'cell-border stripe hover'
)

5 Data Visualization

What is data visualization?

Data visualization is the process of transforming raw data into visual representations such as charts, diagrams, maps, or interactive dashboards, with the aim of presenting information in a way that is easier to understand, more engaging, and meaningful. In a world full of data — whether from research results, business, social media, or digital sensors — people often struggle to comprehend long numbers or tables. This is where data visualization plays an important role: helping humans see patterns, trends, relationships, and anomalies hidden within the data.

The purpose of this visualization is to:

Facilitate data understanding = Visualization helps to quickly see patterns, trends, and comparisons in data without having to read raw numbers in tables.
Support statistical analysis = Charts such as histograms, boxplots, scatterplots, and bar charts help analyze measures of central tendency (mean, median, mode) as well as data dispersion.
Assist decision-making = Visualization makes analysis results clearer, making it easier to draw conclusions and make data-driven decisions.
Communicate results effectively = Charts in RPubs help convey research or assignment results with an interactive and professional display, not just as text or static tables.

library(readr)
data <- read_csv("term.csv")

## ----setup, include=FALSE---------------------------------------------------
knitr::opts_chunk$set(echo = TRUE, warning = FALSE, message = FALSE)

# Memanggil library yang dibutuhkan
library(tidyverse)
library(DT)
library(ggplot2)
library(plotly)

# Membaca data
file_path <- "term.csv"
data <- read_csv(file_path)

# Menghapus kolom ID yang tidak dibutuhkan (...1)
if ("...1" %in% names(data)) {
  data <- data |> select(-...1)
}

# Menampilkan data mentah
datatable(
  data,
  caption = htmltools::tags$caption(
    style = 'caption-side: top; text-align: center; font-weight: bold; font-size: 16px;',
    'Table 1. Raw Data of Urban Business Dataset'
  )
)

5.1 Data Visualization

library(readr)
library(tidyverse)
library(ggplot2)
library(DT)

# Read CSV file
data <- read_csv("term.csv")

# Remove unnecessary index column if it exists
if ("...1" %in% names(data)) {
  data <- data |> select(-...1)
}

# Display the first few rows of the dataset
datatable(
  head(data),
  caption = htmltools::tags$caption(
    style = 'caption-side: top; text-align: center; font-weight: bold; font-size: 16px;',
    "Table 1. Preview of the Raw Dataset"
  )
)

## ----1.1-Data-Visualization-------------------------------------------------
# BAR CHART: Number of Businesses per City

# Hitung jumlah data Number of Businesses per City dan urutkan dari yang tertinggi

library(ggplot2)
library(plotly)
library(dplyr)
library(readr)

# Impor file CSV (pastikan nama file sesuai)
data <- read_csv("term.csv")

# Hitung jumlah bisnis per kota
city_count <- data %>%
  group_by(City) %>%
  summarise(Jumlah_Bisnis = n()) %>%
  arrange(desc(Jumlah_Bisnis))

# Buat bar chart interaktif
p <- ggplot(city_count, aes(x = reorder(City, -Jumlah_Bisnis), y = Jumlah_Bisnis, fill = City)) +
  geom_bar(stat = "identity", width = 0.7, color = "black") +
  geom_text(aes(label = Jumlah_Bisnis),
            vjust = -0.5,
            fontface = "bold",
            size = 4) +
  labs(
    title = "Number of Business per City",
    x = "City",
    y = "Total business"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.5, face = "bold", size = 12),
    axis.text.x = element_text(angle = 25, hjust = 1),
    legend.position = "none"
  )

# Tampilkan grafik interaktif
ggplotly(p)

Interpretation BarChart

The visualization above shows the number of businesses in five major cities in Indonesia, namely Jakarta, Bandung, Medan, Makassar, and Surabaya. From the graph, it can be seen that Jakarta has the highest number of businesses compared to the other cities, around 425 businesses, because Jakarta is the economic center of Indonesia. Meanwhile, Medan has around 420 businesses, and Surabaya has 406 businesses, indicating a fairly high business count. Next, Bandung follows with around 381 businesses, and Makassar with around 368 businesses, having fewer businesses compared to the other cities.

library(plotly)
library(dplyr)
library(readr)

# Import data
data <- read_csv("term.csv")

# Hitung proporsi tiap Sales Channel
sales_channel <- data |> 
  count(SalesChannel) |> 
  mutate(Percent = round(n / sum(n) * 100, 1),
         Label = paste0(SalesChannel, ": ", Percent, "%"))

# Buat donut chart interaktif langsung dengan plotly
plot_ly(
  data = sales_channel,
  labels = ~SalesChannel,
  values = ~n,
  type = 'pie',
  textinfo = 'label+percent',
  insidetextorientation = 'radial',
  hole = 0.5,  # membuat lubang di tengah (donut)
  marker = list(line = list(color = '#FFFFFF', width = 1))
) %>%
  layout(
    title = list(text = "Sales Channel Proportion (Donut Chart)",
                 x = 0.5,
                 font = list(size = 16, face = "bold")),
    showlegend = TRUE
  )

Interpretation Donut Chart

From the visualization above, it can be seen that Online sales have the largest proportion, accounting for 54.5% of total sales. Offline sales contribute 45.5% of total sales. This means that the online sales channel is slightly more dominant than offline. This shows that most transactions are carried out through digital platforms, although offline sales still contribute significantly and are almost evenly balanced. This visual illustrates a shift in trend towards online-based sales, which are beginning to lead in total sales distribution.

# HISTOGRAM (Plotly): Marketing Spend + Mean, Median, Mode + Interpretasi Otomatis
library(plotly)
library(dplyr)
library(readr)

# Import data
data <- read_csv("term.csv")

# Hitung mean, median, mode
mean_spend <- mean(data$MarketingSpend, na.rm = TRUE)
median_spend <- median(data$MarketingSpend, na.rm = TRUE)
mode_spend <- as.numeric(names(sort(table(data$MarketingSpend), decreasing = TRUE)[1]))

# Tentukan bentuk distribusi
if (mean_spend > median_spend & median_spend > mode_spend) {
  distribusi_spend <- "Right-skewed (miring ke kanan)"
} else if (mean_spend < median_spend & median_spend < mode_spend) {
  distribusi_spend <- "Left-skewed (miring ke kiri)"
} else {
  distribusi_spend <- "Approximately symmetrical (hampir simetris)"
}

# Tampilkan nilai dan interpretasi di konsol
cat("Mean Marketing Spend:", round(mean_spend, 2), "\n")

## Mean Marketing Spend: 85.27

cat("Median Marketing Spend:", round(median_spend, 2), "\n")

## Median Marketing Spend: 84.9

cat("Mode Marketing Spend:", round(mode_spend, 2), "\n")

## Mode Marketing Spend: 106.2

cat("Distribusi:", distribusi_spend, "\n")

## Distribusi: Approximately symmetrical (hampir simetris)

# Buat histogram interaktif dengan Plotly
p <- plot_ly(
  data, 
  x = ~MarketingSpend, 
  type = "histogram",
  nbinsx = 30,
  marker = list(color = 'orange', line = list(color = "white", width = 1)),
  opacity = 0.7,
  name = "Histogram"
)

# Tambahkan garis Mean, Median, Mode sebagai shapes di layout
p <- p %>%
  layout(
    title = list(
      text = "Distribution of Marketing Spend (Interactive)",
      x = 0.5,
      font = list(size = 16, face = "bold")
    ),
    xaxis = list(title = "Marketing Spend"),
    yaxis = list(title = "Frequency"),
    shapes = list(
      list(type = "line", x0 = mean_spend, x1 = mean_spend, y0 = 0, y1 = 1,
           xref = "x", yref = "paper", line = list(color = "red", dash = "dash")),
      list(type = "line", x0 = median_spend, x1 = median_spend, y0 = 0, y1 = 1,
           xref = "x", yref = "paper", line = list(color = "green", dash = "dot")),
      list(type = "line", x0 = mode_spend, x1 = mode_spend, y0 = 0, y1 = 1,
           xref = "x", yref = "paper", line = list(color = "purple", dash = "solid"))
    ),
    annotations = list(
      list(x = mean_spend, y = 0.95, text = "Mean", showarrow = FALSE, font = list(color = "red")),
      list(x = median_spend, y = 0.9, text = "Median", showarrow = FALSE, font = list(color = "green")),
      list(x = mode_spend, y = 0.85, text = "Mode", showarrow = FALSE, font = list(color = "purple"))
    )
  )

p

Interpretation Histogram

The histogram above shows the distribution of marketing spend in the form of a histogram with a density curve. The green dashed line (Mean) indicates the average spending, the red dashed line (Median) shows the middle value, and the purple line (Mode) represents the most frequently occurring value. It can be seen that the Mode is to the right of the Mean and Median, indicating a left-skewed distribution. Most spending values fall within the range of 40 to 120, with a fairly even density but with two main peaks (bimodal). In conclusion, the distribution of marketing spend is not completely normal, with a slight leftward skew and the presence of two prominent spending groups.

# VIOLIN + BOX PLOT (Plotly): Distribusi Marketing Spend
library(plotly)
library(dplyr)
library(readr)

# Import data
data <- read_csv("term.csv")

# Buat plot interaktif
p <- plot_ly(data, y = ~MarketingSpend, type = 'violin',
             box = list(visible = TRUE),
             meanline = list(visible = TRUE),
             fillcolor = 'lightpink',
             line = list(color = 'black'),
             opacity = 0.6,
             name = "Marketing Spend") %>%
  layout(
    title = list(
      text = "Distribution of Marketing Spend (Violin + Boxplot)",
      x = 0.5,
      font = list(size = 16, face = "bold")
    ),
    yaxis = list(title = "Marketing Spend"),
    xaxis = list(title = "")
  )

p

Interpretation Violin + Boxplot

The visualization combines a violin plot and a boxplot to show the distribution of marketing spend data. The central part (boxplot) shows that the median value is around 80, indicating that half of the data is below and half is above this value. The interquartile range (IQR) is fairly wide, indicating a large variation or spread of the data. The symmetric shape of the violin on the left and right shows that the data distribution is relatively balanced (not skewed/normal). No significant outliers are apparent. In conclusion, marketing expenditure has a fairly even and stable distribution, with a median value of around 80 and no extreme deviations.

# SCATTER PLOT (Plotly): Relationship between Marketing Spend & Monthly Revenue
library(plotly)
library(dplyr)
library(readr)

# Import data
data <- read_csv("term.csv")

# Buat model regresi linear
model <- lm(MonthlyRevenue ~ MarketingSpend, data = data)

# Buat titik garis regresi (100 titik supaya halus)
x_line <- seq(min(data$MarketingSpend), max(data$MarketingSpend), length.out = 100)
y_line <- predict(model, newdata = data.frame(MarketingSpend = x_line))

# Plot interaktif
p <- plot_ly(data, 
             x = ~MarketingSpend, 
             y = ~MonthlyRevenue,
             type = 'scatter',
             mode = 'markers',
             marker = list(color = 'tomato', size = 8, opacity = 0.55),
             name = 'Data Points') |> 
  add_lines(
    x = x_line,
    y = y_line,
    line = list(color = 'darkblue', width = 5),
    name = 'Regression Line'
  ) |> 
  layout(
    title = list(
      text = "Relationship between Marketing Spend and Monthly Revenue (with Regression Line)",
      x = 0.5,
      font = list(size = 16)
    ),
    xaxis = list(title = "Marketing Spend"),
    yaxis = list(title = "Monthly Revenue"),
    hovermode = "closest"
  )

p

Interpretation Scatter Plot

The scatter plot visualization with a regression line shows that there is a strong positive linear relationship between Marketing Spend and Monthly Revenue. The data points tend to form an upward pattern to the right, indicating that the higher the marketing expenditure, the higher the monthly revenue generated. The upward-sloping regression line reinforces this pattern by illustrating the average increase in revenue as marketing spend increases. The relatively tight clustering of points around the line suggests that marketing spend is a fairly good predictor of revenue, although there is still variation due to other factors. Overall, this visualization indicates that investing in marketing has a positive effect on increasing company revenue.

5.2 Central Tendency Analysis

This section focuses on analyzing the measures of central tendency Mean, Median, and Mode for two numerical variables: Marketing Spend and Monthly Revenue.
These values represent the “center” or “typical” point of the data distribution.

# Function to calculate mode
get_mode <- function(x) {
  uniq_x <- unique(x)
  uniq_x[which.max(tabulate(match(x, uniq_x)))]
}

# Calculate central tendency for Marketing Spend
marketing_mean <- mean(data$MarketingSpend, na.rm = TRUE)
marketing_median <- median(data$MarketingSpend, na.rm = TRUE)
marketing_mode <- get_mode(data$MarketingSpend)

# Calculate central tendency for Monthly Revenue
revenue_mean <- mean(data$MonthlyRevenue, na.rm = TRUE)
revenue_median <- median(data$MonthlyRevenue, na.rm = TRUE)
revenue_mode <- get_mode(data$MonthlyRevenue)

# Create a summary table
central_tendency <- tibble(
  Variable = c("Marketing Spend", "Monthly Revenue"),
  Mean = c(marketing_mean, revenue_mean),
  Median = c(marketing_median, revenue_median),
  Mode = c(marketing_mode, revenue_mode)
)

datatable(
  central_tendency,
  caption = htmltools::tags$caption(
    style = 'caption-side: top; text-align: center; font-weight: bold; font-size: 16px;',
    "Table 2. Measures of Central Tendency (Mean, Median, Mode)"
  )
)

# HISTOGRAM (Plotly): Marketing Spend & Monthly Revenue
library(plotly)
library(readr)
library(dplyr)

# Import data
data <- read_csv("term.csv")

# Histogram 1: Marketing Spend
hist1 <- plot_ly(data,
                 x = ~MarketingSpend,
                 type = "histogram",
                 name = "Marketing Spend",   # <--- LABEL LEGEND
                 marker = list(color = "lightgreen",
                               line = list(color = "white", width = 1))) %>%
  layout(
    title = list(
      text = "Histogram of Marketing Spend",
      x = 0.5,
      font = list(size = 16, face = "bold")
    ),
    xaxis = list(title = "Marketing Spend"),
    yaxis = list(title = "Frequency"),
    bargap = 0.05
  )

# Histogram 2: Monthly Revenue
hist2 <- plot_ly(data,
                 x = ~MonthlyRevenue,
                 type = "histogram",
                 name = "Monthly Revenue",   # <--- LABEL LEGEND
                 marker = list(color = "orange",
                               line = list(color = "white", width = 1))) %>%
  layout(
    title = list(
      text = "Histogram of Monthly Revenue",
      x = 0.5,
      font = list(size = 16, face = "bold")
    ),
    xaxis = list(title = "Monthly Revenue"),
    yaxis = list(title = "Frequency"),
    bargap = 0.05
  )

# Combined subplot
subplot(hist1, hist2, nrows = 1, shareY = TRUE) %>%
  layout(
    title = list(
      text = "Histograms: Marketing Spend & Monthly Revenue",
      x = 0.5,
      font = list(size = 18, face = "bold")
    ),
    showlegend = TRUE  # pastikan legend muncul
  )

Interpretation Histogram: Marketing Spend & Monthly Revenue

The histogram shows a different distribution between the two variables. Marketing Spend appears to be right-skewed (positively skewed), with most expenditures at low to medium levels and only a few observations at very high values. In contrast, Monthly Revenue appears to be left-skewed (negatively skewed), where most revenue is at high values while only a few data points are at the lower end. This difference in distribution shape suggests that the company’s marketing spending tends to be low, whereas revenue more frequently occurs at higher levels.

6 Measures of Dispersion

This section focuses on analyzing how spread out or variable the data is.
Four measures are used: Range, Variance, Standard Deviation, and Interquartile Range (IQR) for Marketing Spend and Monthly Revenue.

# Calculate dispersion measures for Marketing Spend
marketing_range <- max(data$MarketingSpend) - min(data$MarketingSpend)
marketing_var <- var(data$MarketingSpend)
marketing_sd <- sd(data$MarketingSpend)
marketing_iqr <- IQR(data$MarketingSpend)

# Calculate dispersion measures for Monthly Revenue
revenue_range <- max(data$MonthlyRevenue) - min(data$MonthlyRevenue)
revenue_var <- var(data$MonthlyRevenue)
revenue_sd <- sd(data$MonthlyRevenue)
revenue_iqr <- IQR(data$MonthlyRevenue)

# Create summary table
dispersion_summary <- tibble(
  Variable = c("Marketing Spend", "Monthly Revenue"),
  Range = c(marketing_range, revenue_range),
  Variance = c(marketing_var, revenue_var),
  `Standard Deviation` = c(marketing_sd, revenue_sd),
  IQR = c(marketing_iqr, revenue_iqr)
)

datatable(
  dispersion_summary,
  caption = htmltools::tags$caption(
    style = 'caption-side: top; text-align: center; font-weight: bold; font-size: 16px;',
    "Table 3. Measures of Dispersion (Range, Variance, SD, IQR)"
  )
)

## ----1.3-Measures-of-Dispersion--------------------------------------------

# Fungsi untuk menghitung semua ukuran dispersi
measure_dispersion <- function(x) {
  range_val <- max(x, na.rm = TRUE) - min(x, na.rm = TRUE)
  var_val <- var(x, na.rm = TRUE)
  sd_val <- sd(x, na.rm = TRUE)
  iqr_val <- IQR(x, na.rm = TRUE)
  
  result <- data.frame(
    Range = round(range_val, 2),
    Variance = round(var_val, 2),
    Standard_Deviation = round(sd_val, 2),
    IQR = round(iqr_val, 2)
  )
  return(result)
}

# Hitung untuk dua variabel
dispersion_revenue <- measure_dispersion(data$MonthlyRevenue)
dispersion_spend <- measure_dispersion(data$MarketingSpend)

# Gabungkan hasil jadi satu tabel
dispersion_table <- rbind(
  cbind(Variable = "Monthly Revenue", dispersion_revenue),
  cbind(Variable = "Marketing Spend", dispersion_spend)
)

# Tampilkan dalam format tabel
library(knitr)
kable(dispersion_table, caption = "Table: Measures of Dispersion for Numerical Variables")

Table: Measures of Dispersion for Numerical Variables
Variable	Range	Variance	Standard_Deviation	IQR
Monthly Revenue	256.65	2232.43	47.25	73.86
Marketing Spend	129.90	1432.17	37.84	66.12

# BOX PLOT COMPARISON (Plotly Version)
library(plotly)
library(readr)

# Baca data
data <- read_csv("term.csv")

# BOX PLOT 1: Marketing Spend
p1 <- plot_ly(
  data,
  y = ~MarketingSpend,
  type = "box",
  name = "Marketing Spend",
  boxpoints = "outliers",
  marker = list(color = "lightgreen"),
  line = list(color = "darkgreen")
) %>%
  layout(
    title = list(
      text = "Boxplot of Marketing Spend",
      x = 0.5,
      font = list(size = 16, face = "bold")
    ),
    yaxis = list(title = "Marketing Spend")
  )

# BOX PLOT 2: Monthly Revenue
p2 <- plot_ly(
  data,
  y = ~MonthlyRevenue,
  type = "box",
  name = "Monthly Revenue",
  boxpoints = "outliers",
  marker = list(color = "orange"),
  line = list(color = "darkred")
) %>%
  layout(
    title = list(
      text = "Boxplot of Monthly Revenue",
      x = 0.5,
      font = list(size = 16, face = "bold")
    ),
    yaxis = list(title = "Monthly Revenue")
  )

# Gabungkan dua boxplot jadi satu tampilan horizontal
subplot(p1, p2, nrows = 1, shareY = FALSE, titleX = TRUE, titleY = TRUE) %>%
  layout(
    title = list(
      text = "Comparison of Marketing Spend and Monthly Revenue (Boxplots)",
      x = 0.5,
      font = list(size = 18, face = "bold")
    )
  )

Interpretation Box Plot: Comparison of Marketing Spend & Monthly Revenue

This visualization compares Marketing Expenses and Monthly Revenue, presented using two different scales. Essentially, the data shows that the average Monthly Revenue (approximately $180) is much higher than the average Marketing Expenses (approximately $85). This indicates that the company has a very healthy profit margin from its marketing efforts. Nevertheless, Monthly Revenue also shows greater fluctuations (variability) compared to Marketing Expenses, which tend to be more stable. In conclusion, this company is efficient in generating high revenue but should be mindful of the variations that occur in revenue figures over time.

# SCATTER PLOT: Marketing Spend vs Monthly Revenue (Plotly)
library(plotly)
library(readr)

# Baca data CSV
data <- read_csv("term.csv")

# Scatter Plot interaktif
plot_ly(
  data,
  x = ~MarketingSpend,
  y = ~MonthlyRevenue,
  type = "scatter",
  mode = "markers",
  marker = list(color = "dodgerblue", size = 8, opacity = 0.6, line = list(width = 1, color = "white"))
) %>%
  layout(
    title = list(
      text = "Scatter Plot of Marketing Spend vs Monthly Revenue",
      x = 0.5,
      font = list(size = 18, face = "bold")
    ),
    xaxis = list(title = "Marketing Spend"),
    yaxis = list(title = "Monthly Revenue"),
    plot_bgcolor = "rgba(245,245,245,1)",
    paper_bgcolor = "rgba(255,255,255,1)"
  )

Interpretation Scatter Plot of Marketing Spend vs Monthly Revenue

The graph above shows that most of the data falls within the spending range of around 40 to 120, with monthly income ranging from 100 to 250. The blue line, which is the regression line, indicates a consistently upward trend, reinforcing the presence of a positive correlation between marketing spending and income. Although there are a few points scattered outside the main pattern, overall, this scatterplot shows that increasing the marketing budget has the potential to positively influence the company’s monthly revenue growth.

# BOX + VIOLIN PLOT (Plotly Version)
library(plotly)
library(readr)
library(dplyr)
library(tidyr)

# Baca data
data <- read_csv("term.csv")

# Ubah data ke format long (seperti melt)
data_long <- data %>%
  select(MonthlyRevenue, MarketingSpend) %>%
  pivot_longer(cols = everything(), names_to = "Variable", values_to = "Value")

# Plot interaktif (gabungan Violin + Box)
plot_ly(data_long, y = ~Value, x = ~Variable, color = ~Variable, type = "violin",
        box = list(visible = TRUE),
        meanline = list(visible = TRUE),
        points = "all",
        jitter = 0.3,
        marker = list(opacity = 0.3, size = 3)) %>%
  layout(
    title = list(
      text = "Boxplot and Violin Plot: Monthly Revenue vs Marketing Spend",
      x = 0.5,
      font = list(size = 18, face = "bold")
    ),
    xaxis = list(title = "Variable"),
    yaxis = list(title = "Value"),
    showlegend = FALSE
  )

Interpretation Boxplot & Violinplot: Monthly Revenue vs Marketing Spend

The visualization above shows a comparison between marketing expenditures and monthly revenue through a combination of boxplots and violin plots. The graph shows that the Marketing Expenditure values generally range from 20 to 150, with most of the data clustered around 80 to 100, while Monthly Revenue has a higher range, between 100 and 300, with most of the data around 180 to 220. The shape of the violin plot indicates that the distribution of Monthly Revenue is wider, suggesting greater revenue variation compared to marketing expenditures. Meanwhile, Marketing Expenditure appears denser in the middle region, indicating a relatively high frequency of mid-range values. Overall, this graph shows that although marketing expenditures are relatively smaller and stable, increases in them may be associated with higher and more varied monthly revenue growth.

# SCATTER PLOT + REGRESSION LINE (Plotly Version)
library(plotly)
library(readr)
library(dplyr)

# Baca data
data <- read_csv("term.csv")

# Buat model regresi linear
model <- lm(MonthlyRevenue ~ MarketingSpend, data = data)

# Prediksi nilai untuk garis regresi
data$Predicted <- predict(model)

# Scatter plot + regression line interaktif
plot_ly() %>%
  add_markers(
    data = data,
    x = ~MarketingSpend,
    y = ~MonthlyRevenue,
    name = "Data Points",
    marker = list(color = "darkorange", opacity = 0.6, size = 7)
  ) %>%
  add_lines(
    data = data,
    x = ~MarketingSpend,
    y = ~Predicted,
    name = "Regression Line",
    line = list(color = "blue", width = 2)
  ) %>%
  layout(
    title = list(
      text = "Scatter Plot: Relationship between Marketing Spend and Monthly Revenue",
      x = 0.5,
      font = list(size = 18, face = "bold")
    ),
    xaxis = list(title = "Marketing Spend"),
    yaxis = list(title = "Monthly Revenue"),
    plot_bgcolor = "rgba(245,245,245,1)",
    paper_bgcolor = "rgba(255,255,255,1)",
    legend = list(orientation = "h", x = 0.3, y = -0.2)
  )

Interpretation Scatter plot: Relationship between Marketing Spend and Monthly Revenue

This visualization shows the relationship between Marketing Spend and Monthly Revenue. The orange dots represent actual data, while the blue line is the average trend from the linear regression model. In general, the higher the marketing spend, the higher the monthly revenue.

## ----Central-Density----------------------------------------------------
library(plotly)
library(dplyr)
library(readr)
library(knitr)

# Baca data
data <- read_csv("term.csv")

# Hitung nilai mean, median, mode untuk Monthly Revenue
mean_rev <- mean(data$MonthlyRevenue, na.rm = TRUE)
median_rev <- median(data$MonthlyRevenue, na.rm = TRUE)
mode_rev <- as.numeric(names(sort(table(data$MonthlyRevenue), decreasing = TRUE))[1])

# Buat histogram dan density curve
dens <- density(data$MonthlyRevenue, na.rm = TRUE)

p <- plot_ly() %>%
  # Histogram
  add_histogram(
    x = data$MonthlyRevenue,
    histnorm = "probability density",
    nbinsx = 30,
    name = "Histogram",
    marker = list(color = "lightblue", line = list(color = "white", width = 1)),
    opacity = 0.7
  ) %>%
  # Density curve
  add_lines(
    x = dens$x, y = dens$y,
    name = "Density Curve",
    line = list(color = "darkblue", width = 2)
  ) %>%
  # Garis Mean
  add_segments(x = mean_rev, xend = mean_rev, y = 0, yend = max(dens$y),
               name = "Mean", line = list(color = "red", dash = "dash", width = 2)) %>%
  # Garis Median
  add_segments(x = median_rev, xend = median_rev, y = 0, yend = max(dens$y),
               name = "Median", line = list(color = "green", dash = "dot", width = 2)) %>%
  # Garis Mode
  add_segments(x = mode_rev, xend = mode_rev, y = 0, yend = max(dens$y),
               name = "Mode", line = list(color = "purple", dash = "dashdot", width = 2)) %>%
  layout(
    title = list(
      text = "Central Density and Central Tendency of Monthly Revenue",
      x = 0.5,
      font = list(size = 18)
    ),
    xaxis = list(title = "Monthly Revenue"),
    yaxis = list(title = "Density"),
    plot_bgcolor = "rgba(245,245,245,1)",
    paper_bgcolor = "rgba(255,255,255,1)",
    legend = list(orientation = "h", x = 0.3, y = -0.2)
  )

p

# Tabel ringkasan nilai central tendency
central_table <- data.frame(
  Measure = c("Mean", "Median", "Mode"),
  Value = round(c(mean_rev, median_rev, mode_rev), 2)
)

kable(central_table, caption = "Table: Central Tendency Measures for Monthly Revenue")

Table: Central Tendency Measures for Monthly Revenue
Measure	Value
Mean	180.83
Median	181.18
Mode	58.73

Interpretation Central Density & Central Tendency of Monthly Revenue

The graph above shows the distribution of Monthly Income by displaying a histogram, a density curve, and three measures of central tendency: mean, median, and mode. From the graph, it can be seen that the monthly income data is fairly symmetrical, with the peak of the distribution around the value of 180 to 200, which means that the most frequently occurring income falls within that range. The green line (mean) and the red line (median) are very close around the center value, indicating that the data tends to be normally distributed without a significant difference between the average and the median. Meanwhile, the purple line (mode) is slightly to the left, indicating that the most frequently occurring value is slightly lower than the average. Overall, this graph illustrates that most of the company’s monthly incomes range from 100 to 250, with a stable distribution trend that does not deviate too much to the left or right.

## ----Relationship-Correlation-Regression-------------------------------
library(plotly)
library(dplyr)
library(knitr)
library(readr)

# Baca data
data <- read_csv("term.csv")

# Hitung korelasi
correlation <- cor(data$MarketingSpend, data$MonthlyRevenue, use = "complete.obs")

# Model regresi
model <- lm(MonthlyRevenue ~ MarketingSpend, data = data)

# Buat x garis regresi (100 titik agar halus)
x_line <- seq(min(data$MarketingSpend), max(data$MarketingSpend), length.out = 100)

# Prediksi y untuk garis regresi
y_line <- predict(model, newdata = data.frame(MarketingSpend = x_line))

# ===========================
#      PLOT SCATTER + LINE
# ===========================

# Scatter plot
p <- plot_ly(
  data = data,
  x = ~MarketingSpend,
  y = ~MonthlyRevenue,
  type = "scatter",
  mode = "markers",
  name = "Data Points",
  marker = list(color = "rgba(30,144,255,0.45)", size = 7)
)

# Garis regresi (benar-benar garis lurus)
p <- add_lines(
  p,
  x = x_line,
  y = y_line,
  name = "Regression Line",
  line = list(color = "rgba(139,0,0,1)", width = 6)
)

# Layout
p <- layout(
  p,
  title = list(
    text = paste("Relationship between Marketing Spend and Monthly Revenue<br>",
                 "Correlation =", round(correlation, 3)),
    x = 0.5,
    font = list(size = 16)
  ),
  xaxis = list(title = "Marketing Spend"),
  yaxis = list(title = "Monthly Revenue"),
  legend = list(orientation = "h", x = 0.3, y = -0.2)
)

p

Interpretation Relationship between Marketing Spend & Monthly Revenue

The scatterplot above shows the relationship between Marketing Expenditure and Monthly Revenue with a correlation value of 0.869. This value indicates a very strong and positive relationship between the two variables. The graph shows that most of the data points form a pattern rising from the lower left to the upper right, indicating that the higher the spending on marketing, the more likely the monthly revenue will increase. Most of the data falls within the expenditure range of around 40 to 130, with monthly revenue ranging from 100 to 280. The blue regression line reinforces this trend as it shows a consistent increase as marketing costs rise. Although there are some points scattered outside the main pattern, overall, this graph shows that investment in marketing has a significant positive impact on increasing the company’s revenue.

Summary-Interpretation

Based on the analysis of the Urban Business Dataset, several key insights were identified regarding business patterns across major Indonesian cities. In terms of central tendency (mean, median, and mode), the Monthly Revenue variable has a mean value close to its median, indicating an approximately normal distribution with minimal skewness. Meanwhile, Marketing Spend shows a slightly higher mean than median, suggesting a mild right-skewed distribution.

For measures of dispersion (range, variance, standard deviation, and IQR), Marketing Spend displays greater variability compared to Customer Rating, implying that marketing expenditures differ significantly between businesses. Conversely, Customer Rating is relatively consistent, reflecting stable levels of customer satisfaction across cities.

Data visualizations support these findings: the histogram indicates a near-normal shape, boxplots and violin plots reveal some outliers in spending data, and the scatter plot demonstrates a positive linear relationship between Marketing Spend and Monthly Revenue.

In conclusion, businesses with higher marketing expenditures tend to achieve higher monthly revenue. This emphasizes the importance of effective marketing strategies in driving business performance across different regions.

Urban Business Dataset

Midterm Exam