Data Exploration
Exercises ~ Week 3
1 Exercise 1
The following table shows sample information for three students. Each observation represents a single student and includes details such as their unique student ID, name, age, total credits completed, major field of study, and year level.
This dataset demonstrates a mixture of variable types:
- Nominal: StudentID, Name, Major
- Numeric: Age (continuous), CreditsCompleted
(discrete)
- Ordinal: YearLevel (Freshman → Senior)
StudentID | Name | Age | CreditsCompleted | Major | YearLevel |
---|---|---|---|---|---|
S001 | Alice | 20 | 45 | Data Sains | Sophomore |
S002 | Budi | 21 | 60 | Mathematics | Junior |
S003 | Citra | 19 | 30 | Statistics | Freshman |
# 1. Create vectors for each variable
StudentID <- c("S001", "S002", "S003") # Nominal / ID
Name <- c("Alice", "Budi", "Citra") # Nominal / Name
Age <- c(20, 21, 19) # Numeric / Continuous
CreditsCompleted <- c(45, 60, 30) # Numeric / Discrete
# Nominal
Major <- c("Data Sains", "Mathematics", "Statistics")
# Ordinal
YearLevel <- factor(c("Sophomore", "Junior", "Freshman"),
levels = c("Freshman","Sophomore","Junior","Senior"),
ordered = TRUE)
# 2. Combine all vectors into a data frame
students <- data.frame(
StudentID, Name, Age, CreditsCompleted, Major, YearLevel,
stringsAsFactors = FALSE
)
# 3. Display the data frame
print(students)
## StudentID Name Age CreditsCompleted Major YearLevel
## 1 S001 Alice 20 45 Data Sains Sophomore
## 2 S002 Budi 21 60 Mathematics Junior
## 3 S003 Citra 19 30 Statistics Freshman
2 Exercise 2
Identify Data Types: Determine the type of data for each of the following variables:
# Install knitr package if not already installed
# install.packages("knitr")
library(knitr)
# Create a data frame for Data Types
variables_info <- data.frame(
No = 1:5,
Variable = c(
"Number of vehicles passing through the toll road each day",
"Student height in cm",
"Employee gender (Male / Female)",
"Customer satisfaction level: Low, Medium, High",
"Respondent's favorite color: Red, Blue, Green"
),
DataType = c(
"Numerics",
"Numerics",
"Catagorical",
"Catagorical",
"Catagorical"
),
Subtype = c(
"Discreate",
"Continuous",
"Nominal",
"Ordinal",
"Nominal"
),
stringsAsFactors = FALSE
)
# Display the data frame as a neat table
kable(variables_info,
caption = "Table of Variables and Data Types")
No | Variable | DataType | Subtype |
---|---|---|---|
1 | Number of vehicles passing through the toll road each day | Numerics | Discreate |
2 | Student height in cm | Numerics | Continuous |
3 | Employee gender (Male / Female) | Catagorical | Nominal |
4 | Customer satisfaction level: Low, Medium, High | Catagorical | Ordinal |
5 | Respondent’s favorite color: Red, Blue, Green | Catagorical | Nominal |
3 Exercise 3
Classify Data Sources: Determine whether the following data comes from internal or external sources, and whether it is structured or unstructured:
# Install DT package if not already installed
# install.packages("DT")
library(DT)
# Create a data frame for data sources
data_sources <- data.frame(
No = 1:4,
DataSource = c(
"Daily sales transaction data of the company",
"Weather reports from BMKG",
"Product reviews on social media",
"Warehouse inventory reports"
),
Internal_External = c(
"Internal",
"External",
"External",
"Internal"
),
Structured_Unstructured = c(
"Structured",
"Structured",
"Unstructured",
"Structured"
),
stringsAsFactors = FALSE
)
# Display the data frame as a neat table
datatable(data_sources,
caption = "Table of Data Sources",
rownames = FALSE) # hides the index column
4 Exercise 4
Dataset Structure: Consider the following transaction table:
Date | Qty | Price | Product | CustomerTier |
---|---|---|---|---|
2025-10-01 | 2 | 1000 | Laptop | High |
2025-10-01 | 5 | 20 | Mouse | Medium |
2025-10-02 | 1 | 1000 | Laptop | Low |
2025-10-02 | 3 | 30 | Keyboard | Medium |
2025-10-03 | 4 | 50 | Mouse | Medium |
2025-10-03 | 2 | 1000 | Laptop | High |
2025-10-04 | 6 | 25 | Keyboard | Low |
2025-10-04 | 1 | 1000 | Laptop | High |
2025-10-05 | 3 | 40 | Mouse | Low |
2025-10-05 | 5 | 10 | Keyboard | Medium |
Your Assignment Instructions: Creating a Transactions Table above in R
Create a data frame in R called
transactions
containing the data above.Identify which variables are numeric and which are categorical
Calculate total revenue for each transaction by multiplying
Qty × Price
and add it as a new columnTotal
.Compute summary statistics:
- Total quantity sold for each product
- Total revenue per product
- Average price per product
Visualize the data:
- Create a barplot showing total quantity sold per product.
- Create a pie chart showing the proportion of total revenue per customer tier.
Optional Challenge:
- Find which date had the highest total revenue.
- Create a stacked bar chart showing quantity sold per product by customer tier.
Hints: Use data.frame()
,
aggregate()
, barplot()
, pie()
,
and basic arithmetic operations in R.
4.1 Data Variable
Data Frame
library(knitr)
library(reshape2)
#Data Frame
Date = as.Date(c("2025-10-01", "2025-10-01", "2025-10-02", "2025-10-02",
"2025-10-03", "2025-10-03", "2025-10-04", "2025-10-04",
"2025-10-05", "2025-10-05"))
Qty = c(2,5,1,3,4,2,6,1,3,5)
Price = c(1000, 20, 1000, 30, 50, 1000, 25, 1000, 40, 10)
Product = c("Laptop", "Mouse", "Laptop", "Keyboard", "Mouse",
"Laptop", "Keyboard", "Laptop", "Mouse", "Keyboard")
CostumerTier = c("High", "Medium", "Low", "Medium", "Medium",
"High", "Low", "High", "Low", "Medium")
#transactions data frame
transactions <- data.frame(Date, Qty, Price,
Product, CostumerTier)
kable(transactions,
caption="Data Frame")
Date | Qty | Price | Product | CostumerTier |
---|---|---|---|---|
2025-10-01 | 2 | 1000 | Laptop | High |
2025-10-01 | 5 | 20 | Mouse | Medium |
2025-10-02 | 1 | 1000 | Laptop | Low |
2025-10-02 | 3 | 30 | Keyboard | Medium |
2025-10-03 | 4 | 50 | Mouse | Medium |
2025-10-03 | 2 | 1000 | Laptop | High |
2025-10-04 | 6 | 25 | Keyboard | Low |
2025-10-04 | 1 | 1000 | Laptop | High |
2025-10-05 | 3 | 40 | Mouse | Low |
2025-10-05 | 5 | 10 | Keyboard | Medium |
Data Numeric Numeric or quantitative data are data expressed in numbers that represent counts or measurements. They provide information about how much or how many of something, allowing for mathematical operations such as addition, subtraction, averaging, and statistical analysis.
#Category and Numeric
Numeric <- data.frame(
Qty, Price,
stringsAsFactors = FALSE
)
kable(Numeric,
caption = "Data Numeric")
Qty | Price |
---|---|
2 | 1000 |
5 | 20 |
1 | 1000 |
3 | 30 |
4 | 50 |
2 | 1000 |
6 | 25 |
1 | 1000 |
3 | 40 |
5 | 10 |
Data Category Categorical or qualitative data are data expressed in labels, names, or categories rather than numbers. They describe qualities, attributes, or classifications that cannot be meaningfully measured with arithmetic operations like addition or subtraction.
Date | Product | CostumerTier |
---|---|---|
2025-10-01 | Laptop | High |
2025-10-01 | Mouse | Medium |
2025-10-02 | Laptop | Low |
2025-10-02 | Keyboard | Medium |
2025-10-03 | Mouse | Medium |
2025-10-03 | Laptop | High |
2025-10-04 | Keyboard | Low |
2025-10-04 | Laptop | High |
2025-10-05 | Mouse | Low |
2025-10-05 | Keyboard | Medium |
Total Data Transaction
Date | Qty | Price | Product | CostumerTier | Total |
---|---|---|---|---|---|
2025-10-01 | 2 | 1000 | Laptop | High | 2000 |
2025-10-01 | 5 | 20 | Mouse | Medium | 100 |
2025-10-02 | 1 | 1000 | Laptop | Low | 1000 |
2025-10-02 | 3 | 30 | Keyboard | Medium | 90 |
2025-10-03 | 4 | 50 | Mouse | Medium | 200 |
2025-10-03 | 2 | 1000 | Laptop | High | 2000 |
2025-10-04 | 6 | 25 | Keyboard | Low | 150 |
2025-10-04 | 1 | 1000 | Laptop | High | 1000 |
2025-10-05 | 3 | 40 | Mouse | Low | 120 |
2025-10-05 | 5 | 10 | Keyboard | Medium | 50 |
#Total quantity sold for each product
total_qty <- aggregate(Qty ~ Product, data = transactions, sum)
# Total revenue per product
total_revenue <- aggregate(Total ~ Product, data = transactions, sum)
# Average price per product
avg_price <- aggregate(Price ~ Product, data = transactions, mean)
kable(total_qty,
caption = "Total Quantity")
Product | Qty |
---|---|
Keyboard | 14 |
Laptop | 6 |
Mouse | 12 |
Product | Total |
---|---|
Keyboard | 290 |
Laptop | 6000 |
Mouse | 420 |
Product | Price |
---|---|
Keyboard | 21.66667 |
Laptop | 1000.00000 |
Mouse | 36.66667 |
barplot(total_qty$Qty,
names.arg = total_qty$Qty,
main = "Total Quantity Sold per Product",
col = "navy",
xlab = "Product",
ylab = "total_qty",)
___
# Pie chart total revenue per customer tier and percentage
revenue_tier <- aggregate(Total ~ CostumerTier, data = transactions, sum)
revenue_tier$Percent <- round(100 * revenue_tier$Total / sum(revenue_tier$Total), 1)
labels <- paste(revenue_tier$CostumerTier, "-", revenue_tier$Percent, "%")
pie(revenue_tier$Total,
labels = labels,
main = "Total revenue pie chart per CostumerTier (%)",
col = c("lightblue", "lightgreen", "pink"),
xlab = "Costumer",
ylab = "total_revenue")
Date Highest Total Revenue
revenue_date <- aggregate(Total ~ Date, data = transactions, sum)
revenue_date[which.max(revenue_date$Total),]
# Stacked bar chart: quantity sold per product by customer tier
qty_tier <- aggregate(Qty ~ Product + CostumerTier, data = transactions, sum)
qty_wide <- dcast(qty_tier, Product ~ CostumerTier, value.var = "Qty", fill = 0)
barplot(
as.matrix(qty_wide[, -1]),
beside = TRUE,
legend = colnames(qty_wide)[-1],
col = rainbow(ncol(qty_wide) - 1),
main = "Quantity Sold per Product by Customer Tier",
xlab = "Product",
ylab = "Quantity",
names.arg = qty_wide$Product
)
5 Exercise 5
Create Your Own Data Frame:
Objective: Create a data frame in R with 30 rows containing a mix of data types: continuous, discrete, nominal, and ordinal.
5.1 Instructions
Open RStudio or the R console.
Create a vector for each column in your data frame:
- Date: 30 dates (can be sequential or random within
a month/year)
- Continuous: numeric values that can take decimal
values (e.g., height, weight, temperature)
- Discrete: numeric values that can only take whole
numbers (e.g., number of items, number of vehicles)
- Nominal: categorical values with no
order (e.g., color, gender, city)
- Ordinal: categorical values with a defined order (e.g., Low, Medium, High; Beginner, Intermediate, Expert)
- Date: 30 dates (can be sequential or random within
a month/year)
Combine all vectors into a data frame called
my_data
.Check your data frame using
head()
orView()
to ensure it has 30 rows and the columns are correct.Optional tasks:
- Summarize each column using
summary()
- Count the frequency of each category for Nominal
and Ordinal columns using
table()
- Summarize each column using
5.2 Hints
- Use
seq.Date()
oras.Date()
to generate the Date column.
- Use
runif()
orrnorm()
for continuous numeric data.
- Use
sample()
for discrete, nominal, and ordinal data.
- Ensure the ordinal vector is created with
factor(..., levels = c("Low","Medium","High"), ordered = TRUE)
(or similar).
#Excercise 5: Data Penggunaan KRL selama bulan September
library(knitr)
library(DT)
# Data simulasi
Tanggal <- seq.Date(from = as.Date("2025-09-01"), to = as.Date("2025-09-30"), by = "day")
Jumlah_Penumpang <- sample(4800:5500, 30, replace = TRUE)
Waktu_Tunggu <- round(runif(30, min = 6.0, max = 8.5), 1)
Jenis_Jalur <- sample(c("Merah", "Hijau", "Biru", "Kuning"), 30, replace = TRUE)
Tingkat_Kepadatan <- factor(
sample(c("Rendah", "Sedang", "Tinggi", "Sangat Tinggi"), 30, replace = TRUE),
levels = c("Rendah", "Sedang", "Tinggi", "Sangat Tinggi"),
ordered = TRUE
)
#Data Frame
my_data <- data.frame(
Tanggal,
Jumlah_Penumpang,
Waktu_Tunggu,
Jenis_Jalur,
Tingkat_Kepadatan,
stringsAsFactors = FALSE
)
#Interactive table and color
datatable(
my_data,
caption = htmltools::tags$caption(
style = 'caption-side: top; text-align: center; font-weight: bold; font-size: 16px; color: #2C3E50;',
'Tabel 1. Data Simulasi Penggunaan KRL Selama Bulan September'
),
options = list(
pageLength = 10,
autoWidth = TRUE,
dom = 'Bfrtip',
buttons = c('copy', 'csv', 'excel', 'pdf', 'print'),
initComplete = JS(
"function(settings, json) {",
"$(this.api().table().header()).css({'background-color': '#2C3E50', 'color': '#fff'});",
"}"
)
),
rownames = FALSE,
class = 'cell-border stripe hover compact'
) %>%
formatStyle(
'Jenis_Jalur',
backgroundColor = styleEqual(
c("Merah", "Hijau", "Biru", "Kuning"),
c("#FF6B6B", "#6BCB77", "#4D96FF", "#FFD93D")
),
color = "black",
fontWeight = "bold"
) %>%
formatStyle(
'Tingkat_Kepadatan',
backgroundColor = styleEqual(
c("Rendah", "Sedang", "Tinggi", "Sangat Tinggi"),
c("#DFFFD6", "#FFF3B0", "#FFD6A5", "#FFB5A7")
),
color = "black",
fontWeight = "bold"
)
SUMMARY
Tanggal | Jumlah_Penumpang | Waktu_Tunggu | Jenis_Jalur | Tingkat_Kepadatan | |
---|---|---|---|---|---|
Min. :2025-09-01 | Min. :4831 | Min. :6.200 | Length:30 | Rendah :11 | |
1st Qu.:2025-09-08 | 1st Qu.:4940 | 1st Qu.:7.025 | Class :character | Sedang : 3 | |
Median :2025-09-15 | Median :5120 | Median :7.250 | Mode :character | Tinggi : 8 | |
Mean :2025-09-15 | Mean :5138 | Mean :7.350 | NA | Sangat Tinggi: 8 | |
3rd Qu.:2025-09-22 | 3rd Qu.:5321 | 3rd Qu.:7.850 | NA | NA | |
Max. :2025-09-30 | Max. :5488 | Max. :8.500 | NA | NA |
JENIS JALUR (NOMINAL)
##
## Biru Hijau Kuning Merah
## 6 5 8 11
TINGKAT KEPADATAN (ORDINAL)
##
## Rendah Sedang Tinggi Sangat Tinggi
## 11 3 8 8