Data Exploration
Exercises ~ Week 2
1 Exercise 1
The following table shows sample information for three students. Each observation represents a single student and includes details such as their unique student ID, name, age, total credits completed, major field of study, and year level.
This dataset demonstrates a mixture of variable types:
- Nominal: StudentID, Name, Major
- Numeric: Age (continuous), CreditsCompleted
(discrete)
- Ordinal: YearLevel (Freshman → Senior)
StudentID | Name | Age | CreditsCompleted | Major | YearLevel |
---|---|---|---|---|---|
S001 | Alice | 20 | 45 | Data Sains | Sophomore |
S002 | Budi | 21 | 60 | Mathematics | Junior |
S003 | Citra | 19 | 30 | Statistics | Freshman |
# 1. Create vectors for each variable
StudentID <- c("S001", "S002", "S003") # Nominal / ID
Name <- c("Alice", "Budi", "Citra") # Nominal / Name
Age <- c(20, 21, 19) # Numeric / Continuous
CreditsCompleted <- c(45, 60, 30) # Numeric / Discrete
# Nominal
Major <- c("Data Sains", "Mathematics", "Statistics")
# Ordinal
YearLevel <- factor(c("Sophomore", "Junior", "Freshman"),
levels = c("Freshman","Sophomore","Junior","Senior"),
ordered = TRUE)
# 2. Combine all vectors into a data frame
students <- data.frame(
StudentID, Name, Age, CreditsCompleted, Major, YearLevel,
stringsAsFactors = FALSE
)
# 3. Display the data frame
print(students)
## StudentID Name Age CreditsCompleted Major YearLevel
## 1 S001 Alice 20 45 Data Sains Sophomore
## 2 S002 Budi 21 60 Mathematics Junior
## 3 S003 Citra 19 30 Statistics Freshman
2 Exercise 2
Identify Data Types: Determine the type of data for each of the following variables:
# Install knitr package if not already installed
# install.packages("knitr")
library(knitr)
# Create a data frame for Data Types
variables_info <- data.frame(
No = 1:5,
Variable = c(
"Number of vehicles passing through the toll road each day",
"Student height in cm",
"Employee gender (Male / Female)",
"Customer satisfaction level: Low, Medium, High",
"Respondent's favorite color: Red, Blue, Green"
),
DataType = c(
"numeric",
"numeric",
"kategorikal",
"kategorikal",
"kategorikal"
),
Subtype = c(
"diskrit",
"kontinu",
"nominal",
"ordinal",
"nominal"
),
stringsAsFactors = FALSE
)
# Display the data frame as a neat table
kable(variables_info,
caption = "Table of Variables and Data Types")
No | Variable | DataType | Subtype |
---|---|---|---|
1 | Number of vehicles passing through the toll road each day | numeric | diskrit |
2 | Student height in cm | numeric | kontinu |
3 | Employee gender (Male / Female) | kategorikal | nominal |
4 | Customer satisfaction level: Low, Medium, High | kategorikal | ordinal |
5 | Respondent’s favorite color: Red, Blue, Green | kategorikal | nominal |
3 Exercise 3
Classify Data Sources: Determine whether the following data comes from internal or external sources, and whether it is structured or unstructured:
# Install DT package if not already installed
# install.packages("DT")
library(DT)
# Create a data frame for data sources
data_sources <- data.frame(
No = 1:4,
DataSource = c(
"Daily sales transaction data of the company",
"Weather reports from BMKG",
"Product reviews on social media",
"Warehouse inventory reports"
),
Internal_External = c(
"internal",
"eksternal",
"eksternal",
"internal"
),
Structured_Unstructured = c(
"structured",
"structured",
"unstructured",
"structured"
),
stringsAsFactors = FALSE
)
# Display the data frame as a neat table
datatable(data_sources,
caption = "Table of Data Sources",
rownames = FALSE) # hides the index column
4 Exercise 4
Dataset Structure: Consider the following transaction table:
Date | Qty | Price | Product | CustomerTier | Total |
---|---|---|---|---|---|
2025-10-01 | 2 | 1000 | Laptop | High | 2000 |
2025-10-01 | 5 | 20 | Mouse | Medium | 100 |
2025-10-02 | 1 | 1000 | Laptop | Low | 1000 |
2025-10-02 | 3 | 30 | Keyboard | Medium | 90 |
2025-10-03 | 4 | 50 | Mouse | Medium | 200 |
2025-10-03 | 2 | 1000 | Laptop | High | 2000 |
2025-10-04 | 6 | 25 | Keyboard | Low | 150 |
2025-10-04 | 1 | 1000 | Laptop | High | 1000 |
2025-10-05 | 3 | 40 | Mouse | Low | 120 |
2025-10-05 | 5 | 10 | Keyboard | Medium | 50 |
Your Assignment Instructions: Creating a Transactions Table above in R
Create a data frame in R called
transactions
containing the data above.Identify which variables are numeric and which are categorical
N0 | Variables | Data Types |
---|---|---|
1. | Date | Categorical (Ordinal) |
2. | QTY | Numerical (Diskrit) |
3. | Price | Numerical (Diskrit) |
4. | Product | Categorical (Nominal) |
5. | CostumerTier | Categorical (Ordinal) |
6. | Total | Numerical (Diskrit) |
Calculate total revenue for each transaction by multiplying
Qty × Price
and add it as a new columnTotal
.Compute summary statistics: - Total quantity sold for each product
transaction <- data. frame( Date = as.Date(c(
- Total revenue per product
- Average price per product
Visualize the data:
- Create a barplot showing total quantity sold per product.
- Create a pie chart showing the proportion of total revenue per customer tier.
Optional Challenge:
- Find which date had the highest total revenue.
- Create a stacked bar chart showing quantity sold per product by customer tier.
Hints: Use data.frame()
,
aggregate()
, barplot()
, pie()
,
and basic arithmetic operations in R.
transactions <- data.frame(
Date = as.Date(c(
"2025-10-01", "2025-10-01", "2025-10-02", "2025-10-02",
"2025-10-03", "2025-10-03", "2025-10-04", "2025-10-04",
"2025-10-05", "2025-10-05"
)),
Qty = c(2, 5, 1, 3, 4, 2, 6, 1, 3, 5),
Price = c(1000, 20, 1000, 30, 50, 1000, 25, 1000, 40, 10),
Product = c("Laptop", "Mouse", "Laptop", "Keyboard", "Mouse",
"Laptop", "Keyboard", "Laptop", "Mouse", "Keyboard"),
CustomerTier = c("High", "Medium", "Low", "Medium", "Medium",
"High", "Low", "High", "Low", "Medium"),
stringsAsFactors = FALSE
)
transactions
## 'data.frame': 10 obs. of 5 variables:
## $ Date : Date, format: "2025-10-01" "2025-10-01" ...
## $ Qty : num 2 5 1 3 4 2 6 1 3 5
## $ Price : num 1000 20 1000 30 50 1000 25 1000 40 10
## $ Product : chr "Laptop" "Mouse" "Laptop" "Keyboard" ...
## $ CustomerTier: chr "High" "Medium" "Low" "Medium" ...
# Total quantity sold per product
total_qty <- aggregate(Qty ~ Product, data = transactions, sum)
# Total revenue per product
total_revenue <- aggregate(Total ~ Product, data = transactions, sum)
# Average price per product
avg_price <- aggregate(Price ~ Product, data = transactions, mean)
total_qty
# Barplot total quantity sold per product
barplot(
total_qty$Qty,
names.arg = total_qty$Product,
col = "maroon",
main = "Total Quantity Sold per Product",
xlab = "Product",
ylab = "Total Quantity"
)
# Pie chart total revenue per customer tier
# 1. Summarize total revenue by customer tier
revenue_tier <- aggregate(Total ~ CustomerTier, data = transactions, sum)
# 2. Coral color palette
coral_palette <- c("#FF7F50", "#FF6F61", "#FFA07A")
# Coral, Deep Coral, Light Coral
# 3. Calculate percentage for each tier
percentages <- round(100 * revenue_tier$Total / sum(revenue_tier$Total), 1)
# 4. Combine labels: tier name + percentage + total value
labels <- paste0(
revenue_tier$CustomerTier,
"\n", percentages, "% (", revenue_tier$Total, ")"
)
# 5. Create pie chart
pie(
revenue_tier$Total,
labels = labels,
main = "Proportion of Total Revenue per Customer Tier",
col = coral_palette,
clockwise = TRUE,
border = "white"
)
# 6. Add legend on the right side
legend(
"topright",
legend = paste(revenue_tier$CustomerTier, "-", percentages, "%"),
fill = coral_palette,
border = "white",
title = "Customer Tier",
bty = "n"
)
# Find the date with the highest total revenue
revenue_date <- aggregate(Total ~ Date, data = transactions, sum)
revenue_date[which.max(revenue_date$Total), ]
# Stacked Bar Chart: Quantity Sold per Product by Customer Tier
library(reshape2)
# Make summary of total quantity by product and customer tier
qty_tier <- aggregate(Qty ~ Product + CustomerTier, data = transactions, sum)
# Change to wide format (rows = Product, columns = Customer Tier)
qty_wide <- dcast(qty_tier, Product ~ CustomerTier, value.var = "Qty", fill = 0)
# Coral colors
coral_colors <- c("#FF7F50", "#FF6F61", "#FFA07A")
# Change to matrix for barplot
qty_matrix <- as.matrix(qty_wide[, -1])
# Calculate total per product
total_per_product <- rowSums(qty_matrix)
# Calculate percent for each part
percent <- round(qty_matrix / total_per_product * 100, 1)
# Make the stacked bar chart
bar_pos <- barplot(
t(qty_matrix),
col = coral_colors,
beside = FALSE,
legend = colnames(qty_wide)[-1],
main = "Quantity Sold per Product by Customer Tier",
xlab = "Product",
ylab = "Quantity",
names.arg = qty_wide$Product,
border = "white"
)
# Add labels: number + percent
for (i in 1:nrow(qty_matrix)) {
y_bottom <- 0
for (j in 1:ncol(qty_matrix)) {
value <- qty_matrix[i, j]
label <- paste0(value, " (", percent[i, j], "%)")
text(bar_pos[i], y_bottom + value / 2, labels = label, cex = 0.8)
y_bottom <- y_bottom + value
}
}
5 Exercise 5
Create Your Own Data Frame:
Objective: Create a data frame in R with 30 rows containing a mix of data types: continuous, discrete, nominal, and ordinal.
5.1 Instructions
Open RStudio or the R console.
Create a vector for each column in your data frame:
- Date: 30 dates (can be sequential or random within
a month/year)
- Continuous: numeric values that can take decimal
values (e.g., height, weight, temperature)
- Discrete: numeric values that can only take whole
numbers (e.g., number of items, number of vehicles)
- Nominal: categorical values with no
order (e.g., color, gender, city)
- Ordinal: categorical values with a defined order (e.g., Low, Medium, High; Beginner, Intermediate, Expert)
- Date: 30 dates (can be sequential or random within
a month/year)
Combine all vectors into a data frame called
my_data
.Check your data frame using
head()
orView()
to ensure it has 30 rows and the columns are correct.Optional tasks:
- Summarize each column using
summary()
- Count the frequency of each category for Nominal
and Ordinal columns using
table()
- Summarize each column using
5.2 Hints
- Use
seq.Date()
oras.Date()
to generate the Date column.
- Use
runif()
orrnorm()
for continuous numeric data.
- Use
sample()
for discrete, nominal, and ordinal data.
- Ensure the ordinal vector is created with
factor(..., levels = c("Low","Medium","High"), ordered = TRUE)
(or similar).
# Daily Fruit Enjoyment (30 Days)
set.seed(321)
# 1. Create basic columns
No <- 1:30
Date <- seq.Date(from = as.Date("2025-10-01"), by = "day", length.out = 30)
# 2. Continuous variable: enjoyment level after eating fruit (1–10)
Enjoyment <- round(runif(30, min = 5, max = 10), 1)
# 3. Discrete variable: number of fruits eaten per day (0–5)
FruitsEaten <- sample(0:5, 30, replace = TRUE)
# 4. Nominal variable: type of fruit eaten
Fruit <- sample(c("Apple", "Banana", "Orange", "Grape", "Mango"), 30, replace = TRUE)
# 5. Ordinal variable: energy level based on enjoyment
Energy <- cut(Enjoyment,
breaks = c(-Inf, 6, 8, Inf),
labels = c("Low", "Medium", "High"),
ordered_result = TRUE)
# 6. Combine all into one data frame
fruit_data <- data.frame(
No,
Date,
Enjoyment,
FruitsEaten,
Fruit,
Energy
)
# 7. Rename columns
colnames(fruit_data) <- c("No", "Date", "Enjoyment", "FruitsEaten", "Fruit", "Energy")
# 8. Display the main data table
knitr::kable(fruit_data, caption = "Daily Fruit Enjoyment (30 Days)")
No | Date | Enjoyment | FruitsEaten | Fruit | Energy |
---|---|---|---|---|---|
1 | 2025-10-01 | 9.8 | 3 | Grape | High |
2 | 2025-10-02 | 9.7 | 3 | Banana | High |
3 | 2025-10-03 | 6.2 | 3 | Mango | Medium |
4 | 2025-10-04 | 6.3 | 2 | Mango | Medium |
5 | 2025-10-05 | 7.0 | 5 | Orange | Medium |
6 | 2025-10-06 | 6.7 | 4 | Apple | Medium |
7 | 2025-10-07 | 7.3 | 5 | Grape | Medium |
8 | 2025-10-08 | 6.4 | 3 | Apple | Medium |
9 | 2025-10-09 | 7.3 | 4 | Banana | Medium |
10 | 2025-10-10 | 9.0 | 0 | Orange | High |
11 | 2025-10-11 | 8.0 | 0 | Mango | Medium |
12 | 2025-10-12 | 6.8 | 2 | Banana | Medium |
13 | 2025-10-13 | 8.8 | 5 | Grape | High |
14 | 2025-10-14 | 5.2 | 4 | Grape | Low |
15 | 2025-10-15 | 8.0 | 5 | Banana | Medium |
16 | 2025-10-16 | 6.0 | 0 | Banana | Low |
17 | 2025-10-17 | 8.2 | 1 | Orange | High |
18 | 2025-10-18 | 7.0 | 0 | Mango | Medium |
19 | 2025-10-19 | 6.5 | 5 | Apple | Medium |
20 | 2025-10-20 | 8.2 | 4 | Banana | High |
21 | 2025-10-21 | 8.2 | 2 | Grape | High |
22 | 2025-10-22 | 9.9 | 0 | Mango | High |
23 | 2025-10-23 | 9.7 | 5 | Apple | High |
24 | 2025-10-24 | 7.4 | 3 | Mango | Medium |
25 | 2025-10-25 | 7.9 | 5 | Banana | Medium |
26 | 2025-10-26 | 8.8 | 0 | Apple | High |
27 | 2025-10-27 | 10.0 | 4 | Orange | High |
28 | 2025-10-28 | 7.2 | 2 | Orange | Medium |
29 | 2025-10-29 | 5.6 | 4 | Grape | Low |
30 | 2025-10-30 | 8.0 | 2 | Banana | Medium |
# === Summary Section ===
# Summary for numeric columns (as a data frame)
numeric_summary <- data.frame(
Variable = c("Enjoyment", "FruitsEaten"),
Minimum = c(min(fruit_data$Enjoyment), min(fruit_data$FruitsEaten)),
Maximum = c(max(fruit_data$Enjoyment), max(fruit_data$FruitsEaten)),
Mean = c(round(mean(fruit_data$Enjoyment), 2), round(mean(fruit_data$FruitsEaten), 2)),
Median = c(median(fruit_data$Enjoyment), median(fruit_data$FruitsEaten))
)
# Frequency tables for Nominal and Ordinal columns
fruit_freq <- as.data.frame(table(fruit_data$Fruit))
colnames(fruit_freq) <- c("Fruit", "Frequency")
energy_freq <- as.data.frame(table(fruit_data$Energy))
colnames(energy_freq) <- c("Energy_Level", "Frequency")
# Display all summaries neatly
knitr::kable(numeric_summary, caption = "Summary of Numeric Variables")
Variable | Minimum | Maximum | Mean | Median |
---|---|---|---|---|
Enjoyment | 5.2 | 10 | 7.70 | 7.65 |
FruitsEaten | 0.0 | 5 | 2.83 | 3.00 |
Fruit | Frequency |
---|---|
Apple | 5 |
Banana | 8 |
Grape | 6 |
Mango | 6 |
Orange | 5 |
Energy_Level | Frequency |
---|---|
Low | 3 |
Medium | 16 |
High | 11 |