Data Exploration
Exercises ~ Week 2
1 Exercise 1
The following table shows sample information for three students. Each observation represents a single student and includes details such as their unique student ID, name, age, total credits completed, major field of study, and year level.
This dataset demonstrates a mixture of variable types:
- Nominal: StudentID, Name, Major
- Numeric: Age (continuous), CreditsCompleted
(discrete)
- Ordinal: YearLevel (Freshman → Senior)
StudentID | Name | Age | CreditsCompleted | Major | YearLevel |
---|---|---|---|---|---|
S001 | Alice | 20 | 45 | Data Sains | Sophomore |
S002 | Budi | 21 | 60 | Mathematics | Junior |
S003 | Citra | 19 | 30 | Statistics | Freshman |
# 1. Create vectors for each variable
StudentID <- c("S001", "S002", "S003") # Nominal / ID
Name <- c("Alice", "Budi", "Citra") # Nominal / Name
Age <- c(20, 21, 19) # Numeric / Continuous
CreditsCompleted <- c(45, 60, 30) # Numeric / Discrete
# Nominal
Major <- c("Data Sains", "Mathematics", "Statistics")
# Ordinal
YearLevel <- factor(c("Sophomore", "Junior", "Freshman"),
levels = c("Freshman","Sophomore","Junior","Senior"),
ordered = TRUE)
# 2. Combine all vectors into a data frame
students <- data.frame(
StudentID, Name, Age, CreditsCompleted, Major, YearLevel,
stringsAsFactors = FALSE
)
# 3. Display the data frame
print(students)
## StudentID Name Age CreditsCompleted Major YearLevel
## 1 S001 Alice 20 45 Data Sains Sophomore
## 2 S002 Budi 21 60 Mathematics Junior
## 3 S003 Citra 19 30 Statistics Freshman
2 Exercise 2
Identify Data Types: Determine the type of data for each of the following variables:
# Install knitr package if not already installed
# install.packages("knitr")
library(knitr)
# Create a data frame for Data Types
variables_info <- data.frame(
No = 1:5,
Variable = c(
"Number of vehicles passing through the toll road each day",
"Student height in cm",
"Employee gender (Male / Female)",
"Customer satisfaction level: Low, Medium, High",
"Respondent's favorite color: Red, Blue, Green"
),
DataType = c(
"Numeric",
"numeric",
"Categorical",
"Categorical",
"Categorical"
),
Subtype = c(
"Continous",
"continous",
"Nominal",
"Ordinal",
"Nominal"
),
stringsAsFactors = FALSE
)
# Display the data frame as a neat table
kable(variables_info,
caption = "Table of Variables and Data Types")
No | Variable | DataType | Subtype |
---|---|---|---|
1 | Number of vehicles passing through the toll road each day | Numeric | Continous |
2 | Student height in cm | numeric | continous |
3 | Employee gender (Male / Female) | Categorical | Nominal |
4 | Customer satisfaction level: Low, Medium, High | Categorical | Ordinal |
5 | Respondent’s favorite color: Red, Blue, Green | Categorical | Nominal |
3 Exercise 3
Classify Data Sources: Determine whether the following data comes from internal or external sources, and whether it is structured or unstructured:
# Install DT package if not already installed
# install.packages("DT")
library(DT)
# Create a data frame for data sources
data_sources <- data.frame(
No = 1:4,
DataSource = c(
"Daily sales transaction data of the company",
"Weather reports from BMKG",
"Product reviews on social media",
"Warehouse inventory reports"
),
Internal_External = c(
"Internal",
"External",
"External",
"Internal"
),
Structured_Unstructured = c(
"Structured",
"Structured",
"Unstructured",
"Structured"
),
stringsAsFactors = FALSE
)
# Display the data frame as a neat table
datatable(data_sources,
caption = "Table of Data Sources",
rownames = FALSE) # hides the index column
4 Exercise 4
Dataset Structure: Consider the following transaction table:
Date | Qty | Price | Product | CustomerTier |
---|---|---|---|---|
2025-10-01 | 2 | 1000 | Laptop | High |
2025-10-01 | 5 | 20 | Mouse | Medium |
2025-10-02 | 1 | 1000 | Laptop | Low |
2025-10-02 | 3 | 30 | Keyboard | Medium |
2025-10-03 | 4 | 50 | Mouse | Medium |
2025-10-03 | 2 | 1000 | Laptop | High |
2025-10-04 | 6 | 25 | Keyboard | Low |
2025-10-04 | 1 | 1000 | Laptop | High |
2025-10-05 | 3 | 40 | Mouse | Low |
2025-10-05 | 5 | 10 | Keyboard | Medium |
Your Assignment Instructions: Creating a Transactions Table above in R
Create a data frame in R called
transactions
containing the data above.Identify which variables are numeric and which are categorical
Calculate total revenue for each transaction by multiplying
Qty × Price
and add it as a new columnTotal
.Compute summary statistics:
- Total quantity sold for each product
- Total revenue per product
- Average price per product
Visualize the data:
- Create a barplot showing total quantity sold per product.
- Create a pie chart showing the proportion of total revenue per customer tier.
Optional Challenge:
- Find which date had the highest total revenue.
- Create a stacked bar chart showing quantity sold per product by customer tier.
Hints: Use data.frame()
,
aggregate()
, barplot()
, pie()
,
and basic arithmetic operations in R.
library(DT)
#transaction
Date <- c("2025-10-01", "2025-10-01", "2025-10-02", "2025-10-02", "2025-10-03",
"2025-10-03", "2025-10-04", "2025-10-04", "2025-10-05", "2025-10-05")
Qty <- c(2, 5, 1, 3, 4, 2, 6, 1, 3, 5)
Price <- c(1000, 20, 1000, 30, 50, 1000, 25, 1000, 40, 10)
Product <- c("Laptop", "Mouse", "Laptop", "Keyboard", "Mouse", "Laptop",
"Keyboard", "Laptop", "Mouse", "Keyboard")
CustomerTier <- c("High", "Medium", "Low", "Medium", "Medium", "High",
"Low", "High", "Low", "Medium")
# data frame
transactions <- data.frame(Date, Qty, Price, Product, CustomerTier)
# Display the data frame
print(transactions)
## Date Qty Price Product CustomerTier
## 1 2025-10-01 2 1000 Laptop High
## 2 2025-10-01 5 20 Mouse Medium
## 3 2025-10-02 1 1000 Laptop Low
## 4 2025-10-02 3 30 Keyboard Medium
## 5 2025-10-03 4 50 Mouse Medium
## 6 2025-10-03 2 1000 Laptop High
## 7 2025-10-04 6 25 Keyboard Low
## 8 2025-10-04 1 1000 Laptop High
## 9 2025-10-05 3 40 Mouse Low
## 10 2025-10-05 5 10 Keyboard Medium
# 3 Total Reveneu
transactions$total <- transactions$Qty * transactions$Price
# view the data frame
print(transactions)
## Date Qty Price Product CustomerTier total
## 1 2025-10-01 2 1000 Laptop High 2000
## 2 2025-10-01 5 20 Mouse Medium 100
## 3 2025-10-02 1 1000 Laptop Low 1000
## 4 2025-10-02 3 30 Keyboard Medium 90
## 5 2025-10-03 4 50 Mouse Medium 200
## 6 2025-10-03 2 1000 Laptop High 2000
## 7 2025-10-04 6 25 Keyboard Low 150
## 8 2025-10-04 1 1000 Laptop High 1000
## 9 2025-10-05 3 40 Mouse Low 120
## 10 2025-10-05 5 10 Keyboard Medium 50
## Product Qty
## 1 Keyboard 14
## 2 Laptop 6
## 3 Mouse 12
## Product total
## 1 Keyboard 290
## 2 Laptop 6000
## 3 Mouse 420
## Product Price
## 1 Keyboard 21.66667
## 2 Laptop 1000.00000
## 3 Mouse 36.66667
# Calculate total quantity per product
qty_per_product <- aggregate(Qty ~ Product, data = transactions, sum)
# Create a bar plot
barplot (qty_per_product$Qty,
names.arg = qty_per_product$Product,
col = "orange",
main = "Total Quantity Sold per Product",
xlab = "Product", ylab = "Quantity Sold")
# Calculate total revenue per CustomerTier
revenue_tier <- aggregate(total ~ CustomerTier, data = transactions, sum)
# create pie chart
pie(revenue_tier$tota,
labels = revenue_tier$CustomerTier,
main = "Proportion of Total Revenue per Customer Tier",
col = c("red", "orange", "pink"))
# Hitung total revenue per tanggal
revenue_per_date <- aggregate(total ~ Date, data = transactions, sum)
# Lihat hasilnya
print(revenue_per_date)
## Date total
## 1 2025-10-01 2100
## 2 2025-10-02 1090
## 3 2025-10-03 2200
## 4 2025-10-04 1150
## 5 2025-10-05 170
# Cari tanggal dengan total revenue paling tingi menggunakan max
max_date <- revenue_per_date[which.max(revenue_per_date$total), ]
#lihat hasilnya
print(max_date)
## Date total
## 3 2025-10-03 2200
library(ggplot2)
# Bikin stacked bar chart
ggplot(transactions, aes(x = reorder(Product, Qty), y = Qty, fill = CustomerTier)) +
geom_bar(stat = "identity") +
labs(title = "Quantity Sold per Product by Customer Tier",
x = "Product",
y = "Quantity Sold") +
theme_minimal()
5 Exercise 5
Create Your Own Data Frame:
Objective: Create a data frame in R with 30 rows containing a mix of data types: continuous, discrete, nominal, and ordinal.
5.1 Instructions
Open RStudio or the R console.
Create a vector for each column in your data frame:
- Date: 30 dates (can be sequential or random within
a month/year)
- Continuous: numeric values that can take decimal
values (e.g., height, weight, temperature)
- Discrete: numeric values that can only take whole
numbers (e.g., number of items, number of vehicles)
- Nominal: categorical values with no
order (e.g., color, gender, city)
- Ordinal: categorical values with a defined order (e.g., Low, Medium, High; Beginner, Intermediate, Expert)
- Date: 30 dates (can be sequential or random within
a month/year)
Combine all vectors into a data frame called
my_data
.Check your data frame using
head()
orView()
to ensure it has 30 rows and the columns are correct.Optional tasks:
- Summarize each column using
summary()
- Count the frequency of each category for Nominal
and Ordinal columns using
table()
- Summarize each column using
5.2 Hints
- Use
seq.Date()
oras.Date()
to generate the Date column.
- Use
runif()
orrnorm()
for continuous numeric data.
- Use
sample()
for discrete, nominal, and ordinal data.
- Ensure the ordinal vector is created with
factor(..., levels = c("Low","Medium","High"), ordered = TRUE)
(or similar).
library(DT)
# Data Financial Frame
# column 1: Transaction Date
Date <- seq.Date(from = as.Date("2025-10-01"), by = "day", length.out = 30)
# Column 2: Transaction value (Continuous) — In rupiah
TransactionValue <- c(
1250000, 980000, 1500000, 2100000, 1850000, 750000, 2400000, 1300000, 1780000, 2220000,
1950000, 1650000, 1420000, 2550000, 3100000, 2750000, 1980000, 880000, 1750000, 2200000,
1900000, 2600000, 2450000, 1350000, 1500000, 900000, 1200000, 2750000, 3200000, 2100000
)
# Column 3: Number of items purchased (Discrete)
Items <- c(
2, 1, 3, 5, 4, 2, 6, 3, 5, 4,
2, 3, 1, 7, 8, 5, 3, 2, 4, 6,
5, 7, 8, 3, 4, 1, 2, 7, 9, 5
)
# Column 4: Transaction type (Nominal)
TransactionType <- c(
"Pembelian Tunai", "Transfer Bank", "Kartu Kredit", "Pembelian Tunai", "E-Wallet",
"Transfer Bank", "Kartu Kredit", "E-Wallet", "Pembelian Tunai", "E-Wallet",
"Kartu Kredit", "Pembelian Tunai", "Transfer Bank", "E-Wallet", "Kartu Kredit",
"Pembelian Tunai", "E-Wallet", "Transfer Bank", "Kartu Kredit", "Pembelian Tunai",
"E-Wallet", "Transfer Bank", "Kartu Kredit", "Pembelian Tunai", "E-Wallet",
"Kartu Kredit", "Transfer Bank", "Pembelian Tunai", "E-Wallet", "Transfer Bank"
)
# Kolom 5: Customer Tier (Ordinal)
CustomerTier <- factor(
c(
"Sedang", "Rendah", "Sedang", "Tinggi", "Sedang", "Rendah", "Tinggi", "Sedang", "Tinggi", "Sedang",
"Rendah", "Tinggi", "Sedang", "Tinggi", "Tinggi", "Sedang", "Rendah", "Sedang", "Tinggi", "Sedang",
"Rendah", "Tinggi", "Tinggi", "Rendah", "Sedang", "Rendah", "Sedang", "Tinggi", "Tinggi", "Sedang"
),
levels = c("Rendah", "Sedang", "Tinggi"),
ordered = TRUE
)
# Combine into a data frame
data_financial <- data.frame(Date, TransactionValue, Items, TransactionType, CustomerTier)
# Display the data frame as a table
print(data_financial)
## Date TransactionValue Items TransactionType CustomerTier
## 1 2025-10-01 1250000 2 Pembelian Tunai Sedang
## 2 2025-10-02 980000 1 Transfer Bank Rendah
## 3 2025-10-03 1500000 3 Kartu Kredit Sedang
## 4 2025-10-04 2100000 5 Pembelian Tunai Tinggi
## 5 2025-10-05 1850000 4 E-Wallet Sedang
## 6 2025-10-06 750000 2 Transfer Bank Rendah
## 7 2025-10-07 2400000 6 Kartu Kredit Tinggi
## 8 2025-10-08 1300000 3 E-Wallet Sedang
## 9 2025-10-09 1780000 5 Pembelian Tunai Tinggi
## 10 2025-10-10 2220000 4 E-Wallet Sedang
## 11 2025-10-11 1950000 2 Kartu Kredit Rendah
## 12 2025-10-12 1650000 3 Pembelian Tunai Tinggi
## 13 2025-10-13 1420000 1 Transfer Bank Sedang
## 14 2025-10-14 2550000 7 E-Wallet Tinggi
## 15 2025-10-15 3100000 8 Kartu Kredit Tinggi
## 16 2025-10-16 2750000 5 Pembelian Tunai Sedang
## 17 2025-10-17 1980000 3 E-Wallet Rendah
## 18 2025-10-18 880000 2 Transfer Bank Sedang
## 19 2025-10-19 1750000 4 Kartu Kredit Tinggi
## 20 2025-10-20 2200000 6 Pembelian Tunai Sedang
## 21 2025-10-21 1900000 5 E-Wallet Rendah
## 22 2025-10-22 2600000 7 Transfer Bank Tinggi
## 23 2025-10-23 2450000 8 Kartu Kredit Tinggi
## 24 2025-10-24 1350000 3 Pembelian Tunai Rendah
## 25 2025-10-25 1500000 4 E-Wallet Sedang
## 26 2025-10-26 900000 1 Kartu Kredit Rendah
## 27 2025-10-27 1200000 2 Transfer Bank Sedang
## 28 2025-10-28 2750000 7 Pembelian Tunai Tinggi
## 29 2025-10-29 3200000 9 E-Wallet Tinggi
## 30 2025-10-30 2100000 5 Transfer Bank Sedang
# Create a second table for variable type
variables_info <- data.frame(
No = 1:4,
variable = c(
"Transaction Value",
"Items",
"Transaction Type",
"Customer Tier"
),
DataType = c(
"Continuous",
"Discrete",
"Nominal",
"Ordinal"
),
stringsAsFactors = FALSE
)
library(knitr)
kable(
variables_info,
caption = "Table of Variables and Data Types"
)
No | variable | DataType |
---|---|---|
1 | Transaction Value | Continuous |
2 | Items | Discrete |
3 | Transaction Type | Nominal |
4 | Customer Tier | Ordinal |