Data Exploration
Exercises ~ Week 3
1 Exercise 1
The following table shows sample information for three students. Each observation represents a single student and includes details such as their unique student ID, name, age, total credits completed, major field of study, and year level.
This dataset demonstrates a mixture of variable types:
- Nominal: StudentID, Name, Major
- Numeric: Age (continuous), CreditsCompleted
(discrete)
- Ordinal: YearLevel (Freshman → Senior)
| StudentID | Name | Age | CreditsCompleted | Major | YearLevel |
|---|---|---|---|---|---|
| S001 | Alice | 20 | 45 | Data Sains | Sophomore |
| S002 | Budi | 21 | 60 | Mathematics | Junior |
| S003 | Citra | 19 | 30 | Statistics | Freshman |
# 1. Create vectors for each variable
StudentID <- c("S001", "S002", "S003") # Nominal / ID
Name <- c("Alice", "Budi", "Citra") # Nominal / Name
Age <- c(20, 21, 19) # Numeric / Continuous
CreditsCompleted <- c(45, 60, 30) # Numeric / Discrete
# Nominal
Major <- c("Data Sains", "Mathematics", "Statistics")
# Ordinal
YearLevel <- factor(c("Sophomore", "Junior", "Freshman"),
levels = c("Freshman","Sophomore","Junior","Senior"),
ordered = TRUE)
# 2. Combine all vectors into a data frame
students <- data.frame(
StudentID, Name, Age, CreditsCompleted, Major, YearLevel,
stringsAsFactors = FALSE
)
# 3. Display the data frame
print(students)## StudentID Name Age CreditsCompleted Major YearLevel
## 1 S001 Alice 20 45 Data Sains Sophomore
## 2 S002 Budi 21 60 Mathematics Junior
## 3 S003 Citra 19 30 Statistics Freshman
2 Exercise 2
Identify Data Types: Determine the type of data for each of the following variables:
# Install knitr package if not already installed
# install.packages("knitr")
library(knitr)
# Create a data frame for Data Types
variables_info <- data.frame(
No = 1:5,
Variable = c(
"Number of vehicles passing through the toll road each day",
"Student height in cm",
"Employee gender (Male / Female)",
"Customer satisfaction level: Low, Medium, High",
"Respondent's favorite color: Red, Blue, Green"
),
DataType = c(
"Numeric",
"Numeric",
"Categorical",
"Categorical",
"Categorical"
),
Subtype = c(
"Discrete",
"Continuous",
"Nominal",
"Ordinal",
"Nominal"
),
stringsAsFactors = FALSE
)
# Display the data frame as a neat table
kable(variables_info,
caption = "Table of Variables and Data Types")| No | Variable | DataType | Subtype |
|---|---|---|---|
| 1 | Number of vehicles passing through the toll road each day | Numeric | Discrete |
| 2 | Student height in cm | Numeric | Continuous |
| 3 | Employee gender (Male / Female) | Categorical | Nominal |
| 4 | Customer satisfaction level: Low, Medium, High | Categorical | Ordinal |
| 5 | Respondent’s favorite color: Red, Blue, Green | Categorical | Nominal |
3 Exercise 3
Classify Data Sources: Determine whether the following data comes from internal or external sources, and whether it is structured or unstructured:
# Install DT package if not already installed
# install.packages("DT")
library(DT)
# Create a data frame for data sources
data_sources <- data.frame(
No = 1:4,
DataSource = c(
"Daily sales transaction data of the company",
"Weather reports from BMKG",
"Product reviews on social media",
"Warehouse inventory reports"
),
Internal_External = c(
"Internal",
"External",
"External",
"Internal"
),
Structured_Unstructured = c(
"Structured",
"Structured",
"Unstructured",
"Structured"
),
stringsAsFactors = FALSE
)
# Display the data frame as a neat table
datatable(data_sources,
caption = "Table of Data Sources",
rownames = FALSE) # hides the index column4 Exercise 4
Dataset Structure: Consider the following transaction table:
| Date | Qty | Price | Product | CustomerTier |
|---|---|---|---|---|
| 2025-10-01 | 2 | 1000 | Laptop | High |
| 2025-10-01 | 5 | 20 | Mouse | Medium |
| 2025-10-02 | 1 | 1000 | Laptop | Low |
| 2025-10-02 | 3 | 30 | Keyboard | Medium |
| 2025-10-03 | 4 | 50 | Mouse | Medium |
| 2025-10-03 | 2 | 1000 | Laptop | High |
| 2025-10-04 | 6 | 25 | Keyboard | Low |
| 2025-10-04 | 1 | 1000 | Laptop | High |
| 2025-10-05 | 3 | 40 | Mouse | Low |
| 2025-10-05 | 5 | 10 | Keyboard | Medium |
Your Assignment Instructions: Creating a Transactions Table above in R
Create a data frame in R called
transactionscontaining the data above.Identify which variables are numeric and which are categorical
Calculate total revenue for each transaction by multiplying
Qty × Priceand add it as a new columnTotal.Compute summary statistics:
- Total quantity sold for each product
- Total revenue per product
- Average price per product
Visualize the data:
- Create a barplot showing total quantity sold per product.
- Create a pie chart showing the proportion of total revenue per customer tier.
Optional Challenge:
- Find which date had the highest total revenue.
- Create a stacked bar chart showing quantity sold per product by customer tier.
Hints: Use data.frame(),
aggregate(), barplot(), pie(),
and basic arithmetic operations in R.
# Install knitr package if not already installed
# install.packages("knitr")
library(knitr)
# Create a data frame for Transactions
Date = c (
"2025-10-01",
"2025-10-01",
"2025-10-02",
"2025-10-02",
"2025-10-03",
"2025-10-03",
"2025-10-04",
"2025-10-04",
"2025-10-05",
"2025-10-05"
) # Numeric / Discrete
Qty = c (2, 5, 1, 3, 4, 2, 6, 1, 3, 5) # Numeric / Discrete
Price = c (1000, 20, 1000, 30, 50, 1000, 25, 1000, 40, 10) # Numeric / Discrete
# Nominal
Product = c (
"Laptop",
"Mouse",
"Laptop",
"Keyboard",
"Mouse",
"Laptop",
"Keyboard",
"Laptop",
"Mouse",
"Keyboard"
) # Categorical / Nominal
# Ordinal
CustomerTier = factor(c(
"High",
"Medium",
"Low",
"Medium",
"Medium",
"High",
"Low",
"High",
"Low",
"Medium"
),
levels = c("Low","Medium","High"),
ordered = TRUE)
# 2. Combine all vectors into a data frame
transactions = data.frame (Date, Qty, Price, Product, CustomerTier, stringsAsFactors = FALSE)
# 3. Display the data frame
library(knitr)
kable(transactions)| Date | Qty | Price | Product | CustomerTier |
|---|---|---|---|---|
| 2025-10-01 | 2 | 1000 | Laptop | High |
| 2025-10-01 | 5 | 20 | Mouse | Medium |
| 2025-10-02 | 1 | 1000 | Laptop | Low |
| 2025-10-02 | 3 | 30 | Keyboard | Medium |
| 2025-10-03 | 4 | 50 | Mouse | Medium |
| 2025-10-03 | 2 | 1000 | Laptop | High |
| 2025-10-04 | 6 | 25 | Keyboard | Low |
| 2025-10-04 | 1 | 1000 | Laptop | High |
| 2025-10-05 | 3 | 40 | Mouse | Low |
| 2025-10-05 | 5 | 10 | Keyboard | Medium |
# 2. Identify variable types
library(knitr)
variable_types <- data.frame(
Variable = c(
"Date",
"Qty",
"Price",
"Product",
"CustomerTier"
),
Type = c(
"Numeric (Discrete)",
"Numeric (Discrete)",
"Numeric (Discrete)",
"Categorical (Nominal)",
"Categorical (Ordinal)"
)
)
# Display the data frame
kable(variable_types, caption = "Variable Types in Transactions Data")| Variable | Type |
|---|---|
| Date | Numeric (Discrete) |
| Qty | Numeric (Discrete) |
| Price | Numeric (Discrete) |
| Product | Categorical (Nominal) |
| CustomerTier | Categorical (Ordinal) |
# 3. Calculate total revenue
transactions$total = transactions$Qty * transactions$Price
kable(transactions)| Date | Qty | Price | Product | CustomerTier | total |
|---|---|---|---|---|---|
| 2025-10-01 | 2 | 1000 | Laptop | High | 2000 |
| 2025-10-01 | 5 | 20 | Mouse | Medium | 100 |
| 2025-10-02 | 1 | 1000 | Laptop | Low | 1000 |
| 2025-10-02 | 3 | 30 | Keyboard | Medium | 90 |
| 2025-10-03 | 4 | 50 | Mouse | Medium | 200 |
| 2025-10-03 | 2 | 1000 | Laptop | High | 2000 |
| 2025-10-04 | 6 | 25 | Keyboard | Low | 150 |
| 2025-10-04 | 1 | 1000 | Laptop | High | 1000 |
| 2025-10-05 | 3 | 40 | Mouse | Low | 120 |
| 2025-10-05 | 5 | 10 | Keyboard | Medium | 50 |
# 4. Compute summary statistic
# a. Total quantity sold for each product
total_Qty = aggregate(Qty ~ Product, data = transactions, sum)
kable(total_Qty, caption = "Total Quantity Sold per Product")| Product | Qty |
|---|---|
| Keyboard | 14 |
| Laptop | 6 |
| Mouse | 12 |
# b. total revenue per product
total_revenue = aggregate(total ~ Product, data = transactions, sum)
kable(total_revenue, caption = "Total Revenue per Product")| Product | total |
|---|---|
| Keyboard | 290 |
| Laptop | 6000 |
| Mouse | 420 |
# c. Average price per product
avg_price = aggregate(Price ~ Product, data = transactions, mean)
kable(avg_price, caption = "Average Price per Product")| Product | Price |
|---|---|
| Keyboard | 21.66667 |
| Laptop | 1000.00000 |
| Mouse | 36.66667 |
# 5. Visualize the data
# a. barplot showing total quantity sold per product
total_qty = tapply(transactions$Qty, transactions$Product, sum)
barplot(total_qty,
main = "Total Quantity Sold per Product",
xlab = "Product",
ylab = "Total Quantity",
col = "lightblue")# b. pie chart showing the proportion of total revenue
total_revenue_tier = tapply(transactions$total, transactions$CustomerTier, sum)
pie(total_revenue_tier,
main = "Proportion of Total Revenue per Customer Tier",
col = rainbow(length(total_revenue_tier)))# 6. Optional challenge
# a. Find which date had the highest total revenue
total_revenue_date <- aggregate(total ~ Date, data = transactions, sum)
total_revenue_date[which.max(total_revenue_date$total), ]# b. stacked bar chart showing quantity sold per product by customer tier
qty_table = xtabs(Qty ~ Product + CustomerTier, data = transactions)
barplot(qty_table,
main = "Quantity Sold per Product by Customer Tier",
xlab = "Product",
ylab = "Quantity",
col = c("lightblue", "lightgreen", "pink"))5 Exercise 5
Create Your Own Data Frame:
Objective: Create a data frame in R with 30 rows containing a mix of data types: continuous, discrete, nominal, and ordinal.
5.1 Instructions
Open RStudio or the R console.
Create a vector for each column in your data frame:
- Date: 30 dates (can be sequential or random within
a month/year)
- Continuous: numeric values that can take decimal
values (e.g., height, weight, temperature)
- Discrete: numeric values that can only take whole
numbers (e.g., number of items, number of vehicles)
- Nominal: categorical values with no
order (e.g., color, gender, city)
- Ordinal: categorical values with a defined order (e.g., Low, Medium, High; Beginner, Intermediate, Expert)
- Date: 30 dates (can be sequential or random within
a month/year)
Combine all vectors into a data frame called
my_data.Check your data frame using
head()orView()to ensure it has 30 rows and the columns are correct.Optional tasks:
- Summarize each column using
summary()
- Count the frequency of each category for Nominal
and Ordinal columns using
table()
- Summarize each column using
5.2 Hints
- Use
seq.Date()oras.Date()to generate the Date column.
- Use
runif()orrnorm()for continuous numeric data.
- Use
sample()for discrete, nominal, and ordinal data.
- Ensure the ordinal vector is created with
factor(..., levels = c("Low","Medium","High"), ordered = TRUE)(or similar).
# 1. Coffee Shop Data
# Date (30 hari di bulan September)
Date = seq(as.Date("2025-09-01"), as.Date("2025-09-30"), by = "day")
# Continuous: jumlah ml kopi terjual per hari (acak dari 1500–4000 ml)
Coffee_ml = runif(30, min = 1500, max = 4000)
# Discrete: jumlah cangkir kopi terjual per hari (acak 20–100)
Cups_Sold = sample(20:100, 30, replace = TRUE)
# Nominal: jenis minuman kopi
Drink_Type = sample(c("Americano", "Cappuccino", "Latte", "Espresso", "Mocha"), 30, replace = TRUE)
# Ordinal: tingkat kepuasan pelanggan
Customer_Satisfaction = factor(
sample(c("Poor", "Fair", "Good", "Very Good", "Excellent"), 30, replace = TRUE),
levels = c("Poor", "Fair", "Good", "Very Good", "Excellent"),
ordered = TRUE)
# Combine all vectors into a data frame
my_data = data.frame(Date, Coffee_ml, Cups_Sold, Drink_Type, Customer_Satisfaction)
kable(my_data)| Date | Coffee_ml | Cups_Sold | Drink_Type | Customer_Satisfaction |
|---|---|---|---|---|
| 2025-09-01 | 2433.254 | 52 | Latte | Very Good |
| 2025-09-02 | 1950.821 | 93 | Cappuccino | Good |
| 2025-09-03 | 2164.201 | 93 | Espresso | Good |
| 2025-09-04 | 2171.674 | 41 | Cappuccino | Poor |
| 2025-09-05 | 1619.997 | 73 | Mocha | Poor |
| 2025-09-06 | 3433.387 | 92 | Mocha | Fair |
| 2025-09-07 | 3883.128 | 99 | Americano | Poor |
| 2025-09-08 | 1805.518 | 66 | Latte | Good |
| 2025-09-09 | 2996.276 | 20 | Espresso | Good |
| 2025-09-10 | 1866.226 | 93 | Espresso | Poor |
| 2025-09-11 | 3003.257 | 42 | Espresso | Poor |
| 2025-09-12 | 2511.445 | 46 | Americano | Fair |
| 2025-09-13 | 2181.126 | 23 | Americano | Fair |
| 2025-09-14 | 1949.932 | 61 | Cappuccino | Poor |
| 2025-09-15 | 3096.677 | 73 | Mocha | Excellent |
| 2025-09-16 | 2773.454 | 62 | Espresso | Very Good |
| 2025-09-17 | 2928.137 | 33 | Cappuccino | Fair |
| 2025-09-18 | 2531.876 | 28 | Latte | Excellent |
| 2025-09-19 | 1546.950 | 44 | Americano | Excellent |
| 2025-09-20 | 2117.450 | 33 | Mocha | Excellent |
| 2025-09-21 | 3021.659 | 90 | Latte | Good |
| 2025-09-22 | 3329.366 | 37 | Americano | Excellent |
| 2025-09-23 | 2288.079 | 88 | Americano | Fair |
| 2025-09-24 | 2204.085 | 25 | Mocha | Good |
| 2025-09-25 | 3257.974 | 21 | Espresso | Fair |
| 2025-09-26 | 3986.743 | 45 | Espresso | Excellent |
| 2025-09-27 | 2030.136 | 63 | Latte | Fair |
| 2025-09-28 | 2666.939 | 24 | Americano | Fair |
| 2025-09-29 | 3255.205 | 43 | Mocha | Good |
| 2025-09-30 | 2359.553 | 87 | Mocha | Good |
| Date | Coffee_ml | Cups_Sold | Drink_Type | Customer_Satisfaction | |
|---|---|---|---|---|---|
| Min. :2025-09-01 | Min. :1547 | Min. :20.00 | Length:30 | Poor :6 | |
| 1st Qu.:2025-09-08 | 1st Qu.:2129 | 1st Qu.:34.00 | Class :character | Fair :8 | |
| Median :2025-09-15 | Median :2472 | Median :49.00 | Mode :character | Good :8 | |
| Mean :2025-09-15 | Mean :2579 | Mean :56.33 | NA | Very Good:2 | |
| 3rd Qu.:2025-09-22 | 3rd Qu.:3017 | 3rd Qu.:83.50 | NA | Excellent:6 | |
| Max. :2025-09-30 | Max. :3987 | Max. :99.00 | NA | NA |
# Frekuensi Kategori (nominal, ordinal)
library(knitr)
drink_freq = table(my_data$Drink_Type)
satisfaction_freq = table(my_data$Customer_Satisfaction)
kable(drink_freq, caption = "Frekuensi Jenis Minuman (Nominal)")| Var1 | Freq |
|---|---|
| Americano | 7 |
| Cappuccino | 4 |
| Espresso | 7 |
| Latte | 5 |
| Mocha | 7 |
| Var1 | Freq |
|---|---|
| Poor | 6 |
| Fair | 8 |
| Good | 8 |
| Very Good | 2 |
| Excellent | 6 |