Data Exploration

Exercises ~ Week 2

Logo

1 Exercise 1

The following table shows sample information for three students. Each observation represents a single student and includes details such as their unique student ID, name, age, total credits completed, major field of study, and year level.

This dataset demonstrates a mixture of variable types:

Nominal: StudentID, Name, Major
Numeric: Age (continuous), CreditsCompleted (discrete)
Ordinal: YearLevel (Freshman → Senior)

StudentID	Name	Age	CreditsCompleted	Major	YearLevel
S001	Alice	20	45	Data Sains	Sophomore
S002	Budi	21	60	Mathematics	Junior
S003	Citra	19	30	Statistics	Freshman

# 1. Create vectors for each variable
StudentID <- c("S001", "S002", "S003")       # Nominal / ID
Name <- c("Alice", "Budi", "Citra")          # Nominal / Name
Age <- c(20, 21, 19)                         # Numeric / Continuous
CreditsCompleted <- c(45, 60, 30)            # Numeric / Discrete

# Nominal
Major <- c("Data Sains", "Mathematics", "Statistics")  

# Ordinal
YearLevel <- factor(c("Sophomore", "Junior", "Freshman"),
                    levels = c("Freshman","Sophomore","Junior","Senior"),
                    ordered = TRUE)          

# 2. Combine all vectors into a data frame
students <- data.frame(
  StudentID, Name, Age, CreditsCompleted, Major, YearLevel,
  stringsAsFactors = FALSE
)

# 3. Display the data frame
print(students)

##   StudentID  Name Age CreditsCompleted       Major YearLevel
## 1      S001 Alice  20               45  Data Sains Sophomore
## 2      S002  Budi  21               60 Mathematics    Junior
## 3      S003 Citra  19               30  Statistics  Freshman

2 Exercise 2

Identify Data Types: Determine the type of data for each of the following variables:

# Install knitr package if not already installed
# install.packages("knitr")
library(knitr)

# Create a data frame for Data Types
variables_info <- data.frame(
  No = 1:5,
  Variable = c(
    "Number of vehicles passing through the toll road each day",
    "Student height in cm",
    "Employee gender (Male / Female)",
    "Customer satisfaction level: Low, Medium, High",
    "Respondent's favorite color: Red, Blue, Green"
  ),
  DataType = c(
    "Numerical",
    "Numerical",
    "Categorical",
    "Categorical",
    "Categorical"
  ),
  Subtype = c(
    "Discrete",
    "Continous",
    "Nominal",
    "Ordinal",
    "Nominal"
  ),
  stringsAsFactors = FALSE
)

# Display the data frame as a neat table
kable(variables_info, 
      caption = "Table of Variables and Data Types")

Table of Variables and Data Types
No	Variable	DataType	Subtype
1	Number of vehicles passing through the toll road each day	Numerical	Discrete
2	Student height in cm	Numerical	Continous
3	Employee gender (Male / Female)	Categorical	Nominal
4	Customer satisfaction level: Low, Medium, High	Categorical	Ordinal
5	Respondent’s favorite color: Red, Blue, Green	Categorical	Nominal

3 Exercise 3

Classify Data Sources: Determine whether the following data comes from internal or external sources, and whether it is structured or unstructured:

# Install DT package if not already installed
# install.packages("DT")
library(DT)

# Create a data frame for data sources 
data_sources <- data.frame(
  No = 1:4,
  DataSource = c(
    "Daily sales transaction data of the company",
    "Weather reports from BMKG",
    "Product reviews on social media",
    "Warehouse inventory reports"
  ),
  Internal_External = c(
    "Internal",
    "Eksternal",
    "Eksternal",
    "Internal"
  ),
  Structured_Unstructured = c(
    "Structured",
    "Structured",
    "Unstructured",
    "Structured"
  ),
  stringsAsFactors = FALSE
)

# Display the data frame as a neat table
datatable(data_sources, 
          caption = "Table of Data Sources",
          rownames = FALSE) # hides the index column

4 Exercise 4

Dataset Structure: Consider the following transaction table:

Date	Qty	Price	Product	CustomerTier	Total
2025-10-01	2	1000	Laptop	High	2000
2025-10-01	5	20	Mouse	Medium	100
2025-10-02	1	1000	Laptop	Low	1000
2025-10-02	3	30	Keyboard	Medium	90
2025-10-03	4	50	Mouse	Medium	200
2025-10-03	2	1000	Laptop	High	2000
2025-10-04	6	25	Keyboard	Low	150
2025-10-04	1	1000	Laptop	High	1000
2025-10-05	3	40	Mouse	Low	120
2025-10-05	5	10	Keyboard	Medium	50

Your Assignment Instructions: Creating a Transactions Table above in R

Create a data frame in R called transactions containing the data above.
Identify which variables are numeric and which are categorical

N0	Variables	Data Type	Subtype of Data Type
1.	Date	Categorical	Ordinal
2.	QTY	Numerical	Diskrit
3.	Price	Numerical	Diskrit
4.	Product	Categorical	Nominal
5.	CostumerTier	Categorical	Ordinal
6.	Total	Numerical	Diskrit

Calculate total revenue for each transaction by multiplying Qty × Price and add it as a new column Total.
Compute summary statistics:
- Total quantity sold for each product*
- Total revenue per product
- Average price per product
Visualize the data:
- Create a barplot showing total quantity sold per product.
- Create a pie chart showing the proportion of total revenue per customer tier.
Optional Challenge:
- Find which date had the highest total revenue.
- Create a stacked bar chart showing quantity sold per product by customer tier.

Hints: Use data.frame(), aggregate(), barplot(), pie(), and basic arithmetic operations in R.

transactions <- data.frame(
  Date = as.Date(c(
    "2025-10-01", "2025-10-01", "2025-10-02", "2025-10-02",
    "2025-10-03", "2025-10-03", "2025-10-04", "2025-10-04",
    "2025-10-05", "2025-10-05"
  )),
  Qty = c(2, 5, 1, 3, 4, 2, 6, 1, 3, 5),
  Price = c(1000, 20, 1000, 30, 50, 1000, 25, 1000, 40, 10),
  Product = c("Laptop", "Mouse", "Laptop", "Keyboard", "Mouse",
              "Laptop", "Keyboard", "Laptop", "Mouse", "Keyboard"),
  CustomerTier = c("High", "Medium", "Low", "Medium", "Medium",
                   "High", "Low", "High", "Low", "Medium"),
  stringsAsFactors = FALSE
)

transactions

str(transactions)

## 'data.frame':    10 obs. of  5 variables:
##  $ Date        : Date, format: "2025-10-01" "2025-10-01" ...
##  $ Qty         : num  2 5 1 3 4 2 6 1 3 5
##  $ Price       : num  1000 20 1000 30 50 1000 25 1000 40 10
##  $ Product     : chr  "Laptop" "Mouse" "Laptop" "Keyboard" ...
##  $ CustomerTier: chr  "High" "Medium" "Low" "Medium" ...

transactions$Total <- transactions$Qty * transactions$Price
transactions

# Total quantity sold per product
total_qty <- aggregate(Qty ~ Product, data = transactions, sum)

# Total revenue per product
total_revenue <- aggregate(Total ~ Product, data = transactions, sum)

# Average price per product
avg_price <- aggregate(Price ~ Product, data = transactions, mean)

total_qty

total_revenue

avg_price

# Barplot total quantity sold per product
barplot(
  total_qty$Qty,
  names.arg = total_qty$Product,
  col = "coral",
  main = "Total Quantity Sold per Product",
  xlab = "Product",
  ylab = "Total Quantity"
)

# Pie Chart: Total Revenue per Customer Tier

# 1. Summarize total revenue by customer tier
revenue_tier <- aggregate(Total ~ CustomerTier, data = transactions, sum)

# 2. Coral color palette
coral_palette <- c("#FF7F50", "#FF6F61", "#FFA07A")  
# Coral, Deep Coral, Light Coral

# 3. Calculate percentage for each tier
percentages <- round(100 * revenue_tier$Total / sum(revenue_tier$Total), 1)

# 4. Combine labels: tier name + percentage + total value
labels <- paste0(
  revenue_tier$CustomerTier, 
  "\n", percentages, "% (", revenue_tier$Total, ")"
)

# 5. Create pie chart
pie(
  revenue_tier$Total,
  labels = labels,
  main = "Proportion of Total Revenue per Customer Tier",
  col = coral_palette,
  clockwise = TRUE,
  border = "white"
)

# 6. Add legend on the right side
legend(
  "topright",
  legend = paste(revenue_tier$CustomerTier, "-", percentages, "%"),
  fill = coral_palette,
  border = "white",
  title = "Customer Tier",
  bty = "n"
)

# Find the date with the highest total revenue
revenue_date <- aggregate(Total ~ Date, data = transactions, sum)
revenue_date[which.max(revenue_date$Total), ]

# Stacked bar chart: quantity sold per product by customer tier
library(reshape2)
qty_tier <- aggregate(Qty ~ Product + CustomerTier, data = transactions, sum)
qty_wide <- dcast(qty_tier, Product ~ CustomerTier, value.var = "Qty", fill = 0)


# Stacked Bar Chart: Quantity Sold per Product by Customer Tier

library(reshape2)

# Make summary of total quantity by product and customer tier
qty_tier <- aggregate(Qty ~ Product + CustomerTier, data = transactions, sum)

# Change to wide format (rows = Product, columns = Customer Tier)
qty_wide <- dcast(qty_tier, Product ~ CustomerTier, value.var = "Qty", fill = 0)

# Coral colors
coral_colors <- c("#FF7F50", "#FF6F61", "#FFA07A")

# Change to matrix for barplot
qty_matrix <- as.matrix(qty_wide[, -1])

# Calculate total per product
total_per_product <- rowSums(qty_matrix)

# Calculate percent for each part
percent <- round(qty_matrix / total_per_product * 100, 1)

# Make the stacked bar chart
bar_pos <- barplot(
  t(qty_matrix),
  col = coral_colors,
  beside = FALSE,
  legend = colnames(qty_wide)[-1],
  main = "Quantity Sold per Product by Customer Tier",
  xlab = "Product",
  ylab = "Quantity",
  names.arg = qty_wide$Product,
  border = "white"
)

# Add labels: number + percent
for (i in 1:nrow(qty_matrix)) {
  y_bottom <- 0
  for (j in 1:ncol(qty_matrix)) {
    value <- qty_matrix[i, j]
    label <- paste0(value, " (", percent[i, j], "%)")
    text(bar_pos[i], y_bottom + value / 2, labels = label, cex = 0.8)
    y_bottom <- y_bottom + value
  }
}

5 Exercise 5

Create Your Own Data Frame:

Objective: Create a data frame in R with 30 rows containing a mix of data types: continuous, discrete, nominal, and ordinal.

5.1 Instructions

Open RStudio or the R console.
Create a vector for each column in your data frame:
- Date: 30 dates (can be sequential or random within a month/year)
- Continuous: numeric values that can take decimal values (e.g., height, weight, temperature)
- Discrete: numeric values that can only take whole numbers (e.g., number of items, number of vehicles)
- Nominal: categorical values with no order (e.g., color, gender, city)
- Ordinal: categorical values with a defined order (e.g., Low, Medium, High; Beginner, Intermediate, Expert)
Combine all vectors into a data frame called my_data.
Check your data frame using head() or View() to ensure it has 30 rows and the columns are correct.
Optional tasks:
- Summarize each column using summary()
- Count the frequency of each category for Nominal and Ordinal columns using table()

5.2 Hints

Use seq.Date() or as.Date() to generate the Date column.
Use runif() or rnorm() for continuous numeric data.
Use sample() for discrete, nominal, and ordinal data.
Ensure the ordinal vector is created with factor(..., levels = c("Low","Medium","High"), ordered = TRUE) (or similar).

# Daily Fruit Enjoyment (30 Days)
set.seed(321)

# 1. Create basic columns
No <- 1:30
Date <- seq.Date(from = as.Date("2025-10-01"), by = "day", length.out = 30)

# 2. Continuous variable: enjoyment level after eating fruit (1–10)
Enjoyment <- round(runif(30, min = 5, max = 10), 1)

# 3. Discrete variable: number of fruits eaten per day (0–5)
FruitsEaten <- sample(0:5, 30, replace = TRUE)

# 4. Nominal variable: type of fruit eaten
Fruit <- sample(c("Apple", "Banana", "Orange", "Grape", "Mango"), 30, replace = TRUE)

# 5. Ordinal variable: energy level based on enjoyment
Energy <- cut(Enjoyment,
              breaks = c(-Inf, 6, 8, Inf),
              labels = c("Low", "Medium", "High"),
              ordered_result = TRUE)

# 6. Combine all into one data frame
fruit_data <- data.frame(
  No,
  Date,
  Enjoyment,
  FruitsEaten,
  Fruit,
  Energy
)

# 7. Rename columns
colnames(fruit_data) <- c("No", "Date", "Enjoyment", "FruitsEaten", "Fruit", "Energy")

# 8. Display the main data table
knitr::kable(fruit_data, caption = "Daily Fruit Enjoyment (30 Days)")

Daily Fruit Enjoyment (30 Days)
No	Date	Enjoyment	FruitsEaten	Fruit	Energy
1	2025-10-01	9.8	3	Grape	High
2	2025-10-02	9.7	3	Banana	High
3	2025-10-03	6.2	3	Mango	Medium
4	2025-10-04	6.3	2	Mango	Medium
5	2025-10-05	7.0	5	Orange	Medium
6	2025-10-06	6.7	4	Apple	Medium
7	2025-10-07	7.3	5	Grape	Medium
8	2025-10-08	6.4	3	Apple	Medium
9	2025-10-09	7.3	4	Banana	Medium
10	2025-10-10	9.0	0	Orange	High
11	2025-10-11	8.0	0	Mango	Medium
12	2025-10-12	6.8	2	Banana	Medium
13	2025-10-13	8.8	5	Grape	High
14	2025-10-14	5.2	4	Grape	Low
15	2025-10-15	8.0	5	Banana	Medium
16	2025-10-16	6.0	0	Banana	Low
17	2025-10-17	8.2	1	Orange	High
18	2025-10-18	7.0	0	Mango	Medium
19	2025-10-19	6.5	5	Apple	Medium
20	2025-10-20	8.2	4	Banana	High
21	2025-10-21	8.2	2	Grape	High
22	2025-10-22	9.9	0	Mango	High
23	2025-10-23	9.7	5	Apple	High
24	2025-10-24	7.4	3	Mango	Medium
25	2025-10-25	7.9	5	Banana	Medium
26	2025-10-26	8.8	0	Apple	High
27	2025-10-27	10.0	4	Orange	High
28	2025-10-28	7.2	2	Orange	Medium
29	2025-10-29	5.6	4	Grape	Low
30	2025-10-30	8.0	2	Banana	Medium

# === Summary Section ===

# Summary for numeric columns (as a data frame)
numeric_summary <- data.frame(
  Variable = c("Enjoyment", "FruitsEaten"),
  Minimum = c(min(fruit_data$Enjoyment), min(fruit_data$FruitsEaten)),
  Maximum = c(max(fruit_data$Enjoyment), max(fruit_data$FruitsEaten)),
  Mean = c(round(mean(fruit_data$Enjoyment), 2), round(mean(fruit_data$FruitsEaten), 2)),
  Median = c(median(fruit_data$Enjoyment), median(fruit_data$FruitsEaten))
)

# Frequency tables for Nominal and Ordinal columns
fruit_freq <- as.data.frame(table(fruit_data$Fruit))
colnames(fruit_freq) <- c("Fruit", "Frequency")

energy_freq <- as.data.frame(table(fruit_data$Energy))
colnames(energy_freq) <- c("Energy_Level", "Frequency")

# Display all summaries neatly
knitr::kable(numeric_summary, caption = "Summary of Numeric Variables")

Summary of Numeric Variables
Variable	Minimum	Maximum	Mean	Median
Enjoyment	5.2	10	7.70	7.65
FruitsEaten	0.0	5	2.83	3.00

knitr::kable(fruit_freq, caption = "Frequency of Nominal Variable (Fruit)")

Frequency of Nominal Variable (Fruit)
Fruit	Frequency
Apple	5
Banana	8
Grape	6
Mango	6
Orange	5

knitr::kable(energy_freq, caption = "Frequency of Ordinal Variable (Energy Level)")

Frequency of Ordinal Variable (Energy Level)
Energy_Level	Frequency
Low	3
Medium	16
High	11