Data Exploration
Exercises ~ Week 3
1 Exercise 1
The following table shows sample information for three students. Each observation represents a single student and includes details such as their unique student ID, name, age, total credits completed, major field of study, and year level.
This dataset demonstrates a mixture of variable types:
- Nominal: StudentID, Name, Major
- Numeric: Age (continuous), CreditsCompleted
(discrete)
- Ordinal: YearLevel (Freshman → Senior)
StudentID | Name | Age | CreditsCompleted | Major | YearLevel |
---|---|---|---|---|---|
S001 | Alice | 20 | 45 | Data Sains | Sophomore |
S002 | Budi | 21 | 60 | Mathematics | Junior |
S003 | Citra | 19 | 30 | Statistics | Freshman |
# 1. Create vectors for each variable
StudentID <- c("S001", "S002", "S003") # Nominal / ID
Name <- c("Alice", "Budi", "Citra") # Nominal / Name
Age <- c(20, 21, 19) # Numeric / Continuous
CreditsCompleted <- c(45, 60, 30) # Numeric / Discrete
# Nominal
Major <- c("Data Sains", "Mathematics", "Statistics")
# Ordinal
YearLevel <- factor(c("Sophomore", "Junior", "Freshman"),
levels = c("Freshman","Sophomore","Junior","Senior"),
ordered = TRUE)
# 2. Combine all vectors into a data frame
students <- data.frame(
StudentID, Name, Age, CreditsCompleted, Major, YearLevel,
stringsAsFactors = FALSE
)
# 3. Display the data frame
print(students)
## StudentID Name Age CreditsCompleted Major YearLevel
## 1 S001 Alice 20 45 Data Sains Sophomore
## 2 S002 Budi 21 60 Mathematics Junior
## 3 S003 Citra 19 30 Statistics Freshman
2 Exercise 2
Identify Data Types: Determine the type of data for each of the following variables:
# Install knitr package if not already installed
# install.packages("knitr")
library(knitr)
# Create a data frame for Data Types
variables_info <- data.frame(
No = 1:5,
Variable = c(
"Number of vehicles passing through the toll road each day",
"Student height in cm",
"Employee gender (Male / Female)",
"Customer satisfaction level: Low, Medium, High",
"Respondent's favorite color: Red, Blue, Green"
),
DataType = c(
"Quantitative",
"Quantitative",
"Qualitative",
"Qualitative",
"Qualitative"
),
Subtype = c(
"Discrete",
"Countinuous",
"Nominal",
"Ordinal",
"Nominal"
),
stringsAsFactors = FALSE
)
# Display the data frame as a neat table
kable(variables_info,
caption = "Table of Variables and Data Types")
No | Variable | DataType | Subtype |
---|---|---|---|
1 | Number of vehicles passing through the toll road each day | Quantitative | Discrete |
2 | Student height in cm | Quantitative | Countinuous |
3 | Employee gender (Male / Female) | Qualitative | Nominal |
4 | Customer satisfaction level: Low, Medium, High | Qualitative | Ordinal |
5 | Respondent’s favorite color: Red, Blue, Green | Qualitative | Nominal |
3 Exercise 3
Classify Data Sources: Determine whether the following data comes from internal or external sources, and whether it is structured or unstructured:
# Install DT package if not already installed
# install.packages("DT")
library(DT)
# Create a data frame for data sources
data_sources <- data.frame(
No = 1:4,
DataSource = c(
"Daily sales transaction data of the company",
"Weather reports from BMKG",
"Product reviews on social media",
"Warehouse inventory reports"
),
Internal_External = c(
"Internal",
"External",
"External",
"Internal"
),
Structured_Unstructured = c(
"Structured",
"Structured",
"Unstructured",
"Structured"
),
stringsAsFactors = FALSE
)
# Display the data frame as a neat table
datatable(data_sources,
caption = "Table of Data Sources",
rownames = FALSE) # hides the index column
4 Exercise 4
Dataset Structure: Consider the following transaction table:
Date | Qty | Price | Product | CustomerTier |
---|---|---|---|---|
2025-10-01 | 2 | 1000 | Laptop | High |
2025-10-01 | 5 | 20 | Mouse | Medium |
2025-10-02 | 1 | 1000 | Laptop | Low |
2025-10-02 | 3 | 30 | Keyboard | Medium |
2025-10-03 | 4 | 50 | Mouse | Medium |
2025-10-03 | 2 | 1000 | Laptop | High |
2025-10-04 | 6 | 25 | Keyboard | Low |
2025-10-04 | 1 | 1000 | Laptop | High |
2025-10-05 | 3 | 40 | Mouse | Low |
2025-10-05 | 5 | 10 | Keyboard | Medium |
Your Assignment Instructions: Creating a Transactions Table above in R
Create a data frame in R called
transactions
containing the data above.Identify which variables are numeric and which are categorical
Calculate total revenue for each transaction by multiplying
Qty × Price
and add it as a new columnTotal
.Compute summary statistics:
- Total quantity sold for each product
- Total revenue per product
- Average price per product
Visualize the data:
- Create a barplot showing total quantity sold per product.
- Create a pie chart showing the proportion of total revenue per customer tier.
Optional Challenge:
- Find which date had the highest total revenue.
- Create a stacked bar chart showing quantity sold per product by customer tier.
Hints: Use data.frame()
,
aggregate()
, barplot()
, pie()
,
and basic arithmetic operations in R.
1. CREATE THE DATA FRAME
transactions <- data.frame(
Date = as.Date(c("2025-10-01", "2025-10-01", "2025-10-02", "2025-10-02",
"2025-10-03", "2025-10-03", "2025-10-04", "2025-10-04",
"2025-10-05", "2025-10-05")),
Qty = c(2, 5, 1, 3, 4, 2, 6, 1, 3, 5),
Price = c(1000, 20, 1000, 30, 50, 1000, 25, 1000, 40, 10),
Product = c("Laptop", "Mouse", "Laptop", "Keyboard", "Mouse",
"Laptop", "Keyboard", "Laptop", "Mouse", "Keyboard"),
CustomerTier = c("High", "Medium", "Low", "Medium", "Medium", "High", "Low", "High", "Low", "Medium"),
stringsAsFactors = FALSE
)
# View the dataset contents
kable(transactions)
Date | Qty | Price | Product | CustomerTier |
---|---|---|---|---|
2025-10-01 | 2 | 1000 | Laptop | High |
2025-10-01 | 5 | 20 | Mouse | Medium |
2025-10-02 | 1 | 1000 | Laptop | Low |
2025-10-02 | 3 | 30 | Keyboard | Medium |
2025-10-03 | 4 | 50 | Mouse | Medium |
2025-10-03 | 2 | 1000 | Laptop | High |
2025-10-04 | 6 | 25 | Keyboard | Low |
2025-10-04 | 1 | 1000 | Laptop | High |
2025-10-05 | 3 | 40 | Mouse | Low |
2025-10-05 | 5 | 10 | Keyboard | Medium |
2. IDENTIFY VARIABLE TYPES
cat_var <- data.frame(
Category = c("Date", "Qty", "Price", "Product", "CustomersTier"),
Variable = c("Categorical (Date)",
"Numeric (Discrete)",
"Numeric(Continuous)",
"Categorical (Nominal)",
"Categorical (Ordinal)"),
stringsAsFactors = FALSE)
#view result
kable(cat_var)
Category | Variable |
---|---|
Date | Categorical (Date) |
Qty | Numeric (Discrete) |
Price | Numeric(Continuous) |
Product | Categorical (Nominal) |
CustomersTier | Categorical (Ordinal) |
3. ADD NEW COLUMN: TOTAL (Qty × Price)
## Data Frame with Total Column
Date | Qty | Price | Product | CustomerTier | Total |
---|---|---|---|---|---|
2025-10-01 | 2 | 1000 | Laptop | High | 2000 |
2025-10-01 | 5 | 20 | Mouse | Medium | 100 |
2025-10-02 | 1 | 1000 | Laptop | Low | 1000 |
2025-10-02 | 3 | 30 | Keyboard | Medium | 90 |
2025-10-03 | 4 | 50 | Mouse | Medium | 200 |
2025-10-03 | 2 | 1000 | Laptop | High | 2000 |
2025-10-04 | 6 | 25 | Keyboard | Low | 150 |
2025-10-04 | 1 | 1000 | Laptop | High | 1000 |
2025-10-05 | 3 | 40 | Mouse | Low | 120 |
2025-10-05 | 5 | 10 | Keyboard | Medium | 50 |
4. SUMMARY STATISTICS
##a. Total quantity sold per product
total_qty <- aggregate(Qty ~ Product, data = transactions, sum)
cat("Total Quantity Sold per Product")
## Total Quantity Sold per Product
Product | Qty |
---|---|
Keyboard | 14 |
Laptop | 6 |
Mouse | 12 |
##b. Total revenue per product
total_revenue <- aggregate(Total ~ Product, data = transactions, sum)
cat("Total Revenue per Product")
## Total Revenue per Product
Product | Total |
---|---|
Keyboard | 290 |
Laptop | 6000 |
Mouse | 420 |
##c. Average price per product
avg_price <- aggregate(Price ~ Product, data = transactions, mean)
cat("Average Price per Product")
## Average Price per Product
Product | Price |
---|---|
Keyboard | 21.66667 |
Laptop | 1000.00000 |
Mouse | 36.66667 |
5. VISUALIZE THE DATA
## a. Barplot: Total Quantity per Product
# Sort total_qty from the largest to the smallest Qty
total_qty <- total_qty[order(total_qty$Qty, decreasing = TRUE), ]
barplot(total_qty$Qty,
names.arg = total_qty$Product,
col = "skyblue",
main = "Total Quantity Sold per Product",
xlab = "Product",
ylab = "Quantity Sold",
cex.names = 0.9 # Adjust the product labels to fit properly
)
## b. Pie Chart: Total Revenue per Customer Tier
revenue_by_tier <- aggregate(Total ~ CustomerTier, data = transactions, sum)
# Calculate the percentage
percent <- round(revenue_by_tier$Total / sum(revenue_by_tier$Total) * 100, 1)
# Create labels that combine the category name and percentage
labels <- paste0(revenue_by_tier$CustomerTier, " (", percent, "%)")
# Generate the pie chart directly
pie(revenue_by_tier$Total,
labels = labels,
main = "Proportion of Total Revenue by Customer Tier",
col = c("lightgreen", "gold", "red"))
6. OPTIONAL CHALLENGE
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
# 1 Create a transaction dataset
transactions <- data.frame(
Date = as.Date(c("2025-10-01","2025-10-01","2025-10-02","2025-10-02","2025-10-03",
"2025-10-03","2025-10-04","2025-10-04","2025-10-05","2025-10-05")),
Qty = c(2,5,1,3,4,2,6,1,3,5),
Price = c(1000,20,1000,30,50,1000,25,1000,40,10),
Product = c("Laptop","Mouse","Laptop","Keyboard","Mouse","Laptop","Keyboard","Laptop","Mouse","Keyboard"),
CustomerTier = c("High","Medium","Low","Medium","Medium","High","Low","High","Low","Medium")
)
# 2 Total revenue per transaction
transactions <- transactions %>%
mutate(TotalRevenue = Qty * Price)
# 3 Total revenue per date
total_per_date <- transactions %>%
group_by(Date) %>%
summarise(Total_Revenue = sum(TotalRevenue)) %>%
arrange(desc(Total_Revenue))
# 4 Find the date with the highest total revenue
highest_revenue_date <- total_per_date %>%
slice_max(Total_Revenue)
# 5 View the result
print("Tanggal dengan total pendapatan tertinggi:")
## [1] "Tanggal dengan total pendapatan tertinggi:"
library(ggplot2)
library(dplyr)
# Calculate total quantity per product to enable sorting
product_order <- transactions %>%
group_by(Product) %>%
summarise(total_qty = sum(Qty)) %>%
arrange(desc(total_qty)) %>%
pull(Product)
# Reorder product factor levels based on total_qty (from highest to lowest)
transactions$Product <- factor(transactions$Product, levels = product_order)
# Plot a clean stacked bar chart sorted from highest to lowest
ggplot(transactions, aes(x = Product, y = Qty, fill = CustomerTier)) +
geom_bar(stat = "identity", position = "stack") +
labs(
title = "Jumlah Penjualan per Produk berdasarkan Tingkatan Pelanggan",
x = "Produk",
y = "Jumlah Penjualan (Qty)",
fill = "Customer Tier"
) +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.5, face = "bold")
)
5 Exercise 5
Create Your Own Data Frame:
Objective: Create a data frame in R with 30 rows containing a mix of data types: continuous, discrete, nominal, and ordinal.
5.1 Instructions
Open RStudio or the R console.
Create a vector for each column in your data frame:
- Date: 30 dates (can be sequential or random within
a month/year)
- Continuous: numeric values that can take decimal
values (e.g., height, weight, temperature)
- Discrete: numeric values that can only take whole
numbers (e.g., number of items, number of vehicles)
- Nominal: categorical values with no
order (e.g., color, gender, city)
- Ordinal: categorical values with a defined order (e.g., Low, Medium, High; Beginner, Intermediate, Expert)
- Date: 30 dates (can be sequential or random within
a month/year)
Combine all vectors into a data frame called
my_data
.Check your data frame using
head()
orView()
to ensure it has 30 rows and the columns are correct.Optional tasks:
- Summarize each column using
summary()
- Count the frequency of each category for Nominal
and Ordinal columns using
table()
- Summarize each column using
5.2 Hints
- Use
seq.Date()
oras.Date()
to generate the Date column.
- Use
runif()
orrnorm()
for continuous numeric data.
- Use
sample()
for discrete, nominal, and ordinal data.
- Ensure the ordinal vector is created with
factor(..., levels = c("Low","Medium","High"), ordered = TRUE)
(or similar).
1. CREATE THE DATA FRAME
## [1] "C"
#DATE
Date <- seq(as.Date("2025-09-01"), as.Date("2025-09-30"), by = "day")
#CONTINOUS (TOTAL SALES)
Total_Sales <- round(rnorm(30, mean = 200000, sd = 50000), 0)
#DISCRETE (QUANTITY)
Quantity <- sample(3:20, 30, replace = TRUE)
#NOMINAL (COFFEE TYPE)
Coffee_Type <- sample(c("Americano", "Latte", "Cappuccino", "Espresso", "Mocha"),
30, replace = TRUE)
#ORDINAL (SALES LEVEL)
Sales_Level <- factor(sample(c("Low", "Medium", "High"), 30, replace = TRUE),
levels = c("Low", "Medium", "High"), ordered = TRUE)
# Combine into one table
my_data <- data.frame(
Date,
Total_Sales,
Quantity,
Coffee_Type,
Sales_Level)
#view the contents of the data
kable(my_data)
Date | Total_Sales | Quantity | Coffee_Type | Sales_Level |
---|---|---|---|---|
2025-09-01 | 171976 | 7 | Mocha | High |
2025-09-02 | 188491 | 10 | Cappuccino | Low |
2025-09-03 | 277935 | 14 | Latte | High |
2025-09-04 | 203525 | 15 | Mocha | Medium |
2025-09-05 | 206464 | 20 | Mocha | Medium |
2025-09-06 | 285753 | 3 | Cappuccino | Low |
2025-09-07 | 223046 | 8 | Latte | High |
2025-09-08 | 136747 | 17 | Latte | Medium |
2025-09-09 | 165657 | 11 | Latte | Low |
2025-09-10 | 177717 | 17 | Espresso | High |
2025-09-11 | 261204 | 18 | Latte | High |
2025-09-12 | 217991 | 8 | Latte | High |
2025-09-13 | 220039 | 13 | Espresso | Low |
2025-09-14 | 205534 | 10 | Espresso | Medium |
2025-09-15 | 172208 | 9 | Americano | Medium |
2025-09-16 | 289346 | 18 | Cappuccino | Low |
2025-09-17 | 224893 | 19 | Cappuccino | Medium |
2025-09-18 | 101669 | 20 | Americano | High |
2025-09-19 | 235068 | 19 | Cappuccino | High |
2025-09-20 | 176360 | 4 | Mocha | Low |
2025-09-21 | 146609 | 6 | Latte | Medium |
2025-09-22 | 189101 | 15 | Cappuccino | Low |
2025-09-23 | 148700 | 7 | Latte | Medium |
2025-09-24 | 163555 | 16 | Mocha | High |
2025-09-25 | 168748 | 5 | Mocha | High |
2025-09-26 | 115665 | 10 | Cappuccino | Low |
2025-09-27 | 241889 | 18 | Espresso | High |
2025-09-28 | 207669 | 14 | Espresso | Medium |
2025-09-29 | 143093 | 16 | Espresso | Low |
2025-09-30 | 262691 | 5 | Mocha | Low |
2. OPTIONAL TASKS
## [1] "C"
## 'data.frame': 30 obs. of 5 variables:
## $ Date : Date, format: "2025-09-01" "2025-09-02" ...
## $ Total_Sales: num 171976 188491 277935 203525 206464 ...
## $ Quantity : int 7 10 14 15 20 3 8 17 11 17 ...
## $ Coffee_Type: chr "Mocha" "Cappuccino" "Latte" "Mocha" ...
## $ Sales_Level: Ord.factor w/ 3 levels "Low"<"Medium"<..: 3 1 3 2 2 1 3 2 1 3 ...
## Date Total_Sales Quantity Coffee_Type
## Min. :2025-09-01 Min. :101669 Min. : 3.0 Length:30
## 1st Qu.:2025-09-08 1st Qu.:166430 1st Qu.: 8.0 Class :character
## Median :2025-09-15 Median :196313 Median :13.5 Mode :character
## Mean :2025-09-15 Mean :197645 Mean :12.4
## 3rd Qu.:2025-09-22 3rd Qu.:224431 3rd Qu.:17.0
## Max. :2025-09-30 Max. :289346 Max. :20.0
## Sales_Level
## Low :10
## Medium: 9
## High :11
##
##
##
## NULL
## [1] High Low High Medium Medium Low High Medium Low High
## [11] High High Low Medium Medium Low Medium High High Low
## [21] Medium Low Medium High High Low High Medium Low Low
## Levels: Low < Medium < High