Data Exploration
Exercises ~ Week 2
1 Exercise 1
The following table shows sample information for three students. Each observation represents a single student and includes details such as their unique student ID, name, age, total credits completed, major field of study, and year level.
This dataset demonstrates a mixture of variable types:
- Nominal: StudentID, Name, Major
- Numeric: Age (continuous), CreditsCompleted
(discrete)
- Ordinal: YearLevel (Freshman → Senior)
StudentID | Name | Age | CreditsCompleted | Major | YearLevel |
---|---|---|---|---|---|
S001 | Alice | 20 | 45 | Data Sains | Sophomore |
S002 | Budi | 21 | 60 | Mathematics | Junior |
S003 | Citra | 19 | 30 | Statistics | Freshman |
# 1. Create vectors for each variable
StudentID <- c("S001", "S002", "S003") # Nominal / ID
Name <- c("Alice", "Budi", "Citra") # Nominal / Name
Age <- c(20, 21, 19) # Numeric / Continuous
CreditsCompleted <- c(45, 60, 30) # Numeric / Discrete
# Nominal
Major <- c("Data Sains", "Mathematics", "Statistics")
1
## [1] 1
# Ordinal
YearLevel <- factor(c("Sophomore", "Junior", "Freshman"),
levels = c("Freshman","Sophomore","Junior","Senior"),
ordered = TRUE)
# 2. Combine all vectors into a data frame
students <- data.frame(
StudentID, Name, Age, CreditsCompleted, Major, YearLevel,
stringsAsFactors = FALSE
)
# 3. Display the data frame
print(students)
## StudentID Name Age CreditsCompleted Major YearLevel
## 1 S001 Alice 20 45 Data Sains Sophomore
## 2 S002 Budi 21 60 Mathematics Junior
## 3 S003 Citra 19 30 Statistics Freshman
2 Exercise 2
Identify Data Types: Determine the type of data for each of the following variables:
# Install knitr package if not already installed
# install.packages("knitr")
library(knitr)
# Create a data frame for Data Types
variables_info <- data.frame(
No = 1:5,
Variable = c(
"Number of vehicles passing through the toll road each day",
"Student height in cm",
"Employee gender (Male / Female)",
"Customer satisfaction level: Low, Medium, High",
"Respondent's favorite color: Red, Blue, Green"
),
DataType = c(
"Quantitative",
"Quantitative",
"Quantitative",
"Qualitative",
"Qualitative"
),
Subtype = c(
"Diskrete",
"Continuous",
"Nominal",
"Ordinal",
"Nominal"
),
stringsAsFactors = FALSE
)
# Display the data frame as a neat table
kable(variables_info,
caption = "Table of Variables and Data Types")
No | Variable | DataType | Subtype |
---|---|---|---|
1 | Number of vehicles passing through the toll road each day | Quantitative | Diskrete |
2 | Student height in cm | Quantitative | Continuous |
3 | Employee gender (Male / Female) | Quantitative | Nominal |
4 | Customer satisfaction level: Low, Medium, High | Qualitative | Ordinal |
5 | Respondent’s favorite color: Red, Blue, Green | Qualitative | Nominal |
3 Exercise 3
Classify Data Sources: Determine whether the following data comes from internal or external sources, and whether it is structured or unstructured:
# Install DT package if not already installed
# install.packages("DT")
library(DT)
# Create a data frame for data sources
data_sources <- data.frame(
No = 1:4,
DataSource = c(
"Daily sales transaction data of the company",
"Weather reports from BMKG",
"Product reviews on social media",
"Warehouse inventory reports"
),
Internal_External = c(
"Internal",
"Eksternal",
"Eksternal",
"Internal"
),
Structured_Unstructured = c(
"Structured",
"Structured",
"Unstructured",
"Structured"
),
stringsAsFactors = FALSE
)
# Display the data frame as a neat table
datatable(data_sources,
caption = "Table of Data Sources",
rownames = FALSE) # hides the index column
4 Exercise 4
Dataset Structure: Consider the following transaction table:
Date | Qty | Price | Product | CustomerTier |
---|---|---|---|---|
2025-10-01 | 2 | 1000 | Laptop | High |
2025-10-01 | 5 | 20 | Mouse | Medium |
2025-10-02 | 1 | 1000 | Laptop | Low |
2025-10-02 | 3 | 30 | Keyboard | Medium |
2025-10-03 | 4 | 50 | Mouse | Medium |
2025-10-03 | 2 | 1000 | Laptop | High |
2025-10-04 | 6 | 25 | Keyboard | Low |
2025-10-04 | 1 | 1000 | Laptop | High |
2025-10-05 | 3 | 40 | Mouse | Low |
2025-10-05 | 5 | 10 | Keyboard | Medium |
Your Assignment Instructions: Creating a Transactions Table above in R
Create a data frame in R called
transactions
containing the data above.Identify which variables are numeric and which are categorical
Calculate total revenue for each transaction by multiplying
Qty × Price
and add it as a new columnTotal
.Compute summary statistics:
- Total quantity sold for each product
- Total revenue per product
- Average price per product
Visualize the data:
- Create a barplot showing total quantity sold per product.
- Create a pie chart showing the proportion of total revenue per customer tier.
Optional Challenge:
- Find which date had the highest total revenue.
- Create a stacked bar chart showing quantity sold per product by customer tier.
Hints: Use data.frame()
,
aggregate()
, barplot()
, pie()
,
and basic arithmetic operations in R.
4.1 create Data Frame
transactions <- data.frame( Date = as.Date(c( “2025-10-01”, “2025-10-01”, “2025-10-02”, “2025-10-02”, “2025-10-03”, “2025-10-03”, “2025-10-04”, “2025-10-04”, “2025-10-05”, “2025-10-05” )), Qty = c(2, 5, 1, 3, 4, 2, 6, 1, 3, 5), Price = c(1000, 20, 1000, 30, 50, 1000, 25, 1000, 40, 10), Product = c(“Laptop”, “Mouse”, “Laptop”, “Keyboard”, “Mouse”, “Laptop”, “Keyboard”, “Laptop”, “Mouse”, “Keyboard”), CustomerTier = c(“High”, “Medium”, “Low”, “Medium”, “Medium”, “High”, “Low”, “High”, “Low”, “Medium”), stringsAsFactors = FALSE )
4.2 Identify data types
str(transactions) # look at the structure (numeric vs categorical)
4.3 add a total colum
transactions\(Total <- transactions\)Qty * transactions$Price
4.4 View results
print(transactions)
4.5 Summary Statistics
4.6 Total quantity sold per product
total_qty <- aggregate(Qty ~ Product, data = transactions, sum)
4.6.1 Total revenue per product
total_revenue <- aggregate(Total ~ Product, data = transactions, sum)
4.6.2 Average price per product
avg_price <- aggregate(Price ~ Product, data = transactions, mean)
4.6.3 Tampilkan hasil ringkasan
cat(“=== Total Quantity per Product ===”) print(total_qty) cat(“=== Total Revenue per Product ===”) print(total_revenue) cat(“=== Average Price per Product ===”) print(avg_price)
4.7 Visualization
4.7.1 (a) Barplot - total quantity sold per product
barplot( total_qty\(Qty, names.arg = total_qty\)Product, main = “Total Quantity Sold per Product”, xlab = “Product”, ylab = “Total Quantity”, col = c(“skyblue”, “lightgreen”, “orange”) )
4.7.2 (b) Pie chart - proportion of total revenue per Customer Tier
revenue_tier <- aggregate(Total ~ CustomerTier, data = transactions, sum) pie( revenue_tier\(Total, labels = paste(revenue_tier\)CustomerTier, “-”, revenue_tier$Total), main = “Proportion of Total Revenue per Customer Tier”, col = c(“gold”, “lightblue”, “tomato”) )
4.8 Optional Challenge
4.8.1 (a) Date with highest total revenue
date_revenue <- aggregate(Total ~ Date, data = transactions, sum) max_rev_date <- date_revenue[which.max(date_revenue$Total), ] cat(“dengan total revenue tertinggi:”) print(max_rev_date)
4.8.2 (b) Stacked bar chart: quantity sold per product by customer tier
qty_stack <- aggregate(Qty ~ Product + CustomerTier, data = transactions, sum) qty_matrix <- xtabs(Qty ~ CustomerTier + Product, data = qty_stack) barplot( qty_matrix, beside = FALSE, main = “Quantity Sold per Product by Customer Tier”, xlab = “Product”, ylab = “Quantity Sold”, col = c(“lightblue”, “gold”, “tomato”) ) legend(“topright”, legend = rownames(qty_matrix), fill = c(“lightblue”, “gold”, “tomato”), title = “Customer Tier”)
5 Exercise 5
Create Your Own Data Frame:
Objective: Create a data frame in R with 30 rows containing a mix of data types: continuous, discrete, nominal, and ordinal.
5.1 Instructions
Open RStudio or the R console.
Create a vector for each column in your data frame:
- Date: 30 dates (can be sequential or random within
a month/year)
- Continuous: numeric values that can take decimal
values (e.g., height, weight, temperature)
- Discrete: numeric values that can only take whole
numbers (e.g., number of items, number of vehicles)
- Nominal: categorical values with no
order (e.g., color, gender, city)
- Ordinal: categorical values with a defined order (e.g., Low, Medium, High; Beginner, Intermediate, Expert)
- Date: 30 dates (can be sequential or random within
a month/year)
Combine all vectors into a data frame called
my_data
.Check your data frame using
head()
orView()
to ensure it has 30 rows and the columns are correct.Optional tasks:
- Summarize each column using
summary()
- Count the frequency of each category for Nominal
and Ordinal columns using
table()
- Summarize each column using
5.2 Hints
- Use
seq.Date()
oras.Date()
to generate the Date column.
- Use
runif()
orrnorm()
for continuous numeric data.
- Use
sample()
for discrete, nominal, and ordinal data.
- Ensure the ordinal vector is created with
factor(..., levels = c("Low","Medium","High"), ordered = TRUE)
(or similar).
5.3 Create each column
5.4 Date: 30 consecutive dates in October 2025
Date <- seq.Date(from = as.Date(“2025-10-01”), by = “day”, length.out = 30)
5.5 Continuous: for example body temperature data (in °C), use decimals
Continuous <- round(runif(30, min = 35.5, max = 37.5), 1)
5.6 Discrete: e.g number of items sold (whole number)
Discrete <- sample(1:50, 30, replace = TRUE)
5.7 Nominal: e.g cutomer’s city of origin (no order)
Nominal <- sample(c(“Jakarta”, “Bandung”, “Surabaya”, “Medan”, “Bali”), 30, replace = TRUE)
5.8 Ordinal: e.g statisfaction level (there is a sequence)
Ordinal <- factor( sample(c(“Low”, “Medium”, “High”), 30, replace = TRUE), levels = c(“Low”, “Medium”, “High”), ordered = TRUE )
5.9 Combine all into a date frame
my_data <- data.frame(Date, Continuous, Discrete, Nominal, Ordinal)
5.10 Check the data contents
head(my_data) # display the first 6 rows View(my_data) # open in RStudio window (optional)
5.11 (Optional) Data Summary
summary(my_data)
5.12 Calculate frequency categories
cat(“=== Nominal Frequency (City) ===”) print(table(my_data$Nominal))
cat(“=== Ordinal Frequency (Level of Statisfaction) ===”) print(table(my_data$Ordinal))