Data Exploration
Exercises ~ Week 2
1 Exercise 1
The following table shows sample information for three students. Each observation represents a single student and includes details such as their unique student ID, name, age, total credits completed, major field of study, and year level.
This dataset demonstrates a mixture of variable types:
- Nominal: StudentID, Name, Major
- Numeric: Age (continuous), CreditsCompleted
(discrete)
- Ordinal: YearLevel (Freshman → Senior)
StudentID | Name | Age | CreditsCompleted | Major | YearLevel |
---|---|---|---|---|---|
S001 | Alice | 20 | 45 | Data Sains | Sophomore |
S002 | Budi | 21 | 60 | Mathematics | Junior |
S003 | Citra | 19 | 30 | Statistics | Freshman |
# 1. Create vectors for each variable
StudentID <- c("S001", "S002", "S003") # Nominal / ID
Name <- c("Alice", "Budi", "Citra") # Nominal / Name
Age <- c(20, 21, 19) # Numeric / Continuous
CreditsCompleted <- c(45, 60, 30) # Numeric / Discrete
# Nominal
Major <- c("Data Sains", "Mathematics", "Statistics")
# Ordinal
YearLevel <- factor(c("Sophomore", "Junior", "Freshman"),
levels = c("Freshman","Sophomore","Junior","Senior"),
ordered = TRUE)
# 2. Combine all vectors into a data frame
students <- data.frame(
StudentID, Name, Age, CreditsCompleted, Major, YearLevel,
stringsAsFactors = FALSE
)
# 3. Display the data frame
print(students)
## StudentID Name Age CreditsCompleted Major YearLevel
## 1 S001 Alice 20 45 Data Sains Sophomore
## 2 S002 Budi 21 60 Mathematics Junior
## 3 S003 Citra 19 30 Statistics Freshman
2 Exercise 2
Identify Data Types: Determine the type of data for each of the following variables:
# Install knitr package if not already installed
# install.packages("knitr")
library(knitr)
# Create a data frame for Data Types
variables_info <- data.frame(
No = 1:5,
Variable = c(
"Number of vehicles passing through the toll road each day",
"Student height in cm",
"Employee gender (Male / Female)",
"Customer satisfaction level: Low, Medium, High",
"Respondent's favorite color: Red, Blue, Green"
),
DataType = c(
"Numeric",
"Numeric",
"Categorical",
"Categorical",
"Categorical"
),
Subtype = c(
"Discrete",
"Continuous",
"Nominal",
"Ordinal",
"Nominal"
),
stringsAsFactors = FALSE
)
# Display the data frame as a neat table
kable(variables_info,
caption = "Table of Variables and Data Types")
No | Variable | DataType | Subtype |
---|---|---|---|
1 | Number of vehicles passing through the toll road each day | Numeric | Discrete |
2 | Student height in cm | Numeric | Continuous |
3 | Employee gender (Male / Female) | Categorical | Nominal |
4 | Customer satisfaction level: Low, Medium, High | Categorical | Ordinal |
5 | Respondent’s favorite color: Red, Blue, Green | Categorical | Nominal |
3 Exercise 3
Classify Data Sources: Determine whether the following data comes from internal or external sources, and whether it is structured or unstructured:
# Install DT package if not already installed
# install.packages("DT")
library(DT)
# Create a data frame for data sources
data_sources <- data.frame(
No = 1:4,
DataSource = c(
"Daily sales transaction data of the company",
"Weather reports from BMKG",
"Product reviews on social media",
"Warehouse inventory reports"
),
Internal_External = c(
"Internal",
"External",
"External",
"Internal"
),
Structured_Unstructured = c(
"Structured",
"Structured",
"UnStructured",
"Structured"
),
stringsAsFactors = FALSE
)
# Display the data frame as a neat table
datatable(data_sources,
caption = "Table of Data Sources",
rownames = FALSE) # hides the index column
4 Exercise 4
Dataset Structure: Consider the following transaction table:
Date | Qty | Price | Product | CustomerTier |
---|---|---|---|---|
2025-10-01 | 2 | 1000 | Laptop | High |
2025-10-01 | 5 | 20 | Mouse | Medium |
2025-10-02 | 1 | 1000 | Laptop | Low |
2025-10-02 | 3 | 30 | Keyboard | Medium |
2025-10-03 | 4 | 50 | Mouse | Medium |
2025-10-03 | 2 | 1000 | Laptop | High |
2025-10-04 | 6 | 25 | Keyboard | Low |
2025-10-04 | 1 | 1000 | Laptop | High |
2025-10-05 | 3 | 40 | Mouse | Low |
2025-10-05 | 5 | 10 | Keyboard | Medium |
Your Assignment Instructions: Creating a Transactions Table above in R
Create a data frame in R called
transactions
containing the data above.Identify which variables are numeric and which are categorical
Calculate total revenue for each transaction by multiplying
Qty × Price
and add it as a new columnTotal
.Compute summary statistics:
- Total quantity sold for each product
- Total revenue per product
- Average price per product
Visualize the data:
- Create a barplot showing total quantity sold per product.
- Create a pie chart showing the proportion of total revenue per customer tier.
Optional Challenge:
- Find which date had the highest total revenue.
- Create a stacked bar chart showing quantity sold per product by customer tier.
Hints: Use data.frame()
,
aggregate()
, barplot()
, pie()
,
and basic arithmetic operations in R.
library(DT)
#Exercise 4: Create Transactions Data Frame
# transactions
Date = c("2025-10-01", "2025-10-01", "2025-10-02", "2025-10-02", "2025-10-03",
"2025-10-03", "2025-10-04", "2025-10-04", "2025-10-05", "2025-10-05")
Qty = c(2,5,1,3,4,2,6,1,3,5)
Price = c(1000,20,1000,30,50,1000,25,1000,40,10)
Product = c("Laptop", "Mouse", "Laptop", "Keyboard", "Mouse",
"Laptop", "Keyboard", "Laptop", "Mouse", "Keyboard")
CustomerTier = c("High", "Medium", "Low", "Medium", "Medium",
"High", "Low", "High", "Low", "Medium")
# Combine all colums into a data frame
transactions <- data.frame(Date, Qty ,Price ,Product ,CustomerTier)
#Show data frame
#Add a new column for total
transactions <- transform (transactions, Total = Qty * Price)
View (transactions)
str (transactions)
## 'data.frame': 10 obs. of 6 variables:
## $ Date : chr "2025-10-01" "2025-10-01" "2025-10-02" "2025-10-02" ...
## $ Qty : num 2 5 1 3 4 2 6 1 3 5
## $ Price : num 1000 20 1000 30 50 1000 25 1000 40 10
## $ Product : chr "Laptop" "Mouse" "Laptop" "Keyboard" ...
## $ CustomerTier: chr "High" "Medium" "Low" "Medium" ...
## $ Total : num 2000 100 1000 90 200 2000 150 1000 120 50
# Total quantity per product
qty_per_product <- aggregate (Qty ~ Product, transactions, sum)
# Bar chart of quantity sold
barplot (qty_per_product$Qty,
names.arg = qty_per_product$Product,
main = "Quantity Sold by Product",
xlab = "Product",
ylab = "Quantity",
col = "red")
# Revenue per custom tier
revenue_per_tier <- aggregate (Total ~ CustomerTier, transactions, sum)
# Pie chart of revenue share
pie (revenue_per_tier$Total,
labels = revenue_per_tier$CustomerTier,
main = "Revenue Share by Customer Tier",
col = rainbow(3))
5 Exercise 5
Create Your Own Data Frame:
Objective: Create a data frame in R with 30 rows containing a mix of data types: continuous, discrete, nominal, and ordinal.
5.1 Instructions
Open RStudio or the R console.
Create a vector for each column in your data frame:
- Date: 30 dates (can be sequential or random within
a month/year)
- Continuous: numeric values that can take decimal
values (e.g., height, weight, temperature)
- Discrete: numeric values that can only take whole
numbers (e.g., number of items, number of vehicles)
- Nominal: categorical values with no
order (e.g., color, gender, city)
- Ordinal: categorical values with a defined order (e.g., Low, Medium, High; Beginner, Intermediate, Expert)
- Date: 30 dates (can be sequential or random within
a month/year)
Combine all vectors into a data frame called
my_data
.Check your data frame using
head()
orView()
to ensure it has 30 rows and the columns are correct.Optional tasks:
- Summarize each column using
summary()
- Count the frequency of each category for Nominal
and Ordinal columns using
table()
- Summarize each column using
5.2 Hints
- Use
seq.Date()
oras.Date()
to generate the Date column.
- Use
runif()
orrnorm()
for continuous numeric data.
- Use
sample()
for discrete, nominal, and ordinal data.
- Ensure the ordinal vector is created with
factor(..., levels = c("Low","Medium","High"), ordered = TRUE)
(or similar).
# create data frame about purchasing drinks at cafe statis
#Dates from September 1 to September 30
date = seq.Date (from = as.Date ("2030-09-01"), to =
as.Date ("2030-09-30"), by = "day")
#Number of drinks ordered each day
set.seed(123)
Number_of_Drinks <- sample (1:15, 30,
replace = TRUE)
#Total purchase amount
Total_Purchase <- round (Number_of_Drinks * runif (30, min = 15000, max = 25000), 0)
#Type of drinks
Drink_Type = sample (c("Matcha", "Coffe",
"Chocolate","Tea",
"Milkshake", "Latte"),
30, replace = TRUE)
# Customer satisfaction level
satisfaction <- ifelse (Total_Purchase < 50000, "Not satisfied",
ifelse (Total_Purchase < 100000,"Satisfied",
"Very satisfied"))
#
q <- quantile(Total_Purchase, probs = c(0, 1/3, 2/3, 1))
satisfaction <- cut(Total_Purchase,
breaks = q,
labels = c("Not satisfied", "Satisfied", "Very satisfied"),
include.lowest = TRUE)
#combine all columns into on data frame named my_data
my_data <- data.frame(date,
Total_Purchase, Number_of_Drinks,
Drink_Type, satisfaction)
# 7. Show first few rows
head(my_data)
## date Total_Purchase Number_of_Drinks Drink_Type satisfaction
## 1 2030-09-01 360345 15 Tea Very satisfied
## 2 2030-09-02 328606 15 Milkshake Very satisfied
## 3 2030-09-03 68864 3 Coffe Not satisfied
## 4 2030-09-04 213446 14 Matcha Very satisfied
## 5 2030-09-05 59334 3 Matcha Not satisfied
## 6 2030-09-06 225846 10 Chocolate Very satisfied
## 7 2030-09-07 34328 2 Matcha Not satisfied
## 8 2030-09-08 109091 6 Latte Not satisfied
## 9 2030-09-09 190479 11 Milkshake Satisfied
## 10 2030-09-10 82140 5 Matcha Not satisfied
## 11 2030-09-11 76582 4 Coffe Not satisfied
## 12 2030-09-12 267921 14 Tea Very satisfied
## 13 2030-09-13 112131 6 Tea Not satisfied
## 14 2030-09-14 148720 9 Latte Satisfied
## 15 2030-09-15 163881 10 Latte Satisfied
## 16 2030-09-16 190634 11 Chocolate Satisfied
## 17 2030-09-17 98298 5 Latte Not satisfied
## 18 2030-09-18 52979 3 Latte Not satisfied
## 19 2030-09-19 259361 11 Matcha Very satisfied
## 20 2030-09-20 139125 9 Latte Satisfied
## 21 2030-09-21 233064 12 Coffe Very satisfied
## 22 2030-09-22 206903 9 Matcha Very satisfied
## 23 2030-09-23 145971 9 Coffe Satisfied
## 24 2030-09-24 267923 13 Tea Very satisfied
## 25 2030-09-25 51196 3 Milkshake Not satisfied
## 26 2030-09-26 130203 8 Milkshake Satisfied
## 27 2030-09-27 225331 10 Latte Very satisfied
## 28 2030-09-28 167653 7 Chocolate Satisfied
## 29 2030-09-29 187446 10 Matcha Satisfied
## 30 2030-09-30 194860 9 Tea Satisfied
## [1] 30
## date Total_Purchase Number_of_Drinks Drink_Type
## Min. :2030-09-01 Min. : 34328 Min. : 2.000 Length:30
## 1st Qu.:2030-09-08 1st Qu.:100996 1st Qu.: 5.250 Class :character
## Median :2030-09-15 Median :165767 Median : 9.000 Mode :character
## Mean :2030-09-15 Mean :166422 Mean : 8.533
## 3rd Qu.:2030-09-22 3rd Qu.:222360 3rd Qu.:11.000
## Max. :2030-09-30 Max. :360345 Max. :15.000
## satisfaction
## Not satisfied :10
## Satisfied :10
## Very satisfied:10
##
##
##
# 11. Make a formatted table using knitr
library(knitr)
kable(my_data, caption = "Table: Drink Purchase Data (30 Days)")
date | Total_Purchase | Number_of_Drinks | Drink_Type | satisfaction |
---|---|---|---|---|
2030-09-01 | 360345 | 15 | Tea | Very satisfied |
2030-09-02 | 328606 | 15 | Milkshake | Very satisfied |
2030-09-03 | 68864 | 3 | Coffe | Not satisfied |
2030-09-04 | 213446 | 14 | Matcha | Very satisfied |
2030-09-05 | 59334 | 3 | Matcha | Not satisfied |
2030-09-06 | 225846 | 10 | Chocolate | Very satisfied |
2030-09-07 | 34328 | 2 | Matcha | Not satisfied |
2030-09-08 | 109091 | 6 | Latte | Not satisfied |
2030-09-09 | 190479 | 11 | Milkshake | Satisfied |
2030-09-10 | 82140 | 5 | Matcha | Not satisfied |
2030-09-11 | 76582 | 4 | Coffe | Not satisfied |
2030-09-12 | 267921 | 14 | Tea | Very satisfied |
2030-09-13 | 112131 | 6 | Tea | Not satisfied |
2030-09-14 | 148720 | 9 | Latte | Satisfied |
2030-09-15 | 163881 | 10 | Latte | Satisfied |
2030-09-16 | 190634 | 11 | Chocolate | Satisfied |
2030-09-17 | 98298 | 5 | Latte | Not satisfied |
2030-09-18 | 52979 | 3 | Latte | Not satisfied |
2030-09-19 | 259361 | 11 | Matcha | Very satisfied |
2030-09-20 | 139125 | 9 | Latte | Satisfied |
2030-09-21 | 233064 | 12 | Coffe | Very satisfied |
2030-09-22 | 206903 | 9 | Matcha | Very satisfied |
2030-09-23 | 145971 | 9 | Coffe | Satisfied |
2030-09-24 | 267923 | 13 | Tea | Very satisfied |
2030-09-25 | 51196 | 3 | Milkshake | Not satisfied |
2030-09-26 | 130203 | 8 | Milkshake | Satisfied |
2030-09-27 | 225331 | 10 | Latte | Very satisfied |
2030-09-28 | 167653 | 7 | Chocolate | Satisfied |
2030-09-29 | 187446 | 10 | Matcha | Satisfied |
2030-09-30 | 194860 | 9 | Tea | Satisfied |
library(knitr)
# Count how many times each drink type appears
drink_table <- as.data.frame(table(my_data$Drink_Type))
colnames(drink_table) <- c("Drink Type", "Frequency")
# Count satisfaction levels
satisfaction_table <- as.data.frame(table(my_data$satisfaction))
colnames(satisfaction_table) <- c("Satisfaction Level", "Frequency")
# show table
knitr::kable(drink_table, caption = "Table: Frequency of Each Drink Type")
Drink Type | Frequency |
---|---|
Chocolate | 3 |
Coffe | 4 |
Latte | 7 |
Matcha | 7 |
Milkshake | 4 |
Tea | 5 |
Satisfaction Level | Frequency |
---|---|
Not satisfied | 10 |
Satisfied | 10 |
Very satisfied | 10 |