Data Exploration
Exercises ~ Week 3
1 Exercise 1
The following table shows sample information for three students. Each observation represents a single student and includes details such as their unique student ID, name, age, total credits completed, major field of study, and year level.
This dataset demonstrates a mixture of variable types:
- Nominal: StudentID, Name, Major
- Numeric: Age (continuous), CreditsCompleted
(discrete)
- Ordinal: YearLevel (Freshman → Senior)
StudentID | Name | Age | CreditsCompleted | Major | YearLevel |
---|---|---|---|---|---|
S001 | Alice | 20 | 45 | Data Sains | Sophomore |
S002 | Budi | 21 | 60 | Mathematics | Junior |
S003 | Citra | 19 | 30 | Statistics | Freshman |
# 1. Create vectors for each variable
StudentID <- c("S001", "S002", "S003") # Nominal / ID
Name <- c("Alice", "Budi", "Citra") # Nominal / Name
Age <- c(20, 21, 19) # Numeric / Continuous
CreditsCompleted <- c(45, 60, 30) # Numeric / Discrete
# Nominal
Major <- c("Data Sains", "Mathematics", "Statistics")
# Ordinal
YearLevel <- factor(c("Sophomore", "Junior", "Freshman"),
levels = c("Freshman","Sophomore","Junior","Senior"),
ordered = TRUE)
# 2. Combine all vectors into a data frame
students <- data.frame(
StudentID, Name, Age, CreditsCompleted, Major, YearLevel,
stringsAsFactors = FALSE
)
# 3. Display the data frame
print(students)
## StudentID Name Age CreditsCompleted Major YearLevel
## 1 S001 Alice 20 45 Data Sains Sophomore
## 2 S002 Budi 21 60 Mathematics Junior
## 3 S003 Citra 19 30 Statistics Freshman
2 Exercise 2
Identify Data Types: Determine the type of data for each of the following variables:
# Install knitr package if not already installed
# install.packages("knitr")
library(knitr)
# Create a data frame for Data Types
variables_info <- data.frame(
No = 1:5,
Variable = c(
"Number of vehicles passing through the toll road each day",
"Student height in cm",
"Employee gender (Male / Female)",
"Customer satisfaction level: Low, Medium, High",
"Respondent's favorite color: Red, Blue, Green"
),
DataType = c(
"Numeric",
"Numeric",
"Categorical",
"Categorical",
"Categorical"
),
Subtype = c(
"Continuous",
"Continuous",
"Nominal",
"Ordinal",
"Nominal"
),
stringsAsFactors = FALSE
)
# Display the data frame as a neat table
kable(variables_info,
caption = "Table of Variables and Data Types")
No | Variable | DataType | Subtype |
---|---|---|---|
1 | Number of vehicles passing through the toll road each day | Numeric | Continuous |
2 | Student height in cm | Numeric | Continuous |
3 | Employee gender (Male / Female) | Categorical | Nominal |
4 | Customer satisfaction level: Low, Medium, High | Categorical | Ordinal |
5 | Respondent’s favorite color: Red, Blue, Green | Categorical | Nominal |
3 Exercise 3
Classify Data Sources: Determine whether the following data comes from internal or external sources, and whether it is structured or unstructured:
# Install DT package if not already installed
# install.packages("DT")
library(DT)
# Create a data frame for data sources
data_sources <- data.frame(
No = 1:4,
DataSource = c(
"Daily sales transaction data of the company",
"Weather reports from BMKG",
"Product reviews on social media",
"Warehouse inventory reports"
),
Internal_External = c(
"Internal",
"External",
"External",
"Internal"
),
Structured_Unstructured = c(
"Structured",
"Structured",
"Unstructured",
"Structured"
),
stringsAsFactors = FALSE
)
# Display the data frame as a neat table
datatable(data_sources,
caption = "Table of Data Sources",
rownames = FALSE) # hides the index column
4 Exercise 4
Dataset Structure: Consider the following transaction table:
Date | Qty | Price | Product | CustomerTier |
---|---|---|---|---|
2025-10-01 | 2 | 1000 | Laptop | High |
2025-10-01 | 5 | 20 | Mouse | Medium |
2025-10-02 | 1 | 1000 | Laptop | Low |
2025-10-02 | 3 | 30 | Keyboard | Medium |
2025-10-03 | 4 | 50 | Mouse | Medium |
2025-10-03 | 2 | 1000 | Laptop | High |
2025-10-04 | 6 | 25 | Keyboard | Low |
2025-10-04 | 1 | 1000 | Laptop | High |
2025-10-05 | 3 | 40 | Mouse | Low |
2025-10-05 | 5 | 10 | Keyboard | Medium |
Your Assignment Instructions: Creating a Transactions Table above in R
Create a data frame in R called
transactions
containing the data above.Identify which variables are numeric and which are categorical
Calculate total revenue for each transaction by multiplying
Qty × Price
and add it as a new columnTotal
.Compute summary statistics:
- Total quantity sold for each product
- Total revenue per product
- Average price per product
Visualize the data:
- Create a barplot showing total quantity sold per product.
- Create a pie chart showing the proportion of total revenue per customer tier.
Optional Challenge:
- Find which date had the highest total revenue.
- Create a stacked bar chart showing quantity sold per product by customer tier.
Hints: Use data.frame()
,
aggregate()
, barplot()
, pie()
,
and basic arithmetic operations in R.
library(DT)
# 4.1 Transaction
transactions <- data.frame(
No = 1 : 10,
Date = c("2025-10-01", "2025-10-01", "2025-10-02", "2025-10-02",
"2025-10-03","2025-10-03", "2025-10-04", "2025-10-04",
"2025-10-05", "2025-10-05"),
Qty = c(2,5,1,3,4,2,6,1,3,5),
Price = c(1000,20,1000,30,50,1000,25,1000,40,10),
Product = c("Laptop", "Mouse", "Laptop", "Keyboard", "Mouse",
"Laptop", "Keyboard", "Laptop", "Mouse", "Keyboard"),
CostumerTier = c("High", "Medium", "Low", "Medium", "Medium",
"High", "Low", "High", "Low", "Medium")
)
library(knitr)
# 4.2 Create a data frame for Data Types
variables_info <- data.frame(
No = 1:4,
Variable = c(
"Qty",
"Price",
"Product",
"Customertier"
),
DataType = c(
"Numeric",
"Numeric",
"Categorical",
"Categorical"
),
stringAsFactors = FALSE
)
library(knitr)
# 4.3 Display the data table as a neat table
kable(variables_info,
caption = "Table of Variables and Data Types")
No | Variable | DataType | stringAsFactors |
---|---|---|---|
1 | Qty | Numeric | FALSE |
2 | Price | Numeric | FALSE |
3 | Product | Categorical | FALSE |
4 | Customertier | Categorical | FALSE |
#4.3 Transactions Total
transactions$Total <- transactions$Qty * transactions$Price
datatable(transactions,
caption = "Table of Transaction",
rownames = FALSE)
library(knitr)
# 4.4 Compute Summary Statistics
# Total Qty per Product
total_qty_per_product <- aggregate(Qty ~ Product, data = transactions, FUN = sum)
print("Total Kuantitas Terjual per Produk:")
## [1] "Total Kuantitas Terjual per Produk:"
## Product Qty
## 1 Keyboard 14
## 2 Laptop 6
## 3 Mouse 12
# Total Revenue per Product
total_revenue_per_product <- aggregate(Total ~ Product, data = transactions, FUN = sum)
print("Total Pendapatan per Produk:")
## [1] "Total Pendapatan per Produk:"
## Product Total
## 1 Keyboard 290
## 2 Laptop 6000
## 3 Mouse 420
# Total Revenue per Customer Tier
total_revenue_per_tier <- aggregate(Total ~ CostumerTier, data = transactions, FUN = sum)
print("Total Pendapatan Per Pelanggan:")
## [1] "Total Pendapatan Per Pelanggan:"
## CostumerTier Total
## 1 High 5000
## 2 Low 1270
## 3 Medium 440
# Average Price per Product
average_price_per_product <- aggregate(Price ~ Product, data = transactions, FUN = mean)
print("Harga Rata-rata per Produk")
## [1] "Harga Rata-rata per Produk"
## Product Price
## 1 Keyboard 21.66667
## 2 Laptop 1000.00000
## 3 Mouse 36.66667
library(knitr)
# 4.5 Visualize Data
barplot(
height = total_qty_per_product$Qty,
names.arg = total_qty_per_product$Product,
col = c("pink", "blue", "green"),
main = "Total Transactions",
xlab = "Product",
ylab = "Quantity",
ylim = c(0, max(total_qty_per_product$Qty) + 2))
#Pie chart
revenue_per_tier <- aggregate(Total ~ CostumerTier, data = transactions, FUN = sum)
total_revenue <- sum(revenue_per_tier$Total)
percentages <- round(revenue_per_tier$Total / total_revenue * 100, 1)
pie_labels <- paste(revenue_per_tier$CostumerTier, "(", percentages, "%)", sep = "")
pie(
x = revenue_per_tier$Total,
labels = pie_labels,
main = "Proportion of Total Revenue per CustomerTier",
col = c("purple", "yellow", "red")
)
library(knitr)
# 4.6 Find the Highest date of total revenue
transactions$Revenue <- transactions$Price * transactions$Qty
total_per_date <- aggregate(Revenue ~ Date, data = transactions, FUN = sum)
highest <- total_per_date[which.max(total_per_date$Revenue), ]
print("The highest data of total Revenue")
## [1] "The highest data of total Revenue"
## Date Revenue
## 1 2025-10-01 2100
## 2 2025-10-02 1090
## 3 2025-10-03 2200
## 4 2025-10-04 1150
## 5 2025-10-05 170
## Date Revenue
## 3 2025-10-03 2200
# Stacked Bar Chart
table_transactions <- with(transactions, tapply(Qty, list(Product, CostumerTier), sum, na.rm = TRUE))
table_transactions[is.na(table_transactions)] <- 0
barplot(
t(table_transactions),
beside = FALSE,
col = c("brown", "gold", "orange"),
main = "Quantity Sold per Product by CustomerTier",
xlab = "Product",
ylab = "Quantity Sold",
legend.text = TRUE)
5 Exercise 5
Create Your Own Data Frame:
Objective: Create a data frame in R with 30 rows containing a mix of data types: continuous, discrete, nominal, and ordinal.
5.1 Instructions
Open RStudio or the R console.
Create a vector for each column in your data frame:
- Date: 30 dates (can be sequential or random within
a month/year)
- Continuous: numeric values that can take decimal
values (e.g., height, weight, temperature)
- Discrete: numeric values that can only take whole
numbers (e.g., number of items, number of vehicles)
- Nominal: categorical values with no
order (e.g., color, gender, city)
- Ordinal: categorical values with a defined order (e.g., Low, Medium, High; Beginner, Intermediate, Expert)
- Date: 30 dates (can be sequential or random within
a month/year)
Combine all vectors into a data frame called
my_data
.Check your data frame using
head()
orView()
to ensure it has 30 rows and the columns are correct.Optional tasks:
- Summarize each column using
summary()
- Count the frequency of each category for Nominal
and Ordinal columns using
table()
- Summarize each column using
library(DT)
library(knitr)
# 5.1 Create a vector for each column
set.seed(50)
tanggal <- as.Date("2024-01-01") + sample(0:365, 30, replace = TRUE)
hewan <- c("Kucing", "Anjing", "Kelinci", "Burung",
"Kuda", "Gajah", "Singa", "Harimau", "Zebra", "Panda", "Rusa",
"Unta", "Kudanil", "Badak", "Beruang", "Monyet", "Gorila", "Kanguru",
"Koala", "Serigala", "Kucing", "Anjing", "Burung", "Gajah", "Singa",
"Harimau", "Zebra", "Panda", "Rusa", "Unta")
hewan <- factor(hewan)
jumlah_porsi_makanan <- c(5, 5, 3, 4, 3,
3, 5, 5, 5, 6,
4, 4, 6, 6, 5,
5, 6, 4, 4, 5,
5, 4, 5, 6, 7,
6, 5, 6, 4, 4)
berat <- c(24.56, 48.76, 15.48, 9.64, 289.67, 785.54, 293.27, 257.38, 158.35,
93.32, 82.67, 632.75, 429.71, 836.54, 274.68, 18.56, 467.82, 152.74,
52.85, 172.92, 38.56, 95.27, 6.32, 364.78, 284.92, 256.85, 193.65,
123.45, 110.43, 576.98)
kelincahan <- c("Tinggi", "Tinggi", "Sedang", "Tinggi", "Sedang", "Rendah", "Tinggi",
"Tinggi", "Sedang", "Sedang", "Sedang", "Rendah", "Rendah", "Rendah",
"Sedang", "Tinggi", "Rendah", "Tinggi", "Sedang", "Tinggi", "Sedang",
"Tinggi", "Tinggi", "Sedang", "Tinggi", "Tinggi", "Sedang", "Sedang",
"Sedang", "Rendah")
kelincahan <- factor(kelincahan,
levels = c("Rendah", "Sedang", "Tinggi"),
ordered = TRUE)
# 5.2 Combine all vectors into a data frame
Data_Hewan <- data.frame(tanggal, hewan, jumlah_porsi_makanan, berat, kelincahan)
# 5.3 Create a data frame as a neat table
datatable(Data_Hewan, caption = "Rekapitulasi Data Hewan: Jenis, Berat dan Kelincahan", options = list(pageLength = 10))
# 5.3 Create a table of variable types
variables_info2 <- data.frame(
No = 1:4,
Variable = c(
"Hewan",
"JumlahPorsiMakanan",
"Berat",
"Kelincahan"),
DataType = c(
"Nominal",
"Diskrit",
"Continuous",
"Ordinal"),
stringsAsFactors = FALSE
)
knitr::kable(variables_info2, caption = "Table of Variable and Data Type")
No | Variable | DataType |
---|---|---|
1 | Hewan | Nominal |
2 | JumlahPorsiMakanan | Diskrit |
3 | Berat | Continuous |
4 | Kelincahan | Ordinal |
## tanggal hewan jumlah_porsi_makanan berat
## Min. :2024-01-07 Anjing : 2 Min. :3.000 Min. : 6.32
## 1st Qu.:2024-04-04 Burung : 2 1st Qu.:4.000 1st Qu.: 60.30
## Median :2024-09-06 Gajah : 2 Median :5.000 Median :165.63
## Mean :2024-07-21 Harimau: 2 Mean :4.833 Mean :238.28
## 3rd Qu.:2024-10-28 Kucing : 2 3rd Qu.:5.750 3rd Qu.:292.37
## Max. :2024-12-28 Panda : 2 Max. :7.000 Max. :836.54
## (Other):18
## kelincahan
## Rendah: 6
## Sedang:12
## Tinggi:12
##
##
##
##
## No Date Qty Price
## Min. : 1.00 Length:10 Min. :1.00 Min. : 10.00
## 1st Qu.: 3.25 Class :character 1st Qu.:2.00 1st Qu.: 26.25
## Median : 5.50 Mode :character Median :3.00 Median : 45.00
## Mean : 5.50 Mean :3.20 Mean : 417.50
## 3rd Qu.: 7.75 3rd Qu.:4.75 3rd Qu.:1000.00
## Max. :10.00 Max. :6.00 Max. :1000.00
## Product CostumerTier Total Revenue
## Length:10 Length:10 Min. : 50 Min. : 50
## Class :character Class :character 1st Qu.: 105 1st Qu.: 105
## Mode :character Mode :character Median : 175 Median : 175
## Mean : 671 Mean : 671
## 3rd Qu.:1000 3rd Qu.:1000
## Max. :2000 Max. :2000
summary_table <- summary(Data_Hewan)
knitr::kable(as.data.frame(summary_table), caption = "Summary of Data Hewan")
Var1 | Var2 | Freq |
---|---|---|
tanggal | Min. :2024-01-07 | |
tanggal | 1st Qu.:2024-04-04 | |
tanggal | Median :2024-09-06 | |
tanggal | Mean :2024-07-21 | |
tanggal | 3rd Qu.:2024-10-28 | |
tanggal | Max. :2024-12-28 | |
tanggal | NA | |
hewan | Anjing : 2 | |
hewan | Burung : 2 | |
hewan | Gajah : 2 | |
hewan | Harimau: 2 | |
hewan | Kucing : 2 | |
hewan | Panda : 2 | |
hewan | (Other):18 | |
jumlah_porsi_makanan | Min. :3.000 | |
jumlah_porsi_makanan | 1st Qu.:4.000 | |
jumlah_porsi_makanan | Median :5.000 | |
jumlah_porsi_makanan | Mean :4.833 | |
jumlah_porsi_makanan | 3rd Qu.:5.750 | |
jumlah_porsi_makanan | Max. :7.000 | |
jumlah_porsi_makanan | NA | |
berat | Min. : 6.32 | |
berat | 1st Qu.: 60.30 | |
berat | Median :165.63 | |
berat | Mean :238.28 | |
berat | 3rd Qu.:292.37 | |
berat | Max. :836.54 | |
berat | NA | |
kelincahan | Rendah: 6 | |
kelincahan | Sedang:12 | |
kelincahan | Tinggi:12 | |
kelincahan | NA | |
kelincahan | NA | |
kelincahan | NA | |
kelincahan | NA |
# 5.4 Count the frequency of each category for **Nominal** and **Ordinal** columns
list(
Nominal = table(Data_Hewan$Hewan),
Ordinal = table(Data_Hewan$kelincahan)
)
## $Nominal
## < table of extent 0 >
##
## $Ordinal
##
## Rendah Sedang Tinggi
## 6 12 12