Data Exploration
Exercises ~ Week 3
1 Exercise 1
The following table shows sample information for three students. Each observation represents a single student and includes details such as their unique student ID, name, age, total credits completed, major field of study, and year level.
This dataset demonstrates a mixture of variable types:
- Nominal: StudentID, Name, Major
- Numeric: Age (continuous), CreditsCompleted
(discrete)
- Ordinal: YearLevel (Freshman → Senior)
| StudentID | Name | Age | CreditsCompleted | Major | YearLevel |
|---|---|---|---|---|---|
| S001 | Alice | 20 | 45 | Data Sains | Sophomore |
| S002 | Budi | 21 | 60 | Mathematics | Junior |
| S003 | Citra | 19 | 30 | Statistics | Freshman |
# 1. Create vectors for each variable
StudentID <- c("S001", "S002", "S003") # Nominal / ID
Name <- c("Alice", "Budi", "Citra") # Nominal / Name
Age <- c(20, 21, 19) # Numeric / Continuous
CreditsCompleted <- c(45, 60, 30) # Numeric / Discrete
# Nominal
Major <- c("Data Sains", "Mathematics", "Statistics")
# Ordinal
YearLevel <- factor(c("Sophomore", "Junior", "Freshman"),
levels = c("Freshman","Sophomore","Junior","Senior"),
ordered = TRUE)
# 2. Combine all vectors into a data frame
students <- data.frame(
StudentID, Name, Age, CreditsCompleted, Major, YearLevel,
stringsAsFactors = FALSE
)
# 3. Display the data frame
print(students)## StudentID Name Age CreditsCompleted Major YearLevel
## 1 S001 Alice 20 45 Data Sains Sophomore
## 2 S002 Budi 21 60 Mathematics Junior
## 3 S003 Citra 19 30 Statistics Freshman
2 Exercise 2
Identify Data Types: Determine the type of data for each of the following variables:
# Install knitr package if not already installed
# install.packages("knitr")
library(knitr) #untuk tabel yang rapih
variables_info <- data.frame(
No = 1:5, # data yang di tampilkan 1 hingga 5
#c itu Combine. untuk membuat vector (data)
Variable = c(
"Number of vehicles passing through the toll road each day",
"Student height in cm",
"Employee gender (Male / Female)",
"Customer satisfaction level: Low, Medium, High",
"Respondent's favorite color: Red, Blue, Green"
),
DataType = c(
"Numeric",
"Numeric",
"Categorical",
"Categorical",
"Categorical"
),
Subtype = c(
"Continous",
"Continous",
"Nominal",
"Ordinal",
"Nominal"
),
#Mengatur agar data teks (string) tidak otomatis diubah menjadi factor
stringsAsFactors = FALSE
)
# Display the data frame as a neat table
kable(variables_info,
caption = "Table of Variables and Data Types") #untuk caption (keterangan) yang muncul di atas tabel| No | Variable | DataType | Subtype |
|---|---|---|---|
| 1 | Number of vehicles passing through the toll road each day | Numeric | Continous |
| 2 | Student height in cm | Numeric | Continous |
| 3 | Employee gender (Male / Female) | Categorical | Nominal |
| 4 | Customer satisfaction level: Low, Medium, High | Categorical | Ordinal |
| 5 | Respondent’s favorite color: Red, Blue, Green | Categorical | Nominal |
3 Exercise 3
Classify Data Sources: Determine whether the following data comes from internal or external sources, and whether it is structured or unstructured:
# Install DT package if not already installed
# install.packages("DT")
library(DT) # package DT(Data Tables) untuk membuat tabel data yang rapih
data_sources <- data.frame(
No = 1:4, # baris data yang muncul 4 data
DataSource = c(
"Daily sales transaction data of the company",
"Weather reports from BMKG",
"Product reviews on social media",
"Warehouse inventory reports"
),
Internal_External = c(
"Internal",
"External",
"External",
"Internal"
),
Structured_Unstructured = c(
"Structured",
"structured",
"Unstructured",
"Structured"
),
#Mengatur agar data teks (string) tidak otomatis diubah menjadi factor
stringsAsFactors = FALSE
)
datatable(data_sources,
caption = "Table of Data Sources", # keterangan untuk di atas tabel
rownames = FALSE) # hides the index column4 Exercise 4
Dataset Structure: Consider the following transaction table:
| Date | Qty | Price | Product | CustomerTier |
|---|---|---|---|---|
| 2025-10-01 | 2 | 1000 | Laptop | High |
| 2025-10-01 | 5 | 20 | Mouse | Medium |
| 2025-10-02 | 1 | 1000 | Laptop | Low |
| 2025-10-02 | 3 | 30 | Keyboard | Medium |
| 2025-10-03 | 4 | 50 | Mouse | Medium |
| 2025-10-03 | 2 | 1000 | Laptop | High |
| 2025-10-04 | 6 | 25 | Keyboard | Low |
| 2025-10-04 | 1 | 1000 | Laptop | High |
| 2025-10-05 | 3 | 40 | Mouse | Low |
| 2025-10-05 | 5 | 10 | Keyboard | Medium |
Your Assignment Instructions: Creating a Transactions Table above in R
Create a data frame in R called
transactionscontaining the data above.Identify which variables are numeric and which are categorical
Calculate total revenue for each transaction by multiplying
Qty × Priceand add it as a new columnTotal.Compute summary statistics:
- Total quantity sold for each product
- Total revenue per product
- Average price per product
Visualize the data:
- Create a barplot showing total quantity sold per product.
- Create a pie chart showing the proportion of total revenue per customer tier.
Optional Challenge:
- Find which date had the highest total revenue.
- Create a stacked bar chart showing quantity sold per product by customer tier.
Hints: Use data.frame(),
aggregate(), barplot(), pie(),
and basic arithmetic operations in R.
library(DT)
NO = 1:10
transactions <- data.frame(
Date = c("2025-10-01", "2025-10-01", "2025-10-02", "2025-10-02",
"2025-10-03", "2025-10-03", "2025-10-04", "2025-10-04",
"2025-10-05", "2025-10-05"),
Qty = c(2, 5, 1, 3, 4, 2, 6, 1, 3, 5),
Price = c(1000, 20, 1000, 30, 50, 1000, 25, 1000, 40, 10),
Product = c("Laptop", "Mouse", "Laptop", "Keyboard", "Mouse",
"Laptop", "Keyboard", "Laptop", "Mouse", "Keyboard"),
#di atas ini Vector (karena datanya tidak memiliki level (text (string)))
# di bawah ini Factor (karena menyimpan data Ordinal(ada levelnya Low - Medium - High)) bukan text (string)
CustomerTier = c("High", "Medium", "Low", "Medium", "Medium",
"High", "Low", "High", "Low", "Medium"),
stringsAsFactors = FALSE #Mengatur agar data teks (string) tidak otomatis diubah menjadi factor
)
# Atur urutan CustomerTier (Low → Medium → High)
transactions$CustomerTier <- factor( # ($) ambil kolom tertentu dari data frame yang bernama transactions
transactions$CustomerTier,
levels = c("Low", "Medium", "High"), # untuk menentukan urutan kategori
ordered = TRUE # memberitahu R bahwa kategori (levels) di dalam faktor tersebut memiliki urutan yang logis
# dengan argumen ordered = TRUE R jadi tahu kalau High lebih tinggi dari Medium dan seterusnya
)
# Tambahkan kolom Total = Qty × Price
transactions$Total <- transactions$Qty * transactions$Price
# Tampilkan tabel interaktif dengan DT
datatable(
transactions,
caption = "Table: Transaction Data with Total Revenue",
rownames = FALSE,
)transactions_summary <- aggregate(
cbind(Qty, Total, Price) ~ Product, # Untuk menggabungkan isi data (value) pada tabel Qty, Total, Price
data = transactions,
# ini fungsi (function) untuk data di bawahnya (sum = penjumlahan untuk nilai total)
# (mean = menghitung nilai rata rata)
FUN = function(x) c(sum = sum(x), avg = mean(x))
)
# Bayangin punya lemari besar namanya transactions_summary,
# di lemari itu ada 4 laci : 1. Product, 2. Qty, 3. Total, 4. Price
transactions_summary <- data.frame(
Product = transactions_summary$Product,
# Ambil map 'sum' dari laci 'Qty'
Total_Qty = transactions_summary$Qty[, "sum"],
# Ambil map 'sum' dari laci 'Total'
Total_Revenue = transactions_summary$Total[, "sum"],
# Ambil map 'avg' dari laci 'Price', lalu bulatkan ke 2 angka desimal
Avg_Price = round(transactions_summary$Price[, "avg"], 2)
)
#Tampilkan Table Numeric
numeric_table <- transactions[, c("Date", "Qty", "Price", "Total")] # data yang hanya di ambil Date, Qty, Price, Total
datatable(
numeric_table,
caption = "Table 2: Numeric Table",
rownames = FALSE,
options = list(pageLength = 5) # hanya menampilkan 5 baris data
)# Tampilkan Table Categorical
categorical_table <- transactions[, c("Date", "Product", "CustomerTier")] # data yang hanya di ambil Date, Product, CustomerTier
datatable(
categorical_table,
caption = "Table 3: Categorical Table",
rownames = FALSE,
options = list(pageLength = 5) # hanya menampilkan 5 baris data
)# Tampilkan tabel Summary
datatable(
transactions_summary,
caption = "Table 4: Summary Statistics per Product",
rownames = FALSE,
options = list(pageLength = 3) # hanya menampilkan 3 baris data
)# PEMBUATAN BARPLOT
# aggregate itu untuk Mengelompokkan data. jadi pengelompokan data transactions berdasarkan kolom Product,
qty_per_product <- aggregate(Qty ~ Product, data = transactions, sum) # Qty ~ Product -> kelompokkan Qty menurut Product
# Buat barplot
barplot(
qty_per_product$Qty, # tinggi batang sesuai jumlah penjualan
names.arg = qty_per_product$Product, # nama batang = nama produk
col = "skyblue", # menampilkan bar warna menjadi biru (skyblue)
main = "Total Quantity Sold per Product", # Judul barplot
xlab = "Product", # menunjukkan data sumbu X (Horizontal)
ylab = "Total Quantity Sold" # menunjukkan data sumbu Y (Vertikal)
)# Mengelompokkan data transactions berdasarkan CustomerTier, lalu menjumlahkan nilai Total untuk setiap level pelanggan.
revenue_per_tier <- aggregate(Total ~ CustomerTier, data = transactions, sum) # Total ~ CustomerTier -> kelompokkan Total menurut CustomerTier
# Buat pie chart
pie(
revenue_per_tier$Total, # nilai total per tier
labels = revenue_per_tier$CustomerTier, # label tiap potongan pie
col = c("red", "yellow", "green"), # menampilkan bar warna merah, kuning, hijau.
main = "Proportion of Total Revenue per Customer Tier" # Judul Pie chart
)5 Exercise 5
Create Your Own Data Frame:
Objective: Create a data frame in R with 30 rows containing a mix of data types: continuous, discrete, nominal, and ordinal.
5.1 Instructions
Open RStudio or the R console.
Create a vector for each column in your data frame:
- Date: 30 dates (can be sequential or random within
a month/year)
- Continuous: numeric values that can take decimal
values (e.g., height, weight, temperature)
- Discrete: numeric values that can only take whole
numbers (e.g., number of items, number of vehicles)
- Nominal: categorical values with no
order (e.g., color, gender, city)
- Ordinal: categorical values with a defined order (e.g., Low, Medium, High; Beginner, Intermediate, Expert)
- Date: 30 dates (can be sequential or random within
a month/year)
Combine all vectors into a data frame called
my_data.Check your data frame using
head()orView()to ensure it has 30 rows and the columns are correct.Optional tasks:
- Summarize each column using
summary()
- Count the frequency of each category for Nominal
and Ordinal columns using
table()
- Summarize each column using
5.2 Hints
- Use
seq.Date()oras.Date()to generate the Date column.
- Use
runif()orrnorm()for continuous numeric data.
- Use
sample()for discrete, nominal, and ordinal data.
- Ensure the ordinal vector is created with
factor(..., levels = c("Low","Medium","High"), ordered = TRUE)(or similar).
# Install kableExtra package if not already installed
# install.packages("kableExtra")
library(knitr)
library(kableExtra) #library untuk double header di table
# 1. Kolom Date: 30 hari berturut-turut di bulan Oktober 2025
Date <- seq.Date(from = as.Date("2025-10-01"), by = "day", length.out = 30)
# 2. Kolom Weather Temperature
# Menggunakan runif() untuk menghasilkan angka desimal acak antara 15°C dan 35°C
Weather_Temperature <- runif(30, min = 15, max = 35)
# 3. Kolom Number of Green Areas
# Menggunakan sample() untuk menghasilkan angka bulat acak antara 1 - 20
Number_of_Green_Areas <- sample(1:20, 30, replace = TRUE)
# 4. Kolom City Name
Cities <- c("New York", "Los Angeles", "Chicago", "Houston", "Phoenix",
"Philadelphia", "San Antonio", "San Diego", "Dallas", "San Jose")
City_Name <- sample(Cities, 30, replace = TRUE) # untuk memilih nama kota secara acak dari daftar
# 5. Kolom Crime Level
# Menggunakan factor() dengan urutan level dari rendah ke tinggi (Low < Medium < High)
Crime_Level <- factor(
sample(c("Low", "Medium", "High"), 30, replace = TRUE),
levels = c("Low", "Medium", "High"),
ordered = TRUE
)
# Menggabungkan semua kolom menjadi satu data frame bernama my_data
my_data <- data.frame(
Date = Date,
Weather_Temperature = Weather_Temperature,
Number_of_Green_Areas = Number_of_Green_Areas,
City_Name = City_Name,
Crime_Level = Crime_Level
)
View(my_data)#melihat semua data
#Untuk menampilkan Kolum Data dan Baris header table
kable(my_data,
caption = "City Environment & Crime Level Dataset",
col.names = c("Date", "Air Temperature (°C)", "Number of Green Areas", "City Name", "Crime Level"),
align = "c") %>%
add_header_above(c(" " = 1, "Continous" = 1, "Discrete" = 1, "Nominal" = 1, "Ordinal" = 1)
) | Date | Air Temperature (°C) | Number of Green Areas | City Name | Crime Level |
|---|---|---|---|---|
| 2025-10-01 | 15.39455 | 17 | New York | High |
| 2025-10-02 | 34.66373 | 1 | Philadelphia | Medium |
| 2025-10-03 | 20.08228 | 8 | Philadelphia | Medium |
| 2025-10-04 | 22.32757 | 4 | Chicago | Low |
| 2025-10-05 | 25.24550 | 20 | Philadelphia | Medium |
| 2025-10-06 | 24.82646 | 14 | Phoenix | High |
| 2025-10-07 | 17.03657 | 7 | Houston | Low |
| 2025-10-08 | 25.22392 | 6 | Phoenix | Low |
| 2025-10-09 | 19.31232 | 16 | San Diego | Medium |
| 2025-10-10 | 23.78695 | 7 | Chicago | Low |
| 2025-10-11 | 28.01679 | 9 | Dallas | Medium |
| 2025-10-12 | 21.58532 | 12 | San Diego | High |
| 2025-10-13 | 26.28265 | 1 | Houston | Low |
| 2025-10-14 | 29.72757 | 1 | Chicago | Medium |
| 2025-10-15 | 17.24652 | 15 | San Antonio | Low |
| 2025-10-16 | 28.57331 | 16 | Chicago | Medium |
| 2025-10-17 | 22.22300 | 3 | New York | High |
| 2025-10-18 | 19.39754 | 10 | Philadelphia | Medium |
| 2025-10-19 | 19.13467 | 12 | San Jose | High |
| 2025-10-20 | 25.80432 | 20 | Phoenix | Low |
| 2025-10-21 | 29.95591 | 3 | Chicago | Low |
| 2025-10-22 | 15.96142 | 20 | Houston | Low |
| 2025-10-23 | 33.21069 | 17 | Philadelphia | High |
| 2025-10-24 | 19.30976 | 18 | Philadelphia | Low |
| 2025-10-25 | 30.62475 | 19 | Los Angeles | Low |
| 2025-10-26 | 23.48873 | 6 | Houston | Low |
| 2025-10-27 | 22.11962 | 13 | San Diego | Low |
| 2025-10-28 | 17.75099 | 13 | Phoenix | Medium |
| 2025-10-29 | 19.15034 | 5 | Phoenix | High |
| 2025-10-30 | 34.90841 | 9 | Dallas | Medium |