Data Exploration

Exercises ~ Week 2

Logo


1 Exercise 1

The following table shows sample information for three students. Each observation represents a single student and includes details such as their unique student ID, name, age, total credits completed, major field of study, and year level.

This dataset demonstrates a mixture of variable types:

  • Nominal: StudentID, Name, Major
  • Numeric: Age (continuous), CreditsCompleted (discrete)
  • Ordinal: YearLevel (Freshman → Senior)
StudentID Name Age CreditsCompleted Major YearLevel
S001 Alice 20 45 Data Sains Sophomore
S002 Budi 21 60 Mathematics Junior
S003 Citra 19 30 Statistics Freshman
# 1. Create vectors for each variable
StudentID <- c("S001", "S002", "S003")       # Nominal / ID
Name <- c("Alice", "Budi", "Citra")          # Nominal / Name
Age <- c(20, 21, 19)                         # Numeric / Continuous
CreditsCompleted <- c(45, 60, 30)            # Numeric / Discrete

# Nominal
Major <- c("Data Sains", "Mathematics", "Statistics")  

# Ordinal
YearLevel <- factor(c("Sophomore", "Junior", "Freshman"),
                    levels = c("Freshman","Sophomore","Junior","Senior"),
                    ordered = TRUE)          

# 2. Combine all vectors into a data frame
students <- data.frame(
  StudentID, Name, Age, CreditsCompleted, Major, YearLevel,
  stringsAsFactors = FALSE
)

# 3. Display the data frame
print(students)
##   StudentID  Name Age CreditsCompleted       Major YearLevel
## 1      S001 Alice  20               45  Data Sains Sophomore
## 2      S002  Budi  21               60 Mathematics    Junior
## 3      S003 Citra  19               30  Statistics  Freshman

2 Exercise 2

Identify Data Types: Determine the type of data for each of the following variables:

# Install knitr package if not already installed
# install.packages("knitr")
library(knitr)

# Create a data frame for Data Types
variables_info <- data.frame(
  No = 1:5,
  Variable = c(
    "Number of vehicles passing through the toll road each day",
    "Student height in cm",
    "Employee gender (Male / Female)",
    "Customer satisfaction level: Low, Medium, High",
    "Respondent's favorite color: Red, Blue, Green"
  ),
  DataType = c(
    "Numeric",
    "Numeric",
    "Categorical",
    "Categorical",
    "Categorical"
  ),
  Subtype = c(
    "Discrete",
    "Continous",
    "Nominal",
    "Ordinal",
    "Nominal"
  ),
  stringsAsFactors = FALSE
)

# Display the data frame as a neat table
kable(variables_info, 
      caption = "Table of Variables and Data Types")
Table of Variables and Data Types
No Variable DataType Subtype
1 Number of vehicles passing through the toll road each day Numeric Discrete
2 Student height in cm Numeric Continous
3 Employee gender (Male / Female) Categorical Nominal
4 Customer satisfaction level: Low, Medium, High Categorical Ordinal
5 Respondent’s favorite color: Red, Blue, Green Categorical Nominal

3 Exercise 3

Classify Data Sources: Determine whether the following data comes from internal or external sources, and whether it is structured or unstructured:

# Install DT package if not already installed
# install.packages("DT")
library(DT)

# Create a data frame for data sources 
data_sources <- data.frame(
  No = 1:4,
  DataSource = c(
    "Daily sales transaction data of the company",
    "Weather reports from BMKG",
    "Product reviews on social media",
    "Warehouse inventory reports"
  ),
  Internal_External = c(
    "Internal",
    "External",
    "External",
    "Internal"
  ),
  Structured_Unstructured = c(
    "structured",
    "structured",
    "unstructured",
    "structured"
  ),
  stringsAsFactors = FALSE
)

# Display the data frame as a neat table
datatable(data_sources, 
          caption = "Table of Data Sources",
          rownames = FALSE) # hides the index column

4 Exercise 4

Dataset Structure: Consider the following transaction table:

Date Qty Price Product CustomerTier
2025-10-01 2 1000 Laptop High
2025-10-01 5 20 Mouse Medium
2025-10-02 1 1000 Laptop Low
2025-10-02 3 30 Keyboard Medium
2025-10-03 4 50 Mouse Medium
2025-10-03 2 1000 Laptop High
2025-10-04 6 25 Keyboard Low
2025-10-04 1 1000 Laptop High
2025-10-05 3 40 Mouse Low
2025-10-05 5 10 Keyboard Medium

Your Assignment Instructions: Creating a Transactions Table above in R

  1. Create a data frame in R called transactions containing the data above.

  2. Identify which variables are numeric and which are categorical

  3. Calculate total revenue for each transaction by multiplying Qty × Price and add it as a new column Total. #total revenue Total <- Qty * Price

  4. Compute summary statistics:

    • Total quantity sold for each product
    • Total revenue per product
    • Average price per product
  5. Visualize the data:

    • Create a barplot showing total quantity sold per product.
    • Create a pie chart showing the proportion of total revenue per customer tier.
  6. Optional Challenge:

    • Find which date had the highest total revenue.
    • Create a stacked bar chart showing quantity sold per product by customer tier.

Hints: Use data.frame(), aggregate(), barplot(), pie(), and basic arithmetic operations in R.

library(DT)

# Transactions
Date <- c("2025-10-01","2025-01-10","2025-10-02","2025-10-02","2025-10-03","2025-10-03","2025-10-04","2025-10-04","2025-10-05","2025-10-05")
Qty <- c(2, 5, 1, 3, 4, 2, 6, 1, 3, 5)                                                          
Price <- c(1000, 20, 1000, 30, 50, 1000, 25, 1000, 40, 10)                                     
Product <- c("Laptop","Mouse","Laptop","Keyboard","Mouse","Laptop","Keyboard","Laptop","Mouse","Keyboard")
CustomerTier <- c("High","Medium","Low","Medium","Medium","High","Low","High","Low","Medium")            

# data.frame
transaction <- data.frame(Date, Qty, Price, Product, CustomerTier)

# Display the data frame 
print(transaction)
##          Date Qty Price  Product CustomerTier
## 1  2025-10-01   2  1000   Laptop         High
## 2  2025-01-10   5    20    Mouse       Medium
## 3  2025-10-02   1  1000   Laptop          Low
## 4  2025-10-02   3    30 Keyboard       Medium
## 5  2025-10-03   4    50    Mouse       Medium
## 6  2025-10-03   2  1000   Laptop         High
## 7  2025-10-04   6    25 Keyboard          Low
## 8  2025-10-04   1  1000   Laptop         High
## 9  2025-10-05   3    40    Mouse          Low
## 10 2025-10-05   5    10 Keyboard       Medium
# Create a data frame for data types
variables_info <- data.frame(
  Variable = c(
    "Qty",
    "Price",
    "Product",
    "CustomerTier"
  ),
  DataType = c(
    "Numeric",
    "Numeric",
    "Categorical",
    "Categorical"
  ),
  stringsAsFactors = FALSE
)

print(variables_info)
##       Variable    DataType
## 1          Qty     Numeric
## 2        Price     Numeric
## 3      Product Categorical
## 4 CustomerTier Categorical
# Compute total cost for every transaction
transaction$total <- transaction$Qty * transaction$Price

# view the data frame 
print(transaction)
##          Date Qty Price  Product CustomerTier total
## 1  2025-10-01   2  1000   Laptop         High  2000
## 2  2025-01-10   5    20    Mouse       Medium   100
## 3  2025-10-02   1  1000   Laptop          Low  1000
## 4  2025-10-02   3    30 Keyboard       Medium    90
## 5  2025-10-03   4    50    Mouse       Medium   200
## 6  2025-10-03   2  1000   Laptop         High  2000
## 7  2025-10-04   6    25 Keyboard          Low   150
## 8  2025-10-04   1  1000   Laptop         High  1000
## 9  2025-10-05   3    40    Mouse          Low   120
## 10 2025-10-05   5    10 Keyboard       Medium    50
# compute summary Statistic
revenue_Qty <- aggregate(Qty ~ Product, data = transaction, sum)
print (revenue_Qty)
##    Product Qty
## 1 Keyboard  14
## 2   Laptop   6
## 3    Mouse  12
revenue_tier <- aggregate(total ~ Product, data = transaction, sum)
print(revenue_tier)
##    Product total
## 1 Keyboard   290
## 2   Laptop  6000
## 3    Mouse   420
avg_price <- aggregate(Price ~ Product, data = transaction, FUN = mean)
print(avg_price) 
##    Product      Price
## 1 Keyboard   21.66667
## 2   Laptop 1000.00000
## 3    Mouse   36.66667
# Calculate total quantity per product 
qty_per_product <- aggregate(Qty ~ Product, data = transaction, sum)

# Barplot
barplot(qty_per_product$Qty,
        names.arg = qty_per_product$Product,
        col = "orange",
        main = "Total Quantity Sold per Product",
        ylab = "Total Quantity",
        xlab = "Product")

# Calculate total revenue per CustomerTier
revenue_tier <- aggregate(total ~ CustomerTier, data = transaction, sum) 

 # Create a pie chart 
pie( revenue_tier$total,
  labels = revenue_tier$CustomerTier,
  main = "Total Revenue per Customer Tier",
  col = c("red", "orange", "pink"))

# Hitung total revenue per tanggal
revenue_per_date <- aggregate(total ~ Date, data = transaction, sum)

# Lihat hasilnya
print(revenue_per_date)
##         Date total
## 1 2025-01-10   100
## 2 2025-10-01  2000
## 3 2025-10-02  1090
## 4 2025-10-03  2200
## 5 2025-10-04  1150
## 6 2025-10-05   170
# Cari tanggal dengan total revenue paling tingi menggunakan max
max_date <- revenue_per_date[which.max(revenue_per_date$total), ]

#lihat hasilnya
print(max_date)
##         Date total
## 4 2025-10-03  2200
library(ggplot2)

# Bikin stacked bar chart
ggplot(transaction, aes(x = reorder(Product, Qty), y = Qty, fill = CustomerTier)) +
  geom_bar(stat = "identity") +
  labs(title = "Quantity Sold per Product by Customer Tier",
       x = "Product",
       y = "Quantity Sold") +
  theme_minimal()

5 Exercise 5

Create Your Own Data Frame:

Objective: Create a data frame in R with 30 rows containing a mix of data types: continuous, discrete, nominal, and ordinal.

5.1 Instructions

  1. Open RStudio or the R console.

  2. Create a vector for each column in your data frame:

    • Date: 30 dates (can be sequential or random within a month/year)
    • Continuous: numeric values that can take decimal values (e.g., height, weight, temperature)
    • Discrete: numeric values that can only take whole numbers (e.g., number of items, number of vehicles)
    • Nominal: categorical values with no order (e.g., color, gender, city)
    • Ordinal: categorical values with a defined order (e.g., Low, Medium, High; Beginner, Intermediate, Expert)
  3. Combine all vectors into a data frame called my_data.

  4. Check your data frame using head() or View() to ensure it has 30 rows and the columns are correct.

  5. Optional tasks:

    • Summarize each column using summary()
    • Count the frequency of each category for Nominal and Ordinal columns using table()

5.2 Hints

  • Use seq.Date() or as.Date() to generate the Date column.
  • Use runif() or rnorm() for continuous numeric data.
  • Use sample() for discrete, nominal, and ordinal data.
  • Ensure the ordinal vector is created with factor(..., levels = c("Low","Medium","High"), ordered = TRUE) (or similar).
library(DT)

# Data Financial Frame

# column 1: Transaction Date
Date <- seq.Date(from = as.Date("2025-10-01"), by = "day", length.out = 30)

# Column 2: Transaction value (Continuous) — In rupiah
TransactionValue <- c(
  1250000, 980000, 1500000, 2100000, 1850000, 750000, 2400000, 1300000, 1780000, 2220000,
  1950000, 1650000, 1420000, 2550000, 3100000, 2750000, 1980000, 880000, 1750000, 2200000,
  1900000, 2600000, 2450000, 1350000, 1500000, 900000, 1200000, 2750000, 3200000, 2100000
)

# Column 3: Number of items purchased (Discrete)
Items <- c(
  2, 1, 3, 5, 4, 2, 6, 3, 5, 4,
  2, 3, 1, 7, 8, 5, 3, 2, 4, 6,
  5, 7, 8, 3, 4, 1, 2, 7, 9, 5
)

# Column 4: Transaction type (Nominal)
TransactionType <- c(
  "Pembelian Tunai", "Transfer Bank", "Kartu Kredit", "Pembelian Tunai", "E-Wallet",
  "Transfer Bank", "Kartu Kredit", "E-Wallet", "Pembelian Tunai", "E-Wallet",
  "Kartu Kredit", "Pembelian Tunai", "Transfer Bank", "E-Wallet", "Kartu Kredit",
  "Pembelian Tunai", "E-Wallet", "Transfer Bank", "Kartu Kredit", "Pembelian Tunai",
  "E-Wallet", "Transfer Bank", "Kartu Kredit", "Pembelian Tunai", "E-Wallet",
  "Kartu Kredit", "Transfer Bank", "Pembelian Tunai", "E-Wallet", "Transfer Bank"
)

# Kolom 5: Customer Tier (Ordinal)
CustomerTier <- factor(
  c(
    "Sedang", "Rendah", "Sedang", "Tinggi", "Sedang", "Rendah", "Tinggi", "Sedang", "Tinggi", "Sedang",
    "Rendah", "Tinggi", "Sedang", "Tinggi", "Tinggi", "Sedang", "Rendah", "Sedang", "Tinggi", "Sedang",
    "Rendah", "Tinggi", "Tinggi", "Rendah", "Sedang", "Rendah", "Sedang", "Tinggi", "Tinggi", "Sedang"
  ),
  levels = c("Rendah", "Sedang", "Tinggi"),
  ordered = TRUE
)

# Combine into a data frame
data_financial <- data.frame(Date, TransactionValue, Items, TransactionType, CustomerTier)

# Display the data frame as a table
print(data_financial)
##          Date TransactionValue Items TransactionType CustomerTier
## 1  2025-10-01          1250000     2 Pembelian Tunai       Sedang
## 2  2025-10-02           980000     1   Transfer Bank       Rendah
## 3  2025-10-03          1500000     3    Kartu Kredit       Sedang
## 4  2025-10-04          2100000     5 Pembelian Tunai       Tinggi
## 5  2025-10-05          1850000     4        E-Wallet       Sedang
## 6  2025-10-06           750000     2   Transfer Bank       Rendah
## 7  2025-10-07          2400000     6    Kartu Kredit       Tinggi
## 8  2025-10-08          1300000     3        E-Wallet       Sedang
## 9  2025-10-09          1780000     5 Pembelian Tunai       Tinggi
## 10 2025-10-10          2220000     4        E-Wallet       Sedang
## 11 2025-10-11          1950000     2    Kartu Kredit       Rendah
## 12 2025-10-12          1650000     3 Pembelian Tunai       Tinggi
## 13 2025-10-13          1420000     1   Transfer Bank       Sedang
## 14 2025-10-14          2550000     7        E-Wallet       Tinggi
## 15 2025-10-15          3100000     8    Kartu Kredit       Tinggi
## 16 2025-10-16          2750000     5 Pembelian Tunai       Sedang
## 17 2025-10-17          1980000     3        E-Wallet       Rendah
## 18 2025-10-18           880000     2   Transfer Bank       Sedang
## 19 2025-10-19          1750000     4    Kartu Kredit       Tinggi
## 20 2025-10-20          2200000     6 Pembelian Tunai       Sedang
## 21 2025-10-21          1900000     5        E-Wallet       Rendah
## 22 2025-10-22          2600000     7   Transfer Bank       Tinggi
## 23 2025-10-23          2450000     8    Kartu Kredit       Tinggi
## 24 2025-10-24          1350000     3 Pembelian Tunai       Rendah
## 25 2025-10-25          1500000     4        E-Wallet       Sedang
## 26 2025-10-26           900000     1    Kartu Kredit       Rendah
## 27 2025-10-27          1200000     2   Transfer Bank       Sedang
## 28 2025-10-28          2750000     7 Pembelian Tunai       Tinggi
## 29 2025-10-29          3200000     9        E-Wallet       Tinggi
## 30 2025-10-30          2100000     5   Transfer Bank       Sedang
# Create a second table for variable type
variables_info <- data.frame(
  No = 1:4,
  variable = c(
    "Transaction Value",
    "Items",
    "Transaction Type",
    "Customer Tier"
  ),
  DataType = c(
    "Continuous",
    "Discrete",
    "Nominal",
    "Ordinal"
  ),
  stringsAsFactors = FALSE
)

library(knitr)

kable(
  variables_info,
  caption = "Table of Variables and Data Types"
)
Table of Variables and Data Types
No variable DataType
1 Transaction Value Continuous
2 Items Discrete
3 Transaction Type Nominal
4 Customer Tier Ordinal