Data Exploration

Exercises ~ Week 3

Logo


1 Exercise 1

The following table shows sample information for three students. Each observation represents a single student and includes details such as their unique student ID, name, age, total credits completed, major field of study, and year level.

This dataset demonstrates a mixture of variable types:

  • Nominal: StudentID, Name, Major
  • Numeric: Age (continuous), CreditsCompleted (discrete)
  • Ordinal: YearLevel (Freshman → Senior)
StudentID Name Age CreditsCompleted Major YearLevel
S001 Alice 20 45 Data Sains Sophomore
S002 Budi 21 60 Mathematics Junior
S003 Citra 19 30 Statistics Freshman
# 1. Create vectors for each variable
StudentID <- c("S001", "S002", "S003")       # Nominal / ID
Name <- c("Alice", "Budi", "Citra")          # Nominal / Name
Age <- c(20, 21, 19)                         # Numeric / Continuous
CreditsCompleted <- c(45, 60, 30)            # Numeric / Discrete

# Nominal
Major <- c("Data Sains", "Mathematics", "Statistics")  

# Ordinal
YearLevel <- factor(c("Sophomore", "Junior", "Freshman"),
                    levels = c("Freshman","Sophomore","Junior","Senior"),
                    ordered = TRUE)          

# 2. Combine all vectors into a data frame
students <- data.frame(
  StudentID, Name, Age, CreditsCompleted, Major, YearLevel,
  stringsAsFactors = FALSE
)

# 3. Display the data frame
print(students)
##   StudentID  Name Age CreditsCompleted       Major YearLevel
## 1      S001 Alice  20               45  Data Sains Sophomore
## 2      S002  Budi  21               60 Mathematics    Junior
## 3      S003 Citra  19               30  Statistics  Freshman

2 Exercise 2

Identify Data Types: Determine the type of data for each of the following variables:

# Install knitr package if not already installed
# install.packages("knitr")
library(knitr)

# Create a data frame for Data Types
variables_info <- data.frame(
  No = 1:5,
  Variable = c(
    "Number of vehicles passing through the toll road each day",
    "Student height in cm",
    "Employee gender (Male / Female)",
    "Customer satisfaction level: Low, Medium, High",
    "Respondent's favorite color: Red, Blue, Green"
  ),
  DataType = c(
    "Numeric",
    "Numeric",
    "Categorical",
    "Categorical",
    "Categorical"
  ),
  Subtype = c(
    "Discrete",
    "Continous",
    "Nominal",
    "Ordinal",
    "Nominal"
  ),
  stringsAsFactors = FALSE
)

# Display the data frame as a neat table
kable(variables_info, 
      caption = "Table of Variables and Data Types")
Table of Variables and Data Types
No Variable DataType Subtype
1 Number of vehicles passing through the toll road each day Numeric Discrete
2 Student height in cm Numeric Continous
3 Employee gender (Male / Female) Categorical Nominal
4 Customer satisfaction level: Low, Medium, High Categorical Ordinal
5 Respondent’s favorite color: Red, Blue, Green Categorical Nominal

3 Exercise 3

Classify Data Sources: Determine whether the following data comes from internal or external sources, and whether it is structured or unstructured:

# Install DT package if not already installed
# install.packages("DT")
library(DT)

# Create a data frame for data sources 
data_sources <- data.frame(
  No = 1:4,
  DataSource = c(
    "Daily sales transaction data of the company",
    "Weather reports from BMKG",
    "Product reviews on social media",
    "Warehouse inventory reports"
  ),
  Internal_External = c(
    "Internal",
    "External",
    "External",
    "Internal"
  ),
  Structured_Unstructured = c(
    "Structured",
    "Structured",
    "Unstructured",
    "Structured"
  ),
  stringsAsFactors = FALSE
)

# Display the data frame as a neat table
datatable(data_sources, 
          caption = "Table of Data Sources",
          rownames = FALSE) # hides the index column

4 Exercise 4

Dataset Structure: Consider the following transaction table:

Date Qty Price Product CustomerTier
2025-10-01 2 1000 Laptop High
2025-10-01 5 20 Mouse Medium
2025-10-02 1 1000 Laptop Low
2025-10-02 3 30 Keyboard Medium
2025-10-03 4 50 Mouse Medium
2025-10-03 2 1000 Laptop High
2025-10-04 6 25 Keyboard Low
2025-10-04 1 1000 Laptop High
2025-10-05 3 40 Mouse Low
2025-10-05 5 10 Keyboard Medium

Your Assignment Instructions: Creating a Transactions Table above in R

  1. Create a data frame in R called transactions containing the data above.

  2. Identify which variables are numeric and which are categorical

  3. Calculate total revenue for each transaction by multiplying Qty × Price and add it as a new column Total.

  4. Compute summary statistics:

    • Total quantity sold for each product
    • Total revenue per product
    • Average price per product
  5. Visualize the data:

    • Create a barplot showing total quantity sold per product.
    • Create a pie chart showing the proportion of total revenue per customer tier.
  6. Optional Challenge:

    • Find which date had the highest total revenue.
    • Create a stacked bar chart showing quantity sold per product by customer tier.

Hints: Use data.frame(), aggregate(), barplot(), pie(), and basic arithmetic operations in R.

  1. Visualize the data:

    • Create a barplot showing total quantity sold per product.
    • Create a pie chart showing the proportion of total revenue per customer tier.
  2. Optional Challenge:

    • Find which date had the highest total revenue.
    • Create a stacked bar chart showing quantity sold per product by customer tier.

Hints: Use data.frame(), aggregate(), barplot(), pie(), and basic arithmetic operations in R.

# Install knitr package if not already installed
# install.packages("knitr")
library(knitr)

# Create a data frame for Transactions
Date = c (
  "2025-10-01",  
  "2025-10-01", 
  "2025-10-02", 
  "2025-10-02", 
  "2025-10-03", 
  "2025-10-03", 
  "2025-10-04", 
  "2025-10-04", 
  "2025-10-05", 
  "2025-10-05"
  )              # Numeric / Discrete
Qty = c (2, 5, 1, 3, 4, 2, 6, 1, 3, 5)   # Numeric / Discrete
Price = c (1000, 20, 1000, 30, 50, 1000, 25, 1000, 40, 10)   # Numeric / Discrete

# Nominal
Product = c (
  "Laptop",
  "Mouse",
  "Laptop",
  "Keyboard",
  "Mouse",
  "Laptop",
  "Keyboard",
  "Laptop",
  "Mouse",
  "Keyboard"
  )           # Categorical / Nominal

# Ordinal
CustomerTier = factor(c(
  "High",
  "Medium",
  "Low",
  "Medium",
  "Medium",
  "High",
  "Low",
  "High",
  "Low",
  "Medium"
  ),
levels = c("Low","Medium","High"),
ordered = TRUE) 

# 2. Combine all vectors into a data frame
transactions =  data.frame (Date, Qty, Price, Product, CustomerTier, stringsAsFactors = FALSE)

# 3. Display the data frame
library(knitr)
kable(transactions)
Date Qty Price Product CustomerTier
2025-10-01 2 1000 Laptop High
2025-10-01 5 20 Mouse Medium
2025-10-02 1 1000 Laptop Low
2025-10-02 3 30 Keyboard Medium
2025-10-03 4 50 Mouse Medium
2025-10-03 2 1000 Laptop High
2025-10-04 6 25 Keyboard Low
2025-10-04 1 1000 Laptop High
2025-10-05 3 40 Mouse Low
2025-10-05 5 10 Keyboard Medium
# 2. Identify variable types
library(knitr)
variable_types <- data.frame(
  Variable = c(
    "Date",
    "Qty",
    "Price",
    "Product",
    "CustomerTier"
    ),
  Type = c(
    "Numeric (Discrete)",
    "Numeric (Discrete)",
    "Numeric (Discrete)",
    "Categorical (Nominal)",
    "Categorical (Ordinal)"
    )
)

# Display the data frame
kable(variable_types, caption = "Variable Types in Transactions Data")
Variable Types in Transactions Data
Variable Type
Date Numeric (Discrete)
Qty Numeric (Discrete)
Price Numeric (Discrete)
Product Categorical (Nominal)
CustomerTier Categorical (Ordinal)
# 3. Calculate total revenue
transactions$total = transactions$Qty * transactions$Price
kable(transactions)
Date Qty Price Product CustomerTier total
2025-10-01 2 1000 Laptop High 2000
2025-10-01 5 20 Mouse Medium 100
2025-10-02 1 1000 Laptop Low 1000
2025-10-02 3 30 Keyboard Medium 90
2025-10-03 4 50 Mouse Medium 200
2025-10-03 2 1000 Laptop High 2000
2025-10-04 6 25 Keyboard Low 150
2025-10-04 1 1000 Laptop High 1000
2025-10-05 3 40 Mouse Low 120
2025-10-05 5 10 Keyboard Medium 50
# 4. Compute summary statistic

# a. Total quantity sold for each product
total_Qty = aggregate(Qty ~ Product, data = transactions, sum)
kable(total_Qty, caption = "Total Quantity Sold per Product")
Total Quantity Sold per Product
Product Qty
Keyboard 14
Laptop 6
Mouse 12
# b. total revenue per product
total_revenue = aggregate(total ~ Product, data = transactions, sum)
kable(total_revenue, caption = "Total Revenue per Product")
Total Revenue per Product
Product total
Keyboard 290
Laptop 6000
Mouse 420
# c. Average price per product
avg_price = aggregate(Price ~ Product, data = transactions, mean)
kable(avg_price, caption = "Average Price per Product")
Average Price per Product
Product Price
Keyboard 21.66667
Laptop 1000.00000
Mouse 36.66667
# 5. Visualize the data

# a. barplot showing total quantity sold per product
total_qty = tapply(transactions$Qty, transactions$Product, sum)
barplot(total_qty,
        main = "Total Quantity Sold per Product",
        xlab = "Product",
        ylab = "Total Quantity",
        col = "lightblue")

# b. pie chart showing the proportion of total revenue
total_revenue_tier = tapply(transactions$total, transactions$CustomerTier, sum)
pie(total_revenue_tier,
    main = "Proportion of Total Revenue per Customer Tier",
    col = rainbow(length(total_revenue_tier)))

# 6. Optional challenge

# a. Find which date had the highest total revenue
total_revenue_date <- aggregate(total ~ Date, data = transactions, sum)
total_revenue_date[which.max(total_revenue_date$total), ]
# b. stacked bar chart showing quantity sold per product by customer tier
qty_table = xtabs(Qty ~ Product + CustomerTier, data = transactions)
barplot(qty_table,
        main = "Quantity Sold per Product by Customer Tier",
        xlab = "Product",
        ylab = "Quantity",
        col = c("lightblue", "lightgreen", "pink"))

5 Exercise 5

Create Your Own Data Frame:

Objective: Create a data frame in R with 30 rows containing a mix of data types: continuous, discrete, nominal, and ordinal.

5.1 Instructions

  1. Open RStudio or the R console.

  2. Create a vector for each column in your data frame:

    • Date: 30 dates (can be sequential or random within a month/year)
    • Continuous: numeric values that can take decimal values (e.g., height, weight, temperature)
    • Discrete: numeric values that can only take whole numbers (e.g., number of items, number of vehicles)
    • Nominal: categorical values with no order (e.g., color, gender, city)
    • Ordinal: categorical values with a defined order (e.g., Low, Medium, High; Beginner, Intermediate, Expert)
  3. Combine all vectors into a data frame called my_data.

  4. Check your data frame using head() or View() to ensure it has 30 rows and the columns are correct.

  5. Optional tasks:

    • Summarize each column using summary()
    • Count the frequency of each category for Nominal and Ordinal columns using table()

5.2 Hints

  • Use seq.Date() or as.Date() to generate the Date column.
  • Use runif() or rnorm() for continuous numeric data.
  • Use sample() for discrete, nominal, and ordinal data.
  • Ensure the ordinal vector is created with factor(..., levels = c("Low","Medium","High"), ordered = TRUE) (or similar).
# 1. Coffee Shop Data

# Date (30 hari di bulan September)
Date = seq(as.Date("2025-09-01"), as.Date("2025-09-30"), by = "day")

# Continuous: jumlah ml kopi terjual per hari (acak dari 1500–4000 ml)
Coffee_ml = runif(30, min = 1500, max = 4000)

# Discrete: jumlah cangkir kopi terjual per hari (acak 20–100)
Cups_Sold = sample(20:100, 30, replace = TRUE)

# Nominal: jenis minuman kopi
Drink_Type = sample(c("Americano", "Cappuccino", "Latte", "Espresso", "Mocha"), 30, replace = TRUE)

# Ordinal: tingkat kepuasan pelanggan
Customer_Satisfaction = factor(
  sample(c("Poor", "Fair", "Good", "Very Good", "Excellent"), 30, replace = TRUE),
  levels = c("Poor", "Fair", "Good", "Very Good", "Excellent"),
  ordered = TRUE)

# Combine all vectors into a data frame
my_data = data.frame(Date, Coffee_ml, Cups_Sold, Drink_Type, Customer_Satisfaction)
kable(my_data)
Date Coffee_ml Cups_Sold Drink_Type Customer_Satisfaction
2025-09-01 3243.048 67 Espresso Good
2025-09-02 2114.439 83 Espresso Poor
2025-09-03 2893.236 85 Espresso Excellent
2025-09-04 3023.221 67 Latte Good
2025-09-05 2185.550 100 Latte Poor
2025-09-06 1871.067 98 Espresso Fair
2025-09-07 1509.224 45 Cappuccino Fair
2025-09-08 2703.869 58 Cappuccino Good
2025-09-09 1840.767 77 Latte Poor
2025-09-10 2875.354 97 Latte Fair
2025-09-11 1816.171 74 Mocha Fair
2025-09-12 1656.236 88 Cappuccino Good
2025-09-13 3003.934 86 Latte Poor
2025-09-14 3980.280 78 Latte Excellent
2025-09-15 1805.588 95 Mocha Excellent
2025-09-16 3406.042 66 Latte Poor
2025-09-17 3751.245 27 Americano Good
2025-09-18 3262.385 72 Americano Poor
2025-09-19 2131.619 94 Americano Fair
2025-09-20 3426.469 92 Espresso Excellent
2025-09-21 3402.881 99 Mocha Good
2025-09-22 1712.093 59 Mocha Fair
2025-09-23 3672.968 60 Americano Good
2025-09-24 3100.494 29 Americano Fair
2025-09-25 2625.981 41 Mocha Excellent
2025-09-26 1730.989 76 Cappuccino Fair
2025-09-27 3134.336 51 Cappuccino Fair
2025-09-28 3026.545 40 Latte Very Good
2025-09-29 3222.764 23 Espresso Very Good
2025-09-30 3555.928 30 Latte Excellent
# Summary data (opsional)
summary_data = summary(my_data)
kable(summary_data)
Date Coffee_ml Cups_Sold Drink_Type Customer_Satisfaction
Min. :2025-09-01 Min. :1509 Min. : 23.00 Length:30 Poor :6
1st Qu.:2025-09-08 1st Qu.:1932 1st Qu.: 52.75 Class :character Fair :9
Median :2025-09-15 Median :2949 Median : 73.00 Mode :character Good :7
Mean :2025-09-15 Mean :2723 Mean : 68.57 NA Very Good:2
3rd Qu.:2025-09-22 3rd Qu.:3258 3rd Qu.: 87.50 NA Excellent:6
Max. :2025-09-30 Max. :3980 Max. :100.00 NA NA
# Frekuensi Kategori (nominal, ordinal)
library(knitr)
drink_freq = table(my_data$Drink_Type)
satisfaction_freq = table(my_data$Customer_Satisfaction)

kable(drink_freq, caption = "Frekuensi Jenis Minuman (Nominal)")
Frekuensi Jenis Minuman (Nominal)
Var1 Freq
Americano 5
Cappuccino 5
Espresso 6
Latte 9
Mocha 5
kable(satisfaction_freq, caption = "Frekuensi Kepuasan Pelanggan (Ordinal)")
Frekuensi Kepuasan Pelanggan (Ordinal)
Var1 Freq
Poor 6
Fair 9
Good 7
Very Good 2
Excellent 6
