Data Exploration

Exercises ~ Week 3

Logo


1 Exercise 1

The following table shows sample information for three students. Each observation represents a single student and includes details such as their unique student ID, name, age, total credits completed, major field of study, and year level.

This dataset demonstrates a mixture of variable types:

  • Nominal: StudentID, Name, Major
  • Numeric: Age (continuous), CreditsCompleted (discrete)
  • Ordinal: YearLevel (Freshman → Senior)
StudentID Name Age CreditsCompleted Major YearLevel
S001 Alice 20 45 Data Sains Sophomore
S002 Budi 21 60 Mathematics Junior
S003 Citra 19 30 Statistics Freshman
# 1. Create vectors for each variable
StudentID <- c("S001", "S002", "S003")       # Nominal / ID
Name <- c("Alice", "Budi", "Citra")          # Nominal / Name
Age <- c(20, 21, 19)                         # Numeric / Continuous
CreditsCompleted <- c(45, 60, 30)            # Numeric / Discrete

# Nominal
Major <- c("Data Sains", "Mathematics", "Statistics")  

# Ordinal
YearLevel <- factor(c("Sophomore", "Junior", "Freshman"),
                    levels = c("Freshman","Sophomore","Junior","Senior"),
                    ordered = TRUE)          

# 2. Combine all vectors into a data frame
students <- data.frame(
  StudentID, Name, Age, CreditsCompleted, Major, YearLevel,
  stringsAsFactors = FALSE
)

# 3. Display the data frame
print(students)
##   StudentID  Name Age CreditsCompleted       Major YearLevel
## 1      S001 Alice  20               45  Data Sains Sophomore
## 2      S002  Budi  21               60 Mathematics    Junior
## 3      S003 Citra  19               30  Statistics  Freshman

2 Exercise 2

Identify Data Types: Determine the type of data for each of the following variables:

# Install knitr package if not already installed
# install.packages("knitr")
library(knitr)

# Create a data frame for Data Types
variables_info <- data.frame(
  No = 1:5,
  Variable = c(
    "Number of vehicles passing through the toll road each day",
    "Student height in cm",
    "Employee gender (Male / Female)",
    "Customer satisfaction level: Low, Medium, High",
    "Respondent's favorite color: Red, Blue, Green"
  ),
  DataType = c(
    "Numeric",
    "Numeric",
    "Categorical",
    "Categorical",
    "Categorical"
  ),
  Subtype = c(
    "Discrete",
    "Continuous",
    "Nominal",
    "Ordinal",
    "Nominal"
  ),
  stringsAsFactors = FALSE
)

# Display the data frame as a neat table
kable(variables_info, 
      caption = "Table of Variables and Data Types")
Table of Variables and Data Types
No Variable DataType Subtype
1 Number of vehicles passing through the toll road each day Numeric Discrete
2 Student height in cm Numeric Continuous
3 Employee gender (Male / Female) Categorical Nominal
4 Customer satisfaction level: Low, Medium, High Categorical Ordinal
5 Respondent’s favorite color: Red, Blue, Green Categorical Nominal

3 Exercise 3

Classify Data Sources: Determine whether the following data comes from internal or external sources, and whether it is structured or unstructured:

# Install DT package if not already installed
# install.packages("DT")
library(DT)

# Create a data frame for data sources 
data_sources <- data.frame(
  No = 1:4,
  DataSource = c(
    "Daily sales transaction data of the company",
    "Weather reports from BMKG",
    "Product reviews on social media",
    "Warehouse inventory reports"
  ),
  Internal_External = c(
    "Internal",
    "External",
    "External",
    "Internal"
  ),
  Structured_Unstructured = c(
    "Structured",
    "Structuded",
    "Unstructuded",
    "Structuded"
      ),
  stringsAsFactors = FALSE

)

# Display the data frame as a neat table
datatable(data_sources, 
          caption = "Table of Data Sources",
      rownames = FALSE) # hides the index column

4 Exercise 4

Dataset Structure: Consider the following transaction table:

Date Qty Price Product CustomerTier
2025-10-01 2 1000 Laptop High
2025-10-01 5 20 Mouse Medium
2025-10-02 1 1000 Laptop Low
2025-10-02 3 30 Keyboard Medium
2025-10-03 4 50 Mouse Medium
2025-10-03 2 1000 Laptop High
2025-10-04 6 25 Keyboard Low
2025-10-04 1 1000 Laptop High
2025-10-05 3 40 Mouse Low
2025-10-05 5 10 Keyboard Medium

Your Assignment Instructions: Creating a Transactions Table above in R

  1. Create a data frame in R called transactions containing the data above.

  2. Identify which variables are numeric and which are categorical

  3. Calculate total revenue for each transaction by multiplying Qty × Price and add it as a new column Total.

  4. Compute summary statistics:

    • Total quantity sold for each product
    • Total revenue per product
    • Average price per product
  5. Visualize the data:

    • Create a barplot showing total quantity sold per product.
    • Create a pie chart showing the proportion of total revenue per customer tier.
  6. Optional Challenge:

    • Find which date had the highest total revenue.
    • Create a stacked bar chart showing quantity sold per product by customer tier.

Hints: Use data.frame(), aggregate(), barplot(), pie(), and basic arithmetic operations in R.

# Create a data frame for Transactions
Date = c (
  "2025-10-01",  
  "2025-10-01", 
  "2025-10-02", 
  "2025-10-02", 
  "2025-10-03", 
  "2025-10-03", 
  "2025-10-04", 
  "2025-10-04", 
  "2025-10-05", 
  "2025-10-05"
  )              
Qty = c (2, 5, 1, 3, 4, 2, 6, 1, 3, 5)  
Price = c (1000, 20, 1000, 30, 50, 1000, 25, 1000, 40, 10)  
Product = c (
  "Laptop",
  "Mouse",
  "Laptop",
  "Keyboard",
  "Mouse",
  "Laptop",
  "Keyboard",
  "Laptop",
  "Mouse",
  "Keyboard"
  )         
CustomerTier = factor(c(
  "High",
  "Medium",
  "Low",
  "Medium",
  "Medium",
  "High",
  "Low",
  "High",
  "Low",
  "Medium"
  ),
levels = c("Low","Medium","High"),
ordered = TRUE) 
transactions <-  data.frame (Date, Qty, Price, Product, CustomerTier, stringsAsFactors = FALSE)

# Display the data frame
library(knitr)
kable(transactions, caption = "Data Transactions")
Data Transactions
Date Qty Price Product CustomerTier
2025-10-01 2 1000 Laptop High
2025-10-01 5 20 Mouse Medium
2025-10-02 1 1000 Laptop Low
2025-10-02 3 30 Keyboard Medium
2025-10-03 4 50 Mouse Medium
2025-10-03 2 1000 Laptop High
2025-10-04 6 25 Keyboard Low
2025-10-04 1 1000 Laptop High
2025-10-05 3 40 Mouse Low
2025-10-05 5 10 Keyboard Medium
# 2. Identify variable types
library(knitr)
variable_types <- data.frame(
  Variable = c(
    "Date",
    "Qty",
    "Price",
    "Product",
    "CustomerTier"
    ),
  Type = c(
    "Numeric (Discrete)",
    "Numeric (Discrete)",
    "Numeric (Discrete)",
    "Categorical (Nominal)",
    "Categorical (Ordinal)"
    )
)

# Display the data frame
kable(variable_types, caption = "Variable Types in Transactions Data")
Variable Types in Transactions Data
Variable Type
Date Numeric (Discrete)
Qty Numeric (Discrete)
Price Numeric (Discrete)
Product Categorical (Nominal)
CustomerTier Categorical (Ordinal)
# 3. Calculate total revenue
transactions$total = transactions$Qty * transactions$Price
kable(transactions)
Date Qty Price Product CustomerTier total
2025-10-01 2 1000 Laptop High 2000
2025-10-01 5 20 Mouse Medium 100
2025-10-02 1 1000 Laptop Low 1000
2025-10-02 3 30 Keyboard Medium 90
2025-10-03 4 50 Mouse Medium 200
2025-10-03 2 1000 Laptop High 2000
2025-10-04 6 25 Keyboard Low 150
2025-10-04 1 1000 Laptop High 1000
2025-10-05 3 40 Mouse Low 120
2025-10-05 5 10 Keyboard Medium 50
# 4. Compute summary statistic

# a. Total quantity sold for each product
total_Qty = aggregate(Qty ~ Product, data = transactions, sum)
kable(total_Qty, caption = "Total Quantity Sold per Product")
Total Quantity Sold per Product
Product Qty
Keyboard 14
Laptop 6
Mouse 12
# b. total revenue per product
total_revenue = aggregate(total ~ Product, data = transactions, sum)
kable(total_revenue, caption = "Total Revenue per Product")
Total Revenue per Product
Product total
Keyboard 290
Laptop 6000
Mouse 420
# c. Average price per product
avg_price = aggregate(Price ~ Product, data = transactions, mean)
kable(avg_price, caption = "Average Price per Product")
Average Price per Product
Product Price
Keyboard 21.66667
Laptop 1000.00000
Mouse 36.66667
# 5. Visualize the data

# a. barplot showing total quantity sold per product
total_qty = tapply(transactions$Qty, transactions$Product, sum)
barplot(total_qty,
        main = "Total Quantity Sold per Product",
        xlab = "Product",
        ylab = "Total Quantity",
        col = "lightblue")

# b. pie chart showing the proportion of total revenue
total_revenue_tier = tapply(transactions$total, transactions$CustomerTier, sum)
pie(total_revenue_tier,
    main = "Proportion of Total Revenue per Customer Tier",
    col = rainbow(length(total_revenue_tier)))

# 6. Optional challenge

# a. Find which date had the highest total revenue
total_revenue_date <- aggregate(total ~ Date, data = transactions, sum)
total_revenue_date[which.max(total_revenue_date$total), ]
# b. stacked bar chart showing quantity sold per product by customer tier
qty_table = xtabs(Qty ~ Product + CustomerTier, data = transactions)
barplot(qty_table,
        main = "Quantity Sold per Product by Customer Tier",
        xlab = "Product",
        ylab = "Quantity",
        col = c("lightblue", "lightgreen", "pink"))

5 Exercise 5

Create Your Own Data Frame:

Objective: Create a data frame in R with 30 rows containing a mix of data types: continuous, discrete, nominal, and ordinal.

5.1 Instructions

  1. Open RStudio or the R console.

  2. Create a vector for each column in your data frame:

    • Date: 30 dates (can be sequential or random within a month/year)
    • Continuous: numeric values that can take decimal values (e.g., height, weight, temperature)
    • Discrete: numeric values that can only take whole numbers (e.g., number of items, number of vehicles)
    • Nominal: categorical values with no order (e.g., color, gender, city)
    • Ordinal: categorical values with a defined order (e.g., Low, Medium, High; Beginner, Intermediate, Expert)
  3. Combine all vectors into a data frame called my_data.

  4. Check your data frame using head() or View() to ensure it has 30 rows and the columns are correct.

  5. Optional tasks:

    • Summarize each column using summary()
    • Count the frequency of each category for Nominal and Ordinal columns using table()

5.2 Hints

  • Use seq.Date() or as.Date() to generate the Date column.
  • Use runif() or rnorm() for continuous numeric data.
  • Use sample() for discrete, nominal, and ordinal data.
  • Ensure the ordinal vector is created with factor(..., levels = c("Low","Medium","High"), ordered = TRUE) (or similar).



``` r
# 1. Coffee Shop Data
library(knitr)
Date = seq(as.Date("2025-09-01"), as.Date("2025-09-30"), by = "day")    # Date
Coffee_ml = runif(30, min = 1500, max = 4000)      # Continuous / Volume Kopi
Cups_Sold = sample(20:100, 30, replace = TRUE)     # Discrete / Jumlah Kopi
Drink_Type = sample(c(
  "Americano",
  "Cappuccino",
  "Latte",
  "Espresso",
  "Mocha"), 30, replace = TRUE)                    # Nominal / Jenis Kopi
Customer_Satisfaction = factor(
  sample(c(
    "Poor",
    "Fair",
    "Good",
    "Very Good",
    "Excellent"), 30, replace = TRUE),
  levels = c("Poor", "Fair", "Good", "Very Good", "Excellent"),
  ordered = TRUE)                                  # Ordinal / Satisfaction

# Combine all vectors into a data frame
my_data = data.frame(Date, Coffee_ml, Cups_Sold, Drink_Type, 
                     Customer_Satisfaction)
kable(my_data)
Date Coffee_ml Cups_Sold Drink_Type Customer_Satisfaction
2025-09-01 3179.608 22 Mocha Fair
2025-09-02 2261.069 88 Cappuccino Very Good
2025-09-03 3957.235 70 Latte Excellent
2025-09-04 1837.464 68 Americano Fair
2025-09-05 1602.321 41 Cappuccino Excellent
2025-09-06 1531.665 21 Latte Good
2025-09-07 3221.046 28 Americano Fair
2025-09-08 2625.220 97 Cappuccino Good
2025-09-09 2565.011 68 Latte Good
2025-09-10 3309.117 23 Cappuccino Fair
2025-09-11 2681.909 65 Mocha Fair
2025-09-12 1902.952 62 Mocha Poor
2025-09-13 3371.731 78 Cappuccino Excellent
2025-09-14 3863.962 99 Espresso Poor
2025-09-15 2953.760 91 Americano Good
2025-09-16 2280.910 90 Latte Fair
2025-09-17 3062.806 30 Mocha Excellent
2025-09-18 2437.797 77 Cappuccino Good
2025-09-19 2224.364 85 Espresso Good
2025-09-20 2013.447 75 Americano Very Good
2025-09-21 3899.187 96 Cappuccino Very Good
2025-09-22 2634.276 40 Espresso Excellent
2025-09-23 3609.000 32 Latte Poor
2025-09-24 1735.574 67 Latte Very Good
2025-09-25 1891.297 95 Espresso Fair
2025-09-26 2445.068 38 Americano Poor
2025-09-27 2919.725 88 Latte Good
2025-09-28 2840.019 46 Latte Very Good
2025-09-29 2841.009 25 Latte Excellent
2025-09-30 3426.712 25 Mocha Poor
# 2. Summary data (opsional)
summary_data = summary(my_data)
kable(summary_data)
Date Coffee_ml Cups_Sold Drink_Type Customer_Satisfaction
Min. :2025-09-01 Min. :1532 Min. :21.00 Length:30 Poor :5
1st Qu.:2025-09-08 1st Qu.:2234 1st Qu.:33.50 Class :character Fair :7
Median :2025-09-15 Median :2658 Median :67.50 Mode :character Good :7
Mean :2025-09-15 Mean :2704 Mean :61.00 NA Very Good:5
3rd Qu.:2025-09-22 3rd Qu.:3211 3rd Qu.:87.25 NA Excellent:6
Max. :2025-09-30 Max. :3957 Max. :99.00 NA NA
# 3. Frekuensi Kategori (nominal, ordinal)
drink_freq = table(my_data$Drink_Type)
satisfaction_freq = table(my_data$Customer_Satisfaction)

kable(drink_freq, caption = "Frekuensi Jenis Minuman (Nominal)")
Frekuensi Jenis Minuman (Nominal)
Var1 Freq
Americano 5
Cappuccino 7
Espresso 4
Latte 9
Mocha 5
kable(satisfaction_freq, caption = "Frekuensi Kepuasan Pelanggan (Ordinal)")
Frekuensi Kepuasan Pelanggan (Ordinal)
Var1 Freq
Poor 5
Fair 7
Good 7
Very Good 5
Excellent 6
