Data Exploration

Exercises ~ Week 2

Logo


1 Exercise 1

The following table shows sample information for three students. Each observation represents a single student and includes details such as their unique student ID, name, age, total credits completed, major field of study, and year level.

This dataset demonstrates a mixture of variable types:

  • Nominal: StudentID, Name, Major
  • Numeric: Age (continuous), CreditsCompleted (discrete)
  • Ordinal: YearLevel (Freshman → Senior)
StudentID Name Age CreditsCompleted Major YearLevel
S001 Alice 20 45 Data Sains Sophomore
S002 Budi 21 60 Mathematics Junior
S003 Citra 19 30 Statistics Freshman
# 1. Create vectors for each variable
StudentID <- c("S001", "S002", "S003")       # Nominal / ID
Name <- c("Alice", "Budi", "Citra")          # Nominal / Name
Age <- c(20, 21, 19)                         # Numeric / Continuous
CreditsCompleted <- c(45, 60, 30)            # Numeric / Discrete

# Nominal
Major <- c("Data Sains", "Mathematics", "Statistics")  

# Ordinal
YearLevel <- factor(c("Sophomore", "Junior", "Freshman"),
                    levels = c("Freshman","Sophomore","Junior","Senior"),
                    ordered = TRUE)          

# 2. Combine all vectors into a data frame
students <- data.frame(
  StudentID, Name, Age, CreditsCompleted, Major, YearLevel,
  stringsAsFactors = FALSE
)

# 3. Display the data frame
print(students)
##   StudentID  Name Age CreditsCompleted       Major YearLevel
## 1      S001 Alice  20               45  Data Sains Sophomore
## 2      S002  Budi  21               60 Mathematics    Junior
## 3      S003 Citra  19               30  Statistics  Freshman

2 Exercise 2

Identify Data Types: Determine the type of data for each of the following variables:

# Install knitr package if not already installed
# install.packages("knitr")
library(knitr)

# Create a data frame for Data Types
variables_info <- data.frame(
  No = 1:5,
  Variable = c(
    "Number of vehicles passing through the toll road each day",
    "Student height in cm",
    "Employee gender (Male / Female)",
    "Customer satisfaction level: Low, Medium, High",
    "Respondent's favorite color: Red, Blue, Green"
  ),
  DataType = c(
    "Numeric",
    "Numeric",
    "Categorical",
    "Categorical",
    "Categorical"
  ),
  Subtype = c(
    "discerete",
    "Continuous",
    "Nominal",
    "Ordinal",
    "Nominal"
  ),
  stringsAsFactors = FALSE
)

# Display the data frame as a neat table
kable(variables_info, 
      caption = "Table of Variables and Data Types")
Table of Variables and Data Types
No Variable DataType Subtype
1 Number of vehicles passing through the toll road each day Numeric discerete
2 Student height in cm Numeric Continuous
3 Employee gender (Male / Female) Categorical Nominal
4 Customer satisfaction level: Low, Medium, High Categorical Ordinal
5 Respondent’s favorite color: Red, Blue, Green Categorical Nominal

3 Exercise 3

Classify Data Sources: Determine whether the following data comes from internal or external sources, and whether it is structured or unstructured:

# Install DT package if not already installed
# install.packages("DT")
library(DT)

# Create a data frame for data sources 
data_sources <- data.frame(
  No = 1:4,
  DataSource = c(
    "Daily sales transaction data of the company",
    "Weather reports from BMKG",
    "Product reviews on social media",
    "Warehouse inventory reports"
  ),
  Internal_External = c(
   "internal",
    "external",
    "external",
    "internal"
  ),
  Structured_Unstructured = c(
    "structured",
    "structured",
    "unstructured",
    "structured"
  ),
  stringsAsFactors = FALSE
)

# Display the data frame as a neat table
datatable(data_sources, 
          caption = "Table of Data Sources",
          rownames = FALSE) # hides the index column

## Exercise 4

Dataset Structure: Consider the following transaction table:

Date Qty Price Product CustomerTier
2025-10-01 2 1000 Laptop High
2025-10-01 5 20 Mouse Medium
2025-10-02 1 1000 Laptop Low
2025-10-02 3 30 Keyboard Medium
2025-10-03 4 50 Mouse Medium
2025-10-03 2 1000 Laptop High
2025-10-04 6 25 Keyboard Low
2025-10-04 1 1000 Laptop High
2025-10-05 3 40 Mouse Low
2025-10-05 5 10 Keyboard Medium

Your Assignment Instructions: Creating a Transactions Table above in R

  1. Create a data frame in R called transactions containing the data above.

  2. Identify which variables are numeric and which are categorical

  3. Calculate total revenue for each transaction by multiplying Qty × Price and add it as a new column Total.

  4. Compute summary statistics:

    • Total quantity sold for each product
    • Total revenue per product
    • Average price per product
  5. Visualize the data:

    • Create a barplot showing total quantity sold per product.
    • Create a pie chart showing the proportion of total revenue per customer tier.
  6. Optional Challenge:

    • Find which date had the highest total revenue.
    • Create a stacked bar chart showing quantity sold per product by customer tier.

Hints: Use data.frame(), aggregate(), barplot(), pie(), and basic arithmetic operations in R.

library(DT)
# Transactions
transactions <- data.frame(
  No = 1 : 10,
  
  Date = c("2025-10-01", "2025-10-01",
           "2025-10-02", "2025-10-02",   
           "2025-10-03", "2025-10-03", 
           "2025-10-04", "2025-10-04", 
           "2025-10-05", "2025-10-05"),
           
Qty = c(2, 5, 1, 3, 4, 2, 6, 1, 3, 5), 

Price = c(1000, 20, 1000, 30, 50, 1000, 25, 1000, 40, 10), 

Product = c("Laptop", "Mouse", "Laptop", 
            "Keyboard", "Mouse", "Laptop", 
            "Keyboard", "Laptop", "Mouse", "Keyboard"), 
            
CustomerTier = c("High", "Medium", "Low", 
                 "Medium", "Medium", "High", 
                 "Low", "High", "Low", "Medium")) 

library(knitr)

# Create a data frame for Data Types
variables_info <- data.frame(
  No = 1:4,
  Variable = c(
    "Qty",
    "Price",
    "Product",
    "CustomerTier"
  ),
  DataType = c(
    "Numeric",
    "Numeric",
    "Categorical",
    "Categorical"
  ),
  stringsAsFactors = FALSE
)

# Display the data frame as a neat table
kable(variables_info, 
      caption = "Table of Variables and Data Types")
Table of Variables and Data Types
No Variable DataType
1 Qty Numeric
2 Price Numeric
3 Product Categorical
4 CustomerTier Categorical
#transactions total

transactions$Total <- transactions$Qty * transactions$Price

datatable(transactions,
          caption = "Table of Transactions",
          rownames = FALSE)
#total qty

total_qty_per_product <- aggregate(Qty ~ Product, data = transactions, FUN = sum)
print("Total Kuantitas Terjual per Produk:")
## [1] "Total Kuantitas Terjual per Produk:"
print(total_qty_per_product)
##    Product Qty
## 1 Keyboard  14
## 2   Laptop   6
## 3    Mouse  12
#Total reveneu

total_revenue_per_product <- aggregate(Total ~ Product, data = transactions, FUN = sum)
print("Total Pendapatan per Produk:")
## [1] "Total Pendapatan per Produk:"
print(total_revenue_per_product)
##    Product Total
## 1 Keyboard   290
## 2   Laptop  6000
## 3    Mouse   420
#price product

average_price_per_product <- aggregate(Price ~ Product, data = transactions, FUN = mean)
print("Harga Rata-rata per Produk:")
## [1] "Harga Rata-rata per Produk:"
print(average_price_per_product)
##    Product      Price
## 1 Keyboard   21.66667
## 2   Laptop 1000.00000
## 3    Mouse   36.66667
#Barplot: Total Kuantitas Terjual per Produk
barplot(
  height = total_qty_per_product$Qty, 
  names.arg = total_qty_per_product$Product,
  main = "Total transactions",
  xlab = "Produk",
  ylab = "kuantitas",
  col = c("red4", "pink", "hotpink4"),
  ylim = c(0, max(total_qty_per_product$Qty) + 2) 
)

#Piechart
revenue_per_tier <- aggregate(Total ~ CustomerTier, data = transactions, FUN = sum)
total_revenue <- sum(revenue_per_tier$Total)
percentages <- round(revenue_per_tier$Total / total_revenue * 100, 1)
pie_labels <- paste(revenue_per_tier$CustomerTier, " (", percentages, "%)", sep="")

pie(
  x = revenue_per_tier$Total, 
  labels = pie_labels,
  main = "Proporsi Total Pendapatan per Tingkat Pelanggan",
  col = c("thistle", "lightpink1", "lightblue")
)

#optional (2)
kuantitas <- as.matrix(data.frame(Hight = c(3),
                                  Medium = c(4),
                                  Low = c(3)))
rownames(kuantitas) <- c("Qty")
kuantitas
##     Hight Medium Low
## Qty     3      4   3
nama <- c("hight", "medium", "low")
barplot(kuantitas, names.arg = nama)

barplot(kuantitas, names.arg = nama, xlim = c(0,5),
        xlab = "CustomerTier", ylab = "Qty",
        main = "Total kuantitas berdasarkan kepuasan pelanggan", density = 20,
        col = c("orchid"), horiz = TRUE)

4 Exercise 5

Create Your Own Data Frame:

Objective: Create a data frame in R with 30 rows containing a mix of data types: continuous, discrete, nominal, and ordinal.

4.1 Instructions

  1. Open RStudio or the R console.

  2. Create a vector for each column in your data frame:

    • Date: 30 dates (can be sequential or random within a month/year)
    • Continuous: numeric values that can take decimal values (e.g., height, weight, temperature)
    • Discrete: numeric values that can only take whole numbers (e.g., number of items, number of vehicles)
    • Nominal: categorical values with no order (e.g., color, gender, city)
    • Ordinal: categorical values with a defined order (e.g., Low, Medium, High; Beginner, Intermediate, Expert)
  3. Combine all vectors into a data frame called my_data.

  4. Check your data frame using head() or View() to ensure it has 30 rows and the columns are correct.

  5. Optional tasks:

    • Summarize each column using summary()
    • Count the frequency of each category for Nominal and Ordinal columns using table()

4.2 Hints

  • Use seq.Date() or as.Date() to generate the Date column.
  • Use runif() or rnorm() for continuous numeric data.
  • Use sample() for discrete, nominal, and ordinal data.
  • Ensure the ordinal vector is created with factor(..., levels = c("Low","Medium","High"), ordered = TRUE) (or similar).
library(DT)
library(knitr)
#my_data

# main plants data
Date <- c("Januari", "Februari", "Maret", "April", 
          "Mei", "Juni", "Juli", "Agustus", 
          "September", "Oktober", "November", 
          "Desember", "Januari", "Februari", 
          "Maret", "April", "Mei", "Juni", 
          "Juli", "Agustus", "September", 
          "Oktober", "November", "Desember", 
          "Januari", "Februari", "Maret", "April", 
          "Mei", "Juni")

Name <- c("Mawar", "Melati", "Anggrek", "Kaktus", 
          "Lidah buaya", "Padi", "Jagung", 
          "Pisang", "Mangga", "Jambu", "Kelapa", 
          "Tomat", "Cabai", "Bayam", 
          "Kacang Panjang", "Terong", "Pepaya", 
          "Apel", "Stroberi", "Nangka", "Durian", 
          "Semangka", "Mentimun", "Wortel", 
          "Kentang", "Singkong", "Tebu", "Kopi", 
          "Teh", "Bambu")

Type <- c("Bunga", "Bunga", "Bunga", "Sukulen", 
          "Sukulen", "Tanaman Pangan", 
          "Tanaman Pangan", "Buah", "Pohon", 
          "Pohon", "Pohon", "Sayuran", "Sayuran", 
          "Sayuran", "Sayuran", "Sayuran", "Buah", 
          "Buah", "Buah", "Pohon", "Pohon", 
          "Buah", "Sayuran", "Sayuran", "Sayuran", 
          "Tanaman Pangan", "Tanaman Pangan", 
          "Pohon", "Pohon", "Pohon")

Height <- as.numeric(gsub(",", ".", c("35,4", "28,6", "42,3", "25,1",
                                      "39,7", "87,2", "120,5", "210,8",
                                      "250,4", "190,2", "340,6", "45,3",
                                      "55,9", "30,1", "80,4", "60,7",
                                      "190,8", "230,5", "25,6", "270,9",
                                      "310,3", "85,7", "40,2", "33,9", 
                                      "27,5", "150,8", "260,1", "170,6", 
                                      "145,2", "400,7")))

Totalplants <- as.integer(c("12", "15", "10", "8", "14", "22", "18", 
                            "16", "35", "28", "42", "20", "18", "25",
                            "30", "17", "27", "33", "14", "40", "45",
                            "19", "22", "24", "18", "26", "32", 
                            "29", "21", "50"))

Totalplants <- as.integer(c("12", "15", "10", "8", "14", "22", "18", 
                            "16", "35", "28", "42", "20", "18", "25",
                            "30", "17", "27", "33", "14", "40", "45",
                            "19", "22", "24", "18", "26", "32", 
                            "29", "21", "50"))

Growth <- c("Baik", "Sangat baik", "Baik", 
            "Kurang baik", "Baik", 
            "Sangat baik", "Baik", "Baik",
            "Sangat baik", "Baik", 
            "Sangat baik", "Baik", "Baik",
            "Sangat baik", "Baik", 
            "Baik", "Sangat baik", "Baik",
            "Baik", "Baik", "Sangat baik", 
            "Baik", "Baik", "Baik", "Kurang baik", 
            "Baik", "Sangat baik", "Baik", "Baik", 
            "Sangat baik")

# Combine into a data frame
my_data <- data.frame(Date, Name, Type, Height, Totalplants, Growth)

# Display interactive table
datatable(my_data, caption = "Table of Plant Development", options = list(pageLength = 10))
# Create a second table for variable types
variables_info <- data.frame(
  No = 1:4,
  Variable = c(
    "Height",
    "Growth", 
    "Totalplants",
    "Type"),
  
  DataType = c(
    "Continuous", 
    "Ordinal", 
    "Discrete", 
    "Nominal"),
  stringsAsFactors = FALSE
)

# Display static table
kable(variables_info, caption = "Table of Variables and Data Types", rownames = FALSE)
Table of Variables and Data Types
No Variable DataType
1 Height Continuous
2 Growth Ordinal
3 Totalplants Discrete
4 Type Nominal
# Summariz data
cat("--- ringkasan my_data (summary( my_data ) ---\n")
## --- ringkasan my_data (summary( my_data ) ---
summary(my_data)
##      Date               Name               Type               Height      
##  Length:30          Length:30          Length:30          Min.   : 25.10  
##  Class :character   Class :character   Class :character   1st Qu.: 39.83  
##  Mode  :character   Mode  :character   Mode  :character   Median : 86.45  
##                                                           Mean   :132.87  
##                                                           3rd Qu.:205.80  
##                                                           Max.   :400.70  
##   Totalplants       Growth         
##  Min.   : 8.00   Length:30         
##  1st Qu.:17.25   Class :character  
##  Median :22.00   Mode  :character  
##  Mean   :24.33                     
##  3rd Qu.:29.75                     
##  Max.   :50.00
