Data Exploration

Exercises ~ Week 3

Logo


1 Exercise 1

The following table shows sample information for three students. Each observation represents a single student and includes details such as their unique student ID, name, age, total credits completed, major field of study, and year level.

This dataset demonstrates a mixture of variable types:

  • Nominal: StudentID, Name, Major
  • Numeric: Age (continuous), CreditsCompleted (discrete)
  • Ordinal: YearLevel (Freshman → Senior)
StudentID Name Age CreditsCompleted Major YearLevel
S001 Alice 20 45 Data Sains Sophomore
S002 Budi 21 60 Mathematics Junior
S003 Citra 19 30 Statistics Freshman
# 1. Create vectors for each variable
StudentID <- c("S001", "S002", "S003")       # Nominal / ID
Name <- c("Alice", "Budi", "Citra")          # Nominal / Name
Age <- c(20, 21, 19)                         # Numeric / Continuous
CreditsCompleted <- c(45, 60, 30)            # Numeric / Discrete

# Nominal
Major <- c("Data Sains", "Mathematics", "Statistics")  

# Ordinal
YearLevel <- factor(c("Sophomore", "Junior", "Freshman"),
                    levels = c("Freshman","Sophomore","Junior","Senior"),
                    ordered = TRUE)          

# 2. Combine all vectors into a data frame
students <- data.frame(
  StudentID, Name, Age, CreditsCompleted, Major, YearLevel,
  stringsAsFactors = FALSE
)

# 3. Display the data frame
print(students)
##   StudentID  Name Age CreditsCompleted       Major YearLevel
## 1      S001 Alice  20               45  Data Sains Sophomore
## 2      S002  Budi  21               60 Mathematics    Junior
## 3      S003 Citra  19               30  Statistics  Freshman

2 Exercise 2

Identify Data Types: Determine the type of data for each of the following variables:

# Install knitr package if not already installed
# install.packages("knitr")
library(knitr)

# Create a data frame for Data Types
variables_info <- data.frame(
  No = 1:5,
  Variable = c(
    "Number of vehicles passing through the toll road each day",
    "Student height in cm",
    "Employee gender (Male / Female)",
    "Customer satisfaction level: Low, Medium, High",
    "Respondent's favorite color: Red, Blue, Green"
  ),
  DataType = c(
    "Quantitative",
    "Quantitative",
    "Qualitative",
    "Qualitative",
    "Qualitative"
  ),
  Subtype = c(
    "Discrete",
    "Countinuous",
    "Nominal",
    "Ordinal",
    "Nominal"
  ),
  stringsAsFactors = FALSE
)

# Display the data frame as a neat table
kable(variables_info, 
      caption = "Table of Variables and Data Types")
Table of Variables and Data Types
No Variable DataType Subtype
1 Number of vehicles passing through the toll road each day Quantitative Discrete
2 Student height in cm Quantitative Countinuous
3 Employee gender (Male / Female) Qualitative Nominal
4 Customer satisfaction level: Low, Medium, High Qualitative Ordinal
5 Respondent’s favorite color: Red, Blue, Green Qualitative Nominal

3 Exercise 3

Classify Data Sources: Determine whether the following data comes from internal or external sources, and whether it is structured or unstructured:

# Install DT package if not already installed
# install.packages("DT")
library(DT)

# Create a data frame for data sources 
data_sources <- data.frame(
  No = 1:4,
  DataSource = c(
    "Daily sales transaction data of the company",
    "Weather reports from BMKG",
    "Product reviews on social media",
    "Warehouse inventory reports"
  ),
  Internal_External = c(
    "Internal",
    "External",
    "External",
    "Internal"
  ),
  Structured_Unstructured = c(
    "Structured",
    "Structured",
    "Unstructured",
    "Structured"
  ),
  stringsAsFactors = FALSE
)

# Display the data frame as a neat table
datatable(data_sources, 
          caption = "Table of Data Sources",
          rownames = FALSE) # hides the index column

4 Exercise 4

Dataset Structure: Consider the following transaction table:

Date Qty Price Product CustomerTier
2025-10-01 2 1000 Laptop High
2025-10-01 5 20 Mouse Medium
2025-10-02 1 1000 Laptop Low
2025-10-02 3 30 Keyboard Medium
2025-10-03 4 50 Mouse Medium
2025-10-03 2 1000 Laptop High
2025-10-04 6 25 Keyboard Low
2025-10-04 1 1000 Laptop High
2025-10-05 3 40 Mouse Low
2025-10-05 5 10 Keyboard Medium

Your Assignment Instructions: Creating a Transactions Table above in R

  1. Create a data frame in R called transactions containing the data above.

  2. Identify which variables are numeric and which are categorical

  3. Calculate total revenue for each transaction by multiplying Qty × Price and add it as a new column Total.

  4. Compute summary statistics:

    • Total quantity sold for each product
    • Total revenue per product
    • Average price per product
  5. Visualize the data:

    • Create a barplot showing total quantity sold per product.
    • Create a pie chart showing the proportion of total revenue per customer tier.
  6. Optional Challenge:

    • Find which date had the highest total revenue.
    • Create a stacked bar chart showing quantity sold per product by customer tier.

Hints: Use data.frame(), aggregate(), barplot(), pie(), and basic arithmetic operations in R.

1. CREATE THE DATA FRAME

transactions <- data.frame(
  Date = as.Date(c("2025-10-01", "2025-10-01", "2025-10-02", "2025-10-02",
                   "2025-10-03", "2025-10-03", "2025-10-04", "2025-10-04",
                   "2025-10-05", "2025-10-05")),
  Qty = c(2, 5, 1, 3, 4, 2, 6, 1, 3, 5),
  Price = c(1000, 20, 1000, 30, 50, 1000, 25, 1000, 40, 10),
  Product = c("Laptop", "Mouse", "Laptop", "Keyboard", "Mouse",
              "Laptop", "Keyboard", "Laptop", "Mouse", "Keyboard"),
  CustomerTier = c("High", "Medium", "Low", "Medium", "Medium", "High", "Low", "High", "Low", "Medium"),
  stringsAsFactors = FALSE
  )

# View the dataset contents
kable(transactions)
Date Qty Price Product CustomerTier
2025-10-01 2 1000 Laptop High
2025-10-01 5 20 Mouse Medium
2025-10-02 1 1000 Laptop Low
2025-10-02 3 30 Keyboard Medium
2025-10-03 4 50 Mouse Medium
2025-10-03 2 1000 Laptop High
2025-10-04 6 25 Keyboard Low
2025-10-04 1 1000 Laptop High
2025-10-05 3 40 Mouse Low
2025-10-05 5 10 Keyboard Medium

2. IDENTIFY VARIABLE TYPES

cat_var <- data.frame(
      Category = c("Date", "Qty", "Price", "Product", "CustomersTier"),
      Variable = c("Categorical (Date)", 
                   "Numeric (Discrete)", 
                   "Numeric(Continuous)", 
                   "Categorical (Nominal)", 
                   "Categorical (Ordinal)"),
      stringsAsFactors = FALSE)
                    
#view result
kable(cat_var)
Category Variable
Date Categorical (Date)
Qty Numeric (Discrete)
Price Numeric(Continuous)
Product Categorical (Nominal)
CustomersTier Categorical (Ordinal)

3. ADD NEW COLUMN: TOTAL (Qty × Price)

    transactions$Total <- transactions$Qty * transactions$Price
    cat("Data Frame with Total Column")
## Data Frame with Total Column
#view result
kable(transactions)
Date Qty Price Product CustomerTier Total
2025-10-01 2 1000 Laptop High 2000
2025-10-01 5 20 Mouse Medium 100
2025-10-02 1 1000 Laptop Low 1000
2025-10-02 3 30 Keyboard Medium 90
2025-10-03 4 50 Mouse Medium 200
2025-10-03 2 1000 Laptop High 2000
2025-10-04 6 25 Keyboard Low 150
2025-10-04 1 1000 Laptop High 1000
2025-10-05 3 40 Mouse Low 120
2025-10-05 5 10 Keyboard Medium 50

4. SUMMARY STATISTICS

##a.  Total quantity sold per product
      total_qty <- aggregate(Qty ~ Product, data = transactions, sum)
      cat("Total Quantity Sold per Product")
## Total Quantity Sold per Product
      kable(total_qty)
Product Qty
Keyboard 14
Laptop 6
Mouse 12
##b.  Total revenue per product
      total_revenue <- aggregate(Total ~ Product, data = transactions, sum)
      cat("Total Revenue per Product")
## Total Revenue per Product
      kable(total_revenue)
Product Total
Keyboard 290
Laptop 6000
Mouse 420
##c.  Average price per product
      avg_price <- aggregate(Price ~ Product, data = transactions, mean)
      cat("Average Price per Product")
## Average Price per Product
      kable(avg_price)
Product Price
Keyboard 21.66667
Laptop 1000.00000
Mouse 36.66667

5. VISUALIZE THE DATA

## a. Barplot: Total Quantity per Product

# Sort total_qty from the largest to the smallest Qty
total_qty <- total_qty[order(total_qty$Qty, decreasing = TRUE), ]

barplot(total_qty$Qty,
        names.arg = total_qty$Product,
        col = "skyblue",
        main = "Total Quantity Sold per Product",
        xlab = "Product",
        ylab = "Quantity Sold",
        cex.names = 0.9 # Adjust the product labels to fit properly
)

## b. Pie Chart: Total Revenue per Customer Tier
revenue_by_tier <- aggregate(Total ~ CustomerTier, data = transactions, sum)

# Calculate the percentage
percent <- round(revenue_by_tier$Total / sum(revenue_by_tier$Total) * 100, 1)

# Create labels that combine the category name and percentage
labels <- paste0(revenue_by_tier$CustomerTier, " (", percent, "%)")

# Generate the pie chart directly
pie(revenue_by_tier$Total,
    labels = labels,
    main = "Proportion of Total Revenue by Customer Tier",
    col = c("lightgreen", "gold", "red"))

6. OPTIONAL CHALLENGE

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)

# 1 Create a transaction dataset
transactions <- data.frame(
  Date = as.Date(c("2025-10-01","2025-10-01","2025-10-02","2025-10-02","2025-10-03",
                   "2025-10-03","2025-10-04","2025-10-04","2025-10-05","2025-10-05")),
  Qty = c(2,5,1,3,4,2,6,1,3,5),
  Price = c(1000,20,1000,30,50,1000,25,1000,40,10),
  Product = c("Laptop","Mouse","Laptop","Keyboard","Mouse","Laptop","Keyboard","Laptop","Mouse","Keyboard"),
  CustomerTier = c("High","Medium","Low","Medium","Medium","High","Low","High","Low","Medium")
)

# 2 Total revenue per transaction
transactions <- transactions %>%
  mutate(TotalRevenue = Qty * Price)

# 3 Total revenue per date
total_per_date <- transactions %>%
  group_by(Date) %>%
  summarise(Total_Revenue = sum(TotalRevenue)) %>%
  arrange(desc(Total_Revenue))

# 4 Find the date with the highest total revenue
highest_revenue_date <- total_per_date %>%
  slice_max(Total_Revenue)

# 5 View the result
print("Tanggal dengan total pendapatan tertinggi:")
## [1] "Tanggal dengan total pendapatan tertinggi:"
highest_revenue_date
library(ggplot2)
library(dplyr)

# Calculate total quantity per product to enable sorting
product_order <- transactions %>%
  group_by(Product) %>%
  summarise(total_qty = sum(Qty)) %>%
  arrange(desc(total_qty)) %>%
  pull(Product)

# Reorder product factor levels based on total_qty (from highest to lowest)
transactions$Product <- factor(transactions$Product, levels = product_order)

# Plot a clean stacked bar chart sorted from highest to lowest
ggplot(transactions, aes(x = Product, y = Qty, fill = CustomerTier)) +
  geom_bar(stat = "identity", position = "stack") +
  labs(
    title = "Jumlah Penjualan per Produk berdasarkan Tingkatan Pelanggan",
    x = "Produk",
    y = "Jumlah Penjualan (Qty)",
    fill = "Customer Tier"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.5, face = "bold")
  )

5 Exercise 5

Create Your Own Data Frame:

Objective: Create a data frame in R with 30 rows containing a mix of data types: continuous, discrete, nominal, and ordinal.

5.1 Instructions

  1. Open RStudio or the R console.

  2. Create a vector for each column in your data frame:

    • Date: 30 dates (can be sequential or random within a month/year)
    • Continuous: numeric values that can take decimal values (e.g., height, weight, temperature)
    • Discrete: numeric values that can only take whole numbers (e.g., number of items, number of vehicles)
    • Nominal: categorical values with no order (e.g., color, gender, city)
    • Ordinal: categorical values with a defined order (e.g., Low, Medium, High; Beginner, Intermediate, Expert)
  3. Combine all vectors into a data frame called my_data.

  4. Check your data frame using head() or View() to ensure it has 30 rows and the columns are correct.

  5. Optional tasks:

    • Summarize each column using summary()
    • Count the frequency of each category for Nominal and Ordinal columns using table()

5.2 Hints

  • Use seq.Date() or as.Date() to generate the Date column.
  • Use runif() or rnorm() for continuous numeric data.
  • Use sample() for discrete, nominal, and ordinal data.
  • Ensure the ordinal vector is created with factor(..., levels = c("Low","Medium","High"), ordered = TRUE) (or similar).

1. CREATE THE DATA FRAME

set.seed(123)
Sys.setlocale("LC_TIME", "C")
## [1] "C"
#DATE 
Date <- seq(as.Date("2025-09-01"), as.Date("2025-09-30"), by = "day")

#CONTINOUS (TOTAL SALES)
Total_Sales <- round(rnorm(30, mean = 200000, sd = 50000), 0)

#DISCRETE (QUANTITY)
Quantity <- sample(3:20, 30, replace = TRUE)

#NOMINAL (COFFEE TYPE)
Coffee_Type <- sample(c("Americano", "Latte", "Cappuccino", "Espresso", "Mocha"), 
                      30, replace = TRUE)

#ORDINAL (SALES LEVEL)
Sales_Level <- factor(sample(c("Low", "Medium", "High"), 30, replace = TRUE),
                      levels = c("Low", "Medium", "High"), ordered = TRUE)


# Combine into one table
my_data <- data.frame(
  Date,
  Total_Sales,
  Quantity, 
  Coffee_Type,
  Sales_Level)
#view the contents of the data
kable(my_data)
Date Total_Sales Quantity Coffee_Type Sales_Level
2025-09-01 171976 7 Mocha High
2025-09-02 188491 10 Cappuccino Low
2025-09-03 277935 14 Latte High
2025-09-04 203525 15 Mocha Medium
2025-09-05 206464 20 Mocha Medium
2025-09-06 285753 3 Cappuccino Low
2025-09-07 223046 8 Latte High
2025-09-08 136747 17 Latte Medium
2025-09-09 165657 11 Latte Low
2025-09-10 177717 17 Espresso High
2025-09-11 261204 18 Latte High
2025-09-12 217991 8 Latte High
2025-09-13 220039 13 Espresso Low
2025-09-14 205534 10 Espresso Medium
2025-09-15 172208 9 Americano Medium
2025-09-16 289346 18 Cappuccino Low
2025-09-17 224893 19 Cappuccino Medium
2025-09-18 101669 20 Americano High
2025-09-19 235068 19 Cappuccino High
2025-09-20 176360 4 Mocha Low
2025-09-21 146609 6 Latte Medium
2025-09-22 189101 15 Cappuccino Low
2025-09-23 148700 7 Latte Medium
2025-09-24 163555 16 Mocha High
2025-09-25 168748 5 Mocha High
2025-09-26 115665 10 Cappuccino Low
2025-09-27 241889 18 Espresso High
2025-09-28 207669 14 Espresso Medium
2025-09-29 143093 16 Espresso Low
2025-09-30 262691 5 Mocha Low

2. OPTIONAL TASKS

set.seed(123)
Sys.setlocale("LC_TIME", "C")
## [1] "C"
str(my_data)          # View data structure
## 'data.frame':    30 obs. of  5 variables:
##  $ Date       : Date, format: "2025-09-01" "2025-09-02" ...
##  $ Total_Sales: num  171976 188491 277935 203525 206464 ...
##  $ Quantity   : int  7 10 14 15 20 3 8 17 11 17 ...
##  $ Coffee_Type: chr  "Mocha" "Cappuccino" "Latte" "Mocha" ...
##  $ Sales_Level: Ord.factor w/ 3 levels "Low"<"Medium"<..: 3 1 3 2 2 1 3 2 1 3 ...
summary(my_data)      # Summary of each column
##       Date             Total_Sales        Quantity    Coffee_Type       
##  Min.   :2025-09-01   Min.   :101669   Min.   : 3.0   Length:30         
##  1st Qu.:2025-09-08   1st Qu.:166430   1st Qu.: 8.0   Class :character  
##  Median :2025-09-15   Median :196313   Median :13.5   Mode  :character  
##  Mean   :2025-09-15   Mean   :197645   Mean   :12.4                     
##  3rd Qu.:2025-09-22   3rd Qu.:224431   3rd Qu.:17.0                     
##  Max.   :2025-09-30   Max.   :289346   Max.   :20.0                     
##  Sales_Level
##  Low   :10  
##  Medium: 9  
##  High  :11  
##             
##             
## 
print(my_data$coffee_type)  # Nominal data frequency
## NULL
print(my_data$Sales_Level)  # Ordinal data frequency
##  [1] High   Low    High   Medium Medium Low    High   Medium Low    High  
## [11] High   High   Low    Medium Medium Low    Medium High   High   Low   
## [21] Medium Low    Medium High   High   Low    High   Medium Low    Low   
## Levels: Low < Medium < High