Data Exploration

Exercises ~ Week 3

Logo

1 Exercise 1

The following table shows sample information for three students. Each observation represents a single student and includes details such as their unique student ID, name, age, total credits completed, major field of study, and year level.

This dataset demonstrates a mixture of variable types:

Nominal: StudentID, Name, Major
Numeric: Age (continuous), CreditsCompleted (discrete)
Ordinal: YearLevel (Freshman → Senior)

StudentID	Name	Age	CreditsCompleted	Major	YearLevel
S001	Alice	20	45	Data Sains	Sophomore
S002	Budi	21	60	Mathematics	Junior
S003	Citra	19	30	Statistics	Freshman

# 1. Create vectors for each variable
StudentID <- c("S001", "S002", "S003")       # Nominal / ID
Name <- c("Alice", "Budi", "Citra")          # Nominal / Name
Age <- c(20, 21, 19)                         # Numeric / Continuous
CreditsCompleted <- c(45, 60, 30)            # Numeric / Discrete

# Nominal
Major <- c("Data Sains", "Mathematics", "Statistics")  

# Ordinal
YearLevel <- factor(c("Sophomore", "Junior", "Freshman"),
                    levels = c("Freshman","Sophomore","Junior","Senior"),
                    ordered = TRUE)          

# 2. Combine all vectors into a data frame
students <- data.frame(
  StudentID, Name, Age, CreditsCompleted, Major, YearLevel,
  stringsAsFactors = FALSE
)

# 3. Display the data frame
print(students)

##   StudentID  Name Age CreditsCompleted       Major YearLevel
## 1      S001 Alice  20               45  Data Sains Sophomore
## 2      S002  Budi  21               60 Mathematics    Junior
## 3      S003 Citra  19               30  Statistics  Freshman

2 Exercise 2

Identify Data Types: Determine the type of data for each of the following variables:

# Install knitr package if not already installed
# install.packages("knitr")
library(knitr)

# Create a data frame for Data Types
variables_info <- data.frame(
  No = 1:5,
  Variable = c(
    "Number of vehicles passing through the toll road each day",
    "Student height in cm",
    "Employee gender (Male / Female)",
    "Customer satisfaction level: Low, Medium, High",
    "Respondent's favorite color: Red, Blue, Green"
  ),
  DataType = c(
    "Quantitative",
    "Quantitative",
    "Qualitative",
    "Qualitative",
    "Qualitative"
  ),
  Subtype = c(
    "Discrete",
    "Countinuous",
    "Nominal",
    "Ordinal",
    "Nominal"
  ),
  stringsAsFactors = FALSE
)

# Display the data frame as a neat table
kable(variables_info, 
      caption = "Table of Variables and Data Types")

Table of Variables and Data Types
No	Variable	DataType	Subtype
1	Number of vehicles passing through the toll road each day	Quantitative	Discrete
2	Student height in cm	Quantitative	Countinuous
3	Employee gender (Male / Female)	Qualitative	Nominal
4	Customer satisfaction level: Low, Medium, High	Qualitative	Ordinal
5	Respondent’s favorite color: Red, Blue, Green	Qualitative	Nominal

3 Exercise 3

Classify Data Sources: Determine whether the following data comes from internal or external sources, and whether it is structured or unstructured:

# Install DT package if not already installed
# install.packages("DT")
library(DT)

# Create a data frame for data sources 
data_sources <- data.frame(
  No = 1:4,
  DataSource = c(
    "Daily sales transaction data of the company",
    "Weather reports from BMKG",
    "Product reviews on social media",
    "Warehouse inventory reports"
  ),
  Internal_External = c(
    "Internal",
    "External",
    "External",
    "Internal"
  ),
  Structured_Unstructured = c(
    "Structured",
    "Structured",
    "Unstructured",
    "Structured"
  ),
  stringsAsFactors = FALSE
)

# Display the data frame as a neat table
datatable(data_sources, 
          caption = "Table of Data Sources",
          rownames = FALSE) # hides the index column

4 Exercise 4

Dataset Structure: Consider the following transaction table:

Date	Qty	Price	Product	CustomerTier
2025-10-01	2	1000	Laptop	High
2025-10-01	5	20	Mouse	Medium
2025-10-02	1	1000	Laptop	Low
2025-10-02	3	30	Keyboard	Medium
2025-10-03	4	50	Mouse	Medium
2025-10-03	2	1000	Laptop	High
2025-10-04	6	25	Keyboard	Low
2025-10-04	1	1000	Laptop	High
2025-10-05	3	40	Mouse	Low
2025-10-05	5	10	Keyboard	Medium

Your Assignment Instructions: Creating a Transactions Table above in R

Create a data frame in R called transactions containing the data above.
Identify which variables are numeric and which are categorical
Calculate total revenue for each transaction by multiplying Qty × Price and add it as a new column Total.
Compute summary statistics:
- Total quantity sold for each product
- Total revenue per product
- Average price per product
Visualize the data:
- Create a barplot showing total quantity sold per product.
- Create a pie chart showing the proportion of total revenue per customer tier.
Optional Challenge:
- Find which date had the highest total revenue.
- Create a stacked bar chart showing quantity sold per product by customer tier.

Hints: Use data.frame(), aggregate(), barplot(), pie(), and basic arithmetic operations in R.

1. CREATE THE DATA FRAME

transactions <- data.frame(
  Date = as.Date(c("2025-10-01", "2025-10-01", "2025-10-02", "2025-10-02",
                   "2025-10-03", "2025-10-03", "2025-10-04", "2025-10-04",
                   "2025-10-05", "2025-10-05")),
  Qty = c(2, 5, 1, 3, 4, 2, 6, 1, 3, 5),
  Price = c(1000, 20, 1000, 30, 50, 1000, 25, 1000, 40, 10),
  Product = c("Laptop", "Mouse", "Laptop", "Keyboard", "Mouse",
              "Laptop", "Keyboard", "Laptop", "Mouse", "Keyboard"),
  CustomerTier = c("High", "Medium", "Low", "Medium", "Medium", "High", "Low", "High", "Low", "Medium"),
  stringsAsFactors = FALSE
  )

# View the dataset contents
kable(transactions)

Date	Qty	Price	Product	CustomerTier
2025-10-01	2	1000	Laptop	High
2025-10-01	5	20	Mouse	Medium
2025-10-02	1	1000	Laptop	Low
2025-10-02	3	30	Keyboard	Medium
2025-10-03	4	50	Mouse	Medium
2025-10-03	2	1000	Laptop	High
2025-10-04	6	25	Keyboard	Low
2025-10-04	1	1000	Laptop	High
2025-10-05	3	40	Mouse	Low
2025-10-05	5	10	Keyboard	Medium

2. IDENTIFY VARIABLE TYPES

cat_var <- data.frame(
      Category = c("Date", "Qty", "Price", "Product", "CustomersTier"),
      Variable = c("Categorical (Date)", 
                   "Numeric (Discrete)", 
                   "Numeric(Continuous)", 
                   "Categorical (Nominal)", 
                   "Categorical (Ordinal)"),
      stringsAsFactors = FALSE)
                    
#view result
kable(cat_var)

Category	Variable
Date	Categorical (Date)
Qty	Numeric (Discrete)
Price	Numeric(Continuous)
Product	Categorical (Nominal)
CustomersTier	Categorical (Ordinal)

3. ADD NEW COLUMN: TOTAL (Qty × Price)

    transactions$Total <- transactions$Qty * transactions$Price
    cat("Data Frame with Total Column")

## Data Frame with Total Column

#view result
kable(transactions)

Date	Qty	Price	Product	CustomerTier	Total
2025-10-01	2	1000	Laptop	High	2000
2025-10-01	5	20	Mouse	Medium	100
2025-10-02	1	1000	Laptop	Low	1000
2025-10-02	3	30	Keyboard	Medium	90
2025-10-03	4	50	Mouse	Medium	200
2025-10-03	2	1000	Laptop	High	2000
2025-10-04	6	25	Keyboard	Low	150
2025-10-04	1	1000	Laptop	High	1000
2025-10-05	3	40	Mouse	Low	120
2025-10-05	5	10	Keyboard	Medium	50

4. SUMMARY STATISTICS

##a.  Total quantity sold per product
      total_qty <- aggregate(Qty ~ Product, data = transactions, sum)
      cat("Total Quantity Sold per Product")

## Total Quantity Sold per Product

      kable(total_qty)

Product	Qty
Keyboard	14
Laptop	6
Mouse	12

##b.  Total revenue per product
      total_revenue <- aggregate(Total ~ Product, data = transactions, sum)
      cat("Total Revenue per Product")

## Total Revenue per Product

      kable(total_revenue)

Product	Total
Keyboard	290
Laptop	6000
Mouse	420

##c.  Average price per product
      avg_price <- aggregate(Price ~ Product, data = transactions, mean)
      cat("Average Price per Product")

## Average Price per Product

      kable(avg_price)

Product	Price
Keyboard	21.66667
Laptop	1000.00000
Mouse	36.66667

5. VISUALIZE THE DATA

## a. Barplot: Total Quantity per Product

# Sort total_qty from the largest to the smallest Qty
total_qty <- total_qty[order(total_qty$Qty, decreasing = TRUE), ]

barplot(total_qty$Qty,
        names.arg = total_qty$Product,
        col = "skyblue",
        main = "Total Quantity Sold per Product",
        xlab = "Product",
        ylab = "Quantity Sold",
        cex.names = 0.9 # Adjust the product labels to fit properly
)

## b. Pie Chart: Total Revenue per Customer Tier
revenue_by_tier <- aggregate(Total ~ CustomerTier, data = transactions, sum)

# Calculate the percentage
percent <- round(revenue_by_tier$Total / sum(revenue_by_tier$Total) * 100, 1)

# Create labels that combine the category name and percentage
labels <- paste0(revenue_by_tier$CustomerTier, " (", percent, "%)")

# Generate the pie chart directly
pie(revenue_by_tier$Total,
    labels = labels,
    main = "Proportion of Total Revenue by Customer Tier",
    col = c("lightgreen", "gold", "red"))

6. OPTIONAL CHALLENGE

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ggplot2)

# 1 Create a transaction dataset
transactions <- data.frame(
  Date = as.Date(c("2025-10-01","2025-10-01","2025-10-02","2025-10-02","2025-10-03",
                   "2025-10-03","2025-10-04","2025-10-04","2025-10-05","2025-10-05")),
  Qty = c(2,5,1,3,4,2,6,1,3,5),
  Price = c(1000,20,1000,30,50,1000,25,1000,40,10),
  Product = c("Laptop","Mouse","Laptop","Keyboard","Mouse","Laptop","Keyboard","Laptop","Mouse","Keyboard"),
  CustomerTier = c("High","Medium","Low","Medium","Medium","High","Low","High","Low","Medium")
)

# 2 Total revenue per transaction
transactions <- transactions %>%
  mutate(TotalRevenue = Qty * Price)

# 3 Total revenue per date
total_per_date <- transactions %>%
  group_by(Date) %>%
  summarise(Total_Revenue = sum(TotalRevenue)) %>%
  arrange(desc(Total_Revenue))

# 4 Find the date with the highest total revenue
highest_revenue_date <- total_per_date %>%
  slice_max(Total_Revenue)

# 5 View the result
print("Tanggal dengan total pendapatan tertinggi:")

## [1] "Tanggal dengan total pendapatan tertinggi:"

highest_revenue_date

library(ggplot2)
library(dplyr)

# Calculate total quantity per product to enable sorting
product_order <- transactions %>%
  group_by(Product) %>%
  summarise(total_qty = sum(Qty)) %>%
  arrange(desc(total_qty)) %>%
  pull(Product)

# Reorder product factor levels based on total_qty (from highest to lowest)
transactions$Product <- factor(transactions$Product, levels = product_order)

# Plot a clean stacked bar chart sorted from highest to lowest
ggplot(transactions, aes(x = Product, y = Qty, fill = CustomerTier)) +
  geom_bar(stat = "identity", position = "stack") +
  labs(
    title = "Jumlah Penjualan per Produk berdasarkan Tingkatan Pelanggan",
    x = "Produk",
    y = "Jumlah Penjualan (Qty)",
    fill = "Customer Tier"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.5, face = "bold")
  )

5 Exercise 5

Create Your Own Data Frame:

Objective: Create a data frame in R with 30 rows containing a mix of data types: continuous, discrete, nominal, and ordinal.

5.1 Instructions

Open RStudio or the R console.
Create a vector for each column in your data frame:
- Date: 30 dates (can be sequential or random within a month/year)
- Continuous: numeric values that can take decimal values (e.g., height, weight, temperature)
- Discrete: numeric values that can only take whole numbers (e.g., number of items, number of vehicles)
- Nominal: categorical values with no order (e.g., color, gender, city)
- Ordinal: categorical values with a defined order (e.g., Low, Medium, High; Beginner, Intermediate, Expert)
Combine all vectors into a data frame called my_data.
Check your data frame using head() or View() to ensure it has 30 rows and the columns are correct.
Optional tasks:
- Summarize each column using summary()
- Count the frequency of each category for Nominal and Ordinal columns using table()

5.2 Hints

Use seq.Date() or as.Date() to generate the Date column.
Use runif() or rnorm() for continuous numeric data.
Use sample() for discrete, nominal, and ordinal data.
Ensure the ordinal vector is created with factor(..., levels = c("Low","Medium","High"), ordered = TRUE) (or similar).

1. CREATE THE DATA FRAME

set.seed(123)
Sys.setlocale("LC_TIME", "C")

## [1] "C"

#DATE 
Date <- seq(as.Date("2025-09-01"), as.Date("2025-09-30"), by = "day")

#CONTINOUS (TOTAL SALES)
Total_Sales <- round(rnorm(30, mean = 200000, sd = 50000), 0)

#DISCRETE (QUANTITY)
Quantity <- sample(3:20, 30, replace = TRUE)

#NOMINAL (COFFEE TYPE)
Coffee_Type <- sample(c("Americano", "Latte", "Cappuccino", "Espresso", "Mocha"), 
                      30, replace = TRUE)

#ORDINAL (SALES LEVEL)
Sales_Level <- factor(sample(c("Low", "Medium", "High"), 30, replace = TRUE),
                      levels = c("Low", "Medium", "High"), ordered = TRUE)


# Combine into one table
my_data <- data.frame(
  Date,
  Total_Sales,
  Quantity, 
  Coffee_Type,
  Sales_Level)
#view the contents of the data
kable(my_data)

Date	Total_Sales	Quantity	Coffee_Type	Sales_Level
2025-09-01	171976	7	Mocha	High
2025-09-02	188491	10	Cappuccino	Low
2025-09-03	277935	14	Latte	High
2025-09-04	203525	15	Mocha	Medium
2025-09-05	206464	20	Mocha	Medium
2025-09-06	285753	3	Cappuccino	Low
2025-09-07	223046	8	Latte	High
2025-09-08	136747	17	Latte	Medium
2025-09-09	165657	11	Latte	Low
2025-09-10	177717	17	Espresso	High
2025-09-11	261204	18	Latte	High
2025-09-12	217991	8	Latte	High
2025-09-13	220039	13	Espresso	Low
2025-09-14	205534	10	Espresso	Medium
2025-09-15	172208	9	Americano	Medium
2025-09-16	289346	18	Cappuccino	Low
2025-09-17	224893	19	Cappuccino	Medium
2025-09-18	101669	20	Americano	High
2025-09-19	235068	19	Cappuccino	High
2025-09-20	176360	4	Mocha	Low
2025-09-21	146609	6	Latte	Medium
2025-09-22	189101	15	Cappuccino	Low
2025-09-23	148700	7	Latte	Medium
2025-09-24	163555	16	Mocha	High
2025-09-25	168748	5	Mocha	High
2025-09-26	115665	10	Cappuccino	Low
2025-09-27	241889	18	Espresso	High
2025-09-28	207669	14	Espresso	Medium
2025-09-29	143093	16	Espresso	Low
2025-09-30	262691	5	Mocha	Low

2. OPTIONAL TASKS

set.seed(123)
Sys.setlocale("LC_TIME", "C")

## [1] "C"

str(my_data)          # View data structure

## 'data.frame':    30 obs. of  5 variables:
##  $ Date       : Date, format: "2025-09-01" "2025-09-02" ...
##  $ Total_Sales: num  171976 188491 277935 203525 206464 ...
##  $ Quantity   : int  7 10 14 15 20 3 8 17 11 17 ...
##  $ Coffee_Type: chr  "Mocha" "Cappuccino" "Latte" "Mocha" ...
##  $ Sales_Level: Ord.factor w/ 3 levels "Low"<"Medium"<..: 3 1 3 2 2 1 3 2 1 3 ...

summary(my_data)      # Summary of each column

##       Date             Total_Sales        Quantity    Coffee_Type       
##  Min.   :2025-09-01   Min.   :101669   Min.   : 3.0   Length:30         
##  1st Qu.:2025-09-08   1st Qu.:166430   1st Qu.: 8.0   Class :character  
##  Median :2025-09-15   Median :196313   Median :13.5   Mode  :character  
##  Mean   :2025-09-15   Mean   :197645   Mean   :12.4                     
##  3rd Qu.:2025-09-22   3rd Qu.:224431   3rd Qu.:17.0                     
##  Max.   :2025-09-30   Max.   :289346   Max.   :20.0                     
##  Sales_Level
##  Low   :10  
##  Medium: 9  
##  High  :11  
##             
##             
##

print(my_data$coffee_type)  # Nominal data frequency

## NULL

print(my_data$Sales_Level)  # Ordinal data frequency

##  [1] High   Low    High   Medium Medium Low    High   Medium Low    High  
## [11] High   High   Low    Medium Medium Low    Medium High   High   Low   
## [21] Medium Low    Medium High   High   Low    High   Medium Low    Low   
## Levels: Low < Medium < High