Data Exploration

Exercises ~ Week 3

Logo


1 Exercise 1

The following table shows sample information for three students. Each observation represents a single student and includes details such as their unique student ID, name, age, total credits completed, major field of study, and year level.

This dataset demonstrates a mixture of variable types:

  • Nominal: StudentID, Name, Major
  • Numeric: Age (continuous), CreditsCompleted (discrete)
  • Ordinal: YearLevel (Freshman → Senior)
StudentID Name Age CreditsCompleted Major YearLevel
S001 Alice 20 45 Data Sains Sophomore
S002 Budi 21 60 Mathematics Junior
S003 Citra 19 30 Statistics Freshman
# 1. Create vectors for each variable
StudentID <- c("S001", "S002", "S003")       # Nominal / ID
Name <- c("Alice", "Budi", "Citra")          # Nominal / Name
Age <- c(20, 21, 19)                         # Numeric / Continuous
CreditsCompleted <- c(45, 60, 30)            # Numeric / Discrete

# Nominal
Major <- c("Data Sains", "Mathematics", "Statistics")  

# Ordinal
YearLevel <- factor(c("Sophomore", "Junior", "Freshman"),
                    levels = c("Freshman","Sophomore","Junior","Senior"),
                    ordered = TRUE)          

# 2. Combine all vectors into a data frame
students <- data.frame(
  StudentID, Name, Age, CreditsCompleted, Major, YearLevel,
  stringsAsFactors = FALSE
)

# 3. Display the data frame
print(students)
##   StudentID  Name Age CreditsCompleted       Major YearLevel
## 1      S001 Alice  20               45  Data Sains Sophomore
## 2      S002  Budi  21               60 Mathematics    Junior
## 3      S003 Citra  19               30  Statistics  Freshman

2 Exercise 2

Identify Data Types: Determine the type of data for each of the following variables:

# Install knitr package if not already installed
# install.packages("knitr")
library(knitr)

# Create a data frame for Data Types
variables_info <- data.frame(
  No = 1:5,
  Variable = c(
    "Number of vehicles passing through the toll road each day",
    "Student height in cm",
    "Employee gender (Male / Female)",
    "Customer satisfaction level: Low, Medium, High",
    "Respondent's favorite color: Red, Blue, Green"
  ),
  DataType = c(
    "Quantitative",
    "Quantitative",
    "Quantitative",
    "Qualitative",
    "Qualitative"
  ),
  Subtype = c(
    "Diskrete",
    "Continuous",
    "Nominal",
    "Ordinal",
    "Nominal"
  ),
  stringsAsFactors = FALSE
)

# Display the data frame as a neat table
kable(variables_info, 
      caption = "Table of Variables and Data Types")
Table of Variables and Data Types
No Variable DataType Subtype
1 Number of vehicles passing through the toll road each day Quantitative Diskrete
2 Student height in cm Quantitative Continuous
3 Employee gender (Male / Female) Quantitative Nominal
4 Customer satisfaction level: Low, Medium, High Qualitative Ordinal
5 Respondent’s favorite color: Red, Blue, Green Qualitative Nominal

3 Exercise 3

Classify Data Sources: Determine whether the following data comes from internal or external sources, and whether it is structured or unstructured:

# Install DT package if not already installed
# install.packages("DT") 
library(DT)

# Create a data frame for data sources 
data_sources <- data.frame(
  No = 1:4,
  DataSource = c(
    "Daily sales transaction data of the company",
    "Weather reports from BMKG",
    "Product reviews on social media",
    "Warehouse inventory reports"
  ),
  Internal_External = c(
    "Internal",
    "Eksternal",
    "Eksternal",
    "Internal"
  ),
  Structured_Unstructured = c(
    "Structured",
    "Structured",
    "Unstructured",
    "Structured"
  ),
  stringsAsFactors = FALSE
)

# Display the data frame as a neat table
datatable(data_sources, 
          caption = "Table of Data Sources",
          rownames = FALSE) # hides the index column

4 Exercise 4

Dataset Structure: Consider the following transaction table:

Date Quantity Price Product CustomerTier
2025-10-01 2 1000 Laptop High
2025-10-01 5 20 Mouse Medium
2025-10-02 1 1000 Laptop Low
2025-10-02 3 30 Keyboard Medium
2025-10-03 4 50 Mouse Medium
2025-10-03 2 1000 Laptop High
2025-10-04 6 25 Keyboard Low
2025-10-04 1 1000 Laptop High
2025-10-05 3 40 Mouse Low
2025-10-05 5 10 Keyboard Medium

Your Assignment Instructions: Creating a Transactions Table above in R

  1. Create a data frame in R called transactions containing the data above.

  2. Identify which variables are numeric and which are categorical

  3. Calculate total revenue for each transaction by multiplying Qty × Price and add it as a new column Total.

  4. Compute summary statistics:

    • Total quantity sold for each product
    • Total revenue per product
    • Average price per product
  5. Visualize the data:

    • Create a barplot showing total quantity sold per product.
    • Create a pie chart showing the proportion of total revenue per customer tier.
  6. Optional Challenge:

    • Find which date had the highest total revenue.
    • Create a stacked bar chart showing quantity sold per product by customer tier.

Hints: Use data.frame(), aggregate(), barplot(), pie(), and basic arithmetic operations in R.

# Create Data Frame  Transaction
library(DT)
library(knitr)

transactions <- data.frame(
  No = 1:10,
  Date = as.Date(c("2025-10-01", "2025-10-01","2025-10-02", "2025-10-02", "2025-10-03",                       "2025-10-03", "2025-10-04", "2025-10-04", "2025-10-05", "2025-10-05")),
  Quantity = c(2, 5, 1, 3, 4, 2, 6, 1, 3, 5),
  Price = c(1000, 20, 1000, 30, 50, 1000, 25, 1000, 40, 10),
  Product = c("Laptop", "Mouse", "Laptop", "Keyboard", "Mouse", "Laptop", "Keyboard",                    "Laptop", "Mouse", "Keyboard"),
  CustomerTier = factor(c("High", "Medium", "Low", "Medium",                                                         "Medium", "High", "Low", "High", "Low", "Medium"),
  levels = c("Low", "Medium", "High"), ordered = TRUE)
)

transactions$Total <- transactions$Quantity * transactions$Price

str(transactions)
## 'data.frame':    10 obs. of  7 variables:
##  $ No          : int  1 2 3 4 5 6 7 8 9 10
##  $ Date        : Date, format: "2025-10-01" "2025-10-01" ...
##  $ Quantity    : num  2 5 1 3 4 2 6 1 3 5
##  $ Price       : num  1000 20 1000 30 50 1000 25 1000 40 10
##  $ Product     : chr  "Laptop" "Mouse" "Laptop" "Keyboard" ...
##  $ CustomerTier: Ord.factor w/ 3 levels "Low"<"Medium"<..: 3 2 1 2 2 3 1 3 1 2
##  $ Total       : num  2000 100 1000 90 200 2000 150 1000 120 50
datatable(transactions,
          rownames = FALSE,
          caption = htmltools::tags$strong("Transactions Table")
)
   #create data frame of category variable
library(DT)

categoryVariable <- data.frame(
      Category = c("Date", "Quantity", "Price", "Product",                              "CustomersTier"),
      Variable = c("Categorical (Date)", 
                   "Numeric (Discrete)", 
                   "Numeric(Continuous)", 
                   "Categorical (Nominal)", 
                   "Categorical (Ordinal)"),
      stringsAsFactors = FALSE)
                  
str(categoryVariable)
## 'data.frame':    5 obs. of  2 variables:
##  $ Category: chr  "Date" "Quantity" "Price" "Product" ...
##  $ Variable: chr  "Categorical (Date)" "Numeric (Discrete)" "Numeric(Continuous)" "Categorical (Nominal)" ...
datatable(categoryVariable,
          caption = htmltools::tags$strong("Identification                                               Table"))
 # create data frame of total quantity
library(DT)

totalQuantity <- aggregate(Quantity ~ Product, data = transactions, sum)

str(totalQuantity)
## 'data.frame':    3 obs. of  2 variables:
##  $ Product : chr  "Keyboard" "Laptop" "Mouse"
##  $ Quantity: num  14 6 12
datatable(totalQuantity,
          rownames = FALSE,
          caption = htmltools::tags$strong("Total Quantity                                                Table")
)
 # create data frame of total reveneu
library(DT)

totalRevenue <- aggregate(Total ~ Product, data = transactions, mean)

str(totalRevenue)
## 'data.frame':    3 obs. of  2 variables:
##  $ Product: chr  "Keyboard" "Laptop" "Mouse"
##  $ Total  : num  96.7 1500 140
datatable(totalRevenue,
          rownames = FALSE,
          caption = htmltools::tags$strong("Total Reveneu                                                Table")
)
 # create data frame of avarange price
avarangePrice <- aggregate(Price ~ Product, data = transactions, mean)

str(avarangePrice)
## 'data.frame':    3 obs. of  2 variables:
##  $ Product: chr  "Keyboard" "Laptop" "Mouse"
##  $ Price  : num  21.7 1000 36.7
datatable(avarangePrice,
          rownames = FALSE,
          caption = htmltools::tags$strong("Avarange Price                                               Table")
)
 # create barplot of date
quantityColors <- c("#a6cba9", "#ecd59f", "#a0ced9")

barplot(totalQuantity$Quantity,
        names.arg = totalQuantity$Product,
        main = "Total Quantity Sold per Product",
        xlab = "Product",
        ylab = "Total Quantity Sold",
        col = quantityColors,
        border = "black")

pie_colors <- c("#cdb4db", "#eba7ac", "#a4c0d6")

revenueTier <- aggregate(Total ~ CustomerTier, data = transactions, sum)

pie(revenueTier$Total, 
    labels = paste(revenueTier$CustomerTier, "-", revenueTier$Total),
    main = "Total Revenue by Customer Tier",
    col = pie_colors)

5 Exercise 5

Create Your Own Data Frame:

Objective: Create a data frame in R with 30 rows containing a mix of data types: continuous, discrete, nominal, and ordinal.

5.1 Instructions

  1. Open RStudio or the R console.

  2. Create a vector for each column in your data frame:

    • Date: 30 dates (can be sequential or random within a month/year)
    • Continuous: numeric values that can take decimal values (e.g., height, weight, temperature)
    • Discrete: numeric values that can only take whole numbers (e.g., number of items, number of vehicles)
    • Nominal: categorical values with no order (e.g., color, gender, city)
    • Ordinal: categorical values with a defined order (e.g., Low, Medium, High; Beginner, Intermediate, Expert)
  3. Combine all vectors into a data frame called my_data.

  4. Check your data frame using head() or View() to ensure it has 30 rows and the columns are correct.

  5. Optional tasks:

    • Summarize each column using summary()
    • Count the frequency of each category for Nominal and Ordinal columns using table()

5.2 Hints

  • Use seq.Date() or as.Date() to generate the Date column.

  • Use runif() or rnorm() for continuous numeric data.

  • Use sample() for discrete, nominal, and ordinal data.

  • Ensure the ordinal vector is created with factor(..., levels = c("Low","Medium","High"), ordered = TRUE) (or similar).

    data set Structure tabel pertumbuhan tinggi badan dengan hubungan waktu tidur :

Tanggal Jumlah Reponden Tinggi Badan Waktu Tidur (Jam) Tingkat Kepuasan
2024-05-03 1 160.00 6 Cukup
2024-05-06 1 160.03 7 Puas
2024-05-09 1 160.06 7 Puas
2024-05-12 1 160.11 8 Sangat Puas
2024-05-15 1 160.12 6 Cukup
2024-05-18 1 160.12 7 Puas
2024-05-21 1 160.15 5 Kurang Puas
2024-05-24 1 160.15 7 Puas
2024-05-27 1 160.18 8 Sangat Puas
2024-05-30 1 160.23 7 Puas
2024-06-03 1 160.26 6 Cukup
2024-06-06 1 160.27 7 Puas
2024-06-09 1 160.30 5 Kurang Puas
2024-06-12 1 160.35 8 Sangat Puas
2024-06-15 1 160.38 7 Puas
2024-06-18 1 160.43 8 Sangat Puas
2024-06-21 1 160.44 6 Cukup
2024-06-24 1 160.47 7 Puas
2024-06-27 1 160.47 5 Kurang Puas
2024-06-30 1 160.52 8 Sangat Puas
2024-07-03 1 160.55 7 Puas
2024-07-06 2 160.56 6 Cukup
2024-07-09 2 160.59 7 Puas
2024-07-12 2 160.64 8 Sangat Puas
2024-07-15 2 160.65 6 Cukup
2024-07-18 2 160.68 7 Puas
2024-07-21 2 160.73 8 Sangat Puas
2024-07-24 2 160.76 7 Puas
2024-07-27 2 160.76 5 Kurang Puas
2024-07-30 2 160.81 8 Sangat Puas
# Create Data Frame "pertumbuhan tinggi badan" table

my_data <- data.frame(
  No = 1:30,
  Tanggal = seq(as.Date("2024-05-03"), by = "day", length.out = 30),
  jumlah_responden = rep(1, 30),
  tinggi_badan = runif(30, min = 160, max = 160.81),
  waktu_tidur = sample(5:8, 30, replace = TRUE),
  tingkat_kepuasan = factor(
    sample(c("Kurang Puas", "Cukup", "Puas", "Sangat Puas"), 30, replace = TRUE),
    levels = c("Kurang Puas", "Cukup", "Puas", "Sangat Puas"),
    ordered = TRUE)
)
knitr::kable(head(my_data, 30), row.names = FALSE)
No Tanggal jumlah_responden tinggi_badan waktu_tidur tingkat_kepuasan
1 2024-05-03 1 160.7255 8 Kurang Puas
2 2024-05-04 1 160.7136 5 Sangat Puas
3 2024-05-05 1 160.4565 8 Sangat Puas
4 2024-05-06 1 160.2841 6 Kurang Puas
5 2024-05-07 1 160.2131 5 Kurang Puas
6 2024-05-08 1 160.3886 7 Puas
7 2024-05-09 1 160.1361 5 Kurang Puas
8 2024-05-10 1 160.4393 7 Sangat Puas
9 2024-05-11 1 160.0706 6 Puas
10 2024-05-12 1 160.4376 7 Sangat Puas
11 2024-05-13 1 160.4536 5 Sangat Puas
12 2024-05-14 1 160.1650 7 Sangat Puas
13 2024-05-15 1 160.1183 6 Cukup
14 2024-05-16 1 160.5876 8 Puas
15 2024-05-17 1 160.0273 6 Sangat Puas
16 2024-05-18 1 160.2752 5 Puas
17 2024-05-19 1 160.7624 7 Kurang Puas
18 2024-05-20 1 160.4996 8 Sangat Puas
19 2024-05-21 1 160.2594 5 Puas
20 2024-05-22 1 160.4866 6 Puas
21 2024-05-23 1 160.0972 8 Sangat Puas
22 2024-05-24 1 160.0030 7 Sangat Puas
23 2024-05-25 1 160.5602 8 Sangat Puas
24 2024-05-26 1 160.5707 7 Sangat Puas
25 2024-05-27 1 160.3742 6 Kurang Puas
26 2024-05-28 1 160.3316 6 Sangat Puas
27 2024-05-29 1 160.6764 7 Kurang Puas
28 2024-05-30 1 160.0485 7 Kurang Puas
29 2024-05-31 1 160.6136 5 Puas
30 2024-06-01 1 160.3201 8 Sangat Puas
 # summary statistic
summary(my_data)
##        No           Tanggal           jumlah_responden  tinggi_badan  
##  Min.   : 1.00   Min.   :2024-05-03   Min.   :1        Min.   :160.0  
##  1st Qu.: 8.25   1st Qu.:2024-05-10   1st Qu.:1        1st Qu.:160.2  
##  Median :15.50   Median :2024-05-17   Median :1        Median :160.4  
##  Mean   :15.50   Mean   :2024-05-17   Mean   :1        Mean   :160.4  
##  3rd Qu.:22.75   3rd Qu.:2024-05-24   3rd Qu.:1        3rd Qu.:160.5  
##  Max.   :30.00   Max.   :2024-06-01   Max.   :1        Max.   :160.8  
##   waktu_tidur       tingkat_kepuasan
##  Min.   :5.000   Kurang Puas: 8     
##  1st Qu.:6.000   Cukup      : 1     
##  Median :7.000   Puas       : 7     
##  Mean   :6.533   Sangat Puas:14     
##  3rd Qu.:7.000                      
##  Max.   :8.000
 # frekuensi of category (Nominal / Ordinal)
table(my_data$tingkat_kepuasan)
## 
## Kurang Puas       Cukup        Puas Sangat Puas 
##           8           1           7          14