Data Exploration

Exercises ~ Week 3

Logo


1 Exercise 1

The following table shows sample information for three students. Each observation represents a single student and includes details such as their unique student ID, name, age, total credits completed, major field of study, and year level.

This dataset demonstrates a mixture of variable types:

  • Nominal: StudentID, Name, Major
  • Numeric: Age (continuous), CreditsCompleted (discrete)
  • Ordinal: YearLevel (Freshman → Senior)
StudentID Name Age CreditsCompleted Major YearLevel
S001 Alice 20 45 Data Sains Sophomore
S002 Budi 21 60 Mathematics Junior
S003 Citra 19 30 Statistics Freshman
##   StudentID  Name Age CreditsCompleted       Major YearLevel
## 1      S001 Alice  20               45  Data Sains Sophomore
## 2      S002  Budi  21               60 Mathematics    Junior
## 3      S003 Citra  19               30  Statistics  Freshman

2 Exercise 2

Identify Data Types: Determine the type of data for each of the following variables:

Table of Variables and Data Types
No Variable DataType Subtype
1 Number of vehicles passing through the toll road each day Numerics Discreate
2 Student height in cm Numerics Continuous
3 Employee gender (Male / Female) Catagorical Nominal
4 Customer satisfaction level: Low, Medium, High Catagorical Ordinal
5 Respondent’s favorite color: Red, Blue, Green Catagorical Nominal

3 Exercise 3

Classify Data Sources: Determine whether the following data comes from internal or external sources, and whether it is structured or unstructured:


4 Exercise 4

Dataset Structure: Consider the following transaction table:

Date Qty Price Product CustomerTier
2025-10-01 2 1000 Laptop High
2025-10-01 5 20 Mouse Medium
2025-10-02 1 1000 Laptop Low
2025-10-02 3 30 Keyboard Medium
2025-10-03 4 50 Mouse Medium
2025-10-03 2 1000 Laptop High
2025-10-04 6 25 Keyboard Low
2025-10-04 1 1000 Laptop High
2025-10-05 3 40 Mouse Low
2025-10-05 5 10 Keyboard Medium

Your Assignment Instructions: Creating a Transactions Table above in R

  1. Create a data frame in R called transactions containing the data above.

  2. Identify which variables are numeric and which are categorical

  3. Calculate total revenue for each transaction by multiplying Qty × Price and add it as a new column Total.

  4. Compute summary statistics:

    • Total quantity sold for each product
    • Total revenue per product
    • Average price per product
  5. Visualize the data:

    • Create a barplot showing total quantity sold per product.
    • Create a pie chart showing the proportion of total revenue per customer tier.
  6. Optional Challenge:

    • Find which date had the highest total revenue.
    • Create a stacked bar chart showing quantity sold per product by customer tier.

Hints: Use data.frame(), aggregate(), barplot(), pie(), and basic arithmetic operations in R.

4.1 Data Variable

Data Frame

Data Frame
Date Qty Price Product CostumerTier
2025-10-01 2 1000 Laptop High
2025-10-01 5 20 Mouse Medium
2025-10-02 1 1000 Laptop Low
2025-10-02 3 30 Keyboard Medium
2025-10-03 4 50 Mouse Medium
2025-10-03 2 1000 Laptop High
2025-10-04 6 25 Keyboard Low
2025-10-04 1 1000 Laptop High
2025-10-05 3 40 Mouse Low
2025-10-05 5 10 Keyboard Medium

Data Numeric Numeric or quantitative data are data expressed in numbers that represent counts or measurements. They provide information about how much or how many of something, allowing for mathematical operations such as addition, subtraction, averaging, and statistical analysis.

Data Numeric
Qty Price
2 1000
5 20
1 1000
3 30
4 50
2 1000
6 25
1 1000
3 40
5 10

Data Category Categorical or qualitative data are data expressed in labels, names, or categories rather than numbers. They describe qualities, attributes, or classifications that cannot be meaningfully measured with arithmetic operations like addition or subtraction.

Data Category
Date Product CostumerTier
2025-10-01 Laptop High
2025-10-01 Mouse Medium
2025-10-02 Laptop Low
2025-10-02 Keyboard Medium
2025-10-03 Mouse Medium
2025-10-03 Laptop High
2025-10-04 Keyboard Low
2025-10-04 Laptop High
2025-10-05 Mouse Low
2025-10-05 Keyboard Medium

Total Data Transaction

Total Quantity
Date Qty Price Product CostumerTier Total
2025-10-01 2 1000 Laptop High 2000
2025-10-01 5 20 Mouse Medium 100
2025-10-02 1 1000 Laptop Low 1000
2025-10-02 3 30 Keyboard Medium 90
2025-10-03 4 50 Mouse Medium 200
2025-10-03 2 1000 Laptop High 2000
2025-10-04 6 25 Keyboard Low 150
2025-10-04 1 1000 Laptop High 1000
2025-10-05 3 40 Mouse Low 120
2025-10-05 5 10 Keyboard Medium 50
Total Revenue
Product Qty
Keyboard 14
Laptop 6
Mouse 12
Average Price
Product Total
Keyboard 290
Laptop 6000
Mouse 420
Product Price
Keyboard 21.66667
Laptop 1000.00000
Mouse 36.66667

___

Date Highest Total Revenue

5 Exercise 5

Create Your Own Data Frame:

Objective: Create a data frame in R with 30 rows containing a mix of data types: continuous, discrete, nominal, and ordinal.

5.1 Instructions

  1. Open RStudio or the R console.

  2. Create a vector for each column in your data frame:

    • Date: 30 dates (can be sequential or random within a month/year)
    • Continuous: numeric values that can take decimal values (e.g., height, weight, temperature)
    • Discrete: numeric values that can only take whole numbers (e.g., number of items, number of vehicles)
    • Nominal: categorical values with no order (e.g., color, gender, city)
    • Ordinal: categorical values with a defined order (e.g., Low, Medium, High; Beginner, Intermediate, Expert)
  3. Combine all vectors into a data frame called my_data.

  4. Check your data frame using head() or View() to ensure it has 30 rows and the columns are correct.

  5. Optional tasks:

    • Summarize each column using summary()
    • Count the frequency of each category for Nominal and Ordinal columns using table()

5.2 Hints

  • Use seq.Date() or as.Date() to generate the Date column.
  • Use runif() or rnorm() for continuous numeric data.
  • Use sample() for discrete, nominal, and ordinal data.
  • Ensure the ordinal vector is created with factor(..., levels = c("Low","Medium","High"), ordered = TRUE) (or similar).

SUMMARY

Tanggal Jumlah_Penumpang Waktu_Tunggu Jenis_Jalur Tingkat_Kepadatan
Min. :2025-09-01 Min. :4827 Min. :6.100 Length:30 Rendah : 7
1st Qu.:2025-09-08 1st Qu.:4944 1st Qu.:6.800 Class :character Sedang :13
Median :2025-09-15 Median :5086 Median :7.300 Mode :character Tinggi : 6
Mean :2025-09-15 Mean :5142 Mean :7.327 NA Sangat Tinggi: 4
3rd Qu.:2025-09-22 3rd Qu.:5373 3rd Qu.:7.975 NA NA
Max. :2025-09-30 Max. :5498 Max. :8.500 NA NA

JENIS JALUR (NOMINAL)

## 
##   Biru  Hijau Kuning  Merah 
##      7      8      8      7

TINGKAT KEPADATAN (ORDINAL)

## 
##        Rendah        Sedang        Tinggi Sangat Tinggi 
##             7            13             6             4