Data Exploration

Exercises ~ Week 2

foto kelompok 6 statiska


Exercises ~ Week 2 (Data Exploration)

1 Exercise 1

The following table shows sample information for three students. Each observation represents a single student and includes details such as their unique student ID, name, age, total credits completed, major field of study, and year level.

This dataset demonstrates a mixture of variable types:

  • Nominal: StudentID, Name, Major
  • Numeric: Age (continuous), CreditsCompleted (discrete)
  • Ordinal: YearLevel (Freshman → Senior)
StudentID Name Age CreditsCompleted Major YearLevel
S001 Alice 20 45 Data Sains Sophomore
S002 Budi 21 60 Mathematics Junior
S003 Citra 19 30 Statistics Freshman
# 1. Create vectors for each variable
StudentID <- c("S001", "S002", "S003")       # Nominal / ID
Name <- c("Alice", "Budi", "Citra")          # Nominal / Name
Age <- c(20, 21, 19)                         # Numeric / Continuous
CreditsCompleted <- c(45, 60, 30)            # Numeric / Discrete

# Nominal
Major <- c("Data Sains", "Mathematics", "Statistics")  

# Ordinal
YearLevel <- factor(c("Sophomore", "Junior", "Freshman"),
                    levels = c("Freshman","Sophomore","Junior","Senior"),
                    ordered = TRUE)          

# 2. Combine all vectors into a data frame
students <- data.frame(
  StudentID, Name, Age, CreditsCompleted, Major, YearLevel,
  stringsAsFactors = FALSE
)

# 3. Display the data frame
print(students)
##   StudentID  Name Age CreditsCompleted       Major YearLevel
## 1      S001 Alice  20               45  Data Sains Sophomore
## 2      S002  Budi  21               60 Mathematics    Junior
## 3      S003 Citra  19               30  Statistics  Freshman

2 Exercise 2

Identify Data Types: Determine the type of data for each of the following variables:

# Install knitr package if not already installed
# install.packages("knitr")
library(knitr)

# Create a data frame for Data Types
variables_info <- data.frame(
  No = 1:5,
  Variable = c(
    "Number of vehicles passing through the toll road each day",
    "Student height in cm",
    "Employee gender (Male / Female)",
    "Customer satisfaction level: Low, Medium, High",
    "Respondent's favorite color: Red, Blue, Green"
  ),
  DataType = c(
    "Numeric",
    "Numeric",
    "Categorical",
    "Categorical",
    "Categorical"
  ),
  Subtype = c(
    "Discrete",
    "Continuous",
    "Nominal",
    "Ordinal",
    "Nominal"
  ),
  stringsAsFactors = FALSE
)

# Display the data frame as a neat table
kable(variables_info, 
      caption = "Table of Variables and Data Types")
Table of Variables and Data Types
No Variable DataType Subtype
1 Number of vehicles passing through the toll road each day Numeric Discrete
2 Student height in cm Numeric Continuous
3 Employee gender (Male / Female) Categorical Nominal
4 Customer satisfaction level: Low, Medium, High Categorical Ordinal
5 Respondent’s favorite color: Red, Blue, Green Categorical Nominal

3 Exercise 3

Classify Data Sources: Determine whether the following data comes from internal or external sources, and whether it is structured or unstructured:

# Install DT package if not already installed
# install.packages("DT")
library(DT)

# Create a data frame for data sources 
data_sources <- data.frame(
  No = 1:4,
  DataSource = c(
    "Daily sales transaction data of the company",
    "Weather reports from BMKG",
    "Product reviews on social media",
    "Warehouse inventory reports"
  ),
  Internal_External = c(
    "Internal",
    "External",
    "External",
    "Internal"
  ),
  Structured_Unstructured = c(
    "Structured",
    "Structured",
    "Unstructured",
    "Structured"
  ),
  stringsAsFactors = FALSE
)

# Display the data frame as a neat table
datatable(data_sources, 
          caption = "Table of Data Sources",
          rownames = FALSE) # hides the index column

4 Exercise 4

Dataset Structure: Consider the following transaction table:

Date Qty Price Product CustomerTier
2025-10-01 2 1000 Laptop High
2025-10-01 5 20 Mouse Medium
2025-10-02 1 1000 Laptop Low
2025-10-02 3 30 Keyboard Medium
2025-10-03 4 50 Mouse Medium
2025-10-03 2 1000 Laptop High
2025-10-04 6 25 Keyboard Low
2025-10-04 1 1000 Laptop High
2025-10-05 3 40 Mouse Low
2025-10-05 5 10 Keyboard Medium

Your Assignment Instructions: Creating a Transactions Table above in R

  1. Create a data frame in R called transactions containing the data above.
# Import library
library(dplyr)   # For data manipulation
library(ggplot2) # For creating visualizations

# Create the transactions dataset
transactions <- data.frame(
  Date = as.Date(c("2025-10-01","2025-10-01","2025-10-02",
                   "2025-10-02","2025-10-03","2025-10-03",
                   "2025-10-04","2025-10-04","2025-10-05",
                   "2025-10-05")),
  Product = c("Laptop","Mouse","Laptop","Keyboard",
              "Mouse","Laptop","Keyboard","Laptop",
              "Mouse","Keyboard"),
  Qty = c(2,5,1,3,4,2,6,1,3,5),
  Price = c(1000,20,1000,30,50,1000,25,1000,40,10),
  CustomerTier =c("High","Medium","Low","Medium","Medium",
                  "High","Low","High","Low","Medium"),
  stringsAsFactors = FALSE)

# Display the transactions as a paginated, scrollable table
datatable(transactions, caption = "Transaction Data Table", 
          options = list(pageLength = 10, scrollX = TRUE))
  1. Identify which variables are numeric and which are categorical
 # View data structure
str(transactions)        
## 'data.frame':    10 obs. of  5 variables:
##  $ Date        : Date, format: "2025-10-01" "2025-10-01" ...
##  $ Product     : chr  "Laptop" "Mouse" "Laptop" "Keyboard" ...
##  $ Qty         : num  2 5 1 3 4 2 6 1 3 5
##  $ Price       : num  1000 20 1000 30 50 1000 25 1000 40 10
##  $ CustomerTier: chr  "High" "Medium" "Low" "Medium" ...
# Generate summary statistics for each column
summary(transactions)     
##       Date              Product               Qty           Price        
##  Min.   :2025-10-01   Length:10          Min.   :1.00   Min.   :  10.00  
##  1st Qu.:2025-10-02   Class :character   1st Qu.:2.00   1st Qu.:  26.25  
##  Median :2025-10-03   Mode  :character   Median :3.00   Median :  45.00  
##  Mean   :2025-10-03                      Mean   :3.20   Mean   : 417.50  
##  3rd Qu.:2025-10-04                      3rd Qu.:4.75   3rd Qu.:1000.00  
##  Max.   :2025-10-05                      Max.   :6.00   Max.   :1000.00  
##  CustomerTier      
##  Length:10         
##  Class :character  
##  Mode  :character  
##                    
##                    
## 
  1. Calculate total revenue for each transaction by multiplying Qty × Price and add it as a new column Total.
# Add a new column 'Total' that multiplies Qty * Price
transactions$Total <- transactions$Qty * transactions$Price

# Display updated dataset with the new column
datatable(transactions, caption = "Transactions with Total Revenue Column",
          options = list(pageLength = 10, scrollX = TRUE))
  1. Compute summary statistics:
-   Total quantity sold for each product
-   Total revenue per product
-   Average price per product
# Total quantity sold for each product
total_qty <- aggregate(Qty ~ Product, data = transactions, sum)

# Total revenue (Total) for each product
total_revenue <- aggregate(Total ~ Product, data = transactions, sum) 

# Average price per product
avg_price <- aggregate(Price ~ Product, data = transactions, mean)

# Display all results in interactive tables
datatable(total_qty, caption = "Total Quantity per Product")
datatable(total_revenue, caption = "Total Revenue per Product")
datatable(avg_price, caption = "Average Price per Product")
Product Total_Qty Total_Revenue Avg_Price
Laptop 6 6000 1000
Mouse 12 420 36.67
Keyboard 14 290 21.67
  1. Visualize the data:

    • Create a barplot showing total quantity sold per product.
# Bar chart: total quantity sold per product
barplot(total_qty$Qty,
  names.arg = total_qty$Product,
  main = "Total Quantity Sold per Product",
  xlab = "Product",
  ylab = "Total Quantity",
  # Adds nice colors
  col = c("skyblue", "lightgreen", "salmon"))

  • Create a pie chart showing the proportion of total revenue per customer tier.

How the Calculation Works

  1. Calculate total revenue for each transaction: Total = Qty * Price

  2. Group by customer tier to get the sum of total revenue per tier. Example:

    • High = 2000 + 2000 + 1000 = 5000
    • Medium = (5×20)+(3×30)+(4×50)+(5×10) = 440
    • Low = (1×1000)+(6×25)+(3×40) = 1270
  3. Find the overall total revenue: 5000 + 440 + 1270 = 6710

  4. Calculate each tier’s percentage: \[[ \text{Percentage} = \frac{\text{Total_Revenue per Tier}}{\text{Total Revenue Overall}} \times 100 ]\]

# Create pie chart total revenue customer tier
library(dplyr)
library(ggplot2)
library(scales)

# Compute total revenue for each customer tier
tier_revenue <- transactions %>%
  mutate(Total = Qty * Price) %>%
  group_by(CustomerTier) %>%
  summarise(Total_Revenue = sum(Total)) %>%
  mutate(Percentage = Total_Revenue / sum(Total_Revenue) * 100)

# Create a pie chart showing the percentage of total revenue per tier
ggplot(tier_revenue, aes(x = "", y = Total_Revenue, fill = CustomerTier)) +
  geom_bar(stat = "identity", width = 1, color = "white") + 
  coord_polar("y", start = 0) +
  geom_text(aes(label = paste0(round(Percentage, 1), "%")),
            position = position_stack(vjust = 0.5),
            color = "white", size = 5) +
  scale_fill_brewer(palette = "Set2") +
  theme_void() +
  labs(title = "Proporsi Total Revenue per Customer Tier", fill = "Customer Tier") +
  theme(plot.title = element_text(hjust = 0.5, face = "bold", size = 14))

  1. Optional Challenge:

    • Find which date had the highest total revenue.
    • Create a stacked bar chart showing quantity sold per product by customer tier.

Hints: Use data.frame(), aggregate(), barplot(), pie(), and basic arithmetic operations in R.

# Calculate total revenue per date
revenue_date <- aggregate(Total ~ Date, data = transactions, sum)

# Display all results
datatable(revenue_date, caption = "Total Revenue per Date")
# Identify the date with the highest total revenue
highest_revenue <- revenue_date[which.max(revenue_date$Total), ]

# Print the highest revenue date
highest_revenue
# Group data by product and customer tier
qty_tier <- transactions %>%
group_by(Product, CustomerTier) %>%
summarise(Qty = sum(Qty), .groups = "drop")

# Create stacked bar chart with ggplot2
ggplot(qty_tier, aes(x = Product, y = Qty, fill = CustomerTier)) +
geom_col() +
labs(title = "Quantity Sold by Product and Customer Tier",
     x = "Product",
     y = "Total Quantity",
     fill = "Customer Tier") +
theme_minimal() +
theme(plot.title = element_text(face = "bold", hjust = 0.5, size = 14)) 

5 Exercise 5

Create Your Own Data Frame:

Objective: Create a data frame in R with 30 rows containing a mix of data types: continuous, discrete, nominal, and ordinal.

5.1 Instructions

  1. Open RStudio or the R console.

  2. Create a vector for each column in your data frame:

    • Date: 30 dates (can be sequential or random within a month/year)
    • Continuous: numeric values that can take decimal values (e.g., height, weight, temperature)
    • Discrete: numeric values that can only take whole numbers (e.g., number of items, number of vehicles)
    • Nominal: categorical values with no order (e.g., color, gender, city)
    • Ordinal: categorical values with a defined order (e.g., Low, Medium, High; Beginner, Intermediate, Expert)
  3. Combine all vectors into a data frame called my_data.

  4. Check your data frame using head() or View() to ensure it has 30 rows and the columns are correct.

  5. Optional tasks:

    • Summarize each column using summary()
    • Count the frequency of each category for Nominal and Ordinal columns using table()

5.2 Hints

  • Use seq.Date() or as.Date() to generate the Date column.
  • Use runif() or rnorm() for continuous numeric data.
  • Use sample() for discrete, nominal, and ordinal data.
  • Ensure the ordinal vector is created with factor(..., levels = c("Low","Medium","High"), ordered = TRUE) (or similar).

Title : “GPA of Female Data Science Students in 2025”

# Generate the Date column (all same for simplicity)
Date <- rep(as.Date("2025-06-29"), 30)

# Nominal variable: Name of student
set.seed(123)
Name <- sample(c("Aurelia Grace","Elara Luna",
                 "Isabella Rose","Luna Amara",
                 "Keyra Belle","Amara Jade",
                 "Aurora Mae","Lyra Elisa",
                 "Selena Claire","Iris Bella",
                 "Aisyah Zahra","Nadira Safira",
                 "Kirana Putri","Salsabila Hanum",
                 "Hanna Amirah","Anindya Cahaya",
                 "Zahra Lestari","Rahma Aminah",
                 "Alya Syafira","Cahaya Putri",
                 "Amara Kyra","Ellena ray",
                 "Valerrina Rosi","Clara zune",
                 "Valeria Dawn","Luna Rayila",
                 "Sofiabelle","Diana Fleur",
                 "Aurora Sky","Isella Grace"),
                  30, replace = FALSE)

# Continuous variable: GPA 
GPA <- runif(30, min = 2.0, max = 4.0)

# Discrete variable: number of tasks
CourseCredits <- sample(18:24, 30, replace = TRUE)

# Ordinal variable: semester levels
Semester <- factor(sample(c("Semester 1","Semester 2",
                            "Semester 3","Semester 4",
                            "Semester 5","Semester 6",
                            "Semester 7","Semester 8"),
                             30, replace = TRUE),
                   levels = c("Semester 1","Semester 2",
                              "Semester 3","Semester 4",
                              "Semester 5","Semester 6",
                              "Semester 7","Semester 8"),
                               ordered = TRUE)

# Combine all vectors into a data frame
my_data <- data.frame(Date, Name, GPA, CourseCredits, Semester)

my_data                 # Display the dataset
head(my_data)           # Display the first six rows of the dataset
nrow(my_data)           # Show number of rows
## [1] 30
summary(my_data)        # Show summary statistics for each  column
##       Date                Name                GPA        CourseCredits  
##  Min.   :2025-06-29   Length:30          Min.   :2.001   Min.   :18.00  
##  1st Qu.:2025-06-29   Class :character   1st Qu.:2.483   1st Qu.:19.25  
##  Median :2025-06-29   Mode  :character   Median :2.914   Median :21.50  
##  Mean   :2025-06-29                      Mean   :2.960   Mean   :21.30  
##  3rd Qu.:2025-06-29                      3rd Qu.:3.508   3rd Qu.:23.00  
##  Max.   :2025-06-29                      Max.   :3.790   Max.   :24.00  
##                                                                         
##        Semester
##  Semester 2:6  
##  Semester 6:6  
##  Semester 8:5  
##  Semester 7:4  
##  Semester 3:3  
##  Semester 5:3  
##  (Other)   :3
table(my_data$Semester) # Show frequency table for Semester variable
## 
## Semester 1 Semester 2 Semester 3 Semester 4 Semester 5 Semester 6 Semester 7 
##          1          6          3          2          3          6          4 
## Semester 8 
##          5

Visualisasi Exercise 5

library(ggplot2)  # Load ggplot2 for data visualization
library(dplyr)    # Load dplyr for data manipulation

# Calculate the average GPA for each semester
avg_gpa <- my_data %>%               
  group_by(Semester) %>%             # Group data by the Semester variable
  summarise(Average_GPA = mean(GPA)) # Compute the mean GPA within each group

# Create a bar chart (column chart) to show the average GPA per semester
ggplot(avg_gpa, aes(x = Semester, y = Average_GPA, fill = Semester)) +   
  geom_col() + # Draw the bars (one for each semester)
  labs(title = "Average GPA per Semester (2025)",  # Add chart title
       x = "Semester",                             # Label for x
       y = "Average GPA") +                        # Label for y
  theme_minimal() + # Use a clean, minimal theme
  # Rotate x-axis labels for readability
  theme(axis.text.x = element_text(angle = 45, hjust = 1))