knitr::opts_chunk$set(echo = TRUE, include = TRUE, message = FALSE, warning = FALSE)

# Install packages if not already installed
if (!require("text2vec")) install.packages("text2vec")
## Loading required package: text2vec
## Warning: package 'text2vec' was built under R version 4.3.3
if (!require("geometry")) install.packages("geometry")
## Loading required package: geometry
## Warning: package 'geometry' was built under R version 4.3.3
if (!require("ggplot2")) install.packages("ggplot2")
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 4.3.3
if (!require("tidyverse")) install.packages("tidyverse")
## Loading required package: tidyverse
## Warning: package 'lubridate' was built under R version 4.3.2
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ lubridate 1.9.3     ✔ tibble    3.2.1
## ✔ purrr     1.0.2     ✔ tidyr     1.3.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
if (!require("GGally")) install.packages("GGally")
## Loading required package: GGally
## Warning: package 'GGally' was built under R version 4.3.3
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
if (!require("plotly")) install.packages("plotly")
## Loading required package: plotly
## Warning: package 'plotly' was built under R version 4.3.3
## 
## Attaching package: 'plotly'
## 
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following object is masked from 'package:graphics':
## 
##     layout
# Load libraries
library(text2vec)
library(geometry)
library(ggplot2)
library(tidyverse)
library(GGally)
library(plotly)

Introduction

The purpose of this prototype demonstration is to present how items can be constructed and analyzed without response data. This demonstration uses item embeddings to examine similarities between items with respect to the words contained in the item stems. For this demonstration no response options were created. Item embeddings are representations of words in numbers. Those numbers are then used to create a word vector. This vector is analogous to a factor or principal component in more typical data reduction techniques. In fact, we will use PCA to create components.

Data Preparation

Let’s create some items. Twenty items related to knowledge of tools were created for five different crafts. An additional 10 items were created that represents safety regulations for mobile cranes. Thus, we could interpret the overall structure has having two dimensions. If we were more specific we could interpret the structure has having six dimensions. The items are as follows:

items <- c(
  "What is the primary function of a spirit level in carpentry?",
  "Describe the difference between a flathead and Phillips head screwdriver.",
  "How is a reciprocating saw used in construction?",
  "What is the purpose of a multimeter in electrical work?",
  "Explain the function of wire strippers in electrical installations.",
  "What type of welding does an MIG welder perform?",
  "How is a pipe wrench different from a regular wrench?",
  "What is the main use of a pipe cutter in plumbing?",
  "Describe the function of a manifold gauge set in HVAC work.",
  "What is the purpose of a framing square in carpentry?",
  "How is a circular saw used in construction?",
  "What is the primary function of wire nuts in electrical work?",
  "Explain the use of a welding helmet in construction.",
  "What is the purpose of a pipe threader in plumbing?",
  "How is a vacuum gauge used in HVAC systems?",
  "What is the main function of a chalk line in carpentry?",
  "Describe the use of a conduit bender in electrical installations.",
  "What is the purpose of a welding clamp in metalwork?",
  "How is a pipe reamer used in plumbing?",
  "What is the function of a refrigerant recovery machine in HVAC?",
  "What is the minimum distance a mobile crane should maintain from power lines?",
  "How often should a mobile crane undergo a thorough inspection?",
  "What personal protective equipment (PPE) is required for crane operators?",
  "What weather conditions prohibit the operation of a mobile crane?",
  "How should the load chart be used when operating a mobile crane?",
  "What is the importance of outriggers in mobile crane operation?",
  "How should hand signals be used during crane operations?",
  "What is the purpose of a pre-operational inspection for a mobile crane?",
  "How should the work area be secured during mobile crane operations?",
  "What is the maximum allowable wind speed for safe crane operations?"
  )

Embeddings

Next, we’ll create the embeddings for our item set. The TF-IDF is a term frequency - inverse document frequency is a technique to determine the importance of a word with respect to its frequency in a corpus of text (i.e., a document). High values mean that a particular word is very important to the document but rather infrequent within the document.

# Preprocess the text
it <- itoken(items, preprocessor = tolower, tokenizer = word_tokenizer)
vocab <- create_vocabulary(it)
vectorizer <- vocab_vectorizer(vocab)

# Create document-term matrix
dtm <- create_dtm(it, vectorizer)

# Create TF-IDF model
tfidf <- TfIdf$new()
dtm_tfidf <- fit_transform(dtm, tfidf)

# Use TF-IDF matrix as our embeddings
embeddings <- as.matrix(dtm_tfidf)

Data Reduction

Our next step is to simplify or reduce the data. A principal components analysis (PCA) was chosen for simplicity but also because I wanted to explain the total variance of the data (i.e., all the words). If I had chosen a factor analysis, then I would have assumed an error term in addition to the covariance. I am not sure that an error term is relevant for embeddings and data reduction. Observe that I am focusing on six dimensions for this analysis because that is the maximum value of the minimum number of dimensions that describe the data. In other words, if each of the 30 items is a dimension, then if I reduced the data to a smaller number of dimensions what is the maximum value of that reduction that makes sense?

pca <- prcomp(embeddings)
reduced_embeddings <- pca$x[, 1:6] # Using 6 dimensions

Analysis and Visualization

Let’s analyze the data by calculating centroids based on the reduced embeddings. The centroid is a center location within an n-dimensional space that reduce the squared difference between itself and all points in the space. Points closer to a given centroid are part of that cluster. Points far away from a given centroid are part of another cluster. Once the clusters are created we can visualize them. Unfortunately, a 6-dimensional space cannot be visualized in three dimensions. Thus, I visualized three dimensions.

# Calculate centroids and distances
centroid <- colMeans(reduced_embeddings)
distances <- apply(reduced_embeddings, 1, function(x) sqrt(sum((x - centroid)^2)))

# Analyze item distances
item_analysis <- data.frame(
  item = items,
  distance = distances
) %>% 
  arrange(distance) %>%
  mutate(rank = row_number())

# Display the item analysis table
knitr::kable(item_analysis, caption = "Item Analysis Results")
Item Analysis Results
item distance rank
28 What is the purpose of a pre-operational inspection for a mobile crane? 0.1438891 1
20 What is the function of a refrigerant recovery machine in HVAC? 0.1507702 2
10 What is the purpose of a framing square in carpentry? 0.1531848 3
26 What is the importance of outriggers in mobile crane operation? 0.1552178 4
14 What is the purpose of a pipe threader in plumbing? 0.1570639 5
18 What is the purpose of a welding clamp in metalwork? 0.1573621 6
8 What is the main use of a pipe cutter in plumbing? 0.1610955 7
4 What is the purpose of a multimeter in electrical work? 0.1730621 8
16 What is the main function of a chalk line in carpentry? 0.1737210 9
1 What is the primary function of a spirit level in carpentry? 0.1864859 10
9 Describe the function of a manifold gauge set in HVAC work. 0.2025362 11
24 What weather conditions prohibit the operation of a mobile crane? 0.2058349 12
21 What is the minimum distance a mobile crane should maintain from power lines? 0.2085079 13
13 Explain the use of a welding helmet in construction. 0.2616504 14
12 What is the primary function of wire nuts in electrical work? 0.2725763 15
22 How often should a mobile crane undergo a thorough inspection? 0.2815217 16
15 How is a vacuum gauge used in HVAC systems? 0.2824550 17
19 How is a pipe reamer used in plumbing? 0.2934387 18
17 Describe the use of a conduit bender in electrical installations. 0.2966962 19
30 What is the maximum allowable wind speed for safe crane operations? 0.3168498 20
25 How should the load chart be used when operating a mobile crane? 0.3189447 21
5 Explain the function of wire strippers in electrical installations. 0.4118183 22
29 How should the work area be secured during mobile crane operations? 0.4582646 23
3 How is a reciprocating saw used in construction? 0.5315493 24
11 How is a circular saw used in construction? 0.5315493 25
27 How should hand signals be used during crane operations? 0.6042911 26
23 What personal protective equipment (PPE) is required for crane operators? 0.8086143 27
7 How is a pipe wrench different from a regular wrench? 0.8669234 28
2 Describe the difference between a flathead and Phillips head screwdriver. 0.9225457 29
6 What type of welding does an MIG welder perform? 0.9537438 30
# Prepare data for visualization
viz_data <- data.frame(reduced_embeddings) %>%
  mutate(item_number = row_number())

# 2D Scatter plot
#p1 <- ggplot(viz_data, aes(x = PC1, y = PC2)) +
#  geom_point() +
#  geom_text(aes(label = item_number), hjust = 0, vjust = 0) +
#  theme_minimal() +
#  labs(title = "Item Embeddings: PC1 vs PC2")

#print(p1)

# 3D Scatter plot
p2 <- plot_ly(viz_data, x = ~PC1, y = ~PC2, z = ~PC3, 
        type = "scatter3d", mode = "markers+text",
        text = ~item_number, hoverinfo = "text") %>%
  layout(scene = list(xaxis = list(title = "PC1"),
                      yaxis = list(title = "PC2"),
                      zaxis = list(title = "PC3")),
         title = "3D Scatterplot of First 3 Principal Components")

p2