knitr::opts_chunk$set(echo = TRUE, include = TRUE, message = FALSE, warning = FALSE)
# Install packages if not already installed
if (!require("text2vec")) install.packages("text2vec")## Loading required package: text2vec
## Warning: package 'text2vec' was built under R version 4.3.3
## Loading required package: geometry
## Warning: package 'geometry' was built under R version 4.3.3
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 4.3.3
## Loading required package: tidyverse
## Warning: package 'lubridate' was built under R version 4.3.2
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.2 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ lubridate 1.9.3 ✔ tibble 3.2.1
## ✔ purrr 1.0.2 ✔ tidyr 1.3.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
## Loading required package: GGally
## Warning: package 'GGally' was built under R version 4.3.3
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
## Loading required package: plotly
## Warning: package 'plotly' was built under R version 4.3.3
##
## Attaching package: 'plotly'
##
## The following object is masked from 'package:ggplot2':
##
## last_plot
##
## The following object is masked from 'package:stats':
##
## filter
##
## The following object is masked from 'package:graphics':
##
## layout
# Load libraries
library(text2vec)
library(geometry)
library(ggplot2)
library(tidyverse)
library(GGally)
library(plotly)The purpose of this prototype demonstration is to present how items can be constructed and analyzed without response data. This demonstration uses item embeddings to examine similarities between items with respect to the words contained in the item stems. For this demonstration no response options were created. Item embeddings are representations of words in numbers. Those numbers are then used to create a word vector. This vector is analogous to a factor or principal component in more typical data reduction techniques. In fact, we will use PCA to create components.
Let’s create some items. Twenty items related to knowledge of tools were created for five different crafts. An additional 10 items were created that represents safety regulations for mobile cranes. Thus, we could interpret the overall structure has having two dimensions. If we were more specific we could interpret the structure has having six dimensions. The items are as follows:
items <- c(
"What is the primary function of a spirit level in carpentry?",
"Describe the difference between a flathead and Phillips head screwdriver.",
"How is a reciprocating saw used in construction?",
"What is the purpose of a multimeter in electrical work?",
"Explain the function of wire strippers in electrical installations.",
"What type of welding does an MIG welder perform?",
"How is a pipe wrench different from a regular wrench?",
"What is the main use of a pipe cutter in plumbing?",
"Describe the function of a manifold gauge set in HVAC work.",
"What is the purpose of a framing square in carpentry?",
"How is a circular saw used in construction?",
"What is the primary function of wire nuts in electrical work?",
"Explain the use of a welding helmet in construction.",
"What is the purpose of a pipe threader in plumbing?",
"How is a vacuum gauge used in HVAC systems?",
"What is the main function of a chalk line in carpentry?",
"Describe the use of a conduit bender in electrical installations.",
"What is the purpose of a welding clamp in metalwork?",
"How is a pipe reamer used in plumbing?",
"What is the function of a refrigerant recovery machine in HVAC?",
"What is the minimum distance a mobile crane should maintain from power lines?",
"How often should a mobile crane undergo a thorough inspection?",
"What personal protective equipment (PPE) is required for crane operators?",
"What weather conditions prohibit the operation of a mobile crane?",
"How should the load chart be used when operating a mobile crane?",
"What is the importance of outriggers in mobile crane operation?",
"How should hand signals be used during crane operations?",
"What is the purpose of a pre-operational inspection for a mobile crane?",
"How should the work area be secured during mobile crane operations?",
"What is the maximum allowable wind speed for safe crane operations?"
)Next, we’ll create the embeddings for our item set. The TF-IDF is a term frequency - inverse document frequency is a technique to determine the importance of a word with respect to its frequency in a corpus of text (i.e., a document). High values mean that a particular word is very important to the document but rather infrequent within the document.
# Preprocess the text
it <- itoken(items, preprocessor = tolower, tokenizer = word_tokenizer)
vocab <- create_vocabulary(it)
vectorizer <- vocab_vectorizer(vocab)
# Create document-term matrix
dtm <- create_dtm(it, vectorizer)
# Create TF-IDF model
tfidf <- TfIdf$new()
dtm_tfidf <- fit_transform(dtm, tfidf)
# Use TF-IDF matrix as our embeddings
embeddings <- as.matrix(dtm_tfidf)Our next step is to simplify or reduce the data. A principal components analysis (PCA) was chosen for simplicity but also because I wanted to explain the total variance of the data (i.e., all the words). If I had chosen a factor analysis, then I would have assumed an error term in addition to the covariance. I am not sure that an error term is relevant for embeddings and data reduction. Observe that I am focusing on six dimensions for this analysis because that is the maximum value of the minimum number of dimensions that describe the data. In other words, if each of the 30 items is a dimension, then if I reduced the data to a smaller number of dimensions what is the maximum value of that reduction that makes sense?
Let’s analyze the data by calculating centroids based on the reduced embeddings. The centroid is a center location within an n-dimensional space that reduce the squared difference between itself and all points in the space. Points closer to a given centroid are part of that cluster. Points far away from a given centroid are part of another cluster. Once the clusters are created we can visualize them. Unfortunately, a 6-dimensional space cannot be visualized in three dimensions. Thus, I visualized three dimensions.
# Calculate centroids and distances
centroid <- colMeans(reduced_embeddings)
distances <- apply(reduced_embeddings, 1, function(x) sqrt(sum((x - centroid)^2)))
# Analyze item distances
item_analysis <- data.frame(
item = items,
distance = distances
) %>%
arrange(distance) %>%
mutate(rank = row_number())
# Display the item analysis table
knitr::kable(item_analysis, caption = "Item Analysis Results")| item | distance | rank | |
|---|---|---|---|
| 28 | What is the purpose of a pre-operational inspection for a mobile crane? | 0.1438891 | 1 |
| 20 | What is the function of a refrigerant recovery machine in HVAC? | 0.1507702 | 2 |
| 10 | What is the purpose of a framing square in carpentry? | 0.1531848 | 3 |
| 26 | What is the importance of outriggers in mobile crane operation? | 0.1552178 | 4 |
| 14 | What is the purpose of a pipe threader in plumbing? | 0.1570639 | 5 |
| 18 | What is the purpose of a welding clamp in metalwork? | 0.1573621 | 6 |
| 8 | What is the main use of a pipe cutter in plumbing? | 0.1610955 | 7 |
| 4 | What is the purpose of a multimeter in electrical work? | 0.1730621 | 8 |
| 16 | What is the main function of a chalk line in carpentry? | 0.1737210 | 9 |
| 1 | What is the primary function of a spirit level in carpentry? | 0.1864859 | 10 |
| 9 | Describe the function of a manifold gauge set in HVAC work. | 0.2025362 | 11 |
| 24 | What weather conditions prohibit the operation of a mobile crane? | 0.2058349 | 12 |
| 21 | What is the minimum distance a mobile crane should maintain from power lines? | 0.2085079 | 13 |
| 13 | Explain the use of a welding helmet in construction. | 0.2616504 | 14 |
| 12 | What is the primary function of wire nuts in electrical work? | 0.2725763 | 15 |
| 22 | How often should a mobile crane undergo a thorough inspection? | 0.2815217 | 16 |
| 15 | How is a vacuum gauge used in HVAC systems? | 0.2824550 | 17 |
| 19 | How is a pipe reamer used in plumbing? | 0.2934387 | 18 |
| 17 | Describe the use of a conduit bender in electrical installations. | 0.2966962 | 19 |
| 30 | What is the maximum allowable wind speed for safe crane operations? | 0.3168498 | 20 |
| 25 | How should the load chart be used when operating a mobile crane? | 0.3189447 | 21 |
| 5 | Explain the function of wire strippers in electrical installations. | 0.4118183 | 22 |
| 29 | How should the work area be secured during mobile crane operations? | 0.4582646 | 23 |
| 3 | How is a reciprocating saw used in construction? | 0.5315493 | 24 |
| 11 | How is a circular saw used in construction? | 0.5315493 | 25 |
| 27 | How should hand signals be used during crane operations? | 0.6042911 | 26 |
| 23 | What personal protective equipment (PPE) is required for crane operators? | 0.8086143 | 27 |
| 7 | How is a pipe wrench different from a regular wrench? | 0.8669234 | 28 |
| 2 | Describe the difference between a flathead and Phillips head screwdriver. | 0.9225457 | 29 |
| 6 | What type of welding does an MIG welder perform? | 0.9537438 | 30 |
# Prepare data for visualization
viz_data <- data.frame(reduced_embeddings) %>%
mutate(item_number = row_number())
# 2D Scatter plot
#p1 <- ggplot(viz_data, aes(x = PC1, y = PC2)) +
# geom_point() +
# geom_text(aes(label = item_number), hjust = 0, vjust = 0) +
# theme_minimal() +
# labs(title = "Item Embeddings: PC1 vs PC2")
#print(p1)
# 3D Scatter plot
p2 <- plot_ly(viz_data, x = ~PC1, y = ~PC2, z = ~PC3,
type = "scatter3d", mode = "markers+text",
text = ~item_number, hoverinfo = "text") %>%
layout(scene = list(xaxis = list(title = "PC1"),
yaxis = list(title = "PC2"),
zaxis = list(title = "PC3")),
title = "3D Scatterplot of First 3 Principal Components")
p2