This markdown sequence documents data import of JISH and Alroqi datasets with the goals of: 1. Data wrangling 2. Short form creation 3. CAT creation

This particular document uses the combined Alroqi and JISH data to try and create an informative and useful short forms.

Load data.

load(file = "cached_data/all_arabic_production.Rds")

Psychometric modeling

Prepare data.

d_prod <- select(full_prod, "sheep.sound":"if.2")
words <- names(d_prod)
d_mat <- as.matrix(d_prod)

Model comparison.

Compared to the Rasch model, the 2PL model fits better and is preferred by both AIC and BIC.

Comparison of Rasch and 2PL models.
Model AIC BIC logLik df
Rasch 375986.4 379997.6 -187146.2 NaN
2PL 368675.3 376688.3 -182645.7 845

The 2PL is favored over the 3PL model by BIC, although AIC prefers the 3PL.

Comparison of 2PL and 3PL models.
Model AIC BIC logLik df
2PL 368675.3 376688.3 -182645.7 NaN
3PL 368383.3 380402.7 -181653.6 846

The 2PL is preferred over both the Rasch (1PL) model and the 3PL model by BIC, so we do the rest of our analyses using the 2PL model as the basis. Next we look for linear dependencies (LD) among the items, and also check for ill-fitting items. We will remove any items that show both strong LD and poor fit.

Examine Linear Dependencies

Some items are very highly correlated with one another. Here are the item pairs with the top residuals from the model.

d_res <- res |>
 as_tibble() |>
 mutate(word1 = rownames(res)) |>
 pivot_longer(-word1, names_to = "word2", values_to = "residual") 

arrange(d_res, desc(residual)) |>
 filter(residual > 450) |>
 knitr::kable(digits = 2)
word1 word2 residual
lizard turkey 647.8310
squirrel owl 574.6708
peas vitamin 551.0986
lizard carriage 531.0199
carriage pig 525.9496
leopard lizard 498.5478
carriage squirrel 488.5663
lizard squirrel 487.2589
crow lizard 487.1227
leopard puppy 486.7679
stretcher milk 485.4912
puppy lizard 481.9509
through around 478.4834
turkey squirrel 472.8403
crow turkey 467.7399
goose squirrel 466.6497
crow goose 460.0712
nine seven 458.4109
penguin owl 457.0020
lizard pig 456.5636
garage milk 454.6480
turkey goose 453.8696
poor dark.in.color 452.9800
crow squirrel 452.5225
leopard squirrel 451.1818

We examined each item for pairwise linear dependencies (LD) with other items using \(\chi^{2}\) (Chen & Thissen, 1997), and found that 117 items show strong LD (Cramer’s \(V \geq 0.5\)), and 835 items show moderate LD (\(V \geq 0.3\)) with at least 10 other items. This suggests multidimensionality, and we may want to look into exploratory factor analysis.

Ill-fitting items

Our next goal is to determine if all items should be included in the item bank. Items that have very bad properties should probably be dropped. We will prune any ill-fitting items (\(\chi^{2}*_{df}\) \(p<.001\)) from the full 2PL model that also showed strong LD.

Check on badly fitting items.

300 items did not fit well in the full 2PL model, and these items are shown below.

bad_items <- itfit2pl$item[bad_items2pl]

kable(matrix(bad_items, ncol=4)) %>%
 html_table_width(rep(150, 8))
sheep.sound sandal winds green
cat.sound gloves shovel dark.in.color
desire.to.eat.sound diapers hose full
car.sound abaya traffic.sign bad
crying.sound mouth castle expired
train.sound elbow mama old.in.age.1
surprise.sound head papa old.in.age.2
pain.sound chin domestic.worker old
falling.sound foot grandmother poor
to.express.bad.smell.sound palm baby brown
bird.sound leg.2 teacher.female.1 yummy
hot.sound forehead clown high
yummy.sound arm firefighters crazy
dog.sound thumb friend sticky
expressing.agreement eyelashes barber soft.not.firm
monster.sound trash.can person colorful
goat lamp.light guard difficult
horse pillow teacher.female.2 sneeze.noun.2
fish.animal plant finger.game.2 saliva
dog television incy.wincy.spider mucus
bear bag hi day
chicken.animal album go.potty later
rabbit tray marhaba today
duck soap wait.imperative before
monkey kettle time.to.sleep after
penguin money turn.around time
donkey rag give.me yesterday
cow matches peace.be.upon.him how
wolf plate calm.down when
lion tissue hide.and.seek he
ball key last.night yours
bubbles pacifier.2 play she
blocks bowl pour this.masculine
toy remote.control measure that.masculine.1
pencil spoon shake mine
story floor.wiper choose they
toys radio share hers
circle lantern read we
car decorative.accent return ours
pickup.truck clock tickle these
ambulance carry.bag splash that.masculine.2
bus glass.bottle pretend those
biscuit jug find there.is.nothing
candy box curse not.here
gello phone think not.nice
soup mattress peel finished.no.more
chocolate cooker fly five
rice side.table look.1 any.one
carrot vacuum.machine swim another.1
buttermilk bed fall.or.scatter different
jam shower crash.by same.thing
popcorn mirror study another.2
rusk vanity.table press more.than
medicine chair be three
hamburger heater smile seven
strawberry chandeliers swing.causative.verb eight
banana door kick nine
butter potty cancel ten
juice rocking.chair fix six
water kitchen swing a.lot.2
lettuce knife.2 spit few
chicken.food plastic.sheets nice away
pepsi roof yellow near
corn ladder orange.color on
doughnut sea having.good.taste about
cherry star blue around
yogurt stone light in
fries building pink with.using
sauce nursery thin through
mulukhiyah preschool alert here.2
skirt tent final.last on.top.of
head.scarf garden second to
shirt.1 work.noun fixed and
shoes clouds red but
undershirt sidewalk light.in.color also

Plot 2PL Coefficients

Next, we examine the coefficients of the 2PL model.

Items that are estimated to be very easy (e.g., mommy, daddy, ball) or very difficult (would, were, country) are highlighted, as well as those at the extremes of discrimination (a1).

We remove the “bad items” identified above.

coefs_2pl <- as_tibble(coef(mod_2pl, simplify = TRUE)$items) %>%
  mutate(definition = rownames(coef(mod_2pl, simplify = TRUE)$items)) |>
  ungroup()

ggplot(filter(coefs_2pl, 
              !(definition %in% bad_items)),
       aes(x = a1, y = -d)) + 
  geom_point(alpha = .3) + 
  ggrepel::geom_text_repel(data = filter(coefs_2pl, 
                                a1 < 1 | a1 > 3.8 | -d > 5 | -d < -2.5), 
                  aes(label = definition), size = 2, 
                  show.legend = FALSE) + 
  xlab("Discrimination") + 
  ylab("Difficulty")
## Warning: ggrepel: 145 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

Short form construction

Goal is to create a 100 item test with the best items for a given age/ability range.

Let’s find our estimated abilities and see how they relate to age.

qplot(x = full_prod$age_mo, y = fscores_2pl$ability, col = full_prod$source, geom = "point" ) +
  geom_smooth()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

So we would like a range of abilities from about -2.5 to 4. Here’s our resulting test information curve.

theta <- matrix(seq(-2.5,4,.01))
tinfo <- testinfo(mod_2pl, theta)
plot(theta, tinfo, type = 'l')

sum(tinfo)
## [1] 221075.8

First let’s do some sanity checks. If we remove the bad items, we should lose less test info than if we remove a random subset.

bad_item_idx <- which(words %in% bad_items)

tinfo_no_bad <- testinfo(mod_2pl, theta, 
                         which.items = setdiff(1:length(words), bad_item_idx))

# run 100 replicates of the shuffled removal to see what happens. 
tinfo_no_random <- replicate(100, 
  sum(testinfo(mod_2pl, theta, 
               which.items = setdiff(1:length(words), 
                                     shuffle(1:length(words))[1:length(bad_items)]))))

# plot(theta, tinfo_no_bad, type = 'l')
# plot(theta, tinfo_no_random, type = 'l')
sum(tinfo_no_bad)
## [1] 143796.7
mean(tinfo_no_random)
## [1] 142699.8

Actually, removing the “bad” items doesn’t make life any worse.

Let’s try making some random 100-item subtests.

coefs_2pl <- mutate(coefs_2pl, idx = 1:n())
  
tinfo_random_100 <- tibble(n = 1:1000) %>%
  split(.$n) |>
  map_df(function(x) {
    tibble(theta = as.vector(theta), 
           testinfo = testinfo(mod_2pl, theta, 
                               which.items = slice_sample(coefs_2pl, n = 100) |>
                                 pull(idx)),
           n = x$n)
  })


tinfo_random_summary <- tinfo_random_100 |>
  group_by(n) |>
  summarise(testinfo = sum(testinfo)) 

ggplot(tinfo_random_summary, 
       aes(x = testinfo)) + 
  geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The mean random test information is 2.6145297^{4}.

Now let’s try selecting high discrimination items.

top_desc <- arrange(coefs_2pl, desc(a1)) |>
  slice(1:100) |>
  pull(definition)

top_desc_idx <- which(words %in% top_desc)

tinfo_top_desc <- testinfo(mod_2pl, theta, which.items = top_desc_idx)
plot(theta, tinfo_top_desc, type = 'l')

The best test selecting based on discrimination has test information of 3.7571746^{4}. That gives us an upper bound on item information.

Now let’s try to do some kind of optimization. Our constraints are:

Let’s look at test information for each of the categories first.

coefs_2pl <- left_join(coefs_2pl, 
                       items |>
                         select(-definition) |>
                         rename(definition = uni_lemma) |>
                         select(definition, category))
## Joining, by = "definition"
coefs_2pl$idx <- 1:nrow(coefs_2pl)


cat_info <- tibble(cat = unique(coefs_2pl$category)) |>
  group_by(cat) |>
  mutate(data = list(tibble(theta = as.vector(theta), 
                            testinfo = testinfo(mod_2pl, theta, 
                                                which.items = filter(coefs_2pl, 
                                                                     category == cat) |>
                                                  pull(idx)),
                            n = nrow(filter(coefs_2pl, category == cat))))) |>
  unnest(cols = "data")

ggplot(cat_info, aes(x = theta, y= testinfo/n)) + 
  geom_line() + 
  facet_wrap(~cat) + 
  ggtitle("Test information per word for different sections")

Given this, let’s exclude the “Sound Effects and Animal Sounds” category. From the other 20 categories, we’ll try taking 5 words from each category, and we’ll try to maximize 1) the standard deviation of difficulty and 2) the mean discrimination.

by_category_max_sd <- coefs_2pl |>
  ungroup() |>
  filter(category != "Sound Effects and Animal Sounds") %>%
  split(.$category) |>
  map_df(function (cat) {
    
    perms <- tibble(n = 1:100) %>%
      split(.$n) |>
      map_df(function (x) {
        slice_sample(cat, n=5) |>
          mutate(score = sd(d) + mean(a1),
                 n = x$n)
      })
    
    filter(perms, score == max(score))
  })
  
by_cat_max_sd_test_info <- testinfo(mod_2pl, theta, which.items = by_category_max_sd$idx)

Let’s try just maximizing discrimination within category.

by_category_max_desc <- coefs_2pl |>
  ungroup() |>
  filter(category != "Sound Effects and Animal Sounds") %>%
  split(.$category) |>
  map_df(function (cat) {
    
    arrange(cat, desc(a1)) |>
      slice(1:5)
  })

by_cat_max_desc_test_info <- testinfo(mod_2pl, theta, which.items = by_category_max_desc$idx)

How about adding the easiest one in each category and then doing the four most discriminating.

by_category_max_desc_one_easy <- coefs_2pl |>
  ungroup() |>
  filter(category != "Sound Effects and Animal Sounds") %>%
  split(.$category) |>
  map_df(function (cat) {
    
    # do this so we definitely get 5 even if the two conditions overlap
    filter(cat, d==max(d) | a1 >= sort(a1, decreasing=TRUE)[5]) |>
      arrange(desc(d)) |>
      slice(1:5)
      
  })

by_category_max_desc_one_easy_test_info <- testinfo(mod_2pl, theta, 
                                                    which.items = by_category_max_desc_one_easy$idx)

Now compare these.

sf_vs_best <- tibble(theta = theta, 
                     `balance discrim and difficulty by category` = by_cat_max_sd_test_info, 
                     `one easy plus most discrim by category` = by_category_max_desc_one_easy_test_info,
                     `most discrim by category` = by_cat_max_desc_test_info, 
                     `most discrim overall` = tinfo_top_desc, 
                     `random` = tinfo_random_100 |> 
                       group_by(theta) |> 
                       summarise(testinfo = mean(testinfo)) |> 
                       pull(testinfo)) |>
  pivot_longer(-theta, names_to = "selection", values_to = "test_information")
                       
                       
ggplot(sf_vs_best, 
       aes(x = theta, 
           y = test_information, col = selection)) + 
  geom_line() 

Examine resulting items

short_metrics <- full_prod |>
  rowwise() |>
  mutate(top_desc = sum(c_across(cols = coefs_2pl$definition[top_desc_idx])),
    one_easy = sum(c_across(cols = 
                              coefs_2pl$definition[by_category_max_desc_one_easy$idx]))) |>
  select(subid, age_mo, total, top_desc, one_easy)

short_metrics_long <- short_metrics |>
  pivot_longer(top_desc:one_easy, names_to = "selection", values_to = "estimate")
  

ggplot(short_metrics_long, aes(x = total, y = estimate)) +
  geom_point(alpha = .25) +
  geom_smooth() + 
  geom_abline(lty = 2, slope = 100/length(coefs_2pl$definition)) +
  facet_wrap(~selection)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Correlations.

cor.test(short_metrics$total, short_metrics$top_desc)
## 
##  Pearson's product-moment correlation
## 
## data:  short_metrics$total and short_metrics$top_desc
## t = 137.66, df = 840, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.9754757 0.9812365
## sample estimates:
##       cor 
## 0.9785465
with(filter(short_metrics, age_mo <= 18), 
     cor.test(total, top_desc))
## 
##  Pearson's product-moment correlation
## 
## data:  total and top_desc
## t = 35.18, df = 260, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.8854195 0.9280139
## sample estimates:
##      cor 
## 0.909063

Correlations are very high for the whole sample, but lower for younger kids.

cor.test(short_metrics$total, short_metrics$one_easy)
## 
##  Pearson's product-moment correlation
## 
## data:  short_metrics$total and short_metrics$one_easy
## t = 137.62, df = 840, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.9754631 0.9812268
## sample estimates:
##       cor 
## 0.9785355
with(filter(short_metrics, age_mo <= 18), 
     cor.test(total, one_easy))
## 
##  Pearson's product-moment correlation
## 
## data:  total and one_easy
## t = 53.856, df = 260, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.9467053 0.9669170
## sample estimates:
##       cor 
## 0.9579844

Our corrected test (“one easy”) does substantially better with the younger kids.

What’s in that test?

filter(coefs_2pl, idx %in% by_category_max_desc_one_easy$idx) |>
  select(definition, category, d, a1) |>
  DT::datatable()

We can maybe do better by removing some redundancy (e.g. toy/toys), but this doesn’t look totally crazy.