This markdown sequence documents data import of JISH and Alroqi datasets with the goals of: 1. Data wrangling 2. Short form creation 3. CAT creation

This particular document uses the combined Alroqi and JISH data to try and create an informative and useful short forms.

Load data.

load(file = "cached_data/all_arabic_production.Rds")

Psychometric modeling

Prepare data.

d_prod <- select(full_prod, "sheep.sound":"if.2")
words <- names(d_prod)
d_mat <- as.matrix(d_prod)

Model comparison.

Compared to the Rasch model, the 2PL model fits better and is preferred by both AIC and BIC.

Comparison of Rasch and 2PL models.
Model	AIC	BIC	logLik	df
Rasch	375986.4	379997.6	-187146.2	NaN
2PL	368675.3	376688.3	-182645.7	845

The 2PL is favored over the 3PL model by BIC, although AIC prefers the 3PL.

Comparison of 2PL and 3PL models.
Model	AIC	BIC	logLik	df
2PL	368675.3	376688.3	-182645.7	NaN
3PL	368383.3	380402.7	-181653.6	846

The 2PL is preferred over both the Rasch (1PL) model and the 3PL model by BIC, so we do the rest of our analyses using the 2PL model as the basis. Next we look for linear dependencies (LD) among the items, and also check for ill-fitting items. We will remove any items that show both strong LD and poor fit.

Examine Linear Dependencies

Some items are very highly correlated with one another. Here are the item pairs with the top residuals from the model.

d_res <- res |>
 as_tibble() |>
 mutate(word1 = rownames(res)) |>
 pivot_longer(-word1, names_to = "word2", values_to = "residual") 

arrange(d_res, desc(residual)) |>
 filter(residual > 450) |>
 knitr::kable(digits = 2)

word1	word2	residual
lizard	turkey	647.8310
squirrel	owl	574.6708
peas	vitamin	551.0986
lizard	carriage	531.0199
carriage	pig	525.9496
leopard	lizard	498.5478
carriage	squirrel	488.5663
lizard	squirrel	487.2589
crow	lizard	487.1227
leopard	puppy	486.7679
stretcher	milk	485.4912
puppy	lizard	481.9509
through	around	478.4834
turkey	squirrel	472.8403
crow	turkey	467.7399
goose	squirrel	466.6497
crow	goose	460.0712
nine	seven	458.4109
penguin	owl	457.0020
lizard	pig	456.5636
garage	milk	454.6480
turkey	goose	453.8696
poor	dark.in.color	452.9800
crow	squirrel	452.5225
leopard	squirrel	451.1818

We examined each item for pairwise linear dependencies (LD) with other items using \(\chi^{2}\) (Chen & Thissen, 1997), and found that 117 items show strong LD (Cramer’s \(V \geq 0.5\)), and 835 items show moderate LD (\(V \geq 0.3\)) with at least 10 other items. This suggests multidimensionality, and we may want to look into exploratory factor analysis.

Ill-fitting items

Our next goal is to determine if all items should be included in the item bank. Items that have very bad properties should probably be dropped. We will prune any ill-fitting items (\(\chi^{2}*_{df}\) \(p<.001\)) from the full 2PL model that also showed strong LD.

Check on badly fitting items.

300 items did not fit well in the full 2PL model, and these items are shown below.

bad_items <- itfit2pl$item[bad_items2pl]

kable(matrix(bad_items, ncol=4)) %>%
 html_table_width(rep(150, 8))

sheep.sound	sandal	winds	green
cat.sound	gloves	shovel	dark.in.color
desire.to.eat.sound	diapers	hose	full
car.sound	abaya	traffic.sign	bad
crying.sound	mouth	castle	expired
train.sound	elbow	mama	old.in.age.1
surprise.sound	head	papa	old.in.age.2
pain.sound	chin	domestic.worker	old
falling.sound	foot	grandmother	poor
to.express.bad.smell.sound	palm	baby	brown
bird.sound	leg.2	teacher.female.1	yummy
hot.sound	forehead	clown	high
yummy.sound	arm	firefighters	crazy
dog.sound	thumb	friend	sticky
expressing.agreement	eyelashes	barber	soft.not.firm
monster.sound	trash.can	person	colorful
goat	lamp.light	guard	difficult
horse	pillow	teacher.female.2	sneeze.noun.2
fish.animal	plant	finger.game.2	saliva
dog	television	incy.wincy.spider	mucus
bear	bag	hi	day
chicken.animal	album	go.potty	later
rabbit	tray	marhaba	today
duck	soap	wait.imperative	before
monkey	kettle	time.to.sleep	after
penguin	money	turn.around	time
donkey	rag	give.me	yesterday
cow	matches	peace.be.upon.him	how
wolf	plate	calm.down	when
lion	tissue	hide.and.seek	he
ball	key	last.night	yours
bubbles	pacifier.2	play	she
blocks	bowl	pour	this.masculine
toy	remote.control	measure	that.masculine.1
pencil	spoon	shake	mine
story	floor.wiper	choose	they
toys	radio	share	hers
circle	lantern	read	we
car	decorative.accent	return	ours
pickup.truck	clock	tickle	these
ambulance	carry.bag	splash	that.masculine.2
bus	glass.bottle	pretend	those
biscuit	jug	find	there.is.nothing
candy	box	curse	not.here
gello	phone	think	not.nice
soup	mattress	peel	finished.no.more
chocolate	cooker	fly	five
rice	side.table	look.1	any.one
carrot	vacuum.machine	swim	another.1
buttermilk	bed	fall.or.scatter	different
jam	shower	crash.by	same.thing
popcorn	mirror	study	another.2
rusk	vanity.table	press	more.than
medicine	chair	be	three
hamburger	heater	smile	seven
strawberry	chandeliers	swing.causative.verb	eight
banana	door	kick	nine
butter	potty	cancel	ten
juice	rocking.chair	fix	six
water	kitchen	swing	a.lot.2
lettuce	knife.2	spit	few
chicken.food	plastic.sheets	nice	away
pepsi	roof	yellow	near
corn	ladder	orange.color	on
doughnut	sea	having.good.taste	about
cherry	star	blue	around
yogurt	stone	light	in
fries	building	pink	with.using
sauce	nursery	thin	through
mulukhiyah	preschool	alert	here.2
skirt	tent	final.last	on.top.of
head.scarf	garden	second	to
shirt.1	work.noun	fixed	and
shoes	clouds	red	but
undershirt	sidewalk	light.in.color	also

Plot 2PL Coefficients

Next, we examine the coefficients of the 2PL model.

Items that are estimated to be very easy (e.g., mommy, daddy, ball) or very difficult (would, were, country) are highlighted, as well as those at the extremes of discrimination (a1).

We remove the “bad items” identified above.

coefs_2pl <- as_tibble(coef(mod_2pl, simplify = TRUE)$items) %>%
  mutate(definition = rownames(coef(mod_2pl, simplify = TRUE)$items)) |>
  ungroup()

ggplot(filter(coefs_2pl, 
              !(definition %in% bad_items)),
       aes(x = a1, y = -d)) + 
  geom_point(alpha = .3) + 
  ggrepel::geom_text_repel(data = filter(coefs_2pl, 
                                a1 < 1 | a1 > 3.8 | -d > 5 | -d < -2.5), 
                  aes(label = definition), size = 2, 
                  show.legend = FALSE) + 
  xlab("Discrimination") + 
  ylab("Difficulty")

## Warning: ggrepel: 145 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

Short form construction

Goal is to create a 100 item test with the best items for a given age/ability range.

Let’s find our estimated abilities and see how they relate to age.

qplot(x = full_prod$age_mo, y = fscores_2pl$ability, col = full_prod$source, geom = "point" ) +
  geom_smooth()

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

So we would like a range of abilities from about -2.5 to 4. Here’s our resulting test information curve.

theta <- matrix(seq(-2.5,4,.01))
tinfo <- testinfo(mod_2pl, theta)
plot(theta, tinfo, type = 'l')

sum(tinfo)

## [1] 221075.8

First let’s do some sanity checks. If we remove the bad items, we should lose less test info than if we remove a random subset.

bad_item_idx <- which(words %in% bad_items)

tinfo_no_bad <- testinfo(mod_2pl, theta, 
                         which.items = setdiff(1:length(words), bad_item_idx))

# run 100 replicates of the shuffled removal to see what happens. 
tinfo_no_random <- replicate(100, 
  sum(testinfo(mod_2pl, theta, 
               which.items = setdiff(1:length(words), 
                                     shuffle(1:length(words))[1:length(bad_items)]))))

# plot(theta, tinfo_no_bad, type = 'l')
# plot(theta, tinfo_no_random, type = 'l')
sum(tinfo_no_bad)

## [1] 143796.7

mean(tinfo_no_random)

## [1] 142699.8

Actually, removing the “bad” items doesn’t make life any worse.

Let’s try making some random 100-item subtests.

coefs_2pl <- mutate(coefs_2pl, idx = 1:n())
  
tinfo_random_100 <- tibble(n = 1:1000) %>%
  split(.$n) |>
  map_df(function(x) {
    tibble(theta = as.vector(theta), 
           testinfo = testinfo(mod_2pl, theta, 
                               which.items = slice_sample(coefs_2pl, n = 100) |>
                                 pull(idx)),
           n = x$n)
  })


tinfo_random_summary <- tinfo_random_100 |>
  group_by(n) |>
  summarise(testinfo = sum(testinfo)) 

ggplot(tinfo_random_summary, 
       aes(x = testinfo)) + 
  geom_histogram()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The mean random test information is 2.6145297^{4}.

Now let’s try selecting high discrimination items.

top_desc <- arrange(coefs_2pl, desc(a1)) |>
  slice(1:100) |>
  pull(definition)

top_desc_idx <- which(words %in% top_desc)

tinfo_top_desc <- testinfo(mod_2pl, theta, which.items = top_desc_idx)
plot(theta, tinfo_top_desc, type = 'l')

The best test selecting based on discrimination has test information of 3.7571746^{4}. That gives us an upper bound on item information.

Now let’s try to do some kind of optimization. Our constraints are:

We want diverse representations across categories
We want good coverage across age

Let’s look at test information for each of the categories first.

coefs_2pl <- left_join(coefs_2pl, 
                       items |>
                         select(-definition) |>
                         rename(definition = uni_lemma) |>
                         select(definition, category))

## Joining, by = "definition"

coefs_2pl$idx <- 1:nrow(coefs_2pl)


cat_info <- tibble(cat = unique(coefs_2pl$category)) |>
  group_by(cat) |>
  mutate(data = list(tibble(theta = as.vector(theta), 
                            testinfo = testinfo(mod_2pl, theta, 
                                                which.items = filter(coefs_2pl, 
                                                                     category == cat) |>
                                                  pull(idx)),
                            n = nrow(filter(coefs_2pl, category == cat))))) |>
  unnest(cols = "data")

ggplot(cat_info, aes(x = theta, y= testinfo/n)) + 
  geom_line() + 
  facet_wrap(~cat) + 
  ggtitle("Test information per word for different sections")

Given this, let’s exclude the “Sound Effects and Animal Sounds” category. From the other 20 categories, we’ll try taking 5 words from each category, and we’ll try to maximize 1) the standard deviation of difficulty and 2) the mean discrimination.

by_category_max_sd <- coefs_2pl |>
  ungroup() |>
  filter(category != "Sound Effects and Animal Sounds") %>%
  split(.$category) |>
  map_df(function (cat) {
    
    perms <- tibble(n = 1:100) %>%
      split(.$n) |>
      map_df(function (x) {
        slice_sample(cat, n=5) |>
          mutate(score = sd(d) + mean(a1),
                 n = x$n)
      })
    
    filter(perms, score == max(score))
  })
  
by_cat_max_sd_test_info <- testinfo(mod_2pl, theta, which.items = by_category_max_sd$idx)

Let’s try just maximizing discrimination within category.

by_category_max_desc <- coefs_2pl |>
  ungroup() |>
  filter(category != "Sound Effects and Animal Sounds") %>%
  split(.$category) |>
  map_df(function (cat) {
    
    arrange(cat, desc(a1)) |>
      slice(1:5)
  })

by_cat_max_desc_test_info <- testinfo(mod_2pl, theta, which.items = by_category_max_desc$idx)

How about adding the easiest one in each category and then doing the four most discriminating.

by_category_max_desc_one_easy <- coefs_2pl |>
  ungroup() |>
  filter(category != "Sound Effects and Animal Sounds") %>%
  split(.$category) |>
  map_df(function (cat) {
    
    # do this so we definitely get 5 even if the two conditions overlap
    filter(cat, d==max(d) | a1 >= sort(a1, decreasing=TRUE)[5]) |>
      arrange(desc(d)) |>
      slice(1:5)
      
  })

by_category_max_desc_one_easy_test_info <- testinfo(mod_2pl, theta, 
                                                    which.items = by_category_max_desc_one_easy$idx)

Now compare these.

sf_vs_best <- tibble(theta = theta, 
                     `balance discrim and difficulty by category` = by_cat_max_sd_test_info, 
                     `one easy plus most discrim by category` = by_category_max_desc_one_easy_test_info,
                     `most discrim by category` = by_cat_max_desc_test_info, 
                     `most discrim overall` = tinfo_top_desc, 
                     `random` = tinfo_random_100 |> 
                       group_by(theta) |> 
                       summarise(testinfo = mean(testinfo)) |> 
                       pull(testinfo)) |>
  pivot_longer(-theta, names_to = "selection", values_to = "test_information")
                       
                       
ggplot(sf_vs_best, 
       aes(x = theta, 
           y = test_information, col = selection)) + 
  geom_line()

Examine resulting items

short_metrics <- full_prod |>
  rowwise() |>
  mutate(top_desc = sum(c_across(cols = coefs_2pl$definition[top_desc_idx])),
    one_easy = sum(c_across(cols = 
                              coefs_2pl$definition[by_category_max_desc_one_easy$idx]))) |>
  select(subid, age_mo, total, top_desc, one_easy)

short_metrics_long <- short_metrics |>
  pivot_longer(top_desc:one_easy, names_to = "selection", values_to = "estimate")
  

ggplot(short_metrics_long, aes(x = total, y = estimate)) +
  geom_point(alpha = .25) +
  geom_smooth() + 
  geom_abline(lty = 2, slope = 100/length(coefs_2pl$definition)) +
  facet_wrap(~selection)

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Correlations.

cor.test(short_metrics$total, short_metrics$top_desc)

## 
##  Pearson's product-moment correlation
## 
## data:  short_metrics$total and short_metrics$top_desc
## t = 137.66, df = 840, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.9754757 0.9812365
## sample estimates:
##       cor 
## 0.9785465

with(filter(short_metrics, age_mo <= 18), 
     cor.test(total, top_desc))

## 
##  Pearson's product-moment correlation
## 
## data:  total and top_desc
## t = 35.18, df = 260, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.8854195 0.9280139
## sample estimates:
##      cor 
## 0.909063

Correlations are very high for the whole sample, but lower for younger kids.

cor.test(short_metrics$total, short_metrics$one_easy)

## 
##  Pearson's product-moment correlation
## 
## data:  short_metrics$total and short_metrics$one_easy
## t = 137.62, df = 840, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.9754631 0.9812268
## sample estimates:
##       cor 
## 0.9785355

with(filter(short_metrics, age_mo <= 18), 
     cor.test(total, one_easy))

## 
##  Pearson's product-moment correlation
## 
## data:  total and one_easy
## t = 53.856, df = 260, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.9467053 0.9669170
## sample estimates:
##       cor 
## 0.9579844

Our corrected test (“one easy”) does substantially better with the younger kids.

What’s in that test?

filter(coefs_2pl, idx %in% by_category_max_desc_one_easy$idx) |>
  select(definition, category, d, a1) |>
  DT::datatable()

We can maybe do better by removing some redundancy (e.g. toy/toys), but this doesn’t look totally crazy.

Arabic CDI - Short Forms

Mike & George

2022-06-25