This markdown sequence documents data import of JISH and Alroqi datasets with the goals of: 1. Data wrangling 2. Short form creation 3. CAT creation
This particular document uses the combined Alroqi and JISH data to try and create an informative and useful short forms.
Load data.
load(file = "cached_data/all_arabic_production.Rds")
Prepare data.
d_prod <- select(full_prod, "sheep.sound":"if.2")
words <- names(d_prod)
d_mat <- as.matrix(d_prod)
Compared to the Rasch model, the 2PL model fits better and is preferred by both AIC and BIC.
Model | AIC | BIC | logLik | df |
---|---|---|---|---|
Rasch | 375986.4 | 379997.6 | -187146.2 | NaN |
2PL | 368675.3 | 376688.3 | -182645.7 | 845 |
The 2PL is favored over the 3PL model by BIC, although AIC prefers the 3PL.
Model | AIC | BIC | logLik | df |
---|---|---|---|---|
2PL | 368675.3 | 376688.3 | -182645.7 | NaN |
3PL | 368383.3 | 380402.7 | -181653.6 | 846 |
The 2PL is preferred over both the Rasch (1PL) model and the 3PL model by BIC, so we do the rest of our analyses using the 2PL model as the basis. Next we look for linear dependencies (LD) among the items, and also check for ill-fitting items. We will remove any items that show both strong LD and poor fit.
Some items are very highly correlated with one another. Here are the item pairs with the top residuals from the model.
d_res <- res |>
as_tibble() |>
mutate(word1 = rownames(res)) |>
pivot_longer(-word1, names_to = "word2", values_to = "residual")
arrange(d_res, desc(residual)) |>
filter(residual > 450) |>
knitr::kable(digits = 2)
word1 | word2 | residual |
---|---|---|
lizard | turkey | 647.8310 |
squirrel | owl | 574.6708 |
peas | vitamin | 551.0986 |
lizard | carriage | 531.0199 |
carriage | pig | 525.9496 |
leopard | lizard | 498.5478 |
carriage | squirrel | 488.5663 |
lizard | squirrel | 487.2589 |
crow | lizard | 487.1227 |
leopard | puppy | 486.7679 |
stretcher | milk | 485.4912 |
puppy | lizard | 481.9509 |
through | around | 478.4834 |
turkey | squirrel | 472.8403 |
crow | turkey | 467.7399 |
goose | squirrel | 466.6497 |
crow | goose | 460.0712 |
nine | seven | 458.4109 |
penguin | owl | 457.0020 |
lizard | pig | 456.5636 |
garage | milk | 454.6480 |
turkey | goose | 453.8696 |
poor | dark.in.color | 452.9800 |
crow | squirrel | 452.5225 |
leopard | squirrel | 451.1818 |
We examined each item for pairwise linear dependencies (LD) with other items using \(\chi^{2}\) (Chen & Thissen, 1997), and found that 117 items show strong LD (Cramer’s \(V \geq 0.5\)), and 835 items show moderate LD (\(V \geq 0.3\)) with at least 10 other items. This suggests multidimensionality, and we may want to look into exploratory factor analysis.
Our next goal is to determine if all items should be included in the item bank. Items that have very bad properties should probably be dropped. We will prune any ill-fitting items (\(\chi^{2}*_{df}\) \(p<.001\)) from the full 2PL model that also showed strong LD.
Check on badly fitting items.
300 items did not fit well in the full 2PL model, and these items are shown below.
bad_items <- itfit2pl$item[bad_items2pl]
kable(matrix(bad_items, ncol=4)) %>%
html_table_width(rep(150, 8))
sheep.sound | sandal | winds | green |
cat.sound | gloves | shovel | dark.in.color |
desire.to.eat.sound | diapers | hose | full |
car.sound | abaya | traffic.sign | bad |
crying.sound | mouth | castle | expired |
train.sound | elbow | mama | old.in.age.1 |
surprise.sound | head | papa | old.in.age.2 |
pain.sound | chin | domestic.worker | old |
falling.sound | foot | grandmother | poor |
to.express.bad.smell.sound | palm | baby | brown |
bird.sound | leg.2 | teacher.female.1 | yummy |
hot.sound | forehead | clown | high |
yummy.sound | arm | firefighters | crazy |
dog.sound | thumb | friend | sticky |
expressing.agreement | eyelashes | barber | soft.not.firm |
monster.sound | trash.can | person | colorful |
goat | lamp.light | guard | difficult |
horse | pillow | teacher.female.2 | sneeze.noun.2 |
fish.animal | plant | finger.game.2 | saliva |
dog | television | incy.wincy.spider | mucus |
bear | bag | hi | day |
chicken.animal | album | go.potty | later |
rabbit | tray | marhaba | today |
duck | soap | wait.imperative | before |
monkey | kettle | time.to.sleep | after |
penguin | money | turn.around | time |
donkey | rag | give.me | yesterday |
cow | matches | peace.be.upon.him | how |
wolf | plate | calm.down | when |
lion | tissue | hide.and.seek | he |
ball | key | last.night | yours |
bubbles | pacifier.2 | play | she |
blocks | bowl | pour | this.masculine |
toy | remote.control | measure | that.masculine.1 |
pencil | spoon | shake | mine |
story | floor.wiper | choose | they |
toys | radio | share | hers |
circle | lantern | read | we |
car | decorative.accent | return | ours |
pickup.truck | clock | tickle | these |
ambulance | carry.bag | splash | that.masculine.2 |
bus | glass.bottle | pretend | those |
biscuit | jug | find | there.is.nothing |
candy | box | curse | not.here |
gello | phone | think | not.nice |
soup | mattress | peel | finished.no.more |
chocolate | cooker | fly | five |
rice | side.table | look.1 | any.one |
carrot | vacuum.machine | swim | another.1 |
buttermilk | bed | fall.or.scatter | different |
jam | shower | crash.by | same.thing |
popcorn | mirror | study | another.2 |
rusk | vanity.table | press | more.than |
medicine | chair | be | three |
hamburger | heater | smile | seven |
strawberry | chandeliers | swing.causative.verb | eight |
banana | door | kick | nine |
butter | potty | cancel | ten |
juice | rocking.chair | fix | six |
water | kitchen | swing | a.lot.2 |
lettuce | knife.2 | spit | few |
chicken.food | plastic.sheets | nice | away |
pepsi | roof | yellow | near |
corn | ladder | orange.color | on |
doughnut | sea | having.good.taste | about |
cherry | star | blue | around |
yogurt | stone | light | in |
fries | building | pink | with.using |
sauce | nursery | thin | through |
mulukhiyah | preschool | alert | here.2 |
skirt | tent | final.last | on.top.of |
head.scarf | garden | second | to |
shirt.1 | work.noun | fixed | and |
shoes | clouds | red | but |
undershirt | sidewalk | light.in.color | also |
Next, we examine the coefficients of the 2PL model.
Items that are estimated to be very easy (e.g., mommy, daddy, ball) or very difficult (would, were, country) are highlighted, as well as those at the extremes of discrimination (a1).
We remove the “bad items” identified above.
coefs_2pl <- as_tibble(coef(mod_2pl, simplify = TRUE)$items) %>%
mutate(definition = rownames(coef(mod_2pl, simplify = TRUE)$items)) |>
ungroup()
ggplot(filter(coefs_2pl,
!(definition %in% bad_items)),
aes(x = a1, y = -d)) +
geom_point(alpha = .3) +
ggrepel::geom_text_repel(data = filter(coefs_2pl,
a1 < 1 | a1 > 3.8 | -d > 5 | -d < -2.5),
aes(label = definition), size = 2,
show.legend = FALSE) +
xlab("Discrimination") +
ylab("Difficulty")
## Warning: ggrepel: 145 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
Goal is to create a 100 item test with the best items for a given age/ability range.
Let’s find our estimated abilities and see how they relate to age.
qplot(x = full_prod$age_mo, y = fscores_2pl$ability, col = full_prod$source, geom = "point" ) +
geom_smooth()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
So we would like a range of abilities from about -2.5 to 4. Here’s our resulting test information curve.
theta <- matrix(seq(-2.5,4,.01))
tinfo <- testinfo(mod_2pl, theta)
plot(theta, tinfo, type = 'l')
sum(tinfo)
## [1] 221075.8
First let’s do some sanity checks. If we remove the bad items, we should lose less test info than if we remove a random subset.
bad_item_idx <- which(words %in% bad_items)
tinfo_no_bad <- testinfo(mod_2pl, theta,
which.items = setdiff(1:length(words), bad_item_idx))
# run 100 replicates of the shuffled removal to see what happens.
tinfo_no_random <- replicate(100,
sum(testinfo(mod_2pl, theta,
which.items = setdiff(1:length(words),
shuffle(1:length(words))[1:length(bad_items)]))))
# plot(theta, tinfo_no_bad, type = 'l')
# plot(theta, tinfo_no_random, type = 'l')
sum(tinfo_no_bad)
## [1] 143796.7
mean(tinfo_no_random)
## [1] 142699.8
Actually, removing the “bad” items doesn’t make life any worse.
Let’s try making some random 100-item subtests.
coefs_2pl <- mutate(coefs_2pl, idx = 1:n())
tinfo_random_100 <- tibble(n = 1:1000) %>%
split(.$n) |>
map_df(function(x) {
tibble(theta = as.vector(theta),
testinfo = testinfo(mod_2pl, theta,
which.items = slice_sample(coefs_2pl, n = 100) |>
pull(idx)),
n = x$n)
})
tinfo_random_summary <- tinfo_random_100 |>
group_by(n) |>
summarise(testinfo = sum(testinfo))
ggplot(tinfo_random_summary,
aes(x = testinfo)) +
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
The mean random test information is 2.6145297^{4}.
Now let’s try selecting high discrimination items.
top_desc <- arrange(coefs_2pl, desc(a1)) |>
slice(1:100) |>
pull(definition)
top_desc_idx <- which(words %in% top_desc)
tinfo_top_desc <- testinfo(mod_2pl, theta, which.items = top_desc_idx)
plot(theta, tinfo_top_desc, type = 'l')
The best test selecting based on discrimination has test information of 3.7571746^{4}. That gives us an upper bound on item information.
Now let’s try to do some kind of optimization. Our constraints are:
Let’s look at test information for each of the categories first.
coefs_2pl <- left_join(coefs_2pl,
items |>
select(-definition) |>
rename(definition = uni_lemma) |>
select(definition, category))
## Joining, by = "definition"
coefs_2pl$idx <- 1:nrow(coefs_2pl)
cat_info <- tibble(cat = unique(coefs_2pl$category)) |>
group_by(cat) |>
mutate(data = list(tibble(theta = as.vector(theta),
testinfo = testinfo(mod_2pl, theta,
which.items = filter(coefs_2pl,
category == cat) |>
pull(idx)),
n = nrow(filter(coefs_2pl, category == cat))))) |>
unnest(cols = "data")
ggplot(cat_info, aes(x = theta, y= testinfo/n)) +
geom_line() +
facet_wrap(~cat) +
ggtitle("Test information per word for different sections")
Given this, let’s exclude the “Sound Effects and Animal Sounds”
category. From the other 20 categories, we’ll try taking 5 words from
each category, and we’ll try to maximize 1) the standard deviation of
difficulty and 2) the mean discrimination.
by_category_max_sd <- coefs_2pl |>
ungroup() |>
filter(category != "Sound Effects and Animal Sounds") %>%
split(.$category) |>
map_df(function (cat) {
perms <- tibble(n = 1:100) %>%
split(.$n) |>
map_df(function (x) {
slice_sample(cat, n=5) |>
mutate(score = sd(d) + mean(a1),
n = x$n)
})
filter(perms, score == max(score))
})
by_cat_max_sd_test_info <- testinfo(mod_2pl, theta, which.items = by_category_max_sd$idx)
Let’s try just maximizing discrimination within category.
by_category_max_desc <- coefs_2pl |>
ungroup() |>
filter(category != "Sound Effects and Animal Sounds") %>%
split(.$category) |>
map_df(function (cat) {
arrange(cat, desc(a1)) |>
slice(1:5)
})
by_cat_max_desc_test_info <- testinfo(mod_2pl, theta, which.items = by_category_max_desc$idx)
How about adding the easiest one in each category and then doing the four most discriminating.
by_category_max_desc_one_easy <- coefs_2pl |>
ungroup() |>
filter(category != "Sound Effects and Animal Sounds") %>%
split(.$category) |>
map_df(function (cat) {
# do this so we definitely get 5 even if the two conditions overlap
filter(cat, d==max(d) | a1 >= sort(a1, decreasing=TRUE)[5]) |>
arrange(desc(d)) |>
slice(1:5)
})
by_category_max_desc_one_easy_test_info <- testinfo(mod_2pl, theta,
which.items = by_category_max_desc_one_easy$idx)
Now compare these.
sf_vs_best <- tibble(theta = theta,
`balance discrim and difficulty by category` = by_cat_max_sd_test_info,
`one easy plus most discrim by category` = by_category_max_desc_one_easy_test_info,
`most discrim by category` = by_cat_max_desc_test_info,
`most discrim overall` = tinfo_top_desc,
`random` = tinfo_random_100 |>
group_by(theta) |>
summarise(testinfo = mean(testinfo)) |>
pull(testinfo)) |>
pivot_longer(-theta, names_to = "selection", values_to = "test_information")
ggplot(sf_vs_best,
aes(x = theta,
y = test_information, col = selection)) +
geom_line()
short_metrics <- full_prod |>
rowwise() |>
mutate(top_desc = sum(c_across(cols = coefs_2pl$definition[top_desc_idx])),
one_easy = sum(c_across(cols =
coefs_2pl$definition[by_category_max_desc_one_easy$idx]))) |>
select(subid, age_mo, total, top_desc, one_easy)
short_metrics_long <- short_metrics |>
pivot_longer(top_desc:one_easy, names_to = "selection", values_to = "estimate")
ggplot(short_metrics_long, aes(x = total, y = estimate)) +
geom_point(alpha = .25) +
geom_smooth() +
geom_abline(lty = 2, slope = 100/length(coefs_2pl$definition)) +
facet_wrap(~selection)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
Correlations.
cor.test(short_metrics$total, short_metrics$top_desc)
##
## Pearson's product-moment correlation
##
## data: short_metrics$total and short_metrics$top_desc
## t = 137.66, df = 840, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.9754757 0.9812365
## sample estimates:
## cor
## 0.9785465
with(filter(short_metrics, age_mo <= 18),
cor.test(total, top_desc))
##
## Pearson's product-moment correlation
##
## data: total and top_desc
## t = 35.18, df = 260, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.8854195 0.9280139
## sample estimates:
## cor
## 0.909063
Correlations are very high for the whole sample, but lower for younger kids.
cor.test(short_metrics$total, short_metrics$one_easy)
##
## Pearson's product-moment correlation
##
## data: short_metrics$total and short_metrics$one_easy
## t = 137.62, df = 840, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.9754631 0.9812268
## sample estimates:
## cor
## 0.9785355
with(filter(short_metrics, age_mo <= 18),
cor.test(total, one_easy))
##
## Pearson's product-moment correlation
##
## data: total and one_easy
## t = 53.856, df = 260, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.9467053 0.9669170
## sample estimates:
## cor
## 0.9579844
Our corrected test (“one easy”) does substantially better with the younger kids.
What’s in that test?
filter(coefs_2pl, idx %in% by_category_max_desc_one_easy$idx) |>
select(definition, category, d, a1) |>
DT::datatable()
We can maybe do better by removing some redundancy (e.g. toy/toys), but this doesn’t look totally crazy.