This markdown demonstrates that doing QAP regression doesn’t make sense for analyses comparing local versus global.
The dataframe below has the average local and global correlation for each language pair, for each of the two corpora Our hypothesis is that the difference between local and global is positive (local more correlated than global). To test this we ask whether “dif” is greater than zero.
DIF_PATH <- "/Users/mollylewis/Documents/research/Projects/1_in_progress/L2ETS/analyses/02_concreteness_semantics/data/local_global_conc.csv"
difs <- read_csv(DIF_PATH)
kable(difs %>% slice(1:10))
lang1 | lang2 | local | global | dif | corpus |
---|---|---|---|---|---|
ar | bg | 0.3303223 | 0.2830508 | 0.0472715 | wiki |
ar | bn | 0.2480982 | 0.2057305 | 0.0423677 | wiki |
ar | de | 0.3248045 | 0.2796395 | 0.0451650 | wiki |
ar | el | 0.3134378 | 0.2688484 | 0.0445894 | wiki |
ar | en | 0.3697269 | 0.3181968 | 0.0515301 | wiki |
ar | es | 0.3474667 | 0.3001870 | 0.0472796 | wiki |
ar | fa | 0.3054198 | 0.2562161 | 0.0492037 | wiki |
ar | fr | 0.3360880 | 0.2922327 | 0.0438553 | wiki |
ar | gu | 0.2278962 | 0.1947674 | 0.0331288 | wiki |
ar | hi | 0.2927884 | 0.2489938 | 0.0437945 | wiki |
Let’s start by getting the data into wide format (Wiki only).
difs_wiki <- difs %>%
filter(corpus == "wiki") %>%
bind_rows(difs %>% filter(corpus == "wiki") %>%
select(-local, -global,) %>% rename(lang1 = lang2,
lang2 = lang1)) %>%
select(lang1, lang2, dif) %>%
arrange(lang1, lang2) %>%
pivot_wider(names_from = "lang2", values_from = "dif") %>%
select("lang1", "ar", everything()) %>%
select(-lang1)
difs_wiki
## # A tibble: 35 x 35
## ar bg bn de el en es fa fr
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 NA 0.0473 0.0424 0.0452 0.0446 0.0515 0.0473 0.0492 0.0439
## 2 0.0473 NA 0.0460 0.0550 0.0593 0.0556 0.0523 0.0492 0.0505
## 3 0.0424 0.0460 NA 0.0428 0.0511 0.0472 0.0467 0.0434 0.0449
## 4 0.0452 0.0550 0.0428 NA 0.0556 0.0541 0.0519 0.0449 0.0489
## 5 0.0446 0.0593 0.0511 0.0556 NA 0.0582 0.0542 0.0519 0.0522
## 6 0.0515 0.0556 0.0472 0.0541 0.0582 NA 0.0502 0.0454 0.0514
## 7 0.0473 0.0523 0.0467 0.0519 0.0542 0.0502 NA 0.0439 0.0465
## 8 0.0492 0.0492 0.0434 0.0449 0.0519 0.0454 0.0439 NA 0.0445
## 9 0.0439 0.0505 0.0449 0.0489 0.0522 0.0514 0.0465 0.0445 NA
## 10 0.0331 0.0347 0.0318 0.0367 0.0359 0.0409 0.0360 0.0381 0.0327
## # … with 25 more rows, and 26 more variables: gu <dbl>, hi <dbl>, id <dbl>,
## # ig <dbl>, it <dbl>, ja <dbl>, kn <dbl>, ko <dbl>, ml <dbl>, mr <dbl>,
## # ne <dbl>, nl <dbl>, pa <dbl>, pl <dbl>, pt <dbl>, ro <dbl>, ru <dbl>,
## # ta <dbl>, te <dbl>, th <dbl>, tl <dbl>, tr <dbl>, ur <dbl>, vi <dbl>,
## # yo <dbl>, zh <dbl>
Next, let’s do the QAP scramble procedure. The algorithm is: shuffle the corresponding rows and columns together of the dependent matrix. In this case our dependent matrix is the dif column, and the independent matrix is a matrix of 1s.
The following code does the scrambling:
get_scrambled_mat <- function(x) {
mat_x <- as.matrix(x)
random_indices <- sample(1:nrow(x), nrow(x))
row_col_shuffled_mat <- mat_x[random_indices, random_indices]
}
The following code converts a matrix form wide to long form (for doing regression):
get_mat_in_long_form <- function(x){
as.data.frame(x) %>%
mutate(lang1 = colnames(x)) %>%
pivot_longer(cols = 1:nrow(x), names_to = "lang2")
}
conc_in_out_wide_scrambled <- difs_wiki %>%
get_scrambled_mat() %>%
get_mat_in_long_form()
Matrix is definitely scrambled:
kable(head(conc_in_out_wide_scrambled))
lang1 | lang2 | value |
---|---|---|
id | id | NA |
id | ja | 0.0094095 |
id | ru | 0.0533725 |
id | mr | 0.0385369 |
id | kn | 0.0482233 |
id | vi | 0.0308962 |
Run the regression
lm(value ~ 1, conc_in_out_wide_scrambled) %>%
summary() %>%
tidy() %>%
kable()
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 0.0353898 | 0.0004398 | 80.46892 | 0 |
conc_in_out_wide_unscrambled <- difs_wiki %>%
get_mat_in_long_form()
Matrix is not scrambled:
kable(head(conc_in_out_wide_unscrambled))
lang1 | lang2 | value |
---|---|---|
ar | ar | NA |
ar | bg | 0.0472715 |
ar | bn | 0.0423677 |
ar | de | 0.0451650 |
ar | el | 0.0445894 |
ar | en | 0.0515301 |
lm(value ~ 1, conc_in_out_wide_unscrambled) %>%
summary() %>%
tidy() %>%
kable()
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 0.0353898 | 0.0004398 | 80.46892 | 0 |
…. The results are the same. That’s because no matter how you scramble the dependent variable the predictor is always the same (1). This is not true for the swadesh analyses and for the analysis where were predicting semantic distance with five predictors.