This markdown demonstrates that doing QAP regression doesn’t make sense for analyses comparing local versus global.

The dataframe below has the average local and global correlation for each language pair, for each of the two corpora Our hypothesis is that the difference between local and global is positive (local more correlated than global). To test this we ask whether “dif” is greater than zero.

DIF_PATH <- "/Users/mollylewis/Documents/research/Projects/1_in_progress/L2ETS/analyses/02_concreteness_semantics/data/local_global_conc.csv"
difs <- read_csv(DIF_PATH)

kable(difs %>% slice(1:10))
lang1 lang2 local global dif corpus
ar bg 0.3303223 0.2830508 0.0472715 wiki
ar bn 0.2480982 0.2057305 0.0423677 wiki
ar de 0.3248045 0.2796395 0.0451650 wiki
ar el 0.3134378 0.2688484 0.0445894 wiki
ar en 0.3697269 0.3181968 0.0515301 wiki
ar es 0.3474667 0.3001870 0.0472796 wiki
ar fa 0.3054198 0.2562161 0.0492037 wiki
ar fr 0.3360880 0.2922327 0.0438553 wiki
ar gu 0.2278962 0.1947674 0.0331288 wiki
ar hi 0.2927884 0.2489938 0.0437945 wiki

Let’s start by getting the data into wide format (Wiki only).

difs_wiki <- difs %>%
  filter(corpus == "wiki") %>%
  bind_rows(difs %>% filter(corpus == "wiki") %>%
                select(-local, -global,) %>% rename(lang1 = lang2, 
                                      lang2 = lang1)) %>%
  select(lang1, lang2, dif) %>%
  arrange(lang1, lang2) %>%
  pivot_wider(names_from = "lang2", values_from = "dif") %>%
  select("lang1", "ar", everything()) %>%
  select(-lang1)

difs_wiki
## # A tibble: 35 x 35
##         ar      bg      bn      de      el      en      es      fa      fr
##      <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
##  1 NA       0.0473  0.0424  0.0452  0.0446  0.0515  0.0473  0.0492  0.0439
##  2  0.0473 NA       0.0460  0.0550  0.0593  0.0556  0.0523  0.0492  0.0505
##  3  0.0424  0.0460 NA       0.0428  0.0511  0.0472  0.0467  0.0434  0.0449
##  4  0.0452  0.0550  0.0428 NA       0.0556  0.0541  0.0519  0.0449  0.0489
##  5  0.0446  0.0593  0.0511  0.0556 NA       0.0582  0.0542  0.0519  0.0522
##  6  0.0515  0.0556  0.0472  0.0541  0.0582 NA       0.0502  0.0454  0.0514
##  7  0.0473  0.0523  0.0467  0.0519  0.0542  0.0502 NA       0.0439  0.0465
##  8  0.0492  0.0492  0.0434  0.0449  0.0519  0.0454  0.0439 NA       0.0445
##  9  0.0439  0.0505  0.0449  0.0489  0.0522  0.0514  0.0465  0.0445 NA     
## 10  0.0331  0.0347  0.0318  0.0367  0.0359  0.0409  0.0360  0.0381  0.0327
## # … with 25 more rows, and 26 more variables: gu <dbl>, hi <dbl>, id <dbl>,
## #   ig <dbl>, it <dbl>, ja <dbl>, kn <dbl>, ko <dbl>, ml <dbl>, mr <dbl>,
## #   ne <dbl>, nl <dbl>, pa <dbl>, pl <dbl>, pt <dbl>, ro <dbl>, ru <dbl>,
## #   ta <dbl>, te <dbl>, th <dbl>, tl <dbl>, tr <dbl>, ur <dbl>, vi <dbl>,
## #   yo <dbl>, zh <dbl>

Next, let’s do the QAP scramble procedure. The algorithm is: shuffle the corresponding rows and columns together of the dependent matrix. In this case our dependent matrix is the dif column, and the independent matrix is a matrix of 1s.

The following code does the scrambling:

get_scrambled_mat <- function(x) {
  mat_x <- as.matrix(x)
  random_indices <- sample(1:nrow(x), nrow(x))
  row_col_shuffled_mat <- mat_x[random_indices, random_indices]
}

The following code converts a matrix form wide to long form (for doing regression):

get_mat_in_long_form <- function(x){
    as.data.frame(x) %>%
    mutate(lang1 = colnames(x)) %>%
    pivot_longer(cols = 1:nrow(x), names_to = "lang2")
}

The scrambled version

conc_in_out_wide_scrambled <- difs_wiki %>%
  get_scrambled_mat() %>%
  get_mat_in_long_form()

Matrix is definitely scrambled:

kable(head(conc_in_out_wide_scrambled))
lang1 lang2 value
id id NA
id ja 0.0094095
id ru 0.0533725
id mr 0.0385369
id kn 0.0482233
id vi 0.0308962

Run the regression

lm(value ~ 1, conc_in_out_wide_scrambled) %>%
  summary() %>%
  tidy() %>%
  kable()
term estimate std.error statistic p.value
(Intercept) 0.0353898 0.0004398 80.46892 0

The unscrambled version

conc_in_out_wide_unscrambled <- difs_wiki %>%
  get_mat_in_long_form()

Matrix is not scrambled:

kable(head(conc_in_out_wide_unscrambled))
lang1 lang2 value
ar ar NA
ar bg 0.0472715
ar bn 0.0423677
ar de 0.0451650
ar el 0.0445894
ar en 0.0515301
lm(value ~ 1, conc_in_out_wide_unscrambled) %>%
  summary() %>%
  tidy() %>%
  kable()
term estimate std.error statistic p.value
(Intercept) 0.0353898 0.0004398 80.46892 0

…. The results are the same. That’s because no matter how you scramble the dependent variable the predictor is always the same (1). This is not true for the swadesh analyses and for the analysis where were predicting semantic distance with five predictors.