QAP regression

The scrambled version
The unscrambled version

This markdown demonstrates that doing QAP regression doesn’t make sense for analyses comparing local versus global.

The dataframe below has the average local and global correlation for each language pair, for each of the two corpora Our hypothesis is that the difference between local and global is positive (local more correlated than global). To test this we ask whether “dif” is greater than zero.

DIF_PATH <- "/Users/mollylewis/Documents/research/Projects/1_in_progress/L2ETS/analyses/02_concreteness_semantics/data/local_global_conc.csv"
difs <- read_csv(DIF_PATH)

kable(difs %>% slice(1:10))

lang1	lang2	local	global	dif	corpus
ar	bg	0.3303223	0.2830508	0.0472715	wiki
ar	bn	0.2480982	0.2057305	0.0423677	wiki
ar	de	0.3248045	0.2796395	0.0451650	wiki
ar	el	0.3134378	0.2688484	0.0445894	wiki
ar	en	0.3697269	0.3181968	0.0515301	wiki
ar	es	0.3474667	0.3001870	0.0472796	wiki
ar	fa	0.3054198	0.2562161	0.0492037	wiki
ar	fr	0.3360880	0.2922327	0.0438553	wiki
ar	gu	0.2278962	0.1947674	0.0331288	wiki
ar	hi	0.2927884	0.2489938	0.0437945	wiki

Let’s start by getting the data into wide format (Wiki only).

difs_wiki <- difs %>%
  filter(corpus == "wiki") %>%
  bind_rows(difs %>% filter(corpus == "wiki") %>%
                select(-local, -global,) %>% rename(lang1 = lang2, 
                                      lang2 = lang1)) %>%
  select(lang1, lang2, dif) %>%
  arrange(lang1, lang2) %>%
  pivot_wider(names_from = "lang2", values_from = "dif") %>%
  select("lang1", "ar", everything()) %>%
  select(-lang1)

difs_wiki

## # A tibble: 35 x 35
##         ar      bg      bn      de      el      en      es      fa      fr
##      <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
##  1 NA       0.0473  0.0424  0.0452  0.0446  0.0515  0.0473  0.0492  0.0439
##  2  0.0473 NA       0.0460  0.0550  0.0593  0.0556  0.0523  0.0492  0.0505
##  3  0.0424  0.0460 NA       0.0428  0.0511  0.0472  0.0467  0.0434  0.0449
##  4  0.0452  0.0550  0.0428 NA       0.0556  0.0541  0.0519  0.0449  0.0489
##  5  0.0446  0.0593  0.0511  0.0556 NA       0.0582  0.0542  0.0519  0.0522
##  6  0.0515  0.0556  0.0472  0.0541  0.0582 NA       0.0502  0.0454  0.0514
##  7  0.0473  0.0523  0.0467  0.0519  0.0542  0.0502 NA       0.0439  0.0465
##  8  0.0492  0.0492  0.0434  0.0449  0.0519  0.0454  0.0439 NA       0.0445
##  9  0.0439  0.0505  0.0449  0.0489  0.0522  0.0514  0.0465  0.0445 NA     
## 10  0.0331  0.0347  0.0318  0.0367  0.0359  0.0409  0.0360  0.0381  0.0327
## # … with 25 more rows, and 26 more variables: gu <dbl>, hi <dbl>, id <dbl>,
## #   ig <dbl>, it <dbl>, ja <dbl>, kn <dbl>, ko <dbl>, ml <dbl>, mr <dbl>,
## #   ne <dbl>, nl <dbl>, pa <dbl>, pl <dbl>, pt <dbl>, ro <dbl>, ru <dbl>,
## #   ta <dbl>, te <dbl>, th <dbl>, tl <dbl>, tr <dbl>, ur <dbl>, vi <dbl>,
## #   yo <dbl>, zh <dbl>

Next, let’s do the QAP scramble procedure. The algorithm is: shuffle the corresponding rows and columns together of the dependent matrix. In this case our dependent matrix is the dif column, and the independent matrix is a matrix of 1s.

The following code does the scrambling:

get_scrambled_mat <- function(x) {
  mat_x <- as.matrix(x)
  random_indices <- sample(1:nrow(x), nrow(x))
  row_col_shuffled_mat <- mat_x[random_indices, random_indices]
}

The following code converts a matrix form wide to long form (for doing regression):

get_mat_in_long_form <- function(x){
    as.data.frame(x) %>%
    mutate(lang1 = colnames(x)) %>%
    pivot_longer(cols = 1:nrow(x), names_to = "lang2")
}

The scrambled version

conc_in_out_wide_scrambled <- difs_wiki %>%
  get_scrambled_mat() %>%
  get_mat_in_long_form()

Matrix is definitely scrambled:

kable(head(conc_in_out_wide_scrambled))

lang1	lang2	value
id	id	NA
id	ja	0.0094095
id	ru	0.0533725
id	mr	0.0385369
id	kn	0.0482233
id	vi	0.0308962

Run the regression

lm(value ~ 1, conc_in_out_wide_scrambled) %>%
  summary() %>%
  tidy() %>%
  kable()

term	estimate	std.error	statistic	p.value
(Intercept)	0.0353898	0.0004398	80.46892	0

The unscrambled version

conc_in_out_wide_unscrambled <- difs_wiki %>%
  get_mat_in_long_form()

Matrix is not scrambled:

kable(head(conc_in_out_wide_unscrambled))

lang1	lang2	value
ar	ar	NA
ar	bg	0.0472715
ar	bn	0.0423677
ar	de	0.0451650
ar	el	0.0445894
ar	en	0.0515301

lm(value ~ 1, conc_in_out_wide_unscrambled) %>%
  summary() %>%
  tidy() %>%
  kable()

term	estimate	std.error	statistic	p.value
(Intercept)	0.0353898	0.0004398	80.46892	0

…. The results are the same. That’s because no matter how you scramble the dependent variable the predictor is always the same (1). This is not true for the swadesh analyses and for the analysis where were predicting semantic distance with five predictors.

QAP regression

Molly Lewis

2020-03-24

The scrambled version

The unscrambled version