class: title-slide background-image: url("assets/uvalogo_g_CommunicationScience.png"), url("assets/candle.gif") background-position: 9% 91%, 100% 50% background-size: 300px, 50% 100% background-color: #000000 <br/> # .Large[(In)visible Information] ## .white[.font100[Theory, Measurement, and Boosting]] <br/><br/><br/><br/> ### .white[.font80[Saurabh Khanna]] ### .white[.font80[21.05.2024]] --- class: middle # About me <br/> .pull-left[ .font90[ 🇳🇱 Assistant Professor, University of Amsterdam [Communication Science] 🇬🇧 Research Associate, University of Oxford [Pembroke College] <br/> ] ] .pull-right[
] --- class: middle # About me <br/> .pull-left[ .font90[ .grey[ 🇳🇱 Assistant Professor, University of Amsterdam [Communication Science] 🇬🇧 Research Associate, University of Oxford [Pembroke College] ] <br/> Previously: 🇬🇧 Postdoctoral Research Fellow, University of Oxford [Pembroke College] 🇺🇸 PhD, Education Policy [Computer Science minor], Stanford University 🇺🇸 MA, Economics, Stanford University 🇮🇳 MA, Education, TISS Mumbai 🇮🇳 BS, Computer Science, BITS Pilani ] ] .pull-right[
] --- class: middle # Digitizing Human Lives - Scale and Speed: `\(8.5B\)` Google searches per day -- - COVID-19: 47% rise in broadband usage across the US -- - Enabled democratic discourse -- - Some concerns though --- # Concerns - Misinformation - Bias --- # Concerns - Misinformation: Information consumed `\(\ne\)` Truth - Bias: [Information consumed `\(\ne\)` Truth] `\(+\)` Discriminates against a social group -- - .red[Truth?] - ~~A norm~~ An exception on the Internet (Flanagin & Metzger, 2000) <br/> -- - Consider two statements: - `\(S_1\)`: The election was rigged - `\(S_2\)`: <span style="color:blue">People think</span> the election was rigged --- # Concerns - Misinformation: Information consumed `\(\ne\)` Truth - Bias: [Information consumed `\(\ne\)` Truth] `\(+\)` Discriminates against a social group - .red[Truth?] - ~~A norm~~ An exception on the Internet (Flanagin & Metzger, 2000) <br/> - Consider two statements: - `\(S_1\)`: The election was rigged ❌ - `\(S_2\)`: <span style="color:blue">People think</span> the election was rigged ✅ -- - `\(S_1\)` and `\(S_2\)` have similar effect on the reader (Tucker & Persily, 2020) -- - Telling the reader `\(S_1\)` is False doesn't help either (Bail, 2021) --- class: middle, bottom background-image: url("assets/venn1.png") background-size: contain background-position: 50% 50% --- class: middle, bottom background-image: url("assets/venn2.png") background-size: contain background-position: 50% 50% background-color: black --- class: middle, center background-color: black ## .white[Towards a Theory of Invisible Information] --- # Three aspects ### 1. Not all knowledge is digitized .pull-left-1[ Exclusion of - minority languages - rare diseases - certain types of crime - ... ] .pull-right-2[ <img src="assets/internet-users.png" alt="rda_logo" width="600" style="border:0;"/> ] --- # Three aspects ### 2. Digitized information in dynamically 'ranked' ~~by~~ for us --- # Three aspects ### 3. We consume the tip of the information iceberg .pull-left-1[ `$$P_i = \frac{\frac{1}{i^s}}{\sum_{j=1}^{N} \frac{1}{j^s}},$$` where `\(P_i\)` is the probability of clicking a search result ranked `\(i\)`. ] .pull-right-2[ <img src="index_files/figure-html/unnamed-chunk-3-1.png" width="504" /> ] --- class: middle, center background-color: black ## .white[Measuring Invisible Information] --- ### Leveraging text embeddings - Transformer-based embeddings (2017), sBERT (2019) - Relevance of a document, given a search query, can be computed as the semantic distance between the document and the query in the embedding space (2020) .center[ <img src="assets/vectors0.png" alt="rda_logo" width="350" style="border:0;"/> ] -- .center[ .blockquote[ <img src="assets/vectors.png" alt="rda_logo" width="500" style="border:0;"/> ] ] --- .center[ <img src="assets/vectors.png" alt="rda_logo" width="600" style="border:0;"/> ] `\(\vec{q}\)`: query `\(\vec{C} = \sum w_i \vec{r_i}\)`: corpus constructed as weighted aggregate of `\(r_i\)` vectors `\(\vec{r_i}\)`: one of out `\(n\)` search results `\(w_i\)`: weight assigned to each search result .content-box-purple[ `$$I_{visibility} = \sum_{i=1}^N P_i (\vec{C} \cdot \vec{r_i}) = \sum_{i=1}^N \left(\frac{\frac{1}{i^s}}{\sum_{j=1}^{N} \frac{1}{j^s}}\right) (\vec{C} \cdot \vec{r_i})$$` ] --- # Information Visibility .pull-left[ Benefits: - Single metric `\(\in [0, 1]\)` - Intuitive - Applicable to any ranked information source - search results - social media feeds - image galleries ] .pull-right[ Limitations: - Assumes upper cap for corpus `\(C\)` - Depends on embedding quality ] .content-box-purple[ `$$I_{visibility} = \sum_{i=1}^N P_i (\vec{C} \cdot \vec{r_i}) = \sum_{i=1}^N \left(\frac{\frac{1}{i^s}}{\sum_{j=1}^{N} \frac{1}{j^s}}\right) (\vec{C} \cdot \vec{r_i})$$` ] --- # Another (more granular) metric ### Information Visibility Curve .pull-left-1[ <br/> .content-box-purple[ `$$f(i) = \vec{C} \cdot \left(\sum_{i=1}^n w_i \vec{r_i}\right)$$` ] The area under the curve indicates how efficiently information visibility was gained for a given dataset. ] .pull-right-2[ <img src="index_files/figure-html/unnamed-chunk-4-1.png" width="504" /> ] --- class: middle, center background-color: black ## .white[Boosting Invisible Information] --- # Boosting ### Boost **all invisible** information? ❌ --- # Boosting ### Boost **relevant invisible** information ✅ -- .pull-left[ - We need - .blue[Data] - .red[A measure of relevance] - .green[A measure of invisibility] ] -- .pull-right[ - .blue[IMDb corpus for 697,872 films since 1874 - 45% have user ratings - 100s of features per film] - .red[Predict ratings for the remaining 55% (A measure of relevance)] - .green[Compare plot embeddings against the whole corpus (A measure of invisibility)] ] <br/> .content-box-gray[ .center[ Boost **hidden gems** based on a harmonic mean of .red[both] .green[measures] above. ] ] --- class: middle # [🕯️](https://theinvisiblelab.org/) The (In)visible Lab <br/> .pull-left[ 🌱 Amsterdam School of Communication Research 🌱 Social and Behavioural Data Science Centre, University of Amsterdam ] .pull-right[ 🇳🇱 University of Amsterdam 🇺🇸 Stanford University 🇬🇧 University of Oxford 🇪🇸 University of Deusto 🇨🇦 Public Knowledge Project 🇮🇸 Citizens Foundation ] --- class: title-slide background-image: url("assets/uvalogo_g_CommunicationScience.png"), url("assets/candle.gif") background-position: 9% 91%, 100% 50% background-size: 300px, 50% 100% background-color: #000000 <br/> # .Large[Thank You!] <br/><br/><br/><br/> ### .white[Feedback/Questions] ### .white[saurabh.khanna@uva.nl]