Psychological & Behavioural distress of COVID-19 and Infodemics

# 12.07.2021 - Psychological & Behavioural distress of COVID-19 and Infodemics

## Data Science with R &#183; Summer 2021

###  &#183; Knowledge Management & Discovery Lab

#### [https://rpubs.com/ranjiraj9/covidistress](https://rpubs.com/ranjiraj9/covidistress)

---

## Motivation & Problem statement

.pull-left70[
- __COVID-19__ arrival and at its peak in __2020__ and in various forms now.
- Aggravated __mental disorders__ and its ill-effects on a larger scale.
- Being __isolated__ affects productivity and fall prone to addictive substances.
- Increased __social media__ usage and __unplanned sleep__ causing distress.
- __Infodemics__ plays an **evil role** to all above.
- Findings in 3 objectives:
  - Global distress survey data analysis
  - Twitter analysis
  - Infodemic risk analysis
  
- _Shiny app_ <https://covid-distress-infodemics.shinyapps.io/shinyapp/>

## Team

* Madhuri Sajith
* Usama Ashfaq
* Vishnu Jayanand
* Sujith Nyarakkad Sudhakaran
* Ranjiraj Rajendran Nair

]

]

---

## Objective 1: Global distress survey

---

## Survey dataset (Overview)

- Only **86,751** respondents gave consent to participate in the survey. We therefore used only those records for which the user has marked "Consent" as "Yes".
- **64.32%**  of the total people who have attended the survey have answered every question till the end of the survey. Out of total **86,751** participants only **8,068** have answered only `one` question.

- On average **75%** of the questions have been answered by all the participants. The average rate of completeness for each individual question is also **75%**.

]

---

## World map of Corona stress 🌎

]

- The stress levels marked in red hue ( _level = 4_ ) are highly affected areas which are less in number and the blue hue ( _level = 2_ ) are less affected areas which are more common in most parts of the countries.

- The stress severity level is more observed in _South African_ countries like __Namibia__ and _West African_ countries like __Senegal__, __Guinea-Bissau__ and __Burkina Faso__.

- This helps us to uncover less spoken parts of the world where the media coverage is less and who are adversely affected.

]

---

## Distress scale

]

- The above bar plot shows the comparison of different sources of stress during the corona pandemic.

- People are more stressed due to the fear of economy collapse, catching corona virus and risk of being dying.

- While not able to perform the religious activities have very less affect on people because they can offer their prayers in their home also.

]

---

## Coping with stress

---

## Final survey analysis

- Participants were also asked questions related on **which things they worry about the most** during the pandemic. An interesting finding here was that people worry **most** about their *family* and *country* and **least** about *themselves*.

- Overall, it was reported only medium levels of trust, with the highest levels of trust for their country's *healthcare system* and the *WHO*. Trust towards the national government was relatively low compared to the other institutions examined. People also prefer the information coming from friends or other people they know.

- Questions related to instructions given by the Government and other international health organizations to stop the spread of coronavirus was asked in the survey.  Surprisingly, most of the choose the option that follow all the instructions.

- The bivariate relationship between `Extraversion`, `Perceived Support`, `Perceived Stress`, `Loneliness` is reported in our analysis. It is observed that in general female respondents are on a higher number as compared to males in participation. It is noteworthy that `Loneliness` is the root cause of stress which contributes to a major level.

]

---

## Objective 2: Twitter sentimental analysis

---

## Twitter dataset (Overview)

- For the year 2021, we worked on the most recent dataset (**June 2021**) aggregated from Twitter using `twitteR` and `rtweet` libraries within a particular time and location.

- Here `twitteR` which provides an interface and access to Twitter web API respectively, `rtweet` which acts as the client for Twitter's REST and stream APIs will be used to retrieve data.

- For the year 2020, we found an academic dataset of twitter id's which were collected for the purpose of research of coronavirus.The dataset can be downloaded here: <https://zenodo.org/record/3831406#.YOGah-j7Q6b/>

- These id's were then used to extract tweets using an open source application called `hydrator`(<https://github.com/DocNow/hydrator/>). The extracted tweets were then filtered by the one's in English language and within a particular time frame.

- After extracting tweets from both years, we performed certain pre-processing techniques such as removing stop words, emoji's, cryptic characters and also text conversion to lowercase to maintain semantic integrity.

]

---

## Most frequent and common words

It can be seen that in **2021** the greater prominence is given to tweets with hashtags "**capacity**" that appear more frequently in the recent time period. From our research it was found that CAPACITY is a registry of patients with COVID-19 and has been established to answer questions on the role of cardiovascular disease in this pandemic.

]

It is pretty evident that "**coronavirus**" outweigh other tags in the year **2020** which was trending when the arrival of the pandemic was sensed which remarks the global sentiment to a large extent.

]

It is evident that conjunctive words such as "**min age**", "**covid 19 pune**" are more frequent and also two-word sequences like "**delta variant**", "**vaccines covishield**" also draws into an attention towards novelty which is more significant in recent times.

]

In the latter 2020 when the outbreak was sensed "**wuhan coronavirus**", "**coronavirus outbreak**" was more to be noticeable in the twitter community.

]

---

## NRC emotions during pandemic

]

]

Based on a comparative analysis, we infer that in the year 2021 the most dominant sentiment across people is **positive** while it was **negative** in the year 2020.

]

---

## Sentimental analysis on tweets

]

]

This gives us an important insight after a year while the negativity towards the virus and the other ongoing after effects of the pandemic such as *death* , *risk* symptoms has increased. On the other hand after a year, there can be seen a huge rise in the positive sentiment as people are recovering, helping and supporting each other, are safe and getting free vaccination.

]

---

## Feature Engineering ⚙️⚙️⚙️️️

<div class="figure" style="text-align: center">
<img src="figures//LS-3.jpg" alt="Fig 7. Feature selection by PCA" width="100%" />
<p class="caption">Fig 7. Feature selection by PCA</p>
</div>

]
 
.pull-right[

Feature selection by using **Principal Component Analysis (PCA)**

- By finding a smaller set of new variables, each being a combination of the input variables, containing basically the same information as the input variables we plan to demonstrate our concept of feature selection by using `PCA`.

- - We infer the combination of `Color Layout Descriptor` with `Color Histogram` is the best in this setting.

]

---

## The Semi-Supervised Learning (SSL) concept💡

All about **Assumptions...** 🤔

Since we have **9,000** raw images after extracting the image within an image from each also any ( _Dorste effect_) we estimate to have **15,000** images in total. By this, we assume that an **80-20** split would give **12,000** unlabeled and **3,000** labeled sets on a rough basis.

In SSL the training sample contains some unlabeled data in addition:

So two goals:

&rarr; Predict the labels on future test data,

&rarr; Predict the labels on the unlabeled instances in the training sample.

]

The SSL methods we proposed for solving our task as follows:

-  **Label Propagation Algorithm (LPA)**

-  **Label Spreading Algorithm**

-  **Semi-Supervised Gaussian Mixture Model (SSGMM)**

]

]

---

## Algorithms

---

## Graph-based ⛓️

**Label Propagation Algorithm (LPA)**: _Assumption_ - Similar images would have
similar feature descriptors and so they would be mapped closely in the graph with high weights to the edges connecting to them.

`Hyperparameters`

- `\(\gamma\)`: Influences the distance of impact of a single training point. Low gamma values means a broad similarity radius which results in more points being clustered together. In the case of high gamma values, points must be very close to each other in order to be included in the same category (or class).

- **Choice of kernel**: RBF (default) / Linear.

**Label Spreading**: _Manifold Assumption_ - the graphs, constructed based on the local similarity between features, provide a lower-dimensional representation of the high-dimensional input images (images on the same low-dimensional manifold should have the same label).

]

A graph constructed from labeled instances `\(x_1\)`, `\(x_2\)` and unlabeled instances. The label of unlabeled instance `\(x_3\)` will be affected more by the label of `\(x_1\)`, which is closer in the graph, than by the label of `\(x_2\)`, which is farther in the graph, even though `\(x_2\)` is closer in Euclidean distance.

]

---

## Expectation Maximization ⛓️⛓️

**Semi-Supervised Gaussian Mixture Model (SSGMM)**: _Assumption_ - The images
come from the mixture model, where the number of features, prior `\(p(y)\)`, and conditional `\(p(x|y)\)` are all correct.

In Gaussian Mixture model, we maximize likelihood function `\(P(X_{train}|\pi, μ, \sigma)\)`

- `\(\pi\)` means distribution parameter of label Y,

- `\(\mu\)` and `\(\sigma\)` are set of mean vector and covariance matrix for each categories.

`Hyperparameters`

- `\(\alpha\)`: Additive (Laplace/Lidstone) smoothing parameter,

- `\(\beta\)`: Weight applied to the contribution of the unlabeled data,

- **fit_prior**:  Whether to learn class prior probabilities or not (default=True). If false, a uniform prior will be used.

- **class_prior**: Prior probabilities of the classes. If specified the priors are not adjusted according to the data.

- **tol**: Tolerance for convergence of EM algorithm.

- **max_iter**: Maximum number of iterations for EM algorithm.

---

## Model Selection and Model Evaluation

---

## Model Selection 👆

]

On the features we perform class balancing ( _undersampling_ ), feature selection ( _ANOVA_ ), feature selection ( _PCA_ ) with an enumeration of `\(2^c\)` (where c is the total extracted features). Based on this, we analyze from a line plot for all the split points ranging from [0.1,0.9] which combination is better to be used for model building. Followed by it we effectively determine from a box plot to pick the best model.

]

]

---

## Model Evaluation  📝

**Class prediction error plots**

]

- We use class prediction error plot which shows the actual targets from the dataset against the predicted values generated by our model.

- It illustrates the support (number of training samples) for each class in the fitted classification model as a stacked bar chart.

- It shows that for which classes our classifier is having a particularly difficult time with, and more importantly, what incorrect answers it is giving on a per-class basis.

- Like in our `MultinomialNBSS` classifier, it often incorrectly labels "cat" as "cows".

]

---

## Model Evaluation  📝📝

**ROC curves**

]

- We use ROC curves to show between the sensitivity and specificity for every possible cut-off for combination of tests with different feature combinations.

- AUC measures the entire two-dimensional area underneath the entire ROC curve. 
- Provides an aggregate measure of performance across all possible classification thresholds.

- For `MultinomialNBSS` classifier the highest AUC value is observed to be 0.83 which is for the "aeroplane" class 83% correct predictions.

]

---

## Model Evaluation  📝📝📝

**Correlation Heatmaps**

]

- We show Correlation heatmap(s) between all the features to determine the strength of influence of one variable on other.

- Also, they show in a glance which variables are correlated, to what degree, in which direction, and alerts us to potential multi-co-linearity problems.

- For `Label Spreading` the correlation coefficient for "aeroplane"-"aeroplane" is the highest starting from the top-left of 17 along the diagonal.

]

---

## Findings and Comparisons

---

## Intra-model comparison

<div class="figure" style="text-align: center">
<img src="figures//findings.png" alt=" Table 1. An overview of the results of our SSL techniques with 95% C.I." width="80%" />
<p class="caption"> Table 1. An overview of the results of our SSL techniques with 95% C.I.</p>
</div>

**Inference**:

- For our task the best SSL model generalization performance is achieved for `Label spreading` with an accuracy of nearly 31%.

- Transductive graph-based technique showcased better performance than the inductive approach for our task.

]

---

## Safeness of SSL

---

### What is  `Safe` in SSL?

**Safe**, here means that the generalization performance is **never statistically significantly** worse than methods using only labeled data.

]

---

## Inter-model comparison

<div class="figure" style="text-align: center">
<img src="figures//SafeSSL.png" alt="Fig 9. Safe SSL pipeline" width="50%" />
<p class="caption">Fig 9. Safe SSL pipeline</p>
</div>

We trained both on the semi-supervised and supervised models by splitting the data with varying labeled and unlabeled test sizes and compared it with the fixed test set. We then compared the efficiency of SSL to the supervised approach and concluded that the `Safe SSL` assumption **does not** hold as it performed better for the supervised model at the split test data.

]

---

## Baseline comparison

We compare our work `\(^{[2]}\)` with a similar dataset on BOV features which was implemented using additive and exponential kernel-based supervised SVM classifiers. Their performances reported were in the range of 48.9%-52%. In comparison, our SSL techniques attain nearly 31% which are not superior to supervised counterparts.

]

" _Based on thorough analysis and different experimentation on different sets of features at a varying proportion of the labeled and unlabeled set, we were finally able to conclude that an increased proportion of labels can help in achieving a better predictive performance in a semi-supervised learning paradigm._"

]

### Future work

- **Feature refining**: Extract relevant features by using Deep Learning-based techniques like Convolutional Neural Networks.

- **Model ensemble**: Combining classifiers by voting or averaging to improve performance.

- **Active Learning**: State-of-the-art algorithms and statistical methods to boost the predictive power.

- **Online-SSL**: Where labeled and unlabeled instances arrive sequentially.

]

---

## References

[1] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. _The PASCAL Visual Object Classes Challenge 2007 (VOC2007)_ Results.http://www.pascalnetwork.org/challenges/VOC/voc2007/workshop/index.html

[2] Florent Perronnin, Jorge Sánchez, and Yan Liu. _Large-scale image categorization with explicit data embedding_. In 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pages 2297–2304, 2010

---

# Thank you! Questions?

&nbsp;