For the sample of the groups that we obtained after removing online markets with a website, with an offline store, without any discussion boards, with no more than 2 products (1742 markets), I collected the data that we will need for the analysis as control/independent/dependent variables, which are:
After this the size of our sample will be updated again. I will briefly present the data about reviews and products. In the course of data collection it came out that some groups became private or got banned, so here we lose 17 observations. Also, after some manual inspection I found that the sample has 72 groups of clothing-rental, which is irrelevant and I dropped them as well. Finally, I removed the groups with the websites which I somehow did not drop in the first place, so before the collection of reviews and products there was 1584 communities.
There are two important things about the products, these are counts and prices. The counts are important because if the number of products which the seller put into the market module is extremely low, it means that the seller does not really use the market as a main mechanism for displaying the products. Deciding on the bottom threshold of the N of products is what we need to discuss. Technically, we need the reliable data about products only for being able to test the hypothesis involving prices. For the other hypotheses it is only importants to know that a community is a marketplace. So,
Here is the summary of the N of products for 1567 communities. Overall, I collected 91414 products. This sample does not include any communities with less than 2 products!!! That is why the minimum is 2.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.00 8.00 22.00 248.54 85.75 14980.00
And the histogram, divided into two parts. The first shows the half of distribution before the median (median n of products = 22), to illustrate how many communities with extremely low n of products we have
The second histogram shows everything starting from the median up to max
I aggregated the prices at the community level; I calculated min, max, median and mean for each of them.
Important to know: VK limits the number of products one can collect per community to 200 items. This means that for the communities with more than 200 products (we have only 14% of those) the aggregated prices might be innacurate. However, there is no alphabetical or price ascending/descending sorting, that is why I believe calculating aggregated prices based on 200 latest added products is OK.
The other thing is that some of the prices were indicated in US dollars or Ukrainian hryvnias; I converted them into roubles using the current exchange rates. Shall we later convert roubles into euros?
1 euro = 71 rouble
Summary of aggregated median prices
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01 871.25 1600.00 2441.95 2700.00 125000.00
Summary of aggregated mean prices
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01 989.55 1805.51 2662.66 2902.95 136666.67
Overall distribution of median prices
Distribution from 0 to 10k roubles
Out of 1567 communities I first collected all the discussion boards they have. As you remember, from this sample I already excluded the communities with no discussion boards before. Next, I checked for how many of the marketplaces have a discussion board, which contains such words as review|suggestions|complaints etc. There are only 1231 of those, so this decreases the sample again. Overall, for 1231 communities I got 79390 reviews. However, the distribution is, obviously, unequal (see below).
For 369 communities we have less than 3 reviews, which means that we will probably exclude them from the sample, because feedback sentiment is our dependent variable.
Again, we have to decide how much reviews posted on the discussion board is enough to keep the group in the sample. Maybe we can compute this based on the ratio of reviews and followers?
In addition, we will later have to exclude the reviews from community managers. Unfortunately, some of them are not indicated as community administrators on a special field, so for those communities where the proportion of posts from the same users on the review board is high, we will have to remove them manually.
Distribution of reviews (discussion boards with only 1 or 2 reviews excluded):
Up to the median (the half from min to median n of reviews):
From median to max