Let’s load our data into a dataframe

df <- read.csv("./data/social_media_earlycovid_india.csv")

Executive summary

In 2020, the official arrival of COVID-19 changed the lives of everyone around the globe especially due to the various containment measures. With all the different lock down policies, people’s use of media changed. However, we can ask ourselves how exactly did these measures affect media consumption. Did people’s overall use of media increase depending on the restrictions in their living area ? Were women and men affected in the same way ? Did they spend the same amount of time and on the same mediums ?

As such, we will study whether people’s use of media software and their lifestyle, in the early days of COVID-19, depend on various criteria such as gender or lock down policy in the living area. We will carry out this analysis and research by studying the case of India.

Data background

The dataset used is the result of a survey taken by Indian men and women in the early stage of the COVID-19 pandemic. It features information on their consumption of media software (time and platforms), the lock down zone and situation of the region they were in, their sex, their age-range, exercising habits, sleep habits, and other elements such as opinions on the situation. There are a total of 586 pieces of data (rows) and 39 variables (one is an index). Originally an Excel file, we converted it to a CSV file for easier use.

Data cleaning

We first select and rename the columns that carry information that we are interested in the most. Indeed, the names of the 39 variables are too long and difficult to work with.

df_new <- select(df, 
                 sex = Sex,
                 age = Age..in.years.,
                 zone = A..Lockdown.zone.category..As.of.May.2020.,
                 social_media_time = B..Total.time.Spent.on.Social.Media.apps,
                 social_media_top1 = X1,
                 video_streaming_time = E..Time.Spent.on.Video.Streaming.Apps,
                 video_streaming_top1 = X1.1,
                 online_news_time = H..How.much.time.do.you.spend.reading.the.news.online.,
                 music_streaming = H..Do.you.Stream.Music.Online.,
                 sleep_hours = How.Many.hours.Do.you.sleep.on.an.average.,
                 lockdown_exercise = J..Do.yo.exercise.during.lockdown.,
                 lockdown_exercise_weekly = K..How.many.Times.do.you.exercise.in.a.week.,
                 lockdown_exercise_type = L..What.type.of.exercise.method.do.you.prefer.,
                 new_skills = P..Any.new.skills.acquired.during.lockdown.,
                 new_habits_name = Name.any.new.habits.or.lifestyle.changes.that.you.have.incorporated.in.this.ongoing.lockdown.period..Reading..sketching..painting..cooking..home.chores..etc.)

After the selection, we are left with 15 variables.

colnames(df_new)
##  [1] "sex"                      "age"                     
##  [3] "zone"                     "social_media_time"       
##  [5] "social_media_top1"        "video_streaming_time"    
##  [7] "video_streaming_top1"     "online_news_time"        
##  [9] "music_streaming"          "sleep_hours"             
## [11] "lockdown_exercise"        "lockdown_exercise_weekly"
## [13] "lockdown_exercise_type"   "new_skills"              
## [15] "new_habits_name"

However, all our variables are categorical which doesn’t help us produce varied graphs. So we decide to transform the data by adding columns that replace time ranges by averaged values.

# We use the "stringr" library to get rid of white spaces in the involved columns so it is easier to deal with
df_new$social_media_time <- str_replace_all(df_new$social_media_time, fixed(" "), "")
df_new$video_streaming_time <- str_replace_all(df_new$video_streaming_time, fixed(" "), "")

# We create vectors with the current time ranges and what their associated value in the new variable will be
old_time <- c("0Hr-1Hr", "1Hr-2Hrs", "2Hrs-3Hrs", "3Hrs-4Hrs",
              "4Hrs-5Hrs", "5Hrs-6Hrs", "6Hrs+", "")
new_time <- c(0.5, 1.5, 2.5, 3.5, 4.5, 5.5, 6.5, 0)
# We add the new numerical "average time" columns
df_new <- df_new %>%
    mutate(social_media_time_avg = new_time[match(df_new$social_media_time, old_time)]) %>%
    mutate(video_time_avg = new_time[match(df_new$video_streaming_time, old_time)])

We add a column for total media time : social + video streaming

df_new <- df_new %>%
    mutate(total_media_time = social_media_time_avg + video_time_avg)

Let’s check how many females and males answered the survey, to be sure the numbers are around the same for an accurate analysis

df_new %>%
    group_by(sex) %>%
    count()
## # A tibble: 2 × 2
## # Groups:   sex [2]
##   sex        n
##   <chr>  <int>
## 1 Female   289
## 2 Male     297

The number of male and female responders is balanced.

We can check on age groups to maybe get some more information:

df_new %>%
    group_by(age) %>%
    count()
## # A tibble: 6 × 2
## # Groups:   age [6]
##   age        n
##   <chr>  <int>
## 1 21-30    463
## 2 31-40     38
## 3 41-50      8
## 4 51-60      7
## 5 60+        1
## 6 Oct-20    69

We notice that we mostly have people between 21 and 30 years old so it wouldn’t be interesting to divide by age.

We also see that we have strange values that aren’t an age range: “Oct-20”. Since we do not know the reason for it, and the number is relatively small, we decide to delete these answers.

df_new <- df_new %>%
    filter(age != "Oct-20")

We back-up our selected data so far

write_csv(df_new, "./data/selected_columns.csv")

Individual figures

Figure 1

We first thought about observing the overall media consumption depending on gender, without any focus on which type of media. For that, we used a density plot to be able to compare the distribution of hours per day.

p1 <- ggplot(data = df_new,
            mapping = aes(x = total_media_time,color = sex, fill = sex))

p1 <- p1 + geom_density(alpha = 0.6) +
    scale_y_continuous(labels = scales::percent) +
    scale_x_continuous(breaks = c(2,4,6,8,10,12)) +
    scale_fill_manual(values = c("mediumpurple3", "slategray2")) +
    scale_color_manual(values = c("mediumpurple3", "slategray2")) +
    theme_bw() +
    facet_wrap(~ sex) +
    guides(color = FALSE, fill = FALSE) +
    labs(x = "Total Media Usage Time (in hours)", y = NULL,
       title = "Distribution of hours per day spent on media consumption depending on Gender",
       caption = "DATA: Social Media Usage Data During early Days of Covid (India)")
## Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
## of ggplot2 3.3.4.
p1

ggsave("Pic1.png", plot = p1, path = "./images")
## Saving 7 x 5 in image

We notice that the total media use is pretty similar between the two groups, except for a little more men that have the highest media consumption.

Figure 2

Even though the distribution of total number of hours per day spent on media is similar between Males and Females, we can ask ourselves if the difference lies in what type of media each group prefers or if they have different preferred platforms. As such, we take the top 3 applications regarding social media and video streaming, and compare the distribution of users they have for Females and Males.

social_by_sex <- df_new %>%
    group_by(sex, social_media_top1) %>%
    summarize(N = n()) %>%
    mutate(freq = N / sum(N),
           pct = round((freq*100), 0)) %>%
    filter(N >= 20)
## `summarise()` has grouped output by 'sex'. You can override using the `.groups`
## argument.
p2 <- social_by_sex %>%
    ggplot(x = social_media_top1, y = freq) +
    geom_point(aes(x=social_media_top1, y=freq, color = social_media_top1),
               size = 5) + 
    geom_segment(aes(x=social_media_top1, xend=social_media_top1, y=0, yend=freq,color = social_media_top1),
                 size = 2) +
    facet_wrap(~sex) +
    coord_flip() +
    scale_y_continuous(labels = scales::percent) +
    scale_color_brewer(palette = "Set2") +
    guides(color = FALSE) +
    labs(x = "Social Media Application",
         y = NULL,
         title = "Distribution of Social Media usage by Gender",
         caption = "DATA: Social Media Usage Data During early Days of Covid (India)") +
    theme_bw()
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
p2

ggsave("Pic2.png", plot = p2, path = "./images")
## Saving 7 x 5 in image
video_by_sex <- df_new %>%
    group_by(sex, video_streaming_top1) %>%
    summarize(N = n()) %>%
    mutate(freq = N / sum(N),
           pct = round((freq*100), 0)) %>%
    filter(N >= 20)
## `summarise()` has grouped output by 'sex'. You can override using the `.groups`
## argument.
p3 <- video_by_sex %>%
    ggplot(x = video_streaming_top1, y = freq) +
    geom_point(aes(x=video_streaming_top1, y=freq, color = video_streaming_top1),
               size = 5) + 
    geom_segment(aes(x=video_streaming_top1, xend=video_streaming_top1, y=0, yend=freq,color = video_streaming_top1),
                 size = 2) +
    facet_wrap(~sex) +
    coord_flip() +
    scale_y_continuous(labels = scales::percent) +
    scale_color_brewer(palette = "Accent") +
    guides(color = FALSE) +
    labs(x = "Video Streaming Application",
         y = NULL,
         title = "Distribution of Video Streaming usage by Gender",
         caption = "DATA: Social Media Usage Data During early Days of Covid (India)") +
    theme_bw()
p3

ggsave("Pic3.png", plot = p3, path = "./images")
## Saving 7 x 5 in image
fig2 <- grid.arrange(p2, p3, ncol = 1)

ggsave("Figure2.png", plot = fig2, path = "./images")
## Saving 7 x 5 in image

What we notice is that more Females seem to use Instagram in proportion, compared to Males, when it comes to messaging and social media. We also can see that, in terms of video streaming services, YouTube is used much more by Males whereas Females prefer the use of Netflix. Maybe more women watch tv shows and movies when men prefer shorter videos ?

Figure 3

Lastly, as people spent more time at home due to the coronavirus, there was a phenomenon of learning new skills. Therefore, the number of men and women who learn new skills in lockdown areas is analyzed and compared. For males and females, how many picked up a hobby, and is it correlated to the lockdown zone?

nsk_c <- df_new %>%
  select(sex, zone, new_skills) %>%
  filter(new_skills == "Yes") %>%
  group_by(zone, sex) %>%
  count(new_skills)
nsk_c
## # A tibble: 8 × 4
## # Groups:   zone, sex [8]
##   zone        sex    new_skills     n
##   <chr>       <chr>  <chr>      <int>
## 1 Containment Female Yes            3
## 2 Containment Male   Yes            4
## 3 Green       Female Yes           92
## 4 Green       Male   Yes           98
## 5 Orange      Female Yes           23
## 6 Orange      Male   Yes           27
## 7 Red         Female Yes           58
## 8 Red         Male   Yes           73
sum(nsk_c$n)  
## [1] 378
p4 <- ggplot(data = nsk_c,
            mapping = aes(x = zone, y = n, fill = zone))
p4 <- p4 + geom_col(alpha = 0.8) +
    facet_wrap(~ sex) +
    scale_fill_manual(values = c("wheat2", "darkolivegreen3", "salmon2", "orangered3")) +
    labs(x = "Lockdown Zone Category", y = "Number of people that acquired new hobbies",
       title = "New skills acquisition by Gender in different Lockdown Zones",
       caption = "DATA: Social Media Usage Data During early Days of Covid (India)") +
    theme_bw() +
    guides(fill=FALSE)

p4

As can be seen from the chart, the number of men and women who have acquired new skills in different lockdown zone has been equivalent.

ggsave("Pic4.png", plot = p4, path = "./images")
## Saving 7 x 5 in image