Loading Data into a Data Frame

Introduction: What do men think it means to be a man?

FiveThirtyEight asked > 1600 men whether they felt the #MeToo movement had changed their perception of masculinity. The study was an effort to gain insights into how #MeToo affects how men feel about being men. Important questions about male identity were raised: For example, participants were asked whether society puts unhealthy pressure on men. Additionally, the study investigated male perceptions of gender in the work place. This is among many other interesting insights. More information about the study can be found in this FiveThirtyEight article

Masculinity Survey data

This demo will explore the dataset, ‘masculinity-survey’, associated with this study. We wll start by accessing the data and loading it as an r dataframe as follow:

We will create the environment we need to run this .Rmd by loading the necessary R packages
Access the data from FiveThirtyEight & use the dim() function to get a sense of the size of our dataframe
Get the current names of the columns

#load the necessary R packages
library(tidyverse)
library(dplyr)
library(ggplot2)

#Access the data from FiveThirtyEight & use the dim() function to get a sense of the size of our dataframe
#the url to the raw data from FiveThirtyEight's git
url <- "https://raw.githubusercontent.com/fivethirtyeight/data/master/masculinity-survey/raw-responses.csv"
#use read_csv() function to read the csv file in to an R dataframe
masculinity_DF <- read_csv( url )
#use the dim() function to get a sense of the size of the df
dim( masculinity_DF )

## [1] 1615   98

Regarding the dimensions of our dataframe.

This dataframe has 1615 rows corresponding to 1615 participants in the survey conducted by FiveThirtyEight. The data record of each participant holds 98 columns. Therefore, there are 98 features to analyze in this dataset.

Let’s view the columns names

#Get the current names of the columns
names( masculinity_DF )

##  [1] "X1"          "StartDate"   "EndDate"     "q0001"       "q0002"      
##  [6] "q0004_0001"  "q0004_0002"  "q0004_0003"  "q0004_0004"  "q0004_0005" 
## [11] "q0004_0006"  "q0005"       "q0007_0001"  "q0007_0002"  "q0007_0003" 
## [16] "q0007_0004"  "q0007_0005"  "q0007_0006"  "q0007_0007"  "q0007_0008" 
## [21] "q0007_0009"  "q0007_0010"  "q0007_0011"  "q0008_0001"  "q0008_0002" 
## [26] "q0008_0003"  "q0008_0004"  "q0008_0005"  "q0008_0006"  "q0008_0007" 
## [31] "q0008_0008"  "q0008_0009"  "q0008_0010"  "q0008_0011"  "q0008_0012" 
## [36] "q0009"       "q0010_0001"  "q0010_0002"  "q0010_0003"  "q0010_0004" 
## [41] "q0010_0005"  "q0010_0006"  "q0010_0007"  "q0010_0008"  "q0011_0001" 
## [46] "q0011_0002"  "q0011_0003"  "q0011_0004"  "q0011_0005"  "q0012_0001" 
## [51] "q0012_0002"  "q0012_0003"  "q0012_0004"  "q0012_0005"  "q0012_0006" 
## [56] "q0012_0007"  "q0013"       "q0014"       "q0015"       "q0017"      
## [61] "q0018"       "q0019_0001"  "q0019_0002"  "q0019_0003"  "q0019_0004" 
## [66] "q0019_0005"  "q0019_0006"  "q0019_0007"  "q0020_0001"  "q0020_0002" 
## [71] "q0020_0003"  "q0020_0004"  "q0020_0005"  "q0020_0006"  "q0021_0001" 
## [76] "q0021_0002"  "q0021_0003"  "q0021_0004"  "q0022"       "q0024"      
## [81] "q0025_0001"  "q0025_0002"  "q0025_0003"  "q0026"       "q0028"      
## [86] "q0029"       "q0030"       "q0034"       "q0035"       "q0036"      
## [91] "race2"       "racethn4"    "educ3"       "educ4"       "age3"       
## [96] "kids"        "orientation" "weight"

Subsetting the Masculinity Survey data

We can see the column names in the output above. Aside from a few columns with names that suggest demographic information and other record identifications, most of the column names are ambiguously labelled after the corresponding questions in the survey the participants answered.

To understand what the data represents, we will have to take a look at the actual FiveThirtyEight Masculinity Survey

The questions are broad and cover a variety of topics from perspectives on dating to opinions about professional working environments. However, for the sake of this exercise, we will create a subset of the dataframe that focuses on just a few of the questions about masculinity & #MeToo. The selection was based on personal interest; I think these particular questions would be interesting to look at as a function of age range. These questions are from subjectively broader in scope with relatively simple categorical answers (e.g. yes or no). If you find another survey question more thought provoking, please use this code to pursue your own analysis!

Columns of interest:

age3 What is you age? (“18 - 34”, “35 - 64”, or “65 and up”)
q0001 In general, how masculine or ‘manly’ do you feel?
q0002 How important is it to you that others see you as masculine?
q0005 Do you think that society puts pressure on men in a way that is unhealthy or bad for them?
q0014 How much have you heard about the #MeToo movement?
q0015 As a man, would you say you think about your behavior at work differently in the wake of #MeToo?

This next block of code will create a new dataframe that holds a subset of the masculinity_DF corresponding to the columns of interest. We will also rename the columns with more intuitive labels.

#create a dataframe that is a subset of the masculinity_DF and holds the columns we are interested in for our analysis and assign the columns new names
subsetMasc_DF <- masculinity_DF %>% select(Age = age3, How_Manly = q0001, Important = q0002, Unhealthy = q0005, MeToo_Aware = q0014, MeToo_Wake = q0015 )
#display the first several rows of the new dataframe 'subsetMasc_DF'
head( subsetMasc_DF )

## # A tibble: 6 x 6
##   Age       How_Manly         Important        Unhealthy MeToo_Aware  MeToo_Wake
##   <chr>     <chr>             <chr>            <chr>     <chr>        <chr>     
## 1 35 - 64   Somewhat masculi… Somewhat import… Yes       <NA>         <NA>      
## 2 65 and up Somewhat masculi… Somewhat import… Yes       <NA>         <NA>      
## 3 35 - 64   Very masculine    Not too importa… No        A lot        No        
## 4 65 and up Very masculine    Not too importa… No        <NA>         <NA>      
## 5 35 - 64   Very masculine    Very important   Yes       A lot        Yes       
## 6 65 and up Very masculine    Somewhat import… Yes       Only a litt… No

Analyzing & visualizing the subset of data

We have just selected a subset of data from a much larger dataset. This subset selects for some specific features (columns) of the data that we are interested in analizing. In the next block of code, we will look at how men from different age groups percieve how ‘manly’ they feel.

#aggregate the data subset by the 'Age' and 'How_Manly' columns. calculate the frequency or each
freq_AgeManly <- subsetMasc_DF %>%
    group_by( Age, How_Manly ) %>%
    summarize( Freq = n() )
#now we need to find the frequency of the Age groups
freq_Age <- subsetMasc_DF %>%
    group_by( Age ) %>%
    summarize( Freq = n() )
#we would like to find the relative proportion of the frequencies of percieved manliness per age group, so we will populate a new column with the frequency values for the corresponding age group
index <- freq_Age[["Age"]]
values <- freq_Age[["Freq"]]
freq_AgeManly$AgeFreq <- values[match(freq_AgeManly$Age, index)]
#take a peak at the first few rows to see if things worked...
head( freq_AgeManly )

## # A tibble: 6 x 4
## # Groups:   Age [2]
##   Age     How_Manly             Freq AgeFreq
##   <chr>   <chr>                <int>   <int>
## 1 18 - 34 Not at all masculine     9     133
## 2 18 - 34 Not very masculine      17     133
## 3 18 - 34 Somewhat masculine      62     133
## 4 18 - 34 Very masculine          45     133
## 5 35 - 64 No answer                9     855
## 6 35 - 64 Not at all masculine    13     855

#Great, now to make a new column in our aggregation that represents the percent of perceived manliness for each age group. This will normalize the data so that we can make a more meaningful comparison between the age groups.
freq_AgeManly$Percent <- round( ( freq_AgeManly$Freq/freq_AgeManly$AgeFreq )*100 )
head( freq_AgeManly )

## # A tibble: 6 x 5
## # Groups:   Age [2]
##   Age     How_Manly             Freq AgeFreq Percent
##   <chr>   <chr>                <int>   <int>   <dbl>
## 1 18 - 34 Not at all masculine     9     133       7
## 2 18 - 34 Not very masculine      17     133      13
## 3 18 - 34 Somewhat masculine      62     133      47
## 4 18 - 34 Very masculine          45     133      34
## 5 35 - 64 No answer                9     855       1
## 6 35 - 64 Not at all masculine    13     855       2

#Now we will prepare the data to be plotted as a bar graph.
#This will implement an example graph that can be found at:
#https://www.r-graph-gallery.com/48-grouped-barplot-with-ggplot2.html

# create a dataset to plot
data2plot <- data.frame(HM = freq_AgeManly$How_Manly, A = freq_AgeManly$Age, P = freq_AgeManly$Percent)
#head( data2plot )
# Grouped
ggplot( data2plot, aes(fill=A, y=P, x=HM)) + 
    geom_bar(position="dodge", stat="identity") +
    xlab("Perceived Manliness") + ylab("%") +
    ggtitle("Perceived Manliness by Age Group") +
    labs(fill = "Age Group")

Findings & Recommendations

The figure above plots the percentage of men from each age group that self-reported a perceived measure of how ‘Manly’ they feel. For all age groups, the majority of men reported feeling “Somewhat Masculine” or “Very Masculine”. Comparatvely fewer men self-reported as feeling “Not very” or “Not at all” masculine. The envelope for the data of age groups “35 - 64” and “65 and up” were very similar. However, Age group “18 - 34” deviated from the other age groups. For example, “18 - 34” group had comparatively higher percentages of males that identified “Not at all” or “Not very” masculine. The trend reverses for “Somewhat” and “Very” masculine reportings.

We just examined how percieved feelings of manliness vary across the age groups. There are several other questions we can ask of our data subset:

Does age affect the importance men place on others see them as masculine?
Does age impact the likelihood that a man percieves an unhealthy societal pressure because of their gender?
Are different age groups more or less aware of the #MeToo movement that their peers? These are just a few examples. This dataset is very high dimensional and information rich, so there are many, many directions that the analysis can be taken. What other questions would you chose to analyze?