This data was uploaded by Nicholas Schettini in my Data 607 class in the CUNY Master’s of Data Science program.
I have also put the raw data into my Github, available here:
https://raw.githubusercontent.com/heathergeiger/Data607_project2/master/TimeUse.csv
Nicholas gave the following description for the data:
“I found this dataset on time use by gender and by country. Some of the variables include eating, sleeping, employment, travel, school, study, walking the dog, etc. It seems you could analyze how males vs. females spend their time, and how each countries males and females compare to each other. Maybe certain countries spend more time doing something more than another country; same goes for gender.”
Load libraries.
library(tidyr)
library(stringr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
Read in data.
timeuse <- read.csv("TimeUse.csv",header=TRUE,skipNul = TRUE,check.names=FALSE,stringsAsFactors=FALSE)
Take a look at the file.
There are a lot of columns, so we’ll display just the first 10 along with just the column names for all.
dim(timeuse)
head(timeuse[,1:10])
colnames(timeuse)
## [1] 28 58
## SEX GEO/ACL00 Total Personal care Sleep
## 1 Males Belgium 24:00 10:45 8:15
## 2 Males Bulgaria 24:00 11:54 9:08
## 3 Males Germany (including former GDR from 1991) 24:00 10:40 8:08
## 4 Males Estonia 24:00 10:35 8:24
## 5 Males Spain 24:00 11:11 8:36
## 6 Males France 24:00 11:44 8:45
## Eating Other and/or unspecified personal care
## 1 1:49 0:42
## 2 2:07 0:39
## 3 1:43 0:49
## 4 1:19 0:52
## 5 1:47 0:48
## 6 2:18 0:41
## Employment, related activities and travel as part of/during main and second job
## 1 3:07
## 2 3:32
## 3 3:27
## 4 4:27
## 5 4:21
## 6 3:48
## Main and second job and related travel Activities related to employment and unspecified employment
## 1 3:05 0:02
## 2 3:27 0:04
## 3 3:21 0:06
## 4 4:20 0:07
## 5 4:17 0:03
## 6 3:46 0:02
## [1] "SEX"
## [2] "GEO/ACL00"
## [3] "Total"
## [4] "Personal care"
## [5] "Sleep"
## [6] "Eating"
## [7] "Other and/or unspecified personal care"
## [8] "Employment, related activities and travel as part of/during main and second job"
## [9] "Main and second job and related travel"
## [10] "Activities related to employment and unspecified employment"
## [11] "Study"
## [12] "School and university except homework"
## [13] "Homework"
## [14] "Free time study"
## [15] "Household and family care"
## [16] "Food management except dish washing"
## [17] "Dish washing"
## [18] "Cleaning dwelling"
## [19] "Household upkeep except cleaning dwelling"
## [20] "Laundry"
## [21] "Ironing"
## [22] "Handicraft and producing textiles and other care for textiles"
## [23] "Gardening; other pet care"
## [24] "Tending domestic animals"
## [25] "Caring for pets"
## [26] "Walking the dog"
## [27] "Construction and repairs"
## [28] "Shopping and services"
## [29] "Childcare, except teaching, reading and talking"
## [30] "Teaching, reading and talking with child"
## [31] "Household management and help family member"
## [32] "Leisure, social and associative life"
## [33] "Organisational work"
## [34] "Informal help to other households"
## [35] "Participatory activities"
## [36] "Visiting and feasts"
## [37] "Other social life"
## [38] "Entertainment and culture"
## [39] "Resting"
## [40] "Walking and hiking"
## [41] "Sports and outdoor activities except walking and hiking"
## [42] "Computer games"
## [43] "Computing"
## [44] "Hobbies and games except computing and computer games"
## [45] "Reading books"
## [46] "Reading, except books"
## [47] "TV and video"
## [48] "Radio and music"
## [49] "Unspecified leisure"
## [50] "Travel except travel related to jobs"
## [51] "Travel to/from work"
## [52] "Travel related to study"
## [53] "Travel related to shopping and services"
## [54] "Transporting a child"
## [55] "Travel related to other household purposes"
## [56] "Travel related to leisure, social and associative life"
## [57] "Unspecified travel"
## [58] "Unspecified time use"
In v1 of this script, I did not realize at first which the umbrella categories were.
I realize now that “Personal care” is a superset of “Sleep”, “Eating”, and “Other and/or unspecified personal care”.
“Employment, related activities and travel as part of/during main and second job” is a superset of “Main and second job and related travel” and “Activities related to employment and unspecified employment”.
“Study” is a superset of “School and university except homework”, “Homework”, and “Free time study”.
“Household and family care” is a superset of activities from “Food management except dish washing” to “Household management and help family member”.
“Leisure, social and associative life” is a superset of “Organisational work” to “Unspecified leisure”.
“Travel except travel related to jobs” is a superset of “Travel to/from work” to “Unspecified travel”.
Finally, there is an “Other” type category called “Unspecified time use”.
Let’s make a table of which umbrella category each sub-category fits under.
umbrella_per_sub_category <- data.frame(Individual.activity = colnames(timeuse)[c(5:7,9:10,12:14,16:31,33:49,51:57,58)],
Umbrella = rep(c("Personal care",
"Employment, related activities and travel as part of/during main and second job",
"Study",
"Household and family care",
"Leisure, social and associative life",
"Travel except travel related to jobs",
"Unspecified time use"),
times=c(3,2,3,length(16:31),length(33:49),length(51:57),1)),
stringsAsFactors=FALSE)
For the remainder of this analysis, we’ll focus only on umbrella categories.
However, we save the data from non-umbrella category activities in a different object, which we could transform and analyze in a similar way if we wanted to.
For this purpose, “Unspecified Time use” is included for both umbrella and non-umbrella.
umbrella_categories <- colnames(timeuse)[colnames(timeuse) %in% umbrella_per_sub_category$Umbrella]
non_umbrella_categories <- colnames(timeuse)[colnames(timeuse) %in% umbrella_per_sub_category$Individual.activity]
timeuse_umbrella <- timeuse %>% select(c("SEX","GEO/ACL00","Total",umbrella_categories))
timeuse_non_umbrella <- timeuse %>% select(c("SEX","GEO/ACL00","Total",non_umbrella_categories))
timeuse <- timeuse_umbrella
What countries are included in this data set?
unique(timeuse[,"GEO/ACL00"])
## [1] "Belgium" "Bulgaria" "Germany (including former GDR from 1991)"
## [4] "Estonia" "Spain" "France"
## [7] "Italy" "Latvia" "Lithuania"
## [10] "Poland" "Slovenia" "Finland"
## [13] "United Kingdom" "Norway"
I’m assuming “Total” column will be the same for all countries, but let’s check.
If so, remove this column.
Also rename “GEO/ACL00” to “Country” and “SEX” to “Sex”.
table(timeuse$Total)
##
## 24:00
## 28
timeuse <- timeuse[,setdiff(colnames(timeuse),"Total")]
colnames(timeuse)[1:2] <- c("Sex","Country")
Convert from wide to long format.
dim(timeuse)
## [1] 28 9
timeuse <- gather(timeuse,Activity,Time,-Sex,-Country)
dim(timeuse)
## [1] 196 4
head(timeuse)
## Sex Country Activity Time
## 1 Males Belgium Personal care 10:45
## 2 Males Bulgaria Personal care 11:54
## 3 Males Germany (including former GDR from 1991) Personal care 10:40
## 4 Males Estonia Personal care 10:35
## 5 Males Spain Personal care 11:11
## 6 Males France Personal care 11:44
Write a function to convert the HH:MM notation to number of minutes.
hours_and_minutes_to_minutes <- function(time){
time_split <- strsplit(time,":")[[1]]
hours <- as.numeric(time_split[1])
minutes <- as.numeric(time_split[2])
return((hours * 60) + minutes)
}
Run this function on Time column.
timeuse <- data.frame(timeuse,
Time.in.minutes = unlist(lapply(timeuse$Time,FUN=hours_and_minutes_to_minutes)),
stringsAsFactors=FALSE)
head(timeuse)
## Sex Country Activity Time Time.in.minutes
## 1 Males Belgium Personal care 10:45 645
## 2 Males Bulgaria Personal care 11:54 714
## 3 Males Germany (including former GDR from 1991) Personal care 10:40 640
## 4 Males Estonia Personal care 10:35 635
## 5 Males Spain Personal care 11:11 671
## 6 Males France Personal care 11:44 704
tail(timeuse)
## Sex Country Activity Time Time.in.minutes
## 191 Females Lithuania Unspecified time use 0:04 4
## 192 Females Poland Unspecified time use 0:05 5
## 193 Females Slovenia Unspecified time use 0:02 2
## 194 Females Finland Unspecified time use 0:12 12
## 195 Females United Kingdom Unspecified time use 0:10 10
## 196 Females Norway Unspecified time use 0:03 3
Let’s make sure time adds up to 24 hours for all countries and genders.
24*60
## [1] 1440
aggregate(Time.in.minutes ~ Country + Sex,FUN=sum,data=timeuse)
## Country Sex Time.in.minutes
## 1 Belgium Females 1440
## 2 Bulgaria Females 1440
## 3 Estonia Females 1440
## 4 Finland Females 1439
## 5 France Females 1440
## 6 Germany (including former GDR from 1991) Females 1440
## 7 Italy Females 1441
## 8 Latvia Females 1441
## 9 Lithuania Females 1440
## 10 Norway Females 1441
## 11 Poland Females 1440
## 12 Slovenia Females 1440
## 13 Spain Females 1439
## 14 United Kingdom Females 1441
## 15 Belgium Males 1440
## 16 Bulgaria Males 1441
## 17 Estonia Males 1439
## 18 Finland Males 1440
## 19 France Males 1440
## 20 Germany (including former GDR from 1991) Males 1440
## 21 Italy Males 1440
## 22 Latvia Males 1440
## 23 Lithuania Males 1439
## 24 Norway Males 1439
## 25 Poland Males 1439
## 26 Slovenia Males 1440
## 27 Spain Males 1441
## 28 United Kingdom Males 1438
Yes, they do, minus a few minutes difference at most probably due to rounding errors.
Let’s change some of the category names to something shorter.
timeuse$Activity <- plyr::mapvalues(timeuse$Activity,
from = c("Employment, related activities and travel as part of/during main and second job",
"Leisure, social and associative life",
"Travel except travel related to jobs"),
to = c("Employment",
"Leisure and social",
"Travel, non-job-related"))
Take a look at the data. Let’s pick a random country and look at lines for all males and females.
set.seed(1392)
test_country <- sample(unique(timeuse$Country),1)
timeuse %>% filter(Country == test_country & Sex == "Females")
## Sex Country Activity Time Time.in.minutes
## 1 Females Slovenia Personal care 10:32 632
## 2 Females Slovenia Employment 2:42 162
## 3 Females Slovenia Study 0:19 19
## 4 Females Slovenia Household and family care 4:56 296
## 5 Females Slovenia Leisure and social 4:27 267
## 6 Females Slovenia Travel, non-job-related 1:02 62
## 7 Females Slovenia Unspecified time use 0:02 2
timeuse %>% filter(Country == test_country & Sex == "Males")
## Sex Country Activity Time Time.in.minutes
## 1 Males Slovenia Personal care 10:31 631
## 2 Males Slovenia Employment 3:53 233
## 3 Males Slovenia Study 0:15 15
## 4 Males Slovenia Household and family care 2:38 158
## 5 Males Slovenia Leisure and social 5:31 331
## 6 Males Slovenia Travel, non-job-related 1:10 70
## 7 Males Slovenia Unspecified time use 0:02 2
Run some minor clean-up of the country names (to make them shorter where needed).
Then, make a panel plot with time use by country and gender.
timeuse$Country <- plyr::mapvalues(timeuse$Country,
from = c("Germany (including former GDR from 1991)","United Kingdom"),
to = c("Germany","UK"))
ggplot(timeuse,
aes(Country,Time.in.minutes,fill=Sex)) +
geom_bar(stat="identity",position = "dodge") +
facet_wrap(~Activity,scales="free_y") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
Time spent on unspecified is less than 10-15 minutes. Let’s plot minus that.
ggplot(timeuse[!(timeuse$Activity %in% "Unspecified time use"),],
aes(Country,Time.in.minutes,fill=Sex)) +
geom_bar(stat="identity",position = "dodge") +
facet_wrap(~Activity,scales="free_y") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
We see dramatic gender differences in time spent on employment (much higher for men across all countries) and household and family care (much higher for women across all countries).
Are there any major differences across countries?
Let’s compare countries now, separated by gender.
Let’s also remove “study” this time, as again there is very little time allocated to this category across countries.
mycol <- c("#004949","#009292","#FF6DB6","#FFB677","#490092","#006DDB","#B66DFF","#6DB6FF","#B6DBFF","#920000","#924900","#DBD100","#24FF24","#FFFF6D","#000000") #Set up colorblind friendly vector.
for(activity in setdiff(unique(timeuse$Activity),c("Unspecified time use","Study")))
{
print(ggplot(timeuse %>% filter(Activity == activity),
aes(Country,Time.in.minutes,fill=Country)) +
geom_bar(stat="identity") +
facet_wrap(~Sex,scales="free_y") +
theme(axis.title.x=element_blank(),axis.text.x=element_blank(),axis.ticks.x=element_blank()) +
scale_fill_manual(values = mycol) +
ggtitle(activity))
}
We definitely see some country-related differences in time spent on employment, with both males and females in Latvia and Lithuania (and a bit Estonia, though more so for females) spending more time on this activity.
Belgium, Finland, Germany, and Norway seem to spend more time on leisure, with differences especially dramatic for females.
We also start to see an interaction between country and gender in these plots. For example, Italian females spend the most time on household and family care compared to females across countries. Meanwhile Italian males have the lowest amounts of time spent on household and family care compared to males in other countries.