This data was uploaded by Nicholas Schettini in my Data 607 class in the CUNY Master’s of Data Science program.
I have also put the raw data into my Github, available here:
https://raw.githubusercontent.com/heathergeiger/Data607_project2/master/TimeUse.csv
Nicholas gave the following description for the data:
“I found this dataset on time use by gender and by country. Some of the variables include eating, sleeping, employment, travel, school, study, walking the dog, etc. It seems you could analyze how males vs. females spend their time, and how each countries males and females compare to each other. Maybe certain countries spend more time doing something more than another country; same goes for gender.”
Load libraries.
library(tidyr)
library(stringr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
Read in data.
timeuse <- read.csv("TimeUse.csv",header=TRUE,skipNul = TRUE,check.names=FALSE,stringsAsFactors=FALSE)
Take a look at the file.
There are a lot of columns, so we’ll display just the first 10 along with just the column names for all.
dim(timeuse)
head(timeuse[,1:10])
colnames(timeuse)
## [1] 28 58
## SEX GEO/ACL00 Total Personal care Sleep
## 1 Males Belgium 24:00 10:45 8:15
## 2 Males Bulgaria 24:00 11:54 9:08
## 3 Males Germany (including former GDR from 1991) 24:00 10:40 8:08
## 4 Males Estonia 24:00 10:35 8:24
## 5 Males Spain 24:00 11:11 8:36
## 6 Males France 24:00 11:44 8:45
## Eating Other and/or unspecified personal care
## 1 1:49 0:42
## 2 2:07 0:39
## 3 1:43 0:49
## 4 1:19 0:52
## 5 1:47 0:48
## 6 2:18 0:41
## Employment, related activities and travel as part of/during main and second job
## 1 3:07
## 2 3:32
## 3 3:27
## 4 4:27
## 5 4:21
## 6 3:48
## Main and second job and related travel Activities related to employment and unspecified employment
## 1 3:05 0:02
## 2 3:27 0:04
## 3 3:21 0:06
## 4 4:20 0:07
## 5 4:17 0:03
## 6 3:46 0:02
## [1] "SEX"
## [2] "GEO/ACL00"
## [3] "Total"
## [4] "Personal care"
## [5] "Sleep"
## [6] "Eating"
## [7] "Other and/or unspecified personal care"
## [8] "Employment, related activities and travel as part of/during main and second job"
## [9] "Main and second job and related travel"
## [10] "Activities related to employment and unspecified employment"
## [11] "Study"
## [12] "School and university except homework"
## [13] "Homework"
## [14] "Free time study"
## [15] "Household and family care"
## [16] "Food management except dish washing"
## [17] "Dish washing"
## [18] "Cleaning dwelling"
## [19] "Household upkeep except cleaning dwelling"
## [20] "Laundry"
## [21] "Ironing"
## [22] "Handicraft and producing textiles and other care for textiles"
## [23] "Gardening; other pet care"
## [24] "Tending domestic animals"
## [25] "Caring for pets"
## [26] "Walking the dog"
## [27] "Construction and repairs"
## [28] "Shopping and services"
## [29] "Childcare, except teaching, reading and talking"
## [30] "Teaching, reading and talking with child"
## [31] "Household management and help family member"
## [32] "Leisure, social and associative life"
## [33] "Organisational work"
## [34] "Informal help to other households"
## [35] "Participatory activities"
## [36] "Visiting and feasts"
## [37] "Other social life"
## [38] "Entertainment and culture"
## [39] "Resting"
## [40] "Walking and hiking"
## [41] "Sports and outdoor activities except walking and hiking"
## [42] "Computer games"
## [43] "Computing"
## [44] "Hobbies and games except computing and computer games"
## [45] "Reading books"
## [46] "Reading, except books"
## [47] "TV and video"
## [48] "Radio and music"
## [49] "Unspecified leisure"
## [50] "Travel except travel related to jobs"
## [51] "Travel to/from work"
## [52] "Travel related to study"
## [53] "Travel related to shopping and services"
## [54] "Transporting a child"
## [55] "Travel related to other household purposes"
## [56] "Travel related to leisure, social and associative life"
## [57] "Unspecified travel"
## [58] "Unspecified time use"
What countries are included in this data set?
unique(timeuse[,"GEO/ACL00"])
## [1] "Belgium" "Bulgaria" "Germany (including former GDR from 1991)"
## [4] "Estonia" "Spain" "France"
## [7] "Italy" "Latvia" "Lithuania"
## [10] "Poland" "Slovenia" "Finland"
## [13] "United Kingdom" "Norway"
I’m assuming “Total” column will be the same for all countries, but let’s check.
If so, remove this column.
Also rename “GEO/ACL00” to “Country” and “SEX” to “Sex”.
table(timeuse$Total)
##
## 24:00
## 28
timeuse <- timeuse[,setdiff(colnames(timeuse),"Total")]
colnames(timeuse)[1:2] <- c("Sex","Country")
Convert from wide to long format.
dim(timeuse)
## [1] 28 57
timeuse <- gather(timeuse,Activity,Time,-Sex,-Country)
dim(timeuse)
## [1] 1540 4
head(timeuse)
## Sex Country Activity Time
## 1 Males Belgium Personal care 10:45
## 2 Males Bulgaria Personal care 11:54
## 3 Males Germany (including former GDR from 1991) Personal care 10:40
## 4 Males Estonia Personal care 10:35
## 5 Males Spain Personal care 11:11
## 6 Males France Personal care 11:44
Write a function to convert the HH:MM notation to number of minutes.
hours_and_minutes_to_minutes <- function(time){
time_split <- strsplit(time,":")[[1]]
hours <- as.numeric(time_split[1])
minutes <- as.numeric(time_split[2])
return((hours * 60) + minutes)
}
Test on a few possible options to make sure it works.
hours_and_minutes_to_minutes("13:52")
## [1] 832
hours_and_minutes_to_minutes("02:01")
## [1] 121
hours_and_minutes_to_minutes("10:00")
## [1] 600
hours_and_minutes_to_minutes("11:04")
## [1] 664
Run this function on Time column.
timeuse <- data.frame(timeuse,
Time.in.minutes = unlist(lapply(timeuse$Time,FUN=hours_and_minutes_to_minutes)),
stringsAsFactors=FALSE)
head(timeuse)
## Sex Country Activity Time Time.in.minutes
## 1 Males Belgium Personal care 10:45 645
## 2 Males Bulgaria Personal care 11:54 714
## 3 Males Germany (including former GDR from 1991) Personal care 10:40 640
## 4 Males Estonia Personal care 10:35 635
## 5 Males Spain Personal care 11:11 671
## 6 Males France Personal care 11:44 704
tail(timeuse)
## Sex Country Activity Time Time.in.minutes
## 1535 Females Lithuania Unspecified time use 0:04 4
## 1536 Females Poland Unspecified time use 0:05 5
## 1537 Females Slovenia Unspecified time use 0:02 2
## 1538 Females Finland Unspecified time use 0:12 12
## 1539 Females United Kingdom Unspecified time use 0:10 10
## 1540 Females Norway Unspecified time use 0:03 3
Are there any missing values in the data?
length(which(is.na(timeuse$Time.in.minutes) == TRUE))
## [1] 18
What activities have NA for time spent?
timeuse[is.na(timeuse$Time.in.minutes) == TRUE,]
## Sex Country Activity Time Time.in.minutes
## 294 Males Norway Free time study : NA
## 308 Females Norway Free time study : NA
## 572 Males Finland Tending domestic animals : NA
## 574 Males Norway Tending domestic animals : NA
## 586 Females Finland Tending domestic animals : NA
## 588 Females Norway Tending domestic animals : NA
## 622 Males France Walking the dog : NA
## 636 Females France Walking the dog : NA
## 1070 Males France Computer games : NA
## 1084 Females France Computer games : NA
## 1378 Males France Travel related to shopping and services : NA
## 1392 Females France Travel related to shopping and services : NA
## 1434 Males France Travel related to other household purposes : NA
## 1448 Females France Travel related to other household purposes : NA
## 1462 Males France Travel related to leisure, social and associative life : NA
## 1476 Females France Travel related to leisure, social and associative life : NA
## 1498 Males Norway Unspecified travel : NA
## 1512 Females Norway Unspecified travel : NA
These are all very specific activities that some people may have not answered for, or had on their version of the survey.
Let’s change these to “00:00” and 0 minutes.
timeuse$Time[is.na(timeuse$Time.in.minutes) == TRUE] <- "00:00"
timeuse$Time.in.minutes[is.na(timeuse$Time.in.minutes) == TRUE] <- 0
Let’s make sure time adds up to 24 hours for all countries and genders.
24*60
## [1] 1440
aggregate(Time.in.minutes ~ Country + Sex,FUN=sum,data=timeuse)
## Country Sex Time.in.minutes
## 1 Belgium Females 2881
## 2 Bulgaria Females 2880
## 3 Estonia Females 2874
## 4 Finland Females 2867
## 5 France Females 2876
## 6 Germany (including former GDR from 1991) Females 2872
## 7 Italy Females 2879
## 8 Latvia Females 2877
## 9 Lithuania Females 2878
## 10 Norway Females 2876
## 11 Poland Females 2874
## 12 Slovenia Females 2878
## 13 Spain Females 2875
## 14 United Kingdom Females 2871
## 15 Belgium Males 2880
## 16 Bulgaria Males 2881
## 17 Estonia Males 2874
## 18 Finland Males 2869
## 19 France Males 2879
## 20 Germany (including former GDR from 1991) Males 2874
## 21 Italy Males 2877
## 22 Latvia Males 2878
## 23 Lithuania Males 2878
## 24 Norway Males 2876
## 25 Poland Males 2872
## 26 Slovenia Males 2881
## 27 Spain Males 2874
## 28 United Kingdom Males 2871
Actually, times are all over 24 hours.
Some categories must overlap.
Let’s pick a country and sex show all lines.
timeuse[timeuse$Country == "Belgium" & timeuse$Sex == "Females",
c("Activity","Time","Time.in.minutes")]
## Activity Time Time.in.minutes
## 15 Personal care 11:11 671
## 43 Sleep 8:34 514
## 71 Eating 1:50 110
## 99 Other and/or unspecified personal care 0:47 47
## 127 Employment, related activities and travel as part of/during main and second job 1:53 113
## 155 Main and second job and related travel 1:52 112
## 183 Activities related to employment and unspecified employment 0:01 1
## 211 Study 0:16 16
## 239 School and university except homework 0:06 6
## 267 Homework 0:06 6
## 295 Free time study 0:04 4
## 323 Household and family care 4:10 250
## 351 Food management except dish washing 0:57 57
## 379 Dish washing 0:20 20
## 407 Cleaning dwelling 0:26 26
## 435 Household upkeep except cleaning dwelling 0:28 28
## 463 Laundry 0:09 9
## 491 Ironing 0:19 19
## 519 Handicraft and producing textiles and other care for textiles 0:06 6
## 547 Gardening; other pet care 0:10 10
## 575 Tending domestic animals 0:00 0
## 603 Caring for pets 0:03 3
## 631 Walking the dog 0:03 3
## 659 Construction and repairs 0:04 4
## 687 Shopping and services 0:33 33
## 715 Childcare, except teaching, reading and talking 0:16 16
## 743 Teaching, reading and talking with child 0:07 7
## 771 Household management and help family member 0:10 10
## 799 Leisure, social and associative life 5:06 306
## 827 Organisational work 0:03 3
## 855 Informal help to other households 0:00 0
## 883 Participatory activities 0:03 3
## 911 Visiting and feasts 0:37 37
## 939 Other social life 0:24 24
## 967 Entertainment and culture 0:11 11
## 995 Resting 0:31 31
## 1023 Walking and hiking 0:11 11
## 1051 Sports and outdoor activities except walking and hiking 0:07 7
## 1079 Computer games 0:02 2
## 1107 Computing 0:09 9
## 1135 Hobbies and games except computing and computer games 0:09 9
## 1163 Reading books 0:08 8
## 1191 Reading, except books 0:16 16
## 1219 TV and video 2:13 133
## 1247 Radio and music 0:03 3
## 1275 Unspecified leisure 0:01 1
## 1303 Travel except travel related to jobs 1:22 82
## 1331 Travel to/from work 0:15 15
## 1359 Travel related to study 0:02 2
## 1387 Travel related to shopping and services 0:18 18
## 1415 Transporting a child 0:04 4
## 1443 Travel related to other household purposes 0:00 0
## 1471 Travel related to leisure, social and associative life 0:16 16
## 1499 Unspecified travel 0:27 27
## 1527 Unspecified time use 0:02 2
Looks like, while the survey organizers tried their best to separate categories (e.g. “Childcare, except teaching, reading and talking” vs. “Teaching, reading and talking with child”), there is definitely some overlap.
For example, childcare could also fall under “Household and family care”. And the fact that this category has a lot more time spent suggested that most people may have listed their childcare under this category instead.
I wonder if the “umbrella” categories like this are common between countries?
We can check by getting the top say 10 activities by country and sex, and seeing which ones are repeated most often.
timeuse <- timeuse %>% group_by(Country,Sex) %>% mutate(Activity.rank = dense_rank(-Time.in.minutes))
timeuse <- data.frame(timeuse,stringsAsFactors=FALSE)
num_country_sex_combinations_per_top10_activity <- data.frame(table(timeuse[timeuse$Activity.rank <= 10,"Activity"]))
num_country_sex_combinations_per_top10_activity$Var1 <- as.vector(num_country_sex_combinations_per_top10_activity$Var1)
num_country_sex_combinations_per_top10_activity <- num_country_sex_combinations_per_top10_activity %>% arrange(desc(Freq))
num_country_sex_combinations_per_top10_activity
## Var1 Freq
## 1 Eating 28
## 2 Employment, related activities and travel as part of/during main and second job 28
## 3 Household and family care 28
## 4 Leisure, social and associative life 28
## 5 Main and second job and related travel 28
## 6 Personal care 28
## 7 Sleep 28
## 8 TV and video 28
## 9 Travel except travel related to jobs 27
## 10 Other and/or unspecified personal care 13
## 11 Food management except dish washing 12
## 12 Cleaning dwelling 2
## 13 Other social life 2
## 14 Visiting and feasts 2
## 15 Travel to/from work 1
timeuse[timeuse$Activity == "Travel except travel related to jobs" & timeuse$Activity.rank > 10,]
## Sex Country Activity Time Time.in.minutes Activity.rank
## 1308 Females France Travel except travel related to jobs 0:54 54 11
8 activities are found in the top 10 for all countries and sexes.
Another activity (“Travel except travel related to jobs”) is found in the top 10 for all countries and sexes except French females, for whom this activity is ranked 11th.
So with one exception, 9/10 activities are all in the top 10 for all countries and sexes.
Let’s now focus on how people spend their time doing these 9 activities for the remainder of the analysis.
One additional question though - what is the deal with “Employment, related activities and travel as part of/during main and second job” vs. “Main and second job and related travel”? These sort of sound like the same thing. Let’s check time spent on these by country and sex and see how they compare.
employment_or_job <- timeuse[timeuse$Activity == "Employment, related activities and travel as part of/during main and second job" |
timeuse$Activity == "Main and second job and related travel",]
employment_or_job <- employment_or_job %>% select(Sex,Country,Activity,Time.in.minutes) %>% spread(Activity,Time.in.minutes)
colnames(employment_or_job)[3:4] <- c("Employment","Job")
head(employment_or_job)
## Sex Country Employment Job
## 1 Females Belgium 113 112
## 2 Females Bulgaria 154 153
## 3 Females Estonia 185 182
## 4 Females Finland 153 152
## 5 Females France 137 136
## 6 Females Germany (including former GDR from 1991) 116 113
ggplot(employment_or_job,aes(Employment,Job)) +
geom_point() +
xlab("Employment, related activities and travel as part of/during main and second job") +
ylab("Main and second job and related travel") +
geom_abline(slope = 1, intercept = 0,linetype=2)
These are nearly identical for all combinations. I am assuming they are actually the same thing, and we should use one but not the other.
Let’s use “Main and second job and related travel” but not “Employment, related activities and travel as part of/during main and second job”. That will give us 8 umbrella category activities.
These categories are very broad, but collectively should give where people spend most of their time.
top_activities <- c("Eating","Household and family care","Leisure, social and associative life","Main and second job and related travel","Personal care","Sleep","TV and video","Travel except travel related to jobs")
timeuse_top_activities <- timeuse[timeuse$Activity %in% top_activities,]
Now let’s check the sum of time spent on all of these activities in total.
total_hours_spent <- aggregate(Time.in.minutes ~ Country + Sex,FUN=function(x)sum(x)/60,data=timeuse_top_activities)
colnames(total_hours_spent)[3] <- "Total.hours"
head(total_hours_spent)
## Country Sex Total.hours
## 1 Belgium Females 36.30000
## 2 Bulgaria Females 37.11667
## 3 Estonia Females 35.48333
## 4 Finland Females 35.38333
## 5 France Females 36.71667
## 6 Germany (including former GDR from 1991) Females 35.33333
range(total_hours_spent$Total.hours)
## [1] 34.85000 37.78333
Looks like we are still way over 24 hours.
Also there is some variation in the number of hours these activities add up to. Let’s see which countries and sexes add up to more hours.
total_hours_spent %>% spread(Sex,Total.hours) %>% arrange(Females)
## Country Females Males
## 1 Norway 34.85000 35.16667
## 2 Slovenia 35.18333 35.70000
## 3 Germany (including former GDR from 1991) 35.33333 35.38333
## 4 Italy 35.36667 35.80000
## 5 Finland 35.38333 35.71667
## 6 Estonia 35.48333 35.90000
## 7 Spain 35.63333 36.00000
## 8 United Kingdom 35.68333 35.95000
## 9 Lithuania 35.73333 36.35000
## 10 Latvia 35.80000 36.16667
## 11 Poland 35.86667 36.08333
## 12 Belgium 36.30000 36.41667
## 13 France 36.71667 36.86667
## 14 Bulgaria 37.11667 37.78333
total_hours_spent %>% spread(Sex,Total.hours) %>% arrange(Males)
## Country Females Males
## 1 Norway 34.85000 35.16667
## 2 Germany (including former GDR from 1991) 35.33333 35.38333
## 3 Slovenia 35.18333 35.70000
## 4 Finland 35.38333 35.71667
## 5 Italy 35.36667 35.80000
## 6 Estonia 35.48333 35.90000
## 7 United Kingdom 35.68333 35.95000
## 8 Spain 35.63333 36.00000
## 9 Poland 35.86667 36.08333
## 10 Latvia 35.80000 36.16667
## 11 Lithuania 35.73333 36.35000
## 12 Belgium 36.30000 36.41667
## 13 France 36.71667 36.86667
## 14 Bulgaria 37.11667 37.78333
total_hours_spent %>% spread(Sex,Total.hours) %>% mutate(Total.minutes.difference = round((Males - Females)*60))
## Country Females Males Total.minutes.difference
## 1 Belgium 36.30000 36.41667 7
## 2 Bulgaria 37.11667 37.78333 40
## 3 Estonia 35.48333 35.90000 25
## 4 Finland 35.38333 35.71667 20
## 5 France 36.71667 36.86667 9
## 6 Germany (including former GDR from 1991) 35.33333 35.38333 3
## 7 Italy 35.36667 35.80000 26
## 8 Latvia 35.80000 36.16667 22
## 9 Lithuania 35.73333 36.35000 37
## 10 Norway 34.85000 35.16667 19
## 11 Poland 35.86667 36.08333 13
## 12 Slovenia 35.18333 35.70000 31
## 13 Spain 35.63333 36.00000 22
## 14 United Kingdom 35.68333 35.95000 16
Looks like France and Bulgaria are on the high side of total hours listed for these activities, for both males and females.
There are also some sex differences, which are more pronounced in some countries than others.
Let’s look at one combination of sex and country again.
timeuse_top_activities %>% filter(Sex == "Females" & Country == "Belgium") %>% arrange(Activity.rank)
## Sex Country Activity Time Time.in.minutes Activity.rank
## 1 Females Belgium Personal care 11:11 671 1
## 2 Females Belgium Sleep 8:34 514 2
## 3 Females Belgium Leisure, social and associative life 5:06 306 3
## 4 Females Belgium Household and family care 4:10 250 4
## 5 Females Belgium TV and video 2:13 133 5
## 6 Females Belgium Main and second job and related travel 1:52 112 7
## 7 Females Belgium Eating 1:50 110 8
## 8 Females Belgium Travel except travel related to jobs 1:22 82 9
Looking more closely, the whole data set is a bit strange.
Some of the extra hours over 24 can still be explained by umbrella categories it seems, like that maybe “Personal Care” is a superset of “Sleep” or “TV and video” is a subset of leisure.
However it seems a bit odd that no one in any of these countries has work listed as taking more than 5 hours or so of their time.
And even further removing hours that may be double-counted based on apparent umbrella categories, we are still way over on hours.
I suppose we’re going to have to use the data as-is from here.
We could try to normalize by total hours, but I’m not sure how confident we are in those totals. So I think let’s just compare the actual time values, with a caveat that we need to treat these comparisons with caution.
We now understand the data pretty well, including caveats we need to take when analyzing.
Now let’s make some plots!
For each of the activities, plot sets of bars by country, putting male and female side-by-side.
Start with the four activities that tended to have lower time listed, then plot for the other four.
ggplot(timeuse_top_activities[timeuse_top_activities$Activity %in% c("Eating","Main and second job and related travel",
"Travel except travel related to jobs","TV and video"),],
aes(Country,Time.in.minutes,fill=Sex)) +
geom_bar(stat="identity",position = "dodge") +
facet_wrap(~Activity) +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
Name for Germany is way too long. Let’s switch to just “Germany”.
Also switch United Kingdom to UK.
timeuse_top_activities$Country <- plyr::mapvalues(timeuse_top_activities$Country,
from = c("Germany (including former GDR from 1991)","United Kingdom"),
to = c("Germany","UK"))
ggplot(timeuse_top_activities[timeuse_top_activities$Activity %in% c("Eating","Main and second job and related travel",
"Travel except travel related to jobs","TV and video"),],
aes(Country,Time.in.minutes,fill=Sex)) +
geom_bar(stat="identity",position = "dodge") +
facet_wrap(~Activity,scales="free_y") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
ggplot(timeuse_top_activities[!(timeuse_top_activities$Activity %in% c("Eating","Main and second job and related travel",
"Travel except travel related to jobs","TV and video")),],
aes(Country,Time.in.minutes,fill=Sex)) +
geom_bar(stat="identity",position = "dodge") +
facet_wrap(~Activity,scales="free_y") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
We find that women spend a lot more time on “household and family care” according to this survey. Men spend a lot more time on “main and second job and related travel”.
We also see men spending somewhat more time (though with less dramatic differences) on leisure, eating, TV and video, and non-work travel.
Amounts of time spent on personal care and sleep appear relatively similar.
Some proportion of the difference we see for men spending more time on various activities could be due to men reporting more total time spent doing different activities. But the differences we see are definitely more than the max 40 minute differences we see by sex, so this cannot explain all of what we see.
Curious how these sex differences vary by country.
For household, get ratio of female to male. For job, get ratio of male to female. Then, let’s compare.
household_and_job <- timeuse_top_activities %>%
filter(Activity == "Household and family care" |
Activity == "Main and second job and related travel") %>%
select(Sex,Country,Activity,Time.in.minutes) %>%
spread(Sex,Time.in.minutes) %>%
mutate(Sex.time.ratio = ifelse(Males > Females,Males/Females,Females/Males))
ggplot(household_and_job,
aes(Country,Sex.time.ratio,fill=Activity)) +
geom_bar(stat="identity",position="dodge") +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
ylab("Ratio sex that spends more time/sex that spends less time")
Also get the time spent on these two activities separated by sex and country.
ggplot(household_and_job %>% select(Country,Activity,Males,Females) %>% gather(Sex,Time.in.minutes,-Country,-Activity),
aes(Country,Time.in.minutes,fill=Country)) +
geom_bar(stat="identity") +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
facet_grid(Sex ~ Activity,scales="free_y")
Looks like Italy and Spain have the most extreme sex differences.
Separating each activity by sex to compare between countries, we can start to pick out what proportion of the differences might be due to women or men spending more or less time than their peers in other countries on different activities. For example, we find that Italian women are on the high end for time spent on household tasks compared to other countries, but on the low end for time spent related to a job.