Describing and cleaning the final project dataset
A focus on analyzing the data from “cities with the best work-life balance 2022”. The dataset was found on Kaggle: https://www.kaggle.com/datasets/prasertk/cities-with-the-best-worklife-balance-2022 Original data source: https://www.getkisi.com/work-life-balance-2022#table
To answer the following (or similar) questions:
How does city affordability affect the percent of the employed population holding multiple jobs?
Did cities with more remote jobs have better covid impact (public health mitigation response) and covid support (government spending to lessen economic stress)?
Which countries scored a better overall work-life-balance in 2022?
Which city ranked highest per country in total work-life balance score?
Which cities rank best in work intensity (more remote jobs, less overworked, most vacation time, fewer multiple jobholders)?
Do cities with a higher level of remaining monthly income after living expenses (affordability) have more taken vacation days?
Do cities with a greater unemployed population also have poor access to healthcare and mental healthcare?
Do the cities with a higher overworked population have a lower score of wellness and fitness?
[1] "C:/Users/steph/OneDrive/Documents/R"
wlb <- read_csv("work.life.balance.by.city.csv")
head(wlb)
# A tibble: 6 × 24
`2022` `2021` City Country `Remote Jobs` `Overworked Pop…`
<dbl> <chr> <chr> <chr> <chr> <chr>
1 1 2 Oslo Norway 41.72% 11.20%
2 2 - Bern Switzerland 44.86% 11.40%
3 3 1 Helsinki Finland 38.92% 12.70%
4 4 3 Zurich Switzerland 44.86% 11.90%
5 5 5 Copenhagen Denmark 41.42% 10.50%
6 6 - Geneva Switzerland 44.86% 11.90%
# … with 18 more variables: `Minimum Vacations Offered (Days)` <dbl>,
# `Vacations Taken (Days)` <chr>, Unemployment <dbl>,
# `Multiple Jobholders` <chr>, Inflation <chr>,
# `Paid Parental Leave (Days)` <dbl>, `Covid Impact` <dbl>,
# `Covid Support` <dbl>, Healthcare <dbl>,
# `Access to Mental Healthcare` <dbl>,
# `Inclusivity & Tolerance` <dbl>, Affordability <dbl>, …
The dataset “work.life.balance.by.city” describes data collected and calculated from international organisations, NGO reports, open access datasets, public surveys and crowdsourcing platforms. It considers more than 130 data points to highligh the most and least overworked cities around the world.
The city selection consists of 51 US cities, as well as 49 global cities that were shortlisted for analysis as global economic hubs. The study was divided into three categories: Work Intensity, Society and Institutions, and City Liveability.
1. Work Intensity: Remote Jobs (%), Overworked Population (%), Minimum Vacations Offered (Days), Vacations Taken (Days), Unemployment (Score), Multiple Jobholders (%), Inflation (%), Paid Parental Leave (Days).
2. Society and Institutions: Covid Impact (Score), Covid Support (Score), Healthcare (Score), Access to Mental Healthcare (Score), Inclusivity & Tolerance (Score).
3. City Livability: Affordability (Score), Happiness, Culture & Leisure (Score), City Safety (Score), Outdoor Spaces (Score), Air Quality (Score), Wellness and Fitness (Score).
Multiple indicators were used as components when scoring each factor. The underlying indicators were first standardized using a Z-Score [z = (x-μ)/σ; μ=indicator mean; σ=indicator standard deviation] normalization procedure.
The “2022” and “2021” columns contain qualitative, ordinal data that ranks the cities by overall work-life balance in those respective years.
The “City” and “Country” columns contain qualitative, nominal data that describes the location from which data was gathered.
Work Intensity
The “Remote Jobs” column contains quantitative, continuous data that describes percentage of jobs that can be performed remotely in each city, and the feasibility of working at home for all occupations. A higher percentage reflects a city with a higher number of remote job opportunities and a better infrastructure to support working from home.
The “Overworked Population” column contains quantitative, continuous data that describes the percentage of full-time employees working more than 48 hours per working week in each city. A higher percentage reflects a city with a greater amount of its population working overtime.
The “Minimum Vacations Offered (Days)” column contains quantitative, discrete data that describes the minimum number of compensated vacation days an employee is legally entitled to after at least one year of service. Data was taken at a national level for a full-time, five-day workweek (excluding public holidays). A higher score reflects a city with a higher amount of vacation days offered to employees.
The “Vacations Taken (Days)” column contains quantitative, discrete data that describes the average number of paid vacation days taken by full-time employees in a single year.
The “Unemployment” column contains quantitative, continuous data that describes a score that reflects the unemployment rate for the metropolitan area or region in the first quarter of 2021. A higher score reflects a city with a lower level of unemployment.
The “Multiple Jobholders” column contains quantitative, continuous data that describes the percentage of employed people in each city holding more than one job at a time.
The “Inflation” column contains quantitative, continuous data that describes the rate of price inflation for a range of consumer goods and services including food, beverages, clothing, housing, water, electricity, gas, furnishing, health, transport, communication, recreation, restaurants and hotels. The study compared the index values from the start of the pandemic (using an average of the months January and February 2020) with the index values from the most recent available data to show the percentage change. The study used state-level data for the US and country-level data for the rest of the world.
The “Paid Parental Leave (Days)” column contains quantitative, continuous data that describes the number of paid family leave days afforded to employees by law in each city.
Society and Institutions
The “Covid Impact” column contains quantitative, continuous data that describes a score that reflects the social and economic impact of a location’s Covid response, split across three areas: public health, economic, and social. The public health impact is quantified through cases and deaths relative to population; the economic impact through year-on-year GDP growth in 2020 and 2021; and the social impact through the severity of limiting measures put in place to contain the pandemic, and changes in mobility patterns as an indicator of the effect of these restrictions. A higher score reflects a more mitigated impact of the Covid-19 pandemic.
The “Covid Support” column contains quantitative, continuous data that describes a score that reflects the income support provided by governments to workers affected by the economic effects of Covid. The score takes into account government programmes to replace income lost due to Covid, duration of unemployment benefits, consumer confidence, household spending and general wage levels, as well as overall government spending to lessen the economic impact of Covid. In addition, the number of Covid cases and deaths were taken into account.
The “Healthcare” column contains quantitative, continuous data that describes score that rates a city’s healthcare system based on accessibility, quality of care and satisfaction. Country-level data was obtained from the Universal Health Coverage (UHC) index for access and quality indicators, while US cities data incorporates state-level data from the Health Access and Quality (HAQ) study. Additional data was taken from healthcare access indexes developed by the World Health Organisation and the European Commission. Satisfaction survey results were taken at a city level. A higher score reflects greater accessibility, quality of care and user rating for each city’s healthcare infrastructure. city level.
The “Access to Mental Healthcare” column contains quantitative, continuous data that describes a score that reflects the accessibility and effectiveness of governments’ implementation of mental health policies catering for individuals with mental health illnesses. This factor uses national data on governance, access to treatment and the environment necessary for treatment. It also incorporates suicide rates and city-level survey data on healthcare quality. A higher score reflects a more effective and accessible mental healthcare network.
The “Inclusivity & Tolerance” column contains quantitative, continuous data that describes a score that reflects the combined scores of the ’Gender Equality. A higher score reflects greater gender equality and a higher degree of LGBT+ equality.
City Liveability
The “Affordability” column contains quantitative, continuous data that describes a score that reflects monthly living costs as a proportion of the average household income after tax. Monthly costs include rent, basic utilities costs, groceries, internet connection, leisure activities, clothes and dining out. A higher score indicates a higher level of remaining monthly income after accounting for these deductions.
The “Happiness, Culture & Leisure” column contains quantitative, continuous data that describes the combined scores of both the ‘Happiness’ and ‘Culture & Leisure’ factors. The happiness score is calculated from survey responses evaluating the perceived happiness with one’s own life, as well as the degree of positive and negative effects a respondent experiences. A higher score reflects higher degrees of self-perceived happiness. The culture & leisure score indicates the vibrancy and variety of cultural and lifestyle offerings in a city. It combines cultural city rankings, the number of people employed in the cultural and creative industries, and the number of leisure facilities and activities available per capita, including sports stadiums, restaurants, parks, shops, entertainment and nightlife venues.
The “City Safety” column contains quantitative, continuous data that describes a score of the degree of a city’s safety in more than a dozen key areas, including environmental, social and infrastructural security. Indicators include statistics on injuries and fatalities, damage caused at an economic level, public opinion data, and data on the vulnerability of a location to particular hazards. A higher score reflects a safer city.
The “Outdoor Spaces” column contains quantitative, continuous data that describes the prevalence and accessibility of a city’s urban green infrastructure as a score, including its proximity to residents and the percentage of land allocated to green space. Data on weather and daylight conditions that could affect the use of public outdoor spaces was also incorporated. This includes average temperatures, the annual number of rainy days, annual sunshine hours, and cloudlessness. A higher score reflects a greater urban green infrastructure, as well as better environmental conditions for outdoor life.
The “Air Quality” column contains quantitative, continuous data that describes the annual median particulate matter (PM2.5/PM10) pollution for the year 2022, represented as a score. Daily average data was taken across all days of a single year, with the median pollution level representing the overall score. Data was taken at a city level. A higher score reflects greater air quality.
The “Wellness and Fitness” column contains quantitative, continuous data that describes the general state of a community’s physical fitness and health as represented by the population’s average life expectancy, as well as levels of inactivity, obesity, and the number of fitness studios and gyms per capita. National data was used for life expectancy at birth, while US cities use city-level data. Adult obesity rates and the prevalence of physical inactivity were taken at a national level, with US cities using state level data. Data on the number of gyms per capita is taken at a city level. A higher score reflects a better state of a community’s physical fitness and health.
The “TOTAL SCORE” column contains quantitative, continuous data that describes the calculated final score. The final score was computed as a weighted average of the component Z-Scores, and the resulting score normalised to a scale of 50 to 100 using min-max normalisation [(value - min)/(max-min)*50+50]. The floor of 50 for the scale was chosen to emphasise that the minimum score does not imply the absence of the infrastructures under analysis, as the position is relative to that of other cities in the ranking.
#Practice: filter by affordability and inclusivity
wlb %>%
select("City", "Country", "Affordability", "Inclusivity & Tolerance", "TOTAL SCORE") %>%
filter(`TOTAL SCORE` >80, `Affordability` >70, `Inclusivity & Tolerance` >90) %>%
arrange(desc(Affordability)) %>%
slice(1:10)
# A tibble: 10 × 5
City Country Affordability `Inclusivity &…` `TOTAL SCORE`
<chr> <chr> <dbl> <dbl> <dbl>
1 Seattle USA 88.6 94.7 88.4
2 Colorado Spri… USA 86.2 91.4 84.6
3 Chicago USA 84.1 90.6 82.9
4 Baltimore USA 84.1 92 82.8
5 Denver USA 83.2 92 85.6
6 San Diego USA 80.7 93 85.8
7 San Francisco USA 80.5 97.2 87.3
8 Calgary Canada 79.4 93.5 89.4
9 Stuttgart Germany 79.3 91.6 93.8
10 Dusseldorf Germany 79.2 93.7 90.9
# Clean - make all column names lowercase
wlb<-wlb %>%
rename_all(., .funs = tolower)
colnames(wlb)
[1] "2022"
[2] "2021"
[3] "city"
[4] "country"
[5] "remote jobs"
[6] "overworked population"
[7] "minimum vacations offered (days)"
[8] "vacations taken (days)"
[9] "unemployment"
[10] "multiple jobholders"
[11] "inflation"
[12] "paid parental leave (days)"
[13] "covid impact"
[14] "covid support"
[15] "healthcare"
[16] "access to mental healthcare"
[17] "inclusivity & tolerance"
[18] "affordability"
[19] "happiness, culture & leisure"
[20] "city safety"
[21] "outdoor spaces"
[22] "air quality"
[23] "wellness and fitness"
[24] "total score"
#Clean - rename columns
wlb <- rename(wlb, remote.jobs = `remote jobs`,
overworked = `overworked population`,
min.vaca.days = `minimum vacations offered (days)`,
vaca.days.taken = `vacations taken (days)`,
multiple.jobs = `multiple jobholders`,
parental.leave = `paid parental leave (days)`,
covid.impact = `covid impact`,
covid.support = `covid support`,
mental.health.access = `access to mental healthcare`,
inclusivity = `inclusivity & tolerance`,
happiness = `happiness, culture & leisure`,
city.safety = `city safety`,
outdoor.spaces = `outdoor spaces`,
air.quality = `air quality`,
wellness.fitness = `wellness and fitness`,
total.score = `total score`)
colnames(wlb)
[1] "2022" "2021"
[3] "city" "country"
[5] "remote.jobs" "overworked"
[7] "min.vaca.days" "vaca.days.taken"
[9] "unemployment" "multiple.jobs"
[11] "inflation" "parental.leave"
[13] "covid.impact" "covid.support"
[15] "healthcare" "mental.health.access"
[17] "inclusivity" "affordability"
[19] "happiness" "city.safety"
[21] "outdoor.spaces" "air.quality"
[23] "wellness.fitness" "total.score"
#How does city affordability affect the percent of the employed population holding multiple jobs?
wlb %>%
select(city, country, affordability, multiple.jobs) %>%
arrange(desc(affordability))
# A tibble: 100 × 4
city country affordability multiple.jobs
<chr> <chr> <dbl> <chr>
1 Houston USA 100 3.70%
2 Dallas USA 99 3.70%
3 Austin USA 95.2 3.70%
4 Raleigh USA 91.8 5.00%
5 Columbus USA 90.7 6.50%
6 Charlotte USA 90.3 5.00%
7 Salt Lake City USA 90.1 6.50%
8 Jacksonville USA 89.5 3.50%
9 San Antonio USA 89.1 3.70%
10 Seattle USA 88.6 5.40%
# … with 90 more rows
#Did cities with more remote jobs have better covid impact (public health mitigation response) and covid support (government spending to lessen economic stress)?
wlb %>%
select(city, country, remote.jobs, covid.impact, covid.support) %>%
arrange(desc(remote.jobs)) %>%
slice(1:10)
# A tibble: 10 × 5
city country remote.jobs covid.impact covid.support
<chr> <chr> <chr> <dbl> <dbl>
1 Singapore Singapore 52.06% 87.9 95.1
2 Washington USA 49.77% 81.3 83.4
3 Austin USA 45.51% 81.9 82.3
4 Bern Switzerland 44.86% 82.2 91.4
5 Zurich Switzerland 44.86% 82 91.4
6 Geneva Switzerland 44.86% 79.2 91.4
7 San Francisco USA 44.76% 83.3 93.9
8 Boston USA 44.35% 78.9 82
9 Stockholm Sweden 44.20% 76 98.9
10 Liverpool UK 43.50% 75.6 80.4
#Which countries scored a better overall work-life-balance in 2022?
wlb %>%
select(country, total.score)%>%
group_by(country)%>%
summarise_at(vars(total.score), list(average.country.score = mean))%>%
arrange(desc(average.country.score))
# A tibble: 31 × 2
country average.country.score
<chr> <dbl>
1 Norway 100
2 Finland 99.2
3 Switzerland 97.2
4 Denmark 96.2
5 Australia 93.4
6 Sweden 93.3
7 Netherlands 92.6
8 Japan 92.5
9 New Zealand 92.1
10 Canada 92.1
# … with 21 more rows
#Which city ranked highest per country in total work-life balance score?
wlb %>%
select(city, country, total.score)%>%
group_by(country) %>%
arrange(desc(total.score)) %>%
slice(1)%>%
ungroup()%>%
arrange(desc(total.score))%>%
slice(1:10)
# A tibble: 10 × 3
city country total.score
<chr> <chr> <dbl>
1 Oslo Norway 100
2 Bern Switzerland 99.5
3 Helsinki Finland 99.2
4 Copenhagen Denmark 96.2
5 Ottawa Canada 95.5
6 Sydney Australia 94.0
7 Stuttgart Germany 93.8
8 Stockholm Sweden 93.3
9 Amsterdam Netherlands 92.6
10 Tokyo Japan 92.5
#Do cities with a higher level of remaining monthly income after living expenses (affordability) have more taken vacation days?
wlb %>%
select(city, country, affordability, vaca.days.taken) %>%
arrange(desc(affordability))
# A tibble: 100 × 4
city country affordability vaca.days.taken
<chr> <chr> <dbl> <chr>
1 Houston USA 100 8.5
2 Dallas USA 99 8.5
3 Austin USA 95.2 8.5
4 Raleigh USA 91.8 9.1
5 Columbus USA 90.7 9.4
6 Charlotte USA 90.3 9.1
7 Salt Lake City USA 90.1 8.5
8 Jacksonville USA 89.5 9.1
9 San Antonio USA 89.1 8.5
10 Seattle USA 88.6 8.9
# … with 90 more rows
#Do cities with a greater unemployed population also have poor access to healthcare and mental healthcare?
wlb %>%
select(city, country, unemployment, healthcare, mental.health.access) %>%
arrange(unemployment)
# A tibble: 100 × 5
city country unemployment healthcare mental.health.ac…
<chr> <chr> <dbl> <dbl> <dbl>
1 Cape Town South Africa 50 50 65
2 Sao Paulo Brazil 79.6 62.5 50
3 Vienna Austria 81.3 93.8 77.3
4 Barcelona Spain 82.5 97 64.2
5 Madrid Spain 82.6 97.9 64.2
6 Bremen Germany 85.1 94.8 82
7 Berlin Germany 86.3 92.9 82
8 Buenos Aires Argentina 86.3 65.2 70.2
9 Cologne Germany 86.8 94.2 82
10 Montevideo Uruguay 87.4 74.4 64.6
# … with 90 more rows
#Note: A lower unemployment score reflects a city with a higher level of unemployment.
#Do the cities with a higher overworked population have a lower score of wellness and fitness?
wlb%>%
select(city, country, overworked, wellness.fitness) %>%
arrange(desc(overworked))
# A tibble: 100 × 4
city country overworked wellness.fitness
<chr> <chr> <chr> <dbl>
1 Sydney Australia 9.70% 70.5
2 Melbourne Australia 9.70% 74.7
3 Buenos Aires Argentina 8.80% 60.6
4 Amsterdam Netherlands 8.60% 67.7
5 Dubai UAE 23.40% 58.6
6 Hong Kong Hong Kong 17.90% 81.1
7 Kuala Lumpur Malaysia 17.10% 59.7
8 Singapore Singapore 16.90% 72.9
9 Montevideo Uruguay 16.30% 62.3
10 Tokyo Japan 15.40% 77.5
# … with 90 more rows