Homework 3

Describing and cleaning the final project dataset

Steph
2022-06-08

Final Project Goals

A focus on analyzing the data from “cities with the best work-life balance 2022”. The dataset was found on Kaggle: https://www.kaggle.com/datasets/prasertk/cities-with-the-best-worklife-balance-2022 Original data source: https://www.getkisi.com/work-life-balance-2022#table

To answer the following (or similar) questions:

  1. How does city affordability affect the percent of the employed population holding multiple jobs?

  2. Did cities with more remote jobs have better covid impact (public health mitigation response) and covid support (government spending to lessen economic stress)?

  3. Which countries scored a better overall work-life-balance in 2022?

  4. Which city ranked highest per country in total work-life balance score?

  5. Which cities rank best in work intensity (more remote jobs, less overworked, most vacation time, fewer multiple jobholders)?

  6. Do cities with a higher level of remaining monthly income after living expenses (affordability) have more taken vacation days?

  7. Do cities with a greater unemployed population also have poor access to healthcare and mental healthcare?

  8. Do the cities with a higher overworked population have a lower score of wellness and fitness?

What is the current working directory (WD)?

knitr::opts_chunk$set(echo = TRUE)

library(tidyverse)
library(dplyr)
getwd()
[1] "C:/Users/steph/OneDrive/Documents/R"

Import and name dataset

wlb <- read_csv("work.life.balance.by.city.csv")
head(wlb)
# A tibble: 6 × 24
  `2022` `2021` City       Country     `Remote Jobs` `Overworked Pop…`
   <dbl> <chr>  <chr>      <chr>       <chr>         <chr>            
1      1 2      Oslo       Norway      41.72%        11.20%           
2      2 -      Bern       Switzerland 44.86%        11.40%           
3      3 1      Helsinki   Finland     38.92%        12.70%           
4      4 3      Zurich     Switzerland 44.86%        11.90%           
5      5 5      Copenhagen Denmark     41.42%        10.50%           
6      6 -      Geneva     Switzerland 44.86%        11.90%           
# … with 18 more variables: `Minimum Vacations Offered (Days)` <dbl>,
#   `Vacations Taken (Days)` <chr>, Unemployment <dbl>,
#   `Multiple Jobholders` <chr>, Inflation <chr>,
#   `Paid Parental Leave (Days)` <dbl>, `Covid Impact` <dbl>,
#   `Covid Support` <dbl>, Healthcare <dbl>,
#   `Access to Mental Healthcare` <dbl>,
#   `Inclusivity & Tolerance` <dbl>, Affordability <dbl>, …

Description of the dataset “wlb”

The dataset “work.life.balance.by.city” describes data collected and calculated from international organisations, NGO reports, open access datasets, public surveys and crowdsourcing platforms. It considers more than 130 data points to highligh the most and least overworked cities around the world.

The city selection consists of 51 US cities, as well as 49 global cities that were shortlisted for analysis as global economic hubs. The study was divided into three categories: Work Intensity, Society and Institutions, and City Liveability.

1. Work Intensity: Remote Jobs (%), Overworked Population (%), Minimum Vacations Offered (Days), Vacations Taken (Days), Unemployment (Score), Multiple Jobholders (%), Inflation (%), Paid Parental Leave (Days).

2. Society and Institutions: Covid Impact (Score), Covid Support (Score), Healthcare (Score), Access to Mental Healthcare (Score), Inclusivity & Tolerance (Score).

3. City Livability: Affordability (Score), Happiness, Culture & Leisure (Score), City Safety (Score), Outdoor Spaces (Score), Air Quality (Score), Wellness and Fitness (Score).

Multiple indicators were used as components when scoring each factor. The underlying indicators were first standardized using a Z-Score [z = (x-μ)/σ; μ=indicator mean; σ=indicator standard deviation] normalization procedure.

Explanation of variables

The “2022” and “2021” columns contain qualitative, ordinal data that ranks the cities by overall work-life balance in those respective years.

The “City” and “Country” columns contain qualitative, nominal data that describes the location from which data was gathered.

Work Intensity

The “Remote Jobs” column contains quantitative, continuous data that describes percentage of jobs that can be performed remotely in each city, and the feasibility of working at home for all occupations. A higher percentage reflects a city with a higher number of remote job opportunities and a better infrastructure to support working from home.

The “Overworked Population” column contains quantitative, continuous data that describes the percentage of full-time employees working more than 48 hours per working week in each city. A higher percentage reflects a city with a greater amount of its population working overtime.

The “Minimum Vacations Offered (Days)” column contains quantitative, discrete data that describes the minimum number of compensated vacation days an employee is legally entitled to after at least one year of service. Data was taken at a national level for a full-time, five-day workweek (excluding public holidays). A higher score reflects a city with a higher amount of vacation days offered to employees.

The “Vacations Taken (Days)” column contains quantitative, discrete data that describes the average number of paid vacation days taken by full-time employees in a single year.

The “Unemployment” column contains quantitative, continuous data that describes a score that reflects the unemployment rate for the metropolitan area or region in the first quarter of 2021. A higher score reflects a city with a lower level of unemployment.

The “Multiple Jobholders” column contains quantitative, continuous data that describes the percentage of employed people in each city holding more than one job at a time.

The “Inflation” column contains quantitative, continuous data that describes the rate of price inflation for a range of consumer goods and services including food, beverages, clothing, housing, water, electricity, gas, furnishing, health, transport, communication, recreation, restaurants and hotels. The study compared the index values from the start of the pandemic (using an average of the months January and February 2020) with the index values from the most recent available data to show the percentage change. The study used state-level data for the US and country-level data for the rest of the world.

The “Paid Parental Leave (Days)” column contains quantitative, continuous data that describes the number of paid family leave days afforded to employees by law in each city.

Society and Institutions

The “Covid Impact” column contains quantitative, continuous data that describes a score that reflects the social and economic impact of a location’s Covid response, split across three areas: public health, economic, and social. The public health impact is quantified through cases and deaths relative to population; the economic impact through year-on-year GDP growth in 2020 and 2021; and the social impact through the severity of limiting measures put in place to contain the pandemic, and changes in mobility patterns as an indicator of the effect of these restrictions. A higher score reflects a more mitigated impact of the Covid-19 pandemic.

The “Covid Support” column contains quantitative, continuous data that describes a score that reflects the income support provided by governments to workers affected by the economic effects of Covid. The score takes into account government programmes to replace income lost due to Covid, duration of unemployment benefits, consumer confidence, household spending and general wage levels, as well as overall government spending to lessen the economic impact of Covid. In addition, the number of Covid cases and deaths were taken into account.

The “Healthcare” column contains quantitative, continuous data that describes score that rates a city’s healthcare system based on accessibility, quality of care and satisfaction. Country-level data was obtained from the Universal Health Coverage (UHC) index for access and quality indicators, while US cities data incorporates state-level data from the Health Access and Quality (HAQ) study. Additional data was taken from healthcare access indexes developed by the World Health Organisation and the European Commission. Satisfaction survey results were taken at a city level. A higher score reflects greater accessibility, quality of care and user rating for each city’s healthcare infrastructure. city level.

The “Access to Mental Healthcare” column contains quantitative, continuous data that describes a score that reflects the accessibility and effectiveness of governments’ implementation of mental health policies catering for individuals with mental health illnesses. This factor uses national data on governance, access to treatment and the environment necessary for treatment. It also incorporates suicide rates and city-level survey data on healthcare quality. A higher score reflects a more effective and accessible mental healthcare network.

The “Inclusivity & Tolerance” column contains quantitative, continuous data that describes a score that reflects the combined scores of the ’Gender Equality. A higher score reflects greater gender equality and a higher degree of LGBT+ equality.

City Liveability

The “Affordability” column contains quantitative, continuous data that describes a score that reflects monthly living costs as a proportion of the average household income after tax. Monthly costs include rent, basic utilities costs, groceries, internet connection, leisure activities, clothes and dining out. A higher score indicates a higher level of remaining monthly income after accounting for these deductions.

The “Happiness, Culture & Leisure” column contains quantitative, continuous data that describes the combined scores of both the ‘Happiness’ and ‘Culture & Leisure’ factors. The happiness score is calculated from survey responses evaluating the perceived happiness with one’s own life, as well as the degree of positive and negative effects a respondent experiences. A higher score reflects higher degrees of self-perceived happiness. The culture & leisure score indicates the vibrancy and variety of cultural and lifestyle offerings in a city. It combines cultural city rankings, the number of people employed in the cultural and creative industries, and the number of leisure facilities and activities available per capita, including sports stadiums, restaurants, parks, shops, entertainment and nightlife venues.

The “City Safety” column contains quantitative, continuous data that describes a score of the degree of a city’s safety in more than a dozen key areas, including environmental, social and infrastructural security. Indicators include statistics on injuries and fatalities, damage caused at an economic level, public opinion data, and data on the vulnerability of a location to particular hazards. A higher score reflects a safer city.

The “Outdoor Spaces” column contains quantitative, continuous data that describes the prevalence and accessibility of a city’s urban green infrastructure as a score, including its proximity to residents and the percentage of land allocated to green space. Data on weather and daylight conditions that could affect the use of public outdoor spaces was also incorporated. This includes average temperatures, the annual number of rainy days, annual sunshine hours, and cloudlessness. A higher score reflects a greater urban green infrastructure, as well as better environmental conditions for outdoor life.

The “Air Quality” column contains quantitative, continuous data that describes the annual median particulate matter (PM2.5/PM10) pollution for the year 2022, represented as a score. Daily average data was taken across all days of a single year, with the median pollution level representing the overall score. Data was taken at a city level. A higher score reflects greater air quality.

The “Wellness and Fitness” column contains quantitative, continuous data that describes the general state of a community’s physical fitness and health as represented by the population’s average life expectancy, as well as levels of inactivity, obesity, and the number of fitness studios and gyms per capita. National data was used for life expectancy at birth, while US cities use city-level data. Adult obesity rates and the prevalence of physical inactivity were taken at a national level, with US cities using state level data. Data on the number of gyms per capita is taken at a city level. A higher score reflects a better state of a community’s physical fitness and health.

The “TOTAL SCORE” column contains quantitative, continuous data that describes the calculated final score. The final score was computed as a weighted average of the component Z-Scores, and the resulting score normalised to a scale of 50 to 100 using min-max normalisation [(value - min)/(max-min)*50+50]. The floor of 50 for the scale was chosen to emphasise that the minimum score does not imply the absence of the infrastructures under analysis, as the position is relative to that of other cities in the ranking.

Manipulate

#Practice: filter by affordability and inclusivity
wlb %>%
  select("City", "Country", "Affordability", "Inclusivity & Tolerance", "TOTAL SCORE") %>%
  filter(`TOTAL SCORE` >80, `Affordability` >70, `Inclusivity & Tolerance` >90) %>%
  arrange(desc(Affordability)) %>%
  slice(1:10)
# A tibble: 10 × 5
   City           Country Affordability `Inclusivity &…` `TOTAL SCORE`
   <chr>          <chr>           <dbl>            <dbl>         <dbl>
 1 Seattle        USA              88.6             94.7          88.4
 2 Colorado Spri… USA              86.2             91.4          84.6
 3 Chicago        USA              84.1             90.6          82.9
 4 Baltimore      USA              84.1             92            82.8
 5 Denver         USA              83.2             92            85.6
 6 San Diego      USA              80.7             93            85.8
 7 San Francisco  USA              80.5             97.2          87.3
 8 Calgary        Canada           79.4             93.5          89.4
 9 Stuttgart      Germany          79.3             91.6          93.8
10 Dusseldorf     Germany          79.2             93.7          90.9
# Clean - make all column names lowercase
wlb<-wlb %>%
  rename_all(., .funs = tolower)
colnames(wlb)
 [1] "2022"                            
 [2] "2021"                            
 [3] "city"                            
 [4] "country"                         
 [5] "remote jobs"                     
 [6] "overworked population"           
 [7] "minimum vacations offered (days)"
 [8] "vacations taken (days)"          
 [9] "unemployment"                    
[10] "multiple jobholders"             
[11] "inflation"                       
[12] "paid parental leave (days)"      
[13] "covid impact"                    
[14] "covid support"                   
[15] "healthcare"                      
[16] "access to mental healthcare"     
[17] "inclusivity & tolerance"         
[18] "affordability"                   
[19] "happiness, culture & leisure"    
[20] "city safety"                     
[21] "outdoor spaces"                  
[22] "air quality"                     
[23] "wellness and fitness"            
[24] "total score"                     
#Clean - rename columns
wlb <- rename(wlb, remote.jobs = `remote jobs`,
       overworked = `overworked population`,
       min.vaca.days = `minimum vacations offered (days)`,
       vaca.days.taken = `vacations taken (days)`,
       multiple.jobs = `multiple jobholders`,
       parental.leave = `paid parental leave (days)`,
       covid.impact = `covid impact`,
       covid.support = `covid support`,
       mental.health.access = `access to mental healthcare`,
       inclusivity = `inclusivity & tolerance`,
       happiness = `happiness, culture & leisure`,
       city.safety = `city safety`,
       outdoor.spaces = `outdoor spaces`,
       air.quality = `air quality`,
       wellness.fitness = `wellness and fitness`,
       total.score = `total score`)
colnames(wlb)
 [1] "2022"                 "2021"                
 [3] "city"                 "country"             
 [5] "remote.jobs"          "overworked"          
 [7] "min.vaca.days"        "vaca.days.taken"     
 [9] "unemployment"         "multiple.jobs"       
[11] "inflation"            "parental.leave"      
[13] "covid.impact"         "covid.support"       
[15] "healthcare"           "mental.health.access"
[17] "inclusivity"          "affordability"       
[19] "happiness"            "city.safety"         
[21] "outdoor.spaces"       "air.quality"         
[23] "wellness.fitness"     "total.score"         
#How does city affordability affect the percent of the employed population holding multiple jobs?
wlb %>%
  select(city, country, affordability, multiple.jobs) %>%
  arrange(desc(affordability))
# A tibble: 100 × 4
   city           country affordability multiple.jobs
   <chr>          <chr>           <dbl> <chr>        
 1 Houston        USA             100   3.70%        
 2 Dallas         USA              99   3.70%        
 3 Austin         USA              95.2 3.70%        
 4 Raleigh        USA              91.8 5.00%        
 5 Columbus       USA              90.7 6.50%        
 6 Charlotte      USA              90.3 5.00%        
 7 Salt Lake City USA              90.1 6.50%        
 8 Jacksonville   USA              89.5 3.50%        
 9 San Antonio    USA              89.1 3.70%        
10 Seattle        USA              88.6 5.40%        
# … with 90 more rows
#Did cities with more remote jobs have better covid impact (public health mitigation response) and covid support (government spending to lessen economic stress)?
wlb %>%
  select(city, country, remote.jobs, covid.impact, covid.support) %>%
  arrange(desc(remote.jobs)) %>%
  slice(1:10)
# A tibble: 10 × 5
   city          country     remote.jobs covid.impact covid.support
   <chr>         <chr>       <chr>              <dbl>         <dbl>
 1 Singapore     Singapore   52.06%              87.9          95.1
 2 Washington    USA         49.77%              81.3          83.4
 3 Austin        USA         45.51%              81.9          82.3
 4 Bern          Switzerland 44.86%              82.2          91.4
 5 Zurich        Switzerland 44.86%              82            91.4
 6 Geneva        Switzerland 44.86%              79.2          91.4
 7 San Francisco USA         44.76%              83.3          93.9
 8 Boston        USA         44.35%              78.9          82  
 9 Stockholm     Sweden      44.20%              76            98.9
10 Liverpool     UK          43.50%              75.6          80.4
#Which countries scored a better overall work-life-balance in 2022?
wlb %>%
  select(country, total.score)%>%
  group_by(country)%>%
  summarise_at(vars(total.score), list(average.country.score = mean))%>%
  arrange(desc(average.country.score))
# A tibble: 31 × 2
   country     average.country.score
   <chr>                       <dbl>
 1 Norway                      100  
 2 Finland                      99.2
 3 Switzerland                  97.2
 4 Denmark                      96.2
 5 Australia                    93.4
 6 Sweden                       93.3
 7 Netherlands                  92.6
 8 Japan                        92.5
 9 New Zealand                  92.1
10 Canada                       92.1
# … with 21 more rows
#Which city ranked highest per country in total work-life balance score?
wlb %>%
  select(city, country, total.score)%>%
  group_by(country) %>%
  arrange(desc(total.score)) %>%
  slice(1)%>%
  ungroup()%>%
  arrange(desc(total.score))%>%
  slice(1:10)
# A tibble: 10 × 3
   city       country     total.score
   <chr>      <chr>             <dbl>
 1 Oslo       Norway            100  
 2 Bern       Switzerland        99.5
 3 Helsinki   Finland            99.2
 4 Copenhagen Denmark            96.2
 5 Ottawa     Canada             95.5
 6 Sydney     Australia          94.0
 7 Stuttgart  Germany            93.8
 8 Stockholm  Sweden             93.3
 9 Amsterdam  Netherlands        92.6
10 Tokyo      Japan              92.5
#Do cities with a higher level of remaining monthly income after living expenses (affordability) have more taken vacation days?
wlb %>%
  select(city, country, affordability, vaca.days.taken) %>%
  arrange(desc(affordability)) 
# A tibble: 100 × 4
   city           country affordability vaca.days.taken
   <chr>          <chr>           <dbl> <chr>          
 1 Houston        USA             100   8.5            
 2 Dallas         USA              99   8.5            
 3 Austin         USA              95.2 8.5            
 4 Raleigh        USA              91.8 9.1            
 5 Columbus       USA              90.7 9.4            
 6 Charlotte      USA              90.3 9.1            
 7 Salt Lake City USA              90.1 8.5            
 8 Jacksonville   USA              89.5 9.1            
 9 San Antonio    USA              89.1 8.5            
10 Seattle        USA              88.6 8.9            
# … with 90 more rows
#Do cities with a greater unemployed population also have poor access to healthcare and mental healthcare?
wlb %>%
  select(city, country, unemployment, healthcare, mental.health.access) %>%
  arrange(unemployment) 
# A tibble: 100 × 5
   city         country      unemployment healthcare mental.health.ac…
   <chr>        <chr>               <dbl>      <dbl>             <dbl>
 1 Cape Town    South Africa         50         50                65  
 2 Sao Paulo    Brazil               79.6       62.5              50  
 3 Vienna       Austria              81.3       93.8              77.3
 4 Barcelona    Spain                82.5       97                64.2
 5 Madrid       Spain                82.6       97.9              64.2
 6 Bremen       Germany              85.1       94.8              82  
 7 Berlin       Germany              86.3       92.9              82  
 8 Buenos Aires Argentina            86.3       65.2              70.2
 9 Cologne      Germany              86.8       94.2              82  
10 Montevideo   Uruguay              87.4       74.4              64.6
# … with 90 more rows
#Note: A lower unemployment score reflects a city with a higher level of unemployment.
#Do the cities with a higher overworked population have a lower score of wellness and fitness?
wlb%>%
  select(city, country, overworked, wellness.fitness) %>%
  arrange(desc(overworked))
# A tibble: 100 × 4
   city         country     overworked wellness.fitness
   <chr>        <chr>       <chr>                 <dbl>
 1 Sydney       Australia   9.70%                  70.5
 2 Melbourne    Australia   9.70%                  74.7
 3 Buenos Aires Argentina   8.80%                  60.6
 4 Amsterdam    Netherlands 8.60%                  67.7
 5 Dubai        UAE         23.40%                 58.6
 6 Hong Kong    Hong Kong   17.90%                 81.1
 7 Kuala Lumpur Malaysia    17.10%                 59.7
 8 Singapore    Singapore   16.90%                 72.9
 9 Montevideo   Uruguay     16.30%                 62.3
10 Tokyo        Japan       15.40%                 77.5
# … with 90 more rows