Population and Unemployment in Cyprus: Preprocessing Messy Data

Executive Summary

This exercise will join two datasets, one relating to unemployment rates in all European Countries (Eurostat 2019), and the other relating to the population of Cyprus over a period from 1992-2018 (CYSTAT 2019). The aim of this exercise is to combine these two datasets, scan them for missing and outlying values, and generate a new variable giving approximately the number of unemployed people in each age group per year. This prepossessing will further enable the analysis of differences in employment rates between men and women, as well as different age groups over a period. The two datasets were first imported by the read_csv function. Here they were trimmed of irrelevant information initially imported and split into a male and female table for ease of manipulation. The data for both datasets was in wide format, and needed to be gathered on the year variable. The male and female datasets were then bound together, with a mutated variable indicating sex. Both dataset’s variable were then converted into the appropriate data type. The data was then checked for special characters and missing values, which were only found in the unemployment dataset. These values were missing values and were imputed by using the mean values for the group having the same sex and age group across the ten-year period. After this the two datasets were merged using an inner join to create a new dataframe, and a new variable for the number of unemployed people was created by multiplying population by unemployment rate. This new dataframe was scanned for outliers. While outlier were found when considering the dataset as a whole, when looking at individual age brackets, no true outliers could be detected. Finally, the numeric variables were tested for normality by first examining histograms, and then transformed from a heavy right-skew to a closer approximation of normal

Dataset 1

The first dataset is drawn from the Eurostat database, which contains customisable datasets based on numerous different political, socio-economic, environmental, and devopmental metrics. Data used here contains information from 35 different countries within Europe, accross a period of 10 years from 2010-2019.

unemployed <- read_csv("messyunemployment.csv")

## Warning: Missing column names filled in: 'X2' [2], 'X3' [3], 'X4' [4], 'X5' [5],
## 'X6' [6], 'X7' [7], 'X8' [8], 'X9' [9], 'X10' [10], 'X11' [11], 'X12' [12],
## 'X13' [13], 'X14' [14], 'X15' [15], 'X16' [16], 'X17' [17], 'X18' [18],
## 'X19' [19], 'X20' [20], 'X21' [21], 'X22' [22], 'X23' [23], 'X24' [24],
## 'X25' [25], 'X26' [26], 'X27' [27], 'X28' [28], 'X29' [29], 'X30' [30],
## 'X31' [31], 'X32' [32], 'X33' [33], 'X34' [34], 'X35' [35], 'X36' [36],
## 'X37' [37], 'X38' [38], 'X39' [39], 'X40' [40], 'X41' [41], 'X42' [42]

## Parsed with column specification:
## cols(
##   .default = col_character()
## )

## See spec(...) for full column specifications.

head(unemployed)

The data initially does not conform to the Tidy-Data principles, violating the principle that each variable must have its own column. This is violated for two reasons, firstly that year is used as a column name, rather than being a value, and secondly, because male and female are kept in seperate columns, when they values of a single variable Sex. To convert this data from a wide format to a long format, a range of subsetting must be undertaken.

Subsetting

# Initally the redundant rows not containing data must be removed. Rows 1 and 2 are kept for now as they contain the infromation that must be spread.
unemployed <- unemployed[c(8, 9, 10, 59:478) , c(1, 2, seq(3, 41, 2))]

# For Ease of processing, th male and female values will be split into sperate tables
unemployedM <- unemployed[, c(1, 2, seq(3, 21, 2))]
head(unemployedM)

unemployedF <- unemployed[, c(1, 2, seq(4, 22, 2))]
head(unemployedF)

# Rename the Columns to reflect the data
colnames(unemployedM) <- c("Country", "AgeGroup", 2010:2019)
colnames(unemployedF) <- c("Country", "AgeGroup", 2010:2019)

#Subset the two dataframes to remove redundant information from heading 
unemployedF <- unemployedF[4:423, ]
unemployedM <- unemployedM[4:423, ]

head(unemployedF)

head(unemployedM)

Transforming the data

To transform the data from long to wide format need to used the gather function, to turn the dates into single column. In doing this an arbitrary value of 1 or 2 wil be assigned as a new column to the female and male dataframes respectively.

# Transform the two datasets from wide to long formats, and assign a Variable for sex

unemployedF <- unemployedF %>% mutate(Sex = 1) %>%
  gather(`2010`: `2019`, key = "year", value = "Unemployment_Rate")

unemployedM <- unemployedM %>% mutate(Sex = 2) %>% 
  gather(`2010`:`2019`, key = "year", value = "Unemployment_Rate")

# Can Now use row bind to rejoin the two tables

unemployment_long <- bind_rows(unemployedF, unemployedM)

# Check the Structure of the data

str(unemployment_long)

## tibble [8,400 × 5] (S3: tbl_df/tbl/data.frame)
##  $ Country          : chr [1:8400] "Belgium" "Belgium" "Belgium" "Belgium" ...
##  $ AgeGroup         : chr [1:8400] "From 15 to 19 years" "From 20 to 24 years" "From 25 to 29 years" "From 30 to 34 years" ...
##  $ Sex              : num [1:8400] 1 1 1 1 1 1 1 1 1 1 ...
##  $ year             : chr [1:8400] "2010" "2010" "2010" "2010" ...
##  $ Unemployment_Rate: chr [1:8400] "36.6" "20.4" "11.4" "8.5" ...

Variables

The table contains 5 Variables, Country, AgeGroup, Sex, year, and Unemployment_rate. - Country is an unordered categorical variable relating to the country where the data was gathered. - AgeGroup is an ordered categrorical variable of age groups of workers, in bins of 5 years. - Sex is a 2 level factor variable, with two levels female and male. - Year is an integer variable, with dates ranging from 2010-2019. This is not kept as a date variable as a 4 character year cannot be kept as a date in R. - Unemployment_rate is an integer variable tracking the mean unemployment rate for each age group bin each year.

# These variables converted from charater to integer variables. 

unemployment_long$Sex <- unemployment_long$Sex %>% factor(c(1, 2),
                                                          labels = c("female", "male"), 
                                                          ordered = FALSE)

unemployment_long$Country <- unemployment_long$Country %>% factor()

unemployment_long$AgeGroup <- unemployment_long$AgeGroup %>% factor(labels = c(
                                                                      "15-19",
                                                                    "20-24", 
                                                                    "25-29",
                                                                    "30-34",
                                                                    "35-39",
                                                                    "40-44",
                                                                    "45-49",
                                                                    "50-54",
                                                                    "55-59",
                                                                    "60-64",
                                                                    "65-69",
                                                                    "70-75"), 
                                                                    ordered = TRUE)

unemployment_long$year <- unemployment_long$year %>% as.numeric()

unemployment_long$Unemployment_Rate <- unemployment_long$Unemployment_Rate %>% as.numeric()

## Warning in function_list[[k]](value): NAs introduced by coercion

str(unemployment_long)

## tibble [8,400 × 5] (S3: tbl_df/tbl/data.frame)
##  $ Country          : Factor w/ 35 levels "Austria","Belgium",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ AgeGroup         : Ord.factor w/ 12 levels "15-19"<"20-24"<..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ Sex              : Factor w/ 2 levels "female","male": 1 1 1 1 1 1 1 1 1 1 ...
##  $ year             : num [1:8400] 2010 2010 2010 2010 2010 2010 2010 2010 2010 2010 ...
##  $ Unemployment_Rate: num [1:8400] 36.6 20.4 11.4 8.5 7.9 6 5.3 5.9 5.7 NA ...

unemploymentL <- unemployment_long %>% arrange(Country, AgeGroup, year)

head(unemploymentL)

Searching Dataset 1 for missing values

Dataset 1 has been converted into a long dataframe with 8400 observations. This data is gathered from reporting from 35 different countries, all of whom amy have different metrics for measuring unemployment rates. For example some countries may not count those under 18 as a part of the work force, or may consider those aboce the age of 65 as being retired and supported by a pension scheme.

#Identify Missing Values within data
sum(is.na(unemploymentL))

## [1] 1745

colSums(is.na(unemploymentL))

##           Country          AgeGroup               Sex              year 
##                 0                 0                 0                 0 
## Unemployment_Rate 
##              1745

sum(is.na(unemploymentL))

## [1] 1745

colSums(is.na(unemploymentL))

##           Country          AgeGroup               Sex              year 
##                 0                 0                 0                 0 
## Unemployment_Rate 
##              1745

# All missing values are in the Unemployment_Rate Column, 
#Influencing factors on this could include.tyhe age at which countries stop counting unemployed people
#Due to the provision of a pension scheme or something similar.
#As is consistent throughout europe we will stop counting the population when Age Group exceeds 65.

WorkingAge <- unemploymentL %>% subset(AgeGroup < "65-69")

#Now check new data to see how this has impacted upon the data 

colSums(is.na(WorkingAge))

##           Country          AgeGroup               Sex              year 
##                 0                 0                 0                 0 
## Unemployment_Rate 
##               508

#Significantly less NA figures. This is a valid deletion as most of these countries consider working asge >65

More imputation could be done on this dataframe as a whole, as there is still a high number of missing values. For this next section, attention will be focused specifically on Cyprus, as this is the country that is being joined with the other dataset.

Cypruswork <- WorkingAge %>% subset(Country == "Cyprus")

colSums(is.na(Cypruswork))

##           Country          AgeGroup               Sex              year 
##                 0                 0                 0                 0 
## Unemployment_Rate 
##                10

#In Cyprus there are 10 missing values that we need to impute 

which(is.na(Cypruswork$Unemployment_Rate))

##  [1]   2   4  14  17 181 183 185 195 197 199

#This imputes the Average for that age group, and sex over the whole period. 

Cypruswork$Unemployment_Rate[
  Cypruswork$AgeGroup == "15-19" & Cypruswork$Sex == "male" & is.na(Cypruswork$Unemployment_Rate)] <- 
  mean(Cypruswork$Unemployment_Rate[Cypruswork$AgeGroup == "15-19" & Cypruswork$Sex == "male"], na.rm = TRUE)
##Imputing the data for Males aged 15-19

Cypruswork$Unemployment_Rate[
  Cypruswork$AgeGroup == "15-19" & Cypruswork$Sex == "female" & is.na(Cypruswork$Unemployment_Rate)] <- 
  mean(Cypruswork$Unemployment_Rate[Cypruswork$AgeGroup == "15-19" & Cypruswork$Sex == "female"], na.rm = TRUE)
##Imputing the data for Females aged 15-19

Cypruswork$Unemployment_Rate[
  Cypruswork$AgeGroup == "60-64" & Cypruswork$Sex == "male" & is.na(Cypruswork$Unemployment_Rate)] <- 
  mean(Cypruswork$Unemployment_Rate[Cypruswork$AgeGroup == "60-64" & Cypruswork$Sex == "male"], na.rm = TRUE)
#Imputing the data for Males age 60-64

Cypruswork$Unemployment_Rate[
  Cypruswork$AgeGroup == "60-64" & Cypruswork$Sex == "female" & is.na(Cypruswork$Unemployment_Rate)] <- 
  mean(Cypruswork$Unemployment_Rate[Cypruswork$AgeGroup == "60-64" & Cypruswork$Sex == "female"], na.rm = TRUE)
## Imputing the data for Females aged 60-64

colSums(is.na(Cypruswork))

##           Country          AgeGroup               Sex              year 
##                 0                 0                 0                 0 
## Unemployment_Rate 
##                 0

is.special <- function(x){
  if (is.numeric(x)) (is.infinite(x) | is.nan(x))
}

sapply(Cypruswork, function(x) sum( is.special(x)))

##           Country          AgeGroup               Sex              year 
##                 0                 0                 0                 0 
## Unemployment_Rate 
##                 0

#No more missing values

Dataset 2

This dataset is sourced from the Statistical Services of Cyprus, and is a dataset that tracks the number of people within the Republic of Cyprus of each age group in 5 year bins for each year from 1992-2018.

Cyprus <- read_csv("CyprusPop.csv")

## Warning: Missing column names filled in: 'X1' [1], 'X3' [3], 'X4' [4], 'X5' [5],
## 'X6' [6], 'X7' [7], 'X8' [8], 'X9' [9], 'X10' [10], 'X11' [11], 'X12' [12],
## 'X13' [13], 'X14' [14], 'X15' [15], 'X16' [16], 'X17' [17], 'X18' [18],
## 'X19' [19], 'X20' [20], 'X21' [21], 'X22' [22], 'X23' [23], 'X24' [24],
## 'X25' [25], 'X26' [26], 'X27' [27], 'X28' [28], 'X29' [29]

## Parsed with column specification:
## cols(
##   .default = col_double(),
##   X1 = col_logical(),
##   `B5b. END OF THE YEAR DE JURE POPULATION BY AGE AND SEX, 1992-2018` = col_character(),
##   X29 = col_character()
## )

## See spec(...) for full column specifications.

head(Cyprus)

As with the previous dataset this does not conform to the Tiday Data principles, as the data is in a wide format and each year has its own column rather than being in a single row. Unlike the previous data, the imported data set is already separated into two separate tables, though these will need to be subsetted and combined in order to by usable.

Subsetting

The operations performed here will be much the same as above, as the raw data is formatted in a similar way.

# First subset the data into a male and female frame separately, adding column names

CyprusM <- Cyprus[c(2:20), c(2:29)]

colnames(CyprusM) <- c("AgeGroup", 1992:2018)

CyprusM <- CyprusM[3:19, ]

CyprusF <- Cyprus[27:43, 2:29]

colnames(CyprusF) <- c("AgeGroup", 1992:2018)

# The Convert the two tables into long format data. 

CyprusM <- CyprusM %>% mutate(Sex = 2) %>%
  gather(`1992`:`2018`, key = "year", value = "Population (000's)")
CyprusF <- CyprusF %>% mutate(Sex = 1) %>%
  gather(`1992`:`2018`, key = "year", value = "Population (000's)")


# Bind these two dataframes together

CyprusPop <- rbind(CyprusF, CyprusM)

str(CyprusPop)

## tibble [918 × 4] (S3: tbl_df/tbl/data.frame)
##  $ AgeGroup          : chr [1:918] "0 - 4" "5 - 9" "10 - 14" "15 - 19" ...
##  $ Sex               : num [1:918] 1 1 1 1 1 1 1 1 1 1 ...
##  $ year              : chr [1:918] "1992" "1992" "1992" "1992" ...
##  $ Population (000's): chr [1:918] "26" "25.8" "24.2" "21.3" ...

Variables

The dataset contains 4 Variables - AgeGroup, a categorical variable with age bins in 5 year groups, up to 80, above which all ages are grouped together - Sex, an assigned categorical variable with 2 levels - Year, an integer ranging from 1991 - 2018 - Population, an integer, measuring population, by the 1000

# Perform necessary Factorizations 
CyprusPop$Sex <- CyprusPop$Sex %>% factor(c(1, 2),
                                     labels = c("female", "male"), 
                                     ordered = FALSE)

CyprusPop$AgeGroup <- CyprusPop$AgeGroup %>% factor(labels = c("0-4", 
                                                               "5-9",
                                                               "10-14",
                                                               "15-19",
                                                               "20-24", 
                                                               "25-29",
                                                               "30-34",
                                                               "35-39",
                                                               "40-44",
                                                               "45-49",
                                                               "50-54",
                                                               "55-59",
                                                               "60-64",
                                                               "65-69",
                                                               "70-74", 
                                                               "75-79", 
                                                               "80+"), 
                                                                ordered = TRUE)
CyprusPop$`Population (000's)` <- CyprusPop$`Population (000's)` %>% as.numeric()

CyprusPop$year <- CyprusPop$year %>% as.numeric()

str(CyprusPop)

## tibble [918 × 4] (S3: tbl_df/tbl/data.frame)
##  $ AgeGroup          : Ord.factor w/ 17 levels "0-4"<"5-9"<"10-14"<..: 1 10 2 3 4 5 6 7 8 9 ...
##  $ Sex               : Factor w/ 2 levels "female","male": 1 1 1 1 1 1 1 1 1 1 ...
##  $ year              : num [1:918] 1992 1992 1992 1992 1992 ...
##  $ Population (000's): num [1:918] 26 25.8 24.2 21.3 22.3 23.6 24.9 22.4 21.2 18.8 ...

Check for Missing Values or Special Characters

colSums(is.na(CyprusPop))

##           AgeGroup                Sex               year Population (000's) 
##                  0                  0                  0                  0

# Search for Special Characters
is.special <- function(x){
  if (is.numeric(x)) (is.infinite(x) | is.nan(x))
}

sapply(CyprusPop, function(x) sum( is.special(x)))

##           AgeGroup                Sex               year Population (000's) 
##                  0                  0                  0                  0

There are no missing values within the dataset.

Merging the two datasets

The way the two data sets have been cleaned has made them resemble one another closely, and they match upon 3 variables; AgeGroup, sex, and year. Because these three things form a unique key for each observation, the dataset can be joined using the clause of a mutating join.

# Before joining the two datasets need to convert the AgeGroup variable back to Character Vector. 
CyprusPop$AgeGroup <- CyprusPop$AgeGroup %>% as.character()
Cypruswork$AgeGroup <- Cypruswork$AgeGroup %>%as.character()

There are a number of ways that dataset could be merged. The most efficient will be to use the inner_join function. we could a left_join, between Cypruswork and CyprusPop, which would result in missing values for the observations which have year = 2019 for the values of population. Doing this would provide an opportunity to impute the missing values, to get a more recent comparison of the population and unemployment.This is because the year range for the two datasets differs significantly. By using the inner_join the output will be only data from the years between 2010 and 2018, with the population and the unemployment rate. However, imputing this data could be risky to the overall quality of the data, as it may not take into account large immigration numbers, a change in legislation

leftCyprus <- left_join(Cypruswork, CyprusPop)

## Joining, by = c("AgeGroup", "Sex", "year")

head(leftCyprus)

which(is.na(leftCyprus))

##  [1] 1019 1020 1039 1040 1059 1060 1079 1080 1099 1100 1119 1120 1139 1140 1159
## [16] 1160 1179 1180 1199 1200

The most efficient will be to use the inner_join function. This is because the year range for the two datasets differs significantly. By using the inner_join the output will be only data from the years between 2010 and 2018, with the population and the unemployment rate.

innerCyprus <- inner_join(Cypruswork, CyprusPop)

## Joining, by = c("AgeGroup", "Sex", "year")

head(innerCyprus)

colSums(is.na(innerCyprus))

##            Country           AgeGroup                Sex               year 
##                  0                  0                  0                  0 
##  Unemployment_Rate Population (000's) 
##                  0                  0

Now that this data is joined we can perform a some operations on the data to make it more logical and readable.

#Convert the AgeGroup variable back to a Factor
innerCyprus$AgeGroup <- innerCyprus$AgeGroup %>% factor(labels = c(
                                                                    "15-19",
                                                                    "20-24", 
                                                                    "25-29",
                                                                    "30-34",
                                                                    "35-39",
                                                                    "40-44",
                                                                    "45-49",
                                                                    "50-54",
                                                                    "55-59",
                                                                    "60-64"), 
                                                        ordered = TRUE)
                                                        

#Convert the population in thousands to a true figure
innerCyprus$`Population (000's)` <- innerCyprus$`Population (000's)` * 1000
colnames(innerCyprus)[6] <- "population"

#Convert the Unemployment_Rate to a true proportion 
innerCyprus$Unemployment_Rate <- innerCyprus$Unemployment_Rate / 100
colnames(innerCyprus)[5] <- "unemployment"

str(innerCyprus)

## tibble [180 × 6] (S3: tbl_df/tbl/data.frame)
##  $ Country     : Factor w/ 35 levels "Austria","Belgium",..: 5 5 5 5 5 5 5 5 5 5 ...
##  $ AgeGroup    : Ord.factor w/ 10 levels "15-19"<"20-24"<..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Sex         : Factor w/ 2 levels "female","male": 1 2 1 2 1 2 1 2 1 2 ...
##  $ year        : num [1:180] 2010 2010 2011 2011 2012 ...
##  $ unemployment: num [1:180] 0.268 0.492 0.375 0.492 0.4 ...
##  $ population  : num [1:180] 34900 35500 35400 36500 34900 36200 33000 34500 32600 33700 ...

Mutate a New Variable

This data can now be used find an approximate figure of how many people in each age group are unemployed. This can be done by performing mutation by multiplying the unemployment and the population columns.

innerCyprus <- innerCyprus %>% mutate(number = unemployment * population)
head(innerCyprus)

Data

Using this data is is now possible to, isolate particular observations and get a range of statistics from the data, such as summary statistics on the number of unemployed over time.

innerCyprus %>% filter (AgeGroup == "15-19") %>% group_by(year)  %>% 
                                    summarise(mean = mean(number),     
                                               min = min(number),
                                               max = max(number))

## `summarise()` ungrouping output (override with `.groups` argument)

innerCyprus %>% filter(Sex == "female", AgeGroup == "15-19", year == 2010)

The structure of the data also would allow for importing similar data for other Countries in Europe, as the Country variable is a factor with 35 levels, correlating to all the data in the original European unemployment diagram.

levels(innerCyprus$Country)

##  [1] "Austria"                                         
##  [2] "Belgium"                                         
##  [3] "Bulgaria"                                        
##  [4] "Croatia"                                         
##  [5] "Cyprus"                                          
##  [6] "Czechia"                                         
##  [7] "Denmark"                                         
##  [8] "Estonia"                                         
##  [9] "Finland"                                         
## [10] "France"                                          
## [11] "Germany (until 1990 former territory of the FRG)"
## [12] "Greece"                                          
## [13] "Hungary"                                         
## [14] "Iceland"                                         
## [15] "Ireland"                                         
## [16] "Italy"                                           
## [17] "Latvia"                                          
## [18] "Lithuania"                                       
## [19] "Luxembourg"                                      
## [20] "Malta"                                           
## [21] "Montenegro"                                      
## [22] "Netherlands"                                     
## [23] "North Macedonia"                                 
## [24] "Norway"                                          
## [25] "Poland"                                          
## [26] "Portugal"                                        
## [27] "Romania"                                         
## [28] "Serbia"                                          
## [29] "Slovakia"                                        
## [30] "Slovenia"                                        
## [31] "Spain"                                           
## [32] "Sweden"                                          
## [33] "Switzerland"                                     
## [34] "Turkey"                                          
## [35] "United Kingdom"

Outliers

There are three numeric variables that need to be scanned for outliers; unemployment, population, and numer(unemployed).

To get a general idea of the distribution of the data, a scatter plot will be used.

scatterplot(unemployment ~ year | AgeGroup, data = innerCyprus)

## Warning in scatterplot.default(X[, 2], X[, 1], groups = X[, 3], xlab = xlab, : number of groups exceeds number of available colors
##   colors are recycled

There is variability over time within the data, however clear trends appear where younger people tend to have higher rates of unemployment than older people.

A boxplot of unemployment rates against year suggest that there are a number of potential outliers that have higher rates of unemployment. The dots present suggest that by using Tukey’s method of outlier dectection, a number of univariate outliers can be found within this dataset. Using Tukey’s method of outlier detection here is valid, as unemployment is not normally distributed, as will be discussed below.

ggplot(innerCyprus, aes(x=year, y=unemployment)) + geom_boxplot()

## Warning: Continuous x aesthetic -- did you forget aes(group=...)?

To make a general scan of unemployment, a boxplot, facted by AgeGroup is also useful. Here we can see the general trend of higher levels of unemployment amongst younger people, as well as greater variability. Further, the presence of dots in the Age Groups 15-19, and 60-64, suggests that there may be outliers that should be invesitgated.

ggplot(innerCyprus, aes(x=year, y=unemployment)) + geom_boxplot() + facet_wrap(~AgeGroup)

## Warning: Continuous x aesthetic -- did you forget aes(group=...)?

If we take the population, across the time period as a whole we can clearly find univariate outliers by using the z-score method.

z.scores <- innerCyprus$unemployment %>% scores(type = "z")
summary(z.scores)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -0.97066 -0.56825 -0.36310  0.00000  0.07087  3.94507

which(abs(z.scores) > 3)

## [1]  6  8 10 12

Examining these z-scores it is clear that all suggested outliers are found within amongst males between ages 15-19. This however may not be an appropriate measure, as while these scores may be outliers for the whole population, but within the age bracket they may not be anomalous. When we apply this z-score to the specific age-bracket, those same scores can no longer be found to be outliers. This is true of the age bracket for 60-64, where when considering only like age,

C1<- innerCyprus %>% filter(AgeGroup == "15-19")

z.scores2 <- C1$unemployment %>% scores(type = "z")
summary(z.scores2)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -1.93103 -0.63527  0.01059  0.00000  0.27163  1.82569

which(abs(z.scores2) > 3)

## integer(0)

C2<- innerCyprus %>% filter(AgeGroup == "60-64")

z.scores3 <- C2$unemployment %>% scores(type = "z")
summary(z.scores3)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -1.39190 -0.44485 -0.20809  0.00000  0.08938  2.61483

which(abs(z.scores3) > 3)

## integer(0)

As such, when considering the variable of unemployment as a whole, it seems there are numerous outliers, when factoring for age, we see that these instances fall within normal bounds for the dataset.

#3 Distribution of Variables

Here the unemployment varliable will be assessed for normalacy.

First examine how unemployment is distributed across the whole population.

hist(innerCyprus$unemployment)

It is clear that this data is skewed heavily to the right. This is consistent with the boxplot generated above. This variable could be normalized either by a log transformation, or by using the Box-Cox Transformtion

ln_unemployment <- log(innerCyprus$unemployment)
hist(ln_unemployment)

inv_unemployment <- (innerCyprus$unemployment)^(1/5)
hist(inv_unemployment)

boxcoxUnemployment <- BoxCox(innerCyprus$unemployment, lambda = "auto")
hist(boxcoxUnemployment)

For this variable, the Box Cox transformation is the most successful at normalising the data.

References

CYSTAT 2019, \(Demographic statistics 2018 (EN), Populations\), Population and Social Conditions, data file, Republic of Cyprus, Ministry of Finance Statistical Service, Nicosia, viewed 2 November 2020, https://www.mof.gov.cy/mof/cystat/statistics.nsf/populationcondition_21main_en/populationcondition_21main_en?OpenForm&sub=1&sel=4#

EUROSTAT 2020, \(Employment and Unemployment\) Labour Force Survey, datafile, European Union, Eurostat, Luxembourg City, viewed 2 November 2020, https://appsso.eurostat.ec.europa.eu/nui/show.do?dataset=lfsa_urgan&lang=en

Population and Unemployment in Cyprus: Preprocessing Messy Data

Data Wrangling Assignment 2

Dylan Carlyon Stewart s3872242

03/11/2020

Executive Summary

Dataset 1

Subsetting

Transforming the data

Variables

Searching Dataset 1 for missing values

Dataset 2

Subsetting

Variables

Check for Missing Values or Special Characters

Merging the two datasets

Mutate a New Variable

Data

Outliers

References