“The problem is that the mechanics of regression analysis are not the hard part; the hard part is determining which variables ought to be considered in the analysis and how can best be done,” says Charles Wheelan in Naked Statistics (2012). And “adding too many variables” is not a solution, but a problem.

So I started by reading which are the “variables” that affect the decision for EU young people to leave their parents’ home. Sociologists mainly point out cultural factors, the labour market, wages, unemployment rate, temporary employment.

Digging into Eurostat, I could find the parameters that translate some of those factors into quantifiable information: the unemployment rate, the involuntary part-time rate, and the temporary jobs rate. However, I spent a long time looking for datasets which could represent the “cultural factors” that experts refer to, as well as the unit to compare wages among EU countries.

I found the PPS (Purchase Power Standard), a measure created by the EU to compare euros within the EU. I also include the higher education rate, the marriage age and the minimum salary per country.

CLARIFICATION: This exercise is part of my final project of the MA in Data Journalism that I am studying at Birmingham City University. Conclusions are not decisive.

Preparing data

youth_age <- read.csv("youth_age.csv") #Age in which young people leave home
involuntary_part_time <- read.csv("involuntary_part_time.csv") #Percentage of young people doing involuntary part-time
unemployment <- read.csv("unemployment.csv") #Unemployment rate
youth_education <- read.csv("youth_education.csv") #Terciary education 
living_parents <- read.csv("share_living_parents.csv") #Share of people living with their parents (this is part of the age) 
temporary <- read.csv("temporary_work.csv") #Temporary jobs in young people
low_wages <- read.csv("low_wages.csv") #Share people less than 30 earning a salary less than two thirds or less of their national median gross hourly earnings
library(readxl)
marriage <- read_excel("marriage.xlsx", sheet = 1) #Young people aged 16-29 who were married or in a consensual union (with or without legal basis) in 2013
medianhourly <- read.csv("medianhourly.csv") #Median hourly salary in PPS for less than 30. 
minimumwages <- read.csv("minimumwages.csv") #Minimum wages in PPS

#Selecting one common year and the countries I want to analyse
#Half of the datasets include up to 2017, but many of them are incompleted. That is why I change to 2016, althougth the title says 17
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
youth_age17 <- youth_age %>% filter(TIME == 2016 & SEX == "Total" & !GEO %in% c("European Union (current composition)", "European Union (before the accession of Croatia)", "Montenegro", "Former Yugoslav Republic of Macedonia, the", "Turkey", "Euro area (19 countries)")) 

part_time17 <- involuntary_part_time %>% filter(TIME == 2016 & SEX == "Total" & !GEO %in% c("European Union (current composition)", "European Union (before the accession of Croatia)", "Montenegro", "Former Yugoslav Republic of Macedonia, the", "Turkey", "Euro area (19 countries)", "European Union (15 countries)", "Euro area (18 countries)", "Euro area (17 countries)") & AGE != "From 15 to 19 years")

living_parents17 <- living_parents %>% filter(TIME == 2016 & SEX == "Total" & !GEO %in% c("European Union (current composition)", "European Union (before the accession of Croatia)", "Montenegro", "Former Yugoslav Republic of Macedonia, the", "Turkey", "Euro area (19 countries)", "Euro area (18 countries)", "European Union (EU6-1972, EU9-1980, EU10-1985, EU12-1994, EU15-2004, EU25-2006, EU27-2013, EU28)", "Euro area (EA11-2000, EA12-2006, EA13-2007, EA15-2008, EA16-2010, EA17-2013, EA18-2014, EA19)", "Serbia") & AGE != "From 15 to 19 years")

temporary17 <- temporary %>% filter(TIME == 2016 & SEX == "Total" & !GEO %in% c("European Union (current composition)", "European Union (before the accession of Croatia)", "Montenegro", "Former Yugoslav Republic of Macedonia, the", "Turkey", "Euro area (19 countries)", "European Union (15 countries)", "Euro area (18 countries)", "Euro area (17 countries)") & AGE != "From 15 to 19 years" & C_BIRTH != "EU15 countries except reporting country")

unemployment17 <- unemployment %>% filter(TIME == 2016 & SEX == "Total" & !GEO %in% c("European Union (current composition)", "European Union (before the accession of Croatia)", "Montenegro", "Former Yugoslav Republic of Macedonia, the", "Turkey", "Euro area (19 countries)", "European Union (15 countries)", "Euro area (18 countries)", "Euro area (17 countries)") & AGE != "From 15 to 19 years")

youth_education17 <- youth_education %>% filter(TIME == 2016 & SEX == "Total" & !GEO %in% c("European Union (current composition)", "European Union (before the accession of Croatia)", "Montenegro", "Former Yugoslav Republic of Macedonia, the", "Turkey", "Euro area (19 countries)","Euro area (18 countries)") & AGE != "From 15 to 19 years")

low_wages14 <- low_wages %>% filter(TIME == 2014 & !GEO %in% c("European Union (current composition)", "European Union (before the accession of Croatia)", "Montenegro", "Former Yugoslav Republic of Macedonia, the", "Turkey", "Euro area (19 countries)","Euro area (18 countries)", "Euro area (17 countries)", "Iceland", "Switzerland", "Norway", "Serbia"))

marriage13 <- na.omit(marriage)
marriage13 <- marriage13 %>% filter(!Countries %in% c("Iceland", "Switzerland", "Norway", "EU-28"))
marriage13 <- marriage13 %>% select(Countries, Married, `Consensual union with legal basis`, `Consensual union without legal basis`) %>% mutate(Consensual = `Consensual union with legal basis`+ `Consensual union without legal basis`)

medianhourly14 <- medianhourly %>% filter(TIME == 2014 & !GEO %in% c("European Union (current composition)", "European Union (before the accession of Croatia)", "Montenegro", "Former Yugoslav Republic of Macedonia, the", "Turkey", "Euro area (19 countries)","Euro area (18 countries)", "Euro area (17 countries)", "Iceland", "Switzerland", "Norway", "Serbia") & AGE == "Less than 30 years")

minimumwages16 <- minimumwages %>% filter(TIME == "2016S2" & !GEO %in% c("European Union (current composition)", "European Union (before the accession of Croatia)", "Montenegro", "Former Yugoslav Republic of Macedonia, the", "Turkey", "Euro area (19 countries)","Euro area (18 countries)", "Euro area (17 countries)", "Iceland", "Switzerland", "Norway", "Serbia", "United States", "Albania"))
class(marriage13$Consensual)
## [1] "numeric"
class(medianhourly14$Value)
## [1] "factor"
medianhourly14$Value <- as.character(medianhourly14$Value)
medianhourly14$Value <- as.numeric(medianhourly14$Value)

class(medianhourly14$Value)
## [1] "numeric"

Eurostat uses “:” instead of NA to inform about missing data. There are several datasets with missing values with which I will deal later.

youth_age17$GEO <- gsub("(until 1990 former territory of the FRG)", "", youth_age17$GEO)

Who is young?

Visualisualising the youth_age dataset, I got that the most common age for leaving the parents’ home is between 25 and 30. There are, however, four countries whose age is over 30.

#Change the value from factor to number
youth_age17$Value <- as.numeric(as.character(youth_age17$Value))
hist(youth_age17$Value, xlim = c(15,35), breaks = 5, main = "Age young people leaving parents' home", xlab = "Age", ylab = "Number of EU countries")

summary(youth_age17$Value)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   20.70   24.18   26.90   26.57   29.10   31.80
sd(youth_age17$Value)
## [1] 3.087565

As there is no extreme values (outliers, considered any data point that lies outside the 1.5*IQR) the mean and the median is quite similar, and the “variation” or “spreadability” of this data is low.

With a box-plot I got an overview of the distribution of the ages within the 28 countries.

boxplot(youth_age17$Value, main = "Age for leaving home", ylab = "Age")

library(ggplot2)
ggplot(youth_age17, aes(x = reorder(GEO, -Value), y = Value)) + geom_bar(stat = "identity", fill="darkblue", position = "dodge") + coord_flip() + geom_text(aes(label=Value), vjust=-7, size=2)+
theme_minimal() + labs(title="Age leaving home", x="Countries", y = "Age")

Relationship with one variable

I will compare the age with the unemployment rate to see if there is any relationship. I first adapt the unemployment dataset to have one observation per country.

Likewise the unemployment figures, the other datasets present the information broken-down by groups of age. I will select the 20-29 one, as the range of ages covered is more similar to the leaving-home-age. Moreover, taking into account the Bologna plan, the higher education lasts three years, until 22ish years old.

unemployment17_20_29 <- unemployment17 %>% filter(AGE == "From 20 to 29 years" & !GEO %in% c("Iceland", "Switzerland", "Norway")) 

#From factor to numeric
unemployment17_20_29$Value <- as.numeric(as.character(unemployment17_20_29$Value))
youth_age17$Value <- as.numeric(as.character(youth_age17$Value))

My dependent variable (Y) is age (the event expected to change when the other variable -independent- is manipulated) and the independent (X) in this case is the unemployment ratio. I would like to know if both are related and if changing the unemployment ratio would affect the emancipation age.

plot(unemployment17_20_29$Value, youth_age17$Value, main = "Age and Unemployment", ylab = "Age", xlab = "Unemployment rate", col = "red")
abline(lm(youth_age17$Value ~ unemployment17_20_29$Value), col = "blue")

I can easily see that there is no relationship. I can also calculate the correlation coefficient, a measure that suggests the dependence between two variables (between -1 and 1), and the linear model.

cor(youth_age17$Value, unemployment17_20_29$Value)
## [1] 0.3925254
#Linear models
lm(youth_age17$Value ~ unemployment17_20_29$Value)
## 
## Call:
## lm(formula = youth_age17$Value ~ unemployment17_20_29$Value)
## 
## Coefficients:
##                (Intercept)  unemployment17_20_29$Value  
##                    24.2361                      0.2374
summary(lm(youth_age17$Value ~ unemployment17_20_29$Value))
## 
## Call:
## lm(formula = youth_age17$Value ~ unemployment17_20_29$Value)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -5.506 -1.403 -0.316  1.907  6.448 
## 
## Coefficients:
##                            Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                 24.2361     1.2059  20.098   <2e-16 ***
## unemployment17_20_29$Value   0.2374     0.1091   2.176   0.0388 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.894 on 26 degrees of freedom
## Multiple R-squared:  0.1541, Adjusted R-squared:  0.1215 
## F-statistic: 4.736 on 1 and 26 DF,  p-value: 0.03882

Before analysing the coefficients I will do a multiple regression with the variables from the other datasets.

Multiple regressions

I will first prepare the dataset.

part_time17_20_29 <- part_time17 %>% filter(AGE == "From 20 to 29 years" & !GEO %in% c("Iceland", "Switzerland", "Norway")) 

temporary17_20_29 <- temporary17 %>% filter(AGE == "From 20 to 29 years" & !GEO %in% c("Iceland", "Switzerland", "Norway")) 

youth_education17_20_29 <- youth_education17 %>% filter(AGE == "From 20 to 29 years" & !GEO %in% c("Iceland", "Switzerland", "Norway") & ISCED11 == "Tertiary education (levels 5-8)") 
class(part_time17_20_29$Value)
## [1] "factor"
class(temporary17_20_29$Value)
## [1] "factor"
class(youth_education17_20_29$Value)
## [1] "factor"
class(youth_age17$GEO)
## [1] "character"
class(low_wages14$Value)
## [1] "factor"
part_time17_20_29$Value <- as.numeric(as.character(part_time17_20_29$Value))
## Warning: NAs introducidos por coerci'on
temporary17_20_29$Value <- as.numeric(as.character(temporary17_20_29$Value))
youth_education17_20_29$Value <- as.numeric(as.character(youth_education17_20_29$Value))
low_wages14$Value <- as.numeric(as.character(low_wages14$Value))
minimumwages16$Value <- as.character(minimumwages16$Value)
minimumwages16$Value <- as.numeric(gsub(",", "",minimumwages16$Value))
## Warning: NAs introducidos por coerci'on
class(part_time17_20_29$Value)
## [1] "numeric"
class(temporary17_20_29$Value)
## [1] "numeric"
class(youth_education17_20_29$Value)
## [1] "numeric"
class(low_wages14$Value)
## [1] "numeric"
class(minimumwages16$Value)
## [1] "numeric"
youthMR <- youth_age17 %>% select(GEO, Value)
colnames(youthMR)[2]<- "Age_leaving"
unemploymentMR <- unemployment17_20_29 %>% select(GEO, Value)
colnames(unemploymentMR)[2]<- "Unemployment_rate"
partTimeMR <- part_time17_20_29 %>% select(GEO, Value)
colnames(partTimeMR)[2]<- "Involuntary_partTime"
temporaryMR <- temporary17_20_29 %>% select(GEO, Value)
colnames(temporaryMR)[2]<- "Temporary_job"
educationMR <- youth_education17_20_29 %>% select(GEO, Value) 
colnames(educationMR)[2] <- "Terciary_education"
lowwagesMR <- low_wages14 %>% select(GEO, Value)
colnames(lowwagesMR)[2] <- "Share_low_wages"
marriageMR <- marriage13 %>% select(Countries, Married, Consensual)
colnames(marriageMR)[1]<- "GEO"
medianhourlyMR <- medianhourly14 %>% select(GEO, Value)
colnames(medianhourlyMR)[2]<- "Median_hourly"
minimumwageMR <- minimumwages16 %>% select(GEO,Value) 
colnames(minimumwageMR)[2] <- "Minimum_wage"

Creating a single dataframe

newdata <- cbind(youthMR, unemploymentMR, partTimeMR, temporaryMR, educationMR, lowwagesMR, medianhourlyMR, minimumwageMR)
colnames(newdata)
##  [1] "GEO"                  "Age_leaving"          "GEO"                 
##  [4] "Unemployment_rate"    "GEO"                  "Involuntary_partTime"
##  [7] "GEO"                  "Temporary_job"        "GEO"                 
## [10] "Terciary_education"   "GEO"                  "Share_low_wages"     
## [13] "GEO"                  "Median_hourly"        "GEO"                 
## [16] "Minimum_wage"
newdata <- newdata[,c(1,2,4,6,8,10,12,14,16)]
newdata$GEO <- gsub("(until 1990 former territory of the FRG)", "", newdata$GEO)
newdata$GEO <- gsub(" ()","", newdata$GEO)
newdata$GEO
##  [1] "Belgium"       "Bulgaria"      "CzechRepublic" "Denmark"      
##  [5] "Germany()"     "Estonia"       "Ireland"       "Greece"       
##  [9] "Spain"         "France"        "Croatia"       "Italy"        
## [13] "Cyprus"        "Latvia"        "Lithuania"     "Luxembourg"   
## [17] "Hungary"       "Malta"         "Netherlands"   "Austria"      
## [21] "Poland"        "Portugal"      "Romania"       "Slovenia"     
## [25] "Slovakia"      "Finland"       "Sweden"        "UnitedKingdom"
marriageMR$GEO <- gsub("Germany","Germany()", marriageMR$GEO)
marriageMR$GEO <- gsub("Czech Republic","CzechRepublic", marriageMR$GEO)
marriageMR$GEO <- gsub("United Kingdom","UnitedKingdom", marriageMR$GEO)
marriageMR$GEO 
##  [1] "Finland"       "UnitedKingdom" "France"        "Estonia"      
##  [5] "Sweden"        "Denmark"       "Latvia"        "Poland"       
##  [9] "Belgium"       "Bulgaria"      "Luxembourg"    "Cyprus"       
## [13] "Romania"       "Netherlands"   "Germany()"     "Austria"      
## [17] "CzechRepublic" "Lithuania"     "Portugal"      "Malta"        
## [21] "Hungary"       "Ireland"       "Croatia"       "Spain"        
## [25] "Slovenia"      "Slovakia"      "Italy"         "Greece"
newdata <- merge(newdata, marriageMR, by = "GEO")

Once I have a single dataframe with all the information, I plot the data to see relationship between the variables.

newdata_plot <- newdata %>% select(Age_leaving, Unemployment_rate, Involuntary_partTime, Temporary_job, Terciary_education, Share_low_wages, Median_hourly, Married, Consensual)
plot(newdata_plot, pch=1, col = "orange", main = "Relationship between variables Youth leaving home")

The plot above shows the relationship among every variable. However, I am interested in how each of those variables affect the “Age_leaving_home.” Hence, I use the correlation coeffient.

newdata$Age_leaving <- as.numeric(as.character(newdata$Age_leaving))
cor(newdata$Age_leaving, newdata$Unemployment_rate)
## [1] 0.3925254
cor(newdata$Age_leaving, newdata$Involuntary_partTime, use = "complete.obs") # "use complete.obs" to avoid NA values
## [1] 0.3404984
cor(newdata$Age_leaving, newdata$Temporary_job)
## [1] 0.09024109
cor(newdata$Age_leaving, newdata$Terciary_education)
## [1] -0.1779237
cor(newdata$Age_leaving, newdata$Share_low_wages)
## [1] 0.1844291
cor(newdata$Age_leaving, newdata$Median_hourly)
## [1] -0.621371
cor(newdata$Age_leaving, newdata$Married)
## [1] 0.2511776
cor(newdata$Age_leaving, newdata$Consensual)
## [1] -0.7752898

It goes from -1 to 1. 0 means no correlation. But the figure that shows a “strong” correlation varies depending on the author. This tutorial from Statstutor considers a “strong correlagion” from 0.6, while the website Dummies sets that limit in 0.7, and Explorable in 0.5.

I can conclude that the median hourly and the marriage age are strong correlated with the age of leaving home.

I run now the multiple regression and use the summary function to print the output.

newdata <- newdata %>% filter(!GEO %in% c("Lithuania", "Estonia")) #Cleaning uncompleted countries to avoid distorsions. The regression analysis warns you that you have missing data. For the purpose of this exercise I exclude the two countries with incompleted information. 
summary(lm(Age_leaving ~ Involuntary_partTime + Unemployment_rate + Temporary_job + Terciary_education + Share_low_wages + Median_hourly + Married + Consensual, data = newdata, na.action = "na.omit"))
## 
## Call:
## lm(formula = Age_leaving ~ Involuntary_partTime + Unemployment_rate + 
##     Temporary_job + Terciary_education + Share_low_wages + Median_hourly + 
##     Married + Consensual, data = newdata, na.action = "na.omit")
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.4939 -0.7495 -0.1415  1.0577  2.5819 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          35.15040    2.09198  16.802 5.05e-12 ***
## Involuntary_partTime -0.02138    0.02084  -1.026  0.31931    
## Unemployment_rate     0.16718    0.10127   1.651  0.11713    
## Temporary_job        -0.01146    0.02190  -0.523  0.60739    
## Terciary_education    0.01595    0.04352   0.367  0.71841    
## Share_low_wages      -0.07913    0.03124  -2.533  0.02145 *  
## Median_hourly        -0.41411    0.12325  -3.360  0.00372 ** 
## Married              -0.05250    0.09263  -0.567  0.57826    
## Consensual           -0.29091    0.05044  -5.767 2.28e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.429 on 17 degrees of freedom
## Multiple R-squared:  0.8592, Adjusted R-squared:  0.7929 
## F-statistic: 12.97 on 8 and 17 DF,  p-value: 7.405e-06
format(4.40e-05, scientific = FALSE)
## [1] "0.000044"
format(4.255e-05, scientific = FALSE)
## [1] "0.00004255"

Interpreting the coefficients

According to Naked statistics’ author, “for any regression coefficient, you will be generally interested in three things: sign, size, and significance.”

Sign: positive or negative. It tells you the direction of the relationship.

Size: The number that says how big the relationship is. It means that if the independent variable (all the one I have tested) changes one (the unit depends on the variable’s unit), the dependent one would change the number of the coefficient for this variable.

For example: if the unemployment rate increases one percentage point (because my variable is in percentage), the age at which people leave their parents’ home would increase by 0.168 years. If it will increase by two percentages points, the dependent variable (age) would rise two times 0.168.

The formula (written in my way) is:

DEPENDENT VARIABLE = INTERCEPT + COEFFICIENT_FIRST_INDEPENDENT_VARIABLECHANGE_IN_INDEPENDENT_VARIABLE + COEFFICIENT_SECOND_INDEPENDENT_VARIABLECHANGE_IN_INDEPENDENT_VARIABLE…

AGE = 36.4868125 + (-0.0244483)1 + 0.16767011 + …

Significance: The p-value. Each independent variable has a p-value (“Pr(>|t|”). If it is under 0.05 is considered “statistically significant” and we can “reject the null hypothesis,” or in other words, consider that there is a relationship between the independent variable and the dependent variable. In my case, I’ve got three variables that meet that requirement.

But there are other parameters to test my model and the results.

  • R square: How much of the variation of my dependent variable (age) is explained by those independent variables I have used. As it is a multiple regression, I used the adjusted number (80.21%).

  • F-Statistic: If the R square measure the “strength” of the relationship, the F-Statistic tells you whether that relationship is “statistically significant,” explained Shantanu Deu. The further from 1, the better. But there is another test to judge this number.

“You should not reject the null if your critical f value is smaller than your F Value, unless you also have a small p-value,” says in the blog Statistics How To. The critical f value is what R prints as F-Statistic. The f value can be calculated with this table, using the numbers that follow the F-Statistic in R. In my case 12.14 < 2.69. So I can reject the null hypothesis (my variables affect the age).

  • P-value: The one that is in the last line tells whether the group of variables is “statistically significant.”

  • Standard Error (SE): This acts as the standard deviation. So, following the 68-95-99 rule, in 95 out of 100 times, the coefficients would be within two SE. My 95% confidence interval for the marriage variable is -.33 and -.31. 95 times out of 100 (or 19 in 20) the true population parameter will be between those two values. Those are also no equal to 0, so I can reject the null hypothesis that there is no association between age for leaving home and the age of marriage.

Conclussion

As Charles Wheelan warned, running a regression is easy. The difficulties come with selecting the parameters, testing your models and understanding the results.

After reflecting on the parameters, the assessment of my model tells me that all the parameters together are related to the age at which young people leave home. Mainly, I look at the three “statistically significant.”

plot(newdata$Consensual, newdata$Age_leaving ,main = "Age and Consensual unions", xlab = "Percentage of consensual unions > 30", ylab = "Age for leaving home", col = "orange")
abline(lm(newdata$Age_leaving ~ newdata$Consensual), col = "blue")

plot(newdata$Median_hourly, newdata$Age_leaving ,main = "Age and Median hourly salary", xlab = "Median Hourly salary in PPS", ylab = "Age for leaving home", col = "orange")
abline(lm(newdata$Age_leaving ~ newdata$Median_hourly), col = "blue")

plot(newdata$Share_low_wages, newdata$Age_leaving ,main = "Age and low salaries", xlab = "Share of 20 - 29 earning less than 2/3 of median gross hourly earnings (PPS)", ylab = "Age for leaving home", col = "orange")
abline(lm(newdata$Age_leaving ~ newdata$Share_low_wages), col = "blue")

Some warnings:

Bibliography used: Wheelan, C. (2012) Naked statistics: stripping the dread from the data. New York: W. W. Norton Company

Rego, F. (2015) Quick guide: interpreting simple linear model output in R, Felipe Rego [blog] Oct 23. Available at: https://feliperego.github.io/blog/2015/10/23/Interpreting-Model-Output-In-R [Accessed June 25]

Statistics How To (2018) F Statistic / F Value: Simple Definition and Interpretation, Statistics How To, June 25 Available at: http://www.statisticshowto.com/probability-and-statistics/f-statistic-value-test/ [Accessed July 10]

Flowing Data (2008) How to Read and Use a Box-and-Whisker Plot. Available at: https://flowingdata.com/2008/02/15/how-to-read-and-use-a-box-and-whisker-plot/ [Accessed July 8]

Prabhakaran, S. (n.d.) Linear regression, R-statistics. Available at: http://r-statistics.co/Linear-Regression.html [Accessed July 8]