An analysis of data on employment rate in Scotland

Summary

In this report, we investigate data on demographics in Scotland, particularly relating to the employment rate and no qualifications ratio focusing especially on the cities of Edinburgh and Glasgow.The report uses exploratory methods such as summary statistics and graphical techniques.The results indicate unemployment is higher in Glasgow than Edinburgh and there exists relationships between unemployment with health, Education and Geographic access to services.

Introduction

The data used in this report are imported from the Scottish Data CVS file. The aim is to understand the relationship between different variables within the data set and highlights the differences in employment rates and no qualifications between Glasgow and Edinburgh. The findings may be useful for policymakers and researchers interested in understanding the factors that contribute to deprivation in Scotland. Previous studies suggest Glasgow is more deprived than Edinburgh.

The data include various variables for different “data zones” in Scotland. Data zones are small geographical areas typically containing a population of between 500 and 1000 people. There are 9,976 data zones in total in Scotland, but the analysis in this report focuses on a sample of 400 of them, of which 100 are from the Glasgow Gity, 100 are from City of Edinburgh, and 200 are from elsewhere in Scotland. The data set contains many different variables, of which we will focus in particular on Attainment, Attendance, Alcohol, Broadband and no qualifications.

Methods

We use various methods of exploratory data analysis, including box plots, histograms and Q-Q plots to show the distribution of data , using the mean and variance to performing hypothesis tests to investigate whether they are the same in different council areas. We use exploratory plots to determine the relationship between variables. Calculations are performed using that statistical software R.

Results

In this section, we focus on the available data for Glasgow and Edinburgh, for each of which the data set contains 100 observations (data zones).

For these data zones, the following boxplots summarise the distribution of the employment rate variable:

These boxplots show that percentage of people who are employment deprived in Glasgow is much more higher than in Edinburgh. The median number for Employment rate in Glasgow is 0.155 (2dp) and for Edinburgh it is 0.065 (2dp). There are six outliars in Edinburgh so extreme values, Glasgow does not have any outliars. The range in Glasgow is much higher than Edinburgh.

Figure 1:Box plot for Employment rate in different council areas.

To investigate this in more detail the following histograms summarise the distribution of the employment rate variable:

These histograms suggest Glasgow date follows a roughly normal distribution for Employment rate. Edinburgh data is skewed to the right for Employment rate so data is not symmetrical due to natural variation in the zones of Edinburgh.

Figure 2:Histogram for Employment rate in different council areas.

The normal Q-Q plot for Edinburgh confirms the skewed data to the right as it curves upwards. For Glasgow it shows how it roughly follows a normally distributed plot, with the points in the Q-Q plot mostly lying on the line, there is some curvature in opposite ways at the two ends, suggesting there are some extreme values.

Figure 3:Q-Q plot for Employment rate in different council areas.

Summary statistics confirm that Employment rate, percentage of those employment deprived, in Glasgow is higher than in Edinburgh.

Figure 4:Summary statistics for Employment rate in Edinburgh and Glasgow.
Council_area	Mean	Variance
Glasgow City	0.1649	0.0086131
City of Edinburgh	0.0907	0.0064874

Comparing employment rate between the cities of Glasgow and Edinburgh

As further numerical summaries of overall employment rate and variability in employment rate, we compute the sample means and variances respectively. The sample mean employment rates for Glasgow and Edinburgh are $\bar{x}_G =$ 0.1649 and $\bar{x}_E =$ 0.0907, and the sample variances are $s^2_G =$ 0.0086131 and $s^2_E =$ 0.0064874.

To investigate whether or not the variance of employment rate is the same in Glasgow and Edinburgh, we consider the null and alternative hypotheses $H_0: \sigma^2_G = \sigma^2_E$ vs $H_1: \sigma^2_G \neq \sigma^2_E$ and can use a test based on the statistic $\frac{S_G^2}{S_E^2}$ =1.328,as there are two indpendent populations, where $S_G^2$ and $S_E^2$ are the sample variances of employment rate in Glasgow and Edinburgh, respectively. If $H_0$ is true, and supposing the data are normally distributed, then $\frac{S_G^2}{S_E^2}$ has an $F$ distribution with $n_G - 1$ and $n_E - 1$ degrees of freedom, where $n_G$ and $n_E$ are the sample sizes in Glasgow and Edinburgh, respectively. The value of this $F$ statistic that we compute from the data is denoted by $f$. The upper and lower quantiles of the null distribution are $q_{\alpha/2}$ =1.146 and $q_{1 - \alpha/2}$ =0.873, respectively, and since the statistic $f$ falls outside of this interval, we reject the null hypothesis and conclude that the variance of employment rate is not the same in Glasgow and Edinburgh at the 5% significance level

To investigate whether or not the mean of employment rate is the same in Glasgow and Edinburgh, we consider the null and alternative hypotheses $H_0: \bar{\mu}_G = \bar{\mu}_E$ vs $H_1: \bar{\mu}_G \neq \bar{\mu}_E$ and can use a test based on the $z$ statistic as n is larger (greater than 30), I chose to do the test at 1% significance level.

$n_G$=100, $n_E$=100, n is large so can do $z$ statistic

let $a$=0.01

$z$= (0.1649-0.0907)/ (sqr(0.0086131/100+0.0064874/100))=6.038

z1-a/2 = $z$0.995=2.58

6.038>2.58 so there is significant evidence to reject $H_0$, there is significant evidence at 1% level that the mean of employment rate are different in Glasgow and Edinburgh.

Investigating the relationship between employment rate and other variables

In this section we consider together all of the observations in the data set, not distinguishing whether the observations were from Glasgow, Edinburgh or elsewhere in Scotland.

The following plot shows the relationship between the various variables described in the Introduction section above.
Employment rate and Alcohol has the strongest positive correlation, Employment rate and Attendance has the strongest negative correlation. Employment rate and Attainment have negative correlation. Employment rate and Broadband has no correlation.

Figure 5:Matrix of pairwise scatterplots and the corresponding correlation coefficients.

Corresponding numerical values for the Pearson sample correlation coefficients between these variables.
	Employment_rate	Attainment	Attendance	ALCOHOL	Broadband
Employment_rate	1.0000000	-0.6177307	-0.7283002	0.7549826	-0.2124862
Attainment	-0.6177307	1.0000000	0.6912380	-0.4936659	0.1593356
Attendance	-0.7283002	0.6912380	1.0000000	-0.5400995	0.2458537
ALCOHOL	0.7549826	-0.4936659	-0.5400995	1.0000000	-0.1874733
Broadband	-0.2124862	0.1593356	0.2458537	-0.1874733	1.0000000

We plot again as follows Employment rate vs Attendance, with an added line of best fit for a regression model, it shows a negative relationship, there are a few outliars around 0.7% attendance.

Figure 6:Scatterplot for Employment rate and Attendence with a regression line.

The line of best fit is of the form Emloyment_rate = $\beta_0$ + $\beta_1 \times$ Attendance, where the least-squares estimates of $\beta_0$ and $\beta_1$ are 0.6323 and -0.6554 , respectively.

## beta 0= 0.6323 beta 1= -0.6554

Comparing no qualifications ratio between the cities of Glasgow and Edinburgh

In this section, we focus on the variable working age people with no qualifications and compare in Glasgow and Edinburgh, for each of which the data set contains 100 observations (data zones).

For these data zones, the following boxplots summarise the distribution of the employment rate variable:

These boxplots show that standardised ratio of people who have no qualifications in Glasgow is much more higher than in Edinburgh. The median number for no qualifications in Glasgow is 129 and for Edinburgh it is 95. The range for Glasgow is much higher than Edinburgh.

Figure 7:Boxplot for working age people with no qualifications in different council areas.

For these data zones, the following histograms summarise the distribution of the employment rate variable:

These histograms suggest Glasgow follows a normal distribution for no qualifications. For Edinburgh it cannot be concluded as it does not have a distinct distribution.

Figure 8:Histogram for working age people with no qualifications in different council areas.

The normal Q-Q plot for Glasgow and Edinburgh shows they both roughly follows a normally distributed plot, with the points in the Q-Q plot mostly lying on the line, there is some curvature in opposite ways at the two ends, suggesting there are some extreme values.

Figure 9:Q-Q plot for working age people with no qualifications in different council areas.

Summary statistics confirm that no qualifications is higher in Glasgow than Edinburgh. Although it should be noted that variance in both is very high, data points are spread over a large range of values so conclusions are harder to be drawn.

Figure 10:Summary statistics for no qualifications in Edinburgh and Glasgow
Council_area	Mean	Variance
Glasgow City	161.58542	3799.869
City of Edinburgh	89.73401	3189.923

Comparing No qualifications between the cities of Glasgow and Edinburgh

As further numerical summaries of overall employment rate and variability in employment rate, we compute the sample means and variances respectively. The sample mean employment rates for Glasgow and Edinburgh are $\bar{x}_G =$ 161.58542 and $\bar{x}_E =$ 89.73401, and the sample variances are $s^2_G =$ 3799.869 and $s^2_E =$ 3189.923.

To investigate whether or not the variance of no qualifications is the same in Glasgow and Edinburgh, we consider the null and alternative hypotheses $H_0: \sigma^2_G = \sigma^2_E$ vs $H_1: \sigma^2_G \neq \sigma^2_E$ and can use a test based on the fisher distrubution as there are two independent populations7.

nG=100, nE=100,

let a =0.05 3799.869/3189.923 =1.191 (3dp) - F99,99,0.975

Upper bound=1.526 Lower bound=0.655

1.191>1.526 so significant evidence at 5% level to reject H0.

To investigate whether or not the mean of no qualifications is the same in Glasgow and Edinburgh, we consider the null and alternative hypotheses $H_0: \bar{\mu}_G = \bar{\mu}_E$ vs $H_1: \bar{\mu}_G \neq \bar{\mu}_E$ and can use a test based on the z statistic as n is larger (greater than 30), I chose to do the test at 1% significance level.

nG=100, nE=100, n is large so can do z statistic

let a(alpha)=0.01

z= (161.58542-89.73401)/ (sqr(3799.869/100+3189.923/100))= 10.455

z1-a/2 = z0.995=2.58

10.455>2.58 so there is significant evidence to reject $H_0$, there is significant evidence at 1% level that the mean of no qualifications are different in Glasgow and Edinburgh.

Conclusions

We have investigated various characteristics of data available for geographical areas in Scotland called “data zones”, focusing in particular on Edinburgh and Glasgow. We conclude that unemployment is more prevalent in Glasgow than Edinburgh. Also Glasgow as a higher proportion of people with no qualifications than Edinburgh. These results relate to the previous studies that Glasgow is more deprived than Edinburgh.

There are some limitations of the current study. One is we assumed normality which could introduce bias as Edinburgh data is skewed for Employment rate, it also could cause the hypothesis tests to be inaccurate. Also we could investigate the relationship between Employment rate and health outcomes, such as comparative illness factor.

A future study could investigate using stratified sampling so we get a more representative sample and also increase n to get produce more accurate and reliable result.

Appendix

The R code to produce the analysis and plots in this report is as follows:

knitr::opts_chunk$set(echo = TRUE)
dat <- read.csv("scottishData (1).csv")

dat <- read.csv("scottishData (1).csv")
#Box plot for Edinburgh and Glasgow
dat_e <- dat[dat$Council_area %in% c("Glasgow City", "City of Edinburgh"),]

boxplot(Employment_rate ~ Council_area, data=dat_e,
        xlab= "Council area", ylab="Employment rate (%)",
        main="Employment rate between cities of Glasgow and Edinburgh")
dat <- read.csv("scottishData (1).csv")
dat_e <- dat[dat$Council_area %in% c("Glasgow City", "City of Edinburgh"),]
#Histogram for Edinburgh and Glasgow

par(mfrow = c(1, 2))

hist(dat_e$Employment_rate[dat_e$Council_area == "Glasgow City"],
     main = "Employment rate in Glasgow ", xlab = "Employment rate (%)", 
     ylab = "Frequency", col = "steelblue")

hist(dat_e$Employment_rate[dat_e$Council_area == "City of Edinburgh"],
     main = "Employment rate in Edinburgh", xlab = "Employment rate (%)", 
     ylab = "Frequency", col = "steelblue")


dat <- read.csv("scottishData (1).csv")
dat_e <- dat[dat$Council_area %in% c("Glasgow City", "City of Edinburgh"),]
#QQ plot for Employment rate in Edinburgh and Glasgow
par(mfrow = c(1, 2)) # Set up a 1x2 grid of plots

qqnorm(dat_e$Employment_rate[dat_e$Council_area == "Glasgow City"],
       main = "Employment rate in Glasgow")

qqline(dat_e$Employment_rate[dat_e$Council_area == "Glasgow City"])

qqnorm(dat_e$Employment_rate[dat_e$Council_area == "City of Edinburgh"],
       main = "Employment rate in Edinburgh")

qqline(dat_e$Employment_rate[dat_e$Council_area == "City of Edinburgh"])


# Creating a table with sample mean and variance for Edinburgh and Glasgow
dat_e <- subset(dat, Council_area %in% c("Glasgow City", "City of Edinburgh"))

mean_glasgow <- mean(dat_e$Employment_rate[dat_e$Council_area == "Glasgow City"])
var_glasgow <- var(dat_e$Employment_rate[dat_e$Council_area == "Glasgow City"])

mean_edinburgh <- mean(dat_e$Employment_rate[dat_e$Council_area == "City of Edinburgh"])
var_edinburgh <- var(dat_e$Employment_rate[dat_e$Council_area == "City of Edinburgh"])

library(knitr)
kable(data.frame(
  Council_area = c("Glasgow City", "City of Edinburgh"),
  Mean = c(mean_glasgow, mean_edinburgh),
  Variance = c(var_glasgow, var_edinburgh)
), caption = "Figure 4:Summary statistics for Employment rate in Edinburgh and Glasgow.")

# Import the data file
data <- read.csv("scottishData (1).csv")

# Subset the data to include only the variables of interest
variables <- c("Employment_rate", "Attainment", "Attendance", "ALCOHOL", "Broadband")
data_subset <- data[, variables]

# Create a matrix of pairwise scatterplots
pairs(data_subset)


# Calculate pairwise correlation coefficients
correlations <- cor(data_subset)
knitr::kable(correlations, caption = " Corresponding numerical values for the Pearson sample correlation coefficients between these variables.")


dat <- read.csv("scottishData (1).csv")

dat_e <- dat[dat$Council_area %in% c("Glasgow City", "City of Edinburgh"),]

plot(dat_e$Attendance, dat_e$Employment_rate, 
     xlab = "Attendance (%)", ylab = "Employment Rate (%)", 
     main = "Scatter plot of Employment Rate vs Attendance")

abline(lm(Employment_rate ~ Attendance, data = dat_e))
#perform linear regression
model <- lm(Employment_rate ~ Attendance, data=dat)
coef <- coef(model)
cat("beta 0=", round(coef[1],4), "beta 1=",  round(coef[2],4))
dat <- read.csv("scottishData (1).csv")
#Box plot for Edinburgh and Glasgow
dat_e <- dat[dat$Council_area %in% c("Glasgow City", "City of Edinburgh"),]

boxplot(no_qualifications ~ Council_area, data=dat_e,
        xlab= "Council area", ylab="Working age people with no qualification",
        main="Working age people with no qualifications in different council areas")

dat <- read.csv("scottishData (1).csv")
dat_e <- dat[dat$Council_area %in% c("Glasgow City", "City of Edinburgh"),]
#Histogram for Edinburgh and Glasgow

par(mfrow = c(1, 2))

hist(dat_e$no_qualifications[dat_e$Council_area == "Glasgow City"],
     main = "Glasgow City ", xlab = "Ratio of no qualifications", 
     ylab = "Frequency", col = "steelblue")

hist(dat_e$no_qualifications[dat_e$Council_area == "City of Edinburgh"],
     main = "City of Edinburgh", xlab = "Ratio of no qualifications", 
     ylab = "Frequency", col = "steelblue")


dat <- read.csv("scottishData (1).csv")
dat_e <- dat[dat$Council_area %in% c("Glasgow City", "City of Edinburgh"),]
par(mfrow = c(1, 2)) # Set up a 1x2 grid of plots

qqnorm(dat_e$no_qualifications[dat_e$Council_area == "Glasgow City"],
       main = "Glasgow ")

qqline(dat_e$no_qualifications[dat_e$Council_area == "Glasgow City"])

qqnorm(dat_e$no_qualifications[dat_e$Council_area == "City of Edinburgh"],
       main = "Edinburgh")

qqline(dat_e$no_qualifications[dat_e$Council_area == "City of Edinburgh"])

dat <- read.csv("scottishData (1).csv")

# Creating a table with sample mean and variance for Edinburgh and Glasgow
dat_e <- subset(dat, Council_area %in% c("Glasgow City", "City of Edinburgh"))

mean_glasgow <- mean(dat_e$no_qualifications[dat_e$Council_area == "Glasgow City"])
var_glasgow <- var(dat_e$no_qualifications[dat_e$Council_area == "Glasgow City"])

mean_edinburgh <- mean(dat_e$no_qualifications[dat_e$Council_area == "City of Edinburgh"])
var_edinburgh <- var(dat_e$no_qualifications[dat_e$Council_area == "City of Edinburgh"])

library(knitr)
kable(data.frame(
  Council_area = c("Glasgow City", "City of Edinburgh"),
  Mean = c(mean_glasgow, mean_edinburgh),
  Variance = c(var_glasgow, var_edinburgh)
), caption = "Figure 10:Summary statistics for no qualifications in Edinburgh and Glasgow")