This document forms part of a project I am working on to analyse and present primary school admissions and performance data in Surrey and Hampshire in a meaningful way to parents looking to choose a school for their child. The full project, including all source data and scripts, can be found on Github.
In this analysis, I will explore data relating to primary school admissions in Surrey and Hampshire for academic years beginning September 2014. The data was taken from the UK Government Explore Education Statistics service and contains public sector information licensed under the Open Government Licence v3.0. Full details of the data sources used for this analysis can be found in the README file within the Github repo.
The CSV dataset is located here.
The dataset includes the number of offers made to applicants for both primary and secondary school places to start in September 2021, methods of application and the proportion who received preferred offers of various type. A time series is included going back to 2014, when primary data was collected for the first time following the introduction of the first primary national offer day.
While the dataset does include data for both primary and secondary applications, and for every local authority in England, my analysis will focus on primary applications in Surrey and Hampshire.
Variable names and descriptions for this file are provided below:
| Variable name | Variable description |
|---|---|
| admission_numbers | Number of admission places available |
| applications_received | Number of applications received |
| first_preference_offers | a. Number of first preference offers |
| first_preference_offers_percent | b. First preference rate |
| NC_year_admission | Year of admission |
| no_of_preferences | Number of preferences allowed |
| no_offer | m. Number made no offer |
| no_offer_percent | n. No offer rate |
| non_preferred_offer | k. Number of non preferred offers |
| non_preferred_offer_percent | l. Non preferred offer rate |
| offers_to_nonapplicants | Total number of offers to non applicants |
| one_of_the_three_preference_offers | g. Number of top three preference offers |
| one_of_the_three_preference_offers_percent | h. Top three preference rate |
| online_applications | Number of applications submitted online |
| online_apps_percent | Percentage of applications submitted online |
| preferred_school_offer | i. Number of preferred offers |
| preferred_school_offer_percent | j. Preferred offer rate |
| school_phase | School phase - collection phase |
| schools_in_another_la_offer | q. Number of offers in another LA |
| schools_in_another_la_offer_percent | r. Other LA offer rate |
| schools_in_la_offer | o. Number of offers in home LA |
| schools_in_la_offer_percent | p. Home LA offer rate |
| second_preference_offers | c. Number of second preference offers |
| second_preference_offers_percent | d. Second preference rate |
| third_preference_offers | e. Number of third preference offers |
| third_preference_offers_percent | f. Third preference rate |
I will make use of several libraries during this analysis:
library(tidyverse)
library(janitor)
I am primarily interested in answering the following questions with this data:
As I work through the data wrangling and exploration steps, I may uncover further questions that I wish to answer with this data.
First I need to read in the data from the CSV file so that I can explore, clean and analyse it further.
r_offers_la <- read_csv("data-in/appsandoffers_2021.csv")
I use the prefix r_ at the start of the variable name for the dataframe I just read in. This denotes the variable as being the ‘raw’ version of the data (i.e. exactly as it appeared at source). When I start to clean and wrangle this data, I will store the manipulated version as a new variable, without the r_ prefix. In this way, I always have a copy of the raw data to refer back to if needed.
The output of the read_csv() code in the chunk above resulted in several warnings relating to parsing failures (the warnings are not displayed here but can be called using problems(df). I will print the warnings into a friendly format to see what is going on.
r_offers_la_problems <- problems(r_offers_la) %>%
group_by(col, expected, actual) %>%
summarise(nrows = n())
| col | expected | actual | nrows |
|---|---|---|---|
| no_offer | a double | c | 49 |
| no_offer_percent | a double | c | 49 |
| non_preferred_offer | a double | c | 51 |
| non_preferred_offer_percent | a double | c | 51 |
| offers_to_nonapplicants | a double | c | 11 |
| one_of_the_three_preference_offers | a double | c | 4 |
| one_of_the_three_preference_offers_percent | a double | c | 4 |
| online_applications | a double | c | 3 |
| online_apps_percent | a double | c | 3 |
| preferred_school_offer | a double | c | 4 |
| preferred_school_offer_percent | a double | c | 4 |
| schools_in_another_la_offer | a double | c | 4 |
| schools_in_another_la_offer_percent | a double | c | 4 |
| schools_in_la_offer | a double | c | 4 |
| schools_in_la_offer_percent | a double | c | 4 |
| second_preference_offers | a double | c | 8 |
| second_preference_offers_percent | a double | c | 8 |
| third_preference_offers | a double | c | 10 |
| third_preference_offers_percent | a double | c | 10 |
Here I can see that there are 19 columns that the read_csv() function determined should be numeric based on analysis of the first 1,000 rows of data. However, in some rows it found a string "c" in these columns (as shown by the value in the actual column above).
From the material that accompanied the data in its original source location I know that some values were suppressed, and replaced with a string denoting the suppression:
Rounding
Primary and secondary suppression was applied in the 2014 dataset. No suppression has been applied from 2015 onwards
Conventions
The following convention is used throughout the underlying data.
- ‘c’ Suppressed (2014 only)
- ‘:’ Not applicable.
The suppressed values have been read into the R dataframe as null values (NA). As I work through the analysis I will need to decide if and how to impute these missing values.
Now that I have read the data into R, I can spend some time assessing it to find out what the features look like and what might need to be cleaned.
I’ll start by looking at a summary of the numerical columns of the dataset.
r_offers_la %>%
select(where(is.numeric)) %>%
summary()
## time_period old_la_code admission_numbers applications_received
## Min. :201415 Min. :201.0 Min. : 0 Min. : 16
## 1st Qu.:201516 1st Qu.:334.0 1st Qu.: 2434 1st Qu.: 2185
## Median :201718 Median :807.0 Median : 3551 Median : 3162
## Mean :201768 Mean :614.5 Mean : 12364 Mean : 11000
## 3rd Qu.:201920 3rd Qu.:873.0 3rd Qu.: 6252 3rd Qu.: 5333
## Max. :202122 Max. :938.0 Max. :713090 Max. :641572
## NA's :176
## online_applications online_apps_percent first_preference_offers
## Min. : 1 Min. : 0.05692 Min. : 10
## 1st Qu.: 1886 1st Qu.: 86.02105 1st Qu.: 1851
## Median : 2845 Median : 93.56607 Median : 2687
## Mean : 9972 Mean : 89.15999 Mean : 9501
## 3rd Qu.: 4846 3rd Qu.: 97.18310 3rd Qu.: 4756
## Max. :580938 Max. :100.00000 Max. :566836
## NA's :3 NA's :3
## first_preference_offers_percent second_preference_offers
## Min. :51.40 Min. : 0.0
## 1st Qu.:81.87 1st Qu.: 131.0
## Median :88.43 Median : 218.5
## Mean :85.82 Mean : 771.5
## 3rd Qu.:92.27 3rd Qu.: 381.5
## Max. :99.72 Max. :54187.0
## NA's :8
## second_preference_offers_percent third_preference_offers
## Min. : 0.000 Min. : 0.0
## 1st Qu.: 4.533 1st Qu.: 29.0
## Median : 6.407 Median : 64.0
## Mean : 7.204 Mean : 250.2
## 3rd Qu.: 9.056 3rd Qu.: 132.0
## Max. :36.842 Max. :20186.0
## NA's :8 NA's :10
## third_preference_offers_percent one_of_the_three_preference_offers
## Min. : 0.0000 Min. : 13
## 1st Qu.: 0.9258 1st Qu.: 2082
## Median : 1.7637 Median : 3006
## Mean : 2.3479 Mean : 10534
## 3rd Qu.: 3.1664 3rd Qu.: 5176
## Max. :10.3448 Max. :617899
## NA's :10 NA's :4
## one_of_the_three_preference_offers_percent preferred_school_offer
## Min. : 74.77 Min. : 15
## 1st Qu.: 93.98 1st Qu.: 2111
## Median : 96.65 Median : 3048
## Mean : 95.36 Mean : 10651
## 3rd Qu.: 98.04 3rd Qu.: 5185
## Max. :100.00 Max. :621904
## NA's :4 NA's :4
## preferred_school_offer_percent non_preferred_offer non_preferred_offer_percent
## Min. : 83.80 Min. : 0.0 Min. : 0.000
## 1st Qu.: 95.52 1st Qu.: 43.0 1st Qu.: 1.532
## Median : 97.14 Median : 90.0 Median : 2.529
## Mean : 96.56 Mean : 330.6 Mean : 3.050
## 3rd Qu.: 98.21 3rd Qu.: 181.0 3rd Qu.: 4.102
## Max. :100.00 Max. :27893.0 Max. :16.199
## NA's :4 NA's :51 NA's :51
## no_offer no_offer_percent schools_in_la_offer
## Min. : 0.00 Min. : 0.00000 Min. : 0
## 1st Qu.: 0.00 1st Qu.: 0.00000 1st Qu.: 2010
## Median : 1.00 Median : 0.03357 Median : 2948
## Mean : 39.12 Mean : 0.40254 Mean : 10398
## 3rd Qu.: 9.00 3rd Qu.: 0.22144 3rd Qu.: 5170
## Max. :3868.00 Max. :15.06708 Max. :622789
## NA's :49 NA's :49 NA's :4
## schools_in_la_offer_percent schools_in_another_la_offer
## Min. : 0.00 Min. : 0
## 1st Qu.: 91.67 1st Qu.: 71
## Median : 95.81 Median : 163
## Mean : 92.98 Mean : 601
## 3rd Qu.: 97.91 3rd Qu.: 324
## Max. :100.00 Max. :48916
## NA's :4 NA's :4
## schools_in_another_la_offer_percent offers_to_nonapplicants
## Min. : 0.000 Min. : 0.00
## 1st Qu.: 2.091 1st Qu.: 0.00
## Median : 4.190 Median : 0.00
## Mean : 7.017 Mean : 23.37
## 3rd Qu.: 8.327 3rd Qu.: 2.00
## Max. :100.000 Max. :2494.00
## NA's :4 NA's :11
From the summary I observe the following:
timeperiod column denotes an academic year, but is in the format ‘202021’. I would like to mutate this numeric column (a numeric column so that it displays only the start year of the academic year (i.e. ‘202021’ becomes ‘2020’. I could also rename it to year_of_entry for clarity).old_la_code column contains a categorical variable rather than a true numerical measure, and could benefit from being converted to a string.online_apps_percent has a large range (min: 0.06%, max: 100%). It might be interesting to see how this measure varies by local authority or over time.97.6, 88.3, 3.2 etc. I would like to convert these to decimals (i.e. divide each value by 100) to make it easier to work with later.I can also view a summary of the character columns.
r_offers_la %>%
select(where(is.character)) %>%
lapply(table)
## $time_identifier
##
## Academic year
## 2588
##
## $geographic_level
##
## Local authority National Regional
## 2412 16 160
##
## $country_code
##
## E92000001
## 2588
##
## $country_name
##
## England
## 2588
##
## $region_code
##
## E12000001 E12000002 E12000003 E12000004 E12000005 E12000006 E12000008 E12000009
## 208 384 256 160 240 192 320 252
## E13000001 E13000002
## 240 320
##
## $region_name
##
## East Midlands East of England Inner London
## 160 192 240
## North East North West Outer London
## 208 384 320
## South East South West West Midlands
## 320 252 240
## Yorkshire and The Humber
## 256
##
## $new_la_code
##
## E06000001 E06000002 E06000003 E06000004 E06000005 E06000006 E06000007 E06000008
## 16 16 16 16 16 16 16 16
## E06000009 E06000010 E06000011 E06000012 E06000013 E06000014 E06000015 E06000016
## 16 16 16 16 16 16 16 16
## E06000017 E06000018 E06000019 E06000020 E06000021 E06000022 E06000023 E06000024
## 16 16 16 16 16 16 16 16
## E06000025 E06000026 E06000027 E06000028 E06000029 E06000030 E06000031 E06000032
## 16 16 16 12 12 16 16 16
## E06000033 E06000034 E06000035 E06000036 E06000037 E06000038 E06000039 E06000040
## 16 16 16 16 16 16 16 16
## E06000041 E06000042 E06000043 E06000044 E06000045 E06000046 E06000047 E06000048
## 16 16 16 16 16 16 16 16
## E06000049 E06000050 E06000051 E06000052 E06000054 E06000055 E06000056 E06000058
## 16 16 16 16 16 16 16 4
## E06000059 E08000001 E08000002 E08000003 E08000004 E08000005 E08000006 E08000007
## 4 16 16 16 16 16 16 16
## E08000008 E08000009 E08000010 E08000011 E08000012 E08000013 E08000014 E08000015
## 16 16 16 16 16 16 16 16
## E08000016 E08000017 E08000018 E08000019 E08000020 E08000021 E08000022 E08000023
## 16 16 16 16 16 16 16 16
## E08000024 E08000025 E08000026 E08000027 E08000028 E08000029 E08000030 E08000031
## 16 16 16 16 16 16 16 16
## E08000032 E08000033 E08000034 E08000035 E08000036 E09000001 E09000002 E09000003
## 16 16 16 16 16 16 16 16
## E09000004 E09000005 E09000006 E09000007 E09000008 E09000009 E09000010 E09000011
## 16 16 16 16 16 16 16 16
## E09000012 E09000013 E09000014 E09000015 E09000016 E09000017 E09000018 E09000019
## 16 16 16 16 16 16 16 16
## E09000020 E09000021 E09000022 E09000023 E09000024 E09000025 E09000026 E09000027
## 16 16 16 16 16 16 16 16
## E09000028 E09000029 E09000030 E09000031 E09000032 E09000033 E10000002 E10000003
## 16 16 16 16 16 16 16 16
## E10000006 E10000007 E10000008 E10000009 E10000011 E10000012 E10000013 E10000014
## 16 16 16 12 16 16 16 16
## E10000015 E10000016 E10000017 E10000018 E10000019 E10000020 E10000021 E10000023
## 16 16 16 16 16 16 16 16
## E10000024 E10000025 E10000027 E10000028 E10000029 E10000030 E10000031 E10000032
## 16 16 16 16 16 16 16 16
## E10000034
## 16
##
## $la_name
##
## Barking and Dagenham Barnet
## 16 16
## Barnsley Bath and North East Somerset
## 16 16
## Bedford Bexley
## 16 16
## Birmingham Blackburn with Darwen
## 16 16
## Blackpool Bolton
## 16 16
## Bournemouth Bournemouth Christchurch and Poole
## 12 4
## Bracknell Forest Bradford
## 16 16
## Brent Brighton and Hove
## 16 16
## Bristol, City of Bromley
## 16 16
## Buckinghamshire Bury
## 16 16
## Calderdale Cambridgeshire
## 16 16
## Camden Central Bedfordshire
## 16 16
## Cheshire East Cheshire West and Chester
## 16 16
## City of London Cornwall
## 16 16
## Coventry Croydon
## 16 16
## Cumbria Darlington
## 16 16
## Derby Derbyshire
## 16 16
## Devon Doncaster
## 16 16
## Dorset Dudley
## 16 16
## Durham Ealing
## 16 16
## East Riding of Yorkshire East Sussex
## 16 16
## Enfield Essex
## 16 16
## Gateshead Gloucestershire
## 16 16
## Greenwich Hackney
## 16 16
## Halton Hammersmith and Fulham
## 16 16
## Hampshire Haringey
## 16 16
## Harrow Hartlepool
## 16 16
## Havering Herefordshire
## 16 16
## Hertfordshire Hillingdon
## 16 16
## Hounslow Isle of Wight
## 16 16
## Islington Kensington and Chelsea
## 16 16
## Kent Kingston Upon Hull, City of
## 16 16
## Kingston upon Thames Kirklees
## 16 16
## Knowsley Lambeth
## 16 16
## Lancashire Leeds
## 16 16
## Leicester Leicestershire
## 16 16
## Lewisham Lincolnshire
## 16 16
## Liverpool Luton
## 16 16
## Manchester Medway
## 16 16
## Merton Middlesbrough
## 16 16
## Milton Keynes Newcastle upon Tyne
## 16 16
## Newham Norfolk
## 16 16
## North East Lincolnshire North Lincolnshire
## 16 16
## North Somerset North Tyneside
## 16 16
## North Yorkshire Northamptonshire
## 16 16
## Northumberland Nottingham
## 16 16
## Nottinghamshire Oldham
## 16 16
## Oxfordshire Peterborough
## 16 16
## Plymouth Poole
## 16 12
## Portsmouth Reading
## 16 16
## Redbridge Redcar and Cleveland
## 16 16
## Richmond upon Thames Rochdale
## 16 16
## Rotherham Rutland
## 16 16
## Salford Sandwell
## 16 16
## Sefton Sheffield
## 16 16
## Shropshire Slough
## 16 16
## Solihull Somerset
## 16 16
## South Gloucestershire South Tyneside
## 16 16
## Southampton Southend-on-Sea
## 16 16
## Southwark St. Helens
## 16 16
## Staffordshire Stockport
## 16 16
## Stockton-on-Tees Stoke-on-Trent
## 16 16
## Suffolk Sunderland
## 16 16
## Surrey Sutton
## 16 16
## Swindon Tameside
## 16 16
## Telford and Wrekin Thurrock
## 16 16
## Torbay Tower Hamlets
## 16 16
## Trafford Wakefield
## 16 16
## Walsall Waltham Forest
## 16 16
## Wandsworth Warrington
## 16 16
## Warwickshire West Berkshire
## 16 16
## West Sussex Westminster
## 16 16
## Wigan Wiltshire
## 16 16
## Windsor and Maidenhead Wirral
## 16 16
## Wokingham Wolverhampton
## 16 16
## Worcestershire York
## 16 16
##
## $school_phase
##
## Primary Secondary
## 1294 1294
##
## $NC_year_admission
##
## : 7 9 R
## 176 1190 16 1206
##
## $no_of_preferences
##
## : 3 4 5 6
## 176 1245 221 180 766
From this I can make some further observations:
time_identifer, country_code, country_name can probably all be removed since they contain no useful information.NC_year_admission and no_of_preferences both contain ‘not applicable’ values denoted by :. I believe these null values are tied to the records with the regional and national aggregated data and will therefore disappear when I remove those rows.no_of_preferences would be better converted to a numeric column once the ‘not applicable’ records have been removed.I’ll start by fixing the issues I uncovered during my initial assessment. I will save my changes into a new dataframe offers_la so that I have a copy of both the raw data and my edited version.
# convert the timeperiod column to year_of_entry
offers_la <- r_offers_la %>%
mutate(time_period = as.character(time_period),
time_period = str_sub(time_period, 1, 4),
time_period = as.numeric(time_period)) %>%
rename(year_of_entry = time_period)
head(offers_la$year_of_entry)
## [1] 2021 2021 2021 2021 2021 2021
# convert the old_la_code column to character
offers_la <- offers_la %>%
mutate(old_la_code = as.character(old_la_code))
tail(offers_la$old_la_code)
## [1] "931" "933" "935" "936" "937" "938"
# remove the aggregated data
offers_la <- offers_la %>%
filter(geographic_level == "Local authority")
# convert percentages to decimal
offers_la <- offers_la %>%
mutate(across(ends_with("_percent"), ~ .x / 100))
offers_la %>%
select(ends_with("_percent")) %>%
head()
## # A tibble: 6 x 10
## online_apps_perc… first_preference_of… second_preference_… third_preference_o…
## <dbl> <dbl> <dbl> <dbl>
## 1 1 0.893 0.0357 0
## 2 0.999 0.778 0.0974 0.0395
## 3 0.996 0.889 0.0664 0.0193
## 4 0.981 0.910 0.0438 0.0189
## 5 0.983 0.796 0.0878 0.0367
## 6 1 0.785 0.0871 0.0356
## # … with 6 more variables: one_of_the_three_preference_offers_percent <dbl>,
## # preferred_school_offer_percent <dbl>, non_preferred_offer_percent <dbl>,
## # no_offer_percent <dbl>, schools_in_la_offer_percent <dbl>,
## # schools_in_another_la_offer_percent <dbl>
# remove redundant columns
offers_la <- offers_la %>%
select(!c(time_identifier, country_code, country_name))
offers_la <- offers_la %>%
mutate(no_of_preferences = as.numeric(no_of_preferences))
I have several columns that contain missing values, and I need to decide if and how to impute these values. I’ll start by inspecting the clean dataframe to find out where the missing values are located.
knitr::kable(colSums(is.na(offers_la)),
col.names = c("count_na"))
| count_na | |
|---|---|
| year_of_entry | 0 |
| geographic_level | 0 |
| region_code | 0 |
| region_name | 0 |
| new_la_code | 0 |
| old_la_code | 0 |
| la_name | 0 |
| school_phase | 0 |
| NC_year_admission | 0 |
| admission_numbers | 0 |
| applications_received | 0 |
| online_applications | 3 |
| online_apps_percent | 3 |
| no_of_preferences | 0 |
| first_preference_offers | 0 |
| first_preference_offers_percent | 0 |
| second_preference_offers | 8 |
| second_preference_offers_percent | 8 |
| third_preference_offers | 10 |
| third_preference_offers_percent | 10 |
| one_of_the_three_preference_offers | 4 |
| one_of_the_three_preference_offers_percent | 4 |
| preferred_school_offer | 4 |
| preferred_school_offer_percent | 4 |
| non_preferred_offer | 51 |
| non_preferred_offer_percent | 51 |
| no_offer | 49 |
| no_offer_percent | 49 |
| schools_in_la_offer | 4 |
| schools_in_la_offer_percent | 4 |
| schools_in_another_la_offer | 4 |
| schools_in_another_la_offer_percent | 4 |
| offers_to_nonapplicants | 11 |
I will save my clean data to a new CSV file that I can use for visualization in Tableau.
write_csv(offers_la, file="data-out/offers-la.csv", na="")