This script outlines the steps take to profile the data from the Food Hygiene Ratings dataset. This script will not make any edits to the data but will note down where cleaning actions should be taken so that this can be performed with an R script in the overall pipeline.
Load the packages to be used for this project.
library(here)
library(knitr)
library(sf)
library(mapview)
The data for profiling is contained in the fhr_data
object created in the 2-prep-data.R script.
| fhrsid | local_authority_business_id | business_name | business_type | business_type_id | rating_value | rating_key | rating_date | local_authority_code | local_authority_name | local_authority_web_site | local_authority_email_address | hygiene | structural | confidence_in_management | scheme_type | new_rating_pending | longitude | latitude | address_line1 | address_line2 | address_line3 | address_line4 | post_code | fhrs_establishment_id |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1199283 | PI/000134037 | 3D Spice | Other catering premises | 7841 | 5 | fhrs_5_en-GB | 2022-02-01 | 314 | Waverley | http://www.waverley.gov.uk | environmentalhealth@waverley.gov.uk | 0 | 0 | 0 | FHRS | FALSE | NA | NA | NA | NA | NA | NA | NA | EstablishmentDetail |
| 177971 | PI/000044599 | 40 Degreez | Other catering premises | 7841 | 5 | fhrs_5_en-GB | 2022-05-27 | 314 | Waverley | http://www.waverley.gov.uk | environmentalhealth@waverley.gov.uk | 5 | 5 | 5 | FHRS | FALSE | -0.793374 | 51.21678 | Farnham Youth Project | Dogflud Way | Farnham | Surrey | NA | EstablishmentDetail |
| 1495850 | PI/000180590 | 7AM Premier Farnham | Retailers - other | 4613 | 1 | fhrs_1_en-GB | 2022-12-08 | 314 | Waverley | http://www.waverley.gov.uk | environmentalhealth@waverley.gov.uk | 15 | 15 | 20 | FHRS | FALSE | -0.762408 | 51.23089 | 57-59 Badshot Lea Road | Badshot Lea | Farnham | Surrey | GU9 9LP | EstablishmentDetail |
| 1209035 | PI/000178110 | A & R Convenience Store | Retailers - other | 4613 | 5 | fhrs_5_en-GB | 2023-02-01 | 314 | Waverley | http://www.waverley.gov.uk | environmentalhealth@waverley.gov.uk | 5 | 5 | 0 | FHRS | FALSE | -0.606548 | 51.19057 | 7 Meadrow | Godalming | Surrey | NA | GU7 3HJ | EstablishmentDetail |
| 1531219 | PI/000161558 | A Bite to Wheat | Other catering premises | 7841 | 5 | fhrs_5_en-GB | 2022-06-09 | 314 | Waverley | http://www.waverley.gov.uk | environmentalhealth@waverley.gov.uk | 0 | 0 | 0 | FHRS | FALSE | NA | NA | NA | NA | NA | NA | NA | EstablishmentDetail |
| 1495839 | PI/000124651 | A Fancy Piece | Other catering premises | 7841 | 5 | fhrs_5_en-GB | 2022-03-15 | 314 | Waverley | http://www.waverley.gov.uk | environmentalhealth@waverley.gov.uk | 5 | 5 | 0 | FHRS | FALSE | NA | NA | NA | NA | NA | NA | NA | EstablishmentDetail |
The dataframe has 982 rows and 25 columns. A glimpse of
the data is shown below.
## Rows: 982
## Columns: 25
## $ fhrsid <dbl> 1199283, 177971, 1495850, 1209035, 15312…
## $ local_authority_business_id <chr> "PI/000134037", "PI/000044599", "PI/0001…
## $ business_name <chr> "3D Spice", "40 Degreez", "7AM Premier F…
## $ business_type <chr> "Other catering premises", "Other cateri…
## $ business_type_id <dbl> 7841, 7841, 4613, 4613, 7841, 7841, 7841…
## $ rating_value <chr> "5", "5", "1", "5", "5", "5", "5", "0", …
## $ rating_key <chr> "fhrs_5_en-GB", "fhrs_5_en-GB", "fhrs_1_…
## $ rating_date <date> 2022-02-01, 2022-05-27, 2022-12-08, 202…
## $ local_authority_code <dbl> 314, 314, 314, 314, 314, 314, 314, 314, …
## $ local_authority_name <chr> "Waverley", "Waverley", "Waverley", "Wav…
## $ local_authority_web_site <chr> "http://www.waverley.gov.uk", "http://ww…
## $ local_authority_email_address <chr> "environmentalhealth@waverley.gov.uk", "…
## $ hygiene <dbl> 0, 5, 15, 5, 0, 5, 5, 15, 5, 0, 5, 0, 0,…
## $ structural <dbl> 0, 5, 15, 5, 0, 5, 0, 20, 5, 0, 0, 5, 5,…
## $ confidence_in_management <dbl> 0, 5, 20, 0, 0, 0, 0, 20, 5, 0, 5, 10, 5…
## $ scheme_type <chr> "FHRS", "FHRS", "FHRS", "FHRS", "FHRS", …
## $ new_rating_pending <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE…
## $ longitude <dbl> NA, -0.7933740, -0.7624080, -0.6065480, …
## $ latitude <dbl> NA, 51.21678, 51.23089, 51.19057, NA, NA…
## $ address_line1 <chr> NA, "Farnham Youth Project", "57-59 Bads…
## $ address_line2 <chr> NA, "Dogflud Way", "Badshot Lea", "Godal…
## $ address_line3 <chr> NA, "Farnham", "Farnham", "Surrey", NA, …
## $ address_line4 <chr> NA, "Surrey", "Surrey", NA, NA, NA, NA, …
## $ post_code <chr> NA, NA, "GU9 9LP", "GU7 3HJ", NA, NA, NA…
## $ fhrs_establishment_id <chr> "EstablishmentDetail", "EstablishmentDet…
Look at each column in turn and assess the number of unique values. Is any recoding required?
fhrsidbusiness_namebusiness_type| business_type | n | prop |
|---|---|---|
| Restaurant/Cafe/Canteen | 188 | 0.19 |
| Other catering premises | 179 | 0.18 |
| Retailers - other | 153 | 0.16 |
| Hospitals/Childcare/Caring Premises | 110 | 0.11 |
| Pub/bar/nightclub | 95 | 0.10 |
| School/college/university | 82 | 0.08 |
| Takeaway/sandwich shop | 52 | 0.05 |
| Mobile caterer | 50 | 0.05 |
| Manufacturers/packers | 28 | 0.03 |
| Hotel/bed & breakfast/guest house | 20 | 0.02 |
| Retailers - supermarkets/hypermarkets | 19 | 0.02 |
| Distributors/Transporters | 5 | 0.01 |
| Farmers/growers | 1 | 0.00 |
business_type_idThere are 13 unique values.
| business_type_id | n | prop |
|---|---|---|
| 1 | 188 | 0.19 |
| 7841 | 179 | 0.18 |
| 4613 | 153 | 0.16 |
| 5 | 110 | 0.11 |
| 7843 | 95 | 0.10 |
| 7845 | 82 | 0.08 |
| 7844 | 52 | 0.05 |
| 7846 | 50 | 0.05 |
| 7839 | 28 | 0.03 |
| 7842 | 20 | 0.02 |
| 7840 | 19 | 0.02 |
| 7 | 5 | 0.01 |
| 7838 | 1 | 0.00 |
rating_value| rating_value | n | prop |
|---|---|---|
| 5 | 729 | 0.74 |
| 4 | 116 | 0.12 |
| 3 | 58 | 0.06 |
| Exempt | 43 | 0.04 |
| 2 | 16 | 0.02 |
| 1 | 12 | 0.01 |
| AwaitingInspection | 7 | 0.01 |
| 0 | 1 | 0.00 |
rating_key| rating_key | n | prop |
|---|---|---|
| fhrs_5_en-GB | 729 | 0.74 |
| fhrs_4_en-GB | 116 | 0.12 |
| fhrs_3_en-GB | 58 | 0.06 |
| fhrs_exempt_en-GB | 43 | 0.04 |
| fhrs_2_en-GB | 16 | 0.02 |
| fhrs_1_en-GB | 12 | 0.01 |
| fhrs_awaitinginspection_en-GB | 7 | 0.01 |
| fhrs_0_en-GB | 1 | 0.00 |
rating_value column nicely.
We probably don’t need to add another column.rating_date## Min. 1st Qu. Median Mean 3rd Qu. Max.
## "2016-01-22" "2022-02-11" "2022-07-01" "2022-07-04" "2023-01-20" "2023-07-19"
## NA's
## "7"
hygiene| hygiene | n | prop |
|---|---|---|
| 0 | 587 | 0.60 |
| 5 | 241 | 0.25 |
| 10 | 94 | 0.10 |
| NA | 50 | 0.05 |
| 15 | 10 | 0.01 |
score tag in the original
xml file.structural| structural | n | prop |
|---|---|---|
| 0 | 458 | 0.47 |
| 5 | 358 | 0.36 |
| 10 | 95 | 0.10 |
| NA | 50 | 0.05 |
| 15 | 20 | 0.02 |
| 20 | 1 | 0.00 |
score tag in the original
xml file.confidence_in_management| confidence_in_management | n | prop |
|---|---|---|
| 0 | 506 | 0.52 |
| 5 | 311 | 0.32 |
| 10 | 102 | 0.10 |
| NA | 50 | 0.05 |
| 20 | 13 | 0.01 |
score tag in the original
xml file.scheme_typenew_rating_pending| new_rating_pending | n | prop |
|---|---|---|
| FALSE | 973 | 0.99 |
| TRUE | 9 | 0.01 |
longitude and latitudelongitude## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## -0.8366 -0.7866 -0.7098 -0.6800 -0.6072 -0.4050 226
latitude## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 51.08 51.14 51.19 51.17 51.21 51.31 226
fhrsid 221504 and 182877.
Let’s check out the details of these two records.| fhrsid | business_name | business_type | longitude | latitude | address_line1 | address_line2 | address_line3 | address_line4 | post_code |
|---|---|---|---|---|---|---|---|---|---|
| 182877 | Abbeyfield Haslemere Society Ltd | Hospitals/Childcare/Caring Premises | -0.556417 | 51.30751 | Kemnal | Grayswood Road | Haslemere | NA | GU22 2EP |
| 221504 | Waitrose Ltd | Retailers - supermarkets/hypermarkets | -0.405000 | 51.27200 | Waitrose | The Hart | Farnham | Surrey | GU9 7HD |
post_code## # A tibble: 3 × 1
## post_code
## <chr>
## 1 <NA>
## 2 RH5 6NU
## 3 RH5 6NT
post_code is null. This is
ok. We’ll be excluding those records anyway.| fhrsid | business_name | longitude | latitude | address_line1 | address_line2 | address_line3 | address_line4 | post_code |
|---|---|---|---|---|---|---|---|---|
| 167610 | Hurtwood House School | -0.426199 | 51.17442 | Radnor Road | Holmbury | Dorking | Surrey | RH5 6NU |
| 539804 | Mullard Lab | -0.420718 | 51.17197 | Mullard Space Science Laboratory | Holmbury Hill Road | Holmbury St Mary | Dorking | RH5 6NT |
fhrs_establishment_idrating_value has a mix of character and numeric values.
Look at keeping a single type in the column (perhaps split data into 2
columns, one for numeric rating and one for rating code.)local_authority_code,
local_authority_name,
local_authority_web_site,
local_authority_email_address,
scheme_type._score to the following columns:
hygiene, structural,
confidence_in_management (abbreviated to
management).fhrsid 187877 - the correct postcode for this record
should be GU27 2BP. Lat/Long should be 51.0923509197681,
-0.7060200868816853 (source: Google Maps using address details).fhrsid 221504 - correct lat/long values are
51.21564332626035, -0.8021015970813663 (source: Google Maps using
postcode).