Data Cleaning for Skills for Care Data

Updated 2/19/2023 to add new variable age and save the combined data to prepare for further analysis

Setting up

library(tidyverse) # package for data cleaning and plotting
library(readxl) # package for reading excel file
library(broom) # extracting model summary as data frame
library(modelsummary) # deriving model tables
library(scales) # label percent
library(lubridate) # working with dates

# environment setup to run ordered logit properly
options(contrasts = rep("contr.treatment", 2))

Import data

#import the raw data file
cqc_skills <- read_excel("data/SfC_CQC.xlsx")
skills_raw <- read_csv("data/Skills_raw.csv")

Data Cleaning for CQC_skills (cqc data with SfC establishmentid)

#select the CQC data that has Skills for Care establishment id
CQC_sub <- cqc_skills %>% 
  filter(!is.na(establishmentid)) %>% 
#restructure to wide data (1 row for each establishment id)
  pivot_wider(names_from = domain, values_from = rating) %>% 
  mutate(form = ifelse(form == "NA", NA, form))

head(CQC_sub)

## # A tibble: 6 × 15
##   form  cic_type care_home primary_cat  region service_group publication_date   
##   <chr> <chr>    <chr>     <chr>        <chr>  <chr>         <dttm>             
## 1 FPO   NA       Y         Residential… East … Overall       2020-03-03 00:00:00
## 2 FPO   NA       Y         Residential… South… Overall       2017-10-31 00:00:00
## 3 FPO   NA       Y         Residential… East … Overall       2018-12-14 00:00:00
## 4 FPO   NA       Y         Residential… West … Overall       2019-09-17 00:00:00
## 5 FPO   NA       N         Community b… Yorks… Overall       2020-02-15 00:00:00
## 6 NPO   NA       Y         Residential… South… Overall       2018-11-13 00:00:00
## # ℹ 8 more variables: inherited <chr>, establishmentid <dbl>, Safe <chr>,
## #   Effective <chr>, Caring <chr>, Responsive <chr>, `Well-led` <chr>,
## #   Overall <chr>

CQC_sub %>% 
  summarize(all_rows = n(),
            unique_rows = n_distinct(establishmentid))

## # A tibble: 1 × 2
##   all_rows unique_rows
##      <int>       <int>
## 1     7258        6799

This results show that some facilities have more than 1 historical rating records.

Data Cleaning for Skills for Care data

Select Columns

#write variable name list
columns <- c("ParentSubSingle", "isparent", "estabcreateddate",  #basic info
             "Starters_Leavers_FILTER", "totalstarters", "totalleavers", #staff
             "totalstaff", "totalvacancies","ORGWRKS", "ORGWRKSGP", #staff
             "esttype",  "mainstid", "MAINSERGP1") #service descriptions

Notes for some variables:

ParentSubSingle: values: 0, 1, 2. Does not say which is parent/sub/single. It is observed 2 corresponds to `isparent = 1”, so should be parent.

Starters_Leavers_FILTER: Flag to define if a workplace has reliable starters or leavers information. 1 = Include, 0 = Exclude.

ORGWRKS: Sum of totalstaff across the organisation.

ORGWRKSGP: Grouping of ORGWRKS

Value Label -1 Not allocated 991 Micro (1 to 9) 992 Small (10 to 49) 993 Medium (50 to 249) 994 Large (250 or more)

esttype: sector of the workplace

mainstid: The main service provided by the workplace

totalstarters and totalleavers: Total number of permanent and temporary staff that started/left in the previous 12 months.

skills_selected <- skills_raw %>% 
  select(establishmentid, all_of(columns), starts_with("jr28"), starts_with("jr29"),
         matches("^st.*cap$"), matches("^st.*util$"))

“jr” codes

Code Label

28 Any job role. This is the sum of all job roles submitted by the establishment.

29 Any direct care role. This is the sum of all direct care roles submitted by the establishment. (Note: this variable may be more directly related to service quality)

30 Any manager/supervisor role. This is the sum of all manager/supervisor roles submitted by the establishment.

31 Any regulated profession role. This is the sum of all regulated profession roles submitted by the establishment.

32 Any other roles. This is the sum of all other roles, those not included in groups 29,30 and 31, submitted by the establishment.

1 Senior Management

2 Middle Management

3 First Line Manager

4 Registered Manager

5 Supervisor

6 Social Worker

7 Senior Care Worker

8 Care Worker

9 Community Support and Outreach Work

10 Employment Support

11 Advice Guidance and Advocacy

15 Occupational Therapist

16 Registered Nurse

17 Allied Health Professional

22 Technician

23 Other care-providing job role

24 Managers and staff in care-related but not care-providing roles

25 Administrative or office staff not care-providing

26 Ancillary staff not care-providing

27 Other non-care-providing job roles

34 Activities worker or co-ordinator

35 Safeguarding and reviewing officer

36 Occupational therapist assistant

37 Nursing Associate

38 Nursing Assistant

39 Assessment officer

40 Care co-ordinator

41 Care navigator

42 Any Childrens/young peoples job role

“capacity vs. utility” meaning

capacity meaning

Main groups and corresponding values

main groups values

Exploring the strucutre of Skills for Care data

Use this small dataset to check if each establishment ID has multiple rows

skills_raw %>% 
  select(establishmentid) %>% 
  summarise(count = n(),
            count_unqice = n_distinct(establishmentid))

## # A tibble: 1 × 2
##   count count_unqice
##   <int>        <int>
## 1 18956        18956

In the Skills for Care data set, each establishmentid only corresponds to 1 row data.

Merge data and derive new variables

coding section (jump to the next section for rationale)

# merge and recode data
merged_wide <- CQC_sub %>% 
  left_join(skills_selected, by = "establishmentid") %>% 
  mutate(main_service_type = factor(mainstid),
         main_group1 = factor(MAINSERGP1),
         
         main_service_group = case_when(
           main_group1 == "1" ~ "Adult residential",
           main_group1 == "3" ~ "Adult domicilary", 
           TRUE ~ as.character(NA)),
         
# derive truncated service types (where CICs exist) see table 1
         service_type_selected = case_when(
           main_service_type == "1"  ~ "care home w/ nursing",
           main_service_type == "2"  ~ "care home w/o nursing",
           main_service_type == "8"  ~ "Domicilary Care",
           main_service_type == "55" ~ "Supported Living",
           # dropped 61 -- community based for learning disability
           TRUE ~ as.character(NA)),

# recode size type to be more readable
         size_type = case_when(
           ORGWRKSGP == -1  ~ as.character(NA),
           ORGWRKSGP == 991 ~ "micro",
           ORGWRKSGP == 992 ~ "small",
           ORGWRKSGP == 993 ~ "medium",
           ORGWRKSGP == 994 ~ "large",
           TRUE ~ as.character(NA)),
# derive age of the organization
        start_date = dmy(estabcreateddate),
        age = 2023-year(start_date)
)

Initial exploration with tabulation

Tabulation of service types by legal forms

#the distribution of service by by forms, we can only keep the types that have CIC providers?
table1 <- datasummary(main_service_type ~ form,
            data = merged_wide, fmt = 0)

## Warning in !is.null(rmarkdown::metadata$output) && rmarkdown::metadata$output
## %in% : 'length(x) = 3 > 1' in coercion to 'logical(1)'

table1

main_service_type	CIC	FPO	GOV	IND	NPO
1	5	1106	32	35	189
2	8	2179	167	210	832
5	0	0	0	2	0
6	0	0	0	0	1
8	30	1501	94	49	184
14	0	0	0	0	1
17	0	8	32	1	4
54	0	65	15	0	48
55	6	207	30	0	127
60	0	3	0	0	1
61	2	11	1	1	3
62	0	1	0	0	2
64	0	27	3	0	0
67	0	5	0	0	1
69	0	5	4	0	3
70	0	3	0	0	0
72	0	1	0	0	8
75	0	0	0	0	1

Tabulation of total

table_main_group1 <- datasummary(main_group1 ~ form,
            data = merged_wide, fmt = 0)
table_main_group1

main_group1	CIC	FPO	GOV	IND	NPO
1	13	3285	199	247	1021
2	0	0	0	0	1
3	36	1773	139	49	359
4	0	8	32	1	5
5	0	3	0	0	2
9	2	52	8	1	9
10	0	1	0	0	8

Tabulation of total staff (mean and sd) by form and size type

cross_table <- datasummary(main_service_group * service_type_selected ~ form,
            data = merged_wide, fmt = 0)
cross_table

main_service_group	service_type_selected	CIC	FPO	GOV	IND	NPO
Adult domicilary	care home w/ nursing	0	0	0	0	0
	care home w/o nursing	0	0	0	0	0
	Domicilary Care	30	1501	94	49	184
	Supported Living	6	207	30	0	127
Adult residential	care home w/ nursing	5	1106	32	35	189
	care home w/o nursing	8	2179	167	210	832
	Domicilary Care	0	0	0	0	0
	Supported Living	0	0	0	0	0

Check the distribution of service types over service group

table2 <- datasummary(totalstaff*form*(Mean + SD) ~ size_type,
            data = merged_wide, fmt = 0)
table2

	form		large	medium	micro	small
totalstaff	CIC	Mean	84	67	9	27
		SD	154	34		10
	FPO	Mean	59	65	6	28
		SD	65	41	2	11
	GOV	Mean	53	22		17
		SD	60	15
	IND	Mean		58	6	24
		SD		25	2	10
	NPO	Mean	46	60	7	27
		SD	65	49	2	11

Derive new variables

About turnover rate

Employee turnover rate is the percent of employees who leave a company within a specific time period. Turnover rate is commonly calculated by month, quarter, or year and includes both voluntary and involuntary losses. Monthly and quarterly turnover rates are commonly expressed as averages, while annual turnover rate is usually cumulative.

\[ \text{Employee turnover rate} = \frac{\text{Number of employees who left}}{\text{Average number of employees}} \times 100 \]

\[ \text{Average number of employees} = \frac{\text{(Headcount at the begining of the timeframe + headcount at the end of the timeframe)}}{2} \]

Strictly speaking, we do not have the exact begin/end staff number for the n_left employees in the last year

New variables could be derived

count of employees left in last 12 month/latest total staff by service type, total staff/ number of vacancy - not useful due to a lot of missing value

Any other ideas? Please specify exact variable rather than vague concept

# derive new variables
merged_recoded_wide <- merged_wide %>% 
# mutate NA only for jr28flag = 1 and jr29 = 1, respectively
  mutate(across(starts_with("jr28"), 
          ~ ifelse(jr28flag == 1, ifelse(is.na(.x), 0, .x), .x))) %>% 
  mutate(across(starts_with("jr29"), 
          ~ ifelse(jr29flag == 1, ifelse(is.na(.x), 0, .x), .x))) %>% 
  mutate(across(starts_with("st"),
         ~ ifelse(is.na(.x), 0, .x))) %>% 
  mutate(totalstaff = ifelse(is.na(totalstaff), 0, totalstaff),
         turnover = totalleavers/totalstaff*100,
         turnover = ifelse(turnover < 0, NA, turnover),
         turnover = ifelse(is.na(turnover) | turnover == Inf, NA, turnover)) %>% 
  rowwise() %>% 
  mutate(capacity = sum(across(matches("^st.*cap$"))),
         utility = sum(across(matches("^st.*util$")))) %>% 
  ungroup() %>% 
  mutate(staff_level = totalstaff/utility*100,
         staff_level = ifelse(staff_level == Inf, NA, staff_level))

For now, haven’t accounted for the vacancies. Because there seems to be quite some missing values. Need to check in more detail later.

table_staff <- datasummary(totalstaff * (Mean+Median)  ~ form,
            data = filter(merged_recoded_wide, !is.na(main_service_group)), fmt = 1, na.rm = T, output = "staff2.docx")
table_staff

## [1] "staff2.docx"

ggplot(filter(merged_recoded_wide, !is.na(main_service_group)), 
       aes(x = form, y = totalstaff)) +
  geom_boxplot() +
  coord_cartesian(ylim =c(0, 500)) +
  facet_wrap(vars(main_service_group))

table4 <- datasummary(turnover * service_type_selected *(Mean + SD)~ form,
            data = merged_recoded_wide, fmt = 1, na.rm = T)
table4

	service_type_selected		CIC	FPO	GOV	IND	NPO
turnover	care home w/ nursing	Mean	40.3	34.9	14.3	18.3	27.3
		SD	28.7	34.4	17.1	16.0	26.0
	care home w/o nursing	Mean	9.1	28.9	13.5	18.9	21.6
		SD	12.9	34.2	11.6	27.1	26.6
	Domicilary Care	Mean	30.7	37.7	12.3	28.9	26.2
		SD	29.7	59.2	9.0	35.6	33.3
	Supported Living	Mean	10.7	26.1	10.6		33.7
		SD		30.4	10.7		65.7

Run Toy Models

Prepare data for analysis

#pivot longer back
merged_long <- merged_recoded_wide %>% 
  pivot_longer(cols = c(Safe, Effective, Caring, Responsive, `Well-led`, Overall),
               names_to = "domain", values_to = "rating")

#select relevant columns, rename and relabel 
merged_long_ordered <- merged_long %>% 
# recode legal form types to be more readable / easier to present
  mutate(inherited = ifelse(inherited == "Y", TRUE, FALSE),
         rating = recode(rating, 
                         "Insufficient evidence to rate" = "NA",
                         "Requires improvement" = "Req improv")) %>% 
  # set the order of the values in the factors 
  mutate(form = ordered(form, levels = c("FPO", "NPO", "GOV", "CIC", "IND")),
         
  # assume the order of the ratings as follows but need to double check with the source 
         rating = ordered(rating, levels = c("Inadequate","Req improv", "Good", "Outstanding"))) %>%
  
  # adding the rating data coded as numerical
  mutate(rating_num = case_when(rating == "Inadequate" ~ 1,
                                rating == "Req improv" ~ 2,
                                rating == "Good" ~ 3,
                                rating == "Outstanding" ~ 4)) %>% 
  
  mutate(category = case_when(primary_cat == "Community based adult social care services" ~ "community",
                              primary_cat == "Residential social care" ~ "residential",
                              TRUE ~ as.character(NA)),

         # deriving year column and dummy variable for before_covid
         year = year(publication_date),
         after_covid = ifelse(year >= 2020, TRUE, FALSE),
         before_covid = ifelse(year <= 2019, TRUE, FALSE)) %>%

  # converting the ordinal variable to numerical 
  mutate(rating_num = case_when(rating == "Inadequate" ~ 1,
                                rating == "Req improv" ~ 2,
                                rating == "Good" ~ 3,
                                rating == "Outstanding" ~ 4)) %>% 
  # derive the rating dummy
  mutate(rating_higher = ifelse(rating_num > 2, 1, 0))

# show first several rows of the data set derived 
head(merged_long_ordered)

## # A tibble: 6 × 84
##   form  cic_type care_home primary_cat  region service_group publication_date   
##   <ord> <chr>    <chr>     <chr>        <chr>  <chr>         <dttm>             
## 1 FPO   NA       Y         Residential… East … Overall       2020-03-03 00:00:00
## 2 FPO   NA       Y         Residential… East … Overall       2020-03-03 00:00:00
## 3 FPO   NA       Y         Residential… East … Overall       2020-03-03 00:00:00
## 4 FPO   NA       Y         Residential… East … Overall       2020-03-03 00:00:00
## 5 FPO   NA       Y         Residential… East … Overall       2020-03-03 00:00:00
## 6 FPO   NA       Y         Residential… East … Overall       2020-03-03 00:00:00
## # ℹ 77 more variables: inherited <lgl>, establishmentid <dbl>,
## #   ParentSubSingle <dbl>, isparent <dbl>, estabcreateddate <chr>,
## #   Starters_Leavers_FILTER <dbl>, totalstarters <dbl>, totalleavers <dbl>,
## #   totalstaff <dbl>, totalvacancies <dbl>, ORGWRKS <dbl>, ORGWRKSGP <dbl>,
## #   esttype <dbl>, mainstid <dbl>, MAINSERGP1 <dbl>, jr28flag <dbl>,
## #   jr28perm <dbl>, jr28temp <dbl>, jr28pool <dbl>, jr28agcy <dbl>,
## #   jr28oth <dbl>, jr28emp <dbl>, jr28work <dbl>, jr28strt <dbl>, …

table3 <- datasummary(staff_level * main_service_group *(Mean + SD) ~ form,
            data = merged_long_ordered, fmt = 1, na.rm = T)
table3

	main_service_group		FPO	NPO	GOV	CIC	IND
staff_level	Adult domicilary	Mean	112.6	229.9	215.8	137.6	71.5
		SD	187.2	464.1	330.9	133.6	57.4
	Adult residential	Mean	166.7	204.7	255.1	352.7	152.6
		SD	160.6	118.6	168.4	210.9	313.2

Save data

write.csv(merged_long_ordered, file = "data/cqc_sfc_cleaned.csv")