Analysis of Data 1 - Technical Test

In order to analyze information provided by the calls and locations dataset we begin by visualizing and understanding the information contained in both datasets

#install.packages(pacman)

library(pacman)
p_load(tidyverse, DataExplorer, readr, visdat, janitor, lubridate, pastecs, sf, leaflet, rnaturalearth, rnaturalearthdata, rgeos, maps, h2o)
options(scipen = 999999)

We explore the structure of the data in the call data frame

call <- read_csv2("Call_Duration.csv")
## i Using ',' as decimal and '.' as grouping mark. Use `read_delim()` for more control.
## 
## -- Column specification --------------------------------------------------------
## cols(
##   ID = col_character(),
##   Start_Time = col_character(),
##   Call_Duration_Seconds = col_double()
## )
dim(call)
## [1] 293750      3
glimpse(call)
## Rows: 293,750
## Columns: 3
## $ ID                    <chr> "963f7add01ba11591797326527", "2570f1aaff7cf1...
## $ Start_Time            <chr> "10/06/2020 13:55:27", "10/06/2020 17:35:06",...
## $ Call_Duration_Seconds <dbl> 127, 89, 440, 71, 81, 0, 132, 339, 50, 0, 0, ...
str(call)
## tibble [293,750 x 3] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ ID                   : chr [1:293750] "963f7add01ba11591797326527" "2570f1aaff7cf1591810505504" "41eb8a54bdf721591809435000" "41eb8a54bdf721591811591000" ...
##  $ Start_Time           : chr [1:293750] "10/06/2020 13:55:27" "10/06/2020 17:35:06" "10/06/2020 17:17:15" "10/06/2020 17:53:11" ...
##  $ Call_Duration_Seconds: num [1:293750] 127 89 440 71 81 0 132 339 50 0 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   ID = col_character(),
##   ..   Start_Time = col_character(),
##   ..   Call_Duration_Seconds = col_double()
##   .. )
# DataExplorer::create_report(call)
head(call)
## # A tibble: 6 x 3
##   ID                         Start_Time          Call_Duration_Seconds
##   <chr>                      <chr>                               <dbl>
## 1 963f7add01ba11591797326527 10/06/2020 13:55:27                   127
## 2 2570f1aaff7cf1591810505504 10/06/2020 17:35:06                    89
## 3 41eb8a54bdf721591809435000 10/06/2020 17:17:15                   440
## 4 41eb8a54bdf721591811591000 10/06/2020 17:53:11                    71
## 5 41eb8a54bdf721591812606000 10/06/2020 18:10:06                    81
## 6 74007cb3f77161591806866000 10/06/2020 16:34:26                     0
summary(call)
##       ID             Start_Time        Call_Duration_Seconds
##  Length:293750      Length:293750      Min.   :    0.00     
##  Class :character   Class :character   1st Qu.:    0.00     
##  Mode  :character   Mode  :character   Median :   10.00     
##                                        Mean   :   95.45     
##                                        3rd Qu.:   98.00     
##                                        Max.   :43895.00
vis_miss(call)

# call_dups <- get_dupes(call)
IQR(call$Call_Duration_Seconds)
## [1] 98

The call dataset contains 293750 records and 3 columns: id, the start time of the call and the duration in seconds. There are no missing data and no duplicate information in the ID column However, it does seem there are outlier values in the duration of the call which we can observe more in detail with a histogram and a boxplot

ggplot() +
  geom_histogram(data = call, aes(x = Call_Duration_Seconds, fill = "aquamarine2", color = "black")) +
  theme_minimal() +
  labs(title = "Call Duration Histogram", x = "Seconds")

ggplot() +
  geom_boxplot(data = call, aes(x = "", y = Call_Duration_Seconds, fill = "aquamarine2", color = "black")) +
  theme_minimal() + 
  labs(title = "Call Duration Histogram", x = "Seconds" )

We can create some new columns using the Start time variable, in order to obtain more information. The date format that was assumed was the one that fitted all cases (day, month, year, hour, minutes and seconds)

call <- call %>% 
  mutate(Start_Time_Date = lubridate::dmy_hms(Start_Time),
         start_year = lubridate::year(Start_Time_Date),
         start_month = lubridate::month(Start_Time_Date),
         start_weekday = lubridate::wday(Start_Time_Date, abbr = F, label = T,  week_start = 1),
         start_hour = lubridate::hour(Start_Time_Date))

summary(call)
##       ID             Start_Time        Call_Duration_Seconds
##  Length:293750      Length:293750      Min.   :    0.00     
##  Class :character   Class :character   1st Qu.:    0.00     
##  Mode  :character   Mode  :character   Median :   10.00     
##                                        Mean   :   95.45     
##                                        3rd Qu.:   98.00     
##                                        Max.   :43895.00     
##                                                             
##  Start_Time_Date                 start_year    start_month    
##  Min.   :2000-06-07 13:26:49   Min.   :2000   Min.   : 1.000  
##  1st Qu.:2020-06-10 09:52:36   1st Qu.:2020   1st Qu.: 6.000  
##  Median :2020-06-10 13:34:46   Median :2020   Median : 6.000  
##  Mean   :2020-06-06 10:49:38   Mean   :2020   Mean   : 6.002  
##  3rd Qu.:2020-06-10 16:07:18   3rd Qu.:2020   3rd Qu.: 6.000  
##  Max.   :2020-10-28 17:16:31   Max.   :2020   Max.   :11.000  
##                                                               
##    start_weekday      start_hour   
##  Monday   :    19   Min.   : 0.00  
##  Tuesday  :    70   1st Qu.: 9.00  
##  Wednesday:293598   Median :13.00  
##  Thursday :    24   Mean   :12.45  
##  Friday   :     8   3rd Qu.:16.00  
##  Saturday :     8   Max.   :23.00  
##  Sunday   :    23

We can explore more profoundly the start time with these new columns

ggplot(call) +
  aes(x = start_year) +
  geom_bar(fill = "aquamarine2", color = "black") +
  theme_minimal() +
  labs(title = "Calls by year")

ggplot(call) +
  aes(x = start_month) +
  geom_bar(fill = "aquamarine2", color = "black") +
  theme_minimal() +
  labs(title = "Calls by month")

ggplot(call) +
  aes(x = start_weekday) +
  geom_bar(fill = "aquamarine2", color = "black") +
  theme_minimal() +
  labs(title = "Calls by weekday")

ggplot(call) +
  aes(x = start_hour) +
  geom_bar(fill = "aquamarine2", color = "black") +
  theme_minimal() +
  labs(title = "Calls by hour")

At this point we can say that according to the calls dataset that 293750 unique calls were made between dates 2000-06-07 and 2020-10-28, with an average duration of 95 seconds, and a median of 10 seconds (which indicates the presence of strong outliers). The range goes from 0 seconds to 43895 seconds, with a 664 standard deviation value, 7 variation coefficient, and 98 interquartile range. These values along with boxplot and histogram visualizations show a strong concentration in short duration calls and a few very long ones. The vast majority of calls were made on 2020, on the month of June, a Wednesday and between 6 and 17 hours, with the mode at 17 hours. The cases that don’t belong to the the year 2020, the month of June and made on a Wednesday should be further explored as outliers or possible data entry mistakes.

We can also explore the Location dataset using a similar methodology

location <- read_csv("Location.csv")

dim(location)
## [1] 2539592       6
glimpse(location)
## Rows: 2,539,592
## Columns: 6
## $ X1              <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, ...
## $ ID              <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, ...
## $ r_id            <chr> "963f7add01ba1", "963f7add01ba1", "963f7add01ba1", ...
## $ latitu          <dbl> 36.399, 36.399, 36.399, 36.399, 36.399, 36.399, 36....
## $ longitu         <dbl> -97.858, -97.858, -97.858, -97.858, -97.858, -97.85...
## $ call_start_time <dbl> 1591797326527, 1591797326527, 1591797326527, 159179...
  head(location)
## # A tibble: 6 x 6
##      X1    ID r_id          latitu longitu call_start_time
##   <dbl> <dbl> <chr>          <dbl>   <dbl>           <dbl>
## 1     1     1 963f7add01ba1   36.4   -97.9   1591797326527
## 2     2     2 963f7add01ba1   36.4   -97.9   1591797326527
## 3     3     3 963f7add01ba1   36.4   -97.9   1591797326527
## 4     4     4 963f7add01ba1   36.4   -97.9   1591797326527
## 5     5     5 963f7add01ba1   36.4   -97.9   1591797326527
## 6     6     6 963f7add01ba1   36.4   -97.9   1591797326527
summary(location)
##        X1                ID              r_id               latitu      
##  Min.   :      1   Min.   :      1   Length:2539592     Min.   :-48.94  
##  1st Qu.: 634899   1st Qu.: 634899   Class :character   1st Qu.: 32.74  
##  Median :1269797   Median :1269797   Mode  :character   Median : 36.73  
##  Mean   :1269797   Mean   :1269797                      Mean   : 32.36  
##  3rd Qu.:1904694   3rd Qu.:1904694                      3rd Qu.: 40.76  
##  Max.   :2539592   Max.   :2539592                      Max.   : 71.30  
##                                                         NA's   :78322   
##     longitu        call_start_time        
##  Min.   :-171.71   Min.   : 960384409000  
##  1st Qu.:-105.00   1st Qu.:1591782884230  
##  Median : -86.92   Median :1591797677770  
##  Mean   : -92.86   Mean   :1591703397430  
##  3rd Qu.: -80.14   3rd Qu.:1591806107140  
##  Max.   : 158.18   Max.   :1603905391000  
##  NA's   :78322
sapply(location, function(x) sum(is.na(x)))
##              X1              ID            r_id          latitu         longitu 
##               0               0               0           78322           78322 
## call_start_time 
##               0
unique_locations <- unique(location$r_id)
# excel_issue <- head(janitor::excel_numeric_to_date(location$call_start_time, include_time = TRUE))
max(nchar(location$r_id))
## [1] 13
min(nchar(location$r_id))
## [1] 13

The fact that the format of the two datasets is different (in the columns separators used for example) can indicate they belong to different sources, institutions or are stored in two separate repositories, a fact that should be taken into account when combining the information that both posess.

The location data set has 2539592 rows, of which 78322 have missing values in latitud and longitude attributes, and 6 variables. From the exploration we could assume the column r_id matches, in some way we will explore later, the column id from the call data frame while ID is an internal key variable. We should take into account there are also several rows that have the same r_id value.

From the locations dataset we can conclude there were 2539592 registries of locations which belong to 129723 unique ids.

There seems to be an issue on the column regarding the start date that could be due to parsing from an excel or other software. Further inquiries to the source are suggested,

Given the fact that there seems to be information regarding the location of the calls we can use it to see the distribution along the territory from at least of a sample of the records for optimization reasons.

set.seed(2621)

names(location)
## [1] "X1"              "ID"              "r_id"            "latitu"         
## [5] "longitu"         "call_start_time"
location_s <- sample_n(location, 9000) %>% filter(!is.na(latitu))

leaflet(location_s) %>%
  addCircles(lng = ~longitu, lat = ~latitu, radius = 2) %>%
  addProviderTiles(providers$CartoDB.Positron)
world <- ne_countries(scale = "medium", returnclass = "sf")

location_sf <- st_as_sf(location %>% filter(!is.na(latitu)), coords = c("longitu", "latitu"), crs = 4326)


location_sf_s <- st_as_sf(location %>% filter(!is.na(latitu)) %>% sample_n(9000), coords = c("longitu", "latitu"), crs = 4326)



ggplot() +
  geom_sf(data = world, color = "white", fill = "aquamarine2", alpha = 0.8) +
  geom_sf(data = location_sf_s, size = 0.3, alpha = 0.4) +
  theme_void()

After examining the geolocation we can conclude most the majority of the records are from the US area, although there are, even in samples, some points that are registered in other continents: the relevance of those points should be analyzed in terms of if it is a mistake or some other reason.

We will join both data frames to get more value in the combination of both given they seem to share a common key variable. A right join will be made in order to match the location structure as a full join loses information. The id columns seem to have a pattern in common but are of different lengths. The column that belongs to the call datasets seems to have combined what could be suspected to be a telephone number so it will be parsed to get information about country and area, as well as a matching key.

call <- call %>% 
  separate(ID, into = c('r_id', 'number'), sep = 13) %>% 
  mutate(country = str_sub(number, 1, 3),
         area = str_sub(number, 4, 6))

count_area <- call %>% 
  count(area, sort = TRUE) %>% 
  slice_head(n = 10 )


ggplot() +
  geom_col(data = count_area, aes(x = fct_reorder(area, -n), y = n), fill = "aquamarine2", color = "black") +
  theme_minimal() +
  labs(title = "Phone number areas with most cases", x = "Prefix", y = "Records")

From these new variable we can see the prefix 180, 179, 177 and 178 having the most amount of records.

In order to add some more information, we can use public access information to obtain the shape of US states polygons and see where each point is contained.

states <- st_as_sf(map("state", plot = FALSE, fill = TRUE))
head(states)
## Simple feature collection with 6 features and 1 field
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: -124.3834 ymin: 30.24071 xmax: -71.78015 ymax: 42.04937
## geographic CRS: WGS 84
##            ID                           geom
## 1     alabama MULTIPOLYGON (((-87.46201 3...
## 2     arizona MULTIPOLYGON (((-114.6374 3...
## 3    arkansas MULTIPOLYGON (((-94.05103 3...
## 4  california MULTIPOLYGON (((-120.006 42...
## 5    colorado MULTIPOLYGON (((-102.0552 4...
## 6 connecticut MULTIPOLYGON (((-73.49902 4...
ggplot() +
  geom_sf(data = states) +
  theme_void()

states <- cbind(states, st_coordinates(st_centroid(states)))
##            used  (Mb) gc trigger  (Mb) max used  (Mb)
## Ncells  7234678 386.4   11769982 628.6 11769982 628.6
## Vcells 44388386 338.7   90704554 692.1 75264418 574.3

## [1] "r_id"  "state" "X"     "Y"
##  [1] "r_id"                  "number"                "Start_Time"           
##  [4] "Call_Duration_Seconds" "Start_Time_Date"       "start_year"           
##  [7] "start_month"           "start_weekday"         "start_hour"           
## [10] "country"               "area"
##            used  (Mb) gc trigger  (Mb) max used  (Mb)
## Ncells  2310994 123.5    9415986 502.9 11769982 628.6
## Vcells 16953059 129.4   72563644 553.7 75264418 574.3
## Rows: 2,183
## Columns: 8
## Groups: state, area, start_year, start_month, start_hour [2,183]
## $ state         <chr> "alabama", "alabama", "alabama", "alabama", "alabama"...
## $ area          <chr> "053", "063", "075", "124", "126", "156", "163", "165...
## $ start_year    <dbl> 2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020,...
## $ start_month   <dbl> 5, 5, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6,...
## $ start_hour    <int> 23, 3, 13, 5, 10, 20, 15, 21, 23, 13, 15, 19, 20, 3, ...
## $ start_weekday <chr> "Tuesday", "Thursday", "Friday", "Thursday", "Thursda...
## $ contacts      <int> 50, 37, 43, 42, 39, 43, 46, 39, 44, 34, 39, 38, 37, 9...
## $ total_call    <dbl> 700, 1073, 43, 588, 0, 76712, 0, 7449, 0, 0, 0, 0, 0,...

From the results above we can wee that california, texas and florida are the top 3 states with more contacts. Adding census data could be useful for further works in order to see if it is proportional to the amount of population that is relevant for this data.

With the data we have so far we can make predictions on how many contacts will be made considering state, area, year, month, weekday, hour.

The automl function shows which model and corresponding parameters is the most effective while varimp() show the most important variables (in plot). Predictions on model unseen data can be shown in a table

The same can be done to predict the duration of the call in case it has any correlation with the type or complexity of the call, quality of sereice or resources needed

The errors are quite high given the fact that there are few cases in this cominatory approach which is weajer in terms of prediction but stronger in terms of interpretability.

The call duration can also be predicted in the same manner as above.

Using a sample from the complete dataset can improve the results and provide information on the location of the predicted records

Considerations:

  • Socioeconomic and demographic data per area could be added in order to make better predictions.
  • The nature or content of the call could be of vital importance to understand and improve the mode.
  • A more powerful computer or infraestructure would allow to use all points availabe in the data instead of a sample and get better results. In other words: it is necessary to have a stronger infraestructure than a personal laptop in order to properly process and analyze this data.
  • The need of face validity and domain knowledge supervision is of great importance and value

Insights:

  • 17h seems to be the peak hour for calls received
  • California, Texas and Florida are the states with more amount of calls
  • The duration of the calls, using the median as the most robust measurement, is around 10 seconds, with only a few surpassing this value. Domain knowledge is needed to understand whether this amount of time is appropriate.
  • The dataset has several mistakes or outliers than need to be reviewed with the corresponding stakeholder
  • Deep learning models and gbm ensembles seem to be the most effective algorithms with autotuned parameters