In order to analyze information provided by the calls and locations dataset we begin by visualizing and understanding the information contained in both datasets
#install.packages(pacman)
library(pacman)
p_load(tidyverse, DataExplorer, readr, visdat, janitor, lubridate, pastecs, sf, leaflet, rnaturalearth, rnaturalearthdata, rgeos, maps, h2o)
options(scipen = 999999)
We explore the structure of the data in the call data frame
call <- read_csv2("Call_Duration.csv")
## i Using ',' as decimal and '.' as grouping mark. Use `read_delim()` for more control.
##
## -- Column specification --------------------------------------------------------
## cols(
## ID = col_character(),
## Start_Time = col_character(),
## Call_Duration_Seconds = col_double()
## )
dim(call)
## [1] 293750 3
glimpse(call)
## Rows: 293,750
## Columns: 3
## $ ID <chr> "963f7add01ba11591797326527", "2570f1aaff7cf1...
## $ Start_Time <chr> "10/06/2020 13:55:27", "10/06/2020 17:35:06",...
## $ Call_Duration_Seconds <dbl> 127, 89, 440, 71, 81, 0, 132, 339, 50, 0, 0, ...
str(call)
## tibble [293,750 x 3] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ ID : chr [1:293750] "963f7add01ba11591797326527" "2570f1aaff7cf1591810505504" "41eb8a54bdf721591809435000" "41eb8a54bdf721591811591000" ...
## $ Start_Time : chr [1:293750] "10/06/2020 13:55:27" "10/06/2020 17:35:06" "10/06/2020 17:17:15" "10/06/2020 17:53:11" ...
## $ Call_Duration_Seconds: num [1:293750] 127 89 440 71 81 0 132 339 50 0 ...
## - attr(*, "spec")=
## .. cols(
## .. ID = col_character(),
## .. Start_Time = col_character(),
## .. Call_Duration_Seconds = col_double()
## .. )
# DataExplorer::create_report(call)
head(call)
## # A tibble: 6 x 3
## ID Start_Time Call_Duration_Seconds
## <chr> <chr> <dbl>
## 1 963f7add01ba11591797326527 10/06/2020 13:55:27 127
## 2 2570f1aaff7cf1591810505504 10/06/2020 17:35:06 89
## 3 41eb8a54bdf721591809435000 10/06/2020 17:17:15 440
## 4 41eb8a54bdf721591811591000 10/06/2020 17:53:11 71
## 5 41eb8a54bdf721591812606000 10/06/2020 18:10:06 81
## 6 74007cb3f77161591806866000 10/06/2020 16:34:26 0
summary(call)
## ID Start_Time Call_Duration_Seconds
## Length:293750 Length:293750 Min. : 0.00
## Class :character Class :character 1st Qu.: 0.00
## Mode :character Mode :character Median : 10.00
## Mean : 95.45
## 3rd Qu.: 98.00
## Max. :43895.00
vis_miss(call)
# call_dups <- get_dupes(call)
IQR(call$Call_Duration_Seconds)
## [1] 98
The call dataset contains 293750 records and 3 columns: id, the start time of the call and the duration in seconds. There are no missing data and no duplicate information in the ID column However, it does seem there are outlier values in the duration of the call which we can observe more in detail with a histogram and a boxplot
ggplot() +
geom_histogram(data = call, aes(x = Call_Duration_Seconds, fill = "aquamarine2", color = "black")) +
theme_minimal() +
labs(title = "Call Duration Histogram", x = "Seconds")
ggplot() +
geom_boxplot(data = call, aes(x = "", y = Call_Duration_Seconds, fill = "aquamarine2", color = "black")) +
theme_minimal() +
labs(title = "Call Duration Histogram", x = "Seconds" )
We can create some new columns using the Start time variable, in order to obtain more information. The date format that was assumed was the one that fitted all cases (day, month, year, hour, minutes and seconds)
call <- call %>%
mutate(Start_Time_Date = lubridate::dmy_hms(Start_Time),
start_year = lubridate::year(Start_Time_Date),
start_month = lubridate::month(Start_Time_Date),
start_weekday = lubridate::wday(Start_Time_Date, abbr = F, label = T, week_start = 1),
start_hour = lubridate::hour(Start_Time_Date))
summary(call)
## ID Start_Time Call_Duration_Seconds
## Length:293750 Length:293750 Min. : 0.00
## Class :character Class :character 1st Qu.: 0.00
## Mode :character Mode :character Median : 10.00
## Mean : 95.45
## 3rd Qu.: 98.00
## Max. :43895.00
##
## Start_Time_Date start_year start_month
## Min. :2000-06-07 13:26:49 Min. :2000 Min. : 1.000
## 1st Qu.:2020-06-10 09:52:36 1st Qu.:2020 1st Qu.: 6.000
## Median :2020-06-10 13:34:46 Median :2020 Median : 6.000
## Mean :2020-06-06 10:49:38 Mean :2020 Mean : 6.002
## 3rd Qu.:2020-06-10 16:07:18 3rd Qu.:2020 3rd Qu.: 6.000
## Max. :2020-10-28 17:16:31 Max. :2020 Max. :11.000
##
## start_weekday start_hour
## Monday : 19 Min. : 0.00
## Tuesday : 70 1st Qu.: 9.00
## Wednesday:293598 Median :13.00
## Thursday : 24 Mean :12.45
## Friday : 8 3rd Qu.:16.00
## Saturday : 8 Max. :23.00
## Sunday : 23
We can explore more profoundly the start time with these new columns
ggplot(call) +
aes(x = start_year) +
geom_bar(fill = "aquamarine2", color = "black") +
theme_minimal() +
labs(title = "Calls by year")
ggplot(call) +
aes(x = start_month) +
geom_bar(fill = "aquamarine2", color = "black") +
theme_minimal() +
labs(title = "Calls by month")
ggplot(call) +
aes(x = start_weekday) +
geom_bar(fill = "aquamarine2", color = "black") +
theme_minimal() +
labs(title = "Calls by weekday")
ggplot(call) +
aes(x = start_hour) +
geom_bar(fill = "aquamarine2", color = "black") +
theme_minimal() +
labs(title = "Calls by hour")
At this point we can say that according to the calls dataset that 293750 unique calls were made between dates 2000-06-07 and 2020-10-28, with an average duration of 95 seconds, and a median of 10 seconds (which indicates the presence of strong outliers). The range goes from 0 seconds to 43895 seconds, with a 664 standard deviation value, 7 variation coefficient, and 98 interquartile range. These values along with boxplot and histogram visualizations show a strong concentration in short duration calls and a few very long ones. The vast majority of calls were made on 2020, on the month of June, a Wednesday and between 6 and 17 hours, with the mode at 17 hours. The cases that don’t belong to the the year 2020, the month of June and made on a Wednesday should be further explored as outliers or possible data entry mistakes.
We can also explore the Location dataset using a similar methodology
location <- read_csv("Location.csv")
dim(location)
## [1] 2539592 6
glimpse(location)
## Rows: 2,539,592
## Columns: 6
## $ X1 <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, ...
## $ ID <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, ...
## $ r_id <chr> "963f7add01ba1", "963f7add01ba1", "963f7add01ba1", ...
## $ latitu <dbl> 36.399, 36.399, 36.399, 36.399, 36.399, 36.399, 36....
## $ longitu <dbl> -97.858, -97.858, -97.858, -97.858, -97.858, -97.85...
## $ call_start_time <dbl> 1591797326527, 1591797326527, 1591797326527, 159179...
head(location)
## # A tibble: 6 x 6
## X1 ID r_id latitu longitu call_start_time
## <dbl> <dbl> <chr> <dbl> <dbl> <dbl>
## 1 1 1 963f7add01ba1 36.4 -97.9 1591797326527
## 2 2 2 963f7add01ba1 36.4 -97.9 1591797326527
## 3 3 3 963f7add01ba1 36.4 -97.9 1591797326527
## 4 4 4 963f7add01ba1 36.4 -97.9 1591797326527
## 5 5 5 963f7add01ba1 36.4 -97.9 1591797326527
## 6 6 6 963f7add01ba1 36.4 -97.9 1591797326527
summary(location)
## X1 ID r_id latitu
## Min. : 1 Min. : 1 Length:2539592 Min. :-48.94
## 1st Qu.: 634899 1st Qu.: 634899 Class :character 1st Qu.: 32.74
## Median :1269797 Median :1269797 Mode :character Median : 36.73
## Mean :1269797 Mean :1269797 Mean : 32.36
## 3rd Qu.:1904694 3rd Qu.:1904694 3rd Qu.: 40.76
## Max. :2539592 Max. :2539592 Max. : 71.30
## NA's :78322
## longitu call_start_time
## Min. :-171.71 Min. : 960384409000
## 1st Qu.:-105.00 1st Qu.:1591782884230
## Median : -86.92 Median :1591797677770
## Mean : -92.86 Mean :1591703397430
## 3rd Qu.: -80.14 3rd Qu.:1591806107140
## Max. : 158.18 Max. :1603905391000
## NA's :78322
sapply(location, function(x) sum(is.na(x)))
## X1 ID r_id latitu longitu
## 0 0 0 78322 78322
## call_start_time
## 0
unique_locations <- unique(location$r_id)
# excel_issue <- head(janitor::excel_numeric_to_date(location$call_start_time, include_time = TRUE))
max(nchar(location$r_id))
## [1] 13
min(nchar(location$r_id))
## [1] 13
The fact that the format of the two datasets is different (in the columns separators used for example) can indicate they belong to different sources, institutions or are stored in two separate repositories, a fact that should be taken into account when combining the information that both posess.
The location data set has 2539592 rows, of which 78322 have missing values in latitud and longitude attributes, and 6 variables. From the exploration we could assume the column r_id matches, in some way we will explore later, the column id from the call data frame while ID is an internal key variable. We should take into account there are also several rows that have the same r_id value.
From the locations dataset we can conclude there were 2539592 registries of locations which belong to 129723 unique ids.
There seems to be an issue on the column regarding the start date that could be due to parsing from an excel or other software. Further inquiries to the source are suggested,
Given the fact that there seems to be information regarding the location of the calls we can use it to see the distribution along the territory from at least of a sample of the records for optimization reasons.
set.seed(2621)
names(location)
## [1] "X1" "ID" "r_id" "latitu"
## [5] "longitu" "call_start_time"
location_s <- sample_n(location, 9000) %>% filter(!is.na(latitu))
leaflet(location_s) %>%
addCircles(lng = ~longitu, lat = ~latitu, radius = 2) %>%
addProviderTiles(providers$CartoDB.Positron)
world <- ne_countries(scale = "medium", returnclass = "sf")
location_sf <- st_as_sf(location %>% filter(!is.na(latitu)), coords = c("longitu", "latitu"), crs = 4326)
location_sf_s <- st_as_sf(location %>% filter(!is.na(latitu)) %>% sample_n(9000), coords = c("longitu", "latitu"), crs = 4326)
ggplot() +
geom_sf(data = world, color = "white", fill = "aquamarine2", alpha = 0.8) +
geom_sf(data = location_sf_s, size = 0.3, alpha = 0.4) +
theme_void()
After examining the geolocation we can conclude most the majority of the records are from the US area, although there are, even in samples, some points that are registered in other continents: the relevance of those points should be analyzed in terms of if it is a mistake or some other reason.
We will join both data frames to get more value in the combination of both given they seem to share a common key variable. A right join will be made in order to match the location structure as a full join loses information. The id columns seem to have a pattern in common but are of different lengths. The column that belongs to the call datasets seems to have combined what could be suspected to be a telephone number so it will be parsed to get information about country and area, as well as a matching key.
call <- call %>%
separate(ID, into = c('r_id', 'number'), sep = 13) %>%
mutate(country = str_sub(number, 1, 3),
area = str_sub(number, 4, 6))
count_area <- call %>%
count(area, sort = TRUE) %>%
slice_head(n = 10 )
ggplot() +
geom_col(data = count_area, aes(x = fct_reorder(area, -n), y = n), fill = "aquamarine2", color = "black") +
theme_minimal() +
labs(title = "Phone number areas with most cases", x = "Prefix", y = "Records")
From these new variable we can see the prefix 180, 179, 177 and 178 having the most amount of records.
In order to add some more information, we can use public access information to obtain the shape of US states polygons and see where each point is contained.
states <- st_as_sf(map("state", plot = FALSE, fill = TRUE))
head(states)
## Simple feature collection with 6 features and 1 field
## geometry type: MULTIPOLYGON
## dimension: XY
## bbox: xmin: -124.3834 ymin: 30.24071 xmax: -71.78015 ymax: 42.04937
## geographic CRS: WGS 84
## ID geom
## 1 alabama MULTIPOLYGON (((-87.46201 3...
## 2 arizona MULTIPOLYGON (((-114.6374 3...
## 3 arkansas MULTIPOLYGON (((-94.05103 3...
## 4 california MULTIPOLYGON (((-120.006 42...
## 5 colorado MULTIPOLYGON (((-102.0552 4...
## 6 connecticut MULTIPOLYGON (((-73.49902 4...
ggplot() +
geom_sf(data = states) +
theme_void()
states <- cbind(states, st_coordinates(st_centroid(states)))
## used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 7234678 386.4 11769982 628.6 11769982 628.6
## Vcells 44388386 338.7 90704554 692.1 75264418 574.3
## [1] "r_id" "state" "X" "Y"
## [1] "r_id" "number" "Start_Time"
## [4] "Call_Duration_Seconds" "Start_Time_Date" "start_year"
## [7] "start_month" "start_weekday" "start_hour"
## [10] "country" "area"
## used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 2310994 123.5 9415986 502.9 11769982 628.6
## Vcells 16953059 129.4 72563644 553.7 75264418 574.3
## Rows: 2,183
## Columns: 8
## Groups: state, area, start_year, start_month, start_hour [2,183]
## $ state <chr> "alabama", "alabama", "alabama", "alabama", "alabama"...
## $ area <chr> "053", "063", "075", "124", "126", "156", "163", "165...
## $ start_year <dbl> 2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020,...
## $ start_month <dbl> 5, 5, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6,...
## $ start_hour <int> 23, 3, 13, 5, 10, 20, 15, 21, 23, 13, 15, 19, 20, 3, ...
## $ start_weekday <chr> "Tuesday", "Thursday", "Friday", "Thursday", "Thursda...
## $ contacts <int> 50, 37, 43, 42, 39, 43, 46, 39, 44, 34, 39, 38, 37, 9...
## $ total_call <dbl> 700, 1073, 43, 588, 0, 76712, 0, 7449, 0, 0, 0, 0, 0,...
From the results above we can wee that california, texas and florida are the top 3 states with more contacts. Adding census data could be useful for further works in order to see if it is proportional to the amount of population that is relevant for this data.
With the data we have so far we can make predictions on how many contacts will be made considering state, area, year, month, weekday, hour.
The automl function shows which model and corresponding parameters is the most effective while varimp() show the most important variables (in plot). Predictions on model unseen data can be shown in a table
The same can be done to predict the duration of the call in case it has any correlation with the type or complexity of the call, quality of sereice or resources needed
The errors are quite high given the fact that there are few cases in this cominatory approach which is weajer in terms of prediction but stronger in terms of interpretability.
The call duration can also be predicted in the same manner as above.
Using a sample from the complete dataset can improve the results and provide information on the location of the predicted records