## Observations: 12,299
## Variables: 17
## $ TaskNumber <chr> "1735", "1742", "1971", "2134", "2251", "2283", …
## $ Summary <chr> "Flag RI on SCM Message Summary screen using met…
## $ Priority <dbl> 1, 1, 2, 5, 10, 1, 5, 5, 6, 5, 2, 1, 3, 1, 1, 8,…
## $ RaisedByID <chr> "58", "58", "7", "50", "46", "13", "13", "13", "…
## $ AssignedToID <chr> "58", "42", "58", "42", "13", "13", "13", "58", …
## $ AuthorisedByID <chr> "6", "6", "6", "6", "6", "58", "6", "6", "6", "5…
## $ StatusCode <chr> "FINISHED", "FINISHED", "FINISHED", "FINISHED", …
## $ ProjectCode <chr> "PC2", "PC2", "PC2", "PC2", "PC2", "PC9", "PC2",…
## $ ProjectBreakdownCode <chr> "PBC42", "PBC21", "PBC75", "PBC42", "PBC21", "PB…
## $ Category <chr> "Development", "Development", "Operational", "De…
## $ SubCategory <chr> "Enhancement", "Enhancement", "In House Support"…
## $ HoursEstimate <dbl> 14.00, 7.00, 0.70, 0.70, 3.50, 7.00, 7.00, 7.00,…
## $ HoursActual <dbl> 1.75, 7.00, 0.70, 0.70, 3.50, 7.00, 7.00, 7.00, …
## $ DeveloperID <chr> "58", "42", "58", "42", "13", "13", "43", "58", …
## $ DeveloperHoursActual <dbl> 1.75, 7.00, 0.70, 0.70, 3.50, 7.00, 7.00, 7.00, …
## $ TaskPerformance <dbl> 12.25, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00,…
## $ DeveloperPerformance <dbl> 12.25, 0.00, 0.00, 0.00, 0.00, 0.00, NA, 0.00, 0…
| Name | Piped data |
| Number of rows | 12299 |
| Number of columns | 4 |
| _______________________ | |
| Column type frequency: | |
| character | 2 |
| numeric | 2 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| ProjectCode | 0 | 1 | 3 | 4 | 0 | 20 | 0 |
| TaskNumber | 0 | 1 | 4 | 5 | 0 | 10266 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| HoursEstimate | 0 | 1 | 10.15 | 28.84 | 0.01 | 1 | 3 | 7.0 | 910.00 | ▇▁▁▁▁ |
| HoursActual | 0 | 1 | 13.18 | 68.72 | 0.01 | 1 | 3 | 8.5 | 2490.16 | ▇▁▁▁▁ |
We have 20 projects, with 12299 estimates_raw. There isn’t one estimation for each task, as there’s only 10266 distinct values of TaskNumber.
For our analysis, we’ll use one estimate for task. If there’s more than one we’ll use the mean of estimates_raw:
## Observations: 10,268
## Variables: 8
## $ ProjectCode <chr> "PC1", "PC1", "PC1", "PC1", "PC1", "PC1", "PC1",…
## $ TaskNumber <chr> "11276", "11286", "11294", "11306", "11313", "11…
## $ Category <chr> "Management", "Development", "Development", "Dev…
## $ Priority <dbl> 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ Summary <chr> "Meeting with CCC - CCC Contract and Quotation",…
## $ HoursEstimate <dbl> 14.0, 35.0, 21.0, 14.0, 28.0, 14.0, 49.0, 100.0,…
## $ HoursActual <dbl> 10.00, 22.25, 44.00, 9.00, 7.00, 27.50, 2.00, 12…
## $ DeveloperPerformance <dbl> NA, 12.75, NA, 5.00, 21.00, -13.50, 47.00, -20.4…
actual hours and the estimated hours are densely clustered below 125 hours.actual hours are more dispersed. As the saying goes “No plan survives contact with the enemy.”actual hours we see two tasks that took more than 2000 hours, with one of them taking more than 2500 hours. On the other hand all estimates were under 1000 hours. The aforementioned tasks were grossly underestimated.actual hours sport the tasks with of longest duration.Management and Development it’s less clear whether there’s more dispersion in the actual hours, for Operational it’s clear there is.
Management the actual hours still shows more dispersion but estimated hours somewhat closed the gap.Management and Development.actual hours were longer than the estimated hoursactual hours and estimated hours took the same time.actual hours are more dispersed. We can see this in both the standard deviation and median absolute deviation.actual hours is far greater than for the estimated hours. The first is very close to 2500 hours while the second is closer to 900 hours.Operational category shows the most divergence between estimated hours and actual hours in terms of dispersion.
actual hours of this category.Let’s from now on also employ percentiles to have a perspective of the distribution of values at key points.
If we follow the percentiles as they increase we can see that:
For the Management category we see estimated hours greater than the actual hours all the way up to the 99th percentile. In other words, overstimating is what Management is all about.
The Operational category behaves similar to the data as a whole (categories ignored).
Overall Development has higher values than the other categories.
The most extreme examples (highest absolute value) are those where the actual hours were longer than the estimated hours (understimation).
The values are densely clustered close to 0, this could be expected as 99% of the values for both estimated and actual hours is under or equal to 112.
Developments show more dispersion than the rest, in particular for cases of overestimation.Up to the 30th percentile the difference is a value below 0. Therefore about 30% of the tasks were overestimated
Up to the 65th percentile upwards the difference is a value above 0. Therefore about 25% of the tasks were underestimated
overstimations are more frequent than understimations.
Management has the percentiles of highest value.
Management category showed the biggest divergence between estimated hours and actual hoursabove 0. In other words the Management category had the highest overestimation, this makes sense as overestimations are so common for the Management category.Development category has the 10th percentile and 5th percentile of lowest value.
Development category showed the biggest divergence between estimated hours and actual hoursbelow 0. The Development category had the biggest underestimations.DevelopmentLet’s keep only the values below or equal to 112 as they make up 99% of the values (around the 99th percentile) and take a closer look.
estimated hours deviate from the gaussian distribution due to a strong right skewness. This indicates that the estimated hours does not follow the gaussian distribution.
Across categories:
estimated hours deviate from the gaussian distribution due to a strong right skewness. This indicates that the estimated hours does not follow the gaussian distribution.
actual hours deviate from the gaussian distribution due to a strong right skewness. This indicates that actual hours does not follow the gaussian distribution.
Across categories:
actual hours deviate from the gaussian distribution due to a strong right skewness. This indicates that actual hours does not follow the gaussian distribution.
Across categories:
Management category which is more firmly moderate.actual hours does not follow the gaussian distribution.estimated hours does not follow the gaussian distribution.actual hours
In view of the aforementioned reasons Kendall’s tau is the result we’ll follow.
Kendall’s tau scored a moderate relation (close to a strong one) for the relation between the estimated hours and the actual hours
Kendall points to a moderate relation (close to a strong one), except for the Management category which is more firmly moderate.
The estimate in hours represents how many hours a task will actually take up to a point. However, there are certainly other factors such as how fast a team or whether the assignee is employing a familiar technology. The moderate correlation happens to represent these limitation in the relation.
It’s clear that the relation is positive (seen in all three methods). This is easy to grasp, “harder tasks” will usually be estimated higher and actually take higher up to a point. The more trivial tasks will be estimated with less hours and usually take less time to be finished. In other words, up to a point tasks estimated with longer hours will take longer hours to finish.
The dataset provided to us doesn’t come with the information about the team size explicitly, for this reason we’ll use the original code to retrieve the original information about team size for each task.
Sip_all=read.csv(here::here("data/Sip-task-info.csv"), as.is=TRUE)
Sip_date=read.csv(here::here("data/est-act-dates.csv"), as.is=TRUE)
Sip_date$EstimateOn=as.Date(Sip_date$EstimateOn, format="%d-%b-%y")
Sip_date$StartedOn=as.Date(Sip_date$StartedOn, format="%d-%b-%y")
Sip_date$CompletedOn=as.Date(Sip_date$CompletedOn, format="%d-%b-%y")
Sip_date=Sip_date[order(Sip_date$TaskNumber), ]
Sip_date$TaskNumber=NULL
Sip_all=cbind(Sip_all, Sip_date)
# Removed the 190 tasks that were cancelled before completion
Sip=subset(Sip_all, StatusCode != "CANCELLED")
# Single instance of this in the data
Sip=subset(Sip, StatusCode != "TEMPLATE")
Sip_stTN=subset(Sip, !duplicated(TaskNumber))
# Projects that were not completed at the time of the data snapshot
Sip=subset(Sip, StatusCode != "CHRONICLE")
Sip_uTN=subset(Sip, !duplicated(TaskNumber))
devs=plyr::count(Sip$DeveloperID)
# Column names for each developer
udevID=unique(Sip$DeveloperID)
udevID_str=paste0("ID", udevID)
Sip$Mapped_ID=plyr::mapvalues(Sip$DeveloperID, udevID, 1:length(udevID))
team_summary=function(df)
{
# Who was on the team (as 0/1 for all developers)
devs=matrix(0, nrow=1, ncol=length(udevID))
colnames(devs)=udevID_str
devs[1, df$Mapped_ID]=1
# Team size and who was on the team
return(data.frame(size=nrow(df),
devs,
h_est=df$HoursEstimate[1], h_act=df$HoursActual[1]))
}
team_info=plyr::ddply(Sip, plyr::.(TaskNumber), team_summary)
team_info %>%
glimpse()
## Observations: 8,252
## Variables: 26
## $ TaskNumber <int> 1735, 1742, 1971, 2134, 2251, 2283, 2400, 2451, 2475, 2499…
## $ size <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ ID58 <dbl> 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0…
## $ ID42 <dbl> 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0…
## $ ID13 <dbl> 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0…
## $ ID43 <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ ID26 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1…
## $ ID50 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0…
## $ ID1 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ ID64 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ ID37 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ ID22 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ ID65 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ ID24 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ ID27 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ ID54 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ ID11 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ ID45 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ ID53 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ ID35 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ ID33 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ ID2 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ ID25 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ ID23 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ h_est <dbl> 14.00, 7.00, 0.70, 0.70, 3.50, 7.00, 7.00, 7.00, 1.40, 1.7…
## $ h_act <dbl> 1.75, 7.00, 0.70, 0.70, 3.50, 7.00, 7.00, 7.00, 1.40, 1.75…
Let’s now join our source data with the newfound data about team size:
## Observations: 8,252
## Variables: 10
## $ TaskNumber <chr> "1735", "1742", "1971", "2134", "2251", "2283", …
## $ team_size <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ ProjectCode <chr> "PC2", "PC2", "PC2", "PC2", "PC2", "PC9", "PC2",…
## $ Category <chr> "Development", "Development", "Operational", "De…
## $ Priority <dbl> 1, 1, 2, 5, 10, 1, 5, 5, 6, 5, 2, 1, 3, 1, 1, 8,…
## $ Summary <chr> "Flag RI on SCM Message Summary screen using met…
## $ HoursEstimate <dbl> 14.00, 7.00, 0.70, 0.70, 3.50, 7.00, 7.00, 7.00,…
## $ HoursActual <dbl> 1.75, 7.00, 0.70, 0.70, 3.50, 7.00, 7.00, 7.00, …
## $ DeveloperPerformance <dbl> 12.25, 0.00, 0.00, 0.00, 0.00, 0.00, NA, 0.00, 0…
## $ HoursDiff <dbl> 12.25, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00,…
There’s more dispersion for teams of size 2 and 3 in terms of standard deviation (more extreme values). There’s more dispersion for teams of size 3 in terms of median absolute deviation.
To have a more representative perspective of the distributions by team size we shall look at the central section of the distributions. We shall observe the percentiles between the first and third quartiles (25th to 75th percentile).
Smaller teams (1 up to 3) people made the grossest underestimation / overestimation among all teams.
As the central section of the distribution is less sensitive to extreme values we shall employ it as the reference to answer our question.
As mentioned beforem the teams of one make the least underestimation / overestimation of all.
Teams with 3 members show a trend for more extreme underestimations.
Consistently, teams with 7 members made bigger underestimations. Also, there was a trend where the bigger the teams the bigger the error in the estimations.
Overall, teams with more developers produce estimations with more error