EDA SIP

What are the data

## Observations: 12,299
## Variables: 17
## $ TaskNumber           <chr> "1735", "1742", "1971", "2134", "2251", "2283", …
## $ Summary              <chr> "Flag RI on SCM Message Summary screen using met…
## $ Priority             <dbl> 1, 1, 2, 5, 10, 1, 5, 5, 6, 5, 2, 1, 3, 1, 1, 8,…
## $ RaisedByID           <chr> "58", "58", "7", "50", "46", "13", "13", "13", "…
## $ AssignedToID         <chr> "58", "42", "58", "42", "13", "13", "13", "58", …
## $ AuthorisedByID       <chr> "6", "6", "6", "6", "6", "58", "6", "6", "6", "5…
## $ StatusCode           <chr> "FINISHED", "FINISHED", "FINISHED", "FINISHED", …
## $ ProjectCode          <chr> "PC2", "PC2", "PC2", "PC2", "PC2", "PC9", "PC2",…
## $ ProjectBreakdownCode <chr> "PBC42", "PBC21", "PBC75", "PBC42", "PBC21", "PB…
## $ Category             <chr> "Development", "Development", "Operational", "De…
## $ SubCategory          <chr> "Enhancement", "Enhancement", "In House Support"…
## $ HoursEstimate        <dbl> 14.00, 7.00, 0.70, 0.70, 3.50, 7.00, 7.00, 7.00,…
## $ HoursActual          <dbl> 1.75, 7.00, 0.70, 0.70, 3.50, 7.00, 7.00, 7.00, …
## $ DeveloperID          <chr> "58", "42", "58", "42", "13", "13", "43", "58", …
## $ DeveloperHoursActual <dbl> 1.75, 7.00, 0.70, 0.70, 3.50, 7.00, 7.00, 7.00, …
## $ TaskPerformance      <dbl> 12.25, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00,…
## $ DeveloperPerformance <dbl> 12.25, 0.00, 0.00, 0.00, 0.00, 0.00, NA, 0.00, 0…

Understanding the data

Data summary
Name	Piped data
Number of rows	12299
Number of columns	4
_______________________
Column type frequency:
character	2
numeric	2
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
ProjectCode	0	1	3	4	0	20	0
TaskNumber	0	1	4	5	0	10266	0

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
HoursEstimate	0	1	10.15	28.84	0.01	1	3	7.0	910.00	▇▁▁▁▁
HoursActual	0	1	13.18	68.72	0.01	1	3	8.5	2490.16	▇▁▁▁▁

We have 20 projects, with 12299 estimates_raw. There isn’t one estimation for each task, as there’s only 10266 distinct values of TaskNumber.

1 estimate for task

For our analysis, we’ll use one estimate for task. If there’s more than one we’ll use the mean of estimates_raw:

What’s the relation between estimated hours and actual hours in the company as a whole and in different task categories?

## Observations: 10,268
## Variables: 8
## $ ProjectCode          <chr> "PC1", "PC1", "PC1", "PC1", "PC1", "PC1", "PC1",…
## $ TaskNumber           <chr> "11276", "11286", "11294", "11306", "11313", "11…
## $ Category             <chr> "Management", "Development", "Development", "Dev…
## $ Priority             <dbl> 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ Summary              <chr> "Meeting with CCC - CCC Contract and Quotation",…
## $ HoursEstimate        <dbl> 14.0, 35.0, 21.0, 14.0, 28.0, 14.0, 49.0, 100.0,…
## $ HoursActual          <dbl> 10.00, 22.25, 44.00, 9.00, 7.00, 27.50, 2.00, 12…
## $ DeveloperPerformance <dbl> NA, 12.75, NA, 5.00, 21.00, -13.50, 47.00, -20.4…

Overall we see that both actual hours and the estimated hours are densely clustered below 125 hours.
The actual hours are more dispersed. As the saying goes “No plan survives contact with the enemy.”
On the group of actual hours we see two tasks that took more than 2000 hours, with one of them taking more than 2500 hours. On the other hand all estimates were under 1000 hours. The aforementioned tasks were grossly underestimated.

Across categories the actual hours sport the tasks with of longest duration.
For Management and Development it’s less clear whether there’s more dispersion in the actual hours, for Operational it’s clear there is.
- Visually, for Management the actual hours still shows more dispersion but estimated hours somewhat closed the gap.
The tasks who took over 2000 hours belong to the categories of Management and Development.

On average the actual hours were longer than the estimated hours
On median the actual hours and estimated hours took the same time.
As seen in the scatter plot the actual hours are more dispersed. We can see this in both the standard deviation and median absolute deviation.
The minimum value is the same for both, this might imply that there’s a mandatory minimum time length for a task.
As seen in the scatter plot the maximum value for the actual hours is far greater than for the estimated hours. The first is very close to 2500 hours while the second is closer to 900 hours.

The Operational category shows the most divergence between estimated hours and actual hours in terms of dispersion.
- It becomes obvious when we look at the mean value which is more susceptible to extreme values.
- The global maximum (longest task of all) resides in the actual hours of this category.

Let’s from now on also employ percentiles to have a perspective of the distribution of values at key points.

The 99th percentile for actual and estimated hours is less or equal to 112 hours.
- In other words, for 99% of the values both actual and estimated hours are below 112 hours.
The 85th percentile for actual and estimated hours is around 12 hours.
- In other words, 80% of the values of both actual and estimated hours is less than or close to 12 hours.

If we follow the percentiles as they increase we can see that:

Up to the 80th percentile the distribution of values for actual and estimated hours behave similarly.
From the 80th percentile up to around the 96th percentile the estimated hours shows higher values.
From the 96th percentile upwards the actual hours become become longer than the estimated hours. This matches the results of the scatter plot and the summary as the maximum value and the highest values for the actual hours were way bigger than those of the estimated hours.

For the Management category we see estimated hours greater than the actual hours all the way up to the 99th percentile. In other words, overstimating is what Management is all about.
The Operational category behaves similar to the data as a whole (categories ignored).
Overall Development has higher values than the other categories.

The most extreme examples (highest absolute value) are those where the actual hours were longer than the estimated hours (understimation).
The values are densely clustered close to 0, this could be expected as 99% of the values for both estimated and actual hours is under or equal to 112.

The Developments show more dispersion than the rest, in particular for cases of overestimation.
The values are densely clustered close to 0, similar to the data as a whole (categories ignored).

Up to the 30th percentile the difference is a value below 0. Therefore about 30% of the tasks were overestimated
From the 30th percentile up to the 65th percentile the difference is around 0. Therefore about 35% of the tasks were neither overestimated or underestimated.
- This large number of tied ranks makes the use of Kendall’s tau advisable over the use of Spearman’s rho.
Up to the 65th percentile upwards the difference is a value above 0. Therefore about 25% of the tasks were underestimated

overstimations are more frequent than understimations.

From the 85th percentile upwards the Management has the percentiles of highest value.
- The Management category showed the biggest divergence between estimated hours and actual hoursabove 0. In other words the Management category had the highest overestimation, this makes sense as overestimations are so common for the Management category.
The Development category has the 10th percentile and 5th percentile of lowest value.
- The Development category showed the biggest divergence between estimated hours and actual hoursbelow 0. The Development category had the biggest underestimations.

Visually the relation does not seem linear, this makes the use of Spearson inadvisable.
We have some extreme values (on the scale of 2000), in such a situation the use of Kendall’s tau advisable over the use of Spearman’s rho.
- Spearman’s rho is badly affected by small numbers of extreme outliers.

A linear relation isn’t clear for any of the categories
There’s more dispersion in the category Development

Let’s keep only the values below or equal to 112 as they make up 99% of the values (around the 99th percentile) and take a closer look.

Despite the closer inspection a linear relation isn’t clear, this makes the use of a Spearson inadvisable.
Visually the relation is messy and unclear.

Across categories:
- Despite the closer inspection a linear relation isn’t clear, this makes the use of a Spearson inadvisable.
- Visually the relation is messy and unclear.

The standard quantiles for the estimated hours deviate from the gaussian distribution due to a strong right skewness. This indicates that the estimated hours does not follow the gaussian distribution.
- This makes the use of the Spearson correlation inadvisable.

Across categories:

The standard quantiles for the estimated hours deviate from the gaussian distribution due to a strong right skewness. This indicates that the estimated hours does not follow the gaussian distribution.
- This makes the use of the Spearson correlation inadvisable.

The standard quantiles for the actual hours deviate from the gaussian distribution due to a strong right skewness. This indicates that actual hours does not follow the gaussian distribution.
- This makes the use of the Spearson correlation inadvisable.

Across categories:

The standard quantiles for the actual hours deviate from the gaussian distribution due to a strong right skewness. This indicates that actual hours does not follow the gaussian distribution.
- This makes the use of the Spearson correlation inadvisable.

Pearson points to something around a moderate to weak to relation.
- Expected, as the relation is not linear.
Spearman points to a strong relation
- This optimistic result might be a bad affected from our extreme outliers (known effect on Spearman’s rho).
Kendall points to a moderate relation (close to a strong one).

Across categories:

Pearson points to something around a moderate to weak to relation.
- Expected, as the relation is not linear.
Spearman points to a strong relation
- This optimistic result might be a bad affected from our extreme outliers (known effect on Spearman’s rho).
Kendall points to a moderate relation (close to a strong one), except for the Management category which is more firmly moderate.

Which correlation method is the most appropriate ?

Large number of tied ranks (30% of values),
actual hours does not follow the gaussian distribution.
estimated hours does not follow the gaussian distribution.
there are extreme values in particular for the actual hours
- the 99th percentile is around 112 and there are values above 1000.

In view of the aforementioned reasons Kendall’s tau is the result we’ll follow.

Conclusion / Interpretation

Overall

Kendall’s tau scored a moderate relation (close to a strong one) for the relation between the estimated hours and the actual hours

By category

Kendall points to a moderate relation (close to a strong one), except for the Management category which is more firmly moderate.

The estimate in hours represents how many hours a task will actually take up to a point. However, there are certainly other factors such as how fast a team or whether the assignee is employing a familiar technology. The moderate correlation happens to represent these limitation in the relation.

It’s clear that the relation is positive (seen in all three methods). This is easy to grasp, “harder tasks” will usually be estimated higher and actually take higher up to a point. The more trivial tasks will be estimated with less hours and usually take less time to be finished. In other words, up to a point tasks estimated with longer hours will take longer hours to finish.

Teams with more developers produce estimations with more or less error?

The dataset provided to us doesn’t come with the information about the team size explicitly, for this reason we’ll use the original code to retrieve the original information about team size for each task.

Sip_all=read.csv(here::here("data/Sip-task-info.csv"), as.is=TRUE)
Sip_date=read.csv(here::here("data/est-act-dates.csv"), as.is=TRUE)
Sip_date$EstimateOn=as.Date(Sip_date$EstimateOn, format="%d-%b-%y")
Sip_date$StartedOn=as.Date(Sip_date$StartedOn, format="%d-%b-%y")
Sip_date$CompletedOn=as.Date(Sip_date$CompletedOn, format="%d-%b-%y")

Sip_date=Sip_date[order(Sip_date$TaskNumber), ]
Sip_date$TaskNumber=NULL
Sip_all=cbind(Sip_all, Sip_date)

# Removed the 190 tasks that were cancelled before completion
Sip=subset(Sip_all, StatusCode != "CANCELLED")
# Single instance of this in the data
Sip=subset(Sip, StatusCode != "TEMPLATE")
Sip_stTN=subset(Sip, !duplicated(TaskNumber))
# Projects that were not completed at the time of the data snapshot
Sip=subset(Sip, StatusCode != "CHRONICLE")

Sip_uTN=subset(Sip, !duplicated(TaskNumber))

devs=plyr::count(Sip$DeveloperID)

# Column names for each developer
udevID=unique(Sip$DeveloperID)
udevID_str=paste0("ID", udevID)
Sip$Mapped_ID=plyr::mapvalues(Sip$DeveloperID, udevID, 1:length(udevID))


team_summary=function(df)
{
# Who was on the team (as 0/1 for all developers)
devs=matrix(0, nrow=1, ncol=length(udevID))
colnames(devs)=udevID_str
devs[1, df$Mapped_ID]=1

# Team size and who was on the team
return(data.frame(size=nrow(df),
            devs,
            h_est=df$HoursEstimate[1], h_act=df$HoursActual[1]))
}


team_info=plyr::ddply(Sip, plyr::.(TaskNumber), team_summary)

team_info %>%
  glimpse()

## Observations: 8,252
## Variables: 26
## $ TaskNumber <int> 1735, 1742, 1971, 2134, 2251, 2283, 2400, 2451, 2475, 2499…
## $ size       <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ ID58       <dbl> 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0…
## $ ID42       <dbl> 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0…
## $ ID13       <dbl> 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0…
## $ ID43       <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ ID26       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1…
## $ ID50       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0…
## $ ID1        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ ID64       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ ID37       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ ID22       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ ID65       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ ID24       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ ID27       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ ID54       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ ID11       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ ID45       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ ID53       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ ID35       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ ID33       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ ID2        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ ID25       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ ID23       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ h_est      <dbl> 14.00, 7.00, 0.70, 0.70, 3.50, 7.00, 7.00, 7.00, 1.40, 1.7…
## $ h_act      <dbl> 1.75, 7.00, 0.70, 0.70, 3.50, 7.00, 7.00, 7.00, 1.40, 1.75…

Smaller teams are more frequent.
We only have one example of a team of 8 members.

Let’s now join our source data with the newfound data about team size:

## Observations: 8,252
## Variables: 10
## $ TaskNumber           <chr> "1735", "1742", "1971", "2134", "2251", "2283", …
## $ team_size            <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ ProjectCode          <chr> "PC2", "PC2", "PC2", "PC2", "PC2", "PC9", "PC2",…
## $ Category             <chr> "Development", "Development", "Operational", "De…
## $ Priority             <dbl> 1, 1, 2, 5, 10, 1, 5, 5, 6, 5, 2, 1, 3, 1, 1, 8,…
## $ Summary              <chr> "Flag RI on SCM Message Summary screen using met…
## $ HoursEstimate        <dbl> 14.00, 7.00, 0.70, 0.70, 3.50, 7.00, 7.00, 7.00,…
## $ HoursActual          <dbl> 1.75, 7.00, 0.70, 0.70, 3.50, 7.00, 7.00, 7.00, …
## $ DeveloperPerformance <dbl> 12.25, 0.00, 0.00, 0.00, 0.00, 0.00, NA, 0.00, 0…
## $ HoursDiff            <dbl> 12.25, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00,…

There’s more dispersion in the values for small teams.
- There’s bigger positive and negatives values as well, which means bigger underestimations and over-estimations for small teams. Visually, values closer to 0 seem more frequent.

Observing the mean we see that teams of 5 and 6 members have higher absolute values.
Observing the median we see that teams of 6 and 7 members have higher absolute values.
- In tearms of median (more robust than the mean) we see that the bigger teams made bigger underestimations (higher error).
There’s more dispersion for teams of size 2 and 3 in terms of standard deviation (more extreme values). There’s more dispersion for teams of size 3 in terms of median absolute deviation.
Smaller teams, 1 to 3 members, had by far the most extreme minimum and maximum values.
- In particular for teams of 3 members, there was an overestimation of around 392 hours (max) and an underestimation of around 560 hours (min).

As we either get closer to the maximum or minimum we can see how the smaller teams stand out from the rest in terms of extreme values. This conforms with what we’ve seen in the summary statistics.

To have a more representative perspective of the distributions by team size we shall look at the central section of the distributions. We shall observe the percentiles between the first and third quartiles (25th to 75th percentile).

We see that teams with only 1 person make the least underestimation / overestimation of all. This is intuitive as a single person has only to guess about herself or himself and one can expect to be less unknown in that

For the 25th and 30th percentile we see that teams of 3 members made the biggest underestimations.
- Teams with 3 members show a trend for more extreme underestimations, as we go closer to the minimum value (go for smaller percentiles) the error in the estimate increases (bigger underestimations).
From the 30th up to the 75th percentile teams with 7 members made bigger underestimations than everyone else.
- Starting with the 30th percentile we see a trend where the bigger the team the bigger the error in the estimate , in this case underestimation (teams of 7 had bigger error than teams of 6, whom in turn had bigger error than teams of 5 and so on…).

Conclusion

Looking at the the extremes

Smaller teams (1 up to 3) people made the grossest underestimation / overestimation among all teams.

Looking at the central section (Q1 to Q3)

As the central section of the distribution is less sensitive to extreme values we shall employ it as the reference to answer our question.

As mentioned beforem the teams of one make the least underestimation / overestimation of all.

Teams with 3 members show a trend for more extreme underestimations.

Consistently, teams with 7 members made bigger underestimations. Also, there was a trend where the bigger the teams the bigger the error in the estimations.

Overall, teams with more developers produce estimations with more error

EDA SIP

José Benardi de Souza Nunes

What are the data

Understanding the data

1 estimate for task

What’s the relation between estimated hours and actual hours in the company as a whole and in different task categories?

Which correlation method is the most appropriate ?

Conclusion / Interpretation

Overall

By category

Teams with more developers produce estimations with more or less error?

Conclusion

Looking at the the extremes

Looking at the central section (Q1 to Q3)