Curriculum Insights from Unsupervised Machine Learning: Closing the development gap for Infrastructure Masons

Executive Summary
Introduction
Data and Data Collection
High Importance Subjects
Biggest Education Gaps
Unsupervised Machine Learning: Optimizing “Specialization”
Skills Clusters?
Conclusions

Executive Summary

Results of a survey of ~ 50 Infrastructure Masons on important education topics, those that were taken and, in hindsight, those that were missed (but still important to their career), reveal the breadth of cross functional expertise necessary for individuals who build and operate digital infrastructure.

A curriculum of about ten classes would barely cover half of important subjects for Infrastructure Masons indicating the need for specialization, continuing education, and careful career mentorship.

The most important overall topics include Business Operations, Project and Change Management, Electrical Systems, and Sustainability, along with technology related ones.
The biggest “education gaps” include subjects like Availability Management and Sustainability, as well as emerging topics like Security and Privacy and Edge Computing.

Infrastructure Masonry spans multiple categories of skillsets which don’t necessarily fit neatly into traditional roles. To understand skillset groupings we employ an unsupervised machine learning model (kmeans clustering) to tease-out patterns in the data. While the technique is promising, the small size of the data set limits the strength of these conclusions.

Please shoot a note to me on twitter @imasons_ed or @WinstonOnData if you have comments.

Introduction

Infrastructure Masons is a non-profit professional society established to provide data center infrastructure professionals an independent forum to connect, grow and give back. iMasons was founded by Dean Nelson, a respected Infrastructure executive and industry leader, in 2016.

The Infrastructure Masons Education Team involves iMasons and major educational institutions supporting the development and education needs for infrastructure careers. However, we have at least one big problem. Where to start? Aside from personal and anecdotal stories, we don’t have any hard data on what subjects iMasons find essential in their day-to-day jobs. And what they feel was missing from their foraml education.

To close this gap, while Mark Monroe, Maricel Cerruti , Dean, and I were at the recent iMasons’ Meetup at the 2017 Technology Convergence Conference in San Jose, we asked 50+ iMasons about these essential skills. This is an analysis of the data.

Data and Data Collection

Participants at the 2017 Technology Convergence Conference Meetup were asked to fill out a simple form on “the education subjects they took” (taken) that had been most helpful preparing them for their careers, as well as those subjects “you didn’t take, but with perfect hindsight, needed” (missed).

Collected data were then transcribed to a .csv file. No “data-cleaning” was done, though an additional column of “Job category” was manually added by me based on job description.

Here are all the subjects that people could choose from and how they were organized.

Facilities	Business	Technology	Operations
PowerGrid	Tax	Compute	OpsMgt
MechHVAC	Contracts	Storage	Avail
Electrical	Labor	Virtualization	SecurityPrivacy
WaterResources	TCO	NetworkDesign	CommunityRelations
Cabling	RealEstate	EdgeTechnologies	ProjectAndChangeMgt
Construction	Regs	WorkloadMgt
BackupSystems	BizOps	RnD
Sustainability

Once the surveys were completed, the data were transcribed to an Excel spreadsheet and exported as a .csv file. The plot shows the raw data, as collected.

## put data into long format
analyzed.data.plot.long <- 
    analyzed.data.plot %>%
    mutate(ident = 1:nrow(analyzed.data.plot)) %>%
    gather(1:27, key = "area", value = "import") %>%
    yo

## heat map
p <- ggplot(analyzed.data.plot.long, aes(x = area, y = ident)) + 
    geom_tile(aes(fill=as.factor(import)), color="white") +
    scale_fill_manual(values = c("#693311DD","#DDDDDD","#344F34DD"),
                      guide = guide_legend(title = "important subject")) +
    xlab("subject") + 
    ylab("participant index") +
    labs(title = "Survey Responses", subtitle = "Tech Convergence Conference, Feb 2017") + 
    theme(axis.text.x = element_text(angle = 80, hjust = 0, vjust = 1), 
          legend.position="bottom") +
    scale_x_discrete(position = "top")

print(p)

The goal is to leverage this data for insights into curriculums for Infrastructure Masons.

Some thoughts on Data Limitations

“Subject choices” given were pre-selected so a possible limitation is list completeness. We included write-in spaces and, while a few people took advanage, there was essentially no overlap. So it’s unlikely something major was missed.

The audience was “self-selected” from conference attendees so the data are not strictly from a fully random sample of Infrastructure Masons.

Subjects were not defined by more than a one word description. This means there may have been different interpretations of the meaning of someting like “Biz Ops” or “Electrical”, which are fairly general.

High Importance Subjects

The highest importance subjects are computed by summing columns of the “taken” and “missed” responses, normalizing to the total participants, and expressing it as a percentage.

The “Top 10” items ranked by importance are:

rank	subject
1	BizOps
2	ProjectAndChangeMgt
3	Electrical
4	Sustainability
5	Compute
6	TCO
7	OpsMgt
8	NetworkDesign
9	Contracts
10	MechHVAC

Computing “Curriculum Size”

How big should an Infrastructure Mason’s curriculum be? That is, “how many classes are needed to cover a majority of important topics?” Within the scope of this survey, it turns out we can compute that fairly easily.

    ## extract data
    sums.fit <- sums.df %>% 
            filter(index < 26) %>% 
            mutate(freq = sum.total/sum(sum.total)) %>%
            yo

    ## linear model of log
    a <- coef(lm(log(freq) ~ index, data = sums.fit))
    ## extract coefficients
    intercept.s <- a[1]
    slope.s <- a[2]
    ## plot
    inv.plot <- ggplot(sums.fit , aes(x = index, y = log(freq))) + 
            geom_point(pch = 22, fill = "#DD11AADD", color = "gray90", size = 3) + 
            geom_abline(intercept = intercept.s, slope = slope.s, color = "#BBBBBB") +
            labs(title = "Importance is distributed exponentially", subtitle = "normalized" ) +
            theme(axis.text.x = element_text(angle = 0, hjust = 0.5, vjust = 1)) +
            ylab("log(normalized.importance)") + 
            xlab("n")

The observed subject importance (normalized so the sum adds to 100%) follows an exponential frequency distribution with a rate of decrease of 6.9% in “importance” per subject to a very good approximation. This slow rate of decay accounts for the long tail of the distribution.

Based on this, 50% of “importance” occurs above n ~ 10, implying a curriculum of > 10 subjects are required to cover just 50% of what is “most important” according to this sample of Infrastructure Masons’ input! That’s huge!

The data were colleced from a wide range of Infrastructure Masons and not all job-roles require all these skills. We’ll revisit this below in greater detail.

Biggest Education Gaps

We have more granularity in the data than just “importance” since we also queired the audience in what they got versus what was missing from their education.

If we split the data by “what was taken” and “what was missed” we start to get an interesting story. Here is the same plot as above, now sorted in order of how many marked the subjects as “missed.”

## COMPUTE ERROR BARS

sums.all.df <- 
    sums.all.df %>%
    mutate(sigma.missed = sqrt(sum.missed),
           sigma.taken = sqrt(sum.taken),
           ratio = sum.missed/sum.taken,
           sigma.ratio = ratio * sqrt(1/sum.missed + 1/sum.taken)   
           ) %>%
    yo

In many cases the number of “missed” votes can be large. The comparisons are even more pronounced when the number of “missed” votes is compared to the number of “taken” votes.

Gap Significance

One way to look for significant gaps is to look for large “missed:taken” ratios in the above data. However, we also need to be careful - uncertainties inherent in very small numbers may unjustly influence our interpretation.

We can counter this by estimating an error for the ratios by using simple “counting statistics” estimate of the standard deviation \(\sigma \approx \sqrt{n}\) to compute a confidence bound \(y_{min} = y - \sigma\) of “gap ratio” (the ratio of “missed:taken”).

This impact of the small sample sizes are clear.

We can read off the top few “highest confidence” gaps (those also listed in the “most important” list are bolded)
1. Availability
2. Sustainability
3. Operations Management
4. Network Design
5. TCO (Total Cost of Ownership)

Some of these are not a surprise, for instance Sustainability is still formative, and others, such as Edge Technologies and Security and Privacy, are emerging in importance.

Some apparently “big” signals are swamped by uncertainty (for instance, tax and regulations, etc). These cases may warrant more data collection and analysis.

Unsupervised Machine Learning: Optimizing “Specialization”

We have addressed the “most important” broad skills, but what about specialization grouping? As part of the survey, we asked people what their job titles. It turns out that out of 50 iMasons we got about 50 unique job titles. So that was useless.

But what about organizing by skills themselves? Infrastructure is mulitdisciplinary - “mixing” abilities from many domains. But, it’s not random. Can groupings be gleaned directly from the data?

Well, this is a pattern-recognition exercise well suited to Aritifical Intellgence application. So, as an experiment, let’s use Unsupervised Machine Learning with the kmeans classifier.

Note: kmeans is a classification model which essentially defines clusters than both minimize “distance” between similar objects while maximizing the distance between clusters of dissimilar objects.

Skills Clusters?

Since we are looking at specialization topics, I first remove from the data the “Top 6” most important subjects which might serve as a kind of “baseline”. This choice is somewhat arbitrary but seems reasonable.

For this description I picked five clusters (though I ran experiments with up to 10 clusters), which gives a ratio of between cluster error to within cluster error of about 0.6 (with ideal data and behavior, this converges to 1.0), without overfitting the data.

    ## kmeans -----
    set.seed(8675309)
    cluster <- kmeans(analyzed.data.reduced, centers = 5, nstart = 30)

Here is a table version of the same data couple with a “broad” version of the the top subjects.

MostImportant
BizOps
ProjectAndChangeMgt
Electrical
Sustainability
Compute
TCO

V1	V2	V3	V4	V5
NetworkDesign	Avail	Contracts	Virtualization	Construction
OpsMgt	OpsMgt	RealEstate	Storage	MechHVAC
RnD	MechHVAC	PowerGrid	NetworkDesign	Contracts
Virtualization	BackupSystems	WaterResources	SecurityPrivacy	NetworkDesign
EdgeTechnologies	Construction	RnD	PowerGrid	Cabling

V5 seems to cover areas of concern for facility construction, V4 covers a lot of the compute infrastructure, V3 reads like someone who puts together deals to build data centers, while both V1 and V2 cover scope necessary to operate a data center facility once built.

To highlight the diversity of backgrounds in data centers, I assigned people a “job role” based on their job title and computed a correlation to the model in the form of a “truth table”. The way to read this is laterally. It’s interesting that job-roles classified as Ops seem to fall mostly into category V1, while managers fall into V1 and V5.
Executives truly span the specturm of subjects, while for engineers it’s really hard to tell given the small sample size.

field	V1	V2	V3	V4	V5
Engineer	1	2	0	1	2
Executive	3	3	3	4	3
Manager	5	0	1	1	4
Ops	4	1	1	1	0
Solutions	3	1	4	1	1

Cluster Size Choice

The small sample size of the data (~ 50 data points) implies that more than about four clusters risks to overfit the data. For this expleriment I chose k = 3 clusters based on the observation that within cluster squares shows a break at this point.

We can look at the trend of total and within sum of squares as a function of nclust using the following code:

# compute multiple clusters
kmeans.clusters <- 
    data_frame(nclust = 1:10) %>%
    mutate(cluster.model = map(nclust, ~ kmeans(analyzed.data.reduced, ., nstart = 20 ))) %>%
    yo
# extract centers
kmeans.centers <- 
    kmeans.clusters %>%
    unnest(map(cluster.model, tidy)) %>%
    yo
# get model performance parameters into long format ready for plotting
model.performance <- 
    kmeans.clusters %>%
    unnest(map(cluster.model, glance)) %>%
    select(nclust, tot.withinss, betweenss) %>%
    gather(tot.withinss, betweenss, key = statistic, value = sum.sq.error) %>%
    yo

The model isn’t super great. Normally we would expect to see the ratio of winthin and between sums of squares be about equal.

Conclusions

We have collected and analyzed data on the important subjects and gaps in the education of Infrastructure Masons. This may be the first data on this subject for Infrastructure Professionals.
Many important skills, such as Sustainability, Business Operations, Total Cost Modeling, and emerging techical areas like Security and Privacy and Edge Computing, are among the most prevalent gaps.

Applying unsupervised machine learning to the problem of “creating curriculum” for Infrastructure Masons addresses the need for specialization. Groupings fall into clear areas which can be associated with job roles in the data center, building confidence in the technique. Data size is a key limiter of these conclusions.

Comments welcome at @imasons_ed or @WinstonOnData.