Abstract

The use of service-based contracts is vital to the future profitability of the high-tech industry as it pivots away from an historical emphasis on consumption based commodities. This paper examines several behavioral economic factors for their potential influence on the renewals of 91,727 service contracts provided by a large US-based high-tech company. The goal of our research was to attempt to predict both the likelihood of a service contract being renewed and the anticipated size (as defined by the number of line items contained within a contract) of any renewed contracts. Predictive models were constructed that included proxies of various behavioral factors (e.g., pricing, contract-specific terms, and sales growth strategies) to gauge their impact on contract renewals. Our preferred binomial logistic regression model proved effective at predicting nearly 72% of all customer contract renewal decisions while also indicating that the use of behavioral-driven sales growth strategies such as “Land and Expand”, “Adoption Services”, “Adaptive Learning”, and channel sales each appeared to have a positive effect on the likelihood of a contract being renewed. By contrast, the use of behavioral-driven cost factors and contract features such as discounting and contract durations appeared to yield mixed results. Finally, a negative binomial regression model proved ineffective at predicting the line item counts of renewed contracts, thereby indicating that simply assuming that the line item count of the contract being renewed will be equivalent to that of the previous contract may be a better method for predicting that metric.

Key Words: Contract Renewals, Service Contracts, Predictive Modeling, Behavioral Economics

Introduction

The high-tech industry is currently experiencing a period of rapid transformation. While historically it was primarily focused on selling equipment such as computers, routers and other big box items, the focus has now shifted to the selling of services and software. While equipment sales are still a critical component of technology companies’ revenues, true profitablity is now being created as a result of high-tech companies embracing the concept of consumption-based sales management, wherein a technology supplier works closely with its customers to help them most effectively make use of and expand upon the hardware and services they have purchased. In a sense, the consumption-based sales management approach creates a value-driven partnership between a technology company and its customers, with the customers benefiting from the enhanced value they are able to realize from their technology purchases. The high-tech companies can benefit from such a partnership if it enables them to be more responsive to a customer’s needs while also enabling the sale of additional products or services to a customer.

This evolution of the high-tech industry has greatly elevated the importance of renewable service contracts as a source of both recurring revenues and account expansions for high-tech companies. High-tech companies have come to recognize that “high-touch” service practices can have a marked impact on both their profitability and their customer retention / expansion efforts. With this context in mind, our research attempts to answer two vital questions:

Implicit in these research questions is the goal of improving a company’s sales forecasting and sales management practices through the use of enhanced pro forma revenue forecasting techniques. For example, if a company can improve its abilities to predict which service contracts are unlikely to be renewed it would allow the company to more clearly evaluate the effectiveness of their sales and account management strategies, thereby enabling them to make improvements where needed.

To facilitate our research we received permission from a large US-based global high-tech company to make use of data related to 91,727 renewable service contracts for customers based in North and South America. The data include information regarding contract sizes, discount terms, the geographical location of customers, sales management strategies, and customer contract renewal decisions. This data trove serves as the basis for the research discussed herein.

Literature Review

Our literature review focused on identifying applicable strategies from the field of Behavioral Economics relative to the subject of customer contracts and contract renewals. Many such strategies appear to rely on mathematical models wherein key behavioral characteristics are modeled with respect to a relevant measurable parameter. Of those we reviewed, we chose several that appeared to be most closely related to our research and attempted to map their respective behavioral characteristics to variables contained within our data set. This approach allowed us to test some of the contract renewal theories described in the literature as part of our own investigative efforts. The specific contract-related strategies we researched are described below.

Pricing and Cost Factors

The most important class of behavioral characterstics influencing contractual relationships involves Pricing and Cost Factors. The most salient idea in Pricing that we investigated was the notion of “Reference Dependence”, which suggests that “when a point of reference against which outcomes are compared is changed (due to”framing“), the choices people make are sensitive to the change in frame” (Ho, Lim & Camerer, 2006, pp. 5-7). So if a data set includes a variable that constitutes a “reference point”" that could be adjusted during contract negotiations, we might expect that variable to be an influential factor in the negotiation of a contract.

By contrast, Hyperbolic Discounting represents the notion that the seller would try to create a psychological urgency to influence buying behavior in two ways: for “products that involve either immediate costs with delayed benefits… or immediate benefits with delayed costs” (Ho, et al, p. 26). For example, one way to create such urgency might be to offer customers discounted payment terms in return for a shorter payment period.

Finally, we investigated the propensity for customers to switch service providers, also known as “Churn”. Research on this behavior showed a that customers exhibit a greater likelihood for switching to lower cost contracts if such contracts offer reasonably similar benefits (Wirtz & Lovelock, 2016, p.474), (Ho, et al, p 26-29).

Contract Specific Factors

Incentive Theory (of which Discounting is a subset) tells us that asymmetric information between contract counterparties creates opportunities for the seller to induce sales (Brousseau, Glachant & Farrar, 2001, p.6). That theory can be tested by comparing the renewal likelihoods of contracts of different durations, or by comparing the renewal likelihoods of contracts containing varying quantities of service offerings.

Theoretical work has been done in the area of contract duration while using a quantile regression for counts analysis to determine whether meta-factors such as unemployment rates and expected inflation play a role in contract outcomes (Liu & Peng, 2010, p. 186). While the scope of our analysis was not on the duration of contracts in the macro-economics environment, the article motivated us to use an available contract duration metric as a predictive variable within our own research.

Sales Growth Factors

Sales Growth factors were the last category of strategies we investigated. The behavioral notion behind such factors is that alteration of the relationship between counterparties in a contract beyond pricing, duration and quantity of services can influence outcomes. “Land and Expand Selling” (Wood, Helwin & Lah, 2013, pp. 144-170) encapsulates the idea of creating differentiated service offerings with flexibility around both consumption and billing. For example, to improve the likelihood of contract renewal, a seller can apply the concept of “Adoption Services” (Wood, et al, pp. 171-190), which entails working closely with a customer “post-sales” to provide hands-on product training and deployment assistance. Additional theoretical work in this area has included the use of “Adaptive Learning” techniques, where “belief-based or reinforcement learning may have a significant impact on market organizations” (Hopkins, 2007, p.1-4). Finally, there is also the concept of pricing and marketing through “channels”, wherein a large company typically relies on resellers / partners as “pass-through” sales channels (Wood, et al, p. 7) while focusing its own marketing efforts on “retailer and brand loyalty (Wood, et al, p. 10). Such a concept was relevant for our purposes since the high-tech company that provided our data relies heavily on pass through sales channels for sales of its products and services.

In summary, our review of the literature highlighted the challenges we faced in attempting to predict both service contract renewals and the likely sizes of contracts that were predicted to be renewed. However, the literature did suggest the use of several potential predictor variables for any predictive models we might build. The table below summarizes the relationships we identified between the various theoretical contructs discussed in the literature and the corresponding variables contained within our data set.

Theoretical Construct Source Category Corresponding Variable
Land & Expand Wood, Hewlin & Lah Sales Growth Service_Approach
Adoption Services Wood, Hewlin & Lah Sales Growth Sales_Strategy
Adaptive Learning Hopkins Sales Growth Sales_Strategy
Hyperbolic Discounting Ho, Lin & Camerer Cost Factors Discount_Category
Pricing Ho, Lin & Camerer Cost Factors Cost
Churn Wirtz & Lovelock Cost Factors Contract_Value_Category
Incentive Theory(Discount) Brousseau & Glachant Cost Factors Discount_Category
Incentive Theory(Multiple) Brousseau & Glachant Contract Features Multiple_Services
Contract Duration Liu & Peng Contract Features Contract_Length
Channel Pass-Through Sudhir & Datta Sales Growth Tier

Use of the ‘Corresponding Variables’ listed above therefore provided us with a mechanism for linking our research efforts back to the various relevant theoretical constructs. While any predictive models we might choose to build would not be limited solely to the eight variables listed above, their ultimate inclusion in any particular model could serve as an indicator as to whether the corresponding theoretical construct was validated by our own work.

Methodology

As stated in the Introduction, our research was focused on attempting to answer the following two questions:

  1. Given a limited set of geographical and empirical data, how well could we predict the likelihood of a renewal of a fixed-length technology services contract?

  2. If a contract was predicted to be likely to be renewed, how well could we then predict the likely line item count for the renewed contract?

Our methodology for addressing these questions consisted of three separate processes: Data Exploration, Data Preparation, and Regression Modeling. The nature of each is described below:

Subsequent to the development and selection of task-appropriate regression models, the selected models were applied to the evaluation data set as follows:

Results of both types of regression models were then compared against actual contract renewal outcomes to judge their effectiveness at predicting their respective response variables.

Experimentation and Results

Data Exploration

The data set we were provided contained empirical and qualitative data related to service contract renewals. There were 91,727 contracts represented, with attributes such as contract size, geographic location of the customer, the types of sales management strategies applied to the customer, and contract terms having been provided for each. All proprietary information had been pre-removed by the high-tech company prior to our reception of the data set. No missing data values were found and the data itself was considered to be fully accurate as per the high-tech company’s comments.

For each contract we were provided with 12 attributes that could potentially be used as predictor variables. The two response variables we were to focus on were:

  • RENEWAL_FLAG: Indicated whether a contract was renewed
  • ITEM_COUNT: Indicated how many line items were included in a contract

The following is a brief description of each of the variables represented within the data set:

Variable Definition
INDEX Identification Variable (not a predictor)
RENEWAL_FLAG Contract Renewal Indicator
AREA Domestic or Foreign Customer Indicator
SERVICE_APPROACH Indicates whether a “Land and Expand Strategy” is used w/ customer
TIER Channel Sales Distribution Method Indicator
CONTRACT_EXPIRATION Expiration Date (not a predictor)
CONTRACT_LENGTH Annual or Multi-year Indicator
SALES_CATEGORY Categorical Indicator of Market Segment Assigned to Customer
SELLER_UNIQUE_ID Unique ID for Sales Channel Partner
CONTRACT_VALUE_CATEGORY Contract Value in Dollars (bracketed)
CONTRACT_LINE_CATEGORY Items of Contract (bracketed)
DISCOUNT_CATEGORY Size of Discount (bracketed)
MULTIPLE_SERVICES Indicator: Is more than one service Provided to customer?
ITEM_COUNT Count of Line Items in Contract
COST Sum of List Prices of Contract Line Items
SALES_STRATEGY Adoption Services Strategy Indicator

Six of the independent variables (‘AREA’, ‘SERVICE_APPROACH’, ‘TIER’, ‘CONTRACT_LENGTH’, ‘MULTIPLE_SERVICES’, and ‘SALES_STRATEGY’) were binary categorical variables which needed to be treated as factors. ‘SALES_CATEGORY’, ‘CONTRACT_VALUE_CATEGORY’, CONTRACT_LINE_CATEGORY’, and ‘DISCOUNT_CATEGORY’ were multi-categorical variables that can assume any one of more than two potential predefined text string values as indicated in the table below (NOTE: the shared ’_CATEGORY’ suffix has been abbreviated to “_CAT" within the table for simplification purposes):

SALES_CAT CONTRACT_VALUE_CAT CONTRACT_LINE_CAT DISCOUNT_CAT
COM <10K VERY SMALL NO DISCOUNT
ENT 10K-25K SMALL SMALL
PS 25K-50K MEDIUM MEDIUM
OTHER 50K-100K LARGE LARGE
(-) 100K-250K VERY LARGE VERY LARGE
(-) >250K (-) (-)

Our initial data analysis uncovered a few key features of the data set:

  • Most of the contracts represented were of annual rather than multi-year duration
  • Most of the customers were located in the USA rather than in foreign countries
  • Most contracts had relatively small line item counts
  • Most customers were receiving “medium” size discounts on their service contracts
  • Most customers were contracted for only a single type of service.
  • Only 31.9% of the contracts represented in the data set had been renewed by customers.

Furthermore, there were only two continuous numeric variables within the data set (“ITEM_COUNT” and “COST”) and both were heavily right-skewed due to the fact that the vast majority of the 91,726 contracts contained relatively small numbers of line items and relatively low total list price amounts for those line items.

Barplots and boxplots of each independent variable relative to the binary response variable were investigated for potential skew issues and also allowed us to gain insight into the predictive aspects of the data. The plots showed that domestic customers, customers whose relationships were managed using the “Land and Expand” account management strategy, customers managed via a second tier channel sales distribution method, customers that were either enterprise or public sector entities, and customers whose relationships were managed using the ‘GCS’ sales strategy were all more likely to renew their contracts than were customers who lacked those attributes. Furthermore, annual contracts, higher-value contracts, contracts that included “VERY LARGE” discounts, and contracts that included more than one service were all more likely to be renewed than were contracts lacking those attributes.(NOTE: any plots or matrices discussed in this section can be found in the Data Exploration section of the Appendix).

Additional analysis revealed that the ‘CONTRACT_LINE_CATEGORY’ variable was simply a categorical bracketing of the continous numeric ‘ITEM_COUNT’ variable. As such, it was decided that in no circumstance would we include both variables within any of our regression models.

Finally, a correlation matrix showed that neither of the two continuous numeric variables we’d been provided (‘ITEM_COUNT’ and ‘COST’) was significantly correlated with the binary ‘RENEWAL_FLAG’ response variable, a fact that would need to be considered when constructing predictive models.

Data Preparation

As discussed earlier, the data set we received from the high-tech company had been “pre-cleansed” prior to our reception of it. As such, there were no missing data values. Our analysis concluded that the ‘INDEX’, ‘CONTRACT_EXPIRATION’, and ‘SELLER_UNIQUE_ID’ variables were not relevant to predicting whether or not a contract might renew and were therefore deemed to be removable from the data set for purposes of regression modeling. Furthermore, to simplify the model building process we also converted binomial categorical variable values that were text-based to numeric (0/1) values. The affected variables were ‘AREA’, ‘SERVICE_APPROACH’, ‘TIER’, ‘RENEWAL_FLAG’, ‘MULTIPLE_SERVICES’ and ‘SALES_STRATEGY’.

Finally, in preparation for model building we divided the data into training and evaluation data sets by extracting a random subset of 20,000 items from the master data set to serve as an evaluation data set. Those same 20,000 items from were then removed the master data set to create a training data set. As was mentioned earlier, creation of separate training and evaluation data sets was deemed necessary to allow us to test our regression models on a relatively unbiased set of contract records.

The training data set can viewed here:

The evaluation data set can be viewed here:

Regression Modeling

Binomial Logistic Regression Model for Predicting Contract Renewals

Three distinct binary logistic regression models were constructed for purposes of predicting whether or not a service contract was likely to be renewed. The models’ performance metrics were subsequently compared against each other to allow us to select our “best” binary logistic regression model for purposes of making predictions of contract renewals for the Evaluation data set. Each of the three models used the data set’s RENEWAL_FLAG attribute as the dependent response variable, while various subsets of the potential predictor variables were used as independent variables. A detailed discussion of each of the three models can be found in the REGRESSION MODELING section of the Appendix.

Of the three models that were tested, one based solely on non-contract related attributes (e.g., AREA’, ‘SERVICE_APPROACH’, ‘TIER’, ‘SALES_STRATEGY’, etc.) proved to be ineffective at predicting actual contract renewals. Of the two remaining models, one excluded both the ‘COST’ and ‘ITEM_COUNT’ variables due to lack of statistical significance, while the other included the log-transformed versions of both of those variables when they proved to be statistically significant. Both models had very similar performance statistics as can be seen in the summary table shown below.

Metric Model 1 Model 3
# Predictors 10 10
AIC 82589 82057
Accuracy 0.7134 0.7132
Class.Err.R. 0.2866 0.2868
Precision 0.6369 0.6301
Sensitivity 0.2443 0.2425
Specificity 0.9344 0.9302
F1 Score 0.3531 0.3606
AUC 0.5893 0.5914

Since there was no obvious basis for preferring one model over the other a likelihood ratio test was applied, and its result indicated that Model 3 was preferable. The logit coefficients for this model are shown below.

Coeff. Variable Coeff. Variable
- 2.948 Intercept + 0.416 CONTRACT_VALUE>250K
+ 0.901 AREA1 + 0.261 CONTRACT_VALUE100K-250K
+ 0.286 SERVICE_APPROACH1 + 0.101 CONTRACT_VALUE10K-25K
+ 0.143 TIER1 + 0.178 CONTRACT_VALUE25K-50K
- 1.066 CONTRACT_LENGTHMulti + 0.282 CONTRACT_VALUE50K-100K
- 0.295 SALES_CATEGORYENT - 0.263 DISCOUNT_CATEGORYMEDIUM
+ 0.097 SALES_CATEGORYOTHER + 0.547 DISCOUNT_CATEGORYNONE
- 0.127 SALES_CATEGORYPS + 1.099 DISCOUNT_CATEGORYVERYLARGE
+ 0.206 SALES_STRATEGY1 + 0.092 log(ITEM_COUNT + 1)
(-) (-) + 0.142 log(COST + 1)

Binary Logistic Regression Model Inferences:

We can infer the following from the model’s coefficients relative to the three categories of theoretical constructs desribed in the Literature Review section above (NOTE: the sole variable that does not fall within one of those three categories (‘AREA’) is discussed separately):

  1. Sales Growth: The high-tech company’s use of the “Land and Expand” sales management strategy (as characterized by the ‘SERVICE_APPROACH’ variable), use of a ‘two-tier’ indirect channel approach (as characterized by the ‘TIER’ variable), and use of a ‘GCS’ sales strategy (as characterized by the ‘SALES_STRATEGY’ variable) all appeared to increase the likelihood that a customer would renew their service contract. Furthermore, customers deemed to be either “Enterprise” or “Public Sector” appeared more likely to renew their contracts than did other types of customers.

  2. Cost Factors: In general, the larger the value of the contract, the more likely it was to be renewed. This is not surprising since the high-tech company places more emphasis on ensuring that “high-dollar” contracts end up being renewed. The effectiveness of offering discounts to customers was less clear, with medium sized discounts apparently decreasing the likelihood of contract renewal while offering either no discounts or very large discounts appearing to increase the likelihood of contract renewal.

  3. Contract Features: Contracts with larger quantities of line items were more likely to be renewed than were contracts with smaller quantities of line items. This again may have been the result of the high-tech company placing a great deal of emphasis on the renewal of larger contracts. Additionally, multi-year contracts appeared less likely to be renewed than annual contracts.

  4. Geography (the ‘AREA’ variable): Domestic customers were more likely to renew their contracts than were foreign customers. This was not surprising since the high-tech company is US-based and apparently has a longer history with many American clients. Additionally, economic uncertainties in Venezuela and once growing emerging economies like Brazil and Argentina may have had an impact in service contract renewals in those locations.

Negative Binomial Regression Model for Predicting Contract Item Counts

Our analysis of the ‘ITEM_COUNT’ variable revealed that it was not zero-inflated and its mean was not nearly equal to its variance, thereby allowing us to quickly rule out the use of either Poisson or zero-inflated count regression models for purposes of predicting likely renewed contract line item counts. Instead, a negative binomial count regression model was pursued.

An initial set of modeling iterations led to the removal of a few statistically insignificant predictors but yielded a model whose ‘ITEM_COUNT’ predictions were wildly inaccurate, with some predictions exceeding one trillion possible line items. Further investigation revealed that the ‘COST’ variable was the source of the problem: that variable’s very large variance and outliers were causing the negative binomial model to generate wildly inaccurate predictions.

Further discussions with representatives of the high-tech company revealed that removal of the ‘COST’ variable’s outliers (basically very large total list price amounts) was viable since the outliers comprised a relatively small percentage (~3%) of the contract data we’d been provided. It was therefore decided that all contracts whose ‘COST’ values exceeded $100,000 would be ignored for purposes of predicting the ‘ITEM_COUNT’ variable. Subsequent modeling iterations yielded a negative binomial model with the following performance characteristics:

Metric Value
Number of Predictors 7
AIC 146419
Dispersion 6.00

A jitter plot of the model’s rounded fitted values against the ‘ITEM_COUNT’ response variable showed that the model appeared to perform rather poorly at predicting the likely number of line items for a contract that was likely to be renewed:

Summary statistics for the relevant ‘ITEM_COUNT’ values and the Negative Binomial predictions for the corresponding observations are provided below: (NOTE: “n” represents the number of records within a data set).

Variable n Mean sd Med Min Max Range Skew Kurto. SE
ITEM_COUNT 22243 10.05 137.74 3 1 15176 15175 65.91 6730.87 0.92
Predicted 22243 47.06 292.12 7 3 9198 9195 15.63 325.55 1.96

The summary statistics are reflective of the jitter plot shown above. The predicted values for the item counts are much less variable than are the corresponding actual ‘ITEM_COUNT’ values. Furthermore, the skew and kurtosis of the actual ‘ITEM_COUNT’ values greatly exceeds that of the predicted values.

The coefficients for this negative binomial model are shown below:

Coeff. Variable Coeff. Variable
+ 2.234 Intercept + 0.864 CONTRACT_VALUE>250K
- 0.544 SERVICE_APPROACH1 + 0.456 CONTRACT_VALUE100K-250K
- 0.222 CONTRACT_LENGTHMulti + 0.680 CONTRACT_VALUE10K-25K
- 0.081 SALES_CATEGORYENT + 0.343 CONTRACT_VALUE25K-50K
- 0.313 SALES_CATEGORYOTHER + 0.394 CONTRACT_VALUE50K-100K
- 0.057 SALES_CATEGORYPS - 0.230 DISCOUNT_CATEGORYMEDIUM
+ 0.582 MULTIPLE_SERVICES1 + 0.020 DISCOUNT_CATEGORYNONE
+ 0.001 COST - 0.037 DISCOUNT_CATEGORYVERYLARGE

Model Inferences

We can infer the following from model’s coefficients relative to the three categories of theoretical constructs desribed in the Literature Review section above.

  1. Sales Growth: Customers whose relationships were managed using the “Land and Expand” account management strategy were less likely to have relatively higher contract item counts than were customers whose relationships were not managed using that strategy. This result is somewhat surprising since the alleged benefit of that strategy is enhanced revenues. Furthermore, customers that were neither public sector nor enterprise customers appeared less likely to have higher contract item counts than were customers of other sales categories.

  2. Cost Factors: Customers whose contract values exceed $250,000 in value were more likely to have higher contract item counts than were other customers. Contracts having higher total list prices were more likely to have relatively higher contract item counts than were contracts having lower total list prices. These results are not surprising.

  3. Contract Features: Once again multi-year contracts had a negative influence, as they appeared to predict smaller contract line item counts than did the annual contracts. However, customers contracting for more than a single service were more likely to have relatively higher contract item counts.

Applying the Regression Models to the Evaluation Data

With the construction of our two required regression models completed, we were finally prepared to assess the effectiveness of each relative to the 20,000 item evaluation data set. The two-stage prediction process was as follows:

  • Stage 1: Use the selected binary logistic regression model to predict whether or not a contract was likely to be renewed.

  • Stage 2: For those contracts that were deemed likely to be renewed, predict the likely ‘ITEM_COUNT’ amounts using the negative binomial regression model for all contracts having (COST <= 100,000).

Stage 1: Predict Contract Renewals

The results of the first stage indicated that the binary logistic regression model’s performance with the evaluation data was nearly identical to its performance with the training data:

Metric Training Evaluation
Accuracy 0.7132 0.7176
Class.Err.R. 0.2868 0.2824
Precision 0.6301 0.6251
Sensitivity 0.2425 0.2557
Specificity 0.9302 0.9296
F1 Score 0.3606 0.3630
AUC 0.5914 0.5927

A confusion matrix of the model’s results indicated that the model was very effective at identifying contracts that were unlikely to be renewed: only 965 false positives were predicted out of 13,708 contracts that actually failed to be renewed.

true pred pred
(-) 0 1
0 12743 965
1 4683 1609

However, the model did not perform as well at predicting actual contract renewals, with 4683 false negatives predicted vs. 1609 true positives. As such, the value of this model appeared to be in that it clearly identified which contracts were unlikely to be renewed, thereby allowing the high-tech company to choose whether or not to direct additional sales efforts toward such customers for purposes of securing the renewal of those contracts.

Stage 2: Predict Item Counts for Contracts Predicted to be Renewed

The results of the second stage indicated that, similar to the results we obtained when applying it to the training data, the negative binomial regression model’s performance was again problematic:

Variable n Mean sd Med Min Max Range Skew Kurto. SE
ITEM_COUNT 2281 49.63 209.77 8 1 4263 4262 11.51 171.81 4.39
Predicted 2281 133.98 560.73 11 4 9358 9354 8.96 98.84 11.74

As the table clearly shows, the predicted values for the item counts are much more variable than are the corresponding actual ‘ITEM_COUNT’ values. Furthermore, the skew and kurtosis of the actual ‘ITEM_COUNT’ values greatly exceeds that of the predicted values. These results clearly demonstrate the challenge inherent in attempting to accurately predict the expected line item counts for contracts that are predicted to be renewed based upon the data we had been provided. As such, perhaps a more effective method for predicting the expected line item counts could be based on simply assuming that the line item count for the renewed contract would be equivalent to that of the expiring contract.

Discussion and Conclusions

The purpose of our research was to attempt to predict both the likelihood of a service contract being renewed and the anticipated size (as defined by the number of line items contained within a contract) of any renewed contracts. Data related to 91,727 technology services contracts was used as the basis for constructing and evaluating predictive models for each of those metrics.

The need to predict the likelihood of a contract being renewed led to the development of binary logistic regression model that proved effective at predicting nearly 72% of all customer contract renewal decisions when applied against a dedicated 20,000 item evaluation data set that had been randomly extracted from the original data set. While 0.93 (12743/13708) of non-renewals were predicted correctly, only 0.26 (1609/6292) of actual contract renewals were accurately predicted. Thus the binomial logistic model proved effective but with the caveat that it under-predicted renewals at a higher rate then might prove acceptable for business purposes.

Prediction of contract line item counts led to the development of a negative binomial regression model. However, its results proved to be too inaccurate for practical use, thereby indicating that perhaps the best method for making such predictions might be to simply assume that the line item count of the renewing contract will be similar to that of the expiring contract.

The use of behavioral economic proxy variables within the binary logistic regression model allowed us to make a variety of inferences related to the effectiveness of the relevant strategies we identified during our review of related research literature. The table below provides a summary of those inferences, with the “Rank” column being indicative of which variable attributes should be preferred if maximizing the number of contract renewals is a strategic business goal. A further explanation of how each relates to the high-tech comany that provided our data is provided below.

Strategic Predictor Related Variable Category Effect Preferred Use
Land & Expand Service_Approach Sales Growth Positive Use ‘L&E’
Adoption Services Sales_Strategy Sales Growth Positive Use ‘GCS’
(ENT,COM,PS, OTH)
Adaptive Learning Sales_Strategy Sales Growth Positive Use ‘GCS’
(ENT,COM,PS, OTH)
Hyperbolic Discounting Discount_Category Cost Factors Mixed VLG,NONE,LG,MED
(MED,LG,VLG,NONE)
Reference Dependence Cost, Item Count Cost Factors Positive Sell More Items
(Continuous Num.)
Churn Contract_Value_Cat Cost Factors Positive > $, > Renewals
(Price Range)
Incentive Theory Discount_Category Cost Factors Mixed VLG,NONE,LG,MED
(MED,LG,VLG,NONE)
Contract Duration Contract_Length Contract Negative Use ‘Annual’
(Annual, Multi-Yr) Features
Channel Pass-Through Tier Sales Growth Positive Use ‘Channel’

Pricing and Cost Factor Conclusions:

  1. Churn: The more products and services the high tech company is able to sell to a customer, the more likely the customer is likely to be retained going forward.

  2. Discounting: The effect of discounting is inconclusive: offering either very large or no discounts to customers appears to increase the likelihood of customer retention. However, offering either large or medium size discounts appears to decrease the likelihood of contract renewal.

  3. Selling larger quantities of products and services to a customer appears to increase the likelihood of contract renewal.

Contract Feature Conclusions:

Sales Growth Factor Conclusions:

  1. Use of a “Land & Expand” approach increases the likelihood of contract renewal.

  2. Use of a “GCS” adoption services / adaptive learning customer management strategy increases the likelihood of contract renewal.

  3. Use of channel partners increases the likelihood of contract renewal.

It is important to note that these conclusions apply only to the high-tech company that provided our data set. Additional research would be required to determine whether or not these same conclusions might apply to other high-tech companies or companies in other industries, or to companies that have customers located outside of North and South America. Furthermore, the data we were provided was somewhat restricted in its detail due to the need of the provider to remove all proprietary information from it (e.g., exact contract durations, exact discount percentages, exact net pricing totals, etc.) prior to our use. Despite these limitations, this study demonstrated that clear relationships exist between a variety of behavioral economic concepts and the likelihood of a technology services contract being renewed. In doing so, our research provides valuable insight into the efficacy of different sales and customer management strategies currently in use at a major high-tech company.

References

Brousseau, E., Glachant, J., & Fares, M. (2001). The Economics of Contracts and the Renewal of Economics. Theories and Applications The Economics of Contracts, 3-42. doi:10.1017/cbo9780511613807.001

Ho, T. H., Lim, N., & Camerer, C. F. (2006). Modeling the Psychology of Consumer and Firm Behavior with Behavioral Economics. Journal of Marketing Research, 43(3), 307-331. doi:10.1509/jmkr.43.3.307

Hopkins, E. (2007). Adaptive Learning Models of Consumer Behavior. Journal of Economic Behavior & Organization, 64(3-4), 348-368. doi:10.1016/j.jebo.2006.02.010

Liu, C., & Peng, A. (2010). A Reinvestigation of Contract Duration Using Quantile Regression for Counts Analysis. Economics Letters, 106(3), 184-187. doi:10.1016/j.econlet.2009.11.015

Sudhir, K., & Datta, S. (2009). Pricing in Marketing Channels. Handbook of Pricing Research in Marketing, 319-354. doi:10.4337/9781848447448.00024

Wirtz, J., & Lovelock, C. (2016). Service Marketing Communications. People, Technology, Strategy Services Marketing, 236-287. doi:10.1142/9781944659028_0007

Wood, J. B., Hewlin, T., & Lah, T. E. (2013). B4B: How Technology and Big Data are Reinventing the Customer-Supplier Relationship. San Diego, CA: Point B.

————————————————————————————————————————–

\newpage

Appendix

This Appendix contains the research results, source R code, and associated relevant output from our final writeup and our model building efforts. Additional supplemental graphics and detailed explanations of our regression modeling efforts are also included.

————————————————————————————————————————–

Results of Contract Renewal Predictions

The results of applying our preferred binary logistic regression model to the Evaluation data set can be found at the following web link:

————————————————————————————————————————–

Results of Item Count Predictions

The results of applying our preferred negative binomial regression model to the Evaluation data set contracts that were predicted as likely to be renewed can be found at the following web link:

————————————————————————————————————————–

Data Exploration

This section of the Appendix contains supplemental bar plots, box plots, histograms, and R code used during our Data Exploration work.

Data Anomolies specifically for continuous numeric distributions were found with two variables, ‘COST’ and ‘ITEM_COUNT’ which as pointed out earlier had distributions that appeared to be concentrated on a single value. ‘COST’ and ‘ITEM_COUNT’ and ‘MULTI-SERVICE’ also showed very high skew and kurtosis which was corrected using a simple log transformation. The log transformation revealed the ‘COST’ variable with a standard normal distribution and the ‘ITEM_COUNT’ with a rightward skew count distribution as shown on the histograms below.

Descriptive Statistics Log Trx
n mean sd median min max range skew kurtosis se
ITEM_COUNT 91726 1.20 1.38 1.00 0 11.00 11.00 1.51 2.62 0.00
COST 91726 6.97 2.06 6.97 0 18.44 18.44 0.10 0.14 0.01

Histogram Transformations allowed us to more thoroughly examine whether the distribution of a variable was skewed as well as whether there may were high incidence of specific variable values throughout the data set. The following illustration shows the histograms for ‘ITEM_COUNT’ and count ‘COST’ post log-transformation.

Histograms Log Trx

A Correlation Matrix for the numeric data set is provided below. As can be seen in the matrix ‘ITEM_COUNT’ and ‘COST’ show very high correlation which isn’t surprising since they are two different measures of the size of a contract and therefore likely to be collinear. Notably, both ‘ITEM_COUNT’ and ‘COST’ are very weakly correlated with ‘RENEWAL_FLAG’ indicating these variables may not have much predictive value in the renewal of contracts.

R code Analysis of Predictor Variables

Descriptive Statistics
n mean sd median min max range skew kurtosis se
AREA* 91726 1.74 0.44 2.00 1 2.00 1.00 -1.10 -0.79 0.00
SERVICE_APPROACH* 91726 1.64 0.48 2.00 1 2.00 1.00 -0.59 -1.65 0.00
TIER* 91726 1.53 0.50 2.00 1 2.00 1.00 -0.12 -1.98 0.00
RENEWAL_FLAG 91726 0.32 0.47 0.00 0 1.00 1.00 0.78 -1.40 0.00
CONTRACT_LENGTH* 91726 1.17 0.38 1.00 1 2.00 1.00 1.74 1.03 0.00
SALES_CATEGORY* 91726 1.80 1.21 1.00 1 4.00 3.00 1.04 -0.69 0.00
CONTRACT_VALUE* 91726 1.41 1.13 1.00 1 6.00 5.00 2.65 5.67 0.00
DISCOUNT_CATEGORY* 91726 2.14 0.92 2.00 1 4.00 3.00 0.56 -0.43 0.00
MULTI-SERVICE* 91726 1.01 0.09 1.00 1 2.00 1.00 11.18 122.94 0.00
ITEM_COUNT 91726 1.20 1.38 1.00 0 11.00 11.00 1.51 2.62 0.00
COST 91726 6.97 2.06 6.97 0 18.44 18.44 0.10 0.14 0.01
SALES_STRATEGY* 91726 1.78 0.42 2.00 1 2.00 1.00 -1.32 -0.26 0.00
\newpage
Histograms

Correlation Matrix for Numerics
##              RENEWAL_FLAG ITEM_COUNT COST
## RENEWAL_FLAG         1.00       0.03 0.02
## ITEM_COUNT           0.03       1.00 0.82
## COST                 0.02       0.82 1.00

————————————————————————————————————————–

Data Preparation R Code

This section of the Appendix contains the R source code used during our data preparation efforts. Two separate modules were constructed and each are presented below.

R Code for Transforming Binary Text Strings to Numeric Values

library(alr3)
library(bestglm)
library(car)
library(ggplot2)
library(dplyr)
library(knitr)
library(lmtest)
library(caret)
library(tidyr)

fp.data <- read.csv("https://raw.githubusercontent.com/jtopor/CUNY-MSDA-621/master/Final%20Project/fin%20proj.csv", stringsAsFactors = FALSE)

# create a copy of the original data to use for transforming individual variables
fp.t <- fp.data

start our data transformation….

  1. Area - go for USA=1 and not USA = 0
# transform AREA: USA = 1; !USA = 2

fp.t$AREA[which(fp.data$AREA == 'USA')] <- 1
fp.t$AREA[which(fp.data$AREA != 'USA')] <- 0

# now convert column to numeric since it is char
fp.t$AREA <- as.numeric(fp.t$AREA)

# check results
summary(fp.t$AREA)
  1. SERVICE_APPROACH - go for Service Approach S=1 or not S = 0
# transform SERVICE_APPROACH: S = 1; !S = 0

fp.t$SERVICE_APPROACH[which(fp.data$SERVICE_APPROACH == 'S')] <- 1
fp.t$SERVICE_APPROACH[which(fp.data$SERVICE_APPROACH != 'S')] <- 0

# now convert column to numeric since it is char
fp.t$SERVICE_APPROACH <- as.numeric(fp.t$SERVICE_APPROACH)

# check results
summary(fp.t$SERVICE_APPROACH)
  1. CHANNEL_APPROACH - call it tier with second =1 and first =0
# transform CHANNEL: rename to TIER with second = 1; first = 0

fp.t$CHANNEL_APPROACH[which(fp.data$CHANNEL_APPROACH == 'SECOND')] <- 1
fp.t$CHANNEL_APPROACH[which(fp.data$CHANNEL_APPROACH == 'FIRST')] <- 0

# now convert column to numeric since it is char
fp.t$CHANNEL_APPROACH <- as.numeric(fp.t$CHANNEL_APPROACH)

# check results
summary(fp.t$CHANNEL_APPROACH)

# now rename column to TIER
colnames(fp.t)[colnames(fp.t)=="CHANNEL_APPROACH"] <- "TIER"

head(fp.t$TIER)
  1. RENEWAL_FLAG - renewed = 1, not renewed = 0
# transform RENEWAL_FLAG: RENEWED = 1; NOT RENEWED = 0

fp.t$RENEWAL_FLAG[which(fp.data$RENEWAL_FLAG == 'RENEWED')] <- 1
fp.t$RENEWAL_FLAG[which(fp.data$RENEWAL_FLAG == 'NOT RENEWED')] <- 0

# now convert column to numeric since it is char
fp.t$RENEWAL_FLAG <- as.numeric(fp.t$RENEWAL_FLAG)

# check results
summary(fp.t$RENEWAL_FLAG)
  1. Contract expiration date - ignore

NO ACTION TAKEN

  1. CONTRACT_LENGTH - could go with annual = 1 and multi-year = 3 (most common value for each)
# transform CONTRACT_LENGTH: ANNUAL = 1; MULTI-YEAR = 3

fp.t$CONTRACT_LENGTH[which(fp.data$CONTRACT_LENGTH == 'ANNUAL')] <- 1
fp.t$CONTRACT_LENGTH[which(fp.data$CONTRACT_LENGTH == 'MULTI-YEAR')] <- 3

# now convert column to numeric since it is char
fp.t$CONTRACT_LENGTH <- as.numeric(fp.t$CONTRACT_LENGTH)

# check results
summary(fp.t$CONTRACT_LENGTH)
  1. Sales category - would separate into 3 new columns - ENT = 1, not ENT = 0, PS = 1, not PS=0, COM = 1, not COM = 0

Switch SP and SMB to other as they are not significant

# transform SALES_CATEGORY - create 3 new columns COM, ENT, PS

# build COM column
# first check count of COM entries
fp.t$SALES_CATEGORY[which(fp.t$SALES_CATEGORY=='SP')] <-'OTHER'
fp.t$SALES_CATEGORY[which(fp.t$SALES_CATEGORY=='SMB')] <-'OTHER'

#sum(fp.data$SALES_CATEGORY == 'COM')

#fp.t$COM <- 0
#fp.t$COM[fp.data$SALES_CATEGORY == 'COM'] <- 1
# make sure new column matches original
#sum(fp.t$ENT == 1)

# ---------------------------------
# build ENT column
#sum(fp.data$SALES_CATEGORY == 'ENT')

#fp.t$ENT <- 0
#fp.t$ENT[fp.data$SALES_CATEGORY == 'ENT'] <- 1
#sum(fp.t$ENT == 1)

# ---------------------------------
# build PS column
#sum(fp.data$SALES_CATEGORY == 'PS')

#fp.t$PS <- 0
#fp.t$PS[fp.data$SALES_CATEGORY == 'PS'] <- 1
#sum(fp.t$PS == 1)
  1. Seller_unique ID - not relevant for binary conversation - too complex

NO ACTION TAKEN

  1. CONTRACT_VALUE_CATEGORY - going to bet it is collinear with cost and can be ignored - or could rank 1 to 5 lowest to highest buckets

IGNORE FOR NOW

# transform CONTRACT_VALUE_CATEGORY: <10k = 0, 10K-25K = 1, etc..

#fp.t$CONTRACT_VALUE_CATEGORY[which(fp.data$CONTRACT_VALUE_CATEGORY == '<10K')] <- 0
#fp.t$CONTRACT_VALUE_CATEGORY[which(fp.data$CONTRACT_VALUE_CATEGORY == '10K-25K')] <- 1
#fp.t$CONTRACT_VALUE_CATEGORY[which(fp.data$CONTRACT_VALUE_CATEGORY == '25K-50K')] <- 2
#fp.t$CONTRACT_VALUE_CATEGORY[which(fp.data$CONTRACT_VALUE_CATEGORY == '50K-100K')] <- 3
#fp.t$CONTRACT_VALUE_CATEGORY[which(fp.data$CONTRACT_VALUE_CATEGORY == '100K-250K')] <- 4
#fp.t$CONTRACT_VALUE_CATEGORY[which(fp.data$CONTRACT_VALUE_CATEGORY == '>250K')] <- 5

# now convert column to numeric since it is char
#fp.t$CONTRACT_VALUE_CATEGORY <- as.numeric(fp.t$CONTRACT_VALUE_CATEGORY)

# check results
#summary(fp.t$CONTRACT_VALUE_CATEGORY)
  1. CONTRACT_LINE_CATEGORY- going to bet it is collinear with item count and can be ignored - or could rank 1 to 5 with lowest to highest category IGNORE FOR NOW
# transform CONTRACT_LINE_CATEGORY: VERY SMALL = 1, SMALL = 2, MEDIUM = 3, LARGE = 4, 
# VERY LARGE = 5

#fp.t$CONTRACT_LINE_CATEGORY[which(fp.data$CONTRACT_LINE_CATEGORY == 'VERY SMALL')] <- 1
#fp.t$CONTRACT_LINE_CATEGORY[which(fp.data$CONTRACT_LINE_CATEGORY == 'SMALL')] <- 2
#fp.t$CONTRACT_LINE_CATEGORY[which(fp.data$CONTRACT_LINE_CATEGORY == 'MEDIUM')] <- 3
#fp.t$CONTRACT_LINE_CATEGORY[which(fp.data$CONTRACT_LINE_CATEGORY == 'LARGE')] <- 4
#fp.t$CONTRACT_LINE_CATEGORY[which(fp.data$CONTRACT_LINE_CATEGORY == 'VERY LARGE')] <- 5

# now convert column to numeric since it is char
#fp.t$CONTRACT_LINE_CATEGORY <- as.numeric(fp.t$CONTRACT_LINE_CATEGORY)

# check results
#summary(fp.t$CONTRACT_LINE_CATEGORY)
  1. DISCOUNT_CATEGORY - could go with no or some or call no 0, small 10, medium 20, large 30, very large 50 IGNORE FOR NOW
# transform CONTRACT_LINE_CATEGORY: NO DISCOUNT = 0, SMALL = 10, MEDIUM = 20, LARGE = 30, 
# VERY LARGE = 50

#fp.t$DISCOUNT_CATEGORY[which(fp.data$DISCOUNT_CATEGORY == 'NO DISCOUNT')] <- 0
#fp.t$DISCOUNT_CATEGORY[which(fp.data$DISCOUNT_CATEGORY == 'MEDIUM')] <- 20
#fp.t$DISCOUNT_CATEGORY[which(fp.data$DISCOUNT_CATEGORY == 'LARGE')] <- 30
#fp.t$DISCOUNT_CATEGORY[which(fp.data$DISCOUNT_CATEGORY == 'VERY LARGE')] <- 50

# now convert column to numeric since it is char
#fp.t$DISCOUNT_CATEGORY <- as.numeric(fp.t$DISCOUNT_CATEGORY)

# check results
#summary(fp.t$DISCOUNT_CATEGORY)

11, MULTIPLE_SERVICES = Yes = 1 and No = 0

# transform MULTIPLE_SERVICES: YES = 1; NO = 0

fp.t$MULTIPLE_SERVICES[which(fp.data$MULTIPLE_SERVICES == 'YES')] <- 1
fp.t$MULTIPLE_SERVICES[which(fp.data$MULTIPLE_SERVICES == 'NO')] <- 0

# now convert column to numeric since it is char
fp.t$MULTIPLE_SERVICES <- as.numeric(fp.t$MULTIPLE_SERVICES)

# check results
summary(fp.t$MULTIPLE_SERVICES)
  1. SALES_STRATEGY - GCS = 1, not GCS = 0
# transform SALES_STRATEGY: GCS = 1; NOT GCS = 0

fp.t$SALES_STRATEGY[which(fp.data$SALES_STRATEGY == 'GCS')] <- 1
fp.t$SALES_STRATEGY[which(fp.data$SALES_STRATEGY == 'NOT GCS')] <- 0

# now convert column to numeric since it is char
fp.t$SALES_STRATEGY <- as.numeric(fp.t$SALES_STRATEGY)

# check results
summary(fp.t$SALES_STRATEGY)

fix discount issues:

fp.t$DISCOUNT_CATEGORY[which(fp.t$DISCOUNT_CATEGORY == '#DIV/0!')] <- 'NO DISCOUNT'

Now write transformed data set to a file

write.csv(fp.t, file = "c:/SQLData/621/fp_transformed_data.csv", row.names = FALSE)

Create Training and Evaluation data sets

The R code shown below was used to separate the master data set into dedicated Training and Evaluation data sets.

library(dplyr)

fp <- read.csv("https://raw.githubusercontent.com/jeffnieman11/621_final_data/master/final_data.csv", stringsAsFactors = FALSE)

# create EVALUATION data set: sample 20000 rows at random from master file
set.seed(123)
eval.d <- fp[sample(nrow(fp), 20000), ]

# sort results by INDEX
eval.d <- arrange(eval.d, INDEX)

# create TRAINING data set: remove all eval.d observations from master data set
training.d <- fp[-eval.d$INDEX, ]

# now write both files to disc
write.csv(eval.d, file = "C:/SQLData/621/FP-EVAL-DATASET.csv", row.names = FALSE)
write.csv(training.d, file = "C:/SQLData/621/FP-TRAINING-DATASET.csv", row.names = FALSE)

rm(list = ls())

————————————————————————————————————————–

Regression Modeling: The Details

This section of the Appendix presents an in-depth review of our regression modeling work for this project. Portions of this writeup were excerpted for inclusion within the body of the main document.

————————————————————————————————————————–

Binary Logistic Regression Model for Predicting Contract Renewals

Three distinct binary logistic regression models were constructed for purposes of predicting whether or not a service contract was likely to be renewed. The models’ performance metrics were subsequently compared against each other to allow us to select our “best” binary logistic regression model for purposes of making predictions of contract renewals for the Evaluation data set. Each of the three models used the data set’s RENEWAL_FLAG attribute as the dependent response variable, while various subsets of the potential predictor variables were used as independent variables. A description of each model is provided below.

Binary Model 1: Use Backward Selection to Build a Model

Approach:

Our first binary model applied simple backward selection based on p-values and F-tests in an attempt to identify a binary logistic regression model comprised of only statistically significant predictors relative to the data set’s RENEWAL_FLAG response variable.

The initial iteration of this approach excluded the ‘ITEM_COUNT’ and ‘COST’ variables so that the model would be comprised solely of categorical predictor variables. The results of the initial iteration showed that each of the included categorical predictor variables were, in fact, statistically significant predictors of the RENEWAL_FLAG response variable, producing a model comprised of 10 predictor variables with performance metrics as indicated in the table below:

Metric Score
Number of Predictors 10
AIC 82589
Accuracy 0.7134
Classification Error Rate 0.2866
Precision 0.6369
Sensitivity 0.2443
Specificity 0.9344
F1 Score 0.3531
AUC 0.5893

The logit coefficients for this model are shown below. Please note that coefficients for each category of the multi-category predictors are included.

Coeff. Variable Coeff. Variable
- 1.453 Intercept + 0.020 CONTRACT_LINE_CATMEDIUM
+ 0.932 AREA1 - 0.097 CONTRACT_LINE_CATSMALL
+ 0.325 SERVICE_APPROACH1 - 0.120 CONTRACT_LINE_CATVERY LARGE
+ 0.078 TIER1 - 0.460 CONTRACT_LINE_CATVERY SMALL
- 1.016 CONTRACT_LENGTHMult-YR - 0.310 DISCOUNT_CATEGORYMEDIUM
- 0.261 SALES_CATEGORYENT + 0.376 DISCOUNT_CATEGORYNO DISCOUNT
+ 0.089 SALES_CATEGORYOTHER + 1.158 DISCOUNT_CATEGORYVERY LARGE
- 0.088 SALES_CATEGORYPS + 1.363 MULTIPLE_SERVICES1
+ 1.195 CONTRACT_VAL_CAT>250K + 0.194 SALES_STRATEGY1
+ 0.795 CONTRACT_VAL_CAT100K-250K (-) (-)
+ 0.421 CONTRACT_VAL_CAT10K-25K (-) (-)
+ 0.569 CONTRACT_VAL_CAT25K-50K (-) (-)
+ 0.726 CONTRACT_VAL_CAT50K-100K (-) (-)

————————————————————————————————————————–

Binary Model 2: Use Only Predictors That Are Independent of Contract Terms

Approach:

For our second binary model we decided to include only those predictor variables that appeared to be independent of any contract terms / conditions. For example, we knew that several variables, including ‘COST’, ‘ITEM_COUNT’, ‘DISCOUNT_CATEGORY’, and ‘CONTRACT_LENGTH’ amongst others, could in fact be considered components of an actual service contract. On the other hand, the ‘AREA’, ‘SERVICE_APPROACH’, ‘TIER’, ‘MULTIPLE_SERVICES’, and ‘SALES_STRATEGY’ categorical variables were either simply descriptors of the size or physical location of a customer or were reflective of how the technology company had decided to manage its relationship with a given customer. Therefore, we constructed a model comprised of only ‘AREA’, ‘SERVICE_APPROACH’, ‘TIER’, ‘MULTIPLE_SERVICES’, and ‘SALES_STRATEGY’, and each of those variables was found to be a statistically significant predictor of the RENEWAL_FLAG response variable. Performance metrics for that model are indicated in the table below:

Metric Score
Number of Predictors 5
AIC 87554
Accuracy 0.6836
Classification Error Rate 0.3164
Precision 0.7426
Sensitivity 0.0184
Specificity 0.9970
F1 Score 0.0360
AUC 0.5077

When compared to the first binary logistic model discussed above, this second much simpler model did a poor job of accurately predicting actual contract renewals, with nearly all of the prospective contract renewals being predicted as less than 0.5 likely to occur.

The logit coefficients for this model are shown below. Please note that coefficients for each category of the multi-category predictors are included.

Coeff. Variable
- 1.609 Intercept
+ 0.797 AREA1
+ 0.308 SERVICE_APPROACH1
+ 0.207 TIER1
+ 1.861 MULTIPLE_SERVICES1
+ 0.174 SALES_STRATEGY1

————————————————————————————————————————–

Binary Regression Model Selection

Binary model 2 was eliminated from consideration for selection due to its relatively poor ability to predict contract renewals. Both of the remaining models had very similar performance statistics as can be seen in the summary table shown below.

Metric Model 1 Model 3
# Predictors 10 10
AIC 82589 82057
Accuracy 0.7134 0.7132
Class.Err.R. 0.2866 0.2868
Precision 0.6369 0.6301
Sensitivity 0.2443 0.2425
Specificity 0.9344 0.9302
F1 Score 0.3531 0.3606
AUC 0.5893 0.5914

As we can see in the table, both of the models are fairly similar in their performance metrics, with Model 1 having the highest accuracy, precision, sensitivity, and specificity, while Model 3 has the lowest AIC, highest F1 score, and highest AUC. Since there was no obvious reason to choose either model, a likelihood ratio test was applied, and its result indicated that Model 3 was the preferred model of the two. Therefore, we selected Binary Model 3 for purposes of predicting whether or not a service contract is likely to renew.

————————————————————————————————————————–

Testing the Selected Model on the Training Data

We checked the performance of the selected model prior to applying it to our evaluation data by applying it to the training data set. This allowed us to assess how well the model’s predictions matched up to the training data’s actual ‘RENEWAL_FLAG’ values.

Summary statistics for the relevant ‘RENEWAL_FLAG’ values and the binomial logistic regression predictions for the corresponding observations are provided below: (NOTE: “n” represents the number of records within a data set).

Variable n Mean sd Med. Min Max Range Skew Kurtosis SE
RENEWAL_FLAG 71726 0.32 0.47 0 0 1 1 0.77 -1.41 0
Predicted 71726 0.13 0.33 0 0 1 1 2.22 2.94 0

————————————————————————————————————————–

Binary Logistic Regression Model Inferences:

We can infer the following from the selected binary logistic regression model’s coefficients:

Coeff. Variable Coeff. Variable
- 2.948 Intercept + 0.416 CONTRACT_VALUE (>250K)
+ 0.901 AREA1 + 0.261 CONTRACT_VALUE (100K-250K)
+ 0.286 SERVICE_APPROACH1 + 0.101 CONTRACT_VALUE (10K-25K)
+ 0.143 TIER1 + 0.178 CONTRACT_VALUE (25K-50K)
- 1.066 CONTRACT_LENGTH (Mult-YR) + 0.282 CONTRACT_VALUE (50K-100K)
- 0.295 SALES_CATEGORY (ENT) - 0.263 DISCOUNT_CATEGORY (MEDIUM)
+ 0.097 SALES_CATEGORY (OTHER) + 0.547 DISCOUNT_CATEGORY (NO DISCOUNT)
- 0.127 SALES_CATEGORY (PS) + 1.099 DISCOUNT_CATEGORY (VERY LARGE)
+ 0.206 SALES_STRATEGY1 + 0.092 log(ITEM_COUNT + 1)
(-) (-) + 0.142 log(COST + 1)
  • AREA: Domestic customers are more likely to renew their service contracts than are foreign customers.

  • SERVICE_APPROACH: Customers whose relationships are managed using the “Land and Expand” account management strategy are more likely to renew their service contracts than are customers whose relationships are not managed using the “Land and Expand” strategy.

  • TIER: Customers managed via a second tier channel sales distribution method are more likely to renew than are customers managed via a first tier channel sales distribution method.

  • CONTRACT_LENGTH: Customers that have multi-year service contracts are less likely to renew their contracts than are customers with annual contracts.

  • SALES_CATEGORY: Enterprise and public sector customers are less likely to renew their service contracts than are other types of customers.

  • CONTRACT_VALUE_CATEGORY: Contracts valued at more than $250,000 are more likely to renew than are other contracts.

  • DISCOUNT_CATEGORY: Customers that receive very large discounts are more likely to renew their contracts than are other types of customers.

  • ITEM_COUNT: The greater the number of line items contained in a service contract, the more likely it is that the contract will be renewed.

  • COST: The larger the list price of the items listed in the contract is, the more likely it is that the contract will be renewed.

  • SALES_STRATEGY: Customers whose relationships are managed using the ‘GCS’ sales strategy are more likely to renew their contracts than are customers whose relationships are managed using other sales strategies.

The coefficients can also be exponentiated so as to allow for their interpretation as odds ratios:

Var OR 2.5 % 97.5 %
(Intercept) 0.2311089 0.1973334 0.2706535
AREA1 2.5367399 2.4125369 2.6680595
SERVICE_APPROACH1 1.3803906 1.3353151 1.4270608
TIER1 1.0792109 1.0452289 1.1143245
CONTRACT_LENGTHMULTI-YEAR 0.3660953 0.3458963 0.3873816
SALES_CATEGORYENT 0.7810605 0.7076640 0.8614062
SALES_CATEGORYOTHER 1.0726579 0.9947881 1.1562938
SALES_CATEGORYPS 0.9067025 0.8231195 0.9980575
CONTRACT_VALUE_CATEGORY>250K 3.3883859 2.7073767 4.2526766
CONTRACT_VALUE_CATEGORY100K-250K 2.3854469 2.0401162 2.7909619
CONTRACT_VALUE_CATEGORY10K-25K 1.5244816 1.4351413 1.6191997
CONTRACT_VALUE_CATEGORY25K-50K 1.7114976 1.5653377 1.8710840
CONTRACT_VALUE_CATEGORY50K-100K 1.9905038 1.7628608 2.2477106
CONTRACT_LINE_CATEGORYMEDIUM 1.0249487 0.9052904 1.1604823
CONTRACT_LINE_CATEGORYSMALL 0.9389611 0.8518029 1.0351946
CONTRACT_LINE_CATEGORYVERY LARGE 1.0184574 0.7675312 1.3531406
CONTRACT_LINE_CATEGORYVERY SMALL 0.6403009 0.5823288 0.7042031
DISCOUNT_CATEGORYMEDIUM 0.7283804 0.6988528 0.7591717
DISCOUNT_CATEGORYNO DISCOUNT 1.4498368 1.3689804 1.5355683
DISCOUNT_CATEGORYVERY LARGE 3.1614161 3.0016316 3.3299904
MULTIPLE_SERVICES1 3.9689639 3.3258403 4.7555328
SALES_STRATEGY1

This exponentiation of the coefficients allowed us to more clearly relate their apparent effects to the behavioral economic strategies described in our Literature Review. The table below summarizes our observations in that regard.

Model 1 - Influence on Contract Renewal
Strategic Predictor Data Column Found Category Effect Rank
Land & Expand Service_Approach Sales Growth Positive Use of ‘S’
(‘S’,‘Not S’)
Adoption Services Sales_Strategy Sales Growth Positive Use ‘GLS’
(ENT,COM,PS, OTH)
Adaptive Learning Sales_Strategy Sales Growth Positive Use ‘GLS’
(ENT,COM,PS, OTH)
Hyperbolic Discounting Discount_Category Cost Factors Mixed VLG,NONE,LG,MED
(MED,LG,VLG,NONE)
Reference Dependence Cost Cost Factors NA NA
(Continuous Numeric)
Churn Contract_Value_Category Cost Factors Positive > $, > Effect
(Price Range)
Incentive Theory Discount_Category Cost Factors Mixed VLG,NONE,LG,MED
(Discount) (MED,LG,VLG,NONE)
Incentive Theory Multiple_Services Contract NA NA
(1-Tier, Multi-Tier) Features
Contract Duration Contract_Length Contract Negative Use ‘Annual’
(Annual, Multi-Yr) Features
Channel Pass-Through Tier Sales Growth Positive Use ‘Channel’
(Single, Multi) Sales Growth

————————————————————————————————————————–

The Negative Binomial Model for Predicting Contract Item Counts

While the binary logistic regression model we selected can be useful for predicting whether or not a contract will renew, that model does not allow us to predict the likely dollar value of the renewed contract. The ‘ITEM_COUNT’ variable indicates the number of line items included in a contract, and as such can be used as a proxy for the overall dollar value of the contract since, in general, the larger the number of line items listed in a contract, the larger the dollar value of the contract will likely be.

As such, it was decided that a count regression model would be constructed for purposes of estimating the likely number of items that might be included in a contract if that contract was predicted to be renewed. Our data exploration efforts indicated that the variance of the ‘ITEM_COUNT’ variable was much larger than its mean, thereby ruling out the possibility of applying a Poisson count regression model and pointing us toward the use of a negative binomial count regression model instead.

Our negative binomial model excluded the ‘CONTRACT_LINE_CATEGORY’ so as to avoid potential collinearity issues with the response variable. An initial set of modeling iterations led to the removal of a few statistically insignificant predictors but yielded a model whose ‘ITEM_COUNT’ predictions were wildly inaccurate, with some predictions exceeding one trillion possible line items. Further investigation revealed that the ‘COST’ variable was the source of the problem: that variable’s very large variance and outliers were causing the negative binomial model to generate wildly inaccurate predictions.

Further discussions with representatives of the tech company revealed that removal of the ‘COST’ variable’s outliers (basically very large total list price amounts) was viable since the outliers comprised a relatively small percentage of the contract data we’d been provided. It was therefore decided that all contracts whose ‘COST’ values exceeded $100,000 would be ignored for purposes of predicting the ‘ITEM_COUNT’ variable. Such contracts comprised approximately 3% of the total contract’s we’d been provided.

The resulting multi-step ‘ITEM_COUNT’ prediction process was thereby as follows:

  • Use the selected binary logistic regression model to predict whether or not a contract is likely to renew.

  • For those contracts that are deemed likely to renew, predict the likely ‘ITEM_COUNT’ amounts using the negative binomial regression model for all contracts having (COST <= 100,000).

Subsequent modeling iterations yielded a negative binomial model with the following performance characteristics:

Metric Value
Number of Predictors 7
AIC 146419
Dispersion 6.00

A jitter plot of the model’s rounded fitted values against the ‘ITEM_COUNT’ response variable shows that the model appears to perform rather poorly at accurately predicting the likely number of line items for a contract that is likely to renew:

Summary statistics for the relevant ‘ITEM_COUNT’ values and the Negative Binomial predictions for the corresponding observations are provided below: (NOTE: “n” represents the number of records within a data set).

Variable n Mean sd Med. Min Max Range Skew Kurtosis SE
ITEM_COUNT 22243 10.05 137.74 3 1 15176 15175 65.91 6730.87 0.92
Predicted 22243 47.06 292.12 7 3 9198 9195 15.63 325.55 1.96

The summary statistics are reflective of the jitter plot shown above. The predicted values for the item counts are much less variable than are the corresponding actual ‘ITEM_COUNT’ values. Furthermore, the skew and kurtosis of the actual ‘ITEM_COUNT’ values greatly exceeds that of the predicted values.

The coefficients for this negative binomial model are shown below. Please note that coefficients for each category of the multi-category predictors are included.

Coeff. Variable Coeff. Variable
+ 2.234 Intercept + 0.864 CONTRACT_VAL_CAT>250K
- 0.544 SERVICE_APPROACH1 + 0.456 CONTRACT_VAL_CAT100K-250K
- 0.222 CONTRACT_LENGTHMult-YR + 0.680 CONTRACT_VAL_CAT10K-25K
- 0.081 SALES_CATEGORYENT + 0.343 CONTRACT_VAL_CAT25K-50K
- 0.313 SALES_CATEGORYOTHER + 0.394 CONTRACT_VAL_CAT50K-100K
- 0.057 SALES_CATEGORYPS - 0.230 DISCOUNT_CATEGORYMEDIUM
+ 0.582 MULTIPLE_SERVICES1 + 0.020 DISCOUNT_CATEGORYNO DISCOUNT
+ 0.001 COST - 0.037 DISCOUNT_CATEGORYVERY LARGE

Negative Binomial Regression Model Inferences:

We can infer the following from the model’s coefficients:

  • SERVICE_APPROACH: Customers whose relationships are managed using the “Land and Expand” account management strategy are less likely to have relatively higher contract item counts than are customers whose relationships are not managed using the “Land and Expand” strategy.

  • CONTRACT_LENGTH: Customers that have multi-year service contracts are less likely to have relatively higher contract item counts than are customers with annual contracts.

  • SALES_CATEGORY: Customers that are neither Public sector nor enterprise customers are less likely to have higher contract item counts than are customers of other sales categories.

  • CONTRACT_VALUE_CATEGORY: Customers whose contract values exceed $250,000 in value are more likely to have higher contract item counts than are other customers.

  • DISCOUNT_CATEGORY: Customers receiving no discounts are more likely to have relatively higher contract item counts than are other customers receiving discounts.

  • MULTIPLE_SERVICES: Customers contracting for more than a single service are more likely to have relatively higher contract item counts.

  • COST: Contracts having higher total list prices are more likely to have relatively higher contract item counts than are contracts having lower total list prices.

————————————————————————————————————————–

Regression Modeling - R Source Code for Binary Logit Models

This section of the Appendix contains the source R code used for constructing and evaluating various binary logistic regression models for purposes of predicting the likelihood of a service contract being renewed.

library(alr3)
library(car)
library(knitr)
library(pROC)
library(lmtest)

fp <- read.csv("https://raw.githubusercontent.com/jtopor/CUNY-MSDA-621/master/Final%20Project/FP-TRAINING-DATASET.csv", stringsAsFactors = FALSE)

# convert binary categoricals to factors: 10 variables in total
fp$AREA <- factor(fp$AREA)
fp$SERVICE_APPROACH <- factor(fp$SERVICE_APPROACH)
fp$TIER <- factor(fp$TIER)
fp$CONTRACT_LENGTH <- factor(fp$CONTRACT_LENGTH)
fp$SALES_CATEGORY <- factor(fp$SALES_CATEGORY)
fp$CONTRACT_VALUE_CATEGORY <- factor(fp$CONTRACT_VALUE_CATEGORY)
fp$CONTRACT_LINE_CATEGORY <- factor(fp$CONTRACT_LINE_CATEGORY)
fp$DISCOUNT_CATEGORY <- factor(fp$DISCOUNT_CATEGORY)
fp$MULTIPLE_SERVICES <- factor(fp$MULTIPLE_SERVICES)
fp$SALES_STRATEGY <- factor(fp$SALES_STRATEGY)

# NOTE: RENEWAL_FLAG is the predictor so ignore it
summary(fp[,-1])
f <- fp[,-c(1,6,9)]

m1 <- glm(data = f, RENEWAL_FLAG ~ . - ITEM_COUNT - COST, family=binomial(link="logit"))
summary(m1)


# no point in doing mmps due to all variables being categorical
#par(mfrow=c(1,1))
#mmps(final, layout=c(4,3), key=T)
# Load R functions for model statistics

accuracy <- function(actual, predicted){
  
  # Equation to be modeled: (TP + TN) / (TP + FP + TN + FN)
  
  # derive confusion matrix cell values
  c.mat <- data.frame(table(actual, predicted))
  
  # extract all four confusion matrix values from the data frame
  TN <- as.numeric(as.character(c.mat[1,3]))
  FN <- as.numeric(as.character(c.mat[2,3]))
  FP <- as.numeric(as.character(c.mat[3,3]))
  TP <- as.numeric(as.character(c.mat[4,3]))
  
  # now calculate the required metric
  return( (TP + TN) / (TP + FP + TN + FN) )
}
classif.err.rate <- function(actual, predicted) {
  
  # Equation to be modeled: (FP + FN) / (TP + FP + TN + FN)
  
  # derive confusion matrix cell values
  c.mat <- data.frame(table(actual, predicted))
  
  # extract all four confusion matrix values from the data frame
  TN <- as.numeric(as.character(c.mat[1,3]))
  FN <- as.numeric(as.character(c.mat[2,3]))
  FP <- as.numeric(as.character(c.mat[3,3]))
  TP <- as.numeric(as.character(c.mat[4,3]))
  
  # now calculate the required metric
  return( (FP + FN) / (TP + FP + TN + FN) )
}  
precision <- function(actual, predicted) {
  
  # Precision : the proportion of positive cases that were correctly identified.
  
  # Equation to be modeled: TP / (TP + FP)
  
  # derive confusion matrix cell values
  c.mat <- data.frame(table(actual, predicted))
  
  # extract all four confusion matrix values from the data frame
  TN <- as.numeric(as.character(c.mat[1,3]))
  FN <- as.numeric(as.character(c.mat[2,3]))
  FP <- as.numeric(as.character(c.mat[3,3]))
  TP <- as.numeric(as.character(c.mat[4,3]))
  
  # now calculate the required metric
  return( TP / (TP + FP) )
}  
sensitivity <- function(actual, predicted) {
  
  # Equation to be modeled: TP / (TP + FN)
  
  # derive confusion matrix cell values
  c.mat <- data.frame(table(actual, predicted))
  
  # extract all four confusion matrix values from the data frame
  TN <- as.numeric(as.character(c.mat[1,3]))
  FN <- as.numeric(as.character(c.mat[2,3]))
  FP <- as.numeric(as.character(c.mat[3,3]))
  TP <- as.numeric(as.character(c.mat[4,3]))
  
  # now calculate the required metric
  return( TP / (TP + FN) )
}  
specificity <- function(actual, predicted) {
  
  # Equation to be modeled: TN / (TN + FP)
  
  # derive confusion matrix cell values
  c.mat <- data.frame(table(actual, predicted))
  
  # extract all four confusion matrix values from the data frame
  TN <- as.numeric(as.character(c.mat[1,3]))
  FN <- as.numeric(as.character(c.mat[2,3]))
  FP <- as.numeric(as.character(c.mat[3,3]))
  TP <- as.numeric(as.character(c.mat[4,3]))
  
  # now calculate the required metric
  return( TN / (TN + FP) )
}  
F1.Score <- function(actual, predicted) {
  
  # Equation to be modeled: ( 2 * precision * sensitivity) / (precision + sensitivity)
  
  # now calculate the required metric
  return( ( 2 * precision(actual, predicted) * sensitivity(actual, predicted)) 
          / (precision(actual, predicted) + sensitivity(actual, predicted)) )
}  

Now run metrics

# Coefficient Interpretation

# Logit model average marginal effects - use it to generate interpretable versions of coefficients
LogitScalar <- mean(dlogis(predict(m1, type = "link")))
LogitScalar * coef(m1)

# Logit model predicted probabilities - yields likelihood that each eval item is '+'
# 
predprob.renew <- round(predict(m1, type="response"), 2)
summary(predprob.renew)

# Percent correctly predicted values
# NOTE: Need to create variable 'Y' for this to work - set it to response variable
Y <- f$RENEWAL_FLAG

pred.renew <- round(fitted(m1))

table(true = Y, pred = pred.renew) 


# t.r <- data.frame(table(true = Y, pred = pred.crime))
# t.r

# now use functions built in HW 2 to get required statistics
accuracy(Y, pred.renew)
classif.err.rate(Y, pred.renew)
precision(Y, pred.renew)
sensitivity(Y, pred.renew)
specificity(Y, pred.renew)
F1.Score(Y, pred.renew)

# get AUC
rocCurve <- roc(response= Y, predictor= pred.renew)
auc(rocCurve)
# ALTERNATIVE REDUCED MODEL EXCLUDING CONTRACT TERMS
m2 <- glm(data = f, RENEWAL_FLAG ~ AREA + SERVICE_APPROACH + TIER  + SALES_STRATEGY + MULTIPLE_SERVICES, family=binomial(link="logit"))

summary(m2)

Now run metrics

# Coefficient Interpretation

# Logit model average marginal effects - use it to generate interpretable versions of coefficients
LogitScalar <- mean(dlogis(predict(m2, type = "link")))
LogitScalar * coef(m2)

# Logit model predicted probabilities - yields likelihood that each eval item is '+'
# 
predprob.renew <- round(predict(m2, type="response"), 2)
summary(predprob.renew)

# Percent correctly predicted values
# NOTE: Need to create variable 'Y' for this to work - set it to response variable
Y <- f$RENEWAL_FLAG

pred.renew <- round(fitted(m2))

table(true = Y, pred = pred.renew) 


# now use functions built in HW 2 to get required statistics
accuracy(Y, pred.renew)
classif.err.rate(Y, pred.renew)
precision(Y, pred.renew)
sensitivity(Y, pred.renew)
specificity(Y, pred.renew)
F1.Score(Y, pred.renew)

# get AUC
rocCurve <- roc(response= Y, predictor= pred.renew)
auc(rocCurve)
# ALTERNATIVE REDUCED MODEL EXCLUDING CONTRACT TERMS
m3 <- glm(data = f, RENEWAL_FLAG ~ . - CONTRACT_LINE_CATEGORY, family=binomial(link="logit"))

summary(m3)

STOP - Model shows neither ITEM_COUNT or COST are statistically significant predictors

DO NOT run metrics since STOPPED

summary(powerTransform(ITEM_COUNT+1~RENEWAL_FLAG, f, family="bcPower"))
#use log

summary(powerTransform(COST+1~RENEWAL_FLAG, f, family="bcPower"))
#use log

m4 <- glm(data=f, RENEWAL_FLAG~AREA + SERVICE_APPROACH + TIER + CONTRACT_LENGTH + SALES_CATEGORY + CONTRACT_VALUE_CATEGORY + DISCOUNT_CATEGORY + log(ITEM_COUNT + 1) + log(COST + 1) + SALES_STRATEGY, family=binomial(link="logit"))
summary(m4)
## odds ratios only
exp(coef(m4))

## odds ratios and 95% CI
exp(cbind(OR = coef(m4), confint(m4)))
# Coefficient Interpretation

# Logit model average marginal effects - use it to generate interpretable versions of coefficients
LogitScalar <- mean(dlogis(predict(m4, type = "link")))
LogitScalar * coef(m4)

# Logit model predicted probabilities - yields likelihood that each eval item is '+'
# 
predprob.renew <- round(predict(m4, type="response"), 2)
summary(predprob.renew)

# Percent correctly predicted values
# NOTE: Need to create variable 'Y' for this to work - set it to response variable
Y <- f$RENEWAL_FLAG

pred.renew <- round(fitted(m4))

table(true = Y, pred = pred.renew) 


# t.r <- data.frame(table(true = Y, pred = pred.crime))
# t.r

# now use functions built in HW 2 to get required statistics
accuracy(Y, pred.renew)
classif.err.rate(Y, pred.renew)
precision(Y, pred.renew)
sensitivity(Y, pred.renew)
specificity(Y, pred.renew)
F1.Score(Y, pred.renew)

# get AUC
rocCurve <- roc(response= Y, predictor= pred.renew)
auc(rocCurve)

# now plot TARGET against fitted values
fit1 <- round(m4$fitted.values)

describe(fit1)
describe(f$RENEWAL_FLAG)

Try a lrtest on m1 vs m4

lrtest(m4, m1)

————————————————————————————————————————–

Regression Modeling - R Source Code for Negative Binomial Models

This section of the Appendix contains the source R code used for constructing and evaluating various negative binomial count regression models for purposes of predicting the likely number of contract line items for a contract that had been predicted as likely to be renewed.

library(alr3)
library(car)
library(knitr)
library(pROC)
library(psych)
library(MASS)
options(scipen=999)

# old file load - used full master data set
# fp <- read.csv("https://raw.githubusercontent.com/jeffnieman11/621_final_data/master/final_data.csv", stringsAsFactors = FALSE)

# new file load - uses only TRAINING data set
fp <- read.csv("https://raw.githubusercontent.com/jtopor/CUNY-MSDA-621/master/Final%20Project/FP-TRAINING-DATASET.csv", stringsAsFactors = FALSE)


# convert binary categoricals to factors: 10 variables in total
fp$AREA <- factor(fp$AREA)
fp$SERVICE_APPROACH <- factor(fp$SERVICE_APPROACH)
fp$TIER <- factor(fp$TIER)
fp$CONTRACT_LENGTH <- factor(fp$CONTRACT_LENGTH)
fp$SALES_CATEGORY <- factor(fp$SALES_CATEGORY)
fp$CONTRACT_VALUE_CATEGORY <- factor(fp$CONTRACT_VALUE_CATEGORY)
fp$CONTRACT_LINE_CATEGORY <- factor(fp$CONTRACT_LINE_CATEGORY)
fp$DISCOUNT_CATEGORY <- factor(fp$DISCOUNT_CATEGORY)
fp$MULTIPLE_SERVICES <- factor(fp$MULTIPLE_SERVICES)
fp$SALES_STRATEGY <- factor(fp$SALES_STRATEGY)


# NOTE: RENEWAL_FLAG is the predictor so ignore it
summary(fp[,-1])
f <- fp[,-c(1,6,9)]



# subset only contracts that renewed AND COST <= 100000 (to remove outliers)
renews <- subset(f, RENEWAL_FLAG == 1 & COST <= 100000)


# renumber rows
# rownames(renews) <- 1:nrow(renews)

# fit a negative binomial model
negb.1 <- glm.nb(ITEM_COUNT ~ . -RENEWAL_FLAG - CONTRACT_LINE_CATEGORY, renews)
summary(negb.1)

# remove AREA and TIER
negb.2 <- glm.nb(ITEM_COUNT ~ . -RENEWAL_FLAG - CONTRACT_LINE_CATEGORY - AREA - TIER, renews)
summary(negb.2)

# calculate dispersion: overdispersed at 6.63
(dp <- sum(residuals(negb.2,type="pearson")^2)/negb.2$df.res)

# check summary to see if dropping a predictor will improve things
# the p-values get adjusted by the dispersion so re-check them!!!
summary(negb.2,dispersion=dp)

# --------------------------------------------------------

# drop SALES_STRATEGY

negb.3 <- glm.nb(ITEM_COUNT ~ . -RENEWAL_FLAG - CONTRACT_LINE_CATEGORY - AREA - TIER - SALES_STRATEGY, renews)
summary(negb.3)

# calculate dispersion: overdispersed at 6.0
(dp <- sum(residuals(negb.3,type="pearson")^2)/negb.3$df.res)

# check summary to see if dropping a predictor will improve things
# the p-values get adjusted by the dispersion so re-check them!!!
summary(negb.3,dispersion=dp)

# check goodness of fit
1 - pchisq(deviance(negb.3),df.residual(negb.3))

# anova goodness of fit: indicates good fit to all variables
anova(negb.3, test="Chisq")

# check diagnostic plots: shows one extreme but valid outlier
par(mfrow = c(2,2))
plot(negb.3)

# now plot TARGET against fitted values
fit1 <- round(negb.3$fitted.values)
summary(fit1) # max is 9198 so use 10000 for plot axis

# check summary for max ITEM_COUNT
summary(renews$ITEM_COUNT)

par(mfrow = c(1,1))

plot(jitter(fit1), renews$ITEM_COUNT, xlab="Fitted Values", xaxp  = c(0, 10000, 20))

describe(fit1)
describe(renews$ITEM_COUNT)
# drop CONTRACT_LENGTH and COST

negb.4 <- glm.nb(ITEM_COUNT ~ . -RENEWAL_FLAG - CONTRACT_LINE_CATEGORY - COST, renews)
summary(negb.4)

# Remove AREA
negb.5 <- glm.nb(ITEM_COUNT ~ . -RENEWAL_FLAG - CONTRACT_LINE_CATEGORY - COST - AREA, renews)
summary(negb.5)

# calculate dispersion: overdispersed at 6.0
(dp <- sum(residuals(negb.5,type="pearson")^2)/negb.5$df.res)

# check summary to see if dropping a predictor will improve things
# the p-values get adjusted by the dispersion so re-check them!!!
summary(negb.5,dispersion=dp)

# remove SALES_STRATEGY and TIER
negb.6 <- glm.nb(ITEM_COUNT ~ . -RENEWAL_FLAG - CONTRACT_LINE_CATEGORY - COST - AREA - SALES_STRATEGY - TIER, renews)
summary(negb.6)

# calculate dispersion: overdispersed at 6.0
(dp <- sum(residuals(negb.6,type="pearson")^2)/negb.6$df.res)

# check summary to see if dropping a predictor will improve things
# the p-values get adjusted by the dispersion so re-check them!!!
summary(negb.6,dispersion=dp)

# check goodness of fit
1 - pchisq(deviance(negb.6),df.residual(negb.6))

# anova goodness of fit: indicates good fit to all variables
anova(negb.6, test="Chisq")

# check diagnostic plots: shows one extreme but valid outlier
par(mfrow = c(2,2))
plot(negb.6)

# now plot TARGET against fitted values
fit1 <- round(negb.6$fitted.values)
summary(fit1) # max is 592 so use 600 for plot axis

# check summary for max ITEM_COUNT
summary(renews$ITEM_COUNT)

par(mfrow = c(1,1))

plot(jitter(fit1), renews$ITEM_COUNT, xlab="Fitted Values", xaxp  = c(0, 600, 20))

describe(fit1)
describe(renews$ITEM_COUNT)
renews <- subset(f, RENEWAL_FLAG == 1)


negb.c1 <- glm.nb(ITEM_COUNT ~ . -RENEWAL_FLAG - CONTRACT_LINE_CATEGORY - COST, renews)
summary(negb.c1)

# remove area
negb.c2 <- glm.nb(ITEM_COUNT ~ . -RENEWAL_FLAG - CONTRACT_LINE_CATEGORY - COST - AREA, renews)
summary(negb.c2)

# remove TIER
negb.c3 <- glm.nb(ITEM_COUNT ~ . -RENEWAL_FLAG - CONTRACT_LINE_CATEGORY - COST - AREA - TIER, renews)
summary(negb.c3)

# calculate dispersion: overdispersed at 6.89
(dp <- sum(residuals(negb.c3,type="pearson")^2)/negb.c3$df.res)

# check summary to see if dropping a predictor will improve things
# the p-values get adjusted by the dispersion so re-check them!!!
summary(negb.c3,dispersion=dp)

# remove SALES_STRATEGY and CONTRACT_LENGTH
negb.c4 <- glm.nb(ITEM_COUNT ~ . -RENEWAL_FLAG - CONTRACT_LINE_CATEGORY - COST - AREA - TIER - SALES_STRATEGY - CONTRACT_LENGTH, renews)
summary(negb.c4)

# calculate dispersion: overdispersed at 6.89
(dp <- sum(residuals(negb.c4,type="pearson")^2)/negb.c4$df.res)

# check summary to see if dropping a predictor will improve things
# the p-values get adjusted by the dispersion so re-check them!!!
summary(negb.c4,dispersion=dp)

# anova goodness of fit: indicates good fit to all variables
anova(negb.c4, test="Chisq")

# check diagnostic plots: shows one extreme but valid outlier
par(mfrow = c(2,2))
plot(negb.c4)

# now plot TARGET against fitted values
fit1 <- round(negb.c4$fitted.values)
summary(fit1) # max is 5091 so use 6000 for plot axis

# check summary for max ITEM_COUNT
summary(renews$ITEM_COUNT)

par(mfrow = c(1,1))

plot(jitter(fit1), renews$ITEM_COUNT, xlab="Fitted Values", xaxp  = c(0, 6000, 20))

describe(fit1)
describe(renews$ITEM_COUNT)

————————————————————————————————————————–

R Code for Evaluation Data Set Predictions

This section of the Appendix contains the source R code used for applying the preferred regression models to the Evaluation data set.

R code for the required 2-stage prediction process

First stage:

  1. Load training data
  2. Perform any necessary transforms on data
  3. build selected binary logit regression model
  4. Load eval data
  5. Perform any necessary transforms on eval data
  6. use predict function to get required probabilities
  7. Save both the probabilities and their rounded 0/1 values to the eval data set
library(MASS)
library(psych)
options(scipen=999)

# load training set so that binary model can be built

fp <- read.csv("https://raw.githubusercontent.com/jtopor/CUNY-MSDA-621/master/Final%20Project/FP-TRAINING-DATASET.csv", stringsAsFactors = FALSE)

# convert binary categoricals to factors: 10 variables in total
fp$AREA <- factor(fp$AREA)
fp$SERVICE_APPROACH <- factor(fp$SERVICE_APPROACH)
fp$TIER <- factor(fp$TIER)
fp$CONTRACT_LENGTH <- factor(fp$CONTRACT_LENGTH)
fp$SALES_CATEGORY <- factor(fp$SALES_CATEGORY)
fp$CONTRACT_VALUE_CATEGORY <- factor(fp$CONTRACT_VALUE_CATEGORY)
fp$CONTRACT_LINE_CATEGORY <- factor(fp$CONTRACT_LINE_CATEGORY)
fp$DISCOUNT_CATEGORY <- factor(fp$DISCOUNT_CATEGORY)
fp$MULTIPLE_SERVICES <- factor(fp$MULTIPLE_SERVICES)
fp$SALES_STRATEGY <- factor(fp$SALES_STRATEGY)

# remove unused columns from data set
f <- fp[,-c(1,6,9)]

# build binary logit model
m4 <- glm(data=f, RENEWAL_FLAG~AREA + SERVICE_APPROACH + TIER + CONTRACT_LENGTH + SALES_CATEGORY +
            CONTRACT_VALUE_CATEGORY + DISCOUNT_CATEGORY + log(ITEM_COUNT + 1) + log(COST + 1) + 
            SALES_STRATEGY, family=binomial(link="logit"))

# ---------------------------------------

# now that model is built, load eval data set

# load EVAL data set
fp <- read.csv("https://raw.githubusercontent.com/jtopor/CUNY-MSDA-621/master/Final%20Project/FP-EVAL-DATASET.csv", stringsAsFactors = FALSE)

# convert binary categoricals to factors: 10 variables in total
fp$AREA <- factor(fp$AREA)
fp$SERVICE_APPROACH <- factor(fp$SERVICE_APPROACH)
fp$TIER <- factor(fp$TIER)
fp$CONTRACT_LENGTH <- factor(fp$CONTRACT_LENGTH)
fp$SALES_CATEGORY <- factor(fp$SALES_CATEGORY)
fp$CONTRACT_VALUE_CATEGORY <- factor(fp$CONTRACT_VALUE_CATEGORY)
fp$CONTRACT_LINE_CATEGORY <- factor(fp$CONTRACT_LINE_CATEGORY)
fp$DISCOUNT_CATEGORY <- factor(fp$DISCOUNT_CATEGORY)
fp$MULTIPLE_SERVICES <- factor(fp$MULTIPLE_SERVICES)
fp$SALES_STRATEGY <- factor(fp$SALES_STRATEGY)

# remove unused columns from data set
f <- fp[,-c(1,6,9)]

e.out <- fp

# save a copy of RENEWAL_FLAG for stats at end
Renew.f <- fp$RENEWAL_FLAG

# set renewal flag in eval data set to zero
f$RENEWAL_FLAG <- 0

# now predict TARGET_FLAG using model
pred.renew <- predict(m4, newdata=f, type="response")

# Save predicted probability and rounded value to new variables in output data frame
e.out$RENEWAL_PROB <- round(pred.renew, 3)
e.out$PREDICTED_RENEWAL <- round(pred.renew)

# set RENEWAL_FLAG in eval data set according to predictions
f$RENEWAL_FLAG <- round(pred.renew)

# write results to a file

# create data frame with ONLY those columns needed for comparisons
b.out <- data.frame(cbind(e.out$INDEX, e.out$RENEWAL_FLAG, e.out$RENEWAL_PROB, 
                          e.out$PREDICTED_RENEWAL, e.out$COST, e.out$ITEM_COUNT))

# Give columns proper names
colnames(b.out) <- c("INDEX", "RENEWAL_FLAG", "RENEWAL_PROB", "PREDICTED_RENEWAL", "COST", "ITEM_COUNT")

# write results to a file
#write.csv(b.out, file = "C:/SQLData/621/FP-RENEWAL_PREDS.csv", row.names = FALSE)

# how many were correct predicted?
(pred.cor <- nrow(subset(b.out, RENEWAL_FLAG == 1 & PREDICTED_RENEWAL == 1)))
 # how many total RENEWAL_FLAG == 1 in eval data set?
(renews.t <- nrow(subset(b.out, RENEWAL_FLAG == 1)))

What percentage of actual contract renewals were predicted correctly?

  • True positives = 1609 (out of 6292 actual RENEWAL_FLAG == 1)

  • False postiives = 2574 - 1609 = 965

  • True Negatives = 20000 - (6292 + 965) = 12743 (out of 13708 possible actual RENEWAL_FLAG == 0)

  • False Negatives = 6292 - 1609 = 4683

————————————————————————————————————————–

How well did the binary logit model perform on the eval data set?

  • Accuracy = (TP + TN) / (TP + FP + TN + FN) = (1609 + 12743) / 20000 = 0.7176

  • Classification Error Rate = (FP + FN) / (TP + FP + TN + FN) = (965 + 4683) / 20000 = 0.2824

  • Precision = TP / (TP + FP) = 1609 / (1609 + 965) = 0.6251

  • Sensitivity = TP / (TP + FN) = 1609 / (1609 + 4683) = 0.2256

  • Specificity = TN / (TN + FP) = 12743 / (12743 + 965) = 0.9296

  • F1 Score = ( 2 * precision * sensitivity) / (precision + sensitivity) = (2 * .6251 * .2256) / (.6251 + .2256) = 0.3315