The use of service-based contracts is vital to the future profitability of the high-tech industry as it pivots away from an historical emphasis on consumption based commodities. This paper examines several behavioral economic factors for their potential influence on the renewals of 91,727 service contracts provided by a large US-based high-tech company. The goal of our research was to attempt to predict both the likelihood of a service contract being renewed and the anticipated size (as defined by the number of line items contained within a contract) of any renewed contracts. Predictive models were constructed that included proxies of various behavioral factors (e.g., pricing, contract-specific terms, and sales growth strategies) to gauge their impact on contract renewals. Our preferred binomial logistic regression model proved effective at predicting nearly 72% of all customer contract renewal decisions while also indicating that the use of behavioral-driven sales growth strategies such as “Land and Expand”, “Adoption Services”, “Adaptive Learning”, and channel sales each appeared to have a positive effect on the likelihood of a contract being renewed. By contrast, the use of behavioral-driven cost factors and contract features such as discounting and contract durations appeared to yield mixed results. Finally, a negative binomial regression model proved ineffective at predicting the line item counts of renewed contracts, thereby indicating that simply assuming that the line item count of the contract being renewed will be equivalent to that of the previous contract may be a better method for predicting that metric.
The high-tech industry is currently experiencing a period of rapid transformation. While historically it was primarily focused on selling equipment such as computers, routers and other big box items, the focus has now shifted to the selling of services and software. While equipment sales are still a critical component of technology companies’ revenues, true profitablity is now being created as a result of high-tech companies embracing the concept of consumption-based sales management, wherein a technology supplier works closely with its customers to help them most effectively make use of and expand upon the hardware and services they have purchased. In a sense, the consumption-based sales management approach creates a value-driven partnership between a technology company and its customers, with the customers benefiting from the enhanced value they are able to realize from their technology purchases. The high-tech companies can benefit from such a partnership if it enables them to be more responsive to a customer’s needs while also enabling the sale of additional products or services to a customer.
This evolution of the high-tech industry has greatly elevated the importance of renewable service contracts as a source of both recurring revenues and account expansions for high-tech companies. High-tech companies have come to recognize that “high-touch” service practices can have a marked impact on both their profitability and their customer retention / expansion efforts. With this context in mind, our research attempts to answer two vital questions:
Is it possible to predict whether a fixed-length technology services contract will be renewed by a customer?
For those service contracts that are predicted to be renewed, how well can we predict the magnitude of the resulting renewed contract?
Implicit in these research questions is the goal of improving a company’s sales forecasting and sales management practices through the use of enhanced pro forma revenue forecasting techniques. For example, if a company can improve its abilities to predict which service contracts are unlikely to be renewed it would allow the company to more clearly evaluate the effectiveness of their sales and account management strategies, thereby enabling them to make improvements where needed.
To facilitate our research we received permission from a large US-based global high-tech company to make use of data related to 91,727 renewable service contracts for customers based in North and South America. The data include information regarding contract sizes, discount terms, the geographical location of customers, sales management strategies, and customer contract renewal decisions. This data trove serves as the basis for the research discussed herein.
Our literature review focused on identifying applicable strategies from the field of Behavioral Economics relative to the subject of customer contracts and contract renewals. Many such strategies appear to rely on mathematical models wherein key behavioral characteristics are modeled with respect to a relevant measurable parameter. Of those we reviewed, we chose several that appeared to be most closely related to our research and attempted to map their respective behavioral characteristics to variables contained within our data set. This approach allowed us to test some of the contract renewal theories described in the literature as part of our own investigative efforts. The specific contract-related strategies we researched are described below.
The most important class of behavioral characterstics influencing contractual relationships involves Pricing and Cost Factors. The most salient idea in Pricing that we investigated was the notion of “Reference Dependence”, which suggests that “when a point of reference against which outcomes are compared is changed (due to”framing“), the choices people make are sensitive to the change in frame” (Ho, Lim & Camerer, 2006, pp. 5-7). So if a data set includes a variable that constitutes a “reference point”" that could be adjusted during contract negotiations, we might expect that variable to be an influential factor in the negotiation of a contract.
By contrast, Hyperbolic Discounting represents the notion that the seller would try to create a psychological urgency to influence buying behavior in two ways: for “products that involve either immediate costs with delayed benefits… or immediate benefits with delayed costs” (Ho, et al, p. 26). For example, one way to create such urgency might be to offer customers discounted payment terms in return for a shorter payment period.
Finally, we investigated the propensity for customers to switch service providers, also known as “Churn”. Research on this behavior showed a that customers exhibit a greater likelihood for switching to lower cost contracts if such contracts offer reasonably similar benefits (Wirtz & Lovelock, 2016, p.474), (Ho, et al, p 26-29).
Incentive Theory (of which Discounting is a subset) tells us that asymmetric information between contract counterparties creates opportunities for the seller to induce sales (Brousseau, Glachant & Farrar, 2001, p.6). That theory can be tested by comparing the renewal likelihoods of contracts of different durations, or by comparing the renewal likelihoods of contracts containing varying quantities of service offerings.
Theoretical work has been done in the area of contract duration while using a quantile regression for counts analysis to determine whether meta-factors such as unemployment rates and expected inflation play a role in contract outcomes (Liu & Peng, 2010, p. 186). While the scope of our analysis was not on the duration of contracts in the macro-economics environment, the article motivated us to use an available contract duration metric as a predictive variable within our own research.
Sales Growth factors were the last category of strategies we investigated. The behavioral notion behind such factors is that alteration of the relationship between counterparties in a contract beyond pricing, duration and quantity of services can influence outcomes. “Land and Expand Selling” (Wood, Helwin & Lah, 2013, pp. 144-170) encapsulates the idea of creating differentiated service offerings with flexibility around both consumption and billing. For example, to improve the likelihood of contract renewal, a seller can apply the concept of “Adoption Services” (Wood, et al, pp. 171-190), which entails working closely with a customer “post-sales” to provide hands-on product training and deployment assistance. Additional theoretical work in this area has included the use of “Adaptive Learning” techniques, where “belief-based or reinforcement learning may have a significant impact on market organizations” (Hopkins, 2007, p.1-4). Finally, there is also the concept of pricing and marketing through “channels”, wherein a large company typically relies on resellers / partners as “pass-through” sales channels (Wood, et al, p. 7) while focusing its own marketing efforts on “retailer and brand loyalty (Wood, et al, p. 10). Such a concept was relevant for our purposes since the high-tech company that provided our data relies heavily on pass through sales channels for sales of its products and services.
In summary, our review of the literature highlighted the challenges we faced in attempting to predict both service contract renewals and the likely sizes of contracts that were predicted to be renewed. However, the literature did suggest the use of several potential predictor variables for any predictive models we might build. The table below summarizes the relationships we identified between the various theoretical contructs discussed in the literature and the corresponding variables contained within our data set.
| Theoretical Construct | Source | Category | Corresponding Variable |
|---|---|---|---|
| Land & Expand | Wood, Hewlin & Lah | Sales Growth | Service_Approach |
| Adoption Services | Wood, Hewlin & Lah | Sales Growth | Sales_Strategy |
| Adaptive Learning | Hopkins | Sales Growth | Sales_Strategy |
| Hyperbolic Discounting | Ho, Lin & Camerer | Cost Factors | Discount_Category |
| Pricing | Ho, Lin & Camerer | Cost Factors | Cost |
| Churn | Wirtz & Lovelock | Cost Factors | Contract_Value_Category |
| Incentive Theory(Discount) | Brousseau & Glachant | Cost Factors | Discount_Category |
| Incentive Theory(Multiple) | Brousseau & Glachant | Contract Features | Multiple_Services |
| Contract Duration | Liu & Peng | Contract Features | Contract_Length |
| Channel Pass-Through | Sudhir & Datta | Sales Growth | Tier |
Use of the ‘Corresponding Variables’ listed above therefore provided us with a mechanism for linking our research efforts back to the various relevant theoretical constructs. While any predictive models we might choose to build would not be limited solely to the eight variables listed above, their ultimate inclusion in any particular model could serve as an indicator as to whether the corresponding theoretical construct was validated by our own work.
As stated in the Introduction, our research was focused on attempting to answer the following two questions:
Given a limited set of geographical and empirical data, how well could we predict the likelihood of a renewal of a fixed-length technology services contract?
If a contract was predicted to be likely to be renewed, how well could we then predict the likely line item count for the renewed contract?
Our methodology for addressing these questions consisted of three separate processes: Data Exploration, Data Preparation, and Regression Modeling. The nature of each is described below:
Data Exploration: Investigation of the characteristics of each individual variable contained within the data set that was provided by the technology company, including their data types, range of valid values, their distributions, their correlations with one another, and whether any variables had missing data values for any of the contracts represented within the data set.
Data Preparation: Development of strategies for handling missing or invalid data values and development of an approach to the separation of the master data set into dedicated regression modeling “Training” and “Evaluation” subsets. Creation of separate training and evaluation data sets was deemed necessary to allow us to test our regression models on a relatively unbiased set of contract records.
Regression Modeling: Identification, development, and testing of task-appropriate regression models. For our first set of models, since we were tasked with predicting the binary outcome of a contract renewal decision, various binary logistic regression models were to be evaluated in an attempt to identify that which was most effective at making such predictions. The “best” model was selected on the basis of performance metrics including AIC score, AUC, accuracy, classification error rates, precision, specificity, sensitivity, and F1 scores. On the other hand, predictions of the likely number of line items to be contained within a renewing contract appeared to require the use of a count regression model. As such, various types of count regression models (e.g., Poisson, Negative Binomial, zero-inflated) were to be evaluated relative to the structure of the contract data, and models were evaluated on the basis of AIC scores and whether or not they produced realistic predictions.
Subsequent to the development and selection of task-appropriate regression models, the selected models were applied to the evaluation data set as follows:
The selected binary logistic regression model was used to predict whether or not a contract contained within the evaluation data set was likely to renew.
For those contracts that were deemed likely to renew, the selected count regression model was then used to predict the likely contract line item counts.
Results of both types of regression models were then compared against actual contract renewal outcomes to judge their effectiveness at predicting their respective response variables.
The data set we were provided contained empirical and qualitative data related to service contract renewals. There were 91,727 contracts represented, with attributes such as contract size, geographic location of the customer, the types of sales management strategies applied to the customer, and contract terms having been provided for each. All proprietary information had been pre-removed by the high-tech company prior to our reception of the data set. No missing data values were found and the data itself was considered to be fully accurate as per the high-tech company’s comments.
For each contract we were provided with 12 attributes that could potentially be used as predictor variables. The two response variables we were to focus on were:
RENEWAL_FLAG: Indicated whether a contract was renewedITEM_COUNT: Indicated how many line items were included in a contractThe following is a brief description of each of the variables represented within the data set:
| Variable | Definition |
|---|---|
INDEX |
Identification Variable (not a predictor) |
RENEWAL_FLAG |
Contract Renewal Indicator |
AREA |
Domestic or Foreign Customer Indicator |
SERVICE_APPROACH |
Indicates whether a “Land and Expand Strategy” is used w/ customer |
TIER |
Channel Sales Distribution Method Indicator |
CONTRACT_EXPIRATION |
Expiration Date (not a predictor) |
CONTRACT_LENGTH |
Annual or Multi-year Indicator |
SALES_CATEGORY |
Categorical Indicator of Market Segment Assigned to Customer |
SELLER_UNIQUE_ID |
Unique ID for Sales Channel Partner |
CONTRACT_VALUE_CATEGORY |
Contract Value in Dollars (bracketed) |
CONTRACT_LINE_CATEGORY |
Items of Contract (bracketed) |
DISCOUNT_CATEGORY |
Size of Discount (bracketed) |
MULTIPLE_SERVICES |
Indicator: Is more than one service Provided to customer? |
ITEM_COUNT |
Count of Line Items in Contract |
COST |
Sum of List Prices of Contract Line Items |
SALES_STRATEGY |
Adoption Services Strategy Indicator |
Six of the independent variables (‘AREA’, ‘SERVICE_APPROACH’, ‘TIER’, ‘CONTRACT_LENGTH’, ‘MULTIPLE_SERVICES’, and ‘SALES_STRATEGY’) were binary categorical variables which needed to be treated as factors. ‘SALES_CATEGORY’, ‘CONTRACT_VALUE_CATEGORY’, CONTRACT_LINE_CATEGORY’, and ‘DISCOUNT_CATEGORY’ were multi-categorical variables that can assume any one of more than two potential predefined text string values as indicated in the table below (NOTE: the shared ’_CATEGORY’ suffix has been abbreviated to “_CAT" within the table for simplification purposes):
| SALES_CAT | CONTRACT_VALUE_CAT | CONTRACT_LINE_CAT | DISCOUNT_CAT |
|---|---|---|---|
| COM | <10K | VERY SMALL | NO DISCOUNT |
| ENT | 10K-25K | SMALL | SMALL |
| PS | 25K-50K | MEDIUM | MEDIUM |
| OTHER | 50K-100K | LARGE | LARGE |
| (-) | 100K-250K | VERY LARGE | VERY LARGE |
| (-) | >250K | (-) | (-) |
Our initial data analysis uncovered a few key features of the data set:
Furthermore, there were only two continuous numeric variables within the data set (“ITEM_COUNT” and “COST”) and both were heavily right-skewed due to the fact that the vast majority of the 91,726 contracts contained relatively small numbers of line items and relatively low total list price amounts for those line items.
Barplots and boxplots of each independent variable relative to the binary response variable were investigated for potential skew issues and also allowed us to gain insight into the predictive aspects of the data. The plots showed that domestic customers, customers whose relationships were managed using the “Land and Expand” account management strategy, customers managed via a second tier channel sales distribution method, customers that were either enterprise or public sector entities, and customers whose relationships were managed using the ‘GCS’ sales strategy were all more likely to renew their contracts than were customers who lacked those attributes. Furthermore, annual contracts, higher-value contracts, contracts that included “VERY LARGE” discounts, and contracts that included more than one service were all more likely to be renewed than were contracts lacking those attributes.(NOTE: any plots or matrices discussed in this section can be found in the Data Exploration section of the Appendix).
Additional analysis revealed that the ‘CONTRACT_LINE_CATEGORY’ variable was simply a categorical bracketing of the continous numeric ‘ITEM_COUNT’ variable. As such, it was decided that in no circumstance would we include both variables within any of our regression models.
Finally, a correlation matrix showed that neither of the two continuous numeric variables we’d been provided (‘ITEM_COUNT’ and ‘COST’) was significantly correlated with the binary ‘RENEWAL_FLAG’ response variable, a fact that would need to be considered when constructing predictive models.
As discussed earlier, the data set we received from the high-tech company had been “pre-cleansed” prior to our reception of it. As such, there were no missing data values. Our analysis concluded that the ‘INDEX’, ‘CONTRACT_EXPIRATION’, and ‘SELLER_UNIQUE_ID’ variables were not relevant to predicting whether or not a contract might renew and were therefore deemed to be removable from the data set for purposes of regression modeling. Furthermore, to simplify the model building process we also converted binomial categorical variable values that were text-based to numeric (0/1) values. The affected variables were ‘AREA’, ‘SERVICE_APPROACH’, ‘TIER’, ‘RENEWAL_FLAG’, ‘MULTIPLE_SERVICES’ and ‘SALES_STRATEGY’.
Finally, in preparation for model building we divided the data into training and evaluation data sets by extracting a random subset of 20,000 items from the master data set to serve as an evaluation data set. Those same 20,000 items from were then removed the master data set to create a training data set. As was mentioned earlier, creation of separate training and evaluation data sets was deemed necessary to allow us to test our regression models on a relatively unbiased set of contract records.
The training data set can viewed here:
The evaluation data set can be viewed here:
Three distinct binary logistic regression models were constructed for purposes of predicting whether or not a service contract was likely to be renewed. The models’ performance metrics were subsequently compared against each other to allow us to select our “best” binary logistic regression model for purposes of making predictions of contract renewals for the Evaluation data set. Each of the three models used the data set’s RENEWAL_FLAG attribute as the dependent response variable, while various subsets of the potential predictor variables were used as independent variables. A detailed discussion of each of the three models can be found in the REGRESSION MODELING section of the Appendix.
Of the three models that were tested, one based solely on non-contract related attributes (e.g., AREA’, ‘SERVICE_APPROACH’, ‘TIER’, ‘SALES_STRATEGY’, etc.) proved to be ineffective at predicting actual contract renewals. Of the two remaining models, one excluded both the ‘COST’ and ‘ITEM_COUNT’ variables due to lack of statistical significance, while the other included the log-transformed versions of both of those variables when they proved to be statistically significant. Both models had very similar performance statistics as can be seen in the summary table shown below.
| Metric | Model 1 | Model 3 |
|---|---|---|
| # Predictors | 10 | 10 |
| AIC | 82589 | 82057 |
| Accuracy | 0.7134 | 0.7132 |
| Class.Err.R. | 0.2866 | 0.2868 |
| Precision | 0.6369 | 0.6301 |
| Sensitivity | 0.2443 | 0.2425 |
| Specificity | 0.9344 | 0.9302 |
| F1 Score | 0.3531 | 0.3606 |
| AUC | 0.5893 | 0.5914 |
Since there was no obvious basis for preferring one model over the other a likelihood ratio test was applied, and its result indicated that Model 3 was preferable. The logit coefficients for this model are shown below.
| Coeff. | Variable | Coeff. | Variable |
|---|---|---|---|
| - 2.948 | Intercept | + 0.416 | CONTRACT_VALUE>250K |
| + 0.901 | AREA1 | + 0.261 | CONTRACT_VALUE100K-250K |
| + 0.286 | SERVICE_APPROACH1 | + 0.101 | CONTRACT_VALUE10K-25K |
| + 0.143 | TIER1 | + 0.178 | CONTRACT_VALUE25K-50K |
| - 1.066 | CONTRACT_LENGTHMulti | + 0.282 | CONTRACT_VALUE50K-100K |
| - 0.295 | SALES_CATEGORYENT | - 0.263 | DISCOUNT_CATEGORYMEDIUM |
| + 0.097 | SALES_CATEGORYOTHER | + 0.547 | DISCOUNT_CATEGORYNONE |
| - 0.127 | SALES_CATEGORYPS | + 1.099 | DISCOUNT_CATEGORYVERYLARGE |
| + 0.206 | SALES_STRATEGY1 | + 0.092 | log(ITEM_COUNT + 1) |
| (-) | (-) | + 0.142 | log(COST + 1) |
We can infer the following from the model’s coefficients relative to the three categories of theoretical constructs desribed in the Literature Review section above (NOTE: the sole variable that does not fall within one of those three categories (‘AREA’) is discussed separately):
Sales Growth: The high-tech company’s use of the “Land and Expand” sales management strategy (as characterized by the ‘SERVICE_APPROACH’ variable), use of a ‘two-tier’ indirect channel approach (as characterized by the ‘TIER’ variable), and use of a ‘GCS’ sales strategy (as characterized by the ‘SALES_STRATEGY’ variable) all appeared to increase the likelihood that a customer would renew their service contract. Furthermore, customers deemed to be either “Enterprise” or “Public Sector” appeared more likely to renew their contracts than did other types of customers.
Cost Factors: In general, the larger the value of the contract, the more likely it was to be renewed. This is not surprising since the high-tech company places more emphasis on ensuring that “high-dollar” contracts end up being renewed. The effectiveness of offering discounts to customers was less clear, with medium sized discounts apparently decreasing the likelihood of contract renewal while offering either no discounts or very large discounts appearing to increase the likelihood of contract renewal.
Contract Features: Contracts with larger quantities of line items were more likely to be renewed than were contracts with smaller quantities of line items. This again may have been the result of the high-tech company placing a great deal of emphasis on the renewal of larger contracts. Additionally, multi-year contracts appeared less likely to be renewed than annual contracts.
Geography (the ‘AREA’ variable): Domestic customers were more likely to renew their contracts than were foreign customers. This was not surprising since the high-tech company is US-based and apparently has a longer history with many American clients. Additionally, economic uncertainties in Venezuela and once growing emerging economies like Brazil and Argentina may have had an impact in service contract renewals in those locations.
Our analysis of the ‘ITEM_COUNT’ variable revealed that it was not zero-inflated and its mean was not nearly equal to its variance, thereby allowing us to quickly rule out the use of either Poisson or zero-inflated count regression models for purposes of predicting likely renewed contract line item counts. Instead, a negative binomial count regression model was pursued.
An initial set of modeling iterations led to the removal of a few statistically insignificant predictors but yielded a model whose ‘ITEM_COUNT’ predictions were wildly inaccurate, with some predictions exceeding one trillion possible line items. Further investigation revealed that the ‘COST’ variable was the source of the problem: that variable’s very large variance and outliers were causing the negative binomial model to generate wildly inaccurate predictions.
Further discussions with representatives of the high-tech company revealed that removal of the ‘COST’ variable’s outliers (basically very large total list price amounts) was viable since the outliers comprised a relatively small percentage (~3%) of the contract data we’d been provided. It was therefore decided that all contracts whose ‘COST’ values exceeded $100,000 would be ignored for purposes of predicting the ‘ITEM_COUNT’ variable. Subsequent modeling iterations yielded a negative binomial model with the following performance characteristics:
| Metric | Value |
|---|---|
| Number of Predictors | 7 |
| AIC | 146419 |
| Dispersion | 6.00 |
A jitter plot of the model’s rounded fitted values against the ‘ITEM_COUNT’ response variable showed that the model appeared to perform rather poorly at predicting the likely number of line items for a contract that was likely to be renewed:
Summary statistics for the relevant ‘ITEM_COUNT’ values and the Negative Binomial predictions for the corresponding observations are provided below: (NOTE: “n” represents the number of records within a data set).
| Variable | n | Mean | sd | Med | Min | Max | Range | Skew | Kurto. | SE |
|---|---|---|---|---|---|---|---|---|---|---|
| ITEM_COUNT | 22243 | 10.05 | 137.74 | 3 | 1 | 15176 | 15175 | 65.91 | 6730.87 | 0.92 |
| Predicted | 22243 | 47.06 | 292.12 | 7 | 3 | 9198 | 9195 | 15.63 | 325.55 | 1.96 |
The summary statistics are reflective of the jitter plot shown above. The predicted values for the item counts are much less variable than are the corresponding actual ‘ITEM_COUNT’ values. Furthermore, the skew and kurtosis of the actual ‘ITEM_COUNT’ values greatly exceeds that of the predicted values.
The coefficients for this negative binomial model are shown below:
| Coeff. | Variable | Coeff. | Variable |
|---|---|---|---|
| + 2.234 | Intercept | + 0.864 | CONTRACT_VALUE>250K |
| - 0.544 | SERVICE_APPROACH1 | + 0.456 | CONTRACT_VALUE100K-250K |
| - 0.222 | CONTRACT_LENGTHMulti | + 0.680 | CONTRACT_VALUE10K-25K |
| - 0.081 | SALES_CATEGORYENT | + 0.343 | CONTRACT_VALUE25K-50K |
| - 0.313 | SALES_CATEGORYOTHER | + 0.394 | CONTRACT_VALUE50K-100K |
| - 0.057 | SALES_CATEGORYPS | - 0.230 | DISCOUNT_CATEGORYMEDIUM |
| + 0.582 | MULTIPLE_SERVICES1 | + 0.020 | DISCOUNT_CATEGORYNONE |
| + 0.001 | COST | - 0.037 | DISCOUNT_CATEGORYVERYLARGE |
We can infer the following from model’s coefficients relative to the three categories of theoretical constructs desribed in the Literature Review section above.
Sales Growth: Customers whose relationships were managed using the “Land and Expand” account management strategy were less likely to have relatively higher contract item counts than were customers whose relationships were not managed using that strategy. This result is somewhat surprising since the alleged benefit of that strategy is enhanced revenues. Furthermore, customers that were neither public sector nor enterprise customers appeared less likely to have higher contract item counts than were customers of other sales categories.
Cost Factors: Customers whose contract values exceed $250,000 in value were more likely to have higher contract item counts than were other customers. Contracts having higher total list prices were more likely to have relatively higher contract item counts than were contracts having lower total list prices. These results are not surprising.
Contract Features: Once again multi-year contracts had a negative influence, as they appeared to predict smaller contract line item counts than did the annual contracts. However, customers contracting for more than a single service were more likely to have relatively higher contract item counts.
With the construction of our two required regression models completed, we were finally prepared to assess the effectiveness of each relative to the 20,000 item evaluation data set. The two-stage prediction process was as follows:
Stage 1: Use the selected binary logistic regression model to predict whether or not a contract was likely to be renewed.
Stage 2: For those contracts that were deemed likely to be renewed, predict the likely ‘ITEM_COUNT’ amounts using the negative binomial regression model for all contracts having (COST <= 100,000).
The results of the first stage indicated that the binary logistic regression model’s performance with the evaluation data was nearly identical to its performance with the training data:
| Metric | Training | Evaluation |
|---|---|---|
| Accuracy | 0.7132 | 0.7176 |
| Class.Err.R. | 0.2868 | 0.2824 |
| Precision | 0.6301 | 0.6251 |
| Sensitivity | 0.2425 | 0.2557 |
| Specificity | 0.9302 | 0.9296 |
| F1 Score | 0.3606 | 0.3630 |
| AUC | 0.5914 | 0.5927 |
A confusion matrix of the model’s results indicated that the model was very effective at identifying contracts that were unlikely to be renewed: only 965 false positives were predicted out of 13,708 contracts that actually failed to be renewed.
| true | pred | pred |
|---|---|---|
| (-) | 0 | 1 |
| 0 | 12743 | 965 |
| 1 | 4683 | 1609 |
However, the model did not perform as well at predicting actual contract renewals, with 4683 false negatives predicted vs. 1609 true positives. As such, the value of this model appeared to be in that it clearly identified which contracts were unlikely to be renewed, thereby allowing the high-tech company to choose whether or not to direct additional sales efforts toward such customers for purposes of securing the renewal of those contracts.
The results of the second stage indicated that, similar to the results we obtained when applying it to the training data, the negative binomial regression model’s performance was again problematic:
| Variable | n | Mean | sd | Med | Min | Max | Range | Skew | Kurto. | SE |
|---|---|---|---|---|---|---|---|---|---|---|
| ITEM_COUNT | 2281 | 49.63 | 209.77 | 8 | 1 | 4263 | 4262 | 11.51 | 171.81 | 4.39 |
| Predicted | 2281 | 133.98 | 560.73 | 11 | 4 | 9358 | 9354 | 8.96 | 98.84 | 11.74 |
As the table clearly shows, the predicted values for the item counts are much more variable than are the corresponding actual ‘ITEM_COUNT’ values. Furthermore, the skew and kurtosis of the actual ‘ITEM_COUNT’ values greatly exceeds that of the predicted values. These results clearly demonstrate the challenge inherent in attempting to accurately predict the expected line item counts for contracts that are predicted to be renewed based upon the data we had been provided. As such, perhaps a more effective method for predicting the expected line item counts could be based on simply assuming that the line item count for the renewed contract would be equivalent to that of the expiring contract.
The purpose of our research was to attempt to predict both the likelihood of a service contract being renewed and the anticipated size (as defined by the number of line items contained within a contract) of any renewed contracts. Data related to 91,727 technology services contracts was used as the basis for constructing and evaluating predictive models for each of those metrics.
The need to predict the likelihood of a contract being renewed led to the development of binary logistic regression model that proved effective at predicting nearly 72% of all customer contract renewal decisions when applied against a dedicated 20,000 item evaluation data set that had been randomly extracted from the original data set. While 0.93 (12743/13708) of non-renewals were predicted correctly, only 0.26 (1609/6292) of actual contract renewals were accurately predicted. Thus the binomial logistic model proved effective but with the caveat that it under-predicted renewals at a higher rate then might prove acceptable for business purposes.
Prediction of contract line item counts led to the development of a negative binomial regression model. However, its results proved to be too inaccurate for practical use, thereby indicating that perhaps the best method for making such predictions might be to simply assume that the line item count of the renewing contract will be similar to that of the expiring contract.
The use of behavioral economic proxy variables within the binary logistic regression model allowed us to make a variety of inferences related to the effectiveness of the relevant strategies we identified during our review of related research literature. The table below provides a summary of those inferences, with the “Rank” column being indicative of which variable attributes should be preferred if maximizing the number of contract renewals is a strategic business goal. A further explanation of how each relates to the high-tech comany that provided our data is provided below.
| Strategic Predictor | Related Variable | Category | Effect | Preferred Use |
|---|---|---|---|---|
| Land & Expand | Service_Approach | Sales Growth | Positive | Use ‘L&E’ |
| Adoption Services | Sales_Strategy | Sales Growth | Positive | Use ‘GCS’ |
| (ENT,COM,PS, OTH) | ||||
| Adaptive Learning | Sales_Strategy | Sales Growth | Positive | Use ‘GCS’ |
| (ENT,COM,PS, OTH) | ||||
| Hyperbolic Discounting | Discount_Category | Cost Factors | Mixed | VLG,NONE,LG,MED |
| (MED,LG,VLG,NONE) | ||||
| Reference Dependence | Cost, Item Count | Cost Factors | Positive | Sell More Items |
| (Continuous Num.) | ||||
| Churn | Contract_Value_Cat | Cost Factors | Positive | > $, > Renewals |
| (Price Range) | ||||
| Incentive Theory | Discount_Category | Cost Factors | Mixed | VLG,NONE,LG,MED |
| (MED,LG,VLG,NONE) | ||||
| Contract Duration | Contract_Length | Contract | Negative | Use ‘Annual’ |
| (Annual, Multi-Yr) | Features | |||
| Channel Pass-Through | Tier | Sales Growth | Positive | Use ‘Channel’ |
Pricing and Cost Factor Conclusions:
Churn: The more products and services the high tech company is able to sell to a customer, the more likely the customer is likely to be retained going forward.
Discounting: The effect of discounting is inconclusive: offering either very large or no discounts to customers appears to increase the likelihood of customer retention. However, offering either large or medium size discounts appears to decrease the likelihood of contract renewal.
Selling larger quantities of products and services to a customer appears to increase the likelihood of contract renewal.
Contract Feature Conclusions:
Sales Growth Factor Conclusions:
Use of a “Land & Expand” approach increases the likelihood of contract renewal.
Use of a “GCS” adoption services / adaptive learning customer management strategy increases the likelihood of contract renewal.
Use of channel partners increases the likelihood of contract renewal.
It is important to note that these conclusions apply only to the high-tech company that provided our data set. Additional research would be required to determine whether or not these same conclusions might apply to other high-tech companies or companies in other industries, or to companies that have customers located outside of North and South America. Furthermore, the data we were provided was somewhat restricted in its detail due to the need of the provider to remove all proprietary information from it (e.g., exact contract durations, exact discount percentages, exact net pricing totals, etc.) prior to our use. Despite these limitations, this study demonstrated that clear relationships exist between a variety of behavioral economic concepts and the likelihood of a technology services contract being renewed. In doing so, our research provides valuable insight into the efficacy of different sales and customer management strategies currently in use at a major high-tech company.
Brousseau, E., Glachant, J., & Fares, M. (2001). The Economics of Contracts and the Renewal of Economics. Theories and Applications The Economics of Contracts, 3-42. doi:10.1017/cbo9780511613807.001
Ho, T. H., Lim, N., & Camerer, C. F. (2006). Modeling the Psychology of Consumer and Firm Behavior with Behavioral Economics. Journal of Marketing Research, 43(3), 307-331. doi:10.1509/jmkr.43.3.307
Hopkins, E. (2007). Adaptive Learning Models of Consumer Behavior. Journal of Economic Behavior & Organization, 64(3-4), 348-368. doi:10.1016/j.jebo.2006.02.010
Liu, C., & Peng, A. (2010). A Reinvestigation of Contract Duration Using Quantile Regression for Counts Analysis. Economics Letters, 106(3), 184-187. doi:10.1016/j.econlet.2009.11.015
Sudhir, K., & Datta, S. (2009). Pricing in Marketing Channels. Handbook of Pricing Research in Marketing, 319-354. doi:10.4337/9781848447448.00024
Wirtz, J., & Lovelock, C. (2016). Service Marketing Communications. People, Technology, Strategy Services Marketing, 236-287. doi:10.1142/9781944659028_0007
Wood, J. B., Hewlin, T., & Lah, T. E. (2013). B4B: How Technology and Big Data are Reinventing the Customer-Supplier Relationship. San Diego, CA: Point B.
————————————————————————————————————————–
\newpageThis Appendix contains the research results, source R code, and associated relevant output from our final writeup and our model building efforts. Additional supplemental graphics and detailed explanations of our regression modeling efforts are also included.
————————————————————————————————————————–
The results of applying our preferred binary logistic regression model to the Evaluation data set can be found at the following web link:
————————————————————————————————————————–
The results of applying our preferred negative binomial regression model to the Evaluation data set contracts that were predicted as likely to be renewed can be found at the following web link:
————————————————————————————————————————–
This section of the Appendix contains supplemental bar plots, box plots, histograms, and R code used during our Data Exploration work.
Data Anomolies specifically for continuous numeric distributions were found with two variables, ‘COST’ and ‘ITEM_COUNT’ which as pointed out earlier had distributions that appeared to be concentrated on a single value. ‘COST’ and ‘ITEM_COUNT’ and ‘MULTI-SERVICE’ also showed very high skew and kurtosis which was corrected using a simple log transformation. The log transformation revealed the ‘COST’ variable with a standard normal distribution and the ‘ITEM_COUNT’ with a rightward skew count distribution as shown on the histograms below.
| Descriptive Statistics Log Trx |
| n | mean | sd | median | min | max | range | skew | kurtosis | se | |
|---|---|---|---|---|---|---|---|---|---|---|
| ITEM_COUNT | 91726 | 1.20 | 1.38 | 1.00 | 0 | 11.00 | 11.00 | 1.51 | 2.62 | 0.00 |
| COST | 91726 | 6.97 | 2.06 | 6.97 | 0 | 18.44 | 18.44 | 0.10 | 0.14 | 0.01 |
Histogram Transformations allowed us to more thoroughly examine whether the distribution of a variable was skewed as well as whether there may were high incidence of specific variable values throughout the data set. The following illustration shows the histograms for ‘ITEM_COUNT’ and count ‘COST’ post log-transformation.
| Histograms Log Trx |
A Correlation Matrix for the numeric data set is provided below. As can be seen in the matrix ‘ITEM_COUNT’ and ‘COST’ show very high correlation which isn’t surprising since they are two different measures of the size of a contract and therefore likely to be collinear. Notably, both ‘ITEM_COUNT’ and ‘COST’ are very weakly correlated with ‘RENEWAL_FLAG’ indicating these variables may not have much predictive value in the renewal of contracts.
| Descriptive Statistics |
| n | mean | sd | median | min | max | range | skew | kurtosis | se | |
|---|---|---|---|---|---|---|---|---|---|---|
| AREA* | 91726 | 1.74 | 0.44 | 2.00 | 1 | 2.00 | 1.00 | -1.10 | -0.79 | 0.00 |
| SERVICE_APPROACH* | 91726 | 1.64 | 0.48 | 2.00 | 1 | 2.00 | 1.00 | -0.59 | -1.65 | 0.00 |
| TIER* | 91726 | 1.53 | 0.50 | 2.00 | 1 | 2.00 | 1.00 | -0.12 | -1.98 | 0.00 |
| RENEWAL_FLAG | 91726 | 0.32 | 0.47 | 0.00 | 0 | 1.00 | 1.00 | 0.78 | -1.40 | 0.00 |
| CONTRACT_LENGTH* | 91726 | 1.17 | 0.38 | 1.00 | 1 | 2.00 | 1.00 | 1.74 | 1.03 | 0.00 |
| SALES_CATEGORY* | 91726 | 1.80 | 1.21 | 1.00 | 1 | 4.00 | 3.00 | 1.04 | -0.69 | 0.00 |
| CONTRACT_VALUE* | 91726 | 1.41 | 1.13 | 1.00 | 1 | 6.00 | 5.00 | 2.65 | 5.67 | 0.00 |
| DISCOUNT_CATEGORY* | 91726 | 2.14 | 0.92 | 2.00 | 1 | 4.00 | 3.00 | 0.56 | -0.43 | 0.00 |
| MULTI-SERVICE* | 91726 | 1.01 | 0.09 | 1.00 | 1 | 2.00 | 1.00 | 11.18 | 122.94 | 0.00 |
| ITEM_COUNT | 91726 | 1.20 | 1.38 | 1.00 | 0 | 11.00 | 11.00 | 1.51 | 2.62 | 0.00 |
| COST | 91726 | 6.97 | 2.06 | 6.97 | 0 | 18.44 | 18.44 | 0.10 | 0.14 | 0.01 |
| SALES_STRATEGY* | 91726 | 1.78 | 0.42 | 2.00 | 1 | 2.00 | 1.00 | -1.32 | -0.26 | 0.00 |
| Correlation Matrix for Numerics |
## RENEWAL_FLAG ITEM_COUNT COST
## RENEWAL_FLAG 1.00 0.03 0.02
## ITEM_COUNT 0.03 1.00 0.82
## COST 0.02 0.82 1.00
————————————————————————————————————————–
This section of the Appendix contains the R source code used during our data preparation efforts. Two separate modules were constructed and each are presented below.
library(alr3)
library(bestglm)
library(car)
library(ggplot2)
library(dplyr)
library(knitr)
library(lmtest)
library(caret)
library(tidyr)
fp.data <- read.csv("https://raw.githubusercontent.com/jtopor/CUNY-MSDA-621/master/Final%20Project/fin%20proj.csv", stringsAsFactors = FALSE)
# create a copy of the original data to use for transforming individual variables
fp.t <- fp.data
start our data transformation….
# transform AREA: USA = 1; !USA = 2
fp.t$AREA[which(fp.data$AREA == 'USA')] <- 1
fp.t$AREA[which(fp.data$AREA != 'USA')] <- 0
# now convert column to numeric since it is char
fp.t$AREA <- as.numeric(fp.t$AREA)
# check results
summary(fp.t$AREA)
# transform SERVICE_APPROACH: S = 1; !S = 0
fp.t$SERVICE_APPROACH[which(fp.data$SERVICE_APPROACH == 'S')] <- 1
fp.t$SERVICE_APPROACH[which(fp.data$SERVICE_APPROACH != 'S')] <- 0
# now convert column to numeric since it is char
fp.t$SERVICE_APPROACH <- as.numeric(fp.t$SERVICE_APPROACH)
# check results
summary(fp.t$SERVICE_APPROACH)
# transform CHANNEL: rename to TIER with second = 1; first = 0
fp.t$CHANNEL_APPROACH[which(fp.data$CHANNEL_APPROACH == 'SECOND')] <- 1
fp.t$CHANNEL_APPROACH[which(fp.data$CHANNEL_APPROACH == 'FIRST')] <- 0
# now convert column to numeric since it is char
fp.t$CHANNEL_APPROACH <- as.numeric(fp.t$CHANNEL_APPROACH)
# check results
summary(fp.t$CHANNEL_APPROACH)
# now rename column to TIER
colnames(fp.t)[colnames(fp.t)=="CHANNEL_APPROACH"] <- "TIER"
head(fp.t$TIER)
# transform RENEWAL_FLAG: RENEWED = 1; NOT RENEWED = 0
fp.t$RENEWAL_FLAG[which(fp.data$RENEWAL_FLAG == 'RENEWED')] <- 1
fp.t$RENEWAL_FLAG[which(fp.data$RENEWAL_FLAG == 'NOT RENEWED')] <- 0
# now convert column to numeric since it is char
fp.t$RENEWAL_FLAG <- as.numeric(fp.t$RENEWAL_FLAG)
# check results
summary(fp.t$RENEWAL_FLAG)
NO ACTION TAKEN
# transform CONTRACT_LENGTH: ANNUAL = 1; MULTI-YEAR = 3
fp.t$CONTRACT_LENGTH[which(fp.data$CONTRACT_LENGTH == 'ANNUAL')] <- 1
fp.t$CONTRACT_LENGTH[which(fp.data$CONTRACT_LENGTH == 'MULTI-YEAR')] <- 3
# now convert column to numeric since it is char
fp.t$CONTRACT_LENGTH <- as.numeric(fp.t$CONTRACT_LENGTH)
# check results
summary(fp.t$CONTRACT_LENGTH)
Switch SP and SMB to other as they are not significant
# transform SALES_CATEGORY - create 3 new columns COM, ENT, PS
# build COM column
# first check count of COM entries
fp.t$SALES_CATEGORY[which(fp.t$SALES_CATEGORY=='SP')] <-'OTHER'
fp.t$SALES_CATEGORY[which(fp.t$SALES_CATEGORY=='SMB')] <-'OTHER'
#sum(fp.data$SALES_CATEGORY == 'COM')
#fp.t$COM <- 0
#fp.t$COM[fp.data$SALES_CATEGORY == 'COM'] <- 1
# make sure new column matches original
#sum(fp.t$ENT == 1)
# ---------------------------------
# build ENT column
#sum(fp.data$SALES_CATEGORY == 'ENT')
#fp.t$ENT <- 0
#fp.t$ENT[fp.data$SALES_CATEGORY == 'ENT'] <- 1
#sum(fp.t$ENT == 1)
# ---------------------------------
# build PS column
#sum(fp.data$SALES_CATEGORY == 'PS')
#fp.t$PS <- 0
#fp.t$PS[fp.data$SALES_CATEGORY == 'PS'] <- 1
#sum(fp.t$PS == 1)
NO ACTION TAKEN
IGNORE FOR NOW
# transform CONTRACT_VALUE_CATEGORY: <10k = 0, 10K-25K = 1, etc..
#fp.t$CONTRACT_VALUE_CATEGORY[which(fp.data$CONTRACT_VALUE_CATEGORY == '<10K')] <- 0
#fp.t$CONTRACT_VALUE_CATEGORY[which(fp.data$CONTRACT_VALUE_CATEGORY == '10K-25K')] <- 1
#fp.t$CONTRACT_VALUE_CATEGORY[which(fp.data$CONTRACT_VALUE_CATEGORY == '25K-50K')] <- 2
#fp.t$CONTRACT_VALUE_CATEGORY[which(fp.data$CONTRACT_VALUE_CATEGORY == '50K-100K')] <- 3
#fp.t$CONTRACT_VALUE_CATEGORY[which(fp.data$CONTRACT_VALUE_CATEGORY == '100K-250K')] <- 4
#fp.t$CONTRACT_VALUE_CATEGORY[which(fp.data$CONTRACT_VALUE_CATEGORY == '>250K')] <- 5
# now convert column to numeric since it is char
#fp.t$CONTRACT_VALUE_CATEGORY <- as.numeric(fp.t$CONTRACT_VALUE_CATEGORY)
# check results
#summary(fp.t$CONTRACT_VALUE_CATEGORY)
# transform CONTRACT_LINE_CATEGORY: VERY SMALL = 1, SMALL = 2, MEDIUM = 3, LARGE = 4,
# VERY LARGE = 5
#fp.t$CONTRACT_LINE_CATEGORY[which(fp.data$CONTRACT_LINE_CATEGORY == 'VERY SMALL')] <- 1
#fp.t$CONTRACT_LINE_CATEGORY[which(fp.data$CONTRACT_LINE_CATEGORY == 'SMALL')] <- 2
#fp.t$CONTRACT_LINE_CATEGORY[which(fp.data$CONTRACT_LINE_CATEGORY == 'MEDIUM')] <- 3
#fp.t$CONTRACT_LINE_CATEGORY[which(fp.data$CONTRACT_LINE_CATEGORY == 'LARGE')] <- 4
#fp.t$CONTRACT_LINE_CATEGORY[which(fp.data$CONTRACT_LINE_CATEGORY == 'VERY LARGE')] <- 5
# now convert column to numeric since it is char
#fp.t$CONTRACT_LINE_CATEGORY <- as.numeric(fp.t$CONTRACT_LINE_CATEGORY)
# check results
#summary(fp.t$CONTRACT_LINE_CATEGORY)
# transform CONTRACT_LINE_CATEGORY: NO DISCOUNT = 0, SMALL = 10, MEDIUM = 20, LARGE = 30,
# VERY LARGE = 50
#fp.t$DISCOUNT_CATEGORY[which(fp.data$DISCOUNT_CATEGORY == 'NO DISCOUNT')] <- 0
#fp.t$DISCOUNT_CATEGORY[which(fp.data$DISCOUNT_CATEGORY == 'MEDIUM')] <- 20
#fp.t$DISCOUNT_CATEGORY[which(fp.data$DISCOUNT_CATEGORY == 'LARGE')] <- 30
#fp.t$DISCOUNT_CATEGORY[which(fp.data$DISCOUNT_CATEGORY == 'VERY LARGE')] <- 50
# now convert column to numeric since it is char
#fp.t$DISCOUNT_CATEGORY <- as.numeric(fp.t$DISCOUNT_CATEGORY)
# check results
#summary(fp.t$DISCOUNT_CATEGORY)
11, MULTIPLE_SERVICES = Yes = 1 and No = 0
# transform MULTIPLE_SERVICES: YES = 1; NO = 0
fp.t$MULTIPLE_SERVICES[which(fp.data$MULTIPLE_SERVICES == 'YES')] <- 1
fp.t$MULTIPLE_SERVICES[which(fp.data$MULTIPLE_SERVICES == 'NO')] <- 0
# now convert column to numeric since it is char
fp.t$MULTIPLE_SERVICES <- as.numeric(fp.t$MULTIPLE_SERVICES)
# check results
summary(fp.t$MULTIPLE_SERVICES)
# transform SALES_STRATEGY: GCS = 1; NOT GCS = 0
fp.t$SALES_STRATEGY[which(fp.data$SALES_STRATEGY == 'GCS')] <- 1
fp.t$SALES_STRATEGY[which(fp.data$SALES_STRATEGY == 'NOT GCS')] <- 0
# now convert column to numeric since it is char
fp.t$SALES_STRATEGY <- as.numeric(fp.t$SALES_STRATEGY)
# check results
summary(fp.t$SALES_STRATEGY)
fix discount issues:
fp.t$DISCOUNT_CATEGORY[which(fp.t$DISCOUNT_CATEGORY == '#DIV/0!')] <- 'NO DISCOUNT'
Now write transformed data set to a file
write.csv(fp.t, file = "c:/SQLData/621/fp_transformed_data.csv", row.names = FALSE)
The R code shown below was used to separate the master data set into dedicated Training and Evaluation data sets.
library(dplyr)
fp <- read.csv("https://raw.githubusercontent.com/jeffnieman11/621_final_data/master/final_data.csv", stringsAsFactors = FALSE)
# create EVALUATION data set: sample 20000 rows at random from master file
set.seed(123)
eval.d <- fp[sample(nrow(fp), 20000), ]
# sort results by INDEX
eval.d <- arrange(eval.d, INDEX)
# create TRAINING data set: remove all eval.d observations from master data set
training.d <- fp[-eval.d$INDEX, ]
# now write both files to disc
write.csv(eval.d, file = "C:/SQLData/621/FP-EVAL-DATASET.csv", row.names = FALSE)
write.csv(training.d, file = "C:/SQLData/621/FP-TRAINING-DATASET.csv", row.names = FALSE)
rm(list = ls())
————————————————————————————————————————–
This section of the Appendix presents an in-depth review of our regression modeling work for this project. Portions of this writeup were excerpted for inclusion within the body of the main document.
————————————————————————————————————————–
Three distinct binary logistic regression models were constructed for purposes of predicting whether or not a service contract was likely to be renewed. The models’ performance metrics were subsequently compared against each other to allow us to select our “best” binary logistic regression model for purposes of making predictions of contract renewals for the Evaluation data set. Each of the three models used the data set’s RENEWAL_FLAG attribute as the dependent response variable, while various subsets of the potential predictor variables were used as independent variables. A description of each model is provided below.
Our first binary model applied simple backward selection based on p-values and F-tests in an attempt to identify a binary logistic regression model comprised of only statistically significant predictors relative to the data set’s RENEWAL_FLAG response variable.
The initial iteration of this approach excluded the ‘ITEM_COUNT’ and ‘COST’ variables so that the model would be comprised solely of categorical predictor variables. The results of the initial iteration showed that each of the included categorical predictor variables were, in fact, statistically significant predictors of the RENEWAL_FLAG response variable, producing a model comprised of 10 predictor variables with performance metrics as indicated in the table below:
| Metric | Score |
|---|---|
| Number of Predictors | 10 |
| AIC | 82589 |
| Accuracy | 0.7134 |
| Classification Error Rate | 0.2866 |
| Precision | 0.6369 |
| Sensitivity | 0.2443 |
| Specificity | 0.9344 |
| F1 Score | 0.3531 |
| AUC | 0.5893 |
The logit coefficients for this model are shown below. Please note that coefficients for each category of the multi-category predictors are included.
| Coeff. | Variable | Coeff. | Variable |
|---|---|---|---|
| - 1.453 | Intercept | + 0.020 | CONTRACT_LINE_CATMEDIUM |
| + 0.932 | AREA1 | - 0.097 | CONTRACT_LINE_CATSMALL |
| + 0.325 | SERVICE_APPROACH1 | - 0.120 | CONTRACT_LINE_CATVERY LARGE |
| + 0.078 | TIER1 | - 0.460 | CONTRACT_LINE_CATVERY SMALL |
| - 1.016 | CONTRACT_LENGTHMult-YR | - 0.310 | DISCOUNT_CATEGORYMEDIUM |
| - 0.261 | SALES_CATEGORYENT | + 0.376 | DISCOUNT_CATEGORYNO DISCOUNT |
| + 0.089 | SALES_CATEGORYOTHER | + 1.158 | DISCOUNT_CATEGORYVERY LARGE |
| - 0.088 | SALES_CATEGORYPS | + 1.363 | MULTIPLE_SERVICES1 |
| + 1.195 | CONTRACT_VAL_CAT>250K | + 0.194 | SALES_STRATEGY1 |
| + 0.795 | CONTRACT_VAL_CAT100K-250K | (-) | (-) |
| + 0.421 | CONTRACT_VAL_CAT10K-25K | (-) | (-) |
| + 0.569 | CONTRACT_VAL_CAT25K-50K | (-) | (-) |
| + 0.726 | CONTRACT_VAL_CAT50K-100K | (-) | (-) |
————————————————————————————————————————–
For our second binary model we decided to include only those predictor variables that appeared to be independent of any contract terms / conditions. For example, we knew that several variables, including ‘COST’, ‘ITEM_COUNT’, ‘DISCOUNT_CATEGORY’, and ‘CONTRACT_LENGTH’ amongst others, could in fact be considered components of an actual service contract. On the other hand, the ‘AREA’, ‘SERVICE_APPROACH’, ‘TIER’, ‘MULTIPLE_SERVICES’, and ‘SALES_STRATEGY’ categorical variables were either simply descriptors of the size or physical location of a customer or were reflective of how the technology company had decided to manage its relationship with a given customer. Therefore, we constructed a model comprised of only ‘AREA’, ‘SERVICE_APPROACH’, ‘TIER’, ‘MULTIPLE_SERVICES’, and ‘SALES_STRATEGY’, and each of those variables was found to be a statistically significant predictor of the RENEWAL_FLAG response variable. Performance metrics for that model are indicated in the table below:
| Metric | Score |
|---|---|
| Number of Predictors | 5 |
| AIC | 87554 |
| Accuracy | 0.6836 |
| Classification Error Rate | 0.3164 |
| Precision | 0.7426 |
| Sensitivity | 0.0184 |
| Specificity | 0.9970 |
| F1 Score | 0.0360 |
| AUC | 0.5077 |
When compared to the first binary logistic model discussed above, this second much simpler model did a poor job of accurately predicting actual contract renewals, with nearly all of the prospective contract renewals being predicted as less than 0.5 likely to occur.
The logit coefficients for this model are shown below. Please note that coefficients for each category of the multi-category predictors are included.
| Coeff. | Variable |
|---|---|
| - 1.609 | Intercept |
| + 0.797 | AREA1 |
| + 0.308 | SERVICE_APPROACH1 |
| + 0.207 | TIER1 |
| + 1.861 | MULTIPLE_SERVICES1 |
| + 0.174 | SALES_STRATEGY1 |
————————————————————————————————————————–
For our third binary model we decided to introduce both the ‘ITEM_COUNT’ and ‘COST’ variables as potential predictors. Inclusion of ‘ITEM_COUNT’ forced the removal of ‘CONTRACT_LINE_CATEGORY’ so as to avoid collinearity issues between those two variables. An initial iteration for this model indicated that neither ‘ITEM_COUNT’ or ‘COST’ were statistically significant predictors of the RENEWAL_FLAG response variable. However, our data exploration work had revealed the highly skewed nature of both variables’ distributions. As such, Box-Cox recommended transforms were applied to each and the model was refitted. The results of the refit indicated that log-transformed versions of both ‘ITEM_COUNT’ and ‘COST’ were, in fact, statistically significant predictors of the RENEWAL_FLAG response variable. Furthermore, the ‘MULTIPLE_SERVICES’ variable became statistically insignificant as a result.
Performance metrics for the revised model are indicated in the table below:
| Metric | Score |
|---|---|
| Number of Predictors | 10 |
| AIC | 82057 |
| Accuracy | 0.7132 |
| Classification Error Rate | 0.2868 |
| Precision | 0.6301 |
| Sensitivity | 0.2425 |
| Specificity | 0.9302 |
| F1 Score | 0.3606 |
| AUC | 0.5914 |
————————————————————————————————————————–
Binary model 2 was eliminated from consideration for selection due to its relatively poor ability to predict contract renewals. Both of the remaining models had very similar performance statistics as can be seen in the summary table shown below.
| Metric | Model 1 | Model 3 |
|---|---|---|
| # Predictors | 10 | 10 |
| AIC | 82589 | 82057 |
| Accuracy | 0.7134 | 0.7132 |
| Class.Err.R. | 0.2866 | 0.2868 |
| Precision | 0.6369 | 0.6301 |
| Sensitivity | 0.2443 | 0.2425 |
| Specificity | 0.9344 | 0.9302 |
| F1 Score | 0.3531 | 0.3606 |
| AUC | 0.5893 | 0.5914 |
As we can see in the table, both of the models are fairly similar in their performance metrics, with Model 1 having the highest accuracy, precision, sensitivity, and specificity, while Model 3 has the lowest AIC, highest F1 score, and highest AUC. Since there was no obvious reason to choose either model, a likelihood ratio test was applied, and its result indicated that Model 3 was the preferred model of the two. Therefore, we selected Binary Model 3 for purposes of predicting whether or not a service contract is likely to renew.
————————————————————————————————————————–
We checked the performance of the selected model prior to applying it to our evaluation data by applying it to the training data set. This allowed us to assess how well the model’s predictions matched up to the training data’s actual ‘RENEWAL_FLAG’ values.
Summary statistics for the relevant ‘RENEWAL_FLAG’ values and the binomial logistic regression predictions for the corresponding observations are provided below: (NOTE: “n” represents the number of records within a data set).
| Variable | n | Mean | sd | Med. | Min | Max | Range | Skew | Kurtosis | SE |
|---|---|---|---|---|---|---|---|---|---|---|
| RENEWAL_FLAG | 71726 | 0.32 | 0.47 | 0 | 0 | 1 | 1 | 0.77 | -1.41 | 0 |
| Predicted | 71726 | 0.13 | 0.33 | 0 | 0 | 1 | 1 | 2.22 | 2.94 | 0 |
————————————————————————————————————————–
We can infer the following from the selected binary logistic regression model’s coefficients:
| Coeff. | Variable | Coeff. | Variable |
|---|---|---|---|
| - 2.948 | Intercept | + 0.416 | CONTRACT_VALUE (>250K) |
| + 0.901 | AREA1 | + 0.261 | CONTRACT_VALUE (100K-250K) |
| + 0.286 | SERVICE_APPROACH1 | + 0.101 | CONTRACT_VALUE (10K-25K) |
| + 0.143 | TIER1 | + 0.178 | CONTRACT_VALUE (25K-50K) |
| - 1.066 | CONTRACT_LENGTH (Mult-YR) | + 0.282 | CONTRACT_VALUE (50K-100K) |
| - 0.295 | SALES_CATEGORY (ENT) | - 0.263 | DISCOUNT_CATEGORY (MEDIUM) |
| + 0.097 | SALES_CATEGORY (OTHER) | + 0.547 | DISCOUNT_CATEGORY (NO DISCOUNT) |
| - 0.127 | SALES_CATEGORY (PS) | + 1.099 | DISCOUNT_CATEGORY (VERY LARGE) |
| + 0.206 | SALES_STRATEGY1 | + 0.092 | log(ITEM_COUNT + 1) |
| (-) | (-) | + 0.142 | log(COST + 1) |
AREA: Domestic customers are more likely to renew their service contracts than are foreign customers.
SERVICE_APPROACH: Customers whose relationships are managed using the “Land and Expand” account management strategy are more likely to renew their service contracts than are customers whose relationships are not managed using the “Land and Expand” strategy.
TIER: Customers managed via a second tier channel sales distribution method are more likely to renew than are customers managed via a first tier channel sales distribution method.
CONTRACT_LENGTH: Customers that have multi-year service contracts are less likely to renew their contracts than are customers with annual contracts.
SALES_CATEGORY: Enterprise and public sector customers are less likely to renew their service contracts than are other types of customers.
CONTRACT_VALUE_CATEGORY: Contracts valued at more than $250,000 are more likely to renew than are other contracts.
DISCOUNT_CATEGORY: Customers that receive very large discounts are more likely to renew their contracts than are other types of customers.
ITEM_COUNT: The greater the number of line items contained in a service contract, the more likely it is that the contract will be renewed.
COST: The larger the list price of the items listed in the contract is, the more likely it is that the contract will be renewed.
SALES_STRATEGY: Customers whose relationships are managed using the ‘GCS’ sales strategy are more likely to renew their contracts than are customers whose relationships are managed using other sales strategies.
The coefficients can also be exponentiated so as to allow for their interpretation as odds ratios:
| Var | OR | 2.5 % | 97.5 % |
|---|---|---|---|
| (Intercept) | 0.2311089 | 0.1973334 | 0.2706535 |
| AREA1 | 2.5367399 | 2.4125369 | 2.6680595 |
| SERVICE_APPROACH1 | 1.3803906 | 1.3353151 | 1.4270608 |
| TIER1 | 1.0792109 | 1.0452289 | 1.1143245 |
| CONTRACT_LENGTHMULTI-YEAR | 0.3660953 | 0.3458963 | 0.3873816 |
| SALES_CATEGORYENT | 0.7810605 | 0.7076640 | 0.8614062 |
| SALES_CATEGORYOTHER | 1.0726579 | 0.9947881 | 1.1562938 |
| SALES_CATEGORYPS | 0.9067025 | 0.8231195 | 0.9980575 |
| CONTRACT_VALUE_CATEGORY>250K | 3.3883859 | 2.7073767 | 4.2526766 |
| CONTRACT_VALUE_CATEGORY100K-250K | 2.3854469 | 2.0401162 | 2.7909619 |
| CONTRACT_VALUE_CATEGORY10K-25K | 1.5244816 | 1.4351413 | 1.6191997 |
| CONTRACT_VALUE_CATEGORY25K-50K | 1.7114976 | 1.5653377 | 1.8710840 |
| CONTRACT_VALUE_CATEGORY50K-100K | 1.9905038 | 1.7628608 | 2.2477106 |
| CONTRACT_LINE_CATEGORYMEDIUM | 1.0249487 | 0.9052904 | 1.1604823 |
| CONTRACT_LINE_CATEGORYSMALL | 0.9389611 | 0.8518029 | 1.0351946 |
| CONTRACT_LINE_CATEGORYVERY LARGE | 1.0184574 | 0.7675312 | 1.3531406 |
| CONTRACT_LINE_CATEGORYVERY SMALL | 0.6403009 | 0.5823288 | 0.7042031 |
| DISCOUNT_CATEGORYMEDIUM | 0.7283804 | 0.6988528 | 0.7591717 |
| DISCOUNT_CATEGORYNO DISCOUNT | 1.4498368 | 1.3689804 | 1.5355683 |
| DISCOUNT_CATEGORYVERY LARGE | 3.1614161 | 3.0016316 | 3.3299904 |
| MULTIPLE_SERVICES1 | 3.9689639 | 3.3258403 | 4.7555328 |
| SALES_STRATEGY1 |
This exponentiation of the coefficients allowed us to more clearly relate their apparent effects to the behavioral economic strategies described in our Literature Review. The table below summarizes our observations in that regard.
| Model 1 - Influence on Contract Renewal |
| Strategic Predictor | Data Column Found | Category | Effect | Rank |
|---|---|---|---|---|
| Land & Expand | Service_Approach | Sales Growth | Positive | Use of ‘S’ |
| (‘S’,‘Not S’) | ||||
| Adoption Services | Sales_Strategy | Sales Growth | Positive | Use ‘GLS’ |
| (ENT,COM,PS, OTH) | ||||
| Adaptive Learning | Sales_Strategy | Sales Growth | Positive | Use ‘GLS’ |
| (ENT,COM,PS, OTH) | ||||
| Hyperbolic Discounting | Discount_Category | Cost Factors | Mixed | VLG,NONE,LG,MED |
| (MED,LG,VLG,NONE) | ||||
| Reference Dependence | Cost | Cost Factors | NA | NA |
| (Continuous Numeric) | ||||
| Churn | Contract_Value_Category | Cost Factors | Positive | > $, > Effect |
| (Price Range) | ||||
| Incentive Theory | Discount_Category | Cost Factors | Mixed | VLG,NONE,LG,MED |
| (Discount) | (MED,LG,VLG,NONE) | |||
| Incentive Theory | Multiple_Services | Contract | NA | NA |
| (1-Tier, Multi-Tier) | Features | |||
| Contract Duration | Contract_Length | Contract | Negative | Use ‘Annual’ |
| (Annual, Multi-Yr) | Features | |||
| Channel Pass-Through | Tier | Sales Growth | Positive | Use ‘Channel’ |
| (Single, Multi) | Sales Growth |
————————————————————————————————————————–
While the binary logistic regression model we selected can be useful for predicting whether or not a contract will renew, that model does not allow us to predict the likely dollar value of the renewed contract. The ‘ITEM_COUNT’ variable indicates the number of line items included in a contract, and as such can be used as a proxy for the overall dollar value of the contract since, in general, the larger the number of line items listed in a contract, the larger the dollar value of the contract will likely be.
As such, it was decided that a count regression model would be constructed for purposes of estimating the likely number of items that might be included in a contract if that contract was predicted to be renewed. Our data exploration efforts indicated that the variance of the ‘ITEM_COUNT’ variable was much larger than its mean, thereby ruling out the possibility of applying a Poisson count regression model and pointing us toward the use of a negative binomial count regression model instead.
Our negative binomial model excluded the ‘CONTRACT_LINE_CATEGORY’ so as to avoid potential collinearity issues with the response variable. An initial set of modeling iterations led to the removal of a few statistically insignificant predictors but yielded a model whose ‘ITEM_COUNT’ predictions were wildly inaccurate, with some predictions exceeding one trillion possible line items. Further investigation revealed that the ‘COST’ variable was the source of the problem: that variable’s very large variance and outliers were causing the negative binomial model to generate wildly inaccurate predictions.
Further discussions with representatives of the tech company revealed that removal of the ‘COST’ variable’s outliers (basically very large total list price amounts) was viable since the outliers comprised a relatively small percentage of the contract data we’d been provided. It was therefore decided that all contracts whose ‘COST’ values exceeded $100,000 would be ignored for purposes of predicting the ‘ITEM_COUNT’ variable. Such contracts comprised approximately 3% of the total contract’s we’d been provided.
The resulting multi-step ‘ITEM_COUNT’ prediction process was thereby as follows:
Use the selected binary logistic regression model to predict whether or not a contract is likely to renew.
For those contracts that are deemed likely to renew, predict the likely ‘ITEM_COUNT’ amounts using the negative binomial regression model for all contracts having (COST <= 100,000).
Subsequent modeling iterations yielded a negative binomial model with the following performance characteristics:
| Metric | Value |
|---|---|
| Number of Predictors | 7 |
| AIC | 146419 |
| Dispersion | 6.00 |
A jitter plot of the model’s rounded fitted values against the ‘ITEM_COUNT’ response variable shows that the model appears to perform rather poorly at accurately predicting the likely number of line items for a contract that is likely to renew:
Summary statistics for the relevant ‘ITEM_COUNT’ values and the Negative Binomial predictions for the corresponding observations are provided below: (NOTE: “n” represents the number of records within a data set).
| Variable | n | Mean | sd | Med. | Min | Max | Range | Skew | Kurtosis | SE |
|---|---|---|---|---|---|---|---|---|---|---|
| ITEM_COUNT | 22243 | 10.05 | 137.74 | 3 | 1 | 15176 | 15175 | 65.91 | 6730.87 | 0.92 |
| Predicted | 22243 | 47.06 | 292.12 | 7 | 3 | 9198 | 9195 | 15.63 | 325.55 | 1.96 |
The summary statistics are reflective of the jitter plot shown above. The predicted values for the item counts are much less variable than are the corresponding actual ‘ITEM_COUNT’ values. Furthermore, the skew and kurtosis of the actual ‘ITEM_COUNT’ values greatly exceeds that of the predicted values.
The coefficients for this negative binomial model are shown below. Please note that coefficients for each category of the multi-category predictors are included.
| Coeff. | Variable | Coeff. | Variable |
|---|---|---|---|
| + 2.234 | Intercept | + 0.864 | CONTRACT_VAL_CAT>250K |
| - 0.544 | SERVICE_APPROACH1 | + 0.456 | CONTRACT_VAL_CAT100K-250K |
| - 0.222 | CONTRACT_LENGTHMult-YR | + 0.680 | CONTRACT_VAL_CAT10K-25K |
| - 0.081 | SALES_CATEGORYENT | + 0.343 | CONTRACT_VAL_CAT25K-50K |
| - 0.313 | SALES_CATEGORYOTHER | + 0.394 | CONTRACT_VAL_CAT50K-100K |
| - 0.057 | SALES_CATEGORYPS | - 0.230 | DISCOUNT_CATEGORYMEDIUM |
| + 0.582 | MULTIPLE_SERVICES1 | + 0.020 | DISCOUNT_CATEGORYNO DISCOUNT |
| + 0.001 | COST | - 0.037 | DISCOUNT_CATEGORYVERY LARGE |
We can infer the following from the model’s coefficients:
SERVICE_APPROACH: Customers whose relationships are managed using the “Land and Expand” account management strategy are less likely to have relatively higher contract item counts than are customers whose relationships are not managed using the “Land and Expand” strategy.
CONTRACT_LENGTH: Customers that have multi-year service contracts are less likely to have relatively higher contract item counts than are customers with annual contracts.
SALES_CATEGORY: Customers that are neither Public sector nor enterprise customers are less likely to have higher contract item counts than are customers of other sales categories.
CONTRACT_VALUE_CATEGORY: Customers whose contract values exceed $250,000 in value are more likely to have higher contract item counts than are other customers.
DISCOUNT_CATEGORY: Customers receiving no discounts are more likely to have relatively higher contract item counts than are other customers receiving discounts.
MULTIPLE_SERVICES: Customers contracting for more than a single service are more likely to have relatively higher contract item counts.
COST: Contracts having higher total list prices are more likely to have relatively higher contract item counts than are contracts having lower total list prices.
————————————————————————————————————————–
This section of the Appendix contains the source R code used for constructing and evaluating various binary logistic regression models for purposes of predicting the likelihood of a service contract being renewed.
library(alr3)
library(car)
library(knitr)
library(pROC)
library(lmtest)
fp <- read.csv("https://raw.githubusercontent.com/jtopor/CUNY-MSDA-621/master/Final%20Project/FP-TRAINING-DATASET.csv", stringsAsFactors = FALSE)
# convert binary categoricals to factors: 10 variables in total
fp$AREA <- factor(fp$AREA)
fp$SERVICE_APPROACH <- factor(fp$SERVICE_APPROACH)
fp$TIER <- factor(fp$TIER)
fp$CONTRACT_LENGTH <- factor(fp$CONTRACT_LENGTH)
fp$SALES_CATEGORY <- factor(fp$SALES_CATEGORY)
fp$CONTRACT_VALUE_CATEGORY <- factor(fp$CONTRACT_VALUE_CATEGORY)
fp$CONTRACT_LINE_CATEGORY <- factor(fp$CONTRACT_LINE_CATEGORY)
fp$DISCOUNT_CATEGORY <- factor(fp$DISCOUNT_CATEGORY)
fp$MULTIPLE_SERVICES <- factor(fp$MULTIPLE_SERVICES)
fp$SALES_STRATEGY <- factor(fp$SALES_STRATEGY)
# NOTE: RENEWAL_FLAG is the predictor so ignore it
summary(fp[,-1])
f <- fp[,-c(1,6,9)]
m1 <- glm(data = f, RENEWAL_FLAG ~ . - ITEM_COUNT - COST, family=binomial(link="logit"))
summary(m1)
# no point in doing mmps due to all variables being categorical
#par(mfrow=c(1,1))
#mmps(final, layout=c(4,3), key=T)
# Load R functions for model statistics
accuracy <- function(actual, predicted){
# Equation to be modeled: (TP + TN) / (TP + FP + TN + FN)
# derive confusion matrix cell values
c.mat <- data.frame(table(actual, predicted))
# extract all four confusion matrix values from the data frame
TN <- as.numeric(as.character(c.mat[1,3]))
FN <- as.numeric(as.character(c.mat[2,3]))
FP <- as.numeric(as.character(c.mat[3,3]))
TP <- as.numeric(as.character(c.mat[4,3]))
# now calculate the required metric
return( (TP + TN) / (TP + FP + TN + FN) )
}
classif.err.rate <- function(actual, predicted) {
# Equation to be modeled: (FP + FN) / (TP + FP + TN + FN)
# derive confusion matrix cell values
c.mat <- data.frame(table(actual, predicted))
# extract all four confusion matrix values from the data frame
TN <- as.numeric(as.character(c.mat[1,3]))
FN <- as.numeric(as.character(c.mat[2,3]))
FP <- as.numeric(as.character(c.mat[3,3]))
TP <- as.numeric(as.character(c.mat[4,3]))
# now calculate the required metric
return( (FP + FN) / (TP + FP + TN + FN) )
}
precision <- function(actual, predicted) {
# Precision : the proportion of positive cases that were correctly identified.
# Equation to be modeled: TP / (TP + FP)
# derive confusion matrix cell values
c.mat <- data.frame(table(actual, predicted))
# extract all four confusion matrix values from the data frame
TN <- as.numeric(as.character(c.mat[1,3]))
FN <- as.numeric(as.character(c.mat[2,3]))
FP <- as.numeric(as.character(c.mat[3,3]))
TP <- as.numeric(as.character(c.mat[4,3]))
# now calculate the required metric
return( TP / (TP + FP) )
}
sensitivity <- function(actual, predicted) {
# Equation to be modeled: TP / (TP + FN)
# derive confusion matrix cell values
c.mat <- data.frame(table(actual, predicted))
# extract all four confusion matrix values from the data frame
TN <- as.numeric(as.character(c.mat[1,3]))
FN <- as.numeric(as.character(c.mat[2,3]))
FP <- as.numeric(as.character(c.mat[3,3]))
TP <- as.numeric(as.character(c.mat[4,3]))
# now calculate the required metric
return( TP / (TP + FN) )
}
specificity <- function(actual, predicted) {
# Equation to be modeled: TN / (TN + FP)
# derive confusion matrix cell values
c.mat <- data.frame(table(actual, predicted))
# extract all four confusion matrix values from the data frame
TN <- as.numeric(as.character(c.mat[1,3]))
FN <- as.numeric(as.character(c.mat[2,3]))
FP <- as.numeric(as.character(c.mat[3,3]))
TP <- as.numeric(as.character(c.mat[4,3]))
# now calculate the required metric
return( TN / (TN + FP) )
}
F1.Score <- function(actual, predicted) {
# Equation to be modeled: ( 2 * precision * sensitivity) / (precision + sensitivity)
# now calculate the required metric
return( ( 2 * precision(actual, predicted) * sensitivity(actual, predicted))
/ (precision(actual, predicted) + sensitivity(actual, predicted)) )
}
Now run metrics
# Coefficient Interpretation
# Logit model average marginal effects - use it to generate interpretable versions of coefficients
LogitScalar <- mean(dlogis(predict(m1, type = "link")))
LogitScalar * coef(m1)
# Logit model predicted probabilities - yields likelihood that each eval item is '+'
#
predprob.renew <- round(predict(m1, type="response"), 2)
summary(predprob.renew)
# Percent correctly predicted values
# NOTE: Need to create variable 'Y' for this to work - set it to response variable
Y <- f$RENEWAL_FLAG
pred.renew <- round(fitted(m1))
table(true = Y, pred = pred.renew)
# t.r <- data.frame(table(true = Y, pred = pred.crime))
# t.r
# now use functions built in HW 2 to get required statistics
accuracy(Y, pred.renew)
classif.err.rate(Y, pred.renew)
precision(Y, pred.renew)
sensitivity(Y, pred.renew)
specificity(Y, pred.renew)
F1.Score(Y, pred.renew)
# get AUC
rocCurve <- roc(response= Y, predictor= pred.renew)
auc(rocCurve)
# ALTERNATIVE REDUCED MODEL EXCLUDING CONTRACT TERMS
m2 <- glm(data = f, RENEWAL_FLAG ~ AREA + SERVICE_APPROACH + TIER + SALES_STRATEGY + MULTIPLE_SERVICES, family=binomial(link="logit"))
summary(m2)
Now run metrics
# Coefficient Interpretation
# Logit model average marginal effects - use it to generate interpretable versions of coefficients
LogitScalar <- mean(dlogis(predict(m2, type = "link")))
LogitScalar * coef(m2)
# Logit model predicted probabilities - yields likelihood that each eval item is '+'
#
predprob.renew <- round(predict(m2, type="response"), 2)
summary(predprob.renew)
# Percent correctly predicted values
# NOTE: Need to create variable 'Y' for this to work - set it to response variable
Y <- f$RENEWAL_FLAG
pred.renew <- round(fitted(m2))
table(true = Y, pred = pred.renew)
# now use functions built in HW 2 to get required statistics
accuracy(Y, pred.renew)
classif.err.rate(Y, pred.renew)
precision(Y, pred.renew)
sensitivity(Y, pred.renew)
specificity(Y, pred.renew)
F1.Score(Y, pred.renew)
# get AUC
rocCurve <- roc(response= Y, predictor= pred.renew)
auc(rocCurve)
# ALTERNATIVE REDUCED MODEL EXCLUDING CONTRACT TERMS
m3 <- glm(data = f, RENEWAL_FLAG ~ . - CONTRACT_LINE_CATEGORY, family=binomial(link="logit"))
summary(m3)
STOP - Model shows neither ITEM_COUNT or COST are statistically significant predictors
DO NOT run metrics since STOPPED
summary(powerTransform(ITEM_COUNT+1~RENEWAL_FLAG, f, family="bcPower"))
#use log
summary(powerTransform(COST+1~RENEWAL_FLAG, f, family="bcPower"))
#use log
m4 <- glm(data=f, RENEWAL_FLAG~AREA + SERVICE_APPROACH + TIER + CONTRACT_LENGTH + SALES_CATEGORY + CONTRACT_VALUE_CATEGORY + DISCOUNT_CATEGORY + log(ITEM_COUNT + 1) + log(COST + 1) + SALES_STRATEGY, family=binomial(link="logit"))
summary(m4)
## odds ratios only
exp(coef(m4))
## odds ratios and 95% CI
exp(cbind(OR = coef(m4), confint(m4)))
# Coefficient Interpretation
# Logit model average marginal effects - use it to generate interpretable versions of coefficients
LogitScalar <- mean(dlogis(predict(m4, type = "link")))
LogitScalar * coef(m4)
# Logit model predicted probabilities - yields likelihood that each eval item is '+'
#
predprob.renew <- round(predict(m4, type="response"), 2)
summary(predprob.renew)
# Percent correctly predicted values
# NOTE: Need to create variable 'Y' for this to work - set it to response variable
Y <- f$RENEWAL_FLAG
pred.renew <- round(fitted(m4))
table(true = Y, pred = pred.renew)
# t.r <- data.frame(table(true = Y, pred = pred.crime))
# t.r
# now use functions built in HW 2 to get required statistics
accuracy(Y, pred.renew)
classif.err.rate(Y, pred.renew)
precision(Y, pred.renew)
sensitivity(Y, pred.renew)
specificity(Y, pred.renew)
F1.Score(Y, pred.renew)
# get AUC
rocCurve <- roc(response= Y, predictor= pred.renew)
auc(rocCurve)
# now plot TARGET against fitted values
fit1 <- round(m4$fitted.values)
describe(fit1)
describe(f$RENEWAL_FLAG)
Try a lrtest on m1 vs m4
lrtest(m4, m1)
————————————————————————————————————————–
This section of the Appendix contains the source R code used for constructing and evaluating various negative binomial count regression models for purposes of predicting the likely number of contract line items for a contract that had been predicted as likely to be renewed.
library(alr3)
library(car)
library(knitr)
library(pROC)
library(psych)
library(MASS)
options(scipen=999)
# old file load - used full master data set
# fp <- read.csv("https://raw.githubusercontent.com/jeffnieman11/621_final_data/master/final_data.csv", stringsAsFactors = FALSE)
# new file load - uses only TRAINING data set
fp <- read.csv("https://raw.githubusercontent.com/jtopor/CUNY-MSDA-621/master/Final%20Project/FP-TRAINING-DATASET.csv", stringsAsFactors = FALSE)
# convert binary categoricals to factors: 10 variables in total
fp$AREA <- factor(fp$AREA)
fp$SERVICE_APPROACH <- factor(fp$SERVICE_APPROACH)
fp$TIER <- factor(fp$TIER)
fp$CONTRACT_LENGTH <- factor(fp$CONTRACT_LENGTH)
fp$SALES_CATEGORY <- factor(fp$SALES_CATEGORY)
fp$CONTRACT_VALUE_CATEGORY <- factor(fp$CONTRACT_VALUE_CATEGORY)
fp$CONTRACT_LINE_CATEGORY <- factor(fp$CONTRACT_LINE_CATEGORY)
fp$DISCOUNT_CATEGORY <- factor(fp$DISCOUNT_CATEGORY)
fp$MULTIPLE_SERVICES <- factor(fp$MULTIPLE_SERVICES)
fp$SALES_STRATEGY <- factor(fp$SALES_STRATEGY)
# NOTE: RENEWAL_FLAG is the predictor so ignore it
summary(fp[,-1])
f <- fp[,-c(1,6,9)]
# subset only contracts that renewed AND COST <= 100000 (to remove outliers)
renews <- subset(f, RENEWAL_FLAG == 1 & COST <= 100000)
# renumber rows
# rownames(renews) <- 1:nrow(renews)
# fit a negative binomial model
negb.1 <- glm.nb(ITEM_COUNT ~ . -RENEWAL_FLAG - CONTRACT_LINE_CATEGORY, renews)
summary(negb.1)
# remove AREA and TIER
negb.2 <- glm.nb(ITEM_COUNT ~ . -RENEWAL_FLAG - CONTRACT_LINE_CATEGORY - AREA - TIER, renews)
summary(negb.2)
# calculate dispersion: overdispersed at 6.63
(dp <- sum(residuals(negb.2,type="pearson")^2)/negb.2$df.res)
# check summary to see if dropping a predictor will improve things
# the p-values get adjusted by the dispersion so re-check them!!!
summary(negb.2,dispersion=dp)
# --------------------------------------------------------
# drop SALES_STRATEGY
negb.3 <- glm.nb(ITEM_COUNT ~ . -RENEWAL_FLAG - CONTRACT_LINE_CATEGORY - AREA - TIER - SALES_STRATEGY, renews)
summary(negb.3)
# calculate dispersion: overdispersed at 6.0
(dp <- sum(residuals(negb.3,type="pearson")^2)/negb.3$df.res)
# check summary to see if dropping a predictor will improve things
# the p-values get adjusted by the dispersion so re-check them!!!
summary(negb.3,dispersion=dp)
# check goodness of fit
1 - pchisq(deviance(negb.3),df.residual(negb.3))
# anova goodness of fit: indicates good fit to all variables
anova(negb.3, test="Chisq")
# check diagnostic plots: shows one extreme but valid outlier
par(mfrow = c(2,2))
plot(negb.3)
# now plot TARGET against fitted values
fit1 <- round(negb.3$fitted.values)
summary(fit1) # max is 9198 so use 10000 for plot axis
# check summary for max ITEM_COUNT
summary(renews$ITEM_COUNT)
par(mfrow = c(1,1))
plot(jitter(fit1), renews$ITEM_COUNT, xlab="Fitted Values", xaxp = c(0, 10000, 20))
describe(fit1)
describe(renews$ITEM_COUNT)
# drop CONTRACT_LENGTH and COST
negb.4 <- glm.nb(ITEM_COUNT ~ . -RENEWAL_FLAG - CONTRACT_LINE_CATEGORY - COST, renews)
summary(negb.4)
# Remove AREA
negb.5 <- glm.nb(ITEM_COUNT ~ . -RENEWAL_FLAG - CONTRACT_LINE_CATEGORY - COST - AREA, renews)
summary(negb.5)
# calculate dispersion: overdispersed at 6.0
(dp <- sum(residuals(negb.5,type="pearson")^2)/negb.5$df.res)
# check summary to see if dropping a predictor will improve things
# the p-values get adjusted by the dispersion so re-check them!!!
summary(negb.5,dispersion=dp)
# remove SALES_STRATEGY and TIER
negb.6 <- glm.nb(ITEM_COUNT ~ . -RENEWAL_FLAG - CONTRACT_LINE_CATEGORY - COST - AREA - SALES_STRATEGY - TIER, renews)
summary(negb.6)
# calculate dispersion: overdispersed at 6.0
(dp <- sum(residuals(negb.6,type="pearson")^2)/negb.6$df.res)
# check summary to see if dropping a predictor will improve things
# the p-values get adjusted by the dispersion so re-check them!!!
summary(negb.6,dispersion=dp)
# check goodness of fit
1 - pchisq(deviance(negb.6),df.residual(negb.6))
# anova goodness of fit: indicates good fit to all variables
anova(negb.6, test="Chisq")
# check diagnostic plots: shows one extreme but valid outlier
par(mfrow = c(2,2))
plot(negb.6)
# now plot TARGET against fitted values
fit1 <- round(negb.6$fitted.values)
summary(fit1) # max is 592 so use 600 for plot axis
# check summary for max ITEM_COUNT
summary(renews$ITEM_COUNT)
par(mfrow = c(1,1))
plot(jitter(fit1), renews$ITEM_COUNT, xlab="Fitted Values", xaxp = c(0, 600, 20))
describe(fit1)
describe(renews$ITEM_COUNT)
renews <- subset(f, RENEWAL_FLAG == 1)
negb.c1 <- glm.nb(ITEM_COUNT ~ . -RENEWAL_FLAG - CONTRACT_LINE_CATEGORY - COST, renews)
summary(negb.c1)
# remove area
negb.c2 <- glm.nb(ITEM_COUNT ~ . -RENEWAL_FLAG - CONTRACT_LINE_CATEGORY - COST - AREA, renews)
summary(negb.c2)
# remove TIER
negb.c3 <- glm.nb(ITEM_COUNT ~ . -RENEWAL_FLAG - CONTRACT_LINE_CATEGORY - COST - AREA - TIER, renews)
summary(negb.c3)
# calculate dispersion: overdispersed at 6.89
(dp <- sum(residuals(negb.c3,type="pearson")^2)/negb.c3$df.res)
# check summary to see if dropping a predictor will improve things
# the p-values get adjusted by the dispersion so re-check them!!!
summary(negb.c3,dispersion=dp)
# remove SALES_STRATEGY and CONTRACT_LENGTH
negb.c4 <- glm.nb(ITEM_COUNT ~ . -RENEWAL_FLAG - CONTRACT_LINE_CATEGORY - COST - AREA - TIER - SALES_STRATEGY - CONTRACT_LENGTH, renews)
summary(negb.c4)
# calculate dispersion: overdispersed at 6.89
(dp <- sum(residuals(negb.c4,type="pearson")^2)/negb.c4$df.res)
# check summary to see if dropping a predictor will improve things
# the p-values get adjusted by the dispersion so re-check them!!!
summary(negb.c4,dispersion=dp)
# anova goodness of fit: indicates good fit to all variables
anova(negb.c4, test="Chisq")
# check diagnostic plots: shows one extreme but valid outlier
par(mfrow = c(2,2))
plot(negb.c4)
# now plot TARGET against fitted values
fit1 <- round(negb.c4$fitted.values)
summary(fit1) # max is 5091 so use 6000 for plot axis
# check summary for max ITEM_COUNT
summary(renews$ITEM_COUNT)
par(mfrow = c(1,1))
plot(jitter(fit1), renews$ITEM_COUNT, xlab="Fitted Values", xaxp = c(0, 6000, 20))
describe(fit1)
describe(renews$ITEM_COUNT)
————————————————————————————————————————–
This section of the Appendix contains the source R code used for applying the preferred regression models to the Evaluation data set.
R code for the required 2-stage prediction process
First stage:
library(MASS)
library(psych)
options(scipen=999)
# load training set so that binary model can be built
fp <- read.csv("https://raw.githubusercontent.com/jtopor/CUNY-MSDA-621/master/Final%20Project/FP-TRAINING-DATASET.csv", stringsAsFactors = FALSE)
# convert binary categoricals to factors: 10 variables in total
fp$AREA <- factor(fp$AREA)
fp$SERVICE_APPROACH <- factor(fp$SERVICE_APPROACH)
fp$TIER <- factor(fp$TIER)
fp$CONTRACT_LENGTH <- factor(fp$CONTRACT_LENGTH)
fp$SALES_CATEGORY <- factor(fp$SALES_CATEGORY)
fp$CONTRACT_VALUE_CATEGORY <- factor(fp$CONTRACT_VALUE_CATEGORY)
fp$CONTRACT_LINE_CATEGORY <- factor(fp$CONTRACT_LINE_CATEGORY)
fp$DISCOUNT_CATEGORY <- factor(fp$DISCOUNT_CATEGORY)
fp$MULTIPLE_SERVICES <- factor(fp$MULTIPLE_SERVICES)
fp$SALES_STRATEGY <- factor(fp$SALES_STRATEGY)
# remove unused columns from data set
f <- fp[,-c(1,6,9)]
# build binary logit model
m4 <- glm(data=f, RENEWAL_FLAG~AREA + SERVICE_APPROACH + TIER + CONTRACT_LENGTH + SALES_CATEGORY +
CONTRACT_VALUE_CATEGORY + DISCOUNT_CATEGORY + log(ITEM_COUNT + 1) + log(COST + 1) +
SALES_STRATEGY, family=binomial(link="logit"))
# ---------------------------------------
# now that model is built, load eval data set
# load EVAL data set
fp <- read.csv("https://raw.githubusercontent.com/jtopor/CUNY-MSDA-621/master/Final%20Project/FP-EVAL-DATASET.csv", stringsAsFactors = FALSE)
# convert binary categoricals to factors: 10 variables in total
fp$AREA <- factor(fp$AREA)
fp$SERVICE_APPROACH <- factor(fp$SERVICE_APPROACH)
fp$TIER <- factor(fp$TIER)
fp$CONTRACT_LENGTH <- factor(fp$CONTRACT_LENGTH)
fp$SALES_CATEGORY <- factor(fp$SALES_CATEGORY)
fp$CONTRACT_VALUE_CATEGORY <- factor(fp$CONTRACT_VALUE_CATEGORY)
fp$CONTRACT_LINE_CATEGORY <- factor(fp$CONTRACT_LINE_CATEGORY)
fp$DISCOUNT_CATEGORY <- factor(fp$DISCOUNT_CATEGORY)
fp$MULTIPLE_SERVICES <- factor(fp$MULTIPLE_SERVICES)
fp$SALES_STRATEGY <- factor(fp$SALES_STRATEGY)
# remove unused columns from data set
f <- fp[,-c(1,6,9)]
e.out <- fp
# save a copy of RENEWAL_FLAG for stats at end
Renew.f <- fp$RENEWAL_FLAG
# set renewal flag in eval data set to zero
f$RENEWAL_FLAG <- 0
# now predict TARGET_FLAG using model
pred.renew <- predict(m4, newdata=f, type="response")
# Save predicted probability and rounded value to new variables in output data frame
e.out$RENEWAL_PROB <- round(pred.renew, 3)
e.out$PREDICTED_RENEWAL <- round(pred.renew)
# set RENEWAL_FLAG in eval data set according to predictions
f$RENEWAL_FLAG <- round(pred.renew)
# write results to a file
# create data frame with ONLY those columns needed for comparisons
b.out <- data.frame(cbind(e.out$INDEX, e.out$RENEWAL_FLAG, e.out$RENEWAL_PROB,
e.out$PREDICTED_RENEWAL, e.out$COST, e.out$ITEM_COUNT))
# Give columns proper names
colnames(b.out) <- c("INDEX", "RENEWAL_FLAG", "RENEWAL_PROB", "PREDICTED_RENEWAL", "COST", "ITEM_COUNT")
# write results to a file
#write.csv(b.out, file = "C:/SQLData/621/FP-RENEWAL_PREDS.csv", row.names = FALSE)
# how many were correct predicted?
(pred.cor <- nrow(subset(b.out, RENEWAL_FLAG == 1 & PREDICTED_RENEWAL == 1)))
# how many total RENEWAL_FLAG == 1 in eval data set?
(renews.t <- nrow(subset(b.out, RENEWAL_FLAG == 1)))
True positives = 1609 (out of 6292 actual RENEWAL_FLAG == 1)
False postiives = 2574 - 1609 = 965
True Negatives = 20000 - (6292 + 965) = 12743 (out of 13708 possible actual RENEWAL_FLAG == 0)
False Negatives = 6292 - 1609 = 4683
————————————————————————————————————————–
Accuracy = (TP + TN) / (TP + FP + TN + FN) = (1609 + 12743) / 20000 = 0.7176
Classification Error Rate = (FP + FN) / (TP + FP + TN + FN) = (965 + 4683) / 20000 = 0.2824
Precision = TP / (TP + FP) = 1609 / (1609 + 965) = 0.6251
Sensitivity = TP / (TP + FN) = 1609 / (1609 + 4683) = 0.2256
Specificity = TN / (TN + FP) = 12743 / (12743 + 965) = 0.9296
F1 Score = ( 2 * precision * sensitivity) / (precision + sensitivity) = (2 * .6251 * .2256) / (.6251 + .2256) = 0.3315
# ratio? Only 25% of renewals were predicted correctly
pred.cor / renews.t
1 - (pred.cor / renews.t)
Summary statistics
describe(b.out$PREDICTED_RENEWAL)
describe(Renew.f)
————————————————————————————————————————–
Second stage:
# load training set so that negative binomial model can be built
fp <- read.csv("https://raw.githubusercontent.com/jtopor/CUNY-MSDA-621/master/Final%20Project/FP-TRAINING-DATASET.csv", stringsAsFactors = FALSE)
# convert binary categoricals to factors: 10 variables in total
fp$AREA <- factor(fp$AREA)
fp$SERVICE_APPROACH <- factor(fp$SERVICE_APPROACH)
fp$TIER <- factor(fp$TIER)
fp$CONTRACT_LENGTH <- factor(fp$CONTRACT_LENGTH)
fp$SALES_CATEGORY <- factor(fp$SALES_CATEGORY)
fp$CONTRACT_VALUE_CATEGORY <- factor(fp$CONTRACT_VALUE_CATEGORY)
fp$CONTRACT_LINE_CATEGORY <- factor(fp$CONTRACT_LINE_CATEGORY)
fp$DISCOUNT_CATEGORY <- factor(fp$DISCOUNT_CATEGORY)
fp$MULTIPLE_SERVICES <- factor(fp$MULTIPLE_SERVICES)
fp$SALES_STRATEGY <- factor(fp$SALES_STRATEGY)
# remove unused columns from data set
f.train <- fp[,-c(1,6,9)]
# subset only contracts that renewed AND COST <= 100000 (to remove outliers)
renews <- subset(f.train, RENEWAL_FLAG == 1 & COST <= 100000)
# now fit the model
negb.3 <- glm.nb(ITEM_COUNT ~ . -RENEWAL_FLAG - CONTRACT_LINE_CATEGORY - AREA - TIER - SALES_STRATEGY, renews)
# ---------------------------------
# extract items from EVAL set that were predicted to be RENEWAL_FLAG = 1
eval.nb <- f[which(f$RENEWAL_FLAG == 1 & f$COST <= 100000),]
# save a copy for output so that we have a copy of the original ITEM_COUNT
eval.IC <- b.out[which(b.out$PREDICTED_RENEWAL == 1 & b.out$COST <= 100000),]
# set ITEM_COUNT to zero
# eval.nb$ITEM_COUNT <- 0
# now predict ITEM_COUNT using model
pred.IC <- predict(negb.3, newdata=eval.nb, type="response")
# now add predicted item count to restricted eval data set
eval.IC$PRED_ITEM_COUNT <- round(pred.IC)
describe(eval.IC$ITEM_COUNT)
describe(eval.IC$PRED_ITEM_COUNT)
# write results of ITEM_COUNT predictions to a file
# write.csv(eval.IC, file = "C:/SQLData/621/FP_ITEM_COUNT_PREDS.csv", row.names = FALSE)