Source: Conference: Tools with Artificial Intelligence, 2002. (ICTAI 2002). Proceedings. 14th IEEE International Conference on

1. The challenge

Complexity: “Software entities are more complex for their size than perhaps any other human construct.” “Many of the classical problems of developing software products derive from this essential complexity and its nonlinear increases with size.”
Conformity: Software must conform to the many different human institutions and systems it comes to interface with.
Changeability: “The software product is embedded in a cultural matrix of applications, users, laws, and machine vehicles. These all change continually, and their changes inexorably force change upon the software product.”
Invisibility: “The reality of software is not inherently embedded in space.” “As soon as we attempt to diagram software structure, we find it to constitute not one, but several, general directed graphs, superimposed one upon another.” (Brooks, 1987).

2. Learning methods

There are many different types of learning methods, each having its own characteristics and lending itself to certain learning problems. In this paper, we adopt the classification of Mitchell (1997a) and organize major types of learning methods into the following groups: * Concept learning (CL)

Decision tree (DT) learning,
Artificial neural networks (NN),
Bayesian learning (BL),
Reinforcement learning (RL),
Genetic algorithms (GA) and genetic programming (GP),
Instance-based learning (IBL, of which case-based reasoning, or CBR, is a popular method),
Inductive logic programming (ILP),
Analytical learning (AL, of which explanation-based learning, or EBL is a method),
Combined inductive and analytical learning (IAL).

3. Existing works

3.1. Prediction and estimation

In this group, ML methods are used to predict or estimate: (1) software quality, (2) software size, (3) software development cost, (4) project or software effort, (5) maintenance task effort, (6) software resource, (7) correction cost, (8) software reliability, (9) software defect, (10) reusability, (11) software release timing, and (12) testability of program modules.

1. Software quality prediction.

GP is used in (Evett et al., 1998) to generate software quality models that take as input software metrics collected earlier in development,and predict for each module the number of faults that will be discovered later in development or during operations. These predictions will then be the basis for ranking modules, thus enabling a manager to select as many modules from the top of the list as resources allow for reliability enhancement.

A comparative study is done in (Lanubile and Visaggio, 1997) to evaluate several modeling techniques for predicting quality of software components. Among them is the NN model. Another NN based software quality prediction work, as reported in (Hong and Wu, 1997), is language specific, where design metrics for SDL (Specification and Description Language) are first defined, and then used in building the prediction models for identifying fault prone components. In (Khoshgoftaar et al., 1995, 1997), NN based models are used to predict faults and software quality measures.

CBR is the learning method used in two separate software quality prediction efforts(Emam et al., 2001; Ganesan et al., 2000). The focus of Emam et al. (2001) is on comparing the performance of different CBR classifiers, resulting in a recommendation of a simple CBR classifier with Euclidean distance, z-score standardization, no weighting scheme, and selecting the single nearest neighbor for prediction. In (Ganesan et al.,2000), CBR is applied to software quality modeling of a family of full-scale industrial software systems and the accuracy is considered better than a corresponding multiple linear regression model in predicting the number of design faults.

In (Porter and Selby, 1990), a DT based approach is used to generate measurementbased models of high-risk components. The proposed method relies on historical data (metrics from previous releases or projects) for identifying components of fault prone properties. Another DT based approach is used to build models for predicting high-risk Ada components (Briand et al., 1993). Another comparative study result is reported in (Cohen and Devanbu, 1997) on using ILP methods for software fault prediction for C++ programs. Both natural and artificial data are used in evaluating the performance of two ILP methods and some extensions are proposed to one of them.

Software quality prediction is formulated as a CL problem in (de Almeida and Matwin, 1999). It is noted in the study that there are activities (such data acquisition, feature extraction and example labeling) prior to the actual learning process. These activities would have impact on the quality of the outcome. The proposed approach is applied to a set of COBOL programs.

2.Software size estimation.

NN and GP are used in (Dolado, 2000) to validate the component-based method for software size estimation. In addition to producing results that corroborate the component-based approach for software sizing, it is noticed in the study that NN works well with the data, recognizing some nonlinear relationships that the multiple linear regression method fails to detect. The equations evolved by GP provide similar or better values than those produced by the regression equations, and are intelligible, providing confidence in the results.

3. Software cost prediction.

A general approach, called optimized set reduction and based on DT, is described in (Briand et al., 1992) for analyzing software engineering data, and is demonstrated to be an effective technique for software cost estimation. A comparative study is done in (Briand et al., 1999) which includes a CBR technique for software cost prediction. The result reported in (Chulani et al., 1999) indicates that the improved predictive performance of software cost models can be obtained through the use of Bayesian analysis, which offers a framework where both prior expert knowledge and sample data can be accommodated to obtain predictions.

4. Software (project) development effort prediction.

IBL techniques are used in (Shepperd and Schofield, 1997) for predicting the software project effort for new projects. The empirical results obtained (from nine different industrial data sets totaling 275 projects) indicate that CBR offers a viable complement to the existing prediction and estimations techniques. Another CBR application in software effort estimation is reported in (Vicinanza et al., 1990). DT and NN are used in (Srinivasan and Fisher, 1995) to help predict software development effort. The results were competitive with conventional methods such as COCOMO and function points. The main advantage of DT and NN based estimation systems is that they are adaptable and nonparametric. Additional research on ML based software effort prediction includes: a genetically trained NN (GA + NN) predictor (Shukla, 2000), a comparative study of software effort estimation techniques in (Finnie et al., 1997) that are based on NN and CBR.

5. Maintenance task effort prediction.

Models are generated in terms of NN and DT methods, and regression methods, for software maintenance task effort prediction in (Jorgensen, 1995). The study measures and compares the prediction accuracy for eachmodel, and concludes that DT-based and multiple regression-based models have better accuracy results. It is recommended that prediction models be used as instruments to support the expert estimates and to analyze the impact of the maintenance variables on the process and product of maintenance.

6. Software resource analysis.

In (Selby and Porter, 1988), DT is utilized in software resource data analysis to identify classes of software modules that have high development effort or faults (the concept of “high” is defined with regard to the uppermost quartile relative to past data). Sixteen software systems are used in the study. The decision trees correctly identify 79.3 percent of the software modules that had high development effort or faults.

7. Correction cost estimation.

An empirical study is done in (de Almeida et al., 1998) where DT and ILP are used to generate models for estimating correction costs in software maintenance. The generated models prove to be valuable in helping to optimize resource allocations in corrective maintenance activities, and to make decisions regarding when to restructure or reengineer a component so as to make it more maintainable. A comparison leads to an observation that ILP-based results perform better than DTbased results.

8. Software reliability prediction.

Software reliability growth models can be used to characterize how software reliability varies with time and other factors. The models offer mechanisms for estimating current reliability measures and for predicting their future values. The work in (Karunanithi et al., 1992) reports the use of NN for software reliability growth prediction. An empirical comparison is conducted between NNbased models and five well-known software reliability growth models using actual data sets from a number of different software projects. The results indicate that NN-based models adapt well across different data sets and have a better prediction accuracy.

9. Defect prediction.

BL is used in (Fenton and Neil, 1999) to predict software defects. Though the system reported is only a prototype, it shows the potential Bayesian belief networks (BBN) has in incorporating multiple perspectives on defect prediction into a single, unified model. Variables in the prototype BBN system (Fenton and Neil, 1999) are chosen to represent the life-cycle processes of specification, design and implementation, and testing (Problem-Complexity, Design-Effort, Design-Size, DefectsIntroduced, Testing-Effort, Defects-Detected, Defects-Density-At-Testing, ResidualDefect-Count, and Residual-Defect-Density). The proper causal relationships among those software life-cycle processes are then captured and reflected as arcs connecting the variables. A tool is then used with regard to the BBN model in the following manner. For given facts about Design-Effort and Design-Size as input, the tool will use Bayesian inference to derive the probability distributions for Defects-Introduced,Defects-Detected and Defect-Density

10. Reusability prediction.

Predictive models are built through DT in (Mao et al., 1998) to verify the impact of some internal properties of object-oriented applications on reusability. Effort is focused on establishing a correlation between component reusability and three software attributes (inheritance, coupling and complexity). The experimental results show that some software metrics can be used to predict, with a high level of accuracy, the potential reusable classes.

11. Software release timing.

How to determine the software release schedule is an issue that has impact on both the software product developer and the user and the market.A method, based on NN, is proposed in (Dohi et al., 1999) for estimating the optimal software release timing. The method adopts the cost minimization criterion and translates it into a time series forecasting problem. NN is then used to estimate the fault-detection time in the future.

12. Testability prediction.

The work reported in (Khoshgoftaar et al., 2000) describes a case study in which NN is used to predict the testability of software modules from static measurements of the source code. The objective in the study is to predict a quantity between zero and one whose distribution is highly skewed toward zero, which proves to be difficult for standard statistical techniques. The results echo the salient feature of NN-based predictive models that have been discussed so far: its ability to model nonlinear relationships.

3.2 Property and model discovery.

ML methods are used to identify or discover useful information about software entities. Work in (Bratko and Grobelnik, 1993) explores using ILP to discover loop invariants. The approach is based on collecting execution traces of a program to be proven correct and using them as learning examples of an ILP system. The states of the program variables at a given point in the execution represent positive examples for the condition associated with that point in the program.

A controlled closed-world assumption is utilized to generate negative examples. In (Abd-El-Hafiz, 2000), NN is used to identify objects in procedural programs as an effort to facilitate many maintenance activities (reuse, understanding). The approach is based on cluster analysis and is capable of identifying abstract data types and groups of routines that reference a common set of data.

A data analysis technique called process discovery is proposed in (Cook and Wolf, 1998) that is implemented in terms of NN. The approach is based on first capturing data describing process events from an on-going process and then generating a formal model of the behavior of that process. Another application involves the use of EBL to synthesize models of programming activities or software processes (Garg and Bhansali, 1992). It generates a process fragment (a group of primitive actions which achieves a certain goal given some preconditions) from a recorded process history.

3.3. Transformation.

The work in (Ryan and Ivan, 1999; Ryan 2000) describes a GP system that can transform serial programs into functionally identical parallel programs. The functional identical property between the input and the output of the transformation can be proven, which greatly enhances the opportunities of the system being utilized in commercial environments.

A module architecture assistant is developed in (Schwanke and Hanson, 1994) to help assist software architects in improving the modularity of large programs. A model for modularization is established in terms of nearest-neighbor clustering and classification, and is used to make recommendations to rearrange module membership in order to improve modularity. The tool learns similarity judgments that match those of the software architect through performing back propagation on a specialized neural network.

GA is used in (Choi and Wu, 1998) in experimenting and evaluating a partitioning and allocation model for mapping object-oriented applications to heterogeneous distributed environments. By effectively distributing software components of an objectoriented system in a distributed environment, it is hoped to achieve performance goals such as load balancing, maximizing concurrency and minimizing communication costs.

3.4. Generation and synthesis.

In (Bergadano and Gunetti, 1996), a test case generation method is proposed that is based on ILP. An adequate test set is generated as a result of inductive learning of programs from finite sets of input-output examples. The method scales up well when the size or the complexity of the program to be tested grows. It stops being practical if the number of alternatives (or possible errors) becomes too large

3.5. Reuse library construction and maintenance.

This area presents itself as a fertile ground for CBR applications. In (Ostertag et al., 1992), CBR is the corner stone of a reuse library system. A component in the library is represented in terms of a set of feature/term pairs. Similarity between a target and a candidate is defined by the distance measure, which is computed through comparator functions based on the subsumption, closeness and package relations

3.6. Requirement acquisition.

CL is used to support scenario-based requirement engineering in the work reported in (van Lamsweerde and Willemet, 1998). The paper describes a formal method for supporting the process of inferring specifications of system goals and requirements inductively from interaction scenarios provided by stakeholders. The method is based on a learning algorithm that takes scenarios as examples and counter-examples (positive and negative scenarios) and generates goal specifications as temporal rules.

3.7. Capture development knowledge.

How to capture and manage software development knowledge is the theme of this application group where both papers report work utilizing CBR as the tool. In (Henninger, 1997), a CBR based infrastructure is proposed that supports evolving knowledge and domain analysis methods that capture emerging knowledge and synthesize it into generally applicable forms.

4. Steps of applying ML algorithms to SE tasks

1. Problem formulation.

The first step is to formulate a given problem such that it conforms to the framework of a particular learning method chosen for the task. Different learning methods have different inductive bias, adopt different search strategies that are based on various guiding factors, have different requirements regarding domain theory (presence or absence) and training data (valuation and properties), and are based on different justifications of reasoning (refer to Figure 2). All these issues must be taken into consideration during the problem formulation stage. This step is of pivotal importance to the applicability of the learning method. Strategies such as divide-andconquer may be needed to decompose the original problem into a set of sub-problems more amenable to the chosen learning method. Sometimes, the best formulation of a problem may not always be the one most intuitive to a machine learning researcher(Langley and Simon, 1995).

2. Problem representation.

The next step is to select an appropriate representation for both the training data and the knowledge to be learned. As can be seen in Figure 2, different learning methods have different representational formalisms. Thus, the representation of the attributes and features in the learning task is often problem-specific and formalism-dependent.

3. Data collection.

The third step is to collect data needed for the learning process. The quality and the quantity of the data needed are dependent on the selected learning method. Data may need to be preprocessed before they can be used in the learning process.

4. Domain theory preparation.

Certain learning methods (e.g., EBL) rely on the availability of a domain theory for the given problem. How to acquire and prepare a domain theory (or background knowledge) and what the quality of a domain theory (correctness, completeness) is therefore become an important issue that will affect the outcome of the learning process.

5. Performing the learning process.

Once the data and a domain theory (if needed) are ready, the learning process can be carried out. The data will be divided into a training set and a test set. If some learning tool or environment is utilized, the training data and the test data may need to be organized according to the tool’s requirements. Knowledge induced from the training set is validated on the test set. Because of different splits between the training set and test set, the learning process itself is an iterative one.

6. Analyzing and evaluating learned knowledge.

Analysis and evaluation of learned knowledge is an integral part of the learning process. The interestingness and the performance of the acquired knowledge will be scrutinized during this step, often with the help from human experts, which hopefully will lead to the knowledge refinement.

If learned knowledge is deemed insignificant, uninteresting, irrelevant, or deviating, this may be indicative to the need for revisions at early stages such as problem formulation and representation. There are known practical problems in many learning methods such as overfitting, local minima, or curse of dimensionality that are due to either data inadequacy, noise or irrelevant attributes in data, nature of a search strategy, or incorrect domain theory

7. Fielding the knowledge base.

What this step entails is that the learned knowledge be used (Langley and Simon, 1995). The knowledge could be embedded in a software development system or a software product, or used without embedding it in a computer system.

As observed in (Langley and Simon, 1995), the power of machine learning methods does not come from a particular induction method, but instead from proper formulation of the problems and from crafting the representation to make learning tractable.

5. Guideline for selecting methods

Machine learning for Software Engineering

DU ZHANG zhangd@ecs.csus.edu Department of Computer Science, California State University, Sacramento, CA 95819-6021 JEFFREY J.P. TSAI tsai@cs.uic.edu Department of Computer Science, University of Illinois, Chicago, IL 60607

October 4, 2018