3.1. Prediction and estimation
In this group, ML methods are used to predict or estimate: (1) software quality, (2) software size, (3) software development cost, (4) project or software effort, (5) maintenance task effort, (6) software resource, (7) correction cost, (8) software reliability, (9) software defect, (10) reusability, (11) software release timing, and (12) testability of program modules.
1. Software quality prediction.
GP is used in (Evett et al., 1998) to generate software quality models that take as input software metrics collected earlier in development,and predict for each module the number of faults that will be discovered later in development or during operations. These predictions will then be the basis for ranking modules, thus enabling a manager to select as many modules from the top of the list as resources allow for reliability enhancement.
A comparative study is done in (Lanubile and Visaggio, 1997) to evaluate several modeling techniques for predicting quality of software components. Among them is the NN model. Another NN based software quality prediction work, as reported in (Hong and Wu, 1997), is language specific, where design metrics for SDL (Specification and Description Language) are first defined, and then used in building the prediction models for identifying fault prone components. In (Khoshgoftaar et al., 1995, 1997), NN based models are used to predict faults and software quality measures.
CBR is the learning method used in two separate software quality prediction efforts(Emam et al., 2001; Ganesan et al., 2000). The focus of Emam et al. (2001) is on comparing the performance of different CBR classifiers, resulting in a recommendation of a simple CBR classifier with Euclidean distance, z-score standardization, no weighting scheme, and selecting the single nearest neighbor for prediction. In (Ganesan et al.,2000), CBR is applied to software quality modeling of a family of full-scale industrial software systems and the accuracy is considered better than a corresponding multiple linear regression model in predicting the number of design faults.
In (Porter and Selby, 1990), a DT based approach is used to generate measurementbased models of high-risk components. The proposed method relies on historical data (metrics from previous releases or projects) for identifying components of fault prone properties. Another DT based approach is used to build models for predicting high-risk Ada components (Briand et al., 1993). Another comparative study result is reported in (Cohen and Devanbu, 1997) on using ILP methods for software fault prediction for C++ programs. Both natural and artificial data are used in evaluating the performance of two ILP methods and some extensions are proposed to one of them.
Software quality prediction is formulated as a CL problem in (de Almeida and Matwin, 1999). It is noted in the study that there are activities (such data acquisition, feature extraction and example labeling) prior to the actual learning process. These activities would have impact on the quality of the outcome. The proposed approach is applied to a set of COBOL programs.
2.Software size estimation.
NN and GP are used in (Dolado, 2000) to validate the component-based method for software size estimation. In addition to producing results that corroborate the component-based approach for software sizing, it is noticed in the study that NN works well with the data, recognizing some nonlinear relationships that the multiple linear regression method fails to detect. The equations evolved by GP provide similar or better values than those produced by the regression equations, and are intelligible, providing confidence in the results.
3. Software cost prediction.
A general approach, called optimized set reduction and based on DT, is described in (Briand et al., 1992) for analyzing software engineering data, and is demonstrated to be an effective technique for software cost estimation. A comparative study is done in (Briand et al., 1999) which includes a CBR technique for software cost prediction. The result reported in (Chulani et al., 1999) indicates that the improved predictive performance of software cost models can be obtained through the use of Bayesian analysis, which offers a framework where both prior expert knowledge and sample data can be accommodated to obtain predictions.
4. Software (project) development effort prediction.
IBL techniques are used in (Shepperd and Schofield, 1997) for predicting the software project effort for new projects. The empirical results obtained (from nine different industrial data sets totaling 275 projects) indicate that CBR offers a viable complement to the existing prediction and estimations techniques. Another CBR application in software effort estimation is reported in (Vicinanza et al., 1990). DT and NN are used in (Srinivasan and Fisher, 1995) to help predict software development effort. The results were competitive with conventional methods such as COCOMO and function points. The main advantage of DT and NN based estimation systems is that they are adaptable and nonparametric. Additional research on ML based software effort prediction includes: a genetically trained NN (GA + NN) predictor (Shukla, 2000), a comparative study of software effort estimation techniques in (Finnie et al., 1997) that are based on NN and CBR.
5. Maintenance task effort prediction.
Models are generated in terms of NN and DT methods, and regression methods, for software maintenance task effort prediction in (Jorgensen, 1995). The study measures and compares the prediction accuracy for eachmodel, and concludes that DT-based and multiple regression-based models have better accuracy results. It is recommended that prediction models be used as instruments to support the expert estimates and to analyze the impact of the maintenance variables on the process and product of maintenance.
6. Software resource analysis.
In (Selby and Porter, 1988), DT is utilized in software resource data analysis to identify classes of software modules that have high development effort or faults (the concept of “high” is defined with regard to the uppermost quartile relative to past data). Sixteen software systems are used in the study. The decision trees correctly identify 79.3 percent of the software modules that had high development effort or faults.
7. Correction cost estimation.
An empirical study is done in (de Almeida et al., 1998) where DT and ILP are used to generate models for estimating correction costs in software maintenance. The generated models prove to be valuable in helping to optimize resource allocations in corrective maintenance activities, and to make decisions regarding when to restructure or reengineer a component so as to make it more maintainable. A comparison leads to an observation that ILP-based results perform better than DTbased results.
8. Software reliability prediction.
Software reliability growth models can be used to characterize how software reliability varies with time and other factors. The models offer mechanisms for estimating current reliability measures and for predicting their future values. The work in (Karunanithi et al., 1992) reports the use of NN for software reliability growth prediction. An empirical comparison is conducted between NNbased models and five well-known software reliability growth models using actual data sets from a number of different software projects. The results indicate that NN-based models adapt well across different data sets and have a better prediction accuracy.
9. Defect prediction.
BL is used in (Fenton and Neil, 1999) to predict software defects. Though the system reported is only a prototype, it shows the potential Bayesian belief networks (BBN) has in incorporating multiple perspectives on defect prediction into a single, unified model. Variables in the prototype BBN system (Fenton and Neil, 1999) are chosen to represent the life-cycle processes of specification, design and implementation, and testing (Problem-Complexity, Design-Effort, Design-Size, DefectsIntroduced, Testing-Effort, Defects-Detected, Defects-Density-At-Testing, ResidualDefect-Count, and Residual-Defect-Density). The proper causal relationships among those software life-cycle processes are then captured and reflected as arcs connecting the variables. A tool is then used with regard to the BBN model in the following manner. For given facts about Design-Effort and Design-Size as input, the tool will use Bayesian inference to derive the probability distributions for Defects-Introduced,Defects-Detected and Defect-Density
10. Reusability prediction.
Predictive models are built through DT in (Mao et al., 1998) to verify the impact of some internal properties of object-oriented applications on reusability. Effort is focused on establishing a correlation between component reusability and three software attributes (inheritance, coupling and complexity). The experimental results show that some software metrics can be used to predict, with a high level of accuracy, the potential reusable classes.
11. Software release timing.
How to determine the software release schedule is an issue that has impact on both the software product developer and the user and the market.A method, based on NN, is proposed in (Dohi et al., 1999) for estimating the optimal software release timing. The method adopts the cost minimization criterion and translates it into a time series forecasting problem. NN is then used to estimate the fault-detection time in the future.
12. Testability prediction.
The work reported in (Khoshgoftaar et al., 2000) describes a case study in which NN is used to predict the testability of software modules from static measurements of the source code. The objective in the study is to predict a quantity between zero and one whose distribution is highly skewed toward zero, which proves to be difficult for standard statistical techniques. The results echo the salient feature of NN-based predictive models that have been discussed so far: its ability to model nonlinear relationships.
3.2 Property and model discovery.
ML methods are used to identify or discover useful information about software entities. Work in (Bratko and Grobelnik, 1993) explores using ILP to discover loop invariants. The approach is based on collecting execution traces of a program to be proven correct and using them as learning examples of an ILP system. The states of the program variables at a given point in the execution represent positive examples for the condition associated with that point in the program.
A controlled closed-world assumption is utilized to generate negative examples. In (Abd-El-Hafiz, 2000), NN is used to identify objects in procedural programs as an effort to facilitate many maintenance activities (reuse, understanding). The approach is based on cluster analysis and is capable of identifying abstract data types and groups of routines that reference a common set of data.
A data analysis technique called process discovery is proposed in (Cook and Wolf, 1998) that is implemented in terms of NN. The approach is based on first capturing data describing process events from an on-going process and then generating a formal model of the behavior of that process. Another application involves the use of EBL to synthesize models of programming activities or software processes (Garg and Bhansali, 1992). It generates a process fragment (a group of primitive actions which achieves a certain goal given some preconditions) from a recorded process history.
3.5. Reuse library construction and maintenance.
This area presents itself as a fertile ground for CBR applications. In (Ostertag et al., 1992), CBR is the corner stone of a reuse library system. A component in the library is represented in terms of a set of feature/term pairs. Similarity between a target and a candidate is defined by the distance measure, which is computed through comparator functions based on the subsumption, closeness and package relations