Long-Term Software Fault Prediction Model with Linear Regression and Data Transformation
Subject Areas : Machine learningMomotaz Begum 1 , Jahid Hasan Rony 2 , Md. Rashedul Islam 3 , Jia Uddin 4 *
1 - Department of Computer Science and Engineering, Dhaka University of Engineering & Technology, Gazipur-1707, Dhaka, Bangladesh
2 - Department of Computer Science and Engineering, Dhaka University of Engineering & Technology, Gazipur-1707, Dhaka, Bangladesh
3 - Department of Computer Science and Engineering, International University of Business Agriculture and Technology
4 - AI and Big Data Department, Endicott College, Woosong University, Daejeon, South Korea
Keywords: Software Reliability, Software Faults, Forecasting, Long Term Prediction, Relative Error,
Abstract :
The validation performance is obligatory to ensure the software reliability by determining the characteristics of an implemented software system. To ensure the reliability of software, not only detecting and solving occurred faults but also predicting the future fault is required. It is performed before any actual testing phase initiates. As a result, various works on software fault prediction have been done. In this paper presents, we present a software fault prediction model where different data transformation methods are applied with Poisson fault count data. For data pre-processing from Poisson data to Gaussian data, Box-Cox power transformation (Box-Cox_T), Yeo-Johnson power transformation (Yeo-Johnson_T), and Anscombe transformation (Anscombe_T) are used here. And then, to predict long-term software fault prediction, linear regression is applied. Linear regression shows the linear relationship between the dependent and independent variable correspondingly relative error and testing days. For synthesis analysis, three real software fault count datasets are used, where we compare the proposed approach with Naïve gauss, exponential smoothing time series forecasting model, and conventional method software reliability growth models (SRGMs) in terms of data transformation (With_T) and non-data transformation (Non_T). Our datasets contain days and cumulative software faults represented in (62, 133), (181, 225), and (114, 189) formats, respectively. Box-Cox power transformation with linear regression (L_Box-Cox_T) method, has outperformed all other methods with regard to average relative error from the short to long term.
[1] J. Stilgoe, “Who Killed Elaine Herzberg?,” in Who’s Driving Innovation? New Technologies and the Collaborative State, J. Stilgoe, Ed. Cham: Springer International Publishing, 2020, pp. 1–6. doi: 10.1007/978-3-030-32320-2_1.
[2] B. P. Murthy, N. Krishna, T. Jones, A. Wolkin, R. N. Avchen, and S. J. Vagi, “Public Health Emergency Risk Communication and Social Media Reactions to an Errant Warning of a Ballistic Missile Threat — Hawaii, January 2018,” Morb. Mortal. Wkly. Rep., vol. 68, no. 7, pp. 174–176, Feb. 2019, doi: 10.15585/mmwr.mm6807a2.
[3] H. Pham, System Software Reliability. Springer Science & Business Media, 2007.
[4] T. Menzies, Z. Milton, B. Turhan, B. Cukic, Y. Jiang, and A. Bener, “Defect prediction from static code features: current results, limitations, new approaches,” Autom. Softw. Eng., vol. 17, no. 4, pp. 375–407, Dec. 2010, doi: 10.1007/s10515-010-0069-5.
[5] A. L. Goel, “Software Reliability Models: Assumptions, Limitations, and Applicability,” IEEE Trans. Softw. Eng., vol. SE-11, no. 12, pp. 1411–1423, Dec. 1985, doi: 10.1109/TSE.1985.232177.
[6] A. A. Abdel-Ghaly, P. Y. Chan, and B. Littlewood, “Evaluation of competing software reliability predictions,” IEEE Trans. Softw. Eng., vol. SE-12, no. 9, pp. 950–967, Sep. 1986, doi: 10.1109/TSE.1986.6313050.
[7] S. Santosa, R. A. Pramunendar, D. P. Prabowo, and Y. P. Santosa, “Wood Types Classification using Back-Propagation Neural Network based on Genetic Algorithm with Gray Level Co-occurrence Matrix for Features Extraction,” 2019.
[8] Y. Wang, D. Niu, and L. Ji, “Short-term power load forecasting based on IVL-BP neural network technology,” Syst. Eng. Procedia, vol. 4, pp. 168–174, Jan. 2012, doi: 10.1016/j.sepro.2011.11.062.
[9] “Long-term Software Fault Prediction with Robust Prediction Interval Analysi...: EBSCOhost.”
[10] M. Begum and T. Dohi, “Optimal Release Time Estimation of Software System using Box-Cox Transformation and Neural Network,” Int. J. Math. Eng. Manag. Sci., vol. 3, pp. 177–194, Jun. 2018, doi: 10.33889/IJMEMS.2018.3.2-014.
[11] M. Begum and T. Dohi, “Estimating prediction interval of cumulative number of software faults using back propagation algorithm,” May 2016.
[12] M. Begum and T. Dohi, optimal software release decision via artificial neural network approach with bug count data. 2016.
[13] M. Begum and T. Dohi, “Prediction Interval of Cumulative Number of Software Faults Using Multilayer Perceptron,” vol. 619, pp. 43–58, Jan. 2016, doi: 10.1007/978-3-319-26396-0_4.
[14] M. Begum and T. Dohi, “A Neuro-Based Software Fault Prediction with Box-Cox Power Transformation,” J. Softw. Eng. Appl., vol. 10, no. 3, Art. no. 3, Mar. 2017, doi: 10.4236/jsea.2017.103017.
[15] M. Begum and T. Dohi, “Optimal stopping time of software system test via artificial neural network with fault count data,” J. Qual. Maint. Eng., vol. 24, pp. 00–00, Jan. 2018, doi: 10.1108/JQME-12-2016-0082.
[16] Y. Kamei and E. Shihab, “Defect Prediction: Accomplishments and Future Challenges,” in 2016 IEEE 23rd International Conference on Software Analysis, Evolution, and Reengineering (SANER), Mar. 2016, vol. 5, pp. 33–45. doi: 10.1109/SANER.2016.56.
[17] V. R. Basili, “The experimental paradigm in software engineering,” in Experimental Software Engineering Issues: Critical Assessment and Future Directions, Berlin, Heidelberg, 1993, pp. 1–12. doi: 10.1007/3-540-57092-6_91.
[18] T. M. Khoshgoftaar et al., “Predicting fault-prone modules with case-based reasoning,” in Proceedings The Eighth International Symposium on Software Reliability Engineering, Nov. 1997, pp. 27–35. doi: 10.1109/ISSRE.1997.630845.
[19] C. Catal, “Software fault prediction: A literature review and current trends,” Expert Syst. Appl., vol. 38, no. 4, pp. 4626–4636, Apr. 2011, doi: 10.1016/j.eswa.2010.10.024.
[20] K. Thantirige, A. K. Rathore, S. K. Panda, S. Mukherjee, M. A. Zagrodnik, and A. K. Gupta, “An open-switch fault detection method for cascaded H-bridge multilevel inverter fed industrial drives,” in IECON 2016 - 42nd Annual Conference of the IEEE Industrial Electronics Society, Oct. 2016, pp. 2159–2165. doi: 10.1109/IECON.2016.7794032.
[21] M. Islam, M. Akhtar, and M. Begum, Long short-term memory (LSTM) networks based software fault prediction using data transformation methods. 2022, p. 6. doi: 10.1109/ICAEEE54957.2022.9836388. [22] M. Islam, M. Begum and M. Akhtar, Recursive Approach for Multiple Step-Ahead Software Fault Prediction through Long Short-Term Memory (LSTM). p. 10.
[23] H. K. Dam et al., “Lessons Learned from Using a Deep Tree-Based Model for Software Defect Prediction in Practice,” in 2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR), May 2019, pp. 46–57. doi: 10.1109/MSR.2019.00017.
[24] D. Sharma and P. Chandra, “Linear regression with factor analysis in fault prediction of software,” J. Interdiscip. Math., vol. 23, pp. 11–19, Jan. 2020, doi: 10.1080/09720502.2020.1721641.
[25] D. J. Pedregal, “Time series analysis and forecasting with ECOTOOL,” PLOS ONE, vol. 14, no. 10, p. e0221238, Oct. 2019, doi: 10.1371/journal.pone.0221238.
[26] O. Nyarko-Boateng, A. F. Adekoya, and B. A. Weyori, “Predicting the actual location of faults in underground optical networks using linear regression,” Eng. Rep., vol. 3, no. 3, p. eng212304, 2021, doi: 10.1002/eng2.12304.
[27] G. E. P. Box and D. R. Cox, “An Analysis of Transformations,” J. R. Stat. Soc. Ser. B Methodol., vol. 26, no. 2, pp. 211–252, 1964.
[28] F. J. Anscombe, “The Transformation of Poisson, Binomial and Negative-Binomial Data,” Biometrika, vol. 35, no. 3/4, pp. 246–254, 1948, doi: 10.2307/2332343.
[29] S. Weisberg, “Yeo-Johnson Power Transformations.” 2001.
[30] E. S. Gardner, “Exponential smoothing: The state of the art—Part II,” Int. J. Forecast., vol. 22, no. 4, pp. 637–666, Oct. 2006, doi: 10.1016/j.ijforecast.2006.03.005.
[31] X. Su, X. Yan, and C.-L. Tsai, “Linear regression,” WIREs Comput. Stat., vol. 4, no. 3, pp. 275–294, 2012, doi: 10.1002/wics.1198.
[32] H. Okamura and T. Dohi, “SRATS: Software reliability assessment tool on spreadsheet (Experience report),” in 2013 IEEE 24th International Symposium on Software Reliability Engineering (ISSRE), Nov. 2013, pp. 100–107. doi: 10.1109/ISSRE.2013.6698909.
[33] M. R. Lyu, Ed., Handbook of Software Reliability Engineering. Los Alamitos, Calif.: New York: McGraw-Hill, 1996.
[34] A. Rasoolzadegan, “A new approach to the quantitative measurement of software reliability,” 2015.
http://jist.acecr.org ISSN 2322-1437 / EISSN:2345-2773 |
Journal of Information Systems and Telecommunication
|
Abstract
The validation performance is obligatory to ensure the software reliability by determining the characteristics of an implemented software system. To ensure the reliability of software, not only detecting and solving occurred faults but also predicting the future fault is required. It is performed before any actual testing phase initiates. As a result, various works on software fault prediction have been done. In this paper presents, we present a software fault prediction model where different data transformation methods are applied with Poisson fault count data. For data pre-processing from Poisson data to Gaussian data, Box-Cox power transformation (Box-Cox_T), Yeo-Johnson power transformation (Yeo-Johnson_T), and Anscombe transformation (Anscombe_T) are used here. And then, to predict long-term software fault prediction, linear regression is applied. Linear regression shows the linear relationship between the dependent and independent variable correspondingly relative error and testing days. For synthesis analysis, three real software fault count datasets are used, where we compare the proposed approach with Naïve gauss, exponential smoothing time series forecasting model, and conventional method software reliability growth models (SRGMs) in terms of data transformation (With_T) and non-data transformation (Non_T). Our datasets contain days and cumulative software faults represented in (62, 133), (181, 225), and (114, 189) formats, respectively. Box-Cox power transformation with linear regression (L_Box-Cox_T) method, has outperformed all other methods with regard to average relative error from the short to long term.
Keywords: Software Reliability; Software Faults; Forecasting; Long Term Prediction; Relative Error.
1- Introduction
In the modern era, the software acts as an essential part of our life. The software ensures the performance of our digital devices and helps to maintain our lifestyle, manage businesses, and so on. It has become impossible to pass even a single day without software usage in our daily life. When software is responsible for a massive operation, making a minor software fault, the entire system can collapse. For example, in 2018, a fully autonomous uber test car hit a pedestrian and accidentally killed her [1]. Because of the object detection software fault, the system failed to detect the human who was crossing the road with her bike. In addition, the Hawaii missile false alarm is another example of major suffering due to software failure [2]. Such incidents could have been avoided by reliable software with a software failure prediction system which is a popular approach in software engineering.
The engineering approach of systematic application development can be defined as the term, software engineering. The test effort, optimal cost analysis, correction prediction, security, effort, reusability, and quality-related prediction are a few vital parts issues of software engineering. To find a versatile method of prediction analysis further research is still going on in this area. On the other hand, to ensure software reliability, software fault prediction performance has to make sure. Software reliability is the probability of failure-free software. Long-term software failure can detect the possibility of a software failure so that impact of the failure can be minimized by taking necessary steps and precautions [3].
The development of software is expected to be perfect. However, it is impracticable to design and develop software with 100% accuracy and dependability. From a previous study, already it has been established that proficiency of fault prediction is caused due to the lack of proper evaluation criteria of performance and different fault distribution in a fault dataset [4]. But day by day the importance of software fault prediction has gained lot of attention because of the capability of providing faults number as well as the occurrence pattern of a certain system. Subsequently, it is also helpful for the quality assurance team as it can reduce testing time and cost.
The purpose of the software fault prediction is to identify the fault before sending it to the testing phase in the basis of software structural characteristics. In addition, to ensure software quality, professional stakeholders use prediction systems for optimal cost and effort during the operational phases. In this regard, we focused on the long-term software fault prediction using a linear regression method. Besides that, we have compared the model with Naïve gauss, exponential smoothing time series forecasting model, and two existing software reliability growth models name log extreme minimum (SRGM_LEM) [5] and pareto (SRGM_Pareto) [6]. Besides that, we have used software fault count data instead of fault detection time data because of the availability and the usefulness. Furthermore, three real data sets have been used for this study and the three most popular data transformations Box-Cox_T, Yeo-Johnson_T as well as Anscombe_T methods have been applied for the Poisson data into Gaussian data.
The organization of the rest of the paper is as follows: the related study is explained with conventional NHPP-based SRGMs in Section 2. Then in Section 3, the system architecture is described with suitable figures associated with data pre-processing techniques and forecasting models. The fault prediction with the proposed methodology is presented in Section 4. After that, Section 5 represents the experimental illustration assist with system setup, performance measurement, and result analysis. Finally, the paper concludes with future direction in Section 6.
2- Related Works
Various works have been done and are still going on in this field to predict faults of software to ensure reliability. Software reliability growth models (SRGMs) are one of the oldest with some limitations, such as the maximum likelihood estimation requires high computation power, and from a large number of SRGMs, researchers get confused to select the suitable model for every software data [3]. Nowadays, another popular classification method is an artificial neural network (ANN) with backpropagation (BP) learning algorithms used in software fault prediction [5], [67], [8].
Recently, Begum et al. proposed a robust prediction interval method using a refined artificial intelligence approach, where 5 data transformation methods are used for pre-processing and compared with the traditional method SRGMs [9]. They have constructed prediction intervals using their proposed method, and performance analysis is conducted by coverage rate and mean prediction interval width as well as compared with the existing delta method. However, the architecture of neural networks is complex; as a result, computation time is very high. The same author has related works [10–13] based on multilayer perceptron to address optimal software release problems.
Furthermore, the paper [14] presented a neuro base software fault prediction method using the Box-Cox transformation scheme. They have also investigated the optimal value of transformation parameter λ in case of average relative errors. Subsequently, they compared their result with traditional SRGMs and showed that their method outperformed in the early testing phase. On the other hand, multiplayer neural network architecture is used for identifying optimal software testing time [15]. For underlying software fault count, they have also pre-processed the data using a well-known data transformation technique. Where experimental result was conducted from four (4) actual software fault count data.
In [16] a study showed a study about software fault prediction and the different components and parameters of software fault prediction. Then the paper focused on the accomplishment in this area as well as recent research trends. In addition, they have discussed major future challenges of software fault prediction. But the advantage and disadvantage of recent studies has not prevailed. Different software fault prediction technique [17]–[20] have been proposed previously but none of those fully fills the long-term software fault prediction criteria as well as could not provide enough quality assurance resources and logistics.
Recently semantic long short-term memory (LSTM) network is used to train a model that can be self-directed to identify fault prediction and performed on real projects where the proposed model outperformed state-of-the-art approaches of fault prediction [21], [22]. After that, for finding fault in real life, a deep learning-based fault prediction model is presented in [23]. Which matches the abstract of source code syntax tree representation based on tree structure LSTM. They have evaluated their model using Samsung and a public PROMISE repo dataset. Then in a study [34], author estimate the reliability improvement that the recently suggested SDAFlex&Rel software development methodology, which aims to create reliable but flexible software, promises. By laying the groundwork for formal modelling, refinement, and verification which in turn avoid and eliminate possible faults, that method increases the dependability of software.
On the other hand, Deepak and Pravin introduced object-oriented metrics that are used for main factor findings [24]. In, Diego J. Pedregal [25] presents a paper where a few time series forecasting methods such as regression, Naïve methods, ARIMA, Transfer Functions, VAR(X), exponential smoothing (ETS), unobserved components (UC) models are used to develop a user-friendly graphical interface tool based on MATLAB for automatic outlier identification and detection. Also, a regression technique is applied for prediction analysis which provides a different approach for the software prediction process. For example, the linear regression method is used in optical network fault tracing to reduce high cost and fault detection time cited in [26].
In this paper, we presented different transformation methods as pre-processing for the real dataset and applied linear regression, Naïve Gauss, and exponential smoothing time series forecasting methods to predict software faults. Finally compared the outcome from various perspectives with existing popular SRGMs.
2-1- Non-Homogeneous Poisson Process-based SRGM
Let the group number of software fault represented by G(t) at time t. Suppose (i) G(0) = 0, (ii) G(t) has independent increments, (iii) Pro[G(t+q)- G(t) ≥ 2] = O(q), (iv) Pro[G(t+q)- G(t) =1] = λ(t; Ө)q+O(q), Here, non-homogeneous Poisson process(NHPP) intensity function is λ(t; Ө), Ө is the parameter of the model and infinitesimal time q higher term is O(q). So that, probability of G(t) = y is calculated by,
| (1) |
where mean value function is,