Projects
Stable Prediction of Defect-Inducing Software Changes (SPDISC)
Principal investigator: Dr. Leandro L. Minku.
Keywords: software defect prediction, concept drift,
ensembles of learning machines.
Funding: Engineering
and Physical Sciences Research Council (EPSRC).
Context: software systems have become ever
larger and more complex. This inevitably leads to
software defects, whose debugging is estimated to cost
the global economy 312 billion USD annually. Reducing
the number of software defects is a challenging
problem, and is particularly important considering the
strong pressure towards rapid delivery. Such pressure
impedes different parts of the software source code to
all receive equally large amount of inspection and
testing effort.
With that in mind, machine learning approaches have
been proposed for predicting defect-inducing changes
in the source code as soon as these changes finish
being implemented. Such approaches could enable
software engineers to target special testing and
inspection attention towards parts of the source code
most likely to induce defects, reducing the risk of
committing defective changes.
Problem: the predictive performance of existing
approaches is unstable, because the underlying defect
generating process being modelled may vary over time
(i.e., there may be concept drift). This means that
practitioners cannot be confident about the prediction
ability of existing approaches -- at any given point
in time, predictive models may be performing very well
or failing dramatically.
Aim and vision: SPDISC aims at creating more
stable models for predicting defect-inducing changes,
through the development of a novel machine learning
approach for automatically adapting to concept drift.
When integrated with software versioning systems, the
models will provide early, reliable and automated
defect-inducing change alerts throughout the lifetime
of software projects.
Impact: SPDISC will enable a transformation in
the way software developers review and commit their
changes. By creating stable models to make software
developers aware of defect-inducing changes as soon as
these are implemented, it will allow targeted
inspection and testing attention towards
defect-inducing code throughout the lifetime of
software projects. This will reduce the debugging cost
and ultimately lead to better software quality.
Proposed approach: an online learning
algorithm will be developed to process incoming data
as they become available, enabling fast reaction to
concept drift. Concept drift will be detected using
methods designed to cope with class imbalance, which
typically occurs in prediction of defect-inducing
software changes. Class imbalance refers to the issue
of having a much smaller number of defect-inducing
changes than the number of safe changes. The proposed
approach will also make use of data from different
projects (i.e., transfer learning between domains) to
speed up adaptation to concept drift.
Novelty: SPDISC is the first proposal to look
into the stability of predictive performance over time
in the context of defect-inducing software changes.
Most previous work ignored the fact that predictions
are required over time, being oblivious of the
instability of predictive performance in this problem.
To deal with instability, SPDISC will develop the
first online transfer learning approach for predicting
defect-inducing software changes.
Ambitiousness: online transfer learning
between domains with concept drift is not only a very
new area of research in software engineering, but also
in machine learning. Very few approaches exist for
that, and none of them can deal with class-imbalanced
problems. Therefore, SPDISC will not only advance
software engineering by enabling a transformation in
the way software developers review and commit their
changes, but also advance the area of machine learning
itself.
Timeliness: given the current size and
complexity of software systems, the increased number
of life-critical applications, and the high
competitiveness of the software industry, approaches
for improving software quality and reducing the cost
of producing and maintaining software are currently of
utmost importance.
Experience-based COmputation: Learning to optimisE (ECOLE)
Principal investigator: Prof. Xin Yao.
Funding: European Union’s Horizon 2020 research and innovation programme under grant agreement No 766186.
Website: https://ecole-itn.eu/
ECOLE, is an Innovative Training Network (ITN) for early stage researchers (ESRs) coordinated by the University of Birmingham. It is based on novel synergies between nature inspired optimisation and machine learning. The training programme will be targeted at the automotive industry, but the skill set of the early-stage researchers (ESRs) will be equally valuable to other fast-moving, innovative industries. This four year programme will yield a new generation of high achieving, early stage researchers who will be provided with the transferable skills necessary for thriving careers in emerging and rapidly developing industrial areas.
Dynamic Adaptive Automated Software Engineering (DAASE)
Principal investigator: Prof. Xin Yao.
Keywords: software project estimation, project
scheduling problem, ensembles of learning machines,
online learning, concept drift, evolutionary
algorithms.
Funding: Engineering
and Physical Sciences Research Council (EPSRC).
My work on this project completed in August 2015.
DAASE aims to create a new approach to software
engineering which places computational search at the
heart of the process and products it creates and
embeds adaptivity into both. This new approach will
produce software that is dynamically adaptive, being
not only able to respond to and fix problems that
arise before deployment and during operation, but also
to continually optimise, re-configure and evolve to
adapt to new operating conditions, platforms and
environmental challenges. DAASE will create an array
of new processes, methods, techniques and tools for
this new kind of software engineering, radically
transforming both theory and practice of software
engineering. As part of it, DAASE will develop a
hyper-heuristic approach to adaptive automation. A
hyper-heuristic is a methodology for selecting or
generating heuristics. Most heuristic methods in the
literature operate on a search space of potential
solutions to a particular problem. However, a
hyper-heuristic operates on a search space of
heuristics.
Currently, I am researching into adaptive software
prediction. Software prediction tasks are of strategic
importance for software developing companies. An
example of such task is software effort estimation.
Overestimations may result in a company loosing
contracts or wasting resources, whereas
underestimations may result in poor quality, delayed
or unfinished software systems. Most software
prediction research neglects the fact that software
prediction tasks operate in online changing
environments. Models are typically trained on a set of
projects and evaluated on another set of projects,
without considering whether the training projects were
really available before the testing projects. Besides
possibly leading to incorrect conclusions, this
results in inflexible prediction approaches that
become obsolete with time. I am currently
investigating the type of changes suffered by software
prediction tasks and proposing new approaches to
quickly adapt to these changes.
Software Engineering by Automated Search (SEBASE)
Principal investigator: Prof. Xin Yao.
Keywords: software effort estimation, project
scheduling problem, ensembles of learning machines,
evolutionary algorithms, online learning, concept
drift.
Funding: Engineering
and Physical Sciences Research Council (EPSRC).
Project completed in December 2011.
Online Ensemble Learning in the Presence of Concept Drift
Supervisor: Prof. Xin Yao.
Keywords: concept drift, online learning, ensembles of
learning machines.
Funding: Overseas Research Students (ORS) Award and
School Research Studentship (School of Computer
Science, The University of Birmingham).
Completion: 2010.
Degree congregation: 2011.
Most machine learning algorithms operate in offline
mode. They first learn how to perform a certain task,
and then are used to perform this task. However, most
practical problems change with time, i.e., they suffer
concept drift. For example, the problem of predicting
users' preferences in information filtering systems
may involve changes in users' preferences; the problem
of classifying webpages may involve changes in the
most representative words of different webpage
categories; the problem of credit card approval may
involve changes in customers' reliability. Different
from offline learning algorithms, online learning
algorithms can be used to adapt to concept drifts
based on newly incoming training examples. These
algorithms do not have a separate training and testing
phase, but learn throughout their lifetime as they are
used to perform a certain task, by processing each new
training example separately and then discarding it.
Due to the practical need for adaptive learning
systems, there has been an increasing number of works
on online learning algorithms able to deal with
concept drift. In particular, online ensembles of
learning machines have been used. However, there has
been no deep study of why they can be helpful for
dealing with drifts and which of their features can
contribute for that. This thesis mainly investigates
not only how ensemble diversity affects accuracy in
online learning in the presence of concept drift, but
also how to use diversity in order to significantly
improve accuracy in changing environments. This is the
first diversity study in the presence of concept
drift. The main contributions of the thesis are:
- A better understanding of when, how and why ensembles of learning machines can help to handle concept drift in online learning. This study shows that one reason for ensembles to be helpful for dealing with concept drifts is the diversity among their base learners. Diversity is even more important in changing environments than in static environments. A proper level of diversity at each different environmental condition can significantly reduce the test errors of the learning machines as follows. Before a drift, ensembles with less diversity obtain lower test errors. On the other hand, it is a good strategy to maintain very highly diverse ensembles to obtain lower test errors shortly after a drift independent of the type of drift, even though high diversity is more important for more severe drifts. Longer after a drift, high diversity becomes less important. High diversity by itself can help to reduce the initial increase in error caused by a drift, but does not provide faster recovery from drifts in the long-term.
- Knowledge of how to use information learnt before a concept drift to aid the learning after a concept drift. Previous works have never attempted to use information learnt before a concept drift to aid the learning after a concept drift. However, a good learning machine for changing environments should not only avoid using outdated information, but also be able to use information previously learnt whenever it becomes useful. This thesis shows that ensembles trained before a concept drift with very high diversity can be used to transfer useful information learnt from the old concept to the new concept. Information learnt before a concept drift is shown to be helpful for the learning process after the drift when the drift is slow or does not cause too many changes. Very highly diverse ensembles perform well in comparison to other strategies after these concept drifts as long as low diversity is enforced after the concept drift.
- A new online ensemble learning approach called Diversity for Dealing with Drifts (DDD).Based on the deep diversity studies summarized above, a new approach called DDD was proposed. A good learning approach for changing environments should: (i) maximize performance when the concept is stable; (ii) minimize the drop in performance when there is concept drift; (iii) quickly recover from concept drifts; and (iv) efficiently use information previously learnt whenever it is beneficial. DDD was carefully designed to use ensemble diversity for dealing with these requirements. It maintains ensembles with different levels of diversity which are automatically emphasized during environmental states in which they are helpful. In this way, DDD is robust to different types of concept drift. A study based on both artificial data sets and real world data sets in the domains of credit card approval, computer network intrusion detection and electricity price trend prediction showed that DDD was able to outperform state-of-the-art approaches. In all the experimental comparisons carried out, DDD always performed at least as well as other drift handling approaches under various conditions, with very few exceptions. Furthermore, DDD was shown to be very robust to false positive drift detections, outperforming other drift handling approaches in terms of accuracy under these conditions.
- A new concept drift categorisation to allow principled studies of drifts. The existing literature presented very heterogeneous and ambiguous categorisations of concept drifts. In this thesis, a categorisation using mutually exclusive and unambiguous categories was proposed. Drifts are categorised according to different criteria in order to aid the development and evaluation of approaches to deal with concept drifts.
- An analysis of negative correlation in incremental learning. This thesis also presents
a study of the suitability of ensembles based on
negative correlation learning for incrementally
learning new chunks of training examples under stable
environments. It shows that even though it is possible
to use negative correlation learning for that,
chunk-based incremental learning approaches face a
difficult trade-off between avoiding catastrophic
forgetting under periods of stability and attaining
plasticity when adaptivity to changes is needed.
EFuNN Parameters Optimisation and EFuNN Ensembles Construction
Supervisor: Prof. Teresa B.
Ludermir.
Keywords: online parameters optimisation, numeric
parameters optimisation, fuzzy neural networks,
ensembles of neural networks.
Funding: Brazilian Council for Scientific and
Technological Development (CNPq).
Degree congregation: 2006.
Evolving Connectionist Systems (ECoSs) are systems
composed by one or more neural networks whose
structures adapt according to the data in a continuous
interaction with the environment. Evolving Fuzzy
Neural Networks (EFuNNs) are ECoSs which join the
neural networks functional characteristics to the
power of fuzzy logic. Fuzzy systems have been showing
to be very efficient to represent and reason about
uncertain knowledge. This is very important, as, many
times, human knowledge is uncertain.
A key challenge in Artificial Intelligence is to
create systems that are able not only to represent
human knowledge and reason about it, but also to
evolve and adapt their structures in a changing
environment. This kind of system is able to model
processes that continually develop and change over
time, e.g., biological data processing, electricity
load forecasting and adaptive speech recognition. A
system with these characteristics needs to be able to
tune its parameters in an on-line manner, according to
the environment. EFuNNs have some adaptable parameters
and their structures can also adapt according to
incoming data. However, they still have many
parameters that are fixed before the learning and have
great influence on its results. The problem of using a
fixed set of parameters is that an optimal set to a
particular state of the environment can be unsuitable
when the state of the environment changes.
In this work, two new techniques which use
evolutionary algorithms to evolve the EFuNN parameters
in an on-line manner were developed. These techniques
are able to create fuzzy systems that are completely
tunable, according to unpredictable and unknown
environments. The techniques showed to be able to have
better accuracy than the techniques existent in the
literature to evolve EFuNN parameters in an on-line
manner.
Besides the necessity to create new techniques to
allow changing environments to be represented, it is
always important to develop approaches with increasing
generalization capabilities and lower execution time.
Ensembles of neural networks have formally and
empirically shown to outperform systems composed by
only one neural network. Thus, this work also proposes
a new approach to create ensembles of neural networks,
e.g., ensembles of EFuNNs. The approach uses a
clustering method and co-evolutionary algorithms to
create the ensembles in an innovative way, explicitly
partitioning the input space, in order to allow the
networks that compose the ensemble to specialise in
different parts of it and work in a divide-an-conquer
manner. The approach showed to be able to improve the
accuracy of single EFuNNs generated using evolutionary
algorithms similar to the co-evolutionary algorithms
used in the approach. Furthermore, the execution time
of the approach is lower than the execution time of
evolutionary algorithms to generate single EFuNNs.