A Genetic Algorithm for Empirical Variable Selection
(Text taken in part from S. Vinterbo and L. Ohno-Machado. A genetic algorithm to
select variables in logistic regression: example in the domain of
myocardial infarction. Journal of the American Medical Informatics
Association, 6(Suppl.):984-988, 1999.
[Available here])
Introduction
Predictive models can be compared
in terms of performance, robustness, explanatory
power, and cost. Performance is often measured
by discriminatory ability (e.g., area under the Receiver Operating
Characteristic, or ROC, curve)
and calibration (e.g., plots of expected versus observed
results). Robustness can be interpreted as
the ability to generalize the model to other data
and to maintain good performance in presence of
uncertainty and/or missing data items.
Explanatory power is the ability to explain certain dependencies in
the data and the model results. Cost
can be measured as an aggregate of associated
costs of obtaining the information.
A factor that contributes to performance, robustness, explanatory
power and cost is parsimony of
the model. A smaller model (in terms of the number of variables) is
likely to (i) avoid over-fitting
problems, thus performing and generalizing better, (ii) be less likely
to fail due to missing data,
(iii) be easier to explain, and (iv) cost less, both in
terms of data collection and computational effort.
Logistic Regression
Traditionally, for logistic regression models, the variable selection
issue has been addressed by stepwise forward,
backward, and composite variable selection methods. The SAS
statistical software system calls these selection methods
"forward", "backward" and "stepwise", respectively.
Although being well understood and relatively easy
to compute, these methods consider the addition
or removal of one variable at the time, conditional
on the variables already selected. This sequential
approach restricts the examined number of models
severely. Another approach is to examine all possible models. Given u
variables to choose from,
the number of possible models is of the order of 2 to the power of u,
which renders this exhaustive approach infeasible with other than
small numbers of variables.
A Genetic Algorithm Approach
A genetic algorithm can be used as a heuristic for function
optimization where the extrema of the function (i.e.,
minima or maxima) cannot be established analytically. A population of
potential solutions is refined iteratively by employing a strategy inspired
by Darwinistic evolution or natural selection. Genetic algorithms
promote "survival of the fittest".
Given an initial population, often created randomly, the principal
steps of a genetic algorithm are:
- Select parents from the current population to
undergo genetic operations to form offspring.
This is done stochastically with preference assigned to individuals
that yield higher function values (i.e., the "fittest" individuals).
- Apply genetic operations such as crossover,
mutation and inversion to the selected parents to form offspring. The
operators are designed such that properties of the parents are
reproduced in the offspring.
- Recombine the offspring and current population to form a new
population.
These steps are performed until some predefined
stopping criterion is met. The selection method
from a population of potential solutions, with preference to "fittest"
individuals, has given these types of algorithms the name "genetic",
or sometimes "evolutionary", algorithms. The individuals
in a population are often called "chromosomes",
built out of "genes" that represent the properties
of the individual, and the function to optimize is
referred to as a "fitness" function. Each iteration
is called a "generation". A pseudo-code skeleton for
a genetic algorithm applying crossover, mutation
and inversion as genetic operators, is shown here:
P <- initializePopulation()
evaluate(P)
while(not stop(P)) do
Parents[1..3] <- selectParents(P)
Offspring[1] <- Crossover(Parents[1])
Offspring[2] <- Mutation(Parents[2])
Offspring[3] <- Inversion(Parents[3])
P <- recombine(P, Offspring[1..3], Parents[1..3])
evaluate(P)
done
A report of experiments with such an algorithm can be found in
S. Vinterbo and L. Ohno-Machado. A genetic algorithm to
select variables in logistic regression: example in the domain of
myocardial infarction. Journal of the American Medical Informatics
Association, 6(Suppl.):984-988, 1999.
[Available here])
Implementation in R
Here follows a description of an implementation of the above scheme in
the R language. R is an environment for exploratory data analysis,
very similar to S and S-plus. It is Free Software and is available
with documentation at the
R-Project home page.
For which platform
Any platform capable of running the R system. Includes most unix
flavours, linux and windows.
Distribution: Under what conditions
The programs are free software distributed under the
GNU General Public
Licence version 2.
Important: they come with ABSOLUTELY NO WARRANTY; for details read the
licence.
The ooga.r, lrgasel.r and cindex.r programs are free software, and you are
welcome to redistribute them under certain conditions; read the licence for details.
Code
General GA Framework
Using the object orientation facilities in R a general framework for a
GA can be found here.
Specialization to Variable Selection for LR
A specialization of the GA framework to LR can be found here.
The program is dependent on cindex.r,
which among others contains an implementation of Hanley and
McNeil's algorithm for computing the cindex.
(see
J. A. Hanley and B. J. McNeil. The meaning
and use of the area under a receiver operat ing characteristic (ROC)
curve. Radiology, 143:29{36, 1982.
)
Enjoy