Fine-grained Parallel Genetic Algorithms in Charm++

by Martin Pelikan, Prasanna Parthasarathy, and Arun Ramraj

Abstract

This paper describes our implementation of a fine-grained parallel genetic algorithm. The paper begins by introducing the basic mechanics of genetic algorithms and discussing different ways to parallelize algorithms. The fine-grained genetic algorithm implemented in Charm++, a message-driven parallel language based on C++, is described. Our implementation is fully asynchronous and distributed. Thus, it should scale well, even for a very large number of processors. We present performance results for up to 64 processors on an Origin2000 to verify our scaleability hypothesis. In most tested cases, super linear or linear speed enhancements have been obtained.

Introduction

Genetic algorithms [2,4] are a robust optimization tool that can be used to solve a wide range of difficult problems efficiently and accurately. Three basic operators drive the search in genetic algorithms. Selection biases the search towards promising solutions. Quality of each individual is given by the fitness function that is specific to the problem at hand. Crossover and mutation introduce variation by combining the current set of promising solutions.

Recent studies have shown that parallelization can significantly improve the performance of genetic algorithms [1, 2,3,7]. Many search operators are easily distributed and thus great speed-ups can be obtained for difficult problems. There are two general approaches to parallelizing genetic algorithms, master-slave and fine or coarse grain. A master-slave architecture is often employed when fitness evaluation (which determines quality of each solution) is the most expensive operator. Fitness evaluations are distributed among a number of slave processors and all remaining operators (mutation, crossover, and selection) are performed on the master processor. In fine-grained and coarse-grained parallel genetic algorithms, the population is divided into subpopulations organized in a network. All operators in each subpopulation are executed by a separate processing element containing the subpopulation. Different subpopulations can communicate via diverse network topologies and communication patterns.

The purpose of this paper is to describe our implementation of a versatile fine-grained parallel genetic algorithm and demonstrate its scaleability. Our primary goals included design and implemention of an easily extensible system to solve diverse optimization problems. The implementation allows solutions represented by binary strings and decision graphs with Boolean, real-valued, and integer attributes. Fitness functions include a simple linear problem for binary strings and classification of data sets which are dynamically loaded from a specified data file. Any of the components can easily be extended by deriving a new class for the representation and fitness functions. A configuration file allows a user to select a problem to solve, a representation, and other parameters. This paper does not analyze the efficiency or applicability of genetic algorithms as a general optimization tool, but proposes a way to parallelize one of its variants to achieve good scaleability.

Genetic Algorithms

A genetic algorithm [2,4] evolves a population of potential solutions to solve a given problem. The first population of solutions is generated at random. A measure of quality of solutions, usually expressed in the form of one or multiple functions, is used to select the better solutions from the current population. The selected solutions undergo the operators of mutation and crossover in order to create a population of new solutions (the offspring population). The process is repeated until the termination criteria (e.g., convergence to a singleton) given by the user are met.

The algorithms proceed by repeatedly biasing the search to regions of higher quality and combining promising pieces of solutions to form novel combinations. In this fashion, the problems that are decomposable into problems of bounded difficulty can be solved very efficiently (in sub-quadratic or quadratic time). Although many problems are not clearly decomposable, many interesting problems can indeed be solved in this manner. This holds even for many instances of NP-complete problems, although the algorithm would require exponential time in the worst case.

Genetic algorithms have been successfully applied in a number of areas, including engineering design, business, telecommunication, scheduling, and machine learning. Many difficult problems can be tackled in sub-quadratic time with respect to the number of variables. However, for very big problems even sub-quadratic time results in large computational requirements. In order to solve these difficult problems faster, a large research effort has resulted in the study of parallelizing genetic algorithms. The next section discusses some of these approaches.

Parallel Genetic Algorithms

Parallel genetic algorithms can be classified into two major classes [1, 2,3,7]. The first class, master-slave parallel genetic algorithms, distributes fitness evaluations among slave processors. Information is collected by the master processor which processes the population of candidate solutions applying selection, crossover, and mutation operators. It then sends the new solutions to its slaves for evaluation.

The second class is the fine-grained and coarse-grained parallel genetic algorithms, where the population is distributed among a number of processors. Each processor processes its subpopulation. Subpopulations are allowed to communicate and exchange the solutions at some intervals. Various connection topologies and communication patterns can be used. Fine-grained genetic algorithms divide the population into small subpopulations containing only one or a couple of solutions which are connected in a grid topology (usually, 2D grids are used). The individuals communicate only within their neighborhood. The original purpose of fine-grained genetic algorithms was to use a large number of small processors, where each processor would process only one individual.

In coarse-grained genetic algorithms the subpopulations are bigger and communication is usually reduced to exchange solutions only once in a while or after the algorithms have converged. Common topologies for communication include rings, grids, hypercubes, fully-connected graphs, and random graphs with a fixed number of links for each processing element.

Each of the parallelizations can yield a significant improvement for some problems. If the fitness evaluation takes the majority of the computational resources, master-slave parallel genetic algorithms usually perform well. However, when the fitness evaluation does not require significantly more time than other operators, coarse-grained and fine-grained genetic algorithms can further improve the performance.

As mentioned above, originally fine-grained genetic algorithms were designed to run on machines with a large number of small processors connected in a grid topology. As the optimization proceeds, one would expect high-quality partial solutions to spread from location to location on the grid in a process similar to diffusion. This behavior is interesting in itself, may eliminate problems of premature convergence, and may also improve the execution of the algorithm on a single processor machine. These reasons motivate our decision to implement a fine-grained genetic algorithm.

In our implementation of the fine-grained genetic algorithm, the population of the candidate solutions is mapped to a 2D grid where each position of the grid may either contain a particular solution or be empty. In each iteration of the algorithm, all solutions undergo a crossover operator in which the individual is combined with a randomly picked neighbor from its neighborhood. The individual then undergoes a mutation operator that performs a slight perturbation to the current solution. If the new individual is of higher quality, then it replaces the original individual. This introduces the selection pressure that biases the search toward high-quality solutions. A local search operator can next be applied to tune the particular solution. In this fashion, a genetic algorithm does a global search while the local search operator tries to improve the individual within some small neighborhood. The next section explains the details of our parallel implementation.

What to Parallelize?

To make the implementation portable to a number of different parallel computers, from supercomputers to clusters of PCs, many attributes of the implementation can be tuned dynamically by specifying different input parameters.

We have chosen to use the Charm++ programming language [5], an object-oriented, message-driven, parallel language based on C++. In message-driven parallel languages, the execution, method invocation, and synchronization are performed by sending messages between the different processes. There are several advantages to Charm++. Charm++ is based on an object-oriented language, allowing us to code in a flexible and comprehensible environment. Charm++ also supports automated load balancing, which can automatically migrate some objects depending on the current workload. Additionally, the execution in Charm++ is driven by messages and therefore allows the creation of a completely distributed and asynchronous application, which should be scalable to a large number of processors.

Our implementation uses a 2D grid, where the grid is divided into a number of rectangles. The division is done automatically, but the user can specify the average number of rectangles to be processed by each processor on average. The default value is 4, i.e., each processor should contain 4 rectangles on average. Sub-grids can migrate among different processors depending on the current load. Neighboring sub-grids must communicate since processing each sub-grid requires knowing the individuals on the border of our sub-grid.

However, the processes in genetic algorithms are statistical in nature and, consequently, it is not necessary to always require the information in neighboring processors to be consistent. This allows us to create a completely asynchronous system that can greatly improve the overall performance.

In addition to distributing the grid with individuals, we have distributed the computation of the fitness function. There are many alternatives that the user can specify on the fly. The simplest alternative is when all individuals are evaluated locally by the object responsible for the part of the grid (the rectangle) where the individual is located. This solution requires no communication, unless fitness evaluation takes a large or dynamically changing amount of time, making this an efficient solution. The second alternative is to have one special object for each processor and have each rectangle send its evaluations to the evaluator object on its local node. This can be useful for more complex fitness functions.

The last alternative is to use a hierarchical scheme where there is one fitness manager for a certain number of processors. Each rectangle sends its individuals to its associated manager. There is one evaluator per processor. Depending on the load on each evaluator, the manager decides the evaluator that should receive the work and sends it back. This load can be computed based on the history of computation, since the manager knows about all transactions between the rectangles and evaluators.

Local search is performed by the hill climbers. There is one hill climber for each sub-grid. At the beginning, each hill climber requests a solution from the corresponding sub-grid. It then tries to search its neighborhood and improve the solution locally for a number of iterations. Once it cannot improve the solution anymore, it sends the solution back to the rectangle, where the resulting solution replaces a randomly chosen individual. The rectangle responds with another individual that will be a subject to local search. Hill climbers migrate with the rectangles so that the communication across different processors is minimized.

The current state of the execution is obtained by extracting the best overall solution, the average fitness, and other information extracted from the current population of solutions. In order to prevent a bottleneck in the implementation, we used a spanning tree whose nodes correspond to the rectangles and migrate with them. The information is moved upwards and the top left-most object outputs the information to the screen at certain intervals. By using this scheme, no synchronization is required, thus improving performance. Further details of the implementation are explained shortly.

Implementation Details

This section provides detailed information about our implementation. The section starts by providing a list of classes and a short description of the role of each class in the system. Next, messages and communication are discussed. A general design of the implementation is illustrated in Figure 1.
The figure shows the details of the implementation. A detailed
description of all components is provided in the text of the
paper.
Figure 1: Program design.

Individual class and its children

An individual in a genetic algorithm represents a candidate solution (e.g., a vector, or a classifier). Depending on the problem, the solution may be represented in different encoding schemes. Thus, we have an abstract Individual class that encapsulates the essential ingredients of an individual namely, the fitness and pure virtual functions like initialization, crossover and, mutation. Apart from this, all subclasses of Individual need to define their own method for packing the representation into a character buffer and unpacking into a meaningful individual, given a buffer. These methods facilitate convenient communication of Individuals for fitness evaluation and near-neighbor communication purposes, and also during migration of the FineGrainedGA object to which they belong. Each individual instance contains pointers to instances of the FitnessFunction class and the Random class present in the FineGrainedGA object. The Individual class also includes a pure virtual function that, in the child classes, does a local search for one iteration and returns a value whether or not to continue doing a local search.

The child classes that we have implemented are the BinaryStringIndividual, which represents the conventional binary bit string representation of solutions and the DecisionTree, which is for the decision tree representation of solutions for classification problems. The user specifies which representation to use in the input parameter file.

FinaGrainedGA class

The most important objects in the program belong to the FinaGrainedGA class. This class actually runs the GA operations. The FineGrainedGA class is organized as a two-dimensional char array and each instance of the class has a two-dimensional array of Individuals. The Individual is an abstract class and any of the derived classes can be instantiated based on the problem being solved (specified through the parameters file). The main object instantiates the FineGrainedGA objects that initialize individuals in the grid to randomly generated Individuals and perform initial evaluation. Once this is finished, the start() method is called for all the FineGrainedGA objects by the main object. The start() function is an entry method in the FineGrainedGA class, which goes through the GA generations in a loop, yielding at the end of each generation, messages that the object might have received.

In each GA generation, the following process repeats. For each column of individuals, we first duplicate them. Then, a crossover partner is selected from an 8-neighborhood and crossover is performed on the duplicate copies. This is followed by mutation. Once these alterations are done, the new individuals are sent for evaluation, again, column by column. The evaluation may take place in one of three places: local, FitnessManager objects ,or FitnessEvaluator object. When the evaluations are sent to FitnessEvaluator objects, they may come back reordered. Thus, the process is potentially asynchronous. To take advantage of this, the start() method processes all columns that are up to date, instead of processing columns in serial order.

After processing all current columns, the method reinvokes itself (via its proxy), providing other entry methods the chance to proceed. When the method is finished processing all the columns in one generation, the StateInformationCollector object is invoked and generation statistics are updated. The StateInformationCollector object also sends the four boundary rows/columns to the corresponding neighbors and increments the generation counter. Then, the method reinvokes itself.

There is also another entry method called at the end of processing, namely, finito(). The purpose of this method is to signal other objects to finish their processing due to obtaining the nest individual from some other processor. This eliminates processors from wasting time calculating the same solutions.

FitnessFunction class

The FitnessFunction class is again an abstract class, which any number of child classes may be derived in order to solve different problems. A child class must define the function evaluate(), which takes in a pointer to an object of Individual class and returns a double value representing the fitness of the given individual. We have at the moment, a OneMaxFitness function for the one-max problem that counts the number of ones in a given bit string and a ClassifierFitnessFunction class that represents the classification problem.

FitnessManager class

The FitnessManager class is a 1-D char array that tries to balance the load on a set of slave objects of the FitnessEvaluator class. The implementation is quite straightforward. When a message is sent to an evaluator, it increments a counter corresponding to that object and when the results are returned, the counter of the sender is decremented (which is encoded in the message). Thus, whenever the FitnessEvaluator sends some work, it goes through this array of loads and selects the evaluator with the least load at that moment. When the number of evaluators per manager is set to one, the evaluations are done by the manager itself rather than sending a message to an evaluator on the same processor.

FitnessEvaluator class

FitnessEvaluator objects are the slaves of mangers that receive messages containing individuals, evaluating each of them, and sending the results back to the manager. Since there is no advantage in having more than one evaluator per manger and since we need at least one per processor to utilize all the computational power, the FitnessEvaluator class is derived from the char Group class, which results in exactly one evaluator for every processor. Again, the member functions are trivial and as explained before, there is one for receiving messages which also grinds through the evaluation and there is one for returning the results.

HillClimber class and StateInformationCollector class

The HillClimber object is an autonomous object that is practically attached to every FineGrainedGA object. By attached, we mean that these objects are allocated in the same way as FineGrainedGA objects, namely in a 2D char array and of the same dimension as the FineGrainedGA objects. This property holds for the StateInformationCollector objects also. Whenever a FineGrainedGA object migrates to another processor, it also forces the migration of the attached objects.

The HillClimber object queries its attached FineGrainedGA object for individuals. Upon receiving one, the HillClimber tries to improve the individual by invoking the climbTheHill() member function of the individual. Once the HillClimber is finished climbing, it sends the improved (or same, in case of no progress) individual back and, in turn, receives another individual to process.

The StateInformationCollector objects form a spanning tree with the FineGrainedGA objects attached to its nodes. At the end of each generation, the StateInformationCollector object receives a message from the corresponding FineGrainedGA object through its pass() method. This then compares all messages it has received from its children and attached FineGrainedGA and passes the updated information to its parent. The StateInformationCollector object at the root of the tree consolidates everything and prints out the statistics. Since there is no synchronization, the root object of the spanning tree may receive new information quite often. To avoid excessive output, the user can specify a minimum delay for printing out the global status of the run.

Messages

There are several messages, each with a unique purpose. In Charm++, the messages are the only means of communication between processes. The most important message is the GAParamsMessage, which is created after parsing the input parameters and sent to all the objects when they are created. The FineGrainedGA objects each retain a copy so new individuals can be created even after the initialization. IndividualsMessage is used to send work to managers and near neighbor communication. FitnessValuesMessage is used to send results back to the FineGrainedGA objects. The StateInformationMessage is used in the spanning tree to obtain the best individual fitness and the average fitness. The FinalInfoMessage is used to signify the end of processing of a sub-grid to the main object. This also carries information of how much time each sub-grid took for the entire processing.

Performance

We have performed experiments on two problems. The first problem is a simple linear function defined on binary strings of fixed length. The fitness function of each binary string is the sum of its bits. The second problem is a classification problem where the task is to learn a classifier for predicting solar flares. The data set contains 323 data points, each with 2 character and 10 integer attributes. Each data point is classified in one of the 7 classes. The goal is to learn a decision graph that is capable of predicting the class for a new point. The data set was obtained from the Machine Learning Repository [6], donated by Gary Bradshaw.

The purpose of experiments was not to evaluate the algorithm from the optimization point of view, but rather to find out whether the algorithm scales up for the two substantially different classes of problems. Similar speed-up can be expected for other problems of the same class since the time spent by all operators should be comparable. The only difference can appear in the fitness evaluation which strongly depends on the problem at hand. We ran the algorithm with local evaluation and evaluation on the fitness managers.

Tested configurations for the binary problem are provided in Table 1. Each FineGrainedGA was run for 30 iterations. For the classification problem, we have used a grid of size 128x128 and let the algorithm run for 300 iterations (due the increased problem complexity). The number of processors ranged from 1 to 64. All experiments were performed on a SGI Origin 2000.

Grid size 128x128 256x256 384x384 512x512 640x640
String length 100 200 300 400 500
Table 1: Parameters of the binary problems used in experiments.
The scale-up curve for binary problems is shown in Figure 2 for local evaluation (within each FineGrainedGA) and in Figure 3 for evaluation on fitness managers. The best results were obtained for the algorithm with local evaluation. This is natural, since in one-max the fitness is computed quickly and any communication introduces extra overhead. For local communication, super linear speed-ups have been obtained. However, even for evaluation in fitness managers, we see that as the problem size increases, the speed-up gets closer to the ideal speed-up and the overall results are very close to the ideal curve.

The scale-up curve for the classification problem is shown in Figure 4 for local evaluation and for evaluation in fitness managers. In this problem, we see that evaluation done in managers results in quite a significant improvement of the speedup. This verifies the intuitive argument that distributing the evaluations and letting the FineGrainedGA objects do their processing in the mean time, results in better exploitation of the parallel processing power. We believe that for bigger problems the use of hierarchical scheme would pay off.

A speed-up
curve for binary problems withlocal fitness evaluation. More
discussion provided in the text of the paper.
Figure 2: Speed-up for binary problems with local evaluation.
A speed-up
curve for binary problems withevaluation in fitness managers. More
discussion provided in the text of the paper.
Figure 3: Speed-up for binary problems with evaluation in managers.
A speed-up
curve for the classification problem with both local evaluation as
well as in fitness managers. More discussion provided in the text of
the paper.
Figure 4: Speed-up for classification problem with evaluation in FineGrainedGA and FitnessManager.

Future Research

The implementation can be used in two major ways. First, the implementation can be used to solve various optimization problems. Problems on binary strings and decision trees can be solved using the given implementation as is.

Another use for the implementation is for studying the behavior of the fine-grained parallel genetic algorithm. Various aspects of the behavior are very interesting. One example is the takeover time, which is the time it takes a superior individual to take over the entire population. Other examples include mixing time, population sizing, convergence time, genetic drift, density of individuals in the grid. Looking at these various aspects of the behavior, the algorithm could be related to the existing theory for genetic algorithm with a global population. To our knowledge, no such studies have been done in the past.

The presented work has not analyzed the speed-up behavior for the evaluation distributed among the managers. An interesting topic of future research would be to look at the efficiency of distributed evaluation for different optimization problems.

Conclusions

The fine-grained genetic algorithm is an interesting alternative to other evolutionary algorithms working with one global population. The fine-grained genetic algorithm is especially interesting algorithm due to its ability to scale up to a very large number of processors.

Our implementation is fully distributed and asynchronous, which allows it to scale up almost perfectly even for large number of processors, and use the computational resources on different network topologies very efficiently. Empirical results confirmed our intuition that the algorithm speeds up very well. We believe that the algorithm would continue to speed up for more processors, since all work is evenly distributed and there is no bottleneck. However, more experiments should be performed to investigate the behavior of the implementation on different parallel machines, such as clusters of PCs.

The choice of the evaluation scheme is left to the user. The user is able to specify whether to use local evaluation, a single level, or a hierarchical distributed scheme. Depending on the size and complexity of the problem, different schemes may outperform the others.

References

1
Cantú-Paz, E. A survey of parallel genetic algorithms (IlliGAL Report No. 97003). Urbana, IL, 1997.
2
Goldberg, D. E. Genetic algorithms in search, optimization, and machine learning. Addison-Wesley, Reading, MA, 1989.
3
Goldberg, D. E. The design of innovation: Lessons from genetic algorithms, lessons for the real world. Technological Forecasting and Social Change. In press. 2000.
4
Holland, J. H. Adaptation in natural and artificial systems. University of Michigan Press. Ann Arbor, MI, 1975.
5
Kale, L.V., Krishnan, S. CHARM++ : A Portable Concurrent Object Oriented System Based On C++. In: Proceedings of the Conference on Object Oriented Programming Systems, Languages and Applications, Sept-Oct 1993. ACM Sigplan Notes, Vol. 28, No. 10, pp. 91-108. (Also: Technical Report UIUCDCS-R-93-1796, March 1993, University of Illinois, Urbana, IL.)
6
Machine Learning Repository. ftp://ftp.ics.uci.edu/pub/machine-learning-databases/solar-flare/, October 28, 2001.
7
Mühlenbein, H. Parallel genetic algorithms and combinatorial optimization. Symposium on Parallel Optimization. 1990.

Biography

Martin Pelikan is a Ph.D. student at the Department of Computer Science at the University of Illinois at Urbana-Champaign. He works on the design of efficient genetic and evolutionary algorithms since 1995. His major areas of interest are evolutionary computation, black-box optimization, artificial intelligence, and machine learning.

Prasanna V. Parthasarathy is a graduate student in the Department of General Engineering at the University of Illinois at Urbana-Champaign. Following his work on application of genetic algorithms to truss optimization during his undergraduate study in civil engineering, he now works on the application of hybrid genetic algorithms to engineering optimization problems. Other areas of academic interest include parallel computation, scientific computing and finite element analysis.

Arun Ramraj is graduate student at the Department of Computer Science at the University of Illinois at Urbana-Champaign.

Want more Crossroads articles about Parallel Computing? Get a listing or go to the next one or the previous one.

Last Modified:
Location: www.acm.org/crossroads/xrds8-3/fineGrained.html