Recent studies have shown that parallelization can significantly improve the performance of genetic algorithms [1, 2,3,7]. Many search operators are easily distributed and thus great speed-ups can be obtained for difficult problems. There are two general approaches to parallelizing genetic algorithms, master-slave and fine or coarse grain. A master-slave architecture is often employed when fitness evaluation (which determines quality of each solution) is the most expensive operator. Fitness evaluations are distributed among a number of slave processors and all remaining operators (mutation, crossover, and selection) are performed on the master processor. In fine-grained and coarse-grained parallel genetic algorithms, the population is divided into subpopulations organized in a network. All operators in each subpopulation are executed by a separate processing element containing the subpopulation. Different subpopulations can communicate via diverse network topologies and communication patterns.
The purpose of this paper is to describe our implementation of a versatile fine-grained parallel genetic algorithm and demonstrate its scaleability. Our primary goals included design and implemention of an easily extensible system to solve diverse optimization problems. The implementation allows solutions represented by binary strings and decision graphs with Boolean, real-valued, and integer attributes. Fitness functions include a simple linear problem for binary strings and classification of data sets which are dynamically loaded from a specified data file. Any of the components can easily be extended by deriving a new class for the representation and fitness functions. A configuration file allows a user to select a problem to solve, a representation, and other parameters. This paper does not analyze the efficiency or applicability of genetic algorithms as a general optimization tool, but proposes a way to parallelize one of its variants to achieve good scaleability.
The algorithms proceed by repeatedly biasing the search to regions of higher quality and combining promising pieces of solutions to form novel combinations. In this fashion, the problems that are decomposable into problems of bounded difficulty can be solved very efficiently (in sub-quadratic or quadratic time). Although many problems are not clearly decomposable, many interesting problems can indeed be solved in this manner. This holds even for many instances of NP-complete problems, although the algorithm would require exponential time in the worst case.
Genetic algorithms have been successfully applied in a number of areas, including engineering design, business, telecommunication, scheduling, and machine learning. Many difficult problems can be tackled in sub-quadratic time with respect to the number of variables. However, for very big problems even sub-quadratic time results in large computational requirements. In order to solve these difficult problems faster, a large research effort has resulted in the study of parallelizing genetic algorithms. The next section discusses some of these approaches.
The second class is the fine-grained and coarse-grained parallel genetic algorithms, where the population is distributed among a number of processors. Each processor processes its subpopulation. Subpopulations are allowed to communicate and exchange the solutions at some intervals. Various connection topologies and communication patterns can be used. Fine-grained genetic algorithms divide the population into small subpopulations containing only one or a couple of solutions which are connected in a grid topology (usually, 2D grids are used). The individuals communicate only within their neighborhood. The original purpose of fine-grained genetic algorithms was to use a large number of small processors, where each processor would process only one individual.
In coarse-grained genetic algorithms the subpopulations are bigger and communication is usually reduced to exchange solutions only once in a while or after the algorithms have converged. Common topologies for communication include rings, grids, hypercubes, fully-connected graphs, and random graphs with a fixed number of links for each processing element.
Each of the parallelizations can yield a significant improvement for some problems. If the fitness evaluation takes the majority of the computational resources, master-slave parallel genetic algorithms usually perform well. However, when the fitness evaluation does not require significantly more time than other operators, coarse-grained and fine-grained genetic algorithms can further improve the performance.
As mentioned above, originally fine-grained genetic algorithms were designed to run on machines with a large number of small processors connected in a grid topology. As the optimization proceeds, one would expect high-quality partial solutions to spread from location to location on the grid in a process similar to diffusion. This behavior is interesting in itself, may eliminate problems of premature convergence, and may also improve the execution of the algorithm on a single processor machine. These reasons motivate our decision to implement a fine-grained genetic algorithm.
In our implementation of the fine-grained genetic algorithm, the population of the candidate solutions is mapped to a 2D grid where each position of the grid may either contain a particular solution or be empty. In each iteration of the algorithm, all solutions undergo a crossover operator in which the individual is combined with a randomly picked neighbor from its neighborhood. The individual then undergoes a mutation operator that performs a slight perturbation to the current solution. If the new individual is of higher quality, then it replaces the original individual. This introduces the selection pressure that biases the search toward high-quality solutions. A local search operator can next be applied to tune the particular solution. In this fashion, a genetic algorithm does a global search while the local search operator tries to improve the individual within some small neighborhood. The next section explains the details of our parallel implementation.
We have chosen to use the Charm++ programming language [5], an object-oriented, message-driven, parallel language based on C++. In message-driven parallel languages, the execution, method invocation, and synchronization are performed by sending messages between the different processes. There are several advantages to Charm++. Charm++ is based on an object-oriented language, allowing us to code in a flexible and comprehensible environment. Charm++ also supports automated load balancing, which can automatically migrate some objects depending on the current workload. Additionally, the execution in Charm++ is driven by messages and therefore allows the creation of a completely distributed and asynchronous application, which should be scalable to a large number of processors.
Our implementation uses a 2D grid, where the grid is divided into a number of rectangles. The division is done automatically, but the user can specify the average number of rectangles to be processed by each processor on average. The default value is 4, i.e., each processor should contain 4 rectangles on average. Sub-grids can migrate among different processors depending on the current load. Neighboring sub-grids must communicate since processing each sub-grid requires knowing the individuals on the border of our sub-grid.
However, the processes in genetic algorithms are statistical in nature and, consequently, it is not necessary to always require the information in neighboring processors to be consistent. This allows us to create a completely asynchronous system that can greatly improve the overall performance.
In addition to distributing the grid with individuals, we have distributed the computation of the fitness function. There are many alternatives that the user can specify on the fly. The simplest alternative is when all individuals are evaluated locally by the object responsible for the part of the grid (the rectangle) where the individual is located. This solution requires no communication, unless fitness evaluation takes a large or dynamically changing amount of time, making this an efficient solution. The second alternative is to have one special object for each processor and have each rectangle send its evaluations to the evaluator object on its local node. This can be useful for more complex fitness functions.
The last alternative is to use a hierarchical scheme where there is one fitness manager for a certain number of processors. Each rectangle sends its individuals to its associated manager. There is one evaluator per processor. Depending on the load on each evaluator, the manager decides the evaluator that should receive the work and sends it back. This load can be computed based on the history of computation, since the manager knows about all transactions between the rectangles and evaluators.
Local search is performed by the hill climbers. There is one hill climber for each sub-grid. At the beginning, each hill climber requests a solution from the corresponding sub-grid. It then tries to search its neighborhood and improve the solution locally for a number of iterations. Once it cannot improve the solution anymore, it sends the solution back to the rectangle, where the resulting solution replaces a randomly chosen individual. The rectangle responds with another individual that will be a subject to local search. Hill climbers migrate with the rectangles so that the communication across different processors is minimized.
The current state of the execution is obtained by extracting the best overall solution, the average fitness, and other information extracted from the current population of solutions. In order to prevent a bottleneck in the implementation, we used a spanning tree whose nodes correspond to the rectangles and migrate with them. The information is moved upwards and the top left-most object outputs the information to the screen at certain intervals. By using this scheme, no synchronization is required, thus improving performance. Further details of the implementation are explained shortly.
![]() |
The child classes that we have implemented are the BinaryStringIndividual, which represents the conventional binary bit string representation of solutions and the DecisionTree, which is for the decision tree representation of solutions for classification problems. The user specifies which representation to use in the input parameter file.
In each GA generation, the following process repeats. For each column of individuals, we first duplicate them. Then, a crossover partner is selected from an 8-neighborhood and crossover is performed on the duplicate copies. This is followed by mutation. Once these alterations are done, the new individuals are sent for evaluation, again, column by column. The evaluation may take place in one of three places: local, FitnessManager objects ,or FitnessEvaluator object. When the evaluations are sent to FitnessEvaluator objects, they may come back reordered. Thus, the process is potentially asynchronous. To take advantage of this, the start() method processes all columns that are up to date, instead of processing columns in serial order.
After processing all current columns, the method reinvokes itself (via its proxy), providing other entry methods the chance to proceed. When the method is finished processing all the columns in one generation, the StateInformationCollector object is invoked and generation statistics are updated. The StateInformationCollector object also sends the four boundary rows/columns to the corresponding neighbors and increments the generation counter. Then, the method reinvokes itself.
There is also another entry method called at the end of processing, namely, finito(). The purpose of this method is to signal other objects to finish their processing due to obtaining the nest individual from some other processor. This eliminates processors from wasting time calculating the same solutions.
The HillClimber object queries its attached FineGrainedGA object for individuals. Upon receiving one, the HillClimber tries to improve the individual by invoking the climbTheHill() member function of the individual. Once the HillClimber is finished climbing, it sends the improved (or same, in case of no progress) individual back and, in turn, receives another individual to process.
The StateInformationCollector objects form a spanning tree with the FineGrainedGA objects attached to its nodes. At the end of each generation, the StateInformationCollector object receives a message from the corresponding FineGrainedGA object through its pass() method. This then compares all messages it has received from its children and attached FineGrainedGA and passes the updated information to its parent. The StateInformationCollector object at the root of the tree consolidates everything and prints out the statistics. Since there is no synchronization, the root object of the spanning tree may receive new information quite often. To avoid excessive output, the user can specify a minimum delay for printing out the global status of the run.
The purpose of experiments was not to evaluate the algorithm from the optimization point of view, but rather to find out whether the algorithm scales up for the two substantially different classes of problems. Similar speed-up can be expected for other problems of the same class since the time spent by all operators should be comparable. The only difference can appear in the fitness evaluation which strongly depends on the problem at hand. We ran the algorithm with local evaluation and evaluation on the fitness managers.
Tested configurations for the binary problem are provided in Table 1. Each FineGrainedGA was run for 30 iterations. For the classification problem, we have used a grid of size 128x128 and let the algorithm run for 300 iterations (due the increased problem complexity). The number of processors ranged from 1 to 64. All experiments were performed on a SGI Origin 2000.
|
The scale-up curve for the classification problem is shown in Figure 4 for local evaluation and for evaluation in fitness managers. In this problem, we see that evaluation done in managers results in quite a significant improvement of the speedup. This verifies the intuitive argument that distributing the evaluations and letting the FineGrainedGA objects do their processing in the mean time, results in better exploitation of the parallel processing power. We believe that for bigger problems the use of hierarchical scheme would pay off.
![]() |
![]() |
![]() |
Another use for the implementation is for studying the behavior of the fine-grained parallel genetic algorithm. Various aspects of the behavior are very interesting. One example is the takeover time, which is the time it takes a superior individual to take over the entire population. Other examples include mixing time, population sizing, convergence time, genetic drift, density of individuals in the grid. Looking at these various aspects of the behavior, the algorithm could be related to the existing theory for genetic algorithm with a global population. To our knowledge, no such studies have been done in the past.
The presented work has not analyzed the speed-up behavior for the evaluation distributed among the managers. An interesting topic of future research would be to look at the efficiency of distributed evaluation for different optimization problems.
Our implementation is fully distributed and asynchronous, which allows it to scale up almost perfectly even for large number of processors, and use the computational resources on different network topologies very efficiently. Empirical results confirmed our intuition that the algorithm speeds up very well. We believe that the algorithm would continue to speed up for more processors, since all work is evenly distributed and there is no bottleneck. However, more experiments should be performed to investigate the behavior of the implementation on different parallel machines, such as clusters of PCs.
The choice of the evaluation scheme is left to the user. The user is able to specify whether to use local evaluation, a single level, or a hierarchical distributed scheme. Depending on the size and complexity of the problem, different schemes may outperform the others.
Prasanna V. Parthasarathy is a graduate student in the Department of General Engineering at the University of Illinois at Urbana-Champaign. Following his work on application of genetic algorithms to truss optimization during his undergraduate study in civil engineering, he now works on the application of hybrid genetic algorithms to engineering optimization problems. Other areas of academic interest include parallel computation, scientific computing and finite element analysis.
Arun Ramraj is graduate student at the Department of Computer Science at the University of Illinois at Urbana-Champaign.
Want more Crossroads articles about Parallel Computing? Get a listing or go to the next one or the previous one.
Last Modified:
Location: www.acm.org/crossroads/xrds8-3/fineGrained.html