|
ABSTRACT
This paper presents an optimization algorithm for reducing instruction cache misses. The algorithm uses profile information to reposition programs in memory so that a direct-mapped cache behaves much like an optimal cache with full associativity and full knowledge of the future. For best results, the cache should have a mechanism for excluding certain instructions designated by the compiler. This paper first presents a reduced form of the algorithm. This form is shown to produce an optimal miss rate for programs without conditionals and with a tree call graph, assuming basic blocks can be reordered at will. If conditionals are allowed, but there are no loops within conditionals, the algorithm does as well as an optimal cache for the worst case execution of the program consistent with the profile information. Next, the algorithm is extended with heuristics for general programs. The effectiveness of these heuristics are demonstrated with empirical results for a set of 10 programs for various cache sizes. The improvement depends on cache size. For a 512 word cache, miss rates for a direct-mapped instruction cache are halved. For an 8K word cache, miss rates fall by over 75%. Over a wide range of cache sizes the algorithm is as effective as increasing the cache size by a factor of 3 times. For 512 words, the algorithm generates only 32% more misses than an optimal cache. Optimized programs on a direct-mapped cache have lower miss rates than unoptimized programs on set-associative caches of the same size.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
|
| |
2
|
Horowitz, M., Chow, P., et. al., "MIPS-X: A 20 MIPS Peak, 32-Bit Microprocessor with On- Chip Cache", IEEE Journal of Solid-State Circuits, Vol. SC-22, No. 5, October, 1987, pp. 790-799.
|
| |
3
|
Agarwal, A., Chow, P., Horowitz, M., Acken, J., Salz, A., and Hennessy, J., "On-Chip Instruction Caches for High Performance Processors", Proc. of the Conference on Advanced Research in VLS{, Losleben, P.,ed., Stanford University, Stanford, Ca, March 1987.
|
| |
4
|
|
 |
5
|
S. Prybylski , M. Horowitz , J. Hennessy, Performance tradeoffs in cache design, Proceedings of the 15th Annual International Symposium on Computer architecture, p.290-298, May 30-June 02, 1988, Honolulu, Hawaii, United States
|
 |
6
|
|
| |
7
|
Hatfield, D.J., and Gerald, j., "Program Restructuring for Virtual Memory", IBM Systems J., Vol. 10, No. 3, 1971, pp. 168-192.
|
 |
8
|
|
| |
9
|
Ferrari, D., "The Improvement of Program Behavior", Computer, Vol. 9, No. 11, Nov., 1976, pp. 39-47.
|
| |
10
|
Baer, J.L., and Caughey, R., "Segmentation and Optimization of programs from Cyclic Structure Analysis ", Proc. AFIPS, 1972, pp. 23-36.
|
| |
11
|
|
| |
12
|
|
| |
13
|
Abu-Sufrah, W., "Identifying Program Localities at the Source Level", Technical Report UIUCDCS -R-82-1108, Dept. of Computer Science, Univ. of Illinois, October 1982.
|
| |
14
|
|
| |
15
|
Chow, F., "Private Communication".
|
| |
16
|
|
| |
17
|
Belady, L.A., "A Study of Replacement Algorithms for a Virtual-Storage Computer ", IBM Systems J., Vol. 5, No. 2, 1966, pp. 78-101.
|
CITED BY 87
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Nikolas Gloy , Trevor Blackwell , Michael D. Smith , Brad Calder, Procedure placement using temporal ordering information, Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture, p.303-313, December 01-03, 1997, Research Triangle Park, North Carolina, United States
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Derek Chiou , Prabhat Jain , Larry Rudolph , Srinivas Devadas, Application-specific memory management for embedded systems using software-controlled caches, Proceedings of the 37th conference on Design automation, p.416-419, June 05-09, 2000, Los Angeles, California, United States
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Jared Stark , Paul Racunas , Yale N. Patt, Reducing the performance impact of instruction cache misses by writing instructions into the reservation stations out-of-order, Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture, p.34-43, December 01-03, 1997, Research Triangle Park, North Carolina, United States
|
|
|
|
|
|
|
|
|
|
W. J. Schmidt , R. R. Roediger , C. S. Mestad , B. Mendelson , I. Shavit-Lottem , V. Bortnikov-Sitnitsky, Profile-directed restructuring of operating system code, IBM Systems Journal, v.37 n.2, p.270-297, April 1998
|
|
|
|
|
|
|
Nikolaos Bellas Ibrahim Hajj , George Stamoulis , N. Bellas , C. Polychronopoulos, Architectural and compiler support for energy reduction in the memory hierarchy of high performance microprocessors, Proceedings of the 1998 international symposium on Low power electronics and design, p.70-75, August 10-12, 1998, Monterey, California, United States
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Santosh G. Abraham , Rabin A. Sugumar , Daniel Windheiser , B. R. Rau , Rajiv Gupta, Predictability of load/store instruction latencies, Proceedings of the 26th annual international symposium on Microarchitecture, p.139-152, December 01-03, 1993, Austin, Texas, United States
|
|
Alex Ramirez , Luiz André Barroso , Kourosh Gharachorloo , Robert Cohn , Josep Larriba-Pey , P. Geoffrey Lowney , Mateo Valero, Code layout optimizations for transaction processing workloads, ACM SIGARCH Computer Architecture News, v.29 n.2, p.155-164, May 2001
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Xianglong Huang , Stephen M. Blackburn , David Grove , Kathryn S. McKinley, Fast and efficient partial code reordering: taking advantage of dynamic recompilatior, Proceedings of the 2006 international symposium on Memory management, June 10-11, 2006, Ottawa, Ontario, Canada
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
P. R. Panda , F. Catthoor , N. D. Dutt , K. Danckaert , E. Brockmeyer , C. Kulkarni , A. Vandercappelle , P. G. Kjeldsberg, Data and memory optimization techniques for embedded systems, ACM Transactions on Design Automation of Electronic Systems (TODAES), v.6 n.2, p.149-206, April 2001
|
Peer to Peer - Readers of this Article have also read:
-
Data structures for quadtree approximation and compression
Communications of the ACM
28, 9
Hanan Samet
-
A hierarchical single-key-lock access control using the Chinese remainder theorem
Proceedings of the 1992 ACM/SIGAPP Symposium on Applied computing
Kim S. Lee
, Huizhu Lu
, D. D. Fisher
-
Putting innovation to work: adoption strategies for multimedia communication systems
Communications of the ACM
34, 12
Ellen Francik
, Susan Ehrlich Rudman
, Donna Cooper
, Stephen Levine
-
The GemStone object database management system
Communications of the ACM
34, 10
Paul Butterworth
, Allen Otis
, Jacob Stein
-
An intelligent component database for behavioral synthesis
Proceedings of the 27th ACM/IEEE conference on Design automation
Gwo-Dong Chen
, Daniel D. Gajski
|