by Douglas Low
The Java language is compiled into a platform independent bytecode format. Much of the information contained in the original source code remains in the bytecode, thus decompilation is easy. We will examine how code obfuscation can help protect Java bytecodes.
Traditionally it has been difficult to reverse engineer [13 ] applications because they are large, monolithic and distributed as "stripped" object code. Stripping object code of its symbol table removes information like variable names and obscures references to library routines. For example, a call to the C language library routine printf in the source code might appear in the stripped object code as a procedure call to the memory address 35720.
Since the advent of Java [7], the threat of reverse engineering has become more serious. The language is compiled into a platform independent bytecode format. Being portable, there is little control over the distribution of the bytecodes. Also, much of the information contained in the source code remains in the bytecode, facilitating decompilation [17 , 19]. The threat of reverse engineering is thus intensified.
One possible way to prevent reverse engineering of source code is not to allow physical access to the program. Instead, users communicate with the program via an interface with a limited number of services. This is the client-server model [14]. Unfortunately, this imposes performance penalties because of limitations on network bandwidth and latency. A partial solution is to keep the parts of the program that need to be hidden on the server and have the user's machine run the rest locally.
Encryption of code is another possibility. However, unless the entire encryption/decryption process takes place in hardware, it is possible for the user to intercept and decrypt the code [8, 18 ]. Unfortunately, specialized hardware tends to limit the portability of programs.
Transmitting programs in a form less vulnerable to decompilation might seem like a good idea. Native object codes can be supplied instead of Java bytecodes. The task of decompilation is made more difficult, although not impossible [4]. However native object codes are not subject to bytecode verification, which gives Java a measure of protection against malicious programs such as viruses [3]. Digital signatures [15], verifying that the native code is actually from a trusted source and has not been tampered with, can help to alleviate this problem. The downside is that different versions of the program are required for different architectures. This means that the software maintenance effort is increased. With Java, only one version is required, since it is executed by a virtual machine.
If we cannot make reverse engineering impossible, we can at least make the task costly in terms of time and effort. Code obfuscation transforms a program so that it is more difficult to understand, yet is functionally identical to the original [5, 6]. The program must still produce the same results, although it may execute slower or have additional side effects because of the added code. There is a trade-off between the security provided by code obfuscation and the execution time-space penalty imposed on the transformed program.
The current work on Java obfuscation has been in the form of freeware, shareware and commercial programs, rather than academic publications. Some of the code obfuscation techniques discussed below are based on traditional compiler optimizations. Examples include array and loop reordering, and procedure inlining [2].
We classify code obfuscation according to what kind of information they target and how they affect their target.
Information that is unnecessary to the execution of the program, such as identifier names and comments, is altered. There are many utilities (such as [10, 16]) that will change the identifiers in a program to less meaningful ones. Identifier scrambling is a common obfuscation that has been applied to other languages. The C Shroud system [9], a source code obfuscator available for the C language, is an example of such a tool.
These methods affect the data structures used by a program.
Data storage obfuscation affects how data is stored in memory. For example a local variable can be converted into a global one. Data encoding obfuscation affect how the stored data is interpreted. For example, replacing an integer variable i by 8 * i + 3 . We can see the effect below:
|
Before |
|
After |
|
int i = 1; |
|
int i=11; |
Data aggregation obfuscation alters how data is grouped together. For example, a two-dimensional array can be converted into a one-dimensional array and vice-versa.
Data ordering obfuscation changes how data is ordered. For example, an array used to store a list of integers usually has the ith element in the list at position i in the array. Instead, we could use a function f(i) to determine the position of the ith element in the list.
The idea here is to disguise the real control flow in a program.
Control aggregation obfuscation changes the way in which program statements are grouped together. For example, it is possible to inline procedures. That is, replacing a procedure call with the statements from the called procedure itself.
Control ordering obfuscation alters the order in which statements are executed. For example, loops can be made to iterate backwards instead of forwards.
Control computation obfuscation affects the control flow in a program. These can be divided up further:
|
Before |
|
After |
|
while (i < 1000) { ... i ++; } |
|
int i = 1; |
These attempt to stop decompilers from operating, by exploiting their weaknesses. HoseMocha [11] is a utility which appends extra instructions after a return instruction. The execution of the program is unaffected but the obfuscation causes the Java decompiler Mocha [17] to crash.
The task of making reverse engineering costly is difficult. Client-server models of protection, while providing the best security, suffer from limitations on network capacity. Encryption requires the use of specialized hardware, in turn limiting the portability of programs. Using native object codes makes reverse engineering harder but increases the software support effort. Also, digital signatures are required to prevent tampering. Code obfuscation, while not providing absolute security, is portable, does not require specialized hardware and is transparent to the Java bytecode verifier. However, it does impose an execution time-space penalty on the program being protected.
Code obfuscation is a fruitful area for further research. There are many issues and implications, both theoretical and practical [5], that remain to be resolved.
My Masters Thesis research into Java obfuscation has been performed jointly with my supervisors Dr. Christian Collberg and Professor Clark Thomborson.
Douglas Low is currently a graduate student at the University of Auckland, New Zealand, working on the application of code obfuscation to Java. His research interests include compiler implementation, computational combinatorics and constraint satisfaction (AI). Electronic versions of papers that he has co-authored can be found at http://www.cs.auckland.ac.nz/~collberg/Research/Students/DouglasLow .
Want more Crossroads articles about Java? Get a listing or go to the next one or the previous one.
Last Modified:
Location: www.acm.org/crossroads/xrds4-3/codeob.html