![]() Invitation Archives |
|
Software Based Fault Tolerant ComputingGoutam Kumar SahaMember ACM Mail to: CA – 2 / 4 B, Baguiati, Deshbandhu Nagar, Kolkata 700059, India. sahagk@gmail.com gksaha@rediffmail.com This paper describes how to design a software based fault tolerant application using microprocessor (MP), in order to tolerate the burst errors in memory. This approach may be called a single – version scheme (SVS). The SVS relies on a single version application program which is enhanced with self-checking code redundancy to tolerate memory burst errors that are difficult to correct during the run-time of an application. Conventionally, the other software based approaches can detect a few bit errors (in memory) only towards fail-stop kind of fault tolerance against transient bit errors. Reed Solomon codes are mainly effective for burst errors in coding of audio Compact Disks at offline only. The proposed online technique does not need multiple versions of software and multiple machines. This approach employs only two copies of the application software running on one machine only. Two copies of the enhanced version of an application are used here for online error detection and tolerance thereof as well. This is an effective low-cost online tool for hardening a microprocessor-based industrial computing system or for on-chip DRAM applications using an affordable code and time redundancy against the burst errors in processor memory. The SVS aims to provide a non-fail-stop kind of fault tolerance against burst errors. This approach supplements the Error Correcting Codes (ECC) in memory system also, against both the transient and permanent bit errors in memory.Keywords: Burst errors, multiple byte-errors, fault tolerance, single-version software.
1. Introduction:
The objective of this proposed approach is to design a low-cost robust industrial computing system that can detect and tolerate burst errors ( 7 bytes onwards where on an average, a MP instruction is of about six bytes long). At the same time, it is also useful against transient bit errors as well as permanent bit faults (whose presence is assumed to be continuous in time) in a memory during the run-time of an application. The vast majority of hardware failures in modern microprocessors (MP), especially for memory faults ( for example, bursts of errors or multiple byte errors or random bit-errors), are because of the limited hardware detection in them [1]. Transient faults (whose presence is bounded in time) are random events. Transient bit errors can be tolerated by re-computing an application afresh. Conventionally, with the use of two copies, we can detect bit errors only. But, the SVS approach uses two copies of an application for detecting and tolerating burst errors in memory. Though, memory has Forward Error Correction (FEC) or Error Correcting Codes (ECCs) (e.g. Parity bits, Hamming Code, BCH, and Cyclic redundancy codes in which bits are interpreted as coefficients in a polynomial etc.) that are capable of detecting and correcting a few bit errors on using both code and high time redundancy. For example, BCH (63,45) can correct only 3 errors in a 45 information bits. CRC - 32 code detects any single - bit, all double - bit, any odd number of errors, and error bursts of 32 bits errors. In general, CRC can detect burst errors up to length < number of redundancy bits. However, CRC (polynomial codes) take high processing time to calculate some function y = f(m), where m is the message data, for coding and decoding. Again, in CRC, there is a chance to have false negative test for error. Though CRC is more complicated than parity or checksum (that is, computing the sum of all words in the application memory space before the application starts and re-compute the sum to validate with the earlier sum), it can be implemented in hardware. The Reed-Solomon Codes (based on Galois Fields on (2 8 ) with modulo-operation ) are useful for correcting limited number of burst errors, but they are used at offline only because of its slower decoding and encoding speed. Burst errors can also be controlled through a memory interleaving scheme. In order to make sure that a burst of errors does not affect consecutive bits, we need here to scramble the bits in interleaving. Here, bit failures are scattered, and there exists high probabilities that correction bits have survived. Here, a completely incorrect batch of bytes affects a large number of samples (to be played out), but only in a minor way. However, such scheme bears a lot of housekeeping code redundancy along with high time redundancy for assembling the scrambled bytes. Again, the Cross Interleave Reed Solomon Code or CIRC (that is, interleaving of data and a CRC like error correcting code) can recover a burst error of greater than 4000 consecutive bits - about 2.5 mm on a CD disc. But, CIRC is usable at offline only. Again, such conventional error correcting codes (ECCs) such as BCH or Reed Solomon codes are known to be inadequate for on-chip DRAM applications. The encoding and decoding circuits of these codes employ a linear feedback shift register (LFSR), which, if used in a DRAM chip, will introduce very high access delay. Owing to this constraint, Hamming or extended Hamming codes are normally used. The key to the Hamming code is the use of extra parity bits to allow the identification of a few bit errors. The Hamming code with the Hamming distance of 3, can correct only single error. Hamming codes are capable of correcting one error or detecting two errors. The conventional ECCs in memory system, are effective for few bit errors in memory only. ECCs are not effective for online burst error detection in memory. Thus, a software based online approach for tolerating burst errors is much needed. The proposed technique also aims to be useful for on-chip DRAM application .
In the proposed approach, we have partitioned an application code into a number of basic blocks (that is, branch-free parts of code). We insert an extra No-Operation (NOP) instruction after each useful instruction in an application program. Each block is preceded by an error checking code. Each block is checked by its error checking code. Before proceeding to execute a block of code, we check for the correctness of a block-code by checking the correctness of the extra No-Operation (NOP) code as inserted after each useful instruction, and, by comparing the computed checksum with the previously computed and saved checksum. If a block is detected as an erroneous one (that is, one or many NOP code or other adjacent useful words or an error checking code in the block get corrupted), we check the other similar block in the extra copy of an application and if it is correct, we proceed further to execute it and thus, we avoid re-computing an application afresh. It is obvious that if one or many NOP codes in a block, get corrupted, then adjacent useful words ( instructions or data) also get corrupted (with high probability) in the burst errors affected memory. For a microprocessor with an instruction with more or less N bytes long, this method can detect burst error of (N+1) bytes onwards. As, we execute a similar spare correct block of code ( when a block is detected erroneous), we tolerate transient faults also, without re-computing the same copy afresh. At the same time, if there is a permanent bit error at the memory space of an affected block of code (in a copy of an application), we tolerate such errors by executing the spare similar block of code that is, the spare block of code in the 2nd copy of an application. In other words, we skip the memory space that got affected by burst errors or transients or permanent bit errors, and then we execute the duplicate block of code after it has been detected (by the error checking code) as an error free one. It is obvious that if a NOP code does not get corrupted but, an adjacent useful word gets affected by a few bit flip errors, then an added checksum mechanism at each block's error checking code, detects a few bit errors. Checksum or such Error Correcting Codes (ECC) or Error Detection Mechanism (EDM) in the memory or in a MP, are useful for detecting and correcting a few bit errors only in memory. Conventional ECCs are not effective for online detection and correction of memory burst errors, but they are effective for single or few bit flips in memory. That is why, we augment our approach with an additional checksum mechanism [11] at each block- error checking (EC) code. Thus, bit errors in the useful instructions, data or in a block's error checking code itself, are also detected by this SVS approach, and we tolerate such errors by executing the spare similar block in the spare copy of an application. Thus, we avoid reloading and re-starting an application for re-computing it afresh. It is well known that re-loading and re-computing are normally useful for tolerating transient bit faults only. But, our aim is to online detect and tolerate burst errors only. This SVS approach does not aim to correct errors. Such error detection is also possible by comparing bytes of a copy with its duplicate [14]. But, we cannot tolerate fault here. We can also gain error tolerance through recovery by using triplication (massive redundancy). Thus, the proposed (SVS) approach supplements a MP's EDM or a memory-ECC mechanism for tolerating operational faults that may occur during service delivery of the use phase. Though transient errors affecting the memory are tolerated by the ECC of the modern memory, the proposed approach is useful mainly for detecting and tolerating burst errors, as well as for transient, permanent bit errors in memory. SVS is aimed to supplement ECC in tolerating few bit errors in memory. Because, the ECC of the memory can only detect and recover few bit errors in memory.
We should accept that, relying on software techniques for obtaining dependability means accepting some overhead in terms of increased size of code and reduced performance. Fault tolerance is the ability of a system to perform its function correctly even in the presence of internal faults. Transient faults occur once and then disappear. If the operation is repeated, the fault may go away. A permanent fault is one that continues to exist until the faulty component is repaired. Software Fault Tolerance is the reliance on “Design Redundancy” to mask residual design faults present in software program. Current fault tolerant techniques utilized in commercial systems such as IBM S/390 G5 [1,2,3] rely on redundancies. For example, error checking is implemented by duplicating chips and comparing results. These techniques need two times or more hardware overhead. In addition, the duplicate and compare is adequate for error detection only. Hence, low-cost fault tolerant technique is necessary for future microprocessor systems. This paper describes an economically very important method to harden a microprocessor-based system against burst errors, multiple bit-faults, permanent and transient bit errors by acting on software only.
1.1 The Proposed SVS Mechanism
The proposed SVS technique adopts the strategy of self-checking and self-controlling programming that relies on code and time redundancy. Unlike the triple modular redundancy schemes, this approach does not aim to eliminate software bugs. We assume that the code is correct, and the faulty behavior is only due to memory burst errors affecting the system. In this technique, we insert a one-byte “NO-Operation (NOP)” instruction code after each useful instruction in the basic processing logic of an application. We consider the fault model of at least seven or more contiguous byte-errors or multiple bit flip or a burst of modified bits. We know those memory locations or offsets where the extra NOP –code have been inserted. Execution of NOP code does not alter the status flag bits in a processor status word (PSW) and it provides a delay only. The correctness of these additional NOP codes and checksum are checked before the program control goes forward to execute the adjacent application – instructions in a basic block of code. As soon as, either a NOP code-corruption or checksum-mismatch error is detected by the proposed error checking code, the program control skips (on noting down the erroneous block’s identity on a memory variable namely, PREV_EC) the corrupted block of an image of an application and, then it goes forward to verify for errors and to execute a similar block at the other spare image of an application. We consider that at least two copies or images of an application reside on memory for tolerating such potential burst or bit-errors. However, for fault detection, the proposed SVS needs only one copy of an application. An application image consists one or more basic blocks of code. We keep one error checking code at each block of code in an image of an application. In other words, each block of code in an application image is preceded by an error-checking or error-processing code that detects burst errors by checking for the correctness of the extra NOP codes that are inserted at a block (for detecting burst errors), as well as, it verifies for the correctness of other code through checksum. The blocks are identified in a such manner that an inter-block branching does not arise. Each block of code can also be implemented by a procedure. For a procedure- call, we need to keep an error-checking code that precedes the instructions of a procedure. In procedure also, we need to insert a NOP instruction after each useful instruction in a procedure. As a result, each block-size may not be same. Similarly, each error-checking code, not necessarily, may cover the same number of useful instructions, because of the possible existence of different block -sizes that depend on the branching displacement. In other words, if a jump instruction (say, at location L) branches to an instruction (say, at location M), then we say the branch displacement as the absolute of (M-L) and in such case, we keep the instructions at locations L through M in the same block of code only, in order to prevent execution of unchecked block-code. Such SVS scheme does not allow execution of instructions in a block unless they are checked as error-free by that block’s error checking code. The aim of this proposed scheme is not to eliminate software bugs nor to correct errors; rather, it aims to mask or skip the erroneous block of code for designing a non-fail-stop kind of fault tolerant system against burst errors. In other words, unlike the other single-version, software implemented fault tolerance schemes that detect few bit-errors and, that rely on restarting or resetting a system (fail-stop kind of fault tolerance) against a fault model of a few transient bit errors in memory, the SVS approach can tolerate few faults to go ahead with the execution of an application further (non-fail-stop kind of fault tolerance) until the application crashes when no spare correct similar block of code exists.
2. Previous Works:
In order to achieve ultra reliability in computing , it is necessary to adopt the strategy of defensive programming based on redundancy ( i.e. fault - tolerant software), e.g., Recovery Blocks (RB) [4,9], N Version Programming (NVP) [5]. In RB, the acceptance test condition is expected to be met by the successful execution of either the primary module or the alternate modules. When an acceptance test detects a primary module failure, an alternate module executes. If all alternate modules are exhausted, the system crashes. In NVP, N number of variants or alternates run simultaneously on N different machines and at the end of program , the results are voted upon to find an answer in majority and it is considered as a correct result. If no consensus result is found, then the NVP system crashes. However, both RB and NVP need multiple versions of software to be developed independently using different languages, tools etc. In reality, designing one version of reliable software is itself a very costly and challenging task. Again, designing multiple versions of software is found to be very expensive and beyond reach for many low cost applications. However, in critical systems with real - time deadlines, voting at program’s end may not be acceptable. This scheme requires synchronization of the various software versions at comparison points. The RB scheme needs f+1 number of alternates to tolerate f sequential faults. The NVP scheme needs f+2 number of alternates to tolerate f sequential faults. The various single-version software implemented fault tolerance (SIFT) schemes, for example, Algorithm Based Fault Tolerance (ABFT) [6], Assertions [7, 17, 18], and Control Flow Checking [8] are meant for supplementing the intrinsic error detection mechanisms of a microprocessor system only for designing fail-stop kind of fault tolerance against the fault model of transient bit errors in memory. ABFT is suited for applications using regular structures. Its applicability is valid for a limited set of problems. Therefore, it lacks of generality. The use of logic statements or assertions at different points in the program that reflect invariant relationships between the variables of the program can lead to different problems. Because, assertions are not transparent to the programmer and their effectiveness largely depends on the nature of an application and on the ability of a programmer. Again, the success of Control Flow Checking largely depends on partitioning an application program in basic blocks (branch - free parts of code). For each block, a deterministic signature is computed and errors can be detected by comparing the run-time signature with a pre-computed one. In most of the control flow checking techniques, one of the main problems is to tune the test granularity that should be used. In procedure duplication (PD) [19], a programmer decides to duplicate critical procedures and to compare the obtained results for detection of transient bit - errors. Here, a programmer has to define a set of procedures to be duplicated and to introduce the proper checks on the results. So, PD approach is useful to detect a few bit errors only, towards fail-stop fault tolerance through re-starting an application. Such PD is not effective for burst errors because a procedure which is affected by bursts of errors, may not produce result because of hanging or wrong memory accesses or other exceptions etc., and because of no results, we cannot compare the results, and as a result, such PD cannot detect burst errors. These SIFT techniques that basically rely on a set of carefully chosen software detection techniques, aim towards detection of few bit - errors in memory towards fail-stop kind of fault tolerance through system reset and they lack of generality and applicability. Row-checksum based fault detection and tolerance has been discussed in the work [11]. Interested readers should refer to other important works on hardware or software implementations of time-constrained and reliable embedded systems [10, 12, 13] also. The proposed SVS approach is able to detect (by its error checking code) the burst errors (through checking the NOP codes) as well as bit-errors (through checksum) in a block of code that consists N number of useful instructions and N number of inserted NOP code, and it tolerates faults through execution of a similar, spare and correct block of code to go ahead with the application (unlike a fail-stop kind of fault tolerance in a SIFT) without resetting the system. However, the proposed SVS converges to a fail-stop approach when there is no more spare and correct block. Conventional SIFT techniques are useful for the fault model of few bit errors only. But, the proposed approach works well against the fault model of burst errors (even all bit errors in the NOP instructions that get detected by the error processing code), as well as, against a few bit errors in memory that affect a NOP code or useful instructions or error processing code, detected by the inclusion of checksum at an error processing code. Like any conventional SIFT ,or triple modular redundancy (TMR) based fault tolerance schemes, this SVS approach also cannot claim to be free from an overhead on code and execution time redundancy. Execution time redundancy as observed in SVS, is about 2.7 times the basic application code without any software fix for fault tolerance. If an enhanced application program is partitioned into three basic blocks, then we can tolerate three faults using more time (3.3 times the basic application code without fault tolerance) overhead.
3. Description of Works
The proposed SVS technique relies on two copies or images of an enhanced application program (partitioned into two basic blocks), in order to tolerate two faulty non-identical blocks. Here, both the blocks at either copy of an application can be faulty, or one faulty block is at one copy , and another faulty block is at a non-identical block of the duplicate one. The application is enhanced by inserting “No Operation” (NOP) instruction code after every basic - application – instruction code ( say, Ii ). The selection of NOP-code for insertion, is guided by the fact that it is one byte long and normally it incurs a delay of one machine clock only without changing the flag bits in the processor-status-word (PSW). During the run-time of the application, the self-checking error-processing-code verifies for the possible NOP – code – corruption (at those known locations where extra NOP codes have been inserted) in its block, and compares the computed checksum with the pre-computed checksum. If errors in extra NOP-codes in a block are detected, or if a checksum-mismatch is detected, then the program control branches to the error-processing code of the similar block of code at the other spare image, for executing that spare block after it is detected as free of error. Again, errors at extra NOP-codes ascertain the possible errors in their adjacent application-instructions also, if we consider the fault model of seven or more contiguous byte-errors or burst errors. Checksum is employed in order to detect transient or permanent bit errors at the useful instructions, data and at the error-processing code itself. The SVS technique, relying on two images, is explained in the following steps. The notation, Lij , in general, denotes the memory- location of the i-th NOP – code (i.e. inserted one) in the j-th image of the application. The notation ECi j denotes the error processing code for the i- th block in the j-th image of an application. The ECi j error-processing code is intended to check the code at the i-th block of the j-th copy or j-th image say, Bi j. The symbol "[ ]" denotes the byte-content. The images execute on global input, output variables. Another global variable say, PREV_EC is used to note the most recently (earlier to the present block’s error code) executed error-processing code that had detected errors positive in its block, just earlier to the present block’s error-processing code which is being executed. In other words, PREV_EC denotes the most recently skipped block because of detected errors in that block. For an example, if the execution flow skips the erroneous m-th block and enters the error checking code of the n-th block that is being detected as erroneous also, then PREV_EC holds the m –value only. We use the symbols "/* " and "*/ " to enclose general remarks. The following steps describe the proposed SVS scheme that relies on two similar images of an application.
Image1 /* For understanding the SVS scheme, we have taken two images of an application. For simplicity, we have considered that each image consists two basic blocks only. Each block is preceded by an error-processing code. */ /* EC11 , the error-processing code of the 1st block of the 1st image (beginning at location say, LE11 ), covers the useful application instructions I1 through I6 along with six extra NOP instructions in the B11 block. */ /* For simplicity, we may consider that a block of code contains six useful application -instructions and six extra NOP instructions. In other words, each error-processing code detects errors in total twelve instructions only. */ LE11 :
If ( [L11] == [L21] ==[L31]==[L41] == [L51] ==[L61] ¹ NOP-code) OR Checksum ¹ Pre-computed Checksum then: /* The EC11 checks for errors in the 1st block of the 1st image or B11 . If B11 is checked error positive, then note it at PREV_EC. */ Set PREV_EC = 11 /* i.e., the EC of the 1st block of the 1st image has detected errors in B11 */ Jump LE 12 /* Branch to the Error-processing code of the 1st block of the the 2nd image (i.e., EC12 ) that begins at location say, LE 12 in the next spare Image2 for attempting a similar block) */ /* Most recently, the B 11 block in Image1 was detected faulty and skipped, so, check for errors in the similar and spare block B 12 by executing its error processing code i.e., EC12 that begins from location say, LE12 , in an attempt to execute - the spare and similar instructions ( i.e., block B12 ) at the next spare image or, the 2nd image. */ End if /*1st block of instructions begins here; its errors are detected by the above Error processing code (EC11 ) at the 1st image */ /* 1st block of application code in the 1st image ( say, B11 ) begins here */ I1 /* 1st instruction-code of the 1st block of the 1st image of an application */ L11 NOP /* Extra inserted NOP code */ I2 /* 2nd instruction- code of 1st image of an application */ L21 NOP /* inserted NOP- code */ I3 L31 NOP I4 /* 1st application- instruction- code */ L41 NOP I5 L51 NOP /* inserted NOP- code */ I6 L61 NOP /* 1st block of the 1st image ends here */ /* Error-processing code of the 2nd block of the 1st image */ LE21 : If ( [L71] == [L81] ==[L91] ==[L101] == [L111] ==[L121] ¹ NOP-code) OR Checksum ¹ Pre-computed Checksum, then: Set PREV_EC = 21 /* i.e., the EC of the 2nd block of the 1st image or tests the B 21 as erroneous */ Jump LE 22 /* 2nd Block of Image1 is faulty, so try the 2nd block of the 2nd Image of an application */ /*Jump to the error-processing code of 2nd block of the 2nd image. EC22 starts at location say, LE 22 in Image2 */ End if /* 2nd block of instructions (B21 ) of the 1st image begins here; it is covered by the above Error processing code (EC21 ) in the 1st image */ I7 L71 NOP I8 L81 NOP I9 L91 NOP I10 L101 NOP I11 L111 NOP I12 L121 NOP /* 2nd block of instructions in the 1st image (say, B21 ) ends here */ /* An application can have more than two basic blocks also. */ END /* End of the Image1 */ /*Extra (on an average, the number of bytes in a block in an application) NOP instructions in between two images can be inserted to tolerate a wrong program flow out of the Image1, Image2 */ NOP /* Extra NOP codes at the end of an image in order to catch some of the faulty program flow */ NOP NOP Jump LE12 /* Jump to the beginning of Image2 . Because of faulty program flow, switch to next image */ /* Image2 begins here */ /* EC12 , the error-processing code of the 1st block of the 2nd image (beginning at location say, LE12 ), checks the block of instructions I1 through I6 along with the extra NOP s at the 1st block of 2nd image or B12 block */ LE12 : If ( [L12] == [L22] ==[L32]==[L42] == [L52] ==[L62] ¹ NOP-code) OR Checksum ¹ Pre-computed Checksum, then: If 3rd Image exists, then: Jump to the Error-processing code of the 1st block of the 3rd image. /* if next image exists, then branch to the corresponding Error Processing Code of the 3rd image i.e., the EC13 that begins at location say, LE 13 at the next Image3 , in order to execute - the corresponding similar instructions ( in block B13 ) at the 3rd image. */ Else if PREV_EC == 11, then: /* If no spare and correct image does not exist and, if the most recently executed error processing code (having tested error positive) was say, EC11 ( i.e., PREV_EC = 11) which was executed earlier to the present block’s error code i.e., EC12 ; then it indicates that 1st block of code of the 1st image was found to be erroneous and, its spare similar block (i.e., 1st block of code of the 2nd image) has also been found to be faulty, so application crashes. */ Restart /* Application crashes. Re-load and Re-compute the application. */ End if End if /*1st block of instructions begins here; it is covered by the above stated Error processing code (EC12 ) */ /* 1st block of application code in the 2nd image ( say, B12 ) begins here */ I1 /* 1st instruction-code of the 1st block of the 2nd image of an application */ L12 NOP /* Extra inserted NOP code */ I2 /* 2nd instruction- code of the 1st block of the 2nd image of an application */ L22 NOP /* inserted NOP- code */ I3 L32 NOP I4 L42 NOP I5 L52 NOP /* inserted NOP- code */ I6 L62 NOP /* 1st block of the 2nd image ends here */ /* Error-processing code (i.e., EC22 ) of the 2nd block of the 2nd image begins here */ LE22 : If ( [L72] == [L82] ==[L92] ==[L102] == [L112] ==[L122] ¹ NOP-code) OR Checksum ¹ Pre-computed Checksum, then: if 3rd image exists, then: /* with three images */ Jump to the 2nd error-processing code of 3rd image (i.e., EC23 starting at location say, LE 23 ) in Image3 Else If PREV_EC ¹ 21, then: /* with two images */ Jump LE2 1 /*Jump to execute the EC2 1 */ /*A typical execution flow shows how this SVS approach can tolerate two faulty blocks : Execute EC11 (errors in B11 are detected, so skip B11 ) à execute EC12 (no error is detected in B12 code, so similar block B12 is executed) à execute EC22 (errors in B22 are detected, so skip B22 ) à execute EC21 (no error is detected in similar and spare block B21 code, so B21 is executed). Thus, two faults or two erroneous blocks are detected and tolerated by using two copies of the SVS based application, and each application copy has two basic blocks. */ Else: Restart /* Application crashes. Restart the application for re-computing */ End if End if /* 2nd block of instructions (B22 ) of the 2nd image begins here. */ I7 L72 NOP I8 L82 NOP I9 L92 NOP I10 L102 NOP I11 L112 NOP I12 L122 NOP /* 2nd block of instructions in the 2nd spare image (say, B22 ) ends here */ /*Extra NOP codes in between two images are to tolerate a wrong program flow out of the Image2, Image3 */ NOP /* Extra NOP codes at the end of an image in order to tolerate faulty program flow */ NOP NOP : Jump LE12 /* Jump to the beginning of Image2 */ /* Because of faulty program flow, switch to adjacent image */ 3.1. Errors Effected Execution Flow for Tolerating faulty Blocks:
The following figure-1 shows the errors effected possible inter-block, intra-image and inter-image execution flow for tolerating various erroneous or faulty blocks in an application that is based on SVS scheme. The string “XC” denotes “execute the correct block”, whereas the string “SE” denotes “skip the erroneous block” . An error processing code (not shown in figure-1) at each basic block detects errors in that block.
Figure 1. Errors Effected Execution Flow for Tolerating Faults in SVS Scheme.3.2. An Example of SVS Based Application:
/*For understanding the approach we take the example of computing the factorial of N using pseudo code of Intel 8086 MASM. */ /* We incorporate an error processing code at the beginning of a basic block of code, and for a large application, we incorporate an error processing code at a location prior to the branch instruction or prior to the procedure code also. The following SVS based application uses two copies. Each copy has two basic blocks. Each basic block is preceded by an error processing code (EC) that checks for burst errors (through checking NOP codes’ correctness) in the block or, a few bit errors in the useful instructions or in the error processing code itself (through added checksum in an EC). The number of instructions in a block may vary from block to block. The more the number of instructions in a block is, the more delay (because of more checking time) is for switching to a spare block (image) for tolerating erroneous blocks. */
Image 1: /* The Error-processing code for the 1st block of the 1st image is denoted by EC11. The symbol Ú denotes logical OR ing. The symbol " denotes logical XOR ing. */ /* EC11 begins here */ LVL11: If ( [L11] " [L21] " [L31] " [L41] " [L51] " [L61] ¹ 0 ) Ú ( run-time-checksum " stored- checksum ¹ 0 ), then: MOV PREV_EC, 0Bh /* Set PREV_EC = 11 */ /* Most recently executed Error Processing Code is EC11 */ JMP LVL12 /* Jump to the beginning of similar (1st) block of code at the spare 2nd image i.e., Image2 */ /* 1st block of Image1 is faulty, so attempt the next similar image */ End if /* EC11 ends here */ /* B11 begins here */ MOV AH, 01 h ; Read the number N from keyboard L11 NOP INT 21h ;Request INT 21h service 1 L21 NOP SUB AL, "0" ; Input number in AL L31 NOP MOV DL, AL L41 NOP DEC DL L51 NOP MOV CX, DL ; Set the counter register for iteration L61 NOP /* B11 ends here */ /* EC21 begins here */ LVL21: If ([L71] " [L81] " [L91] " [L101] " [L111] " [L121] ¹ 0 ) Ú ( run-time-checksum " stored- checksum ¹ 0 ), then: If PREV_EC = 22 , then: /* If 2nd block in 2nd image is erroneous */ /* Application Crashes: no spare image of similar (2nd) block, because the 2nd blocks of both the images are erroneous */ Restart the application /* Reload and restart the application */ Else /* Try the spare 2nd block of the 2nd image, if it is not erroneous */ MOV PREV_EC, 15h /* Set PREV_EC = 21, because 2nd block in 1st image is erroneous */ JMP LVL22 /* Jump to the beginning of similar (2nd) block of code at Image2 */ /* Most recently executed (and detected error positive) Error Processing Code is the EC21 that detected errors in the 2nd block of the 1st image of an application */ /* 2nd Block of the Image1 is faulty, so try the similar spare 2nd block of the 2nd image */ Endif Endif /* EC21 ends here */ /* Block B21 begins here */ LVL1: MUL DL ; AX ß AL * DL L71 NOP DEC DL L81 NOP LOOP LVL1 ; repeat through the level LVL1, until the counter register CX is decreased to zero L91 NOP MOV DL, AL L101 NOP MOV AH, 02h L111 NOP INT 21 h ; Display factorial L121 NOP INT 20 h ; Stop END /* Block B12 ends here */ Image2 : /* The Error-processing code for the 1st block of the 2nd image is denoted by EC12. The symbol Ú denotes logical OR ing. The symbol " denotes logical XOR ing. */ /* EC12 begins here */ LVL12: If ( [L12] " [L22] " [L32] " [L42] " [L52] " [L62] ¹ 0 ) Ú ( run-time-checksum " stored-checksum ¹ 0 ), then: If PREV_EC = 11, then: /* 1st block of 1st image was erroneous and the similar spare block in the 2nd image is also found to be erroneous, i.e., application crashes. */ Restart the application /* Reload and restart the application */ Else MOV PREV_EC, 0Ch /* Set, PREV_EC = 12 */ JMP LVL11 /* Try the similar 1st block in 1st image if it is OK */ Endif Endif /* EC12 ends here */ /* Block B12 begins here */ MOV AH, 01 h ; Read the number N from keyboard L12 NOP INT 21h ;Request INT 21h service 1 L22 NOP SUB AL, "0" ; Input number in AL L32 NOP MOV DL, AL L42 NOP DEC DL L52 NOP MOV CX, DL ; Set the counter register for iteration L62 NOP /* Block B12 ends here */ /* Error checking code EC22 begins here */ LVL22: If ([L72] " [L82] " [L92] " [L102] " [L112] " [L122] ¹ 0 ) Ú ( run-time-checksum " stored- checksum ¹ 0 ), then: /* 2nd block of 2nd image is found to be erroneous */ If PREV_EC = 21 , then: /* 2nd block of 1st image is also erroneous */ /* Application Crashes: no spare image of similar (2nd) block, because the 2nd blocks in both the images are erroneous */ Restart the application /* Reload and restart the application */ Else /* Try the 2nd block of the 1st image, if it is not erroneous */ MOV PREV_EC, 16h /* Set PREV_EC= 22, because 2nd block in 2nd image is erroneous */ JMP LVL21 /* Jump to the beginning of similar (2nd) block of code at Image1 */ /* Most recently executed (with error positive) Error Processing Code is the 2nd EC in the 2nd image of a block, EC22 */ /* 2nd block of Image2 is faulty, so execute the spare image */ Endif Endif /* Error checking code EC22 ends here */ /* Block code B22 begins here */ LVL1: MUL DL ; AX ß AL * DL L72 NOP DEC DL L82 NOP LOOP LVL1 L92 NOP MOV DL, AL L102 NOP MOV AH, 02h L112 NOP INT 21 h ; Display factorial L122 NOP INT 20 h ; Stop END /* Block code B22 ends here */ 4. A New Approach to Reliability Analysis & Contribution:
Based on the basic steps as stated in the previous section and the reliability analysis through instructions flow (as proposed by the author), we get the following possible program flows for deriving the probability of uncertainty of this SVS scheme. We have taken the SVS scheme with two images of an application. Each image has two block of codes. For each block of code, we have used one error-processing code. In a typical SVS based application, each basic block consists of six (N) application-instructions and six (N) extra NOP instructions. We have avoided inter- block branching. The notation EC11 * indicates error processing code that has detected various erroneous NOP code (burst errors) or bit errors in the block B11 .
The all possible Execution Flows ( the SVS with two images only) :
i ) EC11 à B11 à EC21 à B21 , ( no error or fault has occurred in the 1st image i.e., one successful execution of an application ). ii) EC11 * (fault is detected in B11 , so skip the block B11 ) à EC12 à B12 à EC22 à B22 , ( fault in the B11 block is tolerated ). iii) EC11 à B11 à EC21 * ( fault in B21 ) à EC22 à B22 , ( fault in B21 block is tolerated ). iv) EC11 * ( fault in B11 ) à EC12 à B12 à EC22 * ( fault in B22 ) à EC21 à B21 , (total two faults i.e., one fault at B11 and another one at B22 are tolerated ). v) EC11 * ( fault in B11 ) à EC12 * ( fault in B12 ) à Restart, (application crashes because of no correct spare block and in such case, restart the application for fail-stop kind of fault tolerance). vi) EC11 à B11 à EC21 * ( fault in B21 ) à EC22 * ( fault in B22 ) à Restart, (application crashes because of no correct spare block and in such case, restart the application for fail-stop kind of fault tolerance).
Table 1 : Performance Analysis of SVS | ß TIME REDUNDANCY à|
5. Experimental Results
To evaluate the feasibility and effectiveness of this proposed SVS approach, we have applied this approach on a set of simple C programs (e.g., multiplication of two matrices of 8 x 8 integer values, an implementation of bubble sort algorithm on a vector of 12 integer elements, used as benchmarks) by manually modifying their source code according to the proposed SVS approach. Motorola 68040 processor and the compiler - Single Step 7.4 by SDS, Inc., have been used with disabling all compiler optimizations when compiling the SVS based application with two images. It is observed that on average, the size of the executable code is increased by 2.6 times and the source code grows by 2.9 times. An average slow-down of about 2.7 times is observed for tolerating two sequential faults. The fault injection environment as described in [15,16] , is built around an application board hosting a 25 MHz Motorola 68040 processor, 2 Mbytes of RAM memory, and some peripheral devices. Fault injection is performed exploiting an ad hoc hardware device which allows monitoring the program execution and triggering a fault injection procedure when a given point is reached. For experiments, the adopted fault model is the multiple-bit flip into memory locations. Bit errors are randomly generated. On injecting total 2000 randomly generated faults ( 500 in the memory area containing data, and 1500 in the memory area containing the code) in an application (based on SVS with two images) of Matrix multiplication of two matrices composed of 8x8 integer values , out of total 2000 errors the number of hardware detected (by Error Detection Mechanisms e.g., microprocessor exceptions) is 382 and the Software detected (by this SVS approach in case of disagreement among the NOP code or through checksum mismatch ) is 764 and the fail silent ( not producing any difference in the program behavior) is 841. The rest 13 are the fail silent violations (not detected by any error detection methods).
Again, on injecting total 2000 randomly generated faults ( 500 in the memory area containing data, and 1500 in the memory area containing the code) in an application (SVS based with two images) of solving a series of twenty quadratic equations, out of total 2000 errors, the number of hardware detected (by Error Detection Mechanisms e.g., microprocessor exceptions) is 423 and the Software detected (by this SVS approach in cases of disagreement among the extra NOP instructions or through checksum mismatch at our error processing code EC) is 772 and the fail silent ( not producing any difference in the program behavior) is 791. The rest 14 is for fail silent violations (not detected by any error detection methods).
On an average, out of total 2000 errors, the average number of hardware detected (by Error Detection Mechanisms e.g., microprocessor exceptions) is 402.5 and the average number of Software detected (by this SVS approach in such case of disagreement among the extra NOP instructions or through checksum mismatch at EC) is 768 and the average number of fail silent ( not producing any difference in the program behavior) is 816. The average number of fail silent violations (not detected by any error detection methods) is 13.5. This is observed that the SVS approach has worked well along with other ECC in memory.
This work aims not to tolerate software design bugs , rather it aims to design a robust and a low cost solution for tolerating memory burst errors or bit errors of multiple bit flips during the execution of an application on adopting the code redundancy along with self-checking code. It can tolerate two consecutive faults on using only two copies of an application. Wrong inter-block program flow inside an application is detected when a program control enters a block without executing its error processing code, and such events are caught by examining the PREV_EC variable.
Limitations :
The proposed SVS scheme’s effectiveness is limited over the memory burst errors or multiple bit flips in memory. However, a wrong program flow within a block of code in an application cannot be detected in this SVS approach. Processor register bit errors also remain undetected in SVS. Register bit errors can be detected by comparing the results on executing both the copies of an application with similar input. This SVS does not aim to detect register bit errors. Again, the detection of a wrong program flow out of the application image but in between the two copies of an application is limited by the number (on average, it is the number of bytes in a block of a copy of an application) of extra NOP code, inserted in between the two copies only. Again, an error that may occur in a block after the execution of its error-processing -code remains undetected until the re-usage of a block. In other words, the fault detection capability of this SVS scheme is limited over the errors in a block that might have occurred before the execution of its error-code only and such errors get detected only at the subsequent reference of the erroneous block. We rely that such error which occurs just after the execution of an error processing code, will get detected by a microprocessor's exception handling mechanism. Thus, this SVS along with processor's exceptions is an effective solution to attain non-fail-stop kind of fault tolerance through hardening a processor-based application at the cost of an affordable overhead on both the time and space.
6. Conclusion
The proposed SVS technique is a low cost and an effective solution towards designing a fault tolerant application system (against memory burst errors), because it does not bear the cost of designing multiple independent versions of an application software, and the cost of multiple machines, as required in N-modular system or NVP system. This SVS technique has no overhead with the synchronization tasks as demanded in NVP system. This SVS technique needs only two images of the enhanced application software running on one machine only, in order to tolerate two sequential faults for designing a non-fail-stop kind of fault tolerance for the burst errors failure model. Though the targeted fault model is assumed to be a burst of modified bits over a few consecutive bytes (or multiple bit flip), this approach aims to be a useful one for tolerating operational faults of other faults model also. Error correction is not incorporated in this technique because of very high time redundancy with multiple byte-error recovery codes. However, this SVS technique can be tailored to fit the exact requirement of a typical application system through proper study and at the cost of an additional time for designing such single-version fault-tolerant application system. The overhead of extra execution time and space redundancy can easily be afforded using an affordable and modern higher speed processing system, in order to gain such reliable computation. This SVS technique is also useful for preventing erroneous results caused by corrupted codes. This technique can be a useful tool for designing various computer controlled application systems and for hardening a processor-based application by acting on the software also. The SVS supplements the ECC of a memory system and detects and tolerate memory burst errors also.
References:
[1] L. Spainhower and T.A. Gregg, "IBM S/390 Parallel Enterprise Server G5 Fault Tolerance: A Historical Perspective," IBM Journal of Research & , Vol. 43, No. 5/6, 1999. [2] T. Sato, "Analyzing Overhead of Reissued Instructions on Data Speculative Processors," Workshop on Performance Analysis and its Impact on Design, held in conjunction with 25th International Symposium on Computer Architecture, 1998. [3] Stephen B. Wicker “ Error Control Systems for Digital Communication and Storage”, Prentice Hall, NJ, USA, pp.72- 127, 1995. [4] B. Randell, "Design - Fault Tolerance," in The Evolution of Fault-Tolerant Computing, A. Avizienis, H. Kopertz, and J.-C. Laprie, eds., Springer-Verlag, Vienna, pp. 251-270 1987. [5] A. Avizienis, “The N-Version Approach to Fault – Tolerant Systems,” IEEE Transactions on Software Engineering , vol. SE -11, No. 12, Dec., 1985,pp.1491-1501. [6] K.H. Huang, J.A. Abraham, "Algorithm-Based Fault Tolerance for Matrix Operations," IEEE Transactions on Computers, vol. 33, pp. 518-528, 1984. [7] M. Zenha Rela, H. Madeira, J.G. Silva, "Experimental Evaluation of the Fail-Silent Behaviour in Programs with Consistency Checks," Proceedings of the FTCS-26, pp.394-403, 1996. [8] S. Yau, F. Chen, "An Approach in Concurrent Control Flow Checking," IEEE Transactions on Software Engineering, vol. SE-6, no. 2, pp. 126-137, 1980. [9] K.H. Kim and H.O. Welch, “Distributed Execution of Recovery Blocks: An Approach for uniform Treatment of Hardware and Software Faults in Real- Time Applications,” IEEE Transactions on Computers, vol.38, No. 5, May 1989, pp. 626-636. [10] R.K. Gupta, C.N. Coelho, G. De. Micheli," Program Implementation Schemes for Hardware - Software Codesign", IEEE Computer, pp. 48-55, June 1994. [11] Goutam K. Saha, “Transient Fault Tolerant Processing in a RF Application,” International Journal – System Analysis Modelling Simulation, vol. 38, pp.81-93, 2000, Gordon and Breach, USA. [12] R.K. Gupta, C.N. Coelho Jr., G.De. Micheli, "Sysnthesis and Simulation of Digital Systems Containing Interacting Hardware and Software Components", Proc. Design Automation Conference, June 1992. [13] Yervant Zorian, Dimitris Gizopoulos, "Design for Yield and Reliability", IEEE Design & Test, May/June, 2004. [14] G.K. Saha, "Designing an EMI Immune Software for Microprocessor Based Application," Proceedings 11th IEEE International Symposium, EMC'95, Switzerland, pp. 401-404, March, 1995. [15] A. Benso, P.L. Civera, M. Rebaudengo, M. Sonza Reorda, "An Integrated HW and SW Fault Injection Environment for Real -Time Systems", Proc. IEEE International Symposium on Defect and Fault Tolerance in VLSI Systems, pp. 117-122, 1998. [16] M-C Hsueh, T.K. Tsai, and R.K. Iyer, "Fault Injection Techniques and Tools," IEEE Computer, pp. 75-82, April 1997. [17] Goutam Kumar Saha, "Fault Tolerant Computation for a Scientific Application," CSI Communications, Computer Society of India Press Mumbai, Vol. 20(5), May 1996. [18] Goutam Kumar Saha, "EMP - Fault Tolerant Computing: A New Approach," International Journal of Microelectronic Systems Integration, Vol. 5(3), pp. 183-193, Plenum Publishing Corp, USA, 1997. [19] Goutam Kumar Saha, " Software Implemented Fault Tolerance Through Data Error Recovery," ACM Ubiquity, Vol 6(35), ACM Press, 2005.
Author’s Biography: In his last seventeen years’ research and development experience, he has worked as a scientist in LRDE, Defence Research & Development Organisation, Bangalore, and at Electronics Research & Development Centre of India , Calcutta. At present, he is with Centre for Development of Advanced Computing, Kolkata, India, as a Scientist-F. He has also taught post-graduate level Computer Science courses for five years. He has authored around hundred research papers in various International Journals, Magazines and Conference Proceedings including SAMS Journal, ACM, C&EE J., IEEE, CSI etc. He is a senior member in IEEE, Computer Society of India, ACM and a fellow member in IETE, MSPI, IMS etc. He has received various awards, scholarships and grants from national and international organizations. He is a referee for AMSE journals (France), IJCSA, CSI Journals, IJCPOL and IEEE Potentials Magazine etc. His field of interest is on dependable, fault tolerant computing and NLE. He is an associate editor of the ACM Ubiquity. He can be reached via sahagk@gmail.com, gksaha@rediffmail.com.
Source: Ubiquity Volume 6, Issue 40 (November 2-8, 2005) www.acm.org/ubiquity Forum Printer Friendly Version[Home] [About Ubiquity] [The Editors]
|