![]() Invitation Archives |
|
A Software Fix Towards Fault-Tolerant Computing Goutam Kumar SahaMember, ACM CA -2 / 4B, Party Office Road, Baguitai, Deshbandhu Nagar, Kolkata 700059 WB INDIA ABSTRACT INTRODUCTION Traditionally EFT burst potential is controlled by the use of clamping devices, shielding, grounding, isolation, balancing etc.Traditional method of error detection like checksum, CRC etc can not detect and repair any number of random errors. EFT errors are of very random in nature. However, properly designed firmware and software fixes have an important role in achieving the higher system EM Compatibility ( EMC) by controlling potential EFT, EFT -burst without increasing size, weight and hardware cost. The design of this software fix is a very challenging task. THE NEW SOFTWARE TECHNIQUE A subroutine namely "FAULTCHK" has been designed in order to detect any kind of fault in memory and program flow, caused by potential EFT, EFT Burst, EMP and then to take necessary recovery action. This proposed technique can detect errors or faults , if exist, very quickly before it has a chance to do any damage of the system.
If the subroutine code "FAULTCHK" starts from say, L1s and one "NOP" instruction is inserted at location say, Lns. Let Lnp be the memory location in program memory, and , Lnd be the location in data memory where "NOP" instructions are inserted. This subroutine is called by the instruction "CALL FAULTCHK" statement inside the main application program with basic processing logic.
Steps involved or Algorithm for the FAULTCHK subroutine is stated below. Algorithm: It shows how the Subroutine “FAULTCHK” can verify or detect if any error or fault has occurred in the program, data memory or in the program flow. It also shows how it can then take necessary recovery actions, if faults are present. It uses one memory variable C. It checks the memory contents at various locations where NOP instruction codes have been inserted intentionally.
DISCUSSION The above steps of the subroutine “FAULTCHK” show that if some errors are present and then they will be detected in the program and data memory space, and program control is transferred to error-handling routines namely, “Error_Program_ Memory” and “Error_Data_Memory” for necessary recovery or repairing the corresponding code. If adjacent bytes in the unused memory space (i.e., from L1u and so on) are affected by the potential EFT burst then the program control is automatically transferred to the error routine namely, “Error_Burst”. Similarly, the code of the “FAULTCHK” subroutine is also verified while the routine itself busy to verify the main application program with basic processing logic and data memory space for its integrity or immunity. If errors or faults are detected in the “FAULTCHK” routine then also the program control is transferred to error handling routine namely, “Error_Subroutine”. The correctness of the program flow is also verified by checking the internal registers (if C is a register variable) say, C . If the Program Counter (PC) register is also affected at that point of run time, it is very effective way to detect it by this thoughtful checkpoint. The inserted NOP instruction codes behave as a test bed in order to catch the presence of burst errors due to potential EFTs. Thus this software fix technique provides a high reliable processing logic for any microprocessor based application. It eliminates possibilities of system ambiguity or erroneous results during run time of an application system.
From the above equation 1(a), it is obvious that Fault detection is higher if we insert more number of NOP instruction codes at various points inside the application. Execution of NOP instructions do not change any processor status word (psw) but provides a delay of one machine cycle only. Thus it also bring the processor into sleep mode while transients are present. The equation 1(b) shows that Time redundancy of this proposed technique is also very less ( less than 2) and easily affordable in comparison to the traditional N-Version Programming, because the total number of machine cycles involved in this proposed software fix is less than that of a typical industrial application. It is negligible than an N-version software fix. Similarly, the equation 1( c) shows that the space redundancy is very less than 2 but little above the value of 1. Whereas a three version program has the space redundancy of more than 3. Thus, this software fix is a very cost effective and economical tool in order to detect faults, if present, periodically and then to bring the system to a known stable state through various error-handling routines, with as little damage as possible. Interested readers may refer other related works on fault recovery [3-8]. CONCLUSION
AUTHOR: Forum Printer Friendly Version[Home] [About Ubiquity] [The Editors]
|