ECE Header Logo

EEC276 – Fault-Tolerant Computer Systems: Design and Analysis

3 units – Winter Quarter; alternate years

Lecture: 3 hours

Prerequisite: EEC 170, EEC 180A

Grading: Letter; midterm (25%), final exam (35%), problem sets/projects (40%).

Catalog Description:

Introduces fault-tolerant digital system theory and practice. Covers recent and classic fault-tolerant techniques based on hardware redundancy, time redundancy, information redundancy, and software redundancy. Examines hardware and software reliability analysis, and example fault-tolerant designs. Offered in alternate years.

Expanded Course Description:

  1. Fault-Tolerant Computing Overview
    1. Fundamental concepts
    2. Nomenclature
    3. Fault taxonomy, fault manifestation
  2. Fault Tolerance Techniques
    1. Hardware Redundancy
      1. Duplication, self-checking
      2. NMR
      3. Hybrid
    2. Information Redundancy
      1. Single correcting, double detecting codes
      2. Cyclic block codes
      3. Residue arithmetic
      4. Self-checking checkers
    3. Time Redundancy
      1. Duplication
      2. Recomputation methods
    4. Software Redundancy
      1. Design diversity/N-version programming
      2. Recovery blocks
    5. Hybrid Redundancy
      1. Algorithm-based fault tolerance
      2. Watchdog monitoring/signature monitoring
  3. Reliability Analysis
    1. Failure probability distributions
    2. System modeling
    3. Stochastic analysis, Markov chains
    4. Availability (mean-time-to-failure, mean-time-to-repair)
  4. Fault-Tolerance in Commercial Systems
    1. Hardened Processors (RH32, RH6000, RH3000)
    2. HAL SPARC64
    3. Tandem
    4. Air Bus

Textbook/reading:

  1. Johnson, Design and Analysis of Fault-Tolerant Digital Systems, Addison-Wesley (1989)
  2. Class notes and recent papers

Instructors: Redinbo, Wilken

THIS COURSE DOES NOT DUPLICATE ANY EXISTING COURSE.

Last modified: February 1997