Main Page

From Resilience

Revision as of 13:23, 2 June 2009 by Christian Engelmann (Talk | contribs)
(diff) ←Older revision | Current revision (diff) | Newer revision→ (diff)
Jump to: navigation, search

Welcome to the HPC Resilience Consortium Wiki!

About HPC Resilience

Recent trends in high-performance computing (HPC) system architecture have clearly indicated future increases in performance, in excess of those resulting from improvements in single-processor/node performance, will be achieved through corresponding increases in system scale, i.e., using a significantly larger component count. As the raw computational performance of the world's fastest HPC systems increases to next-generation extreme-scale capability and beyond, their number of computational, networking, and storage components will grow from today's 10,000 to several 100,000 through tomorrow's 1,000,000 and beyond. This substantial growth in system scale poses a challenge for HPC system and application software with respect to resilience.

In addition to hard errors, such as component wear-out, soft errors are becoming a significant source of interruptions in large-scale HPC systems. Detectable unrecoverable errors (DUEs), such as error-correcting code (ECC) double-bit errors that normally occur in a single memory module randomly once within a few million hours of operation, are an emerging threat to large-scale HPC systems due to the number of memory modules they employ. Vendors have also warned that silent data corruption (SDC), i.e., undetected bit flips in unprotected memory, latches, buses, and logic, are becoming a problem as well. The SDC issue is even more serious with field-programmable gate arrays (FPGAs) and graphics processing units (GPUs). In fact, their significantly higher soft error vulnerability is a main reason why they are currently not used in large-scale HPC systems.

Next-generation extreme-scale HPC systems need to be able to deal with frequent failures in such a manner that their capability is not severely degraded and their correctness is maintained. The field of HPC resilience encompasses all research areas and development efforts focusing on a common goal toward error-resilient HPC systems.

About the HPC Resilience Consortium

The HPC Resilience Consortium was formed in 2008 to bring together a broad spectrum of architects, researchers, developers, system administrators and system support personnel from government, academia and industry to address urgent HPC resilience requirements for future-generation extreme-scale systems. The main goal of the Consortium is to concentrate research and development efforts in HPC resilience toward the common goal of increasing the productivity of current- and next-generation HPC systems by enabling interdisciplinary collaboration across institutions.

In 2009, participants of the Dagstuhl Seminar on Fault Tolerance in High-Performance Computing and Grids decided to adopt this Wiki as a platform to collect information on active projects, existing solutions and to coordinate future research activities. You can also subscribe to a moderated mailing list. The objective of this list is to coordinate activities and disseminate information on fault tolerance in high-performance computing and related areas.

About this Wiki

The purpose of this Wiki is to disseminate information about HPC resilience activities and resources to the community and to operate as a platform for discussion/interaction within the community. This Wiki provides information about HPC resilience focus areas (Standardization, Models, Data, Solutions) and HPC resilience conferences.

  • Accessing the Wiki content:
    • Read access is enabled to everyone without the need to register as a user.
    • Write access is enabled to registered users only.
    • User accounts can be requested and are approved by Dr. Box Leangsuksun.
    • In case of problems with the Wiki, please e-mail James Elliott.
  • Adding/Removing/Modifying Wiki content:
    • All Wiki content is provided under the Creative Commons License.
    • Do not post any inappropriate material, such as copyrighted work without permission or unrelated content.
    • Submitted content should contain at least one abstract/summary and a reference to further material, such as a publication or project Web page.
    • Be clear on the current status of a particular contribution, such as specification, design, early prototype, or production implementation.
    • Write out acronyms at first-time use, e.g., high-performance computing (HPC).
    • List literature and related projects on the same page they are referenced in a separate References section. Use the Cite extension of the Wiki.
    • Use the Institute of Electrical and Electronics Engineers (IEEE) format for references to literature. CiteULike may be of help.
Personal tools