Architecting HPC Systems for Fault Tolerance and Reliability (Intro)This is a featured page

This blog, written by Blake Gonzales, talks about issues around designing, administering, running, and architecting clusters from someone with a background in commercial HPC.

Start Blog Entry
Architecting HPC Systems for Fault Tolerance and Reliability (Introduction to a multi-part series)

01/11/2010

Blake GonzalesThe complex nature of HPC systems can at times have a detrimental effect on their ability to reliably complete the tasks at hand. At the same time, HPC systems are generally relied upon to perform many hundreds or thousands of independent jobs simultaneously. In many cases, the work to be performed by HPC systems is critical in nature. Because of this, reliability and fault tolerance is of upmost concern in HPC.

Shared-memory multiprocessor (SMP) systems are generally prone to system wide failures due to single errors in memory, CPU or disk. Prevention of single errors which cause outages in SMP solutions has always been a struggle. With the ubiquitous use of clustered HPC technology in the last decade, the risk of system wide failures due to single points of failure can be minimized! Although, to accomplish increased reliability, these clustered solutions must be designed correctly to accomplish the desired effect.

There are many “moving parts” so to speak in clustered solutions, so it is important to design each subsystem with an eye to how it relates to the other subsystems. Here I would like explore key hardware and software components that are likely to cause system wide failures, and suggest architecture design techniques to prevent such failures.

-- Blake Gonzales

See other posts from this Blog series:
INTRO
SMP
CLUSTERED SYSTEMS
CLUSTERED SYSTEM INFRASTRUCTURE
POWER DISTRIBUTION


Share your thoughts with a COMMENT --


End Blog Entry



End Blog Entry


HPCBrad
HPCBrad
Latest page update: made by HPCBrad , Mar 29 2010, 4:21 PM EDT (about this update About This Update HPCBrad Edited by HPCBrad

5 words deleted
1 image added
1 image deleted

view changes

- complete history)
Keyword tags: None
More Info: links to this page
There are no threads for this page.  Be the first to start a new thread.
Browse by Keywords
Loading...