Archive

Archive for January, 2013

Do You Really Need ECC Memory for CAD Workstation Computing?

January 31, 2013 1 comment

I recently read an article by an Intel product manager on the need for “ECC” (error correction code) memory in CAD workstations. From the article:  “Corrupted data can impact every aspect of your business, and worse yet you may not even realize your data has become corrupted. Error-correcting code (ECC) memory detects and corrects the more common kinds of internal data corruption.”

For some reason this triggered my memory of the sudden-acceleration Toyota Prius incident from 2010. The popular press latched on to the idea that cosmic rays were screwing with the electronics in the Prius. While theoretically possible,  the probabilities of this were astronomically low.  It did however, make for a great story and the FUD (fear uncertainty doubt) caused Prius prices to temporarily plummet and sales come to a crawl.

Back to ECC memory and CAD systems. Is there really a need for ECC memory in CAD or is it just FUD marketing to upsell hardware and make products sound more valuable than they really are?  I decided to do a little research.

Question:

Who needs ECC memory and what is its role in professional & CAD workstation computing?

Answer:

Naturally occurring cosmic rays can and do cause problems for computers down here on planet Earth. Certain types of subatomic particles (primarily neutrons) can pierce through buildings and computer components and physically alter the electrical state of electronic components. When one of these particles interacts with a block of system memory, GPU memory or other binary electronics inside your computer, it can cause a single bit to spontaneously flip to the opposite state. This can lead to an instantaneous error and the potential for incorrect application output and sometimes, even a total system crash. However, the theoretical chances of a single bit error caused by a cosmic ray strike on your PC or workstation’s memory is fairly rare — only about once every 9 years per 8GB of RAM, according to recent data.

ECC technology — used as both system RAM, and in devices such as high-end GPUs — can reliably detect and correct these errors, reducing the odds of memory corruption due to “single bit errors” down to about once every 45 years for 8GB of RAM. Of course, just like everything else in life there are always tradeoffs. ECC memory is typically up to 10% slower and significantly more expensive than standard non-ECC memory.

Because the odds of a cosmic ray strike increase in direct proportion to the physical amount of memory (and related components) inside a computer, this is a real concern for large scale, clustered supercomputing and other environments where computing tasks often include high-precision calculation sets that can take days or even weeks to complete. In the case of supercomputer clusters, which often contain hundreds or even thousands of connected computer nodes and terabytes of memory, the odds of cosmic ray strikes on the system are much more likely — and much more costly. Restarting a week-long calculation on a supercomputer can cost a facility many tens of thousands of dollars in lost time, electricity and manpower —not to mention lost productivity.

But for even very beefy PC CAD workstation configurations with loads of RAM on board, you are probably not at imminent risk from problems caused by cosmic ray strikes and the resulting single bit errors. Over the course of your work, you are much more likely to endure system crashes or application hangs dues to failing components, power fluctuations and software bugs than due to cosmic ray strikes. Additionally, many applications in the desktop design and engineering space can actually endure a single bit error without negatively impacting the computing process or product. For example, if the color or brightness of a single pixel on a display monitor is changed due to this type of memory corruption on the system’s GPU, nobody will ever see or notice it. There are many such examples of this type of error not really impacting ones everyday work.

This said, many leading technology manufacturers are enabling their high-end products with ECC memory for compute-heavy (especially clustered supercomputing) applications where the benefits of using error correcting memory outweigh any comparative speed/cost drawbacks. AMD for example, has engineered their new AMD FirePro W9000 and FirePro S9000 ultra-high-end GPU cards to include ECC memory which can selectively be enabled by the end user and used for many advanced computing purposes where rock-solid stability and protection from space rays is crucial.

Sources:

  1. http://www.tezzaron.com/about/papers/soft_errors_1_1_secure.pdf
  2. http://en.wikipedia.org/wiki/ECC_memory
  3. http://www.smartm.com/files/salesLiterature/dram/smart_whitepaper_sbe.pdf

Author: Tony DeYoung

Follow

Get every new post delivered to your Inbox.

Join 37 other followers