posted by 은이종 2013. 5. 2. 17:06


Scenario:

Correctable Memory Error Threshold Exceeded (Slot x, Memory Module y) .

Corrected Memory Error Threshold Passed (Slot x, Memory Module y) .

Uncorrectable Memory Error detected by ROM-based memory validation .

Uncorrectable Memory Error (Slot x, Memory Module y) .

Solution

NOTE: Most of the Correctable and Uncorrectable Memory Errors can be solved with a BIOS update. Refer to servers BIOS release notes for fixes.

What is correctable memory error?

Correctable errors can be detected and corrected if the chipset and DIMM support this functionality. Correctable errors are generally single-bit errors. Most of the ProLiant servers are capable of detecting and correcting single-bit errors. In addition, ProLiant servers with Advanced ECC support can detect and correct some multi-bit errors.

Correctable errors can be classified as "hard" and "soft" errors.

  • Hard error typically indicates a problem with the DIMM.

  • Although hard correctable memory errors are corrected by the system and will not result in system downtime or data corruption, but still they indicate a problem with the hardware.

  • Hard error will typically cause a DIMM to exceed HP’s correctable error threshold and the user is warned about hard correctable errors.

  • Soft errors do not indicate any issue with the DIMM.

  • A soft error occurs when the data and/or ECC bits on the DIMM are incorrect, but the error will not continue to occur once the data and/or ECC bits on the DIMM have been corrected.

  • Soft error will not typically cause a DIMM to exceed HP’s correctable error threshold and is not notified about soft errors which do not indicate any issue with the hardware.

    The user is warned about a DIMM exceeding the correctable error threshold in multiple ways.

  • DIMM LEDs on the front panel or on the system board or on memory board.

  • Integrated Management Logs.

  • SNMP Traps if configured.

  • System Management Homepage and System Insight Manager.

What is uncorrectable memory error?

While correctable errors do not affect the normal operation of the system, uncorrectable memory errors will immediately result in a system crash or shutdown of the system when not configured for Mirroring or RAID AMP modes.

Uncorrectable errors are always multi-bit memory errors. The internal Health LED will indicate a critical condition, and on most systems, the LEDs next to the failed DIMMs will be illuminated. In addition, the error will be logged if the Systems Management Driver is loaded. Uncorrectable memory errors can typically be isolated down to a failed Bank of DIMMs, rather than the DIMM itself.


Possible solutions:

Most of the Correctable and Uncorrectable Memory Errors can be solved with a BIOS update. Refer to server’s BIOS release notes for fixes.

Run Insight Diagnostics and replace the faulty part.

If diagnostics is not an option, swap with a known good memory module, to make sure the DIMM slot on the system board or memory board is good.

After swapping with known good part or after performing diagnostics, the faulty part has to be replaced.

'기타' 카테고리의 다른 글

OpenStack  (0) 2013.07.17
RAID + JBOD  (0) 2013.07.04
서비스 무정지 DNS 기관 이전 시 작업순서  (0) 2013.03.28
POC, Pilot, BMT 약어 설명  (0) 2013.02.20
Filezilla (파일질라) Client 관련 오류  (0) 2013.01.29