A new white paper from Intel explores how Intel MCA Recovery + MFP has helped, JD Cloud provide efficient and stable cloud computing services to their more than 2,500 partners.
JD Cloud, one of the largest cloud computing platforms in the world, was facing a challenge, according to Intel. “Memory errors in the data center of JD Cloud account(ed) for 37% of the total hardware failures.” Hardware failures, and the subsequent downtime to resolve them, are costly to service provider like JD Cloud, as the outages violate their Service Level Agreement (SLA) with customers.
To reduce the impact of these memory errors, JD Cloud collaborated with Intel to develop a system that would have “real-time insight into the memory status of the cloud host, predict potential memory failures and effectively recover memory failures.” By improving the reliability and stability of JD Cloud services, the company hoped to “reduce the total cost of ownership of the data center.”
The paper first outlines the types of memory errors JD Cloud was seeing. These include both Corrected Errors (CE) and Uncorrected Errors (UE). “After repeated tests and trade-offs by technical experts and JD Cloud and Intel, JD Cloud finally chose Intel’s MCA Recovery and MFP technologies” to be the backbone of their failure recovery system.
“With the help of MCA Recovery, the failure recovery system will isolate the affected memory pages and prevent the paces from being reused by other applications / processes. If the kernel can successfully perform recovery, the system can stay online as long as there is not failure.” – Intel, “Intel MCA+MFP Helps JD Build Stable and Efficient Cloud Services“
The author briefly explains what Intel’s Memory Failure Prediction (MFP) and MFP Recovery technologies do and how they work. Through illustrations and schematics, they explain how “the deployment of MCA Recovery and MFP, in conjunction with JD Cloud’s fault recovery system, greatly reduces the system crash caused by the memory failures of JD Cloud’s host.” The end result has been a 40% reduction in downtime caused by memory failures.