Internal audit report finds NASA missions delayed by overloaded and outdated supercomputers,

Internal audit report finds NASA missions delayed by overloaded and outdated supercomputers,

2024-03-18 23:26:00

NASA’s supercomputing infrastructure is largely outdated and does not meet modern norms and standards. That’s the alarming conclusion of an audit report that says the agency’s outdated and overloaded supercomputers are creating massive infrastructure bottlenecks, leading to significant mission delays. NASA’s supercomputers still rely primarily on CPUs, and one of its flagship supercomputers uses 18,000 CPUs and 48 GPUs. The report indicates that without investment and updating, the situation risks limiting the priorities and objectives of future missions.

Released Thursday, the report is the work of NASA’s Office of Inspector General. The report’s findings are concerning, primarily because they relate to a space agency that has made some of the most important discoveries in human history and is supposedly at the cutting edge of technology. NASA’s Office of Inspector General says the agency’s supercomputing technologies must be completely overhauled if it is to compete with the space research programs of other major powers and maintain its leadership position in space conquest.

Describing NASA’s high-end computing (HEC) resources as “overstretched” and “overloaded,” the report asserts that mission directions are requesting more computing time than existing capacity can provide, often resulting in delays in the schedule. He says the agency needs renewed commitment and sustained attention from its leaders to reinvigorate its supercomputing efforts. In the absence of an update, the agency’s high-performance computing resources will likely limit future mission priorities and objectives.

According to the report, the situation is so serious that several NASA teams must use part of their allocated budget to purchase their own resources from HEC in order to meet mission deadlines. As an example, the report highlights that the “Space Launch System” team invests approximately $250,000 per year to purchase and manage its own supercomputing systems instead of waiting for existing HEC resources to become available. According to the report, almost all NASA centers use their own HEC systems, with the exception of Goddard Space Flight Center and Stennis Space Center.

NASA has five central HEC resources located at NASA Advanced Supercomputing (NAS) in Ames, California, and the NASA Center for Climate Simulation (NCCS) in Goddard, Maryland. The list includes: I mean (13.12 PFLOPS, designed to support the Artemis program, which aims to return humans to the Moon and establish a lasting presence there), Electra (8.32 PFLOPS), Discover (8.1 PFLOPS, used for modeling climatic and meteorological), Pleiades (7.09 PFLOPS, used for climate simulations, astrophysical studies and aerospace modeling), and finally Endeavor (154.8 TFLOPS).

These machines almost exclusively use old CPU cores. For example, all NAS supercomputers use more than 18,000 CPUs and only 48 GPUs, and NCSS uses even fewer GPUs. “HEC officials raised several concerns regarding this observation, stating that the failure to modernize NASA systems can be attributed to various factors such as supply chain concerns, modern computer language requirements and scarcity qualified personnel needed to implement new technologies,” the report said.

“Ultimately, this failure to modernize current HEC infrastructure will directly impact the agency’s ability to achieve its exploration, science, and research goals,” says the Office of the Inspector General of NASA. However, the observations do not stop there. The audit also found that the agency’s HEC resources are not managed as a centralized strategic program or service, leading to inefficiencies and a lack of a coherent strategy for the use of on-site IT resources. compared to cloud computing resources.

According to the report, this uncertainty has led to hesitation in using cloud resources due to unfamiliar programming practices or assumed higher costs. Additionally, the audit found that security controls for supercomputing infrastructure are often bypassed or not implemented, increasing the risk of cyberattack. So, NASA’s Office of Inspector General makes ten recommendations, the first being that senior leaders reform the way supercomputers are administered and implemented at NASA. The other nine recommendations are actions.

The auditor emphasizes, however, that these actions should be carried out by a “strike team” responsible for resolving known problems throughout NASA’s supercomputer fleet. Among the tasks that this team must tackle are the following:

  • identify critical technology gaps, such as GPU transition and code modernization, to meet current and future needs and strategic technology and science requirements;
  • develop a strategy to improve HEC asset allocation and establish usage priorities, including appropriate use of on-premises versus cloud-based resources;
  • assess cyber risks associated with HEC assets to determine monitoring and control requirements, establish risk appetite and address control gaps;
  • consider using NASA’s Splunk platform as a shared resource;
  • establish a company-wide inventory of HEC assets and formalize hardware and software lifecycle management procedures.

The report states: “While HEC technology can support some small AI projects, the agency’s current HEC ecosystem cannot support projects that require massive data flow.” As might be expected, the audit report caused a lot of reaction on the web. Many are particularly concerned regarding the auditor’s observations related to NASA’s supercomputer fleet and the agency’s lax handling of security. Some commentators say the report might open the eyes of hackers who had not considered NASA as a potential target.

Notably, a passage in the report states that NASA security staff raised concerns regarding the lack of user activity monitoring capabilities and its exclusion from access approval and review processes external users to HEC systems or agency data sets. “Without a refocused effort to implement better cybersecurity safeguards, NASA’s HEC assets will continue to be high-value targets for adversaries,” the report said.

NASA reportedly accepted the recommendation to establish executive leadership for its HEC assets and partially agreed with the report’s other recommendations, noting that it was working to collaborate and strategize on the issues identified by the auditor.

Source : NASA Office of Inspector General Audit Report (PDF)

And you ?

What is your opinion on the subject?
What do you think of the alarming observations from NASA’s internal audit?
How can we explain the obsolescence of the HEC infrastructure of an agency like NASA?
What do you think of the laxity observed in the management of the security of the agency’s HEC resources?
What do you think of the auditor’s recommendations?

See as well

Can Russia create high-performance computing clusters with local technologies? Russian scientists believe the country will not be able to build new supercomputers

NASA’s new moon landing simulation supercomputer is more powerful, more environmentally friendly, can run up to 3.69 petaflops and has 46,080 cores

Jupiter, Europe’s first exascale supercomputer, will run on ARM instead of x86. French SiPearl to supply Rhea processors as Europe seeks hardware independence

1710807337
#Internal #audit #report #finds #NASA #missions #delayed #overloaded #outdated #supercomputers

Leave a Replay