A technical AI government agency plays a vital role in advancing AI innovation and trustworthiness (Mark MacCarthy)

Share This
The current AI Safety Institute is a vital new agency currently established on a temporary basis in the Department of Commerce’s National Institute for Standards and Technology (NIST). This agency is key to U.S. leadership in the development of secure and trustworthy artificial intelligence. The Trump administration should work with Congress to continue the scientific and technical work of the institute. The agency should be dedicated to evaluating the capabilities of the latest frontier AI models and to assessing them for how well they have mitigated specific, realistic risks relating to national security and public safety harms. Congress should permanently authorize the agency in statute and ensure that it is adequately funded and staffed by well-compensated, nonpartisan technical experts.

Last year Rep. Jay Obernolte (R-Calif.), chair of the U.S. House of Representatives AI Task Force, and other representatives sought to establish and fund the AI Safety Institute in NIST in an appropriations bill. At a House Science Committee in February 2025, Rep. Obernolte pledged to reintroduce legislation that would give the agency statutory authority. This proposed bill should be high on the agenda for full congressional consideration and rapid passage. 

Review of current AI initiatives 

On his third day back in office, President Trump signed an executive order titled “Removing Barriers to American Leadership in Artificial Intelligence.” It revoked the previous administration’s AI executive order and ordered a review within 180 days of “all policies, directives, regulations, orders, and other actions taken pursuant to” this now-revoked older order. Based upon that review, any action identified as a barrier to “America’s global AI dominance” will be targeted for suspension, revision, or rescission. On Feb. 6, the National Science Foundation issued a notice in the Federal Register seeking public comment on developing this AI Action Plan. Comments can be submitted through March 15. 

This sweeping revocation might seem unnecessary. Most of the measures in the previous administration’s AI executive order were telling agencies to get up to speed, instructions to ramp up their AI expertise to meet the challenges of the new technology. Some elements might have gone too far for the current administration. For instance, many have criticized its reliance on the Defense Production Act to mandate sharing of AI test results with the government. These elements could have been revised or rescinded.  

Repealing the entire previous AI executive order might have been a political necessity, given that candidate Trump had promised to do it, but it was still a blow to the bipartisanship that is an essential element in creating a stable regulatory environment for AI investment and growth.  

There is one factor in favor of a comprehensive policy review, however. In December 2024, the Chinese company DeepSeek released a technical report on its V3 model. It showed the model had capabilities comparable to OpenAI’s GPT-4o. But while GPT-4 had cost $100 million to train, DeepSeek had spent only $5.6 million on its final training run for V3. Then on Jan. 22, DeepSeek released another technical report on its family of R1 reasoning models derived from V3 that matched the capabilities of OpenAI’s reasoning o1 model, unveiled in September 2024. It seemed as if DeepSeek had duplicated the success of OpenAI’s most advanced frontier model in only a few months and at a fraction of the cost. 

The stock market reaction was harsh. On Jan. 27, the stock of chip company NVIDIA fell 17%, erasing nearly $600 billion in market value and leading to a widespread decline in U.S. tech stocks. Investors apparently thought U.S. companies no longer had a substantial lead over companies based in China and that the huge investment in compute that drove advances in U.S. company models were no longer needed. Tech stocks have somewhat recovered from this blow, but it is clear the U.S. AI industry is facing what tech entrepreneur Marc Andreessen called its “Sputnik moment.” 

Assessing how to maintain U.S. AI leadership in light of this DeepSeek shock will take some time. In that context, it might make some sense to assess U.S. AI policy afresh to determine which elements might be impeding successful competition with a near peer AI rival. 

The good news is that the Trump administration, after conducting its 180-day review of existing AI initiatives, can preserve those elements that contribute to AI innovation and trustworthiness. The AI Safety Institute should become a permanent part of the new administration’s AI Action Plan. 

Real AI risks 

The message coming out of the recent AI Summit held in Paris is the marginalization of existential risk—the speculative and distracting risk that out-of-control AI models will destroy humanity. But for years, AI researchers have warned these highly capable AI foundation models created serious new risks that had to be managed at the model level. These real risks include allowing a non-expert to design and synthesize new biological or chemical weapons, producing persuasive disinformation with minimal user instruction, and harnessing unprecedented offensive cyber capabilities. 

Perhaps to emphasize a turn from a concern about existential risk to a concern about these more mundane but urgent risks, the U.K. recently renamed its AI Safety Institute testing agency. The new name, the AI Security Institute, puts a clearer focus on “strengthening protections against the risks AI poses to national security and crime.”   

Under whatever name, there is an urgent need for a government role in assessing and mitigating genuine AI risks. In their rush to develop highly capable AI, companies can shortchange the need to protect against these real risks.  

There is some evidence that DeepSeek did this. While it seems to adhere well to China’s system of content control in refusing to answer questions about sensitive topics like Tiananmen Square, it does less well in other risk areas. A report by the security firm Enkrypt AI found that DeepSeek’s R1 model is more likely to produce biased content, harmful material, malicious software code, and content related to chemical, biological, and cybersecurity risks than comparable U.S. AI models.  

Revealing that DeepSeek has “considerable vulnerabilities in operational and security risk areas” was not an exercise in regulatory overkill or obsessing about science fiction fantasy risks, but rather a public service that warned companies and professionals to take extra precautions if they want to embed these models in their business systems or use them in their professional work. 

Governments have a responsibility to ensure that AI companies have the right financial incentives to develop secure and trustworthy AI systems, not systems that can expose their users or the public to significant harm. The best way to do this is to set up a voluntary testing institution that can provide them and the public with information about model capabilities and how well they control recognized and foreseeable risks. 

No model alignment can avoid all possible risks, but it can reduce substantial risks to manageable levels. For instance, without alignment, large language models (LLMs) would answer questions about how to build chemical and biological weapons. There is some evidence from a 2024 RAND study that the availability of today’s LLMs might not make it any easier for unskilled adversaries to build these weapons than would access to the internet alone. This means the incremental risks might not be that great right now, but that could change with more capable models. Besides, why make things any easier for bad actors by giving them another route to developing an operational biological threat? If no controls are imposed on LLMs, there is a significant risk that untrained, nonexpert adversaries could use this tool to make weapons. Imposing strong controls on model use that can only be avoided with substantial technical expertise makes it much more difficult to access this information using a LLM, reducing a substantial risk to a more manageable one. 

The key thing is to develop a common understanding in the government, industry, and academia of what risks need to be addressed and how to measure and evaluate risk reduction. A government institution can coordinate the identification and measurement of risk. And only a government agency has the convening power and the authority to bring all the relevant actors to the table to make sure the result represents the consensus of the best people and institutions working on AI.  

Activities of the US AI Safety Institute 

In its activities over the last year, the U.S. AI Safety Institute has begun to do this. On Nov. 1, 2023, the Commerce Department announced the formation of the AI Safety Institute, led by NIST, to “facilitate the development of standards for safety, security, and testing of AI models, develop standards for authenticating AI-generated content, and provide testing environments for researchers to evaluate emerging AI risks and address known impacts.” 

On Feb. 7, 2024, the department reaffirmed this mission to “conduct research, develop guidance, and conduct evaluations of AI models including advanced LLMs in order to identify and mitigate AI safety risks.”   

Recognizing this mission cannot be accomplished by one government agency acting alone, on Feb. 8, 2024, the department announced the formation of the U.S. AI Safety Institute Consortium. This initiative “brings together more than 280 organizations to develop science-based and empirically backed guidelines and standards for AI measurement and policy, laying the foundation for AI safety across the world.” The consortium consists of U.S. companies, universities, think tanks, and advocacy groups with expertise in AI. 

In August 2024, the institute signed agreements with OpenAI and Anthropic regarding AI safety testing. The agreements established formal cooperation arrangements between the agency and the companies and allow it “to receive access to major new models from each company prior to and following their public release.” 

On Nov. 19, 2024, the institute published a pre-deployment evaluation of Anthropic’s upgraded Claude 3.5 Sonnet model. It evaluated the model on four domains: biological capabilities, cyber capabilities, software and AI development, and safeguard efficacy. It compared the model to the prior version of Anthropic’s Sonnet 3.5, OpenAI’s o1-preview, and OpenAI’s GPT-4o. The evaluation found the model had improved capability in all areas, but researchers were able to use publicly available “jailbreak” methods to circumvent the model’s built-in safeguards, meaning that “the model provided answers that should have been prevented.”  

On Dec. 18, 2024, the Institute published a pre-deployment evaluation of OpenAI’s o1 reasoning model, the full version of which was released to the public on Dec. 5, 2024. It evaluated the model in three areas: cyber capabilities, biological capabilities, and software and AI development. The comparison models were OpenAI’s o1-preview, OpenAI’s GPT-4o, and both the upgraded and earlier version of Anthropic’s Claude 3.5 Sonnet. It found that in some areas, it was more capable than the reference models; and in others, less capable.   

On Nov. 20, 2024, in an attempt to knit together government agencies around the world working on the same set of AI trustworthiness problems, the U.S. launched the International Network of AI Safety Institutes at a meeting in San Francisco. 

On Jan. 15, 2025, NIST published a second draft of its guidelines addressing the risk of misuse of dual use foundation models, which is available for public comment until March 15. 

Preserving an AI testing institute 

In many ways, the name “safety institute” is a misnomer. As NIST noted in its well-regarded AI risk management framework, safety in the sense that an AI model does not “lead to a state in which human life, health, property, or the environment is endangered” is only one component of trustworthy AI. Policymakers also want AI to be valid and reliable, “secure and resilient, accountable and transparent, explainable and interpretable, privacy-enhanced, and fair with harmful bias managed.” The phrase “AI safety” also conjures up speculative fears of existential risk from out-of-control AI systems endowed with consciousness, independent agency, and superhuman powers. These science fantasy fears only distract policymakers from real challenges and opportunities. 

The new label for the U.K.’s institute focuses attention on the real work of the institute. Regardless of its label, the institute performs a vital function in maintaining U.S. leadership in AI. Rushing untrustworthy AI models to market without adequate pre-deployment review is not a way to establish leadership—it is a way to scare away individual and institutional customers who are looking for reliable AI products to increase their workplace productivity and make their everyday lives easier and more comfortable.   

NIST has performed this testing function for countless other technologies, including for facial recognition, which is an application of AI technology. Since 2017, it has evaluated facial recognition algorithms submitted voluntarily by developers and ranked them by how accurately they identify people of varied sex, age, and racial background. This information is crucial for users of the technology to compare different brands and decide which is best for their purposes. An AI testing institution housed in NIST would perform a similar function for general purpose frontier AI models.  

Trustworthy AI companies already conduct pre-deployment testing and have committed themselves to continuing this best practice, but the science of assessing trustworthiness in AI is just beginning. The U.S. needs to work with companies to create accurate and objective measures of trustworthiness. It needs a center of excellence within the government where companies can come on a voluntary basis to work with recognized experts to see how successful they have been in meeting the challenges of providing models that are secure, reliable, robust, and trustworthy. 

Source

Leave a Comment