Microsoft unveils Phi-2, the next of its smaller, more nimble genAI models

Microsoft's newest small language model is purported to be faster than large language models 25 times its size.

Microsoft

Microsoft has announced the next of its suite of smaller, more nimble artificial intelligence (AI) models targeted at more specific use cases.

Earlier this month, Microsoft unveiled Phi-1, the first of what it calls small language models (SLMs); they have far fewer parameters than their large language model (LLM) predecessor. For example, the GPT-3 LLM — the basis for ChatGPT — has 175 billion parameters. GPT-4, OpenAI’s latest LLM, has about 1.7 trillion parameters. Phi-1 was followed by Phi-1.5, which by comparison, has 1.3 billion parameters.

Phi-2 is a 2.7 billion-parameter language model that the company claims can outperform LLMs up to 25 times larger.

Microsoft is a major stock holder and partner with OpenAI, the developer of ChatGPT, which was launched a little more than a year ago. Microsoft uses ChatGPT as the basis for its Copilot generative AI assistant.

LLMs used for generative AI (genAI) applications such as chatGPT or Bard can consume vast amounts of processor cycles and be costly and time-consuming to train for specific use cases because of their size. Smaller, more industry- or business-focused models can often provide better results tailored to business needs.

“Sooner or later, scaling of GPU chips will fail to keep up with increases in model size,” said Avivah Litan, a vice president distinguished analyst with Gartner Research. "So, continuing to make models bigger and bigger is not a viable option.”

Currently, there’s a growing trend to shrink LLMs to make them more affordable and capable of being trained for domain-specific tasks, such as online chatbots for financial services clients or genAI applications that can summarize electronic healthcare records.

Smaller, more domain specific language models trained on targeted data will eventually challenge the dominance of today's leading LLMs, including OpenAI's GPT 4, Meta AI’s LLaMA 2, or Google's PaLM 2.

Dan Diasio, Ernst & Young’s Global Artificial Intelligence Consulting Leader, noted that there’s currently a backlog of GPU orders. A chip shortage not only creates problems for tech firms making LLMs, but also for user companies seeking to tweak models or build their own proprietary LLMs.

“As a result, the costs of fine-tuning and building a specialized corporate LLM are quite high, thus driving the trend towards knowledge enhancement packs and building libraries of prompts that contain specialized knowledge,” Diasio said.

With its compact size, Microsoft is pitching Phi-2 as an “ideal playground for researchers,” including for exploration around mechanistic interpretability, safety improvements, or fine-tuning experimentation on a variety of tasks. Phi-2 is available in the Azure AI Studio model catalog.

“If we want AI to be adopted by every business — not just the billion-pound multinationals — then it needs to be cost-effective, according to Victor Botev, former AI research engineer at Chalmers University and CTO and co-founder at start-up Iris.ai, which uses AI to accelerate scientific research.

The release of Microsoft's Phi-2 is significant, Botev said. “Microsoft has managed to challenge traditional scaling laws with a smaller-scale model that focuses on 'textbook-quality' data. It's a testament to the fact that there's more to AI than just increasing the size of the model,” he said.

"While it’s unclear what data and how the model was trained on it, there are a range of innovations that can allow models to do more with less.”

LLMs of all sizes are trained through a process known as prompt engineering — feeding queries and the correct responses into the models so the algorithm can respond more accurately. Today, there are even marketplaces for lists of prompts, such as the 100 best prompts for ChatGPT.

But the more data ingested into LLMs, the the greater the possibility of bad and inaccurate outputs. GenAI tools are basically next-word predictors, meaning flawed information fed into them can yield flawed results. (LLMs have already made some high-profile mistakes and can produce “hallucinations” where the next-word generation engines go off the rails and produce bizarre responses.)

“If the data itself is well structured and promotes reasoning, there is less scope for any model to hallucinate,” Botev said. “Coding language can also be used as the training data, as it is more reason-based than text.

“We must use domain-specific, structured knowledge to make sure language models ingest, process, and reproduce information on a factual basis,” he continued. “Taking this further, knowledge graphs can assess and demonstrate the steps a language model takes to arrive at its outputs, essentially generating a possible chain of thoughts. The less room for interpretation in this training means models are more likely to be guided to factually accurate answers.

"Smaller models with high performance like Phi-2 represent the way forward.”