The Promises and Challenges of AI for Arabic
- Science & Nature
- Innovation & Technology
Reading time:11min
Written by Rebecca Anne Proctor
Illustrated by Mujahid Almalki via AI generator
How are artificial intelligence models making generative AI accessible to the Arabic-speaking world of the Middle East and beyond?
Mohammed Moneb Khaled, a researcher in artificial intelligence (AI) in the United Arab Emirates, believes in the power of AI to foster better communication for the Arabic-speaking world.
For his work at the University of Sharjah, Khaled relies on ChatGPT to translate reports from English to standard Arabic and vice versa. But the Arabic language has multiple varieties. As much as the tool is an asset, when he tries to speak a certain Arabic dialect to ChatGPT, the responses, he says, “are not accurate.”
Khaled says more needs to be met to incorporate the Arabic language, particularly diverse Arabic dialects, into existing AI models, and that’s what he and other researchers hope to achieve.
When OpenAI’s ChatGPT launched in 2022, it became a sensation worldwide for its ability to enable users to easily communicate with a machine on a natural humanlike level.
GPT is a type of Large Language Model (LLM) trained on extensive data. It can understand Arabic inquiries and translate them using modern standard Arabic. But, says Khaled, it falls short in responses. The answers often sound unnatural, and literal translations do not have the same meaning.
While conversational AI tools, such as ChatGPT or the new Google Gemini, can enhance efficiency, customer engagement and communication, many specialists state that AI is lacking when they need to converse in other languages with multiple dialects.
Arabic, spoken by more than 400 million people worldwide, serves as the official language of approximately 22 countries, mainly in the Middle East and Africa, according to the United Nations. It joins Chinese, English, French, Russian and Spanish on the UN’s list of official languages used in its work around the world.
There are three main versions of Arabic: Quranic or Classical, Modern Standard and Colloquial, which has two dozen or more dialects. While the actual number of dialects remains disputed, some are similar, and others are difficult to decipher, even for a speaker of Modern Standard Arabic. The most common groups include North African (Maghrabi), Levantine (Syria and Lebanon), Egyptian and the Gulf Arab dialects. Originally from Syria, Khaled converses in three variations of Arabic: Modern Standard for work, and Levantine or Gulf Arabic dialects for everyday conversations. But with an AI tool he runs into hurdles, and his research shows that’s a common problem.
“Many business owners throughout the Arabic-speaking world have told me they would prefer to have AI models available in Arabic dialect because they use these dialects more commonly than Modern Standard Arabic to conduct business with their customers,” explains Khaled. “Customers prefer to do business in their own dialect.”
One issue, says Khaled, is the dominance of English in the AI world, which has led to the rare use of Arabic and other languages. And that has created a problem.
“In the realm of AI, Arabic is not getting much attention from the researchers or big companies,” says Khaled.
Now, researchers and engineers in the Arab world are trying to change that.
Numerous public and private industries and sectors already require Arabic as a tool for public service.
“This is why AI technologies are now so important and crucial for advancing languages like Arabic,” says Ashraf Elnager, professor and vice dean of the College of Computing and Informatics from the University of Sharjah, who is also Khaled’s professor. Elnager is working with students and researchers such as Khaled to develop new models and expand their knowledge of Arabic usage in AI.
AI, he explains, “is extremely important for natural language processing in general, and for the Arabic language in particular; AI has the potential to bridge the gap between the Arabic language, the linguistic part of it, and the latest technology that has been emerging over the past four or five years.”
Developing advanced AI models today, Elnager notes, allows us to enhance language-processing tools, which can lead to better translation accessibility and integration of Arabic into the digital world.
Rupert Chesman, an AI consultant, and filmmaker based in Sydney who has traveled the region extensively, believes that “machine learning and translation can start to understand the multifaceted nature of the Arabic language.”
Chesman says one way to embrace the complexity of the Arabic language would be to document all Arabic movies and television shows with an understanding of Arabic as a language with numerous distinct morphologies and accents. One method would be to use Google Gemini, a new AI model that not only understands text like other LLMs, but also videos and images.
Using this context window, Google showed a sample of a Buster Keaton movie, “Sherlock Jr.,” Chesman notes, explaining how Gemini analyzed the 44-minute movie in seconds, understanding the vision, nuance and some humor.
“Imagine if Gemini was exposed to Egyptian movies, Saudi television or books in Morocco. It would be able to build a strong knowledge of the incredibly multifaceted nature of Arabic and Arab culture, to understand not only the linguistic nuances that occur but also the importance of cultural nuance at the same time,” he enthuses. “Maybe it would be an opportunity to understand that Arabic is not so Modern Standard after all.”
Innovation accessible to all
Researchers in the field of generative AI believe that it is vital that Large Language Models are developed for languages other than English to ensure that innovation is accessible to everyone.
“Making AI accessible to as many users around the world will level the playing field of an emergent tool to give everybody access from a language-barrier standpoint to one of the most revolutionary tools that humans have invented,” says Jeff Shupack, a specialist in digital transformation and expert in AI who is based in San Francisco.
“Making it accessible in local syntax, local dialects and local languages really levels that playing field,” he adds.
However, as Chesman states, LLMs are only as good as the data provided to them.
“Making AI accessible to as many users around the world will level the playing field of an emergent tool to give everybody access from a language barrier standpoint to one of the most revolutionary tools that humans have invented.”
—Jeff Shupack
With LLMs, written Arabic will often lose much of its cultural subtext because the models do not include accents and are written in a more “standard” way for a “standard” audience, explains Chesman. To improve the accuracy and accessibility of AI in Arabic, it is vital, he says, “that diverse audio and video samples are used to ensure that the geographic and cultural location of the language being transcribed is clear to the model.”
One major advancement took place in Abu Dhabi, the capital of the UAE, in August 2023 with the launch of Jais, an open-source bilingual Arabic-English model developed by Inception, a unit of Abu Dhabi AI company G42, Mohammed bin Zayed University of Artificial Intelligence (MBZUAI) and Silicon Valley-based Cerebras Systems. Jais is now available for download on a machine-learning platform called Hugging Face.
Dubbed the world’s most significant and most accurate open Arabic LLM, Jais is designed to support and bring the Arabic language into the mainstream of this space. Today, according to Jais’ creators, Arabic accounts for just 1 percent of total global online content.
Developing an Arabic LLM has enabled Arabic speakers and organizations to use transformative services like ChatGPT and Gemini.
“Jais makes generative AI accessible to the Arabic-speaking world,” professor Preslav Nakov, department chair and professor of natural language processing (NLP) at MBZUAI, explains. He emphasized that through Jais the depth and heritage of Arabic, with all its intricacies and complexities, can find its voice within the rapidly expanding AI landscape.
“It helps bridge the gap between computers and their understanding of the complexities of Arabic,” Nakov adds.
Preservation of Arabic
In a world that increasingly relies on AI for all aspects of life, training AI models in specific languages is key not only to access a greater public and improve business and communication but also to reach a populace where English is not always spoken.
Jais, Nakov explains, has been designed to have a more accurate understanding of the culture and the context in the region, in contrast to most US-centric models.
“Jais could also help to increase the volume and the diversity of Arabic content available online, including educational resources on various topics, including technology, culture, science and lifestyle, and texts translated from other languages, including news articles, blog posts and subtitles,” he adds.
Models are continually being enhanced. An updated version of Jais, called Jais 30B, was launched in November 2023 and was completed in January 2024. It is the newest and most proficient version of the open-source Arabic LLM, featuring 30 billion parameters, offering a rich, nuanced, generative AI experience for Arabic speakers worldwide.
LLMs can help preserve languages, not just Arabic, in many ways: They can assist in understanding and translation, helping to bridge communication gaps between Arabic and other languages, and they can aid in the preservation and documentation of Arabic by analyzing and understanding historical texts, literature and cultural artifacts.
In Saudi Arabia, one of the main strategic objectives of the National Center for Artificial Intelligence, explains Yaser Al-Onaizan, the head of NCAI, is to develop and operationalize AI solutions to accelerate their adoption in Saudi Arabia.
NCAI is the innovation arm of the Saudi Data and Artificial Intelligence Authority. “One of its main focuses is nationally strategic Arabic Language AI products and services while investing heavily in building Arabic-focused reusable foundational pre-trained models for language and speech, namely ALLaM and SauTech,” explains Al-Onaizan. “Developing these solutions locally ensures the preservation of culture and identity, data sovereignty and real technology transfer.”
While much progress has been made in developing and enhancing Arabic AI models, challenges remain. One is the complexity of Arabic. “It is extremely challenging compared to other languages, simply due to its many dialects, variants, and rich morphology,” says Elnager. “The number of researchers in this area is minimal. One big obstacle is also computational resources.”
With LLMs becoming more widely used across most sectors, there is a risk that they could inadvertently accelerate the decline of smaller, underrepresented languages, notes Nakov.
“If not addressed, the situation has the potential to create a winner-takes-all scenario, with the most widely spoken languages being well served by LLMs (Large Language Models) while lesser-spoken languages are neglected.”
—Yaser Al-Onaizan
“If not addressed, the situation has the potential to create a winner-takes-all scenario, with the most widely spoken languages being well served by LLMs while lesser-spoken languages are neglected,” he adds. “The significance of models such as Jais could potentially extend beyond Arabic and help to preserve other languages in the region that are closely related to Arabic, such as Aramaic, Mehri, Shahri, Hobyot and Harsusi, and Amharic in Ethiopia.”
Another challenge is that Arabic is constantly changing.
“As the language evolves, the [grammatical] rules become more complex and sometimes outdated,” explains Al-Onaizan. “Therefore, it makes sense to develop the ability to learn these rules or patterns of language from the data directly.”
Khaled says he is working on research that will train AI to operate in diverse Arabic dialects and make responses more accurate.
“A CEO of a company in the United Arab Emirates asked me if ChatGPT can type in Emirati dialect and that it would be beneficial for his customers,” quips Khaled. “Another Sudanese man said it would be so beneficial to have a type for the Sudanese dialect. Beyond ChatGPT, it would also be worthwhile for the Arabic-speaking world to have Alexa in Arabic.”
While AI in Arabic is still in its beginning phases, developments across the region, complemented by eager students like Khaled and professors like Elnager, are spurring this much-needed revolution.
About the Author
Mujahid Almalki
Mujahid Almalki, originally from Muscat, Oman, is both a photographer and an artist who is the founder of Sard. Sard is a visual art project that harnesses the power of artificial intelligence to tell stories about the Arabic world.Rebecca Anne Proctor
Rebecca Anne Proctor is an independent journalist, editor and broadcaster based between Dubai and Rome. She is a former editor-in-chief of Harper’s Bazaar Art and Harper’s Bazaar Interiors.
You may also be interested in...
Ingenuity And Innovations 1 - Kohl Eyeliner: More Than Meets the Eye
History
Arts & Culture
Science & Nature
The black eyeliner known widely today as kohl was used much by both men and women in Egypt from around 2000 BCE—and not just for beauty or to invoke the the god Horus. It turns out kohl was also good for the health of the eyes, and the cosmetic’s manufacture relied on the world’s first known example of “wet chemistry”—the use of water to induce chemical reactions.Reflections of Knowledge
History
Science & Nature
Part 3 of our series celebrating AramcoWorld’s 75th anniversary highlights the magazine’s emphasis on experts and institutions that push the boundaries of present-day knowledge while paying homage to historical figures and writings that paved their way.