Last week I attended the final review meeting for the SAMAR project (website in French). SAMAR (Station d'Analyse Multimedia en langue ARabe) is a multi-enterprise project with the objective of developing a platform to manage multimedia content in Arabic.
The goal of the project was to develop tools for the automated analysis of Arabic news content (text, audio and video) from Agence France Presse for smart multilingual searching with (among other things):
- speech to text transcription from the audio track of videos
- semantic text analysis and linking for arabic text, with categorization by IPTC topics
- translation to French and English.
Today, the processing platforms for the Arabic language still lack maturity. Semantic processing in Arabic is particularly difficult. It's not because of the nature of the character themselves, since Arabic uses a 24-letter alphabet and word composition that is not very different from Latin languages. It's more about the fact that vocals are often not written, so they have to be deduced from the context by the algorithm. Transliterations from English names into Arabic often have a lot of variability, and Arabic has many dialects that share many language features but are still quite different (both from a pronunciation standpoint and a vocabulary standpoint).
Also, the relative lack of linguistically annotated corpus of documents with respect to English (for instance) makes it more challenging and costly to build good models for speech transcription, translation and semantic analysis (topic categorization and named entity detection).
With the SAMAR project, we have developed a working platform for semantic processing of Arabic language multimedia content. The project work will be validated by conducting experiments on all Arabic dispatches already produced by the AFP (about one million dispatches, representing more than 150 million words), as well as radio and television data in Arabic.
The SAMAR project builds upon previous results from the IKS project. The goal of IKS was to develop software components to provide content management platforms such as Nuxeo with semantic analysis capabilities. In particular we used Apache Stanbol as a semantic analysis and indexing middle-ware server to connect Nuxeo with the other services developed by the SAMAR partners to analyze AFP's new contents and to cross-link the results with a local index of DBpedia managed by the Apache Stanbol EntityHub component.
A demo is worth a thousand words, so view the video to see the proof of concept in action!