Meet Olivier Grisel, Nuxeo’s resident expert on semantic technologies heading up several major projects in this quickly evolving branch of computer science. With a background in Machine Learning and Artificial Intelligence, he could very well make your sci-fi dreams (or nightmares) come true.
For the last 2 ½ years Olivier Grisel has been heading up the R&D effort at Nuxeo to incorporate semantic technologies into content management systems. This is no small feat, but the secret lies within the technology itself as Olivier explains that semantic technologies are not just about interpreting data, but also enabling information to work for you.
We recently caught up with him between the coffee machine and the break room (wasn’t too hard considering our open office plan...) to get his take on this cutting edge technology and what the future holds.
Nuxeo: How did your research with semantic technologies begin here at Nuxeo?
Olivier: In the beginning it was just the SCRIBO project, part time, now we’ve added a second project IKS, which is a European project. Both projects are taking a lot of investment. FISE is the name of the first prototype we did as part of the IKS project, but now this project has been moved and combined into a new Apache project: Apache Stanbol.
Nuxeo: Prior to Nuxeo were you involved in semantic technologies?
Olivier: Prior to Nuxeo I did a Masters of Artificial Intelligence and Machine Learning in London. These are not directly related to semantic technologies but semantic stuff is a way to apply this kind of research. There are two ways in applying such as the academic research I did before: the first one is machine learning, which is used to train statistical models to try to understand the natural language of human people. The second way is once you have extracted structural knowledge out of the sentences, the text, you can try to perform some kind of reasoning.
For example, if you know that Paris is a city and that cities are places that are located in countries and countries smaller places than continent and you see that Paris name appears in document then you can do queries like give me all the documents that talk about events that happened in Europe. If you have the hierarchical relationship between places you can take it into account do reasoning—this is very simple reasoning in this case but that’s the basis of logical representations we use in Artifical Intelligence.
Nuxeo: How does this translate to semantic technologies?
Olivier: Basically if you describe facts and events using logic you use some kind of formal vocabularies to describe observations that appear in unstructured text or pictures and these vocabularies are the basis of semantic representations. They are structured according to standards such as RDF, that stands for the Resource Description Framework—this is the main standard for semantic web.
Nuxeo: So, how would explain semantic technologies to your grandmother?
Olivier: [laughs] Well, it would be to make what humans produce understandable by machines. What you can then do is automated processing on the output—on the production of human work—so yes, that’s the final goal.
Nuxeo: Did this subject matter, artificial intelligence, always interest you—are you a sci-fi guy?
Olivier: Ah, yes, the Terminators are coming soon, I guess, only a couple of years to wait…
Nuxeo: Have you seen all the Terminator movies?
Nuxeo: Tell us a little about the projects you’re working on and the goals related to these?
Olivier: So the IKS project goal is to develop an open source software stack that is meant to provide semantic services to traditional CMS developers. The goal of Nuxeo with this project is to provide use case and show how this kind of technology can be integrated into a traditional CMS or ECM system.
And so I think both this project and the other project SCRIBO are research projects that enable you to explore what’s possible with today’s technology, but we can also focus already on customer applications, the first application of use is to help journalists, news editors, and press agencies, to enrich the text they produce with structured metadata so they automatically classify or route the news they produce to their content consumers. Their consumers get additional value—not just the raw text from the journalist—but for instance the list of people mentioned in the article, information on those people, biographies of them, or the localization of the event described in the text, and so on. So this is an easy way to get value out of semanitc technologies: for semantic publishing. But I think there are other applications that we could focus on medium term.
For instance eDiscovery, say you have a huge collection of unstructured content and you are setting up a new Nuxeo repository where you import all existing files you have on your intranet or on your shelf folders, but you have no structured way to organize them and they have just been accumulating over the years and when you switch to Nuxeo you might want to structure this based on the project or the people working on the files and so on. So it might be interesting to extract names of the organization people occurring in this text and induce the idea of project relating those entities and try to propose a tree of folders that could organize those projects and the related documents in the best way so as to grow the existing content and discover the relationship between this content.
Nuxeo: Would you still have to somehow tag this unstructured data first?
Olivier: Not necessarily, it’s able to find, if you have sentence like a name—so and so declared to something else—the fact that there is the word “declared” means that the word that came before is most likely a person’s name. So, based on the structure of sentence, it can extract the name of a person even if they’re not known in any knowledgebase.
Nuxeo: So a customer can take this semantic plugin and put it into their CMS and it would start creating the relationships?
Olivier: Yes and propose new candidates of structure or ways of organizing and then the human user can validate if the output looks good. But the eDiscovery part is not implemented yet. Right now, if you have a document it annotates the document with additional information, such as those people who were detected in the document. It’s then able to link them to a Wikipedia article, for instance if you need their biography, summary, picture, nationality, or birth date—you can fetch that kind of structured/linked data. So this is what we’re doing now and trying to improve the quality of this existing part.
Also we’re working on making it possible to do this kind of annotation, even though you don’t have physical web access to an external database Right now we are using DBpedia (there’s also Freebase by Google) but if your server has a link that is down or if DBpedia is down then the plugin no longer works correctly because it misses the connection to the database. We’re working to have local indexes of remote knowledgebase and then when you have the link up you can update the changes, but when the link is down it’s still working and that’s very important for customers who want their data to stay confidential to their IT infrastructure.
The main problem with the current offering for semantic technologies is that most providers offer just a service that they host, which means you have to send your data to their servers where it’s analyzed and enriched and then sent back to your server. The goal of our project is develop this kind of technology as an open source project so you can run it on your server and you’re not dependent on external providers, which could result in confidentiality issues and problems with availability.
Nuxeo: When will eDiscovery be released?
Olivier: This part is more prospective so it’s not very scheduled but I hope by the end of the project, which is in 2 years, we will be able to do more like continuous news feed aggregation and automatically structuring of news feeds—this is the kind of nice demo we want to provide for the final review of the project.
Nuxeo: Do you have any customers who are using/testing this technology already?
Olivier: Yes, actually for the first project I mentioned, SCRIBO—one of the partners is Agence France Presse and they are already using this kind of technology. They are our main use case provider and already have some components working in production and they are testing to see how it can be integrated in their infrastructure, some of which is based on Nuxeo technology.
New stuff that are being tested and deployed…there is a component that is not yet implemented in Stanbol which is able to detect what people have said even if there are no direct quotes—for instance, if somebody has declared something, like the Prime Minister declared that a particular law will pass in 2 months. Semantic technology is able to detect that the aforementioned person made this statement. This means that you can then extract all the public declarations in the news by a particular person. So, if this person goes on to be a candidate for an election later, you can make a statistical analysis of what he had said 2 years ago.
Nuxeo: How do you feel about working at Nuxeo? Do you enjoy it?
Olivier: Yeah, it’s great! I really appreciate being able to work on leading edge research and to try to apply it to real working software and not just in research papers. It’s a good combination between application of research and doing the research and development.
Nuxeo: What do you do in your free time?
Olivier: I’m working right now on some projects related to machine learning, specifically in the area of facial recognition.
Nuxeo: So this is your passion - you live and breathe your work?
Olivier: [laughs] Yes, definitely.