[OOo] New Thesaurus file format for OOo 2.0


Tue 08 March 2005 By nuxeo

The thesaurus file format will change from OOo version 1.x to 2.x

The engine, myThes has been developped by Kevin Hendricks (OOo
lingucomponent project lead). A standalone version is available at
http://lingucomponent.openoffice.org/thesaurus.html

The new format is based on WordNet from Princeton Univerity

The main changes introduced are


  • datas are now plain text, no binary anymore

  • each entry can have multiple meanings and can be morphologically
    tagged


This new format is incompatible with old one. So existing thesaurus will
not work in OOo 2.0

I'm working on a small program translating the old thesauruses to new
format. It is an OOo macro accessing thesaurus API (mainly the
com.sun.star.linguistic2.Thesaurus
service available in OOo 1.1.x and
the old .idx file which is plain text).
Once the data transformed (the .dat file is created), the new index .idx
file is generated using a perl script Kevin wrote.
It is almost finished and will be released under free licence so that other
native-lang OOo projects can transform their own thesaurus if needed.

Concerning morphological informations (verb, noun, adjective ...)
that are actually missing for all entries, Myriam's work will be of great help generating these informations.

(Post originally written by Laurent Godard on the old Nuxeo blogs.)


Category: Product & Development