I have finished my first work on LanguageTool. I have adapted the tool to
French grammar checking.
Work on rules
As I explained
previously, at the beginning of the French grammar checker project,
Myriam Lechelt has worked on An Gramadóir. She has written many
disambiguation and correction rules. Since An Gramadóir was limited and
did not really suit to French, it was abandoned.
During my work, I have converted An Gramadóir's rules to LanguageTool.
Thanks to Marcin Miłkowski who implemented a disambiguator, according
to my instructions, I could import disambiguation rules as well as
correction rules. Moreover, I simplified them a lot and I considerably
reduced their number thanks to the XML language.
Then, I have analysed a corpus of mistakes (V. Lucci et A. Millet, 1994,
L'orthographe de tous les jours, enquête sur les pratiques
orthographiques des français, Editions Champion) and I have
extracted new grammar rules from it.
LanguageTool can detect the following kind of mistakes :
- phonetic proximity (confusion of homophones like ont and on, ça and sa,
- mistakes in verb phrases (confusion between infinitive and past
participle, conjugated form and past participle, etc.)
- subject-verb agreement (personal pronoun or noun phrase with only a
determiner and a noun)
Limits of the formalism
While working on the rules, I made tests that showed me the limits of the
formalism of LanguageTool. Because of the rigid pattern matching on which it
is based, if the patterns described in the rules do not exactly match the
text, the rules become inefficient and prevent some mistakes from being
detected. Moreover, it is necessary to foresee every wrong combination of
words to describe them in the rules. It leads to a combinatory explosion of
the number of rules, especially in noun phrases.
The formalism also generates lots of wrong alarms, because of ambiguities
or wrong tags. Some mistakes can be detected simultaneously several times by
different rules. And when a word is wrong, it can cause wrong alarms on
nearby words, since the rules are based on the context.
I have developed a new formalism to improve French grammar checking in
LanguageTool. It is based on chunks and unification of features structures
. I mix a contextual
syntactic theory (chunks, Abney) and a generative syntactic theory
(unification, Chomsky). This is not a typical combination, but it
makes possible to go further in grammar checking by delimiting an area in
the sentence where all words must agree. It is then no longer necessary to
describe all wrong combinations of words. Instead of listing agreement
mistakes, inconsistencies are detected in phrases.
Thanks to my work for my MPhil, French grammar checking is available for
OpenOffice.org. But there is still a lot of work left. It is necessary to
create a tool compatible with the new formalism, and to build and analyse a
corpus of mistakes to write new grammar rules.
A new approach for grammar checking
To improve grammar checking, I am considering another method which consists
in doing at the same time the morphosyntactic analysis and the grammar
checking, while the sentence is read. This "left-right" method is based on
the principle of latencies (Tesnières, 1959). With the declaration of
what is expected after a word or a phrase, inconsistencies can be detected,
instead of listing all possible mistakes.
This approach will also solve the problem of the vicious circle in grammar
checking. Indeed, for mistakes to be detected, the tagging must not be
wrong. But for it to be correct, the text must not contain any
(Post originally written by Agnes Souque on the old Nuxeo blogs.)