Resume About the French Grammar Checker Project


Thu 19 July 2007 By nuxeo

I have recently started to work on the project of a free french grammar
checker which could be implemented in OpenOffice.org.
Myriam Lechelt had initiated this project 2 years ago by adapting Gramadoir, a gaelic grammar
checker developped by Kevin Scanell. But this tool appeared not to be very
suitable for french grammar.
Myriam had also analyzed other grammar checker, amongst which LanguageTool. It is a
rule-based style and grammar checker initially developped for English by
Daniel Naber, and then extended to German, Polish or Hungarian. It was
rejected at the time by Myriam, but it has progressed a lot. So we have
decided to work on the new version for our french grammar checker.

In her work, Myriam has
given leads to create a new grammar checker for French. For example, she
advises to segment sentences in chunks, between the sentence and the word.
She also suggests to use grammar unification and feature structure to find
grammar mistakes.

How do grammar checkers work ?


First of all, a tokenizer segments the text into sentences and words. Then a
tagger gives tag(s) to each token, containing morphosyntactic information
like the gender, the number, the tense, the person, etc...

Many words have several tags, so a disambiguation is often necessary to
eliminate inappropriate tags in certain contexts and to keep only the good
tag. The method to disambiguate can either be statistic or rule based. The
statistic methods needs a learning tagged corpus, and then the grammar
checking is very dependent on this corpus. The rule-based method requires a
large number of hand-made rules, describing the context in which a word must
have a certain tag. This second method is easier to control.

Finally, a pattern matching with the text and rules is used in the grammar
checking. It can either be grammar rules or error rules. The first
describe what is good and everything that does not match them is considered
as false. This can be very annoying since it can wrongly detect many errors
if rules are not exhaustive. On the contrary, error rules describe what is
wrong and everything that matches them is considered as false. But even if
rules are very numerous, as it is impossible to anticipate all mistakes,
there will always be not detected errors. But this is preferable to wrong
detections.

About LanguageTool


LanguageTool is a style and grammar checker developped by Daniel Naber. It
is composed of several parts in java successively proceeding to the
tokenization in sentences and words, the tagging and the detection of
grammar mistakes.

There is no disambiguation after the tagging, so many words can have more
than one tag. But a disambiguator interface has been implemented for the
languages for which it is a problem not to have disambiguation.

The detection of errors is based on error rules formalized in XML. Each
rule has an identifier (id), a name, a pattern describing the context of the
mistake, a message explaining the mistake, and examples to show a correct
and incorrect sentence corresponding to the mistake.

Tests with French and problems encountered


We have tested the few rules ported from An Gramadóir and written by
Myriam Lechelt. We have immediately noticed that the absence of
disambiguation would be an important problem for French checking.
Indeed, the detection of mistakes almost always failed because of
ambiguous words.
We have tried to get round this problem by modifying the rules to take
ambiguity in account, but we realized that it would be very tedious to build
every rule like that. The best solution is to implement a disambiguator
after the tagging. That is what we will try to do.

We also became aware of the problem raised by the structure of the rules,
and more precisely of the pattern in the rules and the method of rigid
pattern matching. It requires the description of all contexts in which a
mistake can be found, that is to say the description of all possible
combinations of words, and a rule for each one. But it is just impossible to
anticipate all of them. We could only write a very large number of rules
which would never be exhaustive, and which would be costly for the
processing.

An alternative with chunks and unification


According to Abney, "The typical chunk consists of a single content word
surrounded by a constellation of function words, matching a fixed
template
" (S. P. Abney, 1991, Parsing by chunks).
The internal structure of a chunk is fixed, but function words inside are
all dependant of the lexical head and agree with it. In the sentence, chunks
agree whit each other, and they can easily permute, contrary to words in a
chunk.

Feature structures describe each element in a sentence with a list of pairs
feature-value. Unification consists in matching the feature structures of
different elements. The matching failes if a feature does not have the same
value in the feature structures of the different elements tested.

The use of both chunks and unification is a very interesting alternative.
It can make grammar checking really easier. First, by unifying features
only, and not grammatical category, we reduce considerably the number of
necessary rules, since we do no more need to enumerate all possible
combinations of words.
Then, the relations between chunks will be very helpful for some checkings,
like the aggreement with the subject and the verbal chunk, or more generally
for all agreement with distant words. Indeed, distant relations cannot be
checked with a system only applying pattern matching on the immediate
context.

Disambiguation


Myriam Lechelt has built many disambiguation rules for An Gramadóir. We
intended to port them to LanguageTool, so we have analyzed java files to see
how we could add them, and we have thought about how to improve
disambiguation.

It would be logical to rewrite the rules in XML, since it is the formalism
used by LanguageTool for all rules. Moreover, XML rules can be more easily
understood and maintained by linguists who are not necessarily computer
scientists.

It could be preferable, in some cases, not to disambiguate a word totally,
but only the grammatical category. Because of an ambiguity of features, some
mistakes may not be detected, but with a bad disambiguation of features,
wrong mistakes can be detected, which is much more annoying for the
user.

We have thought about disambiguation, how to improve it and how to port
rules to LanguageTool. But in fact, we have finally decided not to implement
disambiguation now, since we lack time and it is more important for us to
improve grammar checking. Instead, we will tag and disambiguate sentences
from a corpus of mistakes, and we will use these sentences for the next
step, that is to say the grammar checking.

(Post originally written by Agnes Souque on the old Nuxeo blogs.)


Category: Product & Development