[Q&A Friday] How Do We Search for Accented Characters in Note Content?


Fri 01 March 2013 By Laurent Doguin


How do we search for accented words in Note content? How do we search for accented characters in Note content?

Here's a question from patrek; you might have seen him a lot on Nuxeo Answers: How do we search for accented characters in Note content?

If you are not familiar with Nuxeo you have to know that the Note document type is used to store text. This text can be plain text, XML, HTML or markdown. When you store a note in HTML, the accented characters you type are converted to HTML entities. For instance if you type 'é', it will be stored as '&eacute;. And if it's stored as '&eacute;', it will be indexed as such. Which means that if you search for it, you'll have to type '&eacute;' in the search field instead of 'é'. This is no good for most of the normal users; they want to type 'é' just like they did when writing the note. So, we need to unescape those HTML entities, but not all of them. This was stored as HTML for a reason. You don't want to unescape '<', >' or &'. What we want here is basically to unescape only the ISO-8859-1 characters. A list is available on the w3c reference page. In the old HTML version, it makes sense to have ISO-8859-1 characters as HTML entities as they were not part of the HTML spec. But now, every browser supports those characters so we can safely unescape them.

You have to know that there is already some HTML sanitizing done automatically for the HTML content of notes. Sanitizing is mostly used for security reasons, particularly to avoid XSS vulnerabilities. Take a look at the listener that does this. As you can see, it's a synchronous listener, triggered very early and only when you're about to create an event or modify it. What we have to do is unescape those HTML entities right after they're sanitized. Here's the appropriate listener contribution:

<extension target="org.nuxeo.ecm.core.event.EventServiceComponent"
point="listener">

<listener name="htmlunescaperlistener" async="false" postCommit="false"
class="org.nuxeo.sample.HtmlUnescaperListener" order="-5">
<event>aboutToCreate</event>
<event>beforeDocumentModification </event>
</listener>
</extension>
Now we need the code that does the actual unescaping. The complete sample is available on Github. Here I'm using the Apache StringEscapeUtils.unescapeHtml method. Be careful as it might unescape too much :) Take a look at the unit test and complete it if you want to be sure.
/*
* (C) Copyright 2013 Nuxeo SA (http://nuxeo.com/) and others.
*
* All rights reserved. This program and the accompanying materials
* are made available under the terms of the GNU Lesser General Public License
* (LGPL) version 2.1 which accompanies this distribution, and is available at
* http://www.gnu.org/licenses/lgpl-2.1.html
*
* This library is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
* Lesser General Public License for more details.
*
* Contributors:
* ldoguin
*/
package org.nuxeo.sample;

import org.apache.commons.lang.StringEscapeUtils;
import org.nuxeo.ecm.core.api.ClientException;
import org.nuxeo.ecm.core.api.DocumentModel;
import org.nuxeo.ecm.core.event.Event;
import org.nuxeo.ecm.core.event.EventContext;
import org.nuxeo.ecm.core.event.EventListener;
import org.nuxeo.ecm.core.event.impl.DocumentEventContext;
import org.nuxeo.ecm.core.schema.FacetNames;

/**
* @author ldoguin
*/
public class HTMLUnescaperListener implements EventListener {

public void handleEvent(Event event) throws ClientException {

EventContext context = event.getContext();
if (!(context instanceof DocumentEventContext)) {
return;
}
DocumentModel doc = ((DocumentEventContext) context).getSourceDocument();
if (doc.hasFacet(FacetNames.IMMUTABLE)) {
return;
}

String html = doc.getProperty("note:note").getValue(String.class);
String unescapedHtml = StringEscapeUtils.unescapeHtml(html);
doc.setPropertyValue("note:note", unescapedHtml);
}

}
Here's the unit test. Note that we are deploying the htmlsanitizer bundle to have a test closer to reality.
/*
* (C) Copyright 2013 Nuxeo SA (http://nuxeo.com/) and others.
*
* All rights reserved. This program and the accompanying materials
* are made available under the terms of the GNU Lesser General Public License
* (LGPL) version 2.1 which accompanies this distribution, and is available at
* http://www.gnu.org/licenses/lgpl-2.1.html
*
* This library is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
* Lesser General Public License for more details.
*
* Contributors:
* ldoguin
*/
package org.nuxeo.sample.test;

import static org.junit.Assert.assertEquals;

import org.junit.Test;
import org.junit.runner.RunWith;
import org.nuxeo.ecm.core.api.CoreSession;
import org.nuxeo.ecm.core.api.DocumentModel;
import org.nuxeo.ecm.core.test.CoreFeature;
import org.nuxeo.runtime.test.runner.Deploy;
import org.nuxeo.runtime.test.runner.Features;
import org.nuxeo.runtime.test.runner.FeaturesRunner;

import com.google.inject.Inject;

@RunWith(FeaturesRunner.class)
@Features(CoreFeature.class)
@Deploy({ "org.nuxeo.ecm.platform.htmlsanitizer",
"nuxeo-htmlunescaper-sample" })
public class HTMLUnescaperListenertest {

public static final String BAD_HTML = "<b>&eacute;foo&agrave;<script>bar</script></b>";

public static final String SANITIZED_UNESCAPED_HTML = "<b>éfooà</b>";

@Inject
CoreSession session;

@Test
public void sanitizeNoteHtml() throws Exception {
DocumentModel doc = session.createDocumentModel("/", "n", "Note");
doc.setPropertyValue("note", BAD_HTML);
doc.setPropertyValue("mime_type", "text/html");
doc = session.createDocument(doc);
String note = (String) doc.getPropertyValue("note");
assertEquals(SANITIZED_UNESCAPED_HTML, note);

session.save();
doc.setPropertyValue("note", BAD_HTML);
doc = session.saveDocument(doc);
note = (String) doc.getPropertyValue("note");
assertEquals(SANITIZED_UNESCAPED_HTML, note);
}
}

Category: Product & Development
Tagged: Q&A