I have lately been looking into increasing the performance of
CPSSharedCalendar when you have many events in the database (some tens of
thousands or more). There are a couple of things that quickly gets tricky
when doing this:


  1. No matter what you index on, the result set from each index will be
    very big.
  2. Indexing recurring events is slightly tricky.


Big result sets


The main thing to filter on for events is obviously start and end time. You
create one index for start time and one for end time. When searching you
then have one result set with anything that falls before one of these dates
and another result set with everything that falls after one of the other
dates. And then you need to make a union of these result sets.



Sound simple enough? Well, it is simple, but it is also very slow, because
the result sets will be huge, and making a union will take a long time.
Creating more indexes does not help, because the only thing you typically
search on is the dates and who goes to the meeting, which of course just
creates a third huge result set.



The way around this is to only use one index, for the data that is likely to
create the smallest result set. This is the index where you compare with
start time, as most searches will be on "today" or "this week" and there are
typically more events in the past than in the future.



The other data you want to compare with are kept in a sort of "meta data"
index, that is, instead of mapping from value to event, you map from event
to value. Then you loop through the result set, and fetch the end date value
from this "meta data index" and compare with that.



This is faster than having two index result sets and doing a union, because
both result sets will be so large that it is faster to get the data event by
event, as the amount of comparisons that is needed gets much smaller.



More hints on how to do efficient searching with large results sets are
welcome.

Indexing recurring events


The biggest problem of indexing recurring events is that if you set
something up to happen every week, forever, you have no end date to index.




I solved this by indexing events with no end date with the value "None" and
handling them separately. This causes some extra work. In principle, you
need to check every single event that recurs forver in every request. You
also need to check every occurence of every recurring event where the
time searched falls on the time span of the occurrences.



This can be handled in several ways.


  • You could index a recurring event with the start date of the first
    occurrence, and the end date of the last. This is the simplest and
    slowest.

  • Or, you could simply index every recurrence separately (except for
    never ending recurrences, of course).

  • Or you can mix it. You can index every recurrence for the next year
    separately, and then index the test as one very long occurrence. You
    could even index the occurences during the search, using the indexing
    as a sort of "occurence caching".




But I don't expect recurring events to be that common, so after briefly
looking into the faster and more complex ways of recurring this, I went with
the simplest and slowest,
according to the "You Ain't Gonna Need It" principle. If your calendar
becomes slow and this is because you have very many recurring events, then
these methods might be of interest.

The results


Well, I'm happy to say that the time to search out a months worth of data
amongst a 100.000 events has dropped from 3.5 second to 0.17 seconds thanks
to the performance improvements I made last week. I think that should be
enough for quite a while.

(Post originally written by Lennart Regebro on the old Nuxeo blogs.)