100% employee owned

The Great Debate: Indexing vs. Relational Databases for Email Archiving

Archive vendors have long debated the merits of using indexing technology vs. relational databases to archive ESI. As a result, I’m often asked what kind of database M+Archive uses. In fact, many clients even specify a database to be mandatory in their RFPs.

The confusion primarily results from the fact that other archive vendors rely on an enterprise database such as Oracle, MySQL (which happens to be owned by Oracle), MS SQL Server, etc. Clients need to know what database they're using in advance so they can anticipate the additional cost of licensing the database, hiring expert DBAs to manage a large enterprise database, support issues, etc.

Why relational databases and email archiving are a mismatch

Relational databases are great for storing certain structured data, like the line items of expenses we have when we make budgets in Excel. The problem is most ESI such as email is unstructured data. It just doesn't fit properly in a database. Add to that the fact that email archives can run into hundreds of millions of documents. Imagine the headache of managing a database of that size.

The good news is M+Archive doesn’t need a database. Messaging Architects' use of indexing for M+Archive is unique — no one else that I know of uses indexing technology only.

So how is search performed? That’s where indexing comes along. All of these files are indexed by the M+Archive Indexing Server. All searches go against the index to be able to retrieve results in less than a second. For a client, this means you don’t need to hire a fancy DBA, and there's no Oracle license to buy/renew. Eliminating the database dependency simplifies things greatly — and the performance is scorching fast. Indexing technology was built to handle billions of documents with unstructured content.

The best way to understand indexing is to think Google. There is no way Google would exist if they had to use traditional databases. In fact, their search engine relies on a proprietary object storage system called Bigtable. The indexing technology M+Archive uses also powers another public search engine that has indexed more than 8 billion Web pages.

Indexing just makes sense, especially in the context of archiving hundreds of millions of records where a few terms hidden in one of these records can make or break a litigation case.

Ranjit Sarai


2 Comments

Ranjit Sarai (July 28, 2009)

Thanks Rob - and great question. One way to perform Single Instance Storage (SIS) is to use a SQL DB where you can have keys to identify unique records. Another way, which is how M+Archive does it, is to reference all items with a hash value (think of it as a key). This hash value is actually stored as part of the archive file system. For performance reasons its a multi-level hash that allows M+Archive to quickly identify where a record is stored. By leveraging the archive itself, which is a flat file system, there's no need for a DB.

-Ranjit

Greenstone, Rob (July 28, 2009)

Hi Ranjit. Very informative article. I am very interested myself in indexing technologies and database systems. I work heavily with MySQL for web design. One question though, How is Single Instance storage achieved without a database back-end?

Post a Comment

First Time Visitor

* Indicates fields that are required

 

Newsletter (View example)

 

Create my profile We respect your privacy.

Returning Visitor


 

Not you? Not You?
If you already have a profile, please enter your valid email address. If you would like to create a profile, fill the form "First Time Visitor".

Sign In

Contact Me

Submit  * Indicates fields that are required