Thursday, September 30, 2010

my experiments with solr :)

Its a catchy title, but yes.. thats what I am going to talk about...

I came across hadoop, when I was looking for a new solution for one of our in-house projects. The need was quite clear, however, the solution had to be dramatically different.

The one statement we received from business was, "We need an exceptionally fast search interface". And for that fast interface to search upon they had more than a hundred million rows worth of data in a popular RDBMS.

So, when I sat about thinking, how to make a fast search application, the first thing that came to my mind was, Google. Actually, whenever we talk about speed or performance of web sites, Google is invariably the first name that comes across.

Further, Google has a plus point that there is always some activity at the back end to generate the page or results that we see, its never static content. And, then, another point, Google has a few trillion pieces of information to store/index/search whereas our system was going to have significantly lower volume of data to manage.   So, going with that, Google looked like a very good benchmark for this fast search application. 

Then I started to look for "How Google generates that kind of performance". There are quite a few pages on the web talking about just that.   But, probably none of them has the definitive/authoritative view on Google's technology or for that matter the insider's view on how it actually does what it does so fast.

Some pages pointed towards their storage technology, some talked about their indexing technology, some about their access to huge volumes of high performance hardware and what not...

For me, some of them turned out to be genuinely interesting, one of them was the indexing technology. There has to be a decent indexing mechanism to which the crawler's would feed and the search algorithms hit.  The storage efficiency is probably the next thing to come in the play. How fast can they access the corresponding item ?

Another of my observation is that, the search results (the page mentioning page titles and stuff) comes real fast, mostly less than 0.25 seconds, but the click on the links does take some time.  So, I think it has to be their indexing methodology that plays the bigger role.

With that in mind, I sat about finding what can do similar things and how much of Google's behaviour they can simulate/implement.

Then I found Hadoop project on apache ( which to a large extent reflects the way Google kind of system would work. It provides distributed computing(hadoop core), it provides a bigTable kind of database (hbase), provides map/reduce layer, and more.  Reading into it more, I figured out that this system is nice for a batch processing kind for mechanism, but not for our need of real time search.

Then I found solr(, a full text search engine under Apache Lucene.  It is a java written, xml indexing based genuinely fast search engine.  It provides many features that we normally wish for in more commercial applications, an being from apache, I would like to think of it as much more reliable and stable than compared to many others.

When we sat about doing a Proof of Concept with it, I figured out a few things –

•    It supports only one schema, as in, rdbms tables – only one. So, basically you would have to denormalize all your content to fit into this one flat structure.
•    It supports interactions with the server interface only through http methods be it the standard methods get/put etc or be it REST like interfaces.
•    It allows you loading data in varying formats, through xml documents, through delimited formats and through db interactions as well.
•    It has support for clustering as well. Either you can host it on top of something like hadoop or you can just configure it to do it within solr as well.
•    It supports things like expression and function based searches
•    It supports faceting
•    Extensive caching and “partitioning” features.

Besides other features, the kind of performance without any specific tuning efforts made me think of it as a viable solution.

In a nutshell, I loaded around 50 million rows on a “old” Pentium-D powered desktop box with 3 GB RAM running ubutnu 10.04 server edition (64 bit) with two local hard disks configured over a logical volume manager.

The loading performance was not quite great. Though its not that bad either. I was able to load a few million rows (in a file that was sized about 6 GB) in about 45 minutes when the file was on the same file system.

In return, it gave me query performances in the range of 2-4 seconds for the first query. For subsequent re-runs of the same query (within a span of an hour or so), it came back in approx 1-2 milliseconds.  I would like to think that its pretty great performance given the kind of hardware I was running upon, and the kind of tuning effort I put in (basically none – zero, I just ran the default configuration).

Given that, I wont say that I have found the equivalent or replacement of Google’s search for our system, but yeah, we should be doing pretty good with this.

Although there is more testing and experimentation that is required to be able to judge solr better, the initial tests look pretty good.. pretty much in line with the experiences of others who are using it.

Sunday, September 12, 2010

Business & Open Source - How both can benefit

I had the opportunity to scout for a new technology/solution for one of our in-house projects.  Quite a few of the options that I looked for were from open source arena.  And amazingly, the products were far more capable from our expectations, just that we'd have to pitch in with some effort to get it working for us.

I have always felt that for open source projects/products to become commercially viable for a business enterprise, the enterprise has to come up and spend some resources to it to get the actual value out of it.

In other words, if an organization wants to use an open source product, which has an equivalent competitive commercial product available in market, they should be open enough to have their own in-house people who can take ownership of the installation. The organization shouldn't completely rely on the support available from the community forums and such.

I have seen more than one manager complain about the lack of support on the open source products.  Had there been proper support system for each of the open source products, we'd see a lot of stories similar to mysql's model or pentaho model.

What I would like to see perhaps is that the organizations' becoming mature enough in their adaptation of the open source products. By that, I expect them to have a open vision, have people who understand and like and own the product, and at the same time tweak and tune the product to suit the organization's business needs.

In the process, the organization should contribute to the product's development cycle.  This could happen in many ways, bug fixes, contribution of new features, the employees could contribute on community forums and such.  Using the terminology from peer to peer sharing, only leechers dont help a torrent, people need to seed to it as well. Same way, unless organizations contribute to an open source product, they would stand to become only leechers.

Only after we have a decent balance of organizations using and contributing to the open source products, we'd see the ecosystem flourishing...

Thursday, September 9, 2010

Tips for brainstorming...

Interesting read, from both positive and negative viewpoints -

1. Use brainstorming to combine and extend ideas, not just to harvest ideas.

2. Don't bother if people live in fear.

3. Do individual brainstorming before and after group sessions.

4. Brainstorming sessions are worthless unless they are woven with other work practices.

5. Brainstorming requires skill and experience both to do and, especially, to facilitate.

6. A good brainstorming session is competitive—in the right way.

7. Use brainstorming sessions for more than just generating good ideas.

8. Follow the rules, or don't call it a brainstorm.

Read more here -

in reference to:

"8. Follow the rules, or don't call it a brainstorm."
- Eight Tips for Better Brainstorming (view on Google Sidewiki)

Wednesday, September 8, 2010

Big help...

I wanted to get my table sizes in infobright, and this page came to my help...

SELECT table_schema,table_name,engine, table_rows, avg_row_length,
(data_length+index_length)/1024/1024 as total_mb,(data_length)/1024/1024 as data_mb,
(index_length)/1024/1024 as index_mb, CURDATE() AS today
FROM information_schema.tables
WHERE table_schema='mySchemaName'

Thanks Ron...
in reference to: Calculating your database size | MySQL Expert | MySQL Performance | MySQL Consulting (view on Google Sidewiki)