Friday, October 14, 2011

Data Lineage.. what is that ?

It is one of those buzzwords, that keep doing the circuit every once in a while. Almost every enterprise wants to do the analysis regarding this, and is almost always hard to find people with knowledge/experience doing this kind of analysis.

For the unaware, Data Lineage is basically (really in very short words) a study of the data from its source to its eventual target, similar to what we'd do for our generation tree, we analyze the generation analysis of the data we are dealing with.

Starting from the source of the data, it travels through different subsystems, sometimes going through transformations, and thus possibly changing shape too...

Informatica had a very interesting blog post around this (already in 2007), which can turn out to be fairly informative.

Tuesday, September 27, 2011

Informatica & hadoop... solutions for future ?

Distributed computing using hadoop has taken the IT industry by a whirlwind in the last few years.  After getting almost "adopted" by yahoo, hadoop has progressed quite fast, and is now maturing slowly but steadily.

More and more enterprise solution providers are annoucing their support for the hadoop platform, hoping to get a pie of the big Data business chunk.  Its possibly a fair thing to expect that the leader in Data Integration business solutions space, Informatica has also announced a tie up with Cloudera, for porting Informatica platform to hadoop.

Though the exact details are yet to come out, the possibilities are endless.  With hadoop (and its inherent distributed computing based on map/reduce technology), informatica can actually think of processing big data in sustainable time frames.

For one my customers, I deal with about 200 million rows of data per day in one job.  Besides the issues with oracle in tuning the query etc, the informatica component itself consumes times in terms of hours.  With map reduce in place, I hope to get that in minutes, oracle issues notwithstanding.

Although word about hadoop is spreading quite fast, its adoption (from buzzword to actual usage in enterprise) is not as fast.  To aid their cause, Informatica and cloudera have started an interesting series of webinars, termed as "hadoop tuesdays".  Its free to join, and they get experts to talk about various related issues around hadoop and big data and informatica.  Its been very useful and informative so far.

Monday, July 25, 2011

Switching defaults in Ubuntu

Ubuntu allows you to have multiple alternatives installed for many software.. for example, java.
You can have the default open jdk installed, and then you can actually have the Sun version installed.

For example, to see what alternatives are installed for your software, try going to /etc/alternatives. Here you'd see many pieces of software with alternatives listed.

With these software installed, you would need to point your system to use one of them as the default, this is important especially after installing a newer version of the software.

In such a case, to switch the alternatives, you need to use this

sudo update-alternatives

If you do a man on update-alternatives, there is a plethora of options to use.

For our example, to configure the default for java, use this

sudo update-alternatives --config java

Wednesday, April 20, 2011


I attended the .WEB day of GIDS (The Great Indian Developer Summit) 2011 edition.

Among many talks, there were two focusing on HTML5. One by Scott Davis (of and Venkat Subramaniam (of .  Scott's talk was more on the conceptual and capability side of HTML 5.  Venkat focused more on the implementation and initiating newbies to HTML 5 coding.

Before these discussions, I would not have been able to say much on the capabilities of HTML5. It was more of a buzzword before, however, now its more of another technology holding lot of promise.   I think that should say a lot for the two speakers, that within two sessions, they have been able to lift the standard of know how around a cutting edge technology from buzzword to daily use.

Both these talks, put together were able to provide a rather complete picture. Enumerate the benefits, major improvements, new tags which are bringing in so much functionality to native HTML without need of any third party libraries, plugins etc.

Of course its cutting edge today, since not all browsers support all of the HTML 5 specification. The specification is huge in itself anyway.  As one of the speakers put it, the HTML 5 spec is a combination of HTML plus all of CSS 3 plus a lot of RIA functioanlities based on JavaScript libraries.  One can say that html5 is rather heavy from browser engine side, however, it intends to provide all the features across the browsers (eventually).   Since its a huge spec, not all browsers implement it  ** completely and ** uniformly.

There would be a time when all the browsers (at least the leading ones) would implement it completely (or almost all of it), but till then, the developers would have to live with polyfill (polimorphically backfill) the html5 functionality for non supporting browsers.   A javascript library at is a big help in implementing this transparently.

As Scott very aptly put it, "We'd program for the faster animal in the herd, and allow the rest of the slower ones to polyfill. As and when they catch up with the fastest one, need for polyfill will automatically go away".

From what I see in html5 spec (what ever part that I have come to know), it looks very very interesting and powerful.  Lots of current functionality that is implemented today with the help of third party libraries/plugins is going to be implemented natively.

And, let me not forget to mention the one single most important innovation that is coming through with html5, semantic web.  Its not really a set of tags or something similar, rather a concept.   There are tags available in spec, which actually indicate the semantics (meaning) of the content. For example, there is a tag called

. This tag wont do much on its own, but when someone is reading the code, or for that matter the parser program is going through the code, the tag name already says that its a footer.  The tag name actually means something.  This also paves way for future improvements on the implementation side.

Perhaps a separate post for html5 possibilities for mobile applications, a huge area in itself.


  3. -> this is a unique one, a complete book on html 5, which is available free of cost, completely online.  One of the finest resources for html5.
  5.  -> javascript library for polyfill

   ## Technorati - CJYMQNJWMX9K

Sunday, March 13, 2011

convert Informatica Session Logs to text/xml format for out of tool readability

There was a situation recently when the infa repository was not able to point us to the old session logs.
However, the file system still had those files.

But the session log files are in binary format by default. If you didnt ask for backward compatible session log files, you'd get a binary format of session log file on file system. This is done to allow better importability of the session log files for infa support guys.

So, I had a few log files in binary format, and needed to analyze them.
Informatica provides a subcommand for infacmd to achieve that conversion.

The convertLogFile subcommand takes three parameters. Syntax is as follows -

convertLogFile <-InputFile|-in> input_file_name
                 [<-Format|-fm> format_TEXT_XML]
                 [<-OutputFile|-lo> output_file_name]

so, when you launch this, you can specify the input file to be converted, the format as TEXT or XML and the output file that you'd want as a result of conversion.

An example call would look like this -  (expecting server on unix) convertLogFile -in /path/to/binary/format/sesslog/file -fm TEXT -lo /my/home/text/format/sesslog/file

Monday, February 21, 2011

Just found out about this amazing thing...

A research initiative at Stanford University, Data Wrangler.. Wonderfully helping for analysts.

Try a demo video here -

And read more about it on

Wrangler Demo Video from Stanford Visualization Group on Vimeo.

Saturday, February 19, 2011

Exadata - is it really worth the hype

Well, I am not going to try to answer that, rather, more on the question side...
Recently one of my projects moved to exadata device. There was so much talk around that, the queries and db processes need not be looked into, exadata will take care of them already.

However, two things happened.. first, there was a technical talk on the device's configuration. The device turned out to be a mammoth piece of hardware. In nutshell, its a 8 node cluster, each node having 4 cpu's.  Each CPU has around 2-4 GB of RAM. Then there is this high speed secondary storage which can hold a lot of cache.

The nodes are interconnected using a special switch which can transfer data faster than Gigabit networks.

With such hardware configuration, any software can claim the kind of performance gain they claim. Not to undermine the performance gains, I just want to say that the hype around the out of the world performance gains, is actually the result of better hardware, not really revolutionary software.

I, personally was expecting something of that type from Oracle, since they lack in that area. Except Teradata, there is almost no player who delivers that kind of Data Warehouse architecture and performance. and I was hoping that Oracle would do something around there and bring out something.

And, the second thing, one of the processes tried to load data to an exadata instance using informatica. Initially we left things at default so that exadata can tune it itself and we should not force anything.  However, there too, exadata failed big time and couldnt put in any perf gain. At the end, all the tuning had to be done by us only.

So, the other claim of exadata regarding intelligence to pick up processes and fix them on its own also went down for us.

Though I agree that its rather new for its own evolution, i believe oracle marketing should be doing a better job.:)