my technology playground
my experiences with technology space
Saturday 15 October 2011
Data Lineage.. what is that ?
Wednesday 28 September 2011
Informatica & hadoop... solutions for future ?
More and more enterprise solution providers are annoucing their support for the hadoop platform, hoping to get a pie of the big Data business chunk. Its possibly a fair thing to expect that the leader in Data Integration business solutions space, Informatica has also announced a tie up with Cloudera, for porting Informatica platform to hadoop.
Though the exact details are yet to come out, the possibilities are endless. With hadoop (and its inherent distributed computing based on map/reduce technology), informatica can actually think of processing big data in sustainable time frames.
For one my customers, I deal with about 200 million rows of data per day in one job. Besides the issues with oracle in tuning the query etc, the informatica component itself consumes times in terms of hours. With map reduce in place, I hope to get that in minutes, oracle issues notwithstanding.
Although word about hadoop is spreading quite fast, its adoption (from buzzword to actual usage in enterprise) is not as fast. To aid their cause, Informatica and cloudera have started an interesting series of webinars, termed as "hadoop tuesdays". Its free to join, and they get experts to talk about various related issues around hadoop and big data and informatica. Its been very useful and informative so far.
Monday 25 July 2011
Switching defaults in Ubuntu
You can have the default open jdk installed, and then you can actually have the Sun version installed.
For example, to see what alternatives are installed for your software, try going to /etc/alternatives. Here you'd see many pieces of software with alternatives listed.
With these software installed, you would need to point your system to use one of them as the default, this is important especially after installing a newer version of the software.
In such a case, to switch the alternatives, you need to use this
sudo update-alternatives
If you do a man on update-alternatives, there is a plethora of options to use.
For our example, to configure the default for java, use this
sudo update-alternatives --config java
Thursday 12 May 2011
writing for cloudtimes.org these days
Wednesday 20 April 2011
HTML 5
Among many talks, there were two focusing on HTML5. One by Scott Davis (of http://www.thirstyhead.com/) and Venkat Subramaniam (of http://www.agiledeveloper.com) . Scott's talk was more on the conceptual and capability side of HTML 5. Venkat focused more on the implementation and initiating newbies to HTML 5 coding.
Before these discussions, I would not have been able to say much on the capabilities of HTML5. It was more of a buzzword before, however, now its more of another technology holding lot of promise. I think that should say a lot for the two speakers, that within two sessions, they have been able to lift the standard of know how around a cutting edge technology from buzzword to daily use.
Both these talks, put together were able to provide a rather complete picture. Enumerate the benefits, major improvements, new tags which are bringing in so much functionality to native HTML without need of any third party libraries, plugins etc.
Of course its cutting edge today, since not all browsers support all of the HTML 5 specification. The specification is huge in itself anyway. As one of the speakers put it, the HTML 5 spec is a combination of HTML plus all of CSS 3 plus a lot of RIA functioanlities based on JavaScript libraries. One can say that html5 is rather heavy from browser engine side, however, it intends to provide all the features across the browsers (eventually). Since its a huge spec, not all browsers implement it ** completely and ** uniformly.
There would be a time when all the browsers (at least the leading ones) would implement it completely (or almost all of it), but till then, the developers would have to live with polyfill (polimorphically backfill) the html5 functionality for non supporting browsers. A javascript library at www.modernizer.com is a big help in implementing this transparently.
As Scott very aptly put it, "We'd program for the faster animal in the herd, and allow the rest of the slower ones to polyfill. As and when they catch up with the fastest one, need for polyfill will automatically go away".
From what I see in html5 spec (what ever part that I have come to know), it looks very very interesting and powerful. Lots of current functionality that is implemented today with the help of third party libraries/plugins is going to be implemented natively.
And, let me not forget to mention the one single most important innovation that is coming through with html5, semantic web. Its not really a set of tags or something similar, rather a concept. There are tags available in spec, which actually indicate the semantics (meaning) of the content. For example, there is a tag called
| Reactions: |
Sunday 13 March 2011
convert Infrmatica Session Logs to text/xml format for out of tool readability
However, the file system still had those files.
But the session log files are in binary format by default. If you didnt ask for backward compatible session log files, you'd get a binary format of session log file on file system. This is done to allow better importability of the session log files for infa support guys.
So, I had a few log files in binary format, and needed to analyze them.
Informatica provides a subcommand for infacmd to achieve that conversion.
The convertLogFile subcommand takes three parameters. Syntax is as follows -
so, when you launch this, you can specify the input file to be converted, the format as TEXT or XML and the output file that you'd want as a result of conversion.
An example call would look like this - (expecting server on unix)
infacmd.sh convertLogFile -in /path/to/binary/format/sesslog/file -fm TEXT -lo /my/home/text/format/sesslog/file
Tuesday 22 February 2011
Just found out about this amazing thing...
Try a demo video here -
And read more about it on http://vis.stanford.edu/wrangler
Wrangler Demo Video from Stanford Visualization Group on Vimeo.
Sunday 20 February 2011
Exadata - is it really worth the hype
Recently one of my projects moved to exadata device. There was so much talk around that, the queries and db processes need not be looked into, exadata will take care of them already.
However, two things happened.. first, there was a technical talk on the device's configuration. The device turned out to be a mammoth piece of hardware. In nutshell, its a 8 node cluster, each node having 4 cpu's. Each CPU has around 2-4 GB of RAM. Then there is this high speed secondary storage which can hold a lot of cache.
The nodes are interconnected using a special switch which can transfer data faster than Gigabit networks.
With such hardware configuration, any software can claim the kind of performance gain they claim. Not to undermine the performance gains, I just want to say that the hype around the out of the world performance gains, is actually the result of better hardware, not really revolutionary software.
I, personally was expecting something of that type from Oracle, since they lack in that area. Except Teradata, there is almost no player who delivers that kind of Data Warehouse architecture and performance. and I was hoping that Oracle would do something around there and bring out something.
And, the second thing, one of the processes tried to load data to an exadata instance using informatica. Initially we left things at default so that exadata can tune it itself and we should not force anything. However, there too, exadata failed big time and couldnt put in any perf gain. At the end, all the tuning had to be done by us only.
So, the other claim of exadata regarding intelligence to pick up processes and fix them on its own also went down for us.
Though I agree that its rather new for its own evolution, i believe oracle marketing should be doing a better job.:)
| Reactions: |
Thursday 2 December 2010
Rejecting records in Fact table loads - Informatica
| Reactions: |
Thursday 18 November 2010
Setting a useful command prompt in Unix
* \[ : begin a sequence of non-printing characters, which could be used to embed a terminal control sequence into the prompt
* \\ : a backslash
* \] : end a sequence of non-printing characters
* \a : an ASCII bell character (07)
* \@ : the current time in 12-hour am/pm format
* \d : the date in "Weekday Month Date" format (e.g., "Tue May 26")
* \D{format} : the format is passed to strftime(3) and the result is inserted into the prompt string; an empty format results in a locale-specific time representation. The braces are required
* \e : an ASCII escape character (033)
* \H : the hostname
* \h : the hostname up to the first '.'
* \j : the number of jobs currently managed by the shell
* \l : the basename of the shell’s terminal device name
* \n : newline
* \nnn : the character corresponding to the octal number nnn
* \r : carriage return
* \T : the current time in 12-hour HH:MM:SS format
* \t : the current time in 24-hour HH:MM:SS format
* \u : the username of the current user
* \s : the name of the shell, the basename of $0 (the portion following the final slash)
* \v : the version of bash (e.g., 2.00)
* \W : the basename of the current working directory
* \w : the current working directory
| Reactions: |
Tuesday 16 November 2010
scripts and hash bang ( #! )
| Reactions: |
Monday 1 November 2010
A Note to New Consultants
- The consultant can function as a specialist or expert, In this role he must be more knowledgeable than the client. This implies a very narrow field of specialization, otherwise the client with his greater continuity of experience would be equally expert.
The consultant can function as a counselor or advisor on the process of decision making. This implies an expertise of a special kind, that of the psychotherapist. This is merely a particular kind of expertise in a particular field.
The most typical role for a consultant is that of auxiliary staff. This does not preclude any of the other roles mentioned before, but it does require a quite different emphasis.
All companies have staff capabilities of their own. Some of this staff is very good. Yet no company can afford to have standby staff adequate for any and all problems. This is why there is an opportunity for consultants. They fill the staff role that cannot be filled internally.
By definition this means that consultants are most useful on the unusual, the non-recurring, the unfamiliar problem. Outside consultants are also most useful where the problem is poorly defined and politically sensitive, but the correct decision is extremely important. Outside consultants get the tough, the important and the sensitive problems.
The natural function of a consultant is to reduce anxiety and uncertainty. Those are the conditions under which anxiety and uncertainty are greatest and where consultants are most likely to be hired.
Problem Definition - If this point of view is our starting point, then problem definition becomes extremely important.
- If the problem is incorrectly defined, then even its complete solution may not satisfy the client's perceived needs.
- If the problem is improperly defined, it may be beyond our ability to solve.
- Problem definition is a major test of professional ability. Outside consultants can frequently define problems in a more satisfactory fashion than internal staff, primarily because they are unencumbered with the historical perspective of the client and the resulting "house" definition.
A consultant's problem definition is the end of the assignment if the problem is not researchable. If the problem is not researchable, then the consultant is either a specialist-expert or a psychotherapist. Neither of these roles are suitable for the use of the resources of an organization such as The Boston Consulting Group.
A researchable problem is usually a problem that should be dealt with by a group approach. Data gathering and analysis requires differing skills and different levels of experience that can best be provided by a group. The insights into complex problems are usually best developed by verbal discussion and testing of alternate hypotheses.
Good research is far more than the application of intellect and common sense. It must start with a set of hypotheses to be explored. Otherwise, the mass of available data is chaotic and cannot be referenced to anything. Such starting hypotheses are often rejected and new ones substituted. This, however, does not change the process sequence of hypothesize / data gathering / analysis / validation / rehypothesize.
Great skill in interviewing and listening is required to do this. Our client starts his own analysis from some hypothesis or concept. We must understand this thoroughly and be able to play it back to him in detail or he does not feel that we understand the situation. Furthermore, we must be sure that we do not exclude any relevant data that may be volunteered. Yet we must formulate our own hypothesis.
Finally, we must be able to take our client through the steps required for him to translate his own perspective into the perspective we achieve as a result of our research. This requires a high order of personal empathy as well as developed teaching skills.
- The end result of a successful consulting assignment is not a single product. It is a new insight on the part of the client. It is also a commitment to take the required action to implement the new insights. Equally important, it is an acute awareness of the new problems and opportunities that are revealed by the new insights.
We fail if we do not get the client to act on his new insights. The client must implement the insights or we failed. It is our professional responsibility to see that there is implementation whether we do it or the client does it.
Much of the performance of a consultant depends upon the development of concepts that extend beyond the client's perception of the world. This is not expertise and specialization. It is the exact opposite. It is an appreciation of how a wide variety of interacting factors are related. This appreciation must be more than an awareness. It must be an ability to quantify the interaction sufficiently to predict the consequences of altering the relationships.
Consultants have a unique opportunity to develop concepts since they are exposed to a wide range of situations in which they deal with relationships instead of techniques. This mastery of concepts is probably the most essential characteristic for true professional excellence.
A successful consultant is first of all a perceptive and sensitive analyst. He must be in order to define a complex problem in the client's terms with inadequate data. This requires highly developed interpersonal intuitions even before the analysis begins.
His analytical thinking must be rigorous and logical, or he will commit himself to the undoable or the unuseful assignment. Whatever his other strengths, he must be the effective and respected organizer of group activities which are both complex and difficult to coordinate. Failure in this is to fall into the restricted role of the specialist.
[raghav] The first time I have read that a specialist role can be restrictive, and honestly, when you think about it again, it does come back as a correct statement, specially in the wider world of other opportunities. Specially for a management consultant.
In defining the problem, the effective consultant must have the courage and the initiative to state his convictions and press the client for acceptance and resolution of the problem as defined. The client expects the consultant to have the strength of his convictions if he is to be dependent upon him. Consultants who are unskilled at this are often liked and respected but employed only as counselors, not as true management consultants.
The successful professional inevitably must be both self-disciplined and rigorous in his data gathering as well as highly cooperative as a member of a case team.
The continuing client relationship requires a sustained and highly developed empathy with the client representative. Inability to do this is disqualifying for the more significant roles in management consulting.
- Identifies his client's significant problems;
- Persuades his client to act on the problems by researching them;
- Organizes a diversified task force of his own firm and coordinates its activity;
- Fully utilizes the insights and staff work available in his client's organization;
- Uses the full conceptual power of his own project team;
- Successfully transmits his findings to the client and sees that they are implemented;
- Identifies the succeeding problems and maintains the client relationship;
- Fully satisfies the client expectations that he raised;
- Does all these things within a framework of the time and cost constraints imposed by himself or the client.
Friday 1 October 2010
my experiments with solr :)
I came across hadoop, when I was looking for a new solution for one of our in-house projects. The need was quite clear, however, the solution had to be dramatically different.
The one statement we received from business was, "We need an exceptionally fast search interface". And for that fast interface to search upon they had more than a hundred million rows worth of data in a popular RDBMS.
So, when I sat about thinking, how to make a fast search application, the first thing that came to my mind was, Google. Actually, whenever we talk about speed or performance of web sites, Google is invariably the first name that comes across.
Further, Google has a plus point that there is always some activity at the back end to generate the page or results that we see, its never static content. And, then, another point, Google has a few trillion pieces of information to store/index/search whereas our system was going to have significantly lower volume of data to manage. So, going with that, Google looked like a very good benchmark for this fast search application.
Then I started to look for "How Google generates that kind of performance". There are quite a few pages on the web talking about just that. But, probably none of them has the definitive/authoritative view on Google's technology or for that matter the insider's view on how it actually does what it does so fast.
Some pages pointed towards their storage technology, some talked about their indexing technology, some about their access to huge volumes of high performance hardware and what not...
For me, some of them turned out to be genuinely interesting, one of them was the indexing technology. There has to be a decent indexing mechanism to which the crawler's would feed and the search algorithms hit. The storage efficiency is probably the next thing to come in the play. How fast can they access the corresponding item ?
Another of my observation is that, the search results (the page mentioning page titles and stuff) comes real fast, mostly less than 0.25 seconds, but the click on the links does take some time. So, I think it has to be their indexing methodology that plays the bigger role.
With that in mind, I sat about finding what can do similar things and how much of Google's behaviour they can simulate/implement.
Then I found Hadoop project on apache (http://hadoop.apache.org/) which to a large extent reflects the way Google kind of system would work. It provides distributed computing(hadoop core), it provides a bigTable kind of database (hbase), provides map/reduce layer, and more. Reading into it more, I figured out that this system is nice for a batch processing kind for mechanism, but not for our need of real time search.
Then I found solr(http://lucene.apache.org/solr/), a full text search engine under Apache Lucene. It is a java written, xml indexing based genuinely fast search engine. It provides many features that we normally wish for in more commercial applications, an being from apache, I would like to think of it as much more reliable and stable than compared to many others.
When we sat about doing a Proof of Concept with it, I figured out a few things –
• It supports only one schema, as in, rdbms tables – only one. So, basically you would have to denormalize all your content to fit into this one flat structure.
• It supports interactions with the server interface only through http methods be it the standard methods get/put etc or be it REST like interfaces.
• It allows you loading data in varying formats, through xml documents, through delimited formats and through db interactions as well.
• It has support for clustering as well. Either you can host it on top of something like hadoop or you can just configure it to do it within solr as well.
• It supports things like expression and function based searches
• It supports faceting
• Extensive caching and “partitioning” features.
Besides other features, the kind of performance without any specific tuning efforts made me think of it as a viable solution.
In a nutshell, I loaded around 50 million rows on a “old” Pentium-D powered desktop box with 3 GB RAM running ubutnu 10.04 server edition (64 bit) with two local hard disks configured over a logical volume manager.
The loading performance was not quite great. Though its not that bad either. I was able to load a few million rows (in a file that was sized about 6 GB) in about 45 minutes when the file was on the same file system.
In return, it gave me query performances in the range of 2-4 seconds for the first query. For subsequent re-runs of the same query (within a span of an hour or so), it came back in approx 1-2 milliseconds. I would like to think that its pretty great performance given the kind of hardware I was running upon, and the kind of tuning effort I put in (basically none – zero, I just ran the default configuration).
Given that, I wont say that I have found the equivalent or replacement of Google’s search for our system, but yeah, we should be doing pretty good with this.
Although there is more testing and experimentation that is required to be able to judge solr better, the initial tests look pretty good.. pretty much in line with the experiences of others who are using it.
| Reactions: |
Monday 13 September 2010
Business & Open Source - How both can benefit
I have always felt that for open source projects/products to become commercially viable for a business enterprise, the enterprise has to come up and spend some resources to it to get the actual value out of it.
In other words, if an organization wants to use an open source product, which has an equivalent competitive commercial product available in market, they should be open enough to have their own in-house people who can take ownership of the installation. The organization shouldn't completely rely on the support available from the community forums and such.
I have seen more than one manager complain about the lack of support on the open source products. Had there been proper support system for each of the open source products, we'd see a lot of stories similar to mysql's model or pentaho model.
What I would like to see perhaps is that the organizations' becoming mature enough in their adaptation of the open source products. By that, I expect them to have a open vision, have people who understand and like and own the product, and at the same time tweak and tune the product to suit the organization's business needs.
In the process, the organization should contribute to the product's development cycle. This could happen in many ways, bug fixes, contribution of new features, the employees could contribute on community forums and such. Using the terminology from peer to peer sharing, only leechers dont help a torrent, people need to seed to it as well. Same way, unless organizations contribute to an open source product, they would stand to become only leechers.
Only after we have a decent balance of organizations using and contributing to the open source products, we'd see the ecosystem flourishing...
Thursday 9 September 2010
Tips for brainstorming...
Interesting read, from both positive and negative viewpoints -
1. Use brainstorming to combine and extend ideas, not just to harvest ideas.
2. Don't bother if people live in fear.
3. Do individual brainstorming before and after group sessions.
4. Brainstorming sessions are worthless unless they are woven with other work practices.
5. Brainstorming requires skill and experience both to do and, especially, to facilitate.
6. A good brainstorming session is competitive—in the right way.
7. Use brainstorming sessions for more than just generating good ideas.
8. Follow the rules, or don't call it a brainstorm.
Read more here - http://www.businessweek.com/
"8. Follow the rules, or don't call it a brainstorm."
- Eight Tips for Better Brainstorming (view on Google Sidewiki)
Wednesday 8 September 2010
Big help...
SELECT table_schema,table_name,
(data_length+index_length)/
(index_length)/1024/1024 as index_mb, CURDATE() AS today
FROM information_schema.tables
WHERE table_schema='mySchemaName'
ORDER BY 7 DESC
Thanks Ron...
in reference to: Calculating your database size | MySQL Expert | MySQL Performance | MySQL Consulting (view on Google Sidewiki)
Wednesday 25 August 2010
I also feel like saying, 1984...
1. an iphone that has been hacked to work outside the contract with which it was sold, read "jailbroken"
2. an iphone that is perhaps being used by someone other than the person who registered the first heartbeat or facial recognition info..
Apple intends to capture the phone location using GPS/other tech and perhaps control the device remotely if they feel its being used "unauthorized"..
i agree with people who remember 1984 after reading apple's intentions...ha.. time does come back...George Orwell.. were u too right ??
in reference to: Apple to make iPhone theft-proof - Hardware - Infotech - The Economic Times (view on Google Sidewiki)
Monday 2 August 2010
Country General Mood using Tweets
http://www.iq.harvard.edu/
I quote - (with all credits where its due, none to me...)
A group of researchers from Northeastern and Harvard universities have gathered enough data from Twitter to give us all a snapshot of how U.S. residents feel throughout a typical day or week.
Not only did they analyze the sentiments we collectively expressed in 300 million tweets over three years against a scholarly word list, these researchers also mashed up that data with information from the U.S. Census Bureau, the Google Maps API and more. What they ended up with was a fascinating visualization showing the pulse of our nation, our very moods as they fluctuate over time.
The researchers have put this information into density-preserving cartograms, maps that take the volume of tweets into account when representing land area. In other words, in areas where there are more tweets, those spots on the map will appear larger than they do in real life.
A apparantly public domain result of the analysis is available here -
http://cdn.mashable.com/wp-
| Reactions: |
Wednesday 28 July 2010
Oracle Count(1) vs Count(*)
Well, it might have been an everlasting discussion about which one of these to use, count(1) or count(*).
I guess, this article of Thomas Kyte already clarified the situation long long ago (well, for IT industry 2005 is long ago anyway, especially given the speed at which we are moving.)
Essentially, what askTom says that, count(*) is better than count(1) since count(1) translates to count(*) internally anyway. I wonder then, why would someone want to use count(1) anyway.
There is at least one more step involved in getting to the actual result. And there is another possible tweak, count(1) has to evaluate an expression as well, "count(1) where 1 is not null". Though its a tautology equivalent, it has to be evaluated nonetheless.
Further, there was some misconception about how the result is returned, whether its read from the data dictionary, this view or table or something like that. I dont think so. The result is calculated at the exact run time,when the query is run, and it actually goes ahead and counts the records in the table.
Should set the record straight...
Saturday 24 July 2010
Developing a Rails application using an existing database
Initially we needed the basic CRUD screens for some tables. Being lazy (i m really proud of that), I set out finding if there a solution that generates the forms (read views) for the existing tables/models.
I have already managed to generate models/schema.rb using another gem. This is called magic_model. Read more about that here
Then google helped me find this another gem called scaffold_form_generator which generates the necessary views/forms for a given model. However, there need to be some improvements required on that (I think). perhaps I would contribute something (if I find out enough on how to do that)
Well, for the moment, I am struggling with handling of the missing special meaning column from the legacy tables. Will continue writing on this...
| Reactions: |