Wednesday, December 1, 2010

Rejecting records in Fact table loads - Informatica

In some development environments, you dont have all the required dimension data and as a result, your fact loading mapping's test runs go for a toss. The mapping wont be able to load anything (reject everything) since some or other foreign key would be missing for each row.

In other words, only those records would be loaded for which ALL the foreign key constraints would be satisfied. However, in Production environment, this would almost never happen. Or even if its the case, we'd actually want those rows to be rejected.

This can work out to be a serious impediment to development/unit testing. It prevents the developer from seeing whether or not his his mapping is behaving appropriately for the happy flow functionality.

One way of achieving this can be to work using a Mapping Variable indicating the Environment. The developer can run using a value like 'D' or something, indicating a different environment than Production, with 'P' (production) being the default value.

Now in the mapping, just about where you decide to reject a record based on different type of conditions, you could put an AND condition involving this mapping variable, e.g. ....AND $$MAPP_ENV = 'P'

Now, the expression would return true only in the Production environment, and therefore would work as expected. In Dev though, this expression would return false and would not reject that row.  

Now, to be able to satisfy the db constraints for the fact table so that the row is actually inserted, you'd need to use some placeholder convention. One of the approaches can be to use an outlier value as the foreign key value.  For Example, for customer id, keep a -1 in dimension table, meaning "Undefined".  And, in all such dev cases, send -1 to the fact. 

It would serve both the purposes, tell your fact table that there is something diff about that row, and still inserting a row in there, so that the testing for the rest of the columns is not stopped because of one foreign key missing out.

Wednesday, November 17, 2010

Setting a useful command prompt in Unix

I just came across a unix system at my workplace that had a static prompt set. Basically, the prompt was just the shell executable's name and version, more like 

bash-3.2 $

Well, this kind of prompt has many drawbacks, some of them i'd list here - 
1. You never know (just like that) where you are in the file system. When you are dealing with multi-directory situations, you might want to stop typing pwd to figure out ur current location.

2. You never know by what user you are logged in (again, just by looking at the prompt). You'd have to run a whoami to figure that out.

3. More importantly, if you are dealing with multiple systems, this one's the most killer.  You never know to what system you are logged in right now. you'd have to issue a hostname command.

Well, there might be, and for sure there are many other, consequences of having such a cryptic command prompt.  And therefore, my favourite, to have a command prompt, that displays at least these three things, always, dynamically...

Something like, 

raghav@deskubuntu:/homes/raghav/rails $ 

wherein, I am always aware of the three things mentioned earlier. This is very very useful when you are dealing with multiple systems and you have multiple users who are configured to run different types of processes.  For example, an oracle user who is supposed to be owner of oracle processes, and an informatica user which is supposed to own everything linked to informatica, and then a connect direct user which owns the CD processes, which receives files coming in from some other system.

With this kind of system, and your own user id to log in to the system, you'd better be careful which processes you are looking at / launching and by what user.  Its really very very important.

When and if you are dealing with a multiple system scenario, like dev / test / acceptance / production, you'd be better advised to use something like this only.

the magic command to do that is by setting appropriate flags and text in a environment variable called PS1.

Just set PS1 to your .profile or .bashrc (depending on your environment) file and you are set .

The example prompt that I mentioned can be achieved by saying - 

export PS1="\\u@\\h:\\w \\$ "

There are many more possibilities that go with special meanings for PS1 variable. Read some of them here - 

    * \$ : if the effective UID is 0, a #, otherwise a $
    * \[ : begin a sequence of non-printing characters, which could be used to embed a terminal control sequence into the prompt
    * \\ : a backslash
    * \] : end a sequence of non-printing characters
    * \a : an ASCII bell character (07)
    * \@ : the current time in 12-hour am/pm format
    * \A : the current time in 24-hour HH:MM format
    * \d : the date in "Weekday Month Date" format (e.g., "Tue May 26")
    * \D{format} : the format is passed to strftime(3) and the result is inserted into the prompt string; an empty format results in a locale-specific time representation. The braces are required
    * \e : an ASCII escape character (033)
    * \H : the hostname
    * \h : the hostname up to the first '.'
    * \j : the number of jobs currently managed by the shell
    * \l : the basename of the shell’s terminal device name
    * \n : newline
    * \nnn : the character corresponding to the octal number nnn
    * \r : carriage return
    * \T : the current time in 12-hour HH:MM:SS format
    * \t : the current time in 24-hour HH:MM:SS format
    * \u : the username of the current user
    * \s : the name of the shell, the basename of $0 (the portion following the final slash)
    * \V : the release of bash, version + patch level (e.g., 2.00.0)
    * \v : the version of bash (e.g., 2.00)
    * \W : the basename of the current working directory
    * \w : the current working directory

Monday, November 15, 2010

scripts and hash bang ( #! )

More often than not, people have to tell the unix shell / perl scripts or other programs where lies their interpretor, e.g. write their command line calls as 

perl SomeScript.pl

or 

ruby ARubyProgram.rb

or 

sh SomeShellScript.sh

this is because the system may not be aware of the location of the the executable interpreter of the exact type that needs to be used for the corresponding script.  Well, for this purpose, windows has the file extension association concept, but we are dealing with Unix like systems not windows, so that option is not really available to us (besides, there are ill effects of that convention too, but lets not go in that discussion).

So, to tell a unix program where to find its interpreter, besides launching the script along with it on command line, there is another way, and rather beautiful at that.

Just put the exact path of your interpreter executable at the very first line of your script preceded by these two magic characters, a hash and an exclamation (#!) also called as hash-bang or shebang.  Now, once your script is marked as executable (see chmod), you are good to go, no need of putting explicit calls to the interpreter to run your code.

Basically, your code should now look like this - 

#!/usr/local/bin/perl5
print "testing hashbang with raghav"

Save this short script as aa.pl (assuming that your system has perl 5 interpreter installed in the location I used). Make the script executable (chmod) and you can just launch the script, like ./aa.pl  instead of earlier example perl aa.pl

A word of caution though, these magic characters have to be absolutely the first and second character of the file, no exceptions to that. Else, the system cant make out the special meaning of this and the purpose is lost.

Pretty neat.. hunh...

Sunday, October 31, 2010

A Note to New Consultants

I  am about to start a new role, that of a consultant at a new customer site soon. In order to prepare for that mentally, I was looking around for inspiration and advice. In the process, I stumbled across this gem from the founder of Boston Consulting Group.  I have picked up the text from this webpage(http://www.careers-in-business.com/consulting/hendnote.htm), and then tried to see through it from my own eyes.  All credits with the original webpage owners.

Written by Bruce Henderson, Founder of Boston Consulting Group in the 1970s

In a sense the consultant's role is a paradox. He gives advice to people of equal intelligence who have vastly superior and extensive experience and knowledge of the problem. Yet he is not necessarily an expert in anything. What is the justification for his value?

       Need of Consultant
  1. The consultant can function as a specialist or expert, In this role he must be more knowledgeable than the client. This implies a very narrow field of specialization, otherwise the client with his greater continuity of experience would be equally expert.

    The consultant can function as a counselor or advisor on the process of decision making. This implies an expertise of a special kind, that of the psychotherapist. This is merely a particular kind of expertise in a particular field.

    The most typical role for a consultant is that of auxiliary staff. This does not preclude any of the other roles mentioned before, but it does require a quite different emphasis.

    All companies have staff capabilities of their own. Some of this staff is very good. Yet no company can afford to have standby staff adequate for any and all problems. This is why there is an opportunity for consultants. They fill the staff role that cannot be filled internally.

    By definition this means that consultants are most useful on the unusual, the non-recurring, the unfamiliar problem. Outside consultants are also most useful where the problem is poorly defined and politically sensitive, but the correct decision is extremely important. Outside consultants get the tough, the important and the sensitive problems.

    The natural function of a consultant is to reduce anxiety and uncertainty. Those are the conditions under which anxiety and uncertainty are greatest and where consultants are most likely to be hired.

    Problem Definition
  2. If this point of view is our starting point, then problem definition becomes extremely important.
    • If the problem is incorrectly defined, then even its complete solution may not satisfy the client's perceived needs.
    • If the problem is improperly defined, it may be beyond our ability to solve.
  3. Problem definition is a major test of professional ability. Outside consultants can frequently define problems in a more satisfactory fashion than internal staff, primarily because they are unencumbered with the historical perspective of the client and the resulting "house" definition.

    A consultant's problem definition is the end of the assignment if the problem is not researchable. If the problem is not researchable, then the consultant is either a specialist-expert or a psychotherapist. Neither of these roles are suitable for the use of the resources of an organization such as The Boston Consulting Group.

    A researchable problem is usually a problem that should be dealt with by a group approach
    . Data gathering and analysis requires differing skills and different levels of experience that can best be provided by a group. The insights into complex problems are usually best developed by verbal discussion and testing of alternate hypotheses.

    Good research is far more than the application of intellect and common sense. It must start with a set of hypotheses to be explored. Otherwise, the mass of available data is chaotic and cannot be referenced to anything. Such starting hypotheses are often rejected and new ones substituted. This, however, does not change the process sequence of hypothesize / data gathering / analysis / validation / rehypothesize.

    Great skill in interviewing and listening is required to do this. Our client starts his own analysis from some hypothesis or concept. We must understand this thoroughly and be able to play it back to him in detail or he does not feel that we understand the situation. Furthermore, we must be sure that we do not exclude any relevant data that may be volunteered. Yet we must formulate our own hypothesis.

    Finally, we must be able to take our client through the steps required for him to translate his own perspective into the perspective we achieve as a result of our research. This requires a high order of personal empathy as well as developed teaching skills.

  4. The end result of a successful consulting assignment is not a single product. It is a new insight on the part of the client. It is also a commitment to take the required action to implement the new insights. Equally important, it is an acute awareness of the new problems and opportunities that are revealed by the new insights.

    We fail if we do not get the client to act on his new insights. The client must implement the insights or we failed. It is our professional responsibility to see that there is implementation whether we do it or the client does it.

    Much of the performance of a consultant depends upon the development of concepts that extend beyond the client's perception of the world. This is not expertise and specialization. It is the exact opposite. It is an appreciation of how a wide variety of interacting factors are related. This appreciation must be more than an awareness. It must be an ability to quantify the interaction sufficiently to predict the consequences of altering the relationships.

    Consultants have a unique opportunity to develop concepts since they are exposed to a wide range of situations in which they deal with relationships instead of techniques. This mastery of concepts is probably the most essential characteristic for true professional excellence.

    A successful consultant is first of all a perceptive and sensitive analyst. He must be in order to define a complex problem in the client's terms with inadequate data. This requires highly developed interpersonal intuitions even before the analysis begins.

    His analytical thinking must be rigorous and logical, or he will commit himself to the undoable or the unuseful assignment. Whatever his other strengths, he must be the effective and respected organizer of group activities which are both complex and difficult to coordinate. Failure in this is to fall into the restricted role of the specialist.

    [raghav] The first time I have read that a specialist role can be restrictive, and honestly, when you think about it again, it does come back as a correct statement, specially in the wider world of other opportunities. Specially for a management consultant.

    In defining the problem, the effective consultant must have the courage and the initiative to state his convictions and press the client for acceptance and resolution of the problem as defined. The client expects the consultant to have the strength of his convictions if he is to be dependent upon him. Consultants who are unskilled at this are often liked and respected but employed only as counselors, not as true management consultants.

    The successful professional inevitably must be both self-disciplined and rigorous in his data gathering as well as highly cooperative as a member of a case team.

    The continuing client relationship requires a sustained and highly developed empathy with the client representative. Inability to do this is disqualifying for the more significant roles in management consulting.
In other words, the successful consultant:
  • Identifies his client's significant problems;
  • Persuades his client to act on the problems by researching them;
  • Organizes a diversified task force of his own firm and coordinates its activity;
  • Fully utilizes the insights and staff work available in his client's organization;
  • Uses the full conceptual power of his own project team;
  • Successfully transmits his findings to the client and sees that they are implemented;
  • Identifies the succeeding problems and maintains the client relationship;
  • Fully satisfies the client expectations that he raised;
  • Does all these things within a framework of the time and cost constraints imposed by himself or the client.

Thursday, September 30, 2010

my experiments with solr :)

Its a catchy title, but yes.. thats what I am going to talk about...

I came across hadoop, when I was looking for a new solution for one of our in-house projects. The need was quite clear, however, the solution had to be dramatically different.

The one statement we received from business was, "We need an exceptionally fast search interface". And for that fast interface to search upon they had more than a hundred million rows worth of data in a popular RDBMS.

So, when I sat about thinking, how to make a fast search application, the first thing that came to my mind was, Google. Actually, whenever we talk about speed or performance of web sites, Google is invariably the first name that comes across.


Further, Google has a plus point that there is always some activity at the back end to generate the page or results that we see, its never static content. And, then, another point, Google has a few trillion pieces of information to store/index/search whereas our system was going to have significantly lower volume of data to manage.   So, going with that, Google looked like a very good benchmark for this fast search application. 

Then I started to look for "How Google generates that kind of performance". There are quite a few pages on the web talking about just that.   But, probably none of them has the definitive/authoritative view on Google's technology or for that matter the insider's view on how it actually does what it does so fast.

Some pages pointed towards their storage technology, some talked about their indexing technology, some about their access to huge volumes of high performance hardware and what not...

For me, some of them turned out to be genuinely interesting, one of them was the indexing technology. There has to be a decent indexing mechanism to which the crawler's would feed and the search algorithms hit.  The storage efficiency is probably the next thing to come in the play. How fast can they access the corresponding item ?

Another of my observation is that, the search results (the page mentioning page titles and stuff) comes real fast, mostly less than 0.25 seconds, but the click on the links does take some time.  So, I think it has to be their indexing methodology that plays the bigger role.

With that in mind, I sat about finding what can do similar things and how much of Google's behaviour they can simulate/implement.

Then I found Hadoop project on apache (http://hadoop.apache.org/) which to a large extent reflects the way Google kind of system would work. It provides distributed computing(hadoop core), it provides a bigTable kind of database (hbase), provides map/reduce layer, and more.  Reading into it more, I figured out that this system is nice for a batch processing kind for mechanism, but not for our need of real time search.

Then I found solr(http://lucene.apache.org/solr/), a full text search engine under Apache Lucene.  It is a java written, xml indexing based genuinely fast search engine.  It provides many features that we normally wish for in more commercial applications, an being from apache, I would like to think of it as much more reliable and stable than compared to many others.

When we sat about doing a Proof of Concept with it, I figured out a few things –

•    It supports only one schema, as in, rdbms tables – only one. So, basically you would have to denormalize all your content to fit into this one flat structure.
•    It supports interactions with the server interface only through http methods be it the standard methods get/put etc or be it REST like interfaces.
•    It allows you loading data in varying formats, through xml documents, through delimited formats and through db interactions as well.
•    It has support for clustering as well. Either you can host it on top of something like hadoop or you can just configure it to do it within solr as well.
•    It supports things like expression and function based searches
•    It supports faceting
•    Extensive caching and “partitioning” features.

Besides other features, the kind of performance without any specific tuning efforts made me think of it as a viable solution.

In a nutshell, I loaded around 50 million rows on a “old” Pentium-D powered desktop box with 3 GB RAM running ubutnu 10.04 server edition (64 bit) with two local hard disks configured over a logical volume manager.

The loading performance was not quite great. Though its not that bad either. I was able to load a few million rows (in a file that was sized about 6 GB) in about 45 minutes when the file was on the same file system.

In return, it gave me query performances in the range of 2-4 seconds for the first query. For subsequent re-runs of the same query (within a span of an hour or so), it came back in approx 1-2 milliseconds.  I would like to think that its pretty great performance given the kind of hardware I was running upon, and the kind of tuning effort I put in (basically none – zero, I just ran the default configuration).

Given that, I wont say that I have found the equivalent or replacement of Google’s search for our system, but yeah, we should be doing pretty good with this.

Although there is more testing and experimentation that is required to be able to judge solr better, the initial tests look pretty good.. pretty much in line with the experiences of others who are using it.

Sunday, September 12, 2010

Business & Open Source - How both can benefit

I had the opportunity to scout for a new technology/solution for one of our in-house projects.  Quite a few of the options that I looked for were from open source arena.  And amazingly, the products were far more capable from our expectations, just that we'd have to pitch in with some effort to get it working for us.

I have always felt that for open source projects/products to become commercially viable for a business enterprise, the enterprise has to come up and spend some resources to it to get the actual value out of it.

In other words, if an organization wants to use an open source product, which has an equivalent competitive commercial product available in market, they should be open enough to have their own in-house people who can take ownership of the installation. The organization shouldn't completely rely on the support available from the community forums and such.

I have seen more than one manager complain about the lack of support on the open source products.  Had there been proper support system for each of the open source products, we'd see a lot of stories similar to mysql's model or pentaho model.

What I would like to see perhaps is that the organizations' becoming mature enough in their adaptation of the open source products. By that, I expect them to have a open vision, have people who understand and like and own the product, and at the same time tweak and tune the product to suit the organization's business needs.

In the process, the organization should contribute to the product's development cycle.  This could happen in many ways, bug fixes, contribution of new features, the employees could contribute on community forums and such.  Using the terminology from peer to peer sharing, only leechers dont help a torrent, people need to seed to it as well. Same way, unless organizations contribute to an open source product, they would stand to become only leechers.

Only after we have a decent balance of organizations using and contributing to the open source products, we'd see the ecosystem flourishing...

Thursday, September 9, 2010

Tips for brainstorming...

Interesting read, from both positive and negative viewpoints -

1. Use brainstorming to combine and extend ideas, not just to harvest ideas.

2. Don't bother if people live in fear.

3. Do individual brainstorming before and after group sessions.

4. Brainstorming sessions are worthless unless they are woven with other work practices.

5. Brainstorming requires skill and experience both to do and, especially, to facilitate.

6. A good brainstorming session is competitive—in the right way.

7. Use brainstorming sessions for more than just generating good ideas.

8. Follow the rules, or don't call it a brainstorm.

Read more here - http://www.businessweek.com/innovate/content/jul2006/id20060726_517774.htm?chan=innovation_innovation+++design_innovation+and+design+lead

in reference to:

"8. Follow the rules, or don't call it a brainstorm."
- Eight Tips for Better Brainstorming (view on Google Sidewiki)

Wednesday, September 8, 2010

Big help...

I wanted to get my table sizes in infobright, and this page came to my help...

SELECT table_schema,table_name,engine, table_rows, avg_row_length,
(data_length+index_length)/1024/1024 as total_mb,(data_length)/1024/1024 as data_mb,
(index_length)/1024/1024 as index_mb, CURDATE() AS today
FROM information_schema.tables
WHERE table_schema='mySchemaName'
ORDER BY 7 DESC

Thanks Ron...
in reference to: Calculating your database size | MySQL Expert | MySQL Performance | MySQL Consulting (view on Google Sidewiki)

Wednesday, August 25, 2010

I also feel like saying, 1984...

This story appeared in Economic Times, wherein Apple claims to have developed (or is busy doing that) sensitive info about an iphone user. Subsequently, Apple intends to hold/halt usage of the iphone device from the "unauthorized" user, this unauthorized reportedly includes -

1. an iphone that has been hacked to work outside the contract with which it was sold, read "jailbroken"

2. an iphone that is perhaps being used by someone other than the person who registered the first heartbeat or facial recognition info..

Apple intends to capture the phone location using GPS/other tech and perhaps control the device remotely if they feel its being used "unauthorized"..

i agree with people who remember 1984 after reading apple's intentions...ha.. time does come back...George Orwell.. were u too right ??
in reference to: Apple to make iPhone theft-proof - Hardware - Infotech - The Economic Times (view on Google Sidewiki)

Monday, August 2, 2010

Country General Mood using Tweets

Well, it sure is pretty fascinating to do that kind of study and come back with results as commonsensical as we see here...

http://www.iq.harvard.edu/blog/netgov/2010/07/mood_twitter_and_the_new_shape.html


I quote - (with all credits where its due, none to me...)


A group of researchers from Northeastern and Harvard universities have gathered enough data from Twitter to give us all a snapshot of how U.S. residents feel throughout a typical day or week.

Not only did they analyze the sentiments we collectively expressed in 300 million tweets over three years against a scholarly word list, these researchers also mashed up that data with information from the U.S. Census Bureau, the Google Maps API and more. What they ended up with was a fascinating visualization showing the pulse of our nation, our very moods as they fluctuate over time.

The researchers have put this information into density-preserving cartograms, maps that take the volume of tweets into account when representing land area. In other words, in areas where there are more tweets, those spots on the map will appear larger than they do in real life.


A apparantly public domain result of the analysis is available here -
http://cdn.mashable.com/wp-content/uploads/2010/07/twitter-moods.jpg

Wednesday, July 28, 2010

Oracle Count(1) vs Count(*)

Well, it might have been an everlasting discussion about which one of these to use, count(1) or count(*).

I guess, this article of Thomas Kyte already clarified the situation long long ago (well, for IT industry 2005 is long ago anyway, especially given the speed at which we are moving.)

Essentially, what askTom says that, count(*) is better than count(1) since count(1) translates to count(*) internally anyway. I wonder then, why would someone want to use count(1) anyway.

There is at least one more step involved in getting to the actual result. And there is another possible tweak, count(1) has to evaluate an expression as well, "count(1) where 1 is not null". Though its a tautology equivalent, it has to be evaluated nonetheless.

Further, there was some misconception about how the result is returned, whether its read from the data dictionary, this view or table or something like that. I dont think so. The result is calculated at the exact run time,when the query is run, and it actually goes ahead and counts the records in the table.

Should set the record straight...

in reference to: Ask Tom "COUNT(*) Vs COUNT(1) on tables with CLOB..." (view on Google Sidewiki)

Friday, July 23, 2010

Developing a Rails application using an existing database

This is the latest challenge for me. A database exists with real data in there, and I have to develop a rails application around that.

Initially we needed the basic CRUD screens for some tables.  Being lazy (i m really proud of that), I set out finding if there a solution that generates the forms (read views) for the existing tables/models.

I have already managed to generate models/schema.rb using another gem. This is called magic_model.  Read more about that here

Then google helped me find this another gem called scaffold_form_generator which generates the necessary views/forms for a given model.  However, there need to be some improvements required on that (I think).  perhaps I would contribute something (if I find out enough on how to do that)

Well, for the moment, I am struggling with handling of the missing special meaning column from the legacy tables. Will continue writing on this...

Wednesday, July 21, 2010

Making rails talk to SQL Server Express Edition

I had zero exposure to sqlserver before today. With the help of a dba and google :) I finally achieved the communication.  Here's what I did to achieve this -

Make an odbc connection - pointing to the db server, preferably make a sql server authentication user

Install the following gems
  1. activerecord-odbc-adapter (2.0)
  2. activerecord-sqlserver-adapter (2.3.8)
  3. dbd-odbc (0.2.5)
  4. dbi (0.4.5)
  5. deprecated (2.0.1)

Further, I copied odbc.so and odbc_utf8.so files from http://www.ch-werner.de/rubyodbc/i386-msvcrt-ruby-odbc.zip to .../ruby/1.8/i386-msvcrt even though exact directory was not found in the path. So, the files were eventually copied to ...\Ruby187\lib\ruby\1.8\i386-mingw32

My database.yml file looks like this -

development:
  adapter: sqlserver
  mode: ODBC
  dsn:
  username: myUserName
  password: myPassword



Friday, July 16, 2010

winscp vs cuteftp

I recently had an experience of transferring relatively larger files from a windows box to linux one.

I had some files which were ranging from 2.1 GB to 4.x GB. The company traditionally uses cuteftp to transfer (read ftp) files across servers. So I started with that anyway.

However, all my transfer attempts were failing. After transferring 2 GB of data, cuteftp would end the transfer and give away some or other error message. I figured that this was happening only to files which were larger in size. And this transfer consumed more than hour or so before it failed.

Google couldnt help a lot, so, finally i got down to winscp. And, it worked out so well. Not only the transfers didnt fail, but they finished within minutes, instead of the hours and hours spent by cuteftp.

Perhaps its some setting somewhere in cuteftp that I couldnt find, but the whole experience has left me more inclined towards the open source community.

Thanks a ton guys .. :)

in reference to: Download WinSCP from SourceForge.net (view on Google Sidewiki)

Wednesday, July 14, 2010

Trying to become a Data warehousing architect... !!!




If someone wants to become a DW Architect, there is a lot of knowledge required.  Some of the different perspectives (one from MS not specifically for DW, rather enterprise) and another on DW specifically..

Although I think its strange already that someone has to think about the kind of questions being asked in such interview, there are things posted on these lines.  i believe its more of a discussion on thought process and approach rather than question-answer session in such an interview. Still, presenting an excerpt, outlining some of the interesting topics that might turn up in a discussion for this kind of position.

  1. Describe advantages of the CIF architecture versus the bus architecture with conformed dimensions. Which would fit best in our environment given [some parameters they give you] and why
  2. Describe snow-flaking
  3. Describe fact-less fact tables.
  4. Draw a star schema of our business
  5. Describe common optimization techniques applied at the data model level
  6. How do you handle data rejects in a warehouse architecture?
  7. Describe common techniques for loading from the staging area to the warehouse when you only have a small window.
  8. How do you load type 1 dimensions
  9. How do you load type 2 dimensions, and how would you load it given our [insert business particularity]
  10. How would you model unbalanced hierarchies
  11. How would you model cyclic relations
  12. What major elements would you include in an audit model?
  13. How would you implement trace-ability ?

One of the Microsoft guys wrote about possible feathers in the cap of an Enterprise Architect -  http://blogs.msdn.com/b/nickmalik/archive/2006/09/09/ea-interviews.aspx



Thursday, May 13, 2010

Parallel Querying using Oracle

Perhaps a long awaited feature, this one is clearly the one that I like most.

Using pre-built packages its now possible to break a single (relatively heavy) operation into multiple small operations, which run in parallel. The benefits are obvious, and cant be ignored.

I am yet to ascertain the usability of such a feature with tools like informatica or pentaho, but I am quite sure a lot can be achieved, especially in the direction of updates to huge tables in data warehouses.

And looking at the implementation, its genuinely simple and straightforward.

Would be nice to see more applications utilizing the benefits of such features..

Read more on Oracle magazine, article written by Steven Feuerstein...

in reference to: PL/SQL Practices: On Working in Parallel (view on Google Sidewiki)

Thursday, May 6, 2010

Ubuntu 10.04 is here.... Upgrade today...

Its only been a few hours since I have had the pleasure to upgrade to the latest version of Ubuntu.

And without any doubt, or any detailed analysis/investigation I can say this much for sure -

- The upgrade process was super easy. No hassles whatsoever, you just need to have a decent internet connection, few times you need to confirm the suggested decision, and done...

- The new system looks cleaner, much much cleaner. Its neat, to some extent beautiful. Its not about the background or themes or anything like that, but the overall look and feel is genuinely cool. Perhaps its to do with one thing that I have done on my part. My system font is "Lucida Grande" all across. And, for some reason the shapes and looks on this font are far better than anyone else I have encountered.

- The boot time has come down a good notch.

Well, I guess this should be a good starting point for my review on the new version, lets see how it comes across further.

in reference to: my tech playground (view on Google Sidewiki)

Friday, April 30, 2010

Implementing Cartesian Product in Informatica Mapping

As against pentaho, Informatica doesnt provide a ready made transformation for implementing cartesian product in a mapping. Although, most of us would agree that  its not often that we tend to go for cartesian product joins. [the instinct generally is to do enough to avoid a cartesian product, because its a performance killer in general]

However, when your requirements need this, there is no direct way to do it in informatica joiner transformation.  Either you do it in the db side, by overriding your source qualifier sql statement and building it in there.


However, I have seen that some designers dont like to override sql statements, in such cases you'd have to implement it inside the mapping only. Here's a workaround for achieving that. Here goes -

  1. Read both the sources using their own source qualifiers, normally.
  2. For both of them, put in an Expression Transformation after the source
  3. In both the expression transformations, create an output port with a constant value. For Example, call it dummy1 for stream 1 and assign it a value -1.  Similarly, a port would be created in the second pipeline, lets call it dummy2 and assign it a value -1.
  4. Now create a joiner transformation. Link ports [including the one that we created with a constant value] to the joiner from both the expressions.
  5. In the join condition, choose to compare the dummy columns. 
  6. The rest of the joiner configuration would have to be like any other joiner. Nothing specific.
You might want to keep the smaller source as the master in the joiner, since it would save on the caching.

Before implementing the above solution, be sure to go back and check if its actually required for your application to have cartesian product !!!!

Thursday, April 29, 2010

Hard vs Soft Parse in Oracle...

Was reading on asktom, came across this response from Tom, that seemed so direct and clear...

So, shamelessly copying.. :)



[From AskTom]
Parsing

This is the first step in the processing of any statement in Oracle. Parsing is the act of breaking the submitted statement down into its component parts ? determining what type of statement it is (query, DML, DDL) and performing various checks on it.

The parsing process performs two main functions:

o Syntax Check: is the statement a valid one? Does it make sense given the SQL grammar documented in the SQL Reference Manual. Does it follow all of the rules for SQL.

o Semantic Analysis: Going beyond the syntax ? is the statement valid in light of the objects in the database (do the tables and columns referenced exist). Do you have access to the objects ? are the proper privileges in place? Are there ambiguities in the statement ?

For example if there are two tables T1 and T2 and both have a column X, the
query ?select X from T1, T2 where ?? is ambiguous, we don?t know which table to get X from. And so on.

So, you can think of parsing as basically a two step process, that of a syntax check to
check the validity of the statement and that of a semantic check ? to ensure the
statement can execute properly. The difference between the two types of checks are hard for you to see ? Oracle does not come back and say "it failed the syntax check", rather it returns the statement with a error code and message. So for example, this statement
fails with a syntax error:

SQL> select from where 2;
select from where 2
*
ERROR at line 1:
ORA-00936: missing expression

While this statement failed with a semantic error ? if the table NOT_A_TABLE existed and we had permission to access it, this statement would have succeeded:

SQL> select * from not_a_table;
select * from not_a_table
*
ERROR at line 1:
ORA-00942: table or view does not exist

That is the only way to really tell the difference between a semantic and syntactic error ? if the statement COULD have executed given the proper objects and privileges, you had a semantic error, otherwise if the statement could not execute under any circumstances, you
have a syntax error. Regardless ? Oracle will not execute the statement for you!

The next step in the parse operation is to see if the statement we are currently parsing has already in fact been processed by some other session. If it has ? we may be in luck here, we can skip the next two steps in the process, that of optimization and row source generation. If we can skip these next two steps in the process, we have done what is known as a Soft Parse ? a shorter process to getting our query going. If we cannot, if we must do all of the steps, we are performing what is known as a Hard Parse ? we must parse, optimize, generate the plan for the query. This distinction is very important. When developing our applications we want a very high percentage of our queries to be Soft Parsed ? to be able to skip the optimize/generate phases ? as they
are very CPU intensive as well as a point of contention (serialization). If we have to
Hard Parse a large percentage of our queries, our system will function slowly and in some cases ? not at all.

The way this sharing of SQL in Oracle is accomplished is via the shared pool, a piece of memory in the SGA maintained by Oracle. After Oracle parses the query and it passes the syntax and semantic checks ? it will look in the shared pool component of the SGA to see if that same exact query has already been processed by another session.

Since it has performed the semantic check it has already figured out:

o Exactly what tables are involved
o That we have access to the tables (the proper privileges are there)

And so on. Now, it can look at all of the queries in the shared pool that have already been parsed/optimized and generated to see if the work has already been done.


[Raghav]
As has been discussed here, the soft parsing of queries makes our application faster, since it doesn't have to do everything all over again (from parsing point of view).  Bind variables come to help in this case.

For example, a query like the following -

select a,b from tbl1 where c = 2333;

would be parsed and stored under a different identifier than

select a,b from tbl1 where c = 58763;

whereas, essentially they are same queries only, the literal value being different for the where condition.
In this case, the second query would go for hard parse, instead of our expectation of a soft parse. So, what could be done to tell oracle to think that these are actually same queries and there is no need to do the hard parse again. Well, you can do that by using a bind variable, in the sense, removing the only differentiating part of query by a runtime replacement.

Look at the following query -

select a,b from tbl1 where c = :bb;

now, this query can be used for both the examples above, at the runtime (after parsing), the variable :bb would be replaced by the respective literal value, and you have your answer.  However, this gains a lot of performance, since, the second run goes for a soft parse, instead of a hard parse, so, every subsequent run, after the first one, is re-using the parsing information collected the first time.

So, simple recommendation, use bind variables wherever you see a query being reused with different literal values, and use run time substitution of the literals.

Wednesday, April 28, 2010

AWS Application Demo

  • Amazon started with S3 (Simple Storage Service) - purely storage service.
  • For Computing purposes, Amazon started EC2 
  • EC2 allows hosting your application on virtual servers operated by Amazon, widely known as AWS.
  • For calculating the availability as a resource, 1 compute unit at Amazon EC2 is roughly equivalent to a 1.2 GHz Xeon server.
  • All computing resources are virtualized, none is physical. No physical details are ever published. However, the user has to manage his own file system. Based on need/request, disk space is mounted and is made available to the server. Further, it has to be managed by the application.
  • There can be three types of instances, 
    • Small
    • Medium
    • Large instances

Behavior on instances
  • like a normal web server
  • has a public web address

  • Java Software available to convert command line instructions into SOAP wrapped API calls to AWS
  • Need to download API tools from AWS
  • Key pairs are tied to regions
    • The benefit of this is that the servers are replicated across data centers located physically separately


Autoscaling service
  • Allows scaling of servers based on need
  • Consumes about 60-90 seconds to scale up based on the need (e.g. load on the server). If configured the appropriate way, the system will replicate the application code and launch new server instances within seconds to handle the extra load.
  • Handles scaling up as well as down both at the same time. Automaticaly, the extra/unused instances will be shut down and released from the application deployment when there is no load to require their service.
  • very useful in cases of spikes, specially high-peaks. New feature launches on sites, or sudden outbreak of news etc, cause spikes in server loads. At such times, AWS works perfectly to scale up the required computing power. The application owners are saved from buying extra servers for that 1 day load... :)
  • Its possibel to say how many servers to add
  • Such a scaling up/down can be configured. For example, it can be specifiied that request for new instances when current system's load exceeds 85% or scale down, when the usage goes below 20% used



OSs supported
  • Linux
  • Windows
  • openSolaris

Buzzwords
  • AMI = Amazon Machine Images
  • EBS = Elastic Block Device
  • Elastic IP = static ips assigned to instances
  • Amazon EC2 = Amazon Elastic Computing Cloud

Monday, April 26, 2010

Web Application Debugging Tools

A Talk by Matthew McCullogh


Some tools he talked about and demonstrated

1. tcpdump
      Captures tcp-ip packets during traffic
      Available as an open source project at http://www.tcpdump.org/
      Allows you to save the captured packet information into a file on disk, thereby allowing offline analysis.

tcpdump -i en1 -s0 -n -A

2. WireShark - visual interpretation of tcpdump captured data
       Once you capture packet information using tcpdump, you can get a better view of it, more readable using this tool. Again an open source project, available from http://www.wireshark.org/

3. netstat
      An all time *nix flavors favourite. Every system admin's first choice. Allows port information.
      Available by defautl with almost all port of all OSs. Sometimes differences are found in implementations pertaining to syntaxes, or switch names/usage.

4. curl  - Another all time favorite.
      1. Allows calling/initiating the http request from command line, saves a lot of time since no program or skeleton is needed to be created for testing http calls.

5. jMeter - designed to load test functional behavior and measure performance

6.  soapUI => mocking a server.
       Available from http://www.soapui.org/

7. JavaScript
      FireBug -  for firefox - Javascript functions/elements are easily visible and manageable. Debugging on the fly is allowed too.  Available as an addon for firefox.
      FireBug - Firebug implementation for Safari.

8. FireFinder- find item in DOM. Easily allow an item in the DOM of a given web page.

More details coming on the following -

9.  Visual Event

10. FireFocus

11. JASH - CLI for test javascript


buzzwords
- bpf - berkeley packet filter

Saturday, April 24, 2010

workshop day GIDS 2010 - 3 - Functional Programming

Well, for the third session of the day, my choice was clear, cloud computing discussion with Simone, from AWS. I had already had the taste of google cloud solution (app engine), and now I wanted to see the Amazon flavour. Quite horribly unfortunately, the internet link gave way and Simone's presentation plans went awry.

The guy was quite angry and it was so visible, so very visible. He tried to talk people through, plain talk talk talk it never settled in. I see people literally dozing off, and I guess he realized that too, since on more than one occasion, he instructed to keep the hall lights switched on. A bit of bad thing on part of Saltmarch/Airtel...

As a result, midway during the session, I left off, and thought of joining in the Flex-php workshop from adobe guys. However, somehow, I ended up entering the hall where Venkat was talking about functional programming.

And did I regret that decision, no chance. I guess, I have never enjoyed a programming language discussion so much, probably because I didnt get a chance to speak to such a learned speaker, and so many of keen and interested students (yeah, blame my college, I know, I do that myself.. all the time :)

It was a packed house, and fantastic atmosphere. For one thing, Venkat knew what he was talking about, and for other, I found that discussion took me back about 15 years to my college days. I had a strong feeling, why didnt I had an educational experience as nice as this. The teacher, the students, the atmosphere, whole place was like a wonderful experience. Though I had no clue what scala is before I entered the place, it never felt like a problem.

The way he talked about issues, the benefits, the arguments he presented, and I think more importantly, the way he engaged the audience in discussions, it was just wonderful. You might think that I am over reacting, but i dont care, I just loved this guy, this session.

He talked about programming paradigms, the way procedural languages were king one day, and the way they are looked at as if ancient history is being talked about today. And similarly, for today's generation OOPS is becoming more of a thing of the past, and new paradigms are coming over.

The way he propounded functional programming, the very idea of not allowing immutability, the way of thinking in terms of having to think of a solution without that.. amazing...

At more than one point in time, the people in the audience tried to contest the idea/need of another language, when we already have so many, and the need of learning another one. There were thoughts on the toughness of a programming language vs the other, and the way he answered.. fantastic... His point in this case was, its as good as a Karnataka person going to Punjab and saying, "These people are crazy, their language is so touch, I cant even read that". And it was a good laugh..

At more than one point in time, he compared different programming languages, their features and pros and cons of each of them. The kind of knowledge that he had, to be able to compare them, was amazing...

I have to say, without any doubt, that this session was my best experience at GIDS 2010. I loved this, probably due to my personal interest in the theory of computer science as such, but also because there probably wasnt any other speaker with that kind of personality and flair for talking.

Hats off to Saltmarch for inviting people like him too...

workshop day gids 2010 - 2 - paypal x

It went even better afterwards, Paypal has opened doors to their Payment methodology to developers. They call it PayPal X. They have published a complete SDK which allows developers to write independent programs that allow integration of paypal as a payment gateway in their e commerce applications.

Their idea is to popularize paypal and use its simplicity and capability in spreading its business reach. The whole team of PayPal was present at the workshop and the atmosphere was pretty nice. Khurram Khan started off by talking a bit on the background of financial transactions, further, Rangarajan took over and talked in detail.

Frankly, the session was interactive, because perhaps people are motivated by money, the chances of earning it, and the basic idea of linking real e-commerce to their websites by such low effort and simplicity. My personal interests in that area are purely out of curiosity, in the sense that I am more keen to know as to whats going on, then get down to the code details and start implementing it.

One thing that stood out for me is that paypal proposes to review the code of the developer before actually allowing to move it to production (live), but has no control thereafter. So, in essence, someone who intends to do something funny (read : wrong/illegal etc...) might pass the initial test by presenting a genuine case, and then once he;s live, he can go back and change the code. I feel that by allowing updations like this, paypal is losing a bit of control on its API usage by the people.

I would probably have built somekind of a dynamic filter or a pair matching kind of mechanism (or a checksum for that matter) which would change the moment the developer changes his code on production site. Every call to the paypal API should check this checksum/authentication token and should go through only if this checks also holds good. Well, its just a thought, and probably paypal has its own reasons not to enforce any such check, but, if I were Khurram, I would probably start like that, perhaps remove it later or something like that.

When I posed this to Khurram, he was saying that, the responsibility of doing anything illegal or wrong lies anyway with the developer or the site owner, so, paypal doenst really want to get in their way. They would be apprehanded sometime anyway, sooner or later. As much as I agree with his argument, I still think that paypal could probably play a role and possibly stop "wrong" things from happening to some extent, after all its happening through their infrastructure, although the ownership lies somewhere else.

Other than, this particular point, I think the SDK is pretty nice and ok. They allow Java, no special downloads etc required to start developing, besides the SDK of course, and that the API supports almost all kinds of operations.

I specially liked their idea of generalizing the paypal kind of services, where it comes to trust building. During one discussion a statement was said, I dont remember which one, but a paypal employee said that, "I am sure that when it comes to trusting someone over internet, you can trust paypal a lot more with your financial information than a relatively unknown website that you are using for the first time." That I agree with, and perhaps the Indian jinx of not using web for payments can be broken with a trusted guy on the net, paypal.

You trust one person, paypal, and the rest is handled by paypal. As far as this statement goes, there actually is a greater risk, what if paypal goes rogue ? It would then have all my information, all of it... well then you would have to trust someone.. right ?? or not ???

The workshop day @GIDS 2010 - I - cloud computing using Google App engine

Well, A developer day cant get better than that... And, whatever issues I mentioned regarding scheduling of the summit were gone up in air.

I had the choice to go attend the Cloud Computing/hosting workshop with Mathhew McCullogh. It was based on java, which I dont care about anyway. Not that I have any kind of disrespect for the language, but I strongly think that we ought to think beyond languages.

He perfectly showcased how easy it is to build and host you own web application on the google app engine cloud platform. Its hard to imagine how easy it has become with eclipse plugin for google app engine and web toolkit, to develop an application for the cloud and deploy it. With those tools installed and configured (its surprisingly easy task to do it anyway), its a button click to deploy your application to the cloud.

I am not a big fan of java, or of huge complex configuration files for that matter, but the way this app engine thing is coming pre-configured, it reminds me of ruby on rails. At least to start with. Its surprisingly easy to start developing with GWT (Google Web Toolkit) and Google App Engine.

I tried to replicate what he was doing/showing, and, remember I am no big shots in java, that last professional java work I did was about 8 years back, It was surprisingly easy. Just a few clicks and it was done. I have to say, it was infectious, as if I felt like delving more into java and go back to experimenting with it...

Thursday, April 22, 2010

Is that a workable idea ?

Actually, we are waiting for a new server to be installed and any new physical server installation takes some time, we all know that. The project would request it, there would be financials and then the order would be approved internally.
It would be placed with the vendor by purchasing, the vendor will ship it, and then a few days/weeks after, technicians will come over to install the stuff.

Then the local admins will get to work and configure it to be used by the respective teams. Another few days/weeks...

Just wondering, any given enterprise has hundreds ir not thousands computers, all classes, laptops, desktop development PCs, server class PCs and others...

How about taking off 10% of all these PCs' computing power to create a cloud computing environment (of course within the company environment only). An application can be installed which will make sure that a given percentage of computing power will go to only a particular resource.

A lot like how BOINC works. You install a client/manager app on ur PC and then u receive a piece of computation for your part of computing power.

That kind of computing environment can easily replace an average server's computing power.
The storage can easily be hooked up to a SAN.

I'd encourage a discussion on this... see what others think around the idea...

in reference to: Distributed computing - Wikipedia, the free encyclopedia (view on Google Sidewiki)

Fantastic Effort by SaltMarch

Well, i have to say it like that, The one day that I attended at the GIDS - 2010 was amazing. Probably I am reacting like that since its my first such summit, nonetheless the quality of speakers and the content they delivered was top of the ladder.

Amazing truely.

To get people of such caliber at one platform is a job in itself which saltmarch have done to perfection. I chose to be there for .Web day, and the kind of knowledge that was flowing around, technologies being talked about and the manner in which they were addressed, simply great.

We know some things as simple buzzwords, but we ought to realize that the buzzwords actually have lot more behind them. The way Simone (from AWS) put it, "Everybody in the confrerence hall would know what is Cloud computing, but I am sure that when I ask you what is it, you'd all come back with different definitions, and perhaps they are all correct as well". That statement sort of sums up the whole experience, we all know to some extent, or varying degree of perception what something is about, but unless we know it from the source, as they say, from the horse's mouth, its always a bit farther from the real reality.

I like that way the speaker and sessions were compiled, though a few of my favorite sessions were canceled (blame that volcano in Iceland for canceling all the flights), I still enjoyed a lot of it.

Personally me, interested in Ajax, came a lot closure to experimenting with it, and perhaps using it in professional environment as well. Thanks to people at Adobe booth there, I have a chance to try out the Adobe Flex Builder IDE, and check out the potential first hand.

Tomorrow, there is a workshop on php and Flex delivering RIA, and I cant wait to be there :)


However, as is the custom for people like me, to comment and find fault with something or other, after all we are humans, we are bound to make mistakes and other humans are bound to find them and report them publicly, like on this blog... :)

I believe the scheduling of some of the sessions could have been better. For example, I missed out on the PayPal X presentation, because I wanted to attend Marty Hall's session on evaluation of Ajax/JavaScript Libraries too. Now, I'd have to contend only with the video of that presentation, hoping that it would be made available. :)

I like the scheduling of the workshop day a lot better. It gives people like me options to attend sessions diff technologies ...

Looking forward to another exciting day of fun filled experiments that enhance know how...

in reference to: Great Indian Developer Summit :: Conference, Expo, and Awards on Java, .NET, Rich Web :: Saltmarch Media Summits (view on Google Sidewiki)

Wednesday, April 21, 2010

Web Changing the face of the world : Ramesh

Ramesh, a senior technologist/scientist from Adobe, talked about how and what has web 2.0 done to change the world, especially the way the information is treated/used/processed etc...

Here are some of his points -

Real Time Web

  • kind of content that is coming around
  • no more old docs
  • social networking
  • live collaboration
experiencing the web - how the web is experienced, the way its accessed, used etc.
  • Some tools are coming up in the VR world, which allows
  • augmented reality -> web apps capturing reality and using the inputs
proliferating devices
  • Already more web access from non PC devices
  • More devices are coming through which access web differently
democratization of data
  • Some govts have already started putting their data in public domain.
  • Lots of application development opportunities based on that data.
  • Cloud computing also coming around to provide processing power and applications for the data.
Buzzwords
  • ActionScript
  • ECMAScript
  • Flex

Quite interesting talk :) Learned person, Ramesh.

Marty Hall : Why Ajax : Using Ajax in ur web apps

Why WebApps - Ajax
  • Asynchronous JavaScript and XML => doesnt hold good anymore
  • allow the browser to utilize the time when the main page is gone tofetch the result of your initial app
  • Add/modify content on the webpage at runtime
  • trends -> more jobs on Ajax than any other technology - php, vb, asp etc...
Ajaxifying an Application
  • libraries - jQuery, ext-ja, dojo, yui, prototype, google closure, mootols
  • Browser allow calls to only the server from where the original code came in, due to firewall issues.

  • Situation 2 - Hybrid
  • Libraries to use for new apps - JSF 2.0, Struts 2.0, Spring MVC 3.
  • JSF is better because
  • integrated in main code - jquery needs u to write javascript
  • no javascript to write - a simple tag

  • Situation 3 - Hardcore Ajax
  • Google web toolkit
  • Use java everywhere
  • Write java code at front end -> compiles to javascript at run time
  • Write java code at back end-> provides facility of communication in terms of java objects
Buzzwords
  • Jboss Seam <-> JSF 2.0

Tuesday, April 20, 2010

@GIDS 2010 Bangalore

I am going to be there at the "Great Indian Developer Summit" in Bangalore April 20th to 23rd 2010.

Its hosted at the lush green IISc campus, in the J N Tata Auditorium. Various pieces of info are available from the organizer website -

www.saltmarch.com

http://www.developermarch.com/developersummit/


It looks like an amazing place to be, full of tech discussions, and hopefully knowledge too...

Wednesday, March 31, 2010

Apple and India...

Well, this is sad, particularly for someone in India, who avoids getting cracked iPhones...

Only yesterday Apple and Airtel announced availability of iPhone 3GS in India and today I get to read this, that Apple is releasing a new/better hardware for iPhone coming summer.

So, basically, India as an iPhone market is going to lag behind the rest of world by at least one model this time around as well.

Sometimes, I dont understand Apple's marketing strategy, specially in Indian context. As an Economy, India is probably the largest in the emerging market economies closely behind China. That statement should alone prove to be a "lucrative market" for any company which aims for growth.

This is specially interesting when we look at the number of iPhones already used in India with cracked OS's.

This doenst help a company's revenues if the product is consumed in a market where the company doesn't officially sell the product.

Also, the number of iPhones used in India, should give Apple an indication about the marketability of their product in this market. Personally, I think as a product from Apple's stable, iPhone has a lot of glamour linked to it.
And therefore, huge possibilities in India for Apple.

The new iPhone's rumoured specs look stunning, 5 MP camera, better display resolution, more processing power, possible multi tasking, video calling camera and few other things. After reading that, my plans to order an iPhone 3GS have taken a back seat. I want to see this product coming to market, further to see, when does it come to India.

Apple seems to listening to customers.. the kind of improvements rumoured is possibly the exact list of things someone might say are downsides of iPhone...

Good luck Apple, and India, hoping for a better collaboration this time around...

in reference to: More on next-gen iPhone and Verizon iPhone dreams (view on Google Sidewiki)

Thursday, March 18, 2010

Explain plan of Infobright

We all knw that Infobright is based on MySql. However, when I started with Infobright, I was equally unaware of both.

Today I got to know about Infobright's explain plan collection technique, and realized how different it is from Oracle's.

A typical explain plan from oracle talks about the path it follows to retrieve the data, and somehow makes common sense to read a plan.

On the other hand, an explain plan from Infobright looks nothing like the Oracle's plan. Its a set of cryptic rows put together. An example of such a plan -

2010-03-18 01:31:01 [4] T:-1 = TABLE_ALIAS(T:0,"myTableName")
T:-2 = TMP_TABLE(T:-1)
VC:-2.0 = CREATE_VC(T:-2,PHYS_COL(T:-1,
A:20))
A:-1 = T:-2.ADD_COLUMN(VC:-2.0,AVG,"
avg(col1)","ALL")
A:-2 = T:-2.ADD_COLUMN(VC:-2.0,MAX,"
max(col1)","ALL")
A:-3 = T:-2.ADD_COLUMN(VC:-2.0,MIN,"
min(col1)","ALL")
A:-4 = T:-2.ADD_COLUMN(,COUNT,"count(
1)","ALL")
VC:-2.1 = CREATE_VC(T:-2,PHYS_COL(T:-1,
A:2))
A:-5 = T:-2.ADD_COLUMN(VC:-2.1,GROUP_
BY,"col2","ALL")
VC:-2.2 = CREATE_VC(T:-2,PHYS_COL(T:-1,
A:5))
A:-6 = T:-2.ADD_COLUMN(VC:-2.2,GROUP_
BY,"col3","ALL")
VC:-2.3 = CREATE_VC(T:-2,PHYS_COL(T:-1,
A:7))
A:-7 = T:-2.ADD_COLUMN(VC:-2.3,GROUP_
BY,"col4","ALL")
VC:-2.4 = CREATE_VC(T:-2,PHYS_COL(T:-1,
A:11))
A:-8 = T:-2.ADD_COLUMN(VC:-2.4,GROUP_
BY,"col5","ALL")
VC:-2.5 = CREATE_VC(T:-2,PHYS_COL(T:-1,
A:12))
A:-9 = T:-2.ADD_COLUMN(VC:-2.5,GROUP_
BY,"col6","ALL")
VC:-2.6 = CREATE_VC(T:-2,PHYS_COL(T:-1,
A:13))
A:-10 = T:-2.ADD_COLUMN(VC:-2.6,GROUP_
BY,"col7","ALL")
VC:-2.7 = CREATE_VC(T:-2,PHYS_COL(T:-2,
A:-1))
T:-2.ADD_ORDER(VC:-2.7,DESC)
RESULT(T:-2)


And, guess what, to generate a plan, you have to change a setting in the initialization parameters file, and then bounce the db. I wonder if there have been any instances when someone had to look into production database's performance and therefore needed to look at some plan.

Being as new a system as it is, I cant really complain, since such things of maturity will slowly creep in. The focus right now is probably elsewhere, building the functionality, and probably performance.

in reference to: Infobright.org Forums | Enable MySQL Logging (view on Google Sidewiki)

Friday, February 19, 2010

Discovered BOINC 5 years ago, joined today :)

It was George, a colleague at Axa Tech in Switzerland who first introduced me to this idea. I was thrilled alright, but didnt join right away.

Today I did.

Its a nice feeling, and a feeling of giving back something to science.

http://en.wikipedia.org/wiki/Berkeley_Open_Infrastructure_for_Network_Computing

http://boinc.berkeley.edu/

I would not go into details how it works and all that. However, I would have to say, for whatever time your laptop/workstation remains idle, you can donate the processing power to a better cause. Absolutely fantastic idea, genuinely noble.

And then, look a the stats, people like me are contributing to the processing power generated to the extent of 4 pFLOPS, as against the fastest supercomputer system at mere 1.79 pFLOPS.

As I said, feels great. :)

I hope to inspire others to join in...

in reference to: BOINC (view on Google Sidewiki)

Tuesday, February 16, 2010

Connecting to an Oracle Server...which way is faster...

I've never came across this question as such, because its sort of given that you connect to a server any-which way the administrators define the policy. The tuning perspectives came in only afterwards.

However, in a recent situation, I had to connect to a remote server from my client system and was wondering whether I should connect using tnsnames.ora or should I use the basic authentication mechanism, using the complete connect string.

I thought that the tnsnames.ora would incur some latency in connections. On the other hand, connecting directly without using any such files is like doing a telnet to the server on a given port.

Also, I have seen that if you attempt to connect using the complete connect string from the tnsnames.ora file, it also works.

Since I was not clear myself, I did post this question on oracle forums. And a gentleman, Billy Verreynne took out some time to help explain the concept. Many thanks to him.

As it stands, the basic methodology of connection is same in both cases. However, if you happen to have a huge tnsnames.ora file, it would be an overhead to parse the file and get the connect string for the service name you attempted to connect to. But it would only make a difference, if your client connection process go into hundreds then it might make any noticeable degradation. Basically, client would have to parse the huge tnsnames file few hundred times per second.. then it might make a difference.

Billy also suggested that if you switch the server mode from dedicated to shared, it would help.

So, for my conclusions, its the same if you connect through tnsnames or through a native connect string.

Thursday, January 21, 2010

availability of native drivers !!!

As I mentioned earlier, I was busy experimenting with a columnar database, InfoBright.

The experience till now is mixed, in some scenarios, the data loading is very good, as much as 40k rows per second, whereas through other channels its a poorer 500 rows per second. When I go with their built in loader, its lightening fast, but when I try from Pentaho or Informatica, its measly.

Apparantly, the drivers and the compatibility of the third party tools do play a role in attaining the performance.

The fact that they dont have any native driver published, is a huge bottleneck. Informatica dont even have a native connector for MySql, the more known cousin of InfoBright (being open source, the core engine of InfoBright is based on MySql only).

One thing came out of this experience for sure, the liking for open source got better and better. During the hunt to see the reason for slow performance from Pentaho, I even tried and managed to get the source code of the plugin (transformation) that pentaho uses for Infobright. Its pure java and it was a very powerful and awakening feeling to see the code of the item. I felt as if I have the power, the choice to make a difference, make it better. :)


I am currently exploring other options, one of them includes creating a dump file (read : csv) of the data, and then launching the command line tool to load the data into target db. I dont like it, but lets see if there is any better (read:faster) way around...