Saturday 15 October 2011

Data Lineage.. what is that ?

It is one of those buzzwords, that keep doing the circuit every once in a while. Almost every enterprise wants to do the analysis regarding this, and is almost always hard to find people with knowledge/experience doing this kind of analysis.

For the unaware, Data Lineage is basically (really in very short words) a study of the data from its source to its eventual target, similar to what we'd do for our generation tree, we analyze the generation analysis of the data we are dealing with.

Starting from the source of the data, it travels through different subsystems, sometimes going through transformations, and thus possibly changing shape too...

Informatica had a very interesting blog post around this (already in 2007), which can turn out to be fairly informative.


Wednesday 28 September 2011

Informatica & hadoop... solutions for future ?

Distributed computing using hadoop has taken the IT industry by a whirlwind in the last few years.  After getting almost "adopted" by yahoo, hadoop has progressed quite fast, and is now maturing slowly but steadily.

More and more enterprise solution providers are annoucing their support for the hadoop platform, hoping to get a pie of the big Data business chunk.  Its possibly a fair thing to expect that the leader in Data Integration business solutions space, Informatica has also announced a tie up with Cloudera, for porting Informatica platform to hadoop.

Though the exact details are yet to come out, the possibilities are endless.  With hadoop (and its inherent distributed computing based on map/reduce technology), informatica can actually think of processing big data in sustainable time frames.

For one my customers, I deal with about 200 million rows of data per day in one job.  Besides the issues with oracle in tuning the query etc, the informatica component itself consumes times in terms of hours.  With map reduce in place, I hope to get that in minutes, oracle issues notwithstanding.

Although word about hadoop is spreading quite fast, its adoption (from buzzword to actual usage in enterprise) is not as fast.  To aid their cause, Informatica and cloudera have started an interesting series of webinars, termed as "hadoop tuesdays".  Its free to join, and they get experts to talk about various related issues around hadoop and big data and informatica.  Its been very useful and informative so far.

Monday 25 July 2011

Switching defaults in Ubuntu

Ubuntu allows you to have multiple alternatives installed for many software.. for example, java.
You can have the default open jdk installed, and then you can actually have the Sun version installed.

For example, to see what alternatives are installed for your software, try going to /etc/alternatives. Here you'd see many pieces of software with alternatives listed.

With these software installed, you would need to point your system to use one of them as the default, this is important especially after installing a newer version of the software.

In such a case, to switch the alternatives, you need to use this

sudo update-alternatives

If you do a man on update-alternatives, there is a plethora of options to use.

For our example, to configure the default for java, use this

sudo update-alternatives --config java

Wednesday 20 April 2011

HTML 5

I attended the .WEB day of GIDS (The Great Indian Developer Summit) 2011 edition.


Among many talks, there were two focusing on HTML5. One by Scott Davis (of http://www.thirstyhead.com/) and Venkat Subramaniam (of http://www.agiledeveloper.com) .  Scott's talk was more on the conceptual and capability side of HTML 5.  Venkat focused more on the implementation and initiating newbies to HTML 5 coding.


Before these discussions, I would not have been able to say much on the capabilities of HTML5. It was more of a buzzword before, however, now its more of another technology holding lot of promise.   I think that should say a lot for the two speakers, that within two sessions, they have been able to lift the standard of know how around a cutting edge technology from buzzword to daily use.


Both these talks, put together were able to provide a rather complete picture. Enumerate the benefits, major improvements, new tags which are bringing in so much functionality to native HTML without need of any third party libraries, plugins etc.


Of course its cutting edge today, since not all browsers support all of the HTML 5 specification. The specification is huge in itself anyway.  As one of the speakers put it, the HTML 5 spec is a combination of HTML plus all of CSS 3 plus a lot of RIA functioanlities based on JavaScript libraries.  One can say that html5 is rather heavy from browser engine side, however, it intends to provide all the features across the browsers (eventually).   Since its a huge spec, not all browsers implement it  ** completely and ** uniformly.


There would be a time when all the browsers (at least the leading ones) would implement it completely (or almost all of it), but till then, the developers would have to live with polyfill (polimorphically backfill) the html5 functionality for non supporting browsers.   A javascript library at www.modernizer.com is a big help in implementing this transparently.


As Scott very aptly put it, "We'd program for the faster animal in the herd, and allow the rest of the slower ones to polyfill. As and when they catch up with the fastest one, need for polyfill will automatically go away".


From what I see in html5 spec (what ever part that I have come to know), it looks very very interesting and powerful.  Lots of current functionality that is implemented today with the help of third party libraries/plugins is going to be implemented natively.


And, let me not forget to mention the one single most important innovation that is coming through with html5, semantic web.  Its not really a set of tags or something similar, rather a concept.   There are tags available in spec, which actually indicate the semantics (meaning) of the content. For example, there is a tag called


. This tag wont do much on its own, but when someone is reading the code, or for that matter the parser program is going through the code, the tag name already says that its a footer.  The tag name actually means something.  This also paves way for future improvements on the implementation side.


Perhaps a separate post for html5 possibilities for mobile applications, a huge area in itself.


Resources

  1. www.html5rocks.com
  2. www.html5doctor.com
  3. www.diveintohtml5.org -> this is a unique one, a complete book on html 5, which is available free of cost, completely online.  One of the finest resources for html5.
  4. www.html5demos.com
  5. www.modernizer.com  -> javascript library for polyfill



   ## Technorati - CJYMQNJWMX9K

Sunday 13 March 2011

convert Infrmatica Session Logs to text/xml format for out of tool readability

There was a situation recently when the infa repository was not able to point us to the old session logs.
However, the file system still had those files.

But the session log files are in binary format by default. If you didnt ask for backward compatible session log files, you'd get a binary format of session log file on file system. This is done to allow better importability of the session log files for infa support guys.

So, I had a few log files in binary format, and needed to analyze them.
Informatica provides a subcommand for infacmd to achieve that conversion.

The convertLogFile subcommand takes three parameters. Syntax is as follows -


convertLogFile <-InputFile|-in> input_file_name
                 [<-Format|-fm> format_TEXT_XML]
                 [<-OutputFile|-lo> output_file_name]


so, when you launch this, you can specify the input file to be converted, the format as TEXT or XML and the output file that you'd want as a result of conversion.

An example call would look like this -  (expecting server on unix)

infacmd.sh convertLogFile -in /path/to/binary/format/sesslog/file -fm TEXT -lo /my/home/text/format/sesslog/file



Tuesday 22 February 2011

Just found out about this amazing thing...

A research initiative at Stanford University, Data Wrangler.. Wonderfully helping for analysts.

Try a demo video here -




And read more about it on http://vis.stanford.edu/wrangler

Wrangler Demo Video from Stanford Visualization Group on Vimeo.

Sunday 20 February 2011

Exadata - is it really worth the hype

Well, I am not going to try to answer that, rather, more on the question side...
Recently one of my projects moved to exadata device. There was so much talk around that, the queries and db processes need not be looked into, exadata will take care of them already.

However, two things happened.. first, there was a technical talk on the device's configuration. The device turned out to be a mammoth piece of hardware. In nutshell, its a 8 node cluster, each node having 4 cpu's.  Each CPU has around 2-4 GB of RAM. Then there is this high speed secondary storage which can hold a lot of cache.

The nodes are interconnected using a special switch which can transfer data faster than Gigabit networks.

With such hardware configuration, any software can claim the kind of performance gain they claim. Not to undermine the performance gains, I just want to say that the hype around the out of the world performance gains, is actually the result of better hardware, not really revolutionary software.

I, personally was expecting something of that type from Oracle, since they lack in that area. Except Teradata, there is almost no player who delivers that kind of Data Warehouse architecture and performance. and I was hoping that Oracle would do something around there and bring out something.

And, the second thing, one of the processes tried to load data to an exadata instance using informatica. Initially we left things at default so that exadata can tune it itself and we should not force anything.  However, there too, exadata failed big time and couldnt put in any perf gain. At the end, all the tuning had to be done by us only.

So, the other claim of exadata regarding intelligence to pick up processes and fix them on its own also went down for us.

Though I agree that its rather new for its own evolution, i believe oracle marketing should be doing a better job.:)

Thursday 2 December 2010

Rejecting records in Fact table loads - Informatica

In some development environments, you dont have all the required dimension data and as a result, your fact loading mapping's test runs go for a toss. The mapping wont be able to load anything (reject everything) since some or other foreign key would be missing for each row.

In other words, only those records would be loaded for which ALL the foreign key constraints would be satisfied. However, in Production environment, this would almost never happen. Or even if its the case, we'd actually want those rows to be rejected.

This can work out to be a serious impediment to development/unit testing. It prevents the developer from seeing whether or not his his mapping is behaving appropriately for the happy flow functionality.

One way of achieving this can be to work using a Mapping Variable indicating the Environment. The developer can run using a value like 'D' or something, indicating a different environment than Production, with 'P' (production) being the default value.

Now in the mapping, just about where you decide to reject a record based on different type of conditions, you could put an AND condition involving this mapping variable, e.g. ....AND $$MAPP_ENV = 'P'

Now, the expression would return true only in the Production environment, and therefore would work as expected. In Dev though, this expression would return false and would not reject that row.  

Now, to be able to satisfy the db constraints for the fact table so that the row is actually inserted, you'd need to use some placeholder convention. One of the approaches can be to use an outlier value as the foreign key value.  For Example, for customer id, keep a -1 in dimension table, meaning "Undefined".  And, in all such dev cases, send -1 to the fact. 

It would serve both the purposes, tell your fact table that there is something diff about that row, and still inserting a row in there, so that the testing for the rest of the columns is not stopped because of one foreign key missing out.

Thursday 18 November 2010

Setting a useful command prompt in Unix

I just came across a unix system at my workplace that had a static prompt set. Basically, the prompt was just the shell executable's name and version, more like 

bash-3.2 $

Well, this kind of prompt has many drawbacks, some of them i'd list here - 
1. You never know (just like that) where you are in the file system. When you are dealing with multi-directory situations, you might want to stop typing pwd to figure out ur current location.

2. You never know by what user you are logged in (again, just by looking at the prompt). You'd have to run a whoami to figure that out.

3. More importantly, if you are dealing with multiple systems, this one's the most killer.  You never know to what system you are logged in right now. you'd have to issue a hostname command.

Well, there might be, and for sure there are many other, consequences of having such a cryptic command prompt.  And therefore, my favourite, to have a command prompt, that displays at least these three things, always, dynamically...

Something like, 

raghav@deskubuntu:/homes/raghav/rails $ 

wherein, I am always aware of the three things mentioned earlier. This is very very useful when you are dealing with multiple systems and you have multiple users who are configured to run different types of processes.  For example, an oracle user who is supposed to be owner of oracle processes, and an informatica user which is supposed to own everything linked to informatica, and then a connect direct user which owns the CD processes, which receives files coming in from some other system.

With this kind of system, and your own user id to log in to the system, you'd better be careful which processes you are looking at / launching and by what user.  Its really very very important.

When and if you are dealing with a multiple system scenario, like dev / test / acceptance / production, you'd be better advised to use something like this only.

the magic command to do that is by setting appropriate flags and text in a environment variable called PS1.

Just set PS1 to your .profile or .bashrc (depending on your environment) file and you are set .

The example prompt that I mentioned can be achieved by saying - 

export PS1="\\u@\\h:\\w \\$ "

There are many more possibilities that go with special meanings for PS1 variable. Read some of them here - 

    * \$ : if the effective UID is 0, a #, otherwise a $
    * \[ : begin a sequence of non-printing characters, which could be used to embed a terminal control sequence into the prompt
    * \\ : a backslash
    * \] : end a sequence of non-printing characters
    * \a : an ASCII bell character (07)
    * \@ : the current time in 12-hour am/pm format
    * \A : the current time in 24-hour HH:MM format
    * \d : the date in "Weekday Month Date" format (e.g., "Tue May 26")
    * \D{format} : the format is passed to strftime(3) and the result is inserted into the prompt string; an empty format results in a locale-specific time representation. The braces are required
    * \e : an ASCII escape character (033)
    * \H : the hostname
    * \h : the hostname up to the first '.'
    * \j : the number of jobs currently managed by the shell
    * \l : the basename of the shell’s terminal device name
    * \n : newline
    * \nnn : the character corresponding to the octal number nnn
    * \r : carriage return
    * \T : the current time in 12-hour HH:MM:SS format
    * \t : the current time in 24-hour HH:MM:SS format
    * \u : the username of the current user
    * \s : the name of the shell, the basename of $0 (the portion following the final slash)
    * \V : the release of bash, version + patch level (e.g., 2.00.0)
    * \v : the version of bash (e.g., 2.00)
    * \W : the basename of the current working directory
    * \w : the current working directory

Tuesday 16 November 2010

scripts and hash bang ( #! )

More often than not, people have to tell the unix shell / perl scripts or other programs where lies their interpretor, e.g. write their command line calls as 

perl SomeScript.pl

or 

ruby ARubyProgram.rb

or 

sh SomeShellScript.sh

this is because the system may not be aware of the location of the the executable interpreter of the exact type that needs to be used for the corresponding script.  Well, for this purpose, windows has the file extension association concept, but we are dealing with Unix like systems not windows, so that option is not really available to us (besides, there are ill effects of that convention too, but lets not go in that discussion).

So, to tell a unix program where to find its interpreter, besides launching the script along with it on command line, there is another way, and rather beautiful at that.

Just put the exact path of your interpreter executable at the very first line of your script preceded by these two magic characters, a hash and an exclamation (#!) also called as hash-bang or shebang.  Now, once your script is marked as executable (see chmod), you are good to go, no need of putting explicit calls to the interpreter to run your code.

Basically, your code should now look like this - 

#!/usr/local/bin/perl5
print "testing hashbang with raghav"

Save this short script as aa.pl (assuming that your system has perl 5 interpreter installed in the location I used). Make the script executable (chmod) and you can just launch the script, like ./aa.pl  instead of earlier example perl aa.pl

A word of caution though, these magic characters have to be absolutely the first and second character of the file, no exceptions to that. Else, the system cant make out the special meaning of this and the purpose is lost.

Pretty neat.. hunh...

Monday 1 November 2010

A Note to New Consultants

I  am about to start a new role, that of a consultant at a new customer site soon. In order to prepare for that mentally, I was looking around for inspiration and advice. In the process, I stumbled across this gem from the founder of Boston Consulting Group.  I have picked up the text from this webpage(http://www.careers-in-business.com/consulting/hendnote.htm), and then tried to see through it from my own eyes.  All credits with the original webpage owners.

Written by Bruce Henderson, Founder of Boston Consulting Group in the 1970s

In a sense the consultant's role is a paradox. He gives advice to people of equal intelligence who have vastly superior and extensive experience and knowledge of the problem. Yet he is not necessarily an expert in anything. What is the justification for his value?

       Need of Consultant
  1. The consultant can function as a specialist or expert, In this role he must be more knowledgeable than the client. This implies a very narrow field of specialization, otherwise the client with his greater continuity of experience would be equally expert.

    The consultant can function as a counselor or advisor on the process of decision making. This implies an expertise of a special kind, that of the psychotherapist. This is merely a particular kind of expertise in a particular field.

    The most typical role for a consultant is that of auxiliary staff. This does not preclude any of the other roles mentioned before, but it does require a quite different emphasis.

    All companies have staff capabilities of their own. Some of this staff is very good. Yet no company can afford to have standby staff adequate for any and all problems. This is why there is an opportunity for consultants. They fill the staff role that cannot be filled internally.

    By definition this means that consultants are most useful on the unusual, the non-recurring, the unfamiliar problem. Outside consultants are also most useful where the problem is poorly defined and politically sensitive, but the correct decision is extremely important. Outside consultants get the tough, the important and the sensitive problems.

    The natural function of a consultant is to reduce anxiety and uncertainty. Those are the conditions under which anxiety and uncertainty are greatest and where consultants are most likely to be hired.

    Problem Definition
  2. If this point of view is our starting point, then problem definition becomes extremely important.
    • If the problem is incorrectly defined, then even its complete solution may not satisfy the client's perceived needs.
    • If the problem is improperly defined, it may be beyond our ability to solve.
  3. Problem definition is a major test of professional ability. Outside consultants can frequently define problems in a more satisfactory fashion than internal staff, primarily because they are unencumbered with the historical perspective of the client and the resulting "house" definition.

    A consultant's problem definition is the end of the assignment if the problem is not researchable. If the problem is not researchable, then the consultant is either a specialist-expert or a psychotherapist. Neither of these roles are suitable for the use of the resources of an organization such as The Boston Consulting Group.

    A researchable problem is usually a problem that should be dealt with by a group approach
    . Data gathering and analysis requires differing skills and different levels of experience that can best be provided by a group. The insights into complex problems are usually best developed by verbal discussion and testing of alternate hypotheses.

    Good research is far more than the application of intellect and common sense. It must start with a set of hypotheses to be explored. Otherwise, the mass of available data is chaotic and cannot be referenced to anything. Such starting hypotheses are often rejected and new ones substituted. This, however, does not change the process sequence of hypothesize / data gathering / analysis / validation / rehypothesize.

    Great skill in interviewing and listening is required to do this. Our client starts his own analysis from some hypothesis or concept. We must understand this thoroughly and be able to play it back to him in detail or he does not feel that we understand the situation. Furthermore, we must be sure that we do not exclude any relevant data that may be volunteered. Yet we must formulate our own hypothesis.

    Finally, we must be able to take our client through the steps required for him to translate his own perspective into the perspective we achieve as a result of our research. This requires a high order of personal empathy as well as developed teaching skills.

  4. The end result of a successful consulting assignment is not a single product. It is a new insight on the part of the client. It is also a commitment to take the required action to implement the new insights. Equally important, it is an acute awareness of the new problems and opportunities that are revealed by the new insights.

    We fail if we do not get the client to act on his new insights. The client must implement the insights or we failed. It is our professional responsibility to see that there is implementation whether we do it or the client does it.

    Much of the performance of a consultant depends upon the development of concepts that extend beyond the client's perception of the world. This is not expertise and specialization. It is the exact opposite. It is an appreciation of how a wide variety of interacting factors are related. This appreciation must be more than an awareness. It must be an ability to quantify the interaction sufficiently to predict the consequences of altering the relationships.

    Consultants have a unique opportunity to develop concepts since they are exposed to a wide range of situations in which they deal with relationships instead of techniques. This mastery of concepts is probably the most essential characteristic for true professional excellence.

    A successful consultant is first of all a perceptive and sensitive analyst. He must be in order to define a complex problem in the client's terms with inadequate data. This requires highly developed interpersonal intuitions even before the analysis begins.

    His analytical thinking must be rigorous and logical, or he will commit himself to the undoable or the unuseful assignment. Whatever his other strengths, he must be the effective and respected organizer of group activities which are both complex and difficult to coordinate. Failure in this is to fall into the restricted role of the specialist.

    [raghav] The first time I have read that a specialist role can be restrictive, and honestly, when you think about it again, it does come back as a correct statement, specially in the wider world of other opportunities. Specially for a management consultant.

    In defining the problem, the effective consultant must have the courage and the initiative to state his convictions and press the client for acceptance and resolution of the problem as defined. The client expects the consultant to have the strength of his convictions if he is to be dependent upon him. Consultants who are unskilled at this are often liked and respected but employed only as counselors, not as true management consultants.

    The successful professional inevitably must be both self-disciplined and rigorous in his data gathering as well as highly cooperative as a member of a case team.

    The continuing client relationship requires a sustained and highly developed empathy with the client representative. Inability to do this is disqualifying for the more significant roles in management consulting.
In other words, the successful consultant:
  • Identifies his client's significant problems;
  • Persuades his client to act on the problems by researching them;
  • Organizes a diversified task force of his own firm and coordinates its activity;
  • Fully utilizes the insights and staff work available in his client's organization;
  • Uses the full conceptual power of his own project team;
  • Successfully transmits his findings to the client and sees that they are implemented;
  • Identifies the succeeding problems and maintains the client relationship;
  • Fully satisfies the client expectations that he raised;
  • Does all these things within a framework of the time and cost constraints imposed by himself or the client.

Friday 1 October 2010

my experiments with solr :)

Its a catchy title, but yes.. thats what I am going to talk about...

I came across hadoop, when I was looking for a new solution for one of our in-house projects. The need was quite clear, however, the solution had to be dramatically different.

The one statement we received from business was, "We need an exceptionally fast search interface". And for that fast interface to search upon they had more than a hundred million rows worth of data in a popular RDBMS.

So, when I sat about thinking, how to make a fast search application, the first thing that came to my mind was, Google. Actually, whenever we talk about speed or performance of web sites, Google is invariably the first name that comes across.


Further, Google has a plus point that there is always some activity at the back end to generate the page or results that we see, its never static content. And, then, another point, Google has a few trillion pieces of information to store/index/search whereas our system was going to have significantly lower volume of data to manage.   So, going with that, Google looked like a very good benchmark for this fast search application. 

Then I started to look for "How Google generates that kind of performance". There are quite a few pages on the web talking about just that.   But, probably none of them has the definitive/authoritative view on Google's technology or for that matter the insider's view on how it actually does what it does so fast.

Some pages pointed towards their storage technology, some talked about their indexing technology, some about their access to huge volumes of high performance hardware and what not...

For me, some of them turned out to be genuinely interesting, one of them was the indexing technology. There has to be a decent indexing mechanism to which the crawler's would feed and the search algorithms hit.  The storage efficiency is probably the next thing to come in the play. How fast can they access the corresponding item ?

Another of my observation is that, the search results (the page mentioning page titles and stuff) comes real fast, mostly less than 0.25 seconds, but the click on the links does take some time.  So, I think it has to be their indexing methodology that plays the bigger role.

With that in mind, I sat about finding what can do similar things and how much of Google's behaviour they can simulate/implement.

Then I found Hadoop project on apache (http://hadoop.apache.org/) which to a large extent reflects the way Google kind of system would work. It provides distributed computing(hadoop core), it provides a bigTable kind of database (hbase), provides map/reduce layer, and more.  Reading into it more, I figured out that this system is nice for a batch processing kind for mechanism, but not for our need of real time search.

Then I found solr(http://lucene.apache.org/solr/), a full text search engine under Apache Lucene.  It is a java written, xml indexing based genuinely fast search engine.  It provides many features that we normally wish for in more commercial applications, an being from apache, I would like to think of it as much more reliable and stable than compared to many others.

When we sat about doing a Proof of Concept with it, I figured out a few things –

•    It supports only one schema, as in, rdbms tables – only one. So, basically you would have to denormalize all your content to fit into this one flat structure.
•    It supports interactions with the server interface only through http methods be it the standard methods get/put etc or be it REST like interfaces.
•    It allows you loading data in varying formats, through xml documents, through delimited formats and through db interactions as well.
•    It has support for clustering as well. Either you can host it on top of something like hadoop or you can just configure it to do it within solr as well.
•    It supports things like expression and function based searches
•    It supports faceting
•    Extensive caching and “partitioning” features.

Besides other features, the kind of performance without any specific tuning efforts made me think of it as a viable solution.

In a nutshell, I loaded around 50 million rows on a “old” Pentium-D powered desktop box with 3 GB RAM running ubutnu 10.04 server edition (64 bit) with two local hard disks configured over a logical volume manager.

The loading performance was not quite great. Though its not that bad either. I was able to load a few million rows (in a file that was sized about 6 GB) in about 45 minutes when the file was on the same file system.

In return, it gave me query performances in the range of 2-4 seconds for the first query. For subsequent re-runs of the same query (within a span of an hour or so), it came back in approx 1-2 milliseconds.  I would like to think that its pretty great performance given the kind of hardware I was running upon, and the kind of tuning effort I put in (basically none – zero, I just ran the default configuration).

Given that, I wont say that I have found the equivalent or replacement of Google’s search for our system, but yeah, we should be doing pretty good with this.

Although there is more testing and experimentation that is required to be able to judge solr better, the initial tests look pretty good.. pretty much in line with the experiences of others who are using it.

Monday 13 September 2010

Business & Open Source - How both can benefit

I had the opportunity to scout for a new technology/solution for one of our in-house projects.  Quite a few of the options that I looked for were from open source arena.  And amazingly, the products were far more capable from our expectations, just that we'd have to pitch in with some effort to get it working for us.

I have always felt that for open source projects/products to become commercially viable for a business enterprise, the enterprise has to come up and spend some resources to it to get the actual value out of it.

In other words, if an organization wants to use an open source product, which has an equivalent competitive commercial product available in market, they should be open enough to have their own in-house people who can take ownership of the installation. The organization shouldn't completely rely on the support available from the community forums and such.

I have seen more than one manager complain about the lack of support on the open source products.  Had there been proper support system for each of the open source products, we'd see a lot of stories similar to mysql's model or pentaho model.

What I would like to see perhaps is that the organizations' becoming mature enough in their adaptation of the open source products. By that, I expect them to have a open vision, have people who understand and like and own the product, and at the same time tweak and tune the product to suit the organization's business needs.

In the process, the organization should contribute to the product's development cycle.  This could happen in many ways, bug fixes, contribution of new features, the employees could contribute on community forums and such.  Using the terminology from peer to peer sharing, only leechers dont help a torrent, people need to seed to it as well. Same way, unless organizations contribute to an open source product, they would stand to become only leechers.

Only after we have a decent balance of organizations using and contributing to the open source products, we'd see the ecosystem flourishing...

Thursday 9 September 2010

Tips for brainstorming...

Interesting read, from both positive and negative viewpoints -

1. Use brainstorming to combine and extend ideas, not just to harvest ideas.

2. Don't bother if people live in fear.

3. Do individual brainstorming before and after group sessions.

4. Brainstorming sessions are worthless unless they are woven with other work practices.

5. Brainstorming requires skill and experience both to do and, especially, to facilitate.

6. A good brainstorming session is competitive—in the right way.

7. Use brainstorming sessions for more than just generating good ideas.

8. Follow the rules, or don't call it a brainstorm.

Read more here - http://www.businessweek.com/innovate/content/jul2006/id20060726_517774.htm?chan=innovation_innovation+++design_innovation+and+design+lead

in reference to:

"8. Follow the rules, or don't call it a brainstorm."
- Eight Tips for Better Brainstorming (view on Google Sidewiki)

Wednesday 8 September 2010

Big help...

I wanted to get my table sizes in infobright, and this page came to my help...

SELECT table_schema,table_name,engine, table_rows, avg_row_length,
(data_length+index_length)/1024/1024 as total_mb,(data_length)/1024/1024 as data_mb,
(index_length)/1024/1024 as index_mb, CURDATE() AS today
FROM information_schema.tables
WHERE table_schema='mySchemaName'
ORDER BY 7 DESC

Thanks Ron...
in reference to: Calculating your database size | MySQL Expert | MySQL Performance | MySQL Consulting (view on Google Sidewiki)

Wednesday 25 August 2010

I also feel like saying, 1984...

This story appeared in Economic Times, wherein Apple claims to have developed (or is busy doing that) sensitive info about an iphone user. Subsequently, Apple intends to hold/halt usage of the iphone device from the "unauthorized" user, this unauthorized reportedly includes -

1. an iphone that has been hacked to work outside the contract with which it was sold, read "jailbroken"

2. an iphone that is perhaps being used by someone other than the person who registered the first heartbeat or facial recognition info..

Apple intends to capture the phone location using GPS/other tech and perhaps control the device remotely if they feel its being used "unauthorized"..

i agree with people who remember 1984 after reading apple's intentions...ha.. time does come back...George Orwell.. were u too right ??
in reference to: Apple to make iPhone theft-proof - Hardware - Infotech - The Economic Times (view on Google Sidewiki)

Monday 2 August 2010

Country General Mood using Tweets

Well, it sure is pretty fascinating to do that kind of study and come back with results as commonsensical as we see here...

http://www.iq.harvard.edu/blog/netgov/2010/07/mood_twitter_and_the_new_shape.html


I quote - (with all credits where its due, none to me...)


A group of researchers from Northeastern and Harvard universities have gathered enough data from Twitter to give us all a snapshot of how U.S. residents feel throughout a typical day or week.

Not only did they analyze the sentiments we collectively expressed in 300 million tweets over three years against a scholarly word list, these researchers also mashed up that data with information from the U.S. Census Bureau, the Google Maps API and more. What they ended up with was a fascinating visualization showing the pulse of our nation, our very moods as they fluctuate over time.

The researchers have put this information into density-preserving cartograms, maps that take the volume of tweets into account when representing land area. In other words, in areas where there are more tweets, those spots on the map will appear larger than they do in real life.


A apparantly public domain result of the analysis is available here -
http://cdn.mashable.com/wp-content/uploads/2010/07/twitter-moods.jpg

Wednesday 28 July 2010

Oracle Count(1) vs Count(*)

Well, it might have been an everlasting discussion about which one of these to use, count(1) or count(*).

I guess, this article of Thomas Kyte already clarified the situation long long ago (well, for IT industry 2005 is long ago anyway, especially given the speed at which we are moving.)

Essentially, what askTom says that, count(*) is better than count(1) since count(1) translates to count(*) internally anyway. I wonder then, why would someone want to use count(1) anyway.

There is at least one more step involved in getting to the actual result. And there is another possible tweak, count(1) has to evaluate an expression as well, "count(1) where 1 is not null". Though its a tautology equivalent, it has to be evaluated nonetheless.

Further, there was some misconception about how the result is returned, whether its read from the data dictionary, this view or table or something like that. I dont think so. The result is calculated at the exact run time,when the query is run, and it actually goes ahead and counts the records in the table.

Should set the record straight...

in reference to: Ask Tom "COUNT(*) Vs COUNT(1) on tables with CLOB..." (view on Google Sidewiki)

Saturday 24 July 2010

Developing a Rails application using an existing database

This is the latest challenge for me. A database exists with real data in there, and I have to develop a rails application around that.

Initially we needed the basic CRUD screens for some tables.  Being lazy (i m really proud of that), I set out finding if there a solution that generates the forms (read views) for the existing tables/models.

I have already managed to generate models/schema.rb using another gem. This is called magic_model.  Read more about that here

Then google helped me find this another gem called scaffold_form_generator which generates the necessary views/forms for a given model.  However, there need to be some improvements required on that (I think).  perhaps I would contribute something (if I find out enough on how to do that)

Well, for the moment, I am struggling with handling of the missing special meaning column from the legacy tables. Will continue writing on this...