Friday 4 January 2013

Finding Informatica domain name

Recently I came across a situation where the customer people had provided us with informatica server hostname, but not the domain name, nor the port for domain.

In such a case, we lost quite some time figuring out how to go through the domain configuration. That's when I started thinking about the alternates for finding the domain name information from the system (assuming different access levels)

If the repository database access is available, i.e. you can access the informatica repository database, you can use the following query to get the domain name out.

select pos_name
from PO_DOMAINSERVICECONFIG

 This sql will need to be run in the schema where the domain repository has been created.

On another approach, if the database access is not there, and the informatica server access is available, another file, domains.infa in the $PM_HOME equivalent directory will be able to provide information on domain name/port etc..

About the port

Though the installations process allows customization of the ports for domain, many installations keep the default as is.  In any case, a simple telnet to the host on the suspicious port will confirm whether the port is open or not.

In my example situation, it turned out to be the default 6005.

Thursday 14 June 2012

hadoop/hive with tableu


It was in 2010 that  I had the first taste of hadoop/hive.  Back then I was still using hadoop 0.20 and was doing a proof of concept for a customer, who wanted to see if hadoop can be a solution for their problems.

Since then, I have been reading up and following the changes in the hadoop world, and tweaking things here and there with the home installation.   Today, I tried to mount hive on hadoop (without hbase, with hbase will be the next experiment) and see how can I get it playing nicely.

The experience is awesome as usual, and it reinforces my belief in the fact that hadoop ecosystem has a huge role to play in the computing industry of tomorrow.

The analytical capabilities of the volumes of data managed by the hadoop kind of system are ever increasing, therefore the interest from many instant BI players to provide access to the data behind hadoop.

One such player is the instant dashboard tool - Tableu.  They have announced that Tableu 7 will be able to read data directly from hive environments.  

In real life it was a bit of a challenge, but whats the fun if there is no challenge. In a nutshell, it does work. No doubt.  However, the kind of configuration that is required and administration can be tricky.

1. You have to install the hive driver (available from their website - http://www.tableausoftware.com/support/drivers)

2. you have to launch hive in a particular way, as a service. (hive --service hiverserver).  Also, hive on a pseudo cluster only allows one user connected (since the metadata store is single user on Derby).  as a result, if you are using Tableu connectivity, nothing else can access hive, not even a command line inerface.

3. Remember that each addition/change to the data set on tableu interface triggers a map-reduce job on the hive cluster/server.  And that, hadoop/hive are not really meant to be fast responsive systems.  Therefore, expect high delays in fulfilling your drag and drop requests.

4. There might/will be additional troubles in aggregating certain types of data, since the data types on hive might not be additive in the same way as front end expects them to be.

All in all, it wins me in the ease of use provided for accessing the data behind the hadoop environment, however, there are faster ways that already exist to achieve the same result.

Saturday 5 May 2012

Mounting a aws instance into Nautilus in ubuntu

For a particular project I needed to transfer files from my local desktop system (running ubuntu ) to my aws instance running also running Ubuntu.  shell based connections work just fine for me, however, I need a GUI solution.  GUI solution was favoured since I wanted to have a quick drag and drop solution to move files rather easily across the two instances.

Although, Nautilus provides a "Connect to Server" option, and you can bookmark it as well, the trick/issue is that with aws instances you are dealing with private/public key  authentication, not password based authentication.

For some reason the "Connect to Server" option doesnt allow you to embed a public key and therefore that authentication method doesnt work through in that case.

After quite a bit of research and googling, I had to put this question up on stackoverflow.com  where, I got a prompt and sweet response.

The trick was to use the .ssh/config file and to create an alias there, specifying the identity file, and then onwards, there is no challenge to a ssh or sftp call, since ssh routes it through the config file automatically.

Here are the sample contents of the .ssh/config file that worked for me -

Host {alias}
HostName {hostname or IP address}
User {username to log in with}
IdentityFile {path to abcd.pem}
With this in place, its possible for me to have a bookmark and clicking on that just opens the target location (like a mount point).  Sweet little trick, and makes me like ubuntu a bit more...

Sunday 29 April 2012

Starting off with Rails 3.2.2 on Ubuntu 11

After a long time, I picked up to build a web application for a friend. I havent worked in web apps for about 6-8 months now, and was quite rusty.

Ruby on Rails was a natural choice. However, when I issued the "gem update" it turned out that rails is on 3.2.2 these days, which means there has been at least 4 releases since I last worked with rails.

I typically work in a way that learning happens while working only.  Therefore, I just started building a new application and built a default scaffold (I know, its  an old habit of testing installation using scaffold)... scaffold generated, which means that the backward compatibility is still alive. Server started.. and bang, launched chrome to see how localhost:3000 looks like...

it bombed... there was this "ExecJS::RuntimeError" staring at my face...

It turned out that Ubuntu is the culprit and needs nodejs to be installed as a JS runtime engine is required and by default Ubuntu doesnt provide one.

after "sudo apt-get install nodejs", its all smooth and shiney...

Will share further...

Thursday 12 April 2012

App vs web browser based access to websites from devices

Every other website these days launches their own app as soon as they find a decent following among customers.  However, there are some things we need to watch out for when using apps as against browsing the same content over a web browser.

A browser is a relatively safe sandbox when it comes to executing website content and rendering it.  There really has to be a loophole in the browser engine for a website to exploit it and do weird things to your device, be it a phone, tablet or laptop/desktop.

On the other hand, when we install "apps" for websites, we provide them "permissions" to do things on our devices. This works nicely based on the trust foundation.  I trust the website, and therefore I trust their app to not do anything untoward to my device.  This trust, can be unfounded in cases, and lead to unknown actions/behaviours from apps.

Among the benefits of using apps, they  do provide a better user experience (in most of the cases) since the rendering is specific to the device.  Also, the apps provide lot more customised user interaction information to their base websites, thereby providing more and accurate and contextual intelligence about their usage. There are reports about apps stealing private information from the devices.

Most of the time, the reason for the app gaining access to information is the grain of the access control used in the device. If its too low, the access permissions to be provided are a huge list, if its too high, you can provide too much access without intending to. There comes the maturity of the device operating system.

With web browsers the information sent back for analytics purposes is rather generic, since its from the browser sandbox.

I believe its safe to say that using apps is a bit of a trade-off, between the user experience and the safety /privacy of the user.  Lets be a bit more careful about which apps do we download and use, and what all permissions that app needs. Lets just be a bit more skeptic and end up being safer for it, hopefully.

PS - There are a lot other comparisons that already exist, however, its hard to say how many of them talk about security aspects. User Experience is one major discussions point, for sure.  Try Googling it

Tuesday 3 April 2012

time zone conversion in Oracle

Often times we need to see a given time-stamp column in a particular time zone.   Without casting as well, oracle allows a very simple way - 


SELECT <column-name> AT TIME ZONE <time zone> FROM <table-name>;

this method saves the expensive cast operations.

Informatica Questions from a friend - Part 2 - Schedulers

Need of Scheduling and Commonly used Schedulers
 
Any and all Data warehousing environments need some kind of scheduler setup to enable
jobs being run at periodic intervals without human intervention.  Another important feature
is the repeatability of the jobs set up such.  Without the help of a scheduler, things would
become very ad-hoc and thus prone to errors and messups. 
Oracle provides an built in scheduling facility, accessible through its dbms_scheduler package.
Unix provides basic scheduling facility using cron command. Similarly, Informatica also 
provides basic scheduling facilities in the Workflow Manager client.
 
The features provided by these scheduling tools are fairly limited, often limited to launching
a job at a given time, providing basic dependency management etc. 
 
However, in real time data warehousing solutions, the required functionality is lot more 
sophisticated than whats offered by these basic features.  Therefore, the need for full 
fledged scheduling tools, e.g. Tivoli Workload Scheduler, Redwood Cronacle, Control-M, 
Cisco Tidal etc..
 
Most of these tools provide sophisticated launch control, dependency management features 
and therefore allow the data warehouse to be instrumented at finer levels.
 
Some of the tools, e.g. Tidal for informatica and Redwood for Oracle, provide support for
the Tools' API as well, therefore integrating even better with the corresponding tool.  

Friday 30 March 2012

Informatica Questions from a friend - Part 1

HOW TO USE A COBOL FILE FOR TRANSFORMATION
 
Informatica allows reading data from cobol copybook formatted data files. These files mostly 
come from mainframe based source systems. Given that many of the world's leading business 
systems still use IBM Mainframe as their computing systems, e.g. airlines, banks, insurance 
companies etc, these systems act as a major source of information for Data warehouses, 
and thus to our Informatica mappings.  
For using a cobol copy book structure as a source, you'd have to put that copybook in a 
empty skeleton cobol program. 
IDENTIFICATION DIVISION.
PROGRAM-ID. RAGHAV.

ENVIRONMENT DIVISION.
SELECT FILE-ONE ASSIGN TO "MYFILE". 

DATA DIVISION.
FILE SECTION.
FD FILE-ONE.

COPY "RAGHAV_COPYBOOK.CPY".

WORKING-STORAGE SECTION.

PROCEDURE DIVISION.

STOP RUN. 

The copybook file can by a plain record structure.
Read more about defining copybooks around here.

 
 
 

Thursday 29 March 2012

Counting columns in a tab delimited file


It sounds so simple, however, when you sit down to write this, specially as a single line expression, it can take a while.

In my experiments I found it rather easy to count it with other delimiters as compared to TAB character.
Here is the command for counting columns

cat <FILENAME>| awk -F'\t' '{print NF}'

 cat can be slow at times, especially with larger files, therefore an alternative without that...

awk -F'\t' '{print NF}' <FILENAME>


 

Wednesday 14 March 2012

how to find sql id of a long running sql in oracle


Sql for finding out sql id etc details of some long running query. Often useful for sending kill/monitoring instructions to DBA friends..

select distinct t.sql_id, s.inst_id,s.sid, s.serial#,s.osuser, s.program, s.status, t.sql_text
from gv$session s, gv$sqlarea t
where s.username = '<USERNAME>'
and s.sql_id = t.sql_id
and t.sql_text like '<%provide a segment of sql to identify it%>'

Saturday 25 February 2012

Timezones in Oracle

It might have been written umpteen times here and there, but it always manages to confuse me. So, here it is another time on the internet...


Datatypes
Timestamp/timezone datatypeWhat Oracle storesWhat Oracle displays
TIMESTAMP WITH TIME ZONEYear, month, day, hour, minute, second, fractional second, and time zone displacement (HH:MI difference from GMT)Stored value
TIMESTAMP WITH LOCAL TIME ZONEYear, month, day, hour, minute, second, fractional second; does NOT store time zone information, but instead converts data to the database time zone and stores it w/o time zone informationConverts the stored data to the session's time zone before displaying


With due respect to - http://toolkit.rdbms-insight.com/tz.php

Friday 3 February 2012

And now.. a Solar Powered ubuntu laptop...

Nick Rutledge has conceptualized a thin laptop that runs Ubuntu, is beautiful and hopes to run on Solar power... isnt that a killer combination...

check out his concept here...

http://nrutledge.blogspot.com/p/ubuntu-laptop-concept.html

Saturday 15 October 2011

Data Lineage.. what is that ?

It is one of those buzzwords, that keep doing the circuit every once in a while. Almost every enterprise wants to do the analysis regarding this, and is almost always hard to find people with knowledge/experience doing this kind of analysis.

For the unaware, Data Lineage is basically (really in very short words) a study of the data from its source to its eventual target, similar to what we'd do for our generation tree, we analyze the generation analysis of the data we are dealing with.

Starting from the source of the data, it travels through different subsystems, sometimes going through transformations, and thus possibly changing shape too...

Informatica had a very interesting blog post around this (already in 2007), which can turn out to be fairly informative.


Wednesday 28 September 2011

Informatica & hadoop... solutions for future ?

Distributed computing using hadoop has taken the IT industry by a whirlwind in the last few years.  After getting almost "adopted" by yahoo, hadoop has progressed quite fast, and is now maturing slowly but steadily.

More and more enterprise solution providers are annoucing their support for the hadoop platform, hoping to get a pie of the big Data business chunk.  Its possibly a fair thing to expect that the leader in Data Integration business solutions space, Informatica has also announced a tie up with Cloudera, for porting Informatica platform to hadoop.

Though the exact details are yet to come out, the possibilities are endless.  With hadoop (and its inherent distributed computing based on map/reduce technology), informatica can actually think of processing big data in sustainable time frames.

For one my customers, I deal with about 200 million rows of data per day in one job.  Besides the issues with oracle in tuning the query etc, the informatica component itself consumes times in terms of hours.  With map reduce in place, I hope to get that in minutes, oracle issues notwithstanding.

Although word about hadoop is spreading quite fast, its adoption (from buzzword to actual usage in enterprise) is not as fast.  To aid their cause, Informatica and cloudera have started an interesting series of webinars, termed as "hadoop tuesdays".  Its free to join, and they get experts to talk about various related issues around hadoop and big data and informatica.  Its been very useful and informative so far.

Monday 25 July 2011

Switching defaults in Ubuntu

Ubuntu allows you to have multiple alternatives installed for many software.. for example, java.
You can have the default open jdk installed, and then you can actually have the Sun version installed.

For example, to see what alternatives are installed for your software, try going to /etc/alternatives. Here you'd see many pieces of software with alternatives listed.

With these software installed, you would need to point your system to use one of them as the default, this is important especially after installing a newer version of the software.

In such a case, to switch the alternatives, you need to use this

sudo update-alternatives

If you do a man on update-alternatives, there is a plethora of options to use.

For our example, to configure the default for java, use this

sudo update-alternatives --config java

Wednesday 20 April 2011

HTML 5

I attended the .WEB day of GIDS (The Great Indian Developer Summit) 2011 edition.


Among many talks, there were two focusing on HTML5. One by Scott Davis (of http://www.thirstyhead.com/) and Venkat Subramaniam (of http://www.agiledeveloper.com) .  Scott's talk was more on the conceptual and capability side of HTML 5.  Venkat focused more on the implementation and initiating newbies to HTML 5 coding.


Before these discussions, I would not have been able to say much on the capabilities of HTML5. It was more of a buzzword before, however, now its more of another technology holding lot of promise.   I think that should say a lot for the two speakers, that within two sessions, they have been able to lift the standard of know how around a cutting edge technology from buzzword to daily use.


Both these talks, put together were able to provide a rather complete picture. Enumerate the benefits, major improvements, new tags which are bringing in so much functionality to native HTML without need of any third party libraries, plugins etc.


Of course its cutting edge today, since not all browsers support all of the HTML 5 specification. The specification is huge in itself anyway.  As one of the speakers put it, the HTML 5 spec is a combination of HTML plus all of CSS 3 plus a lot of RIA functioanlities based on JavaScript libraries.  One can say that html5 is rather heavy from browser engine side, however, it intends to provide all the features across the browsers (eventually).   Since its a huge spec, not all browsers implement it  ** completely and ** uniformly.


There would be a time when all the browsers (at least the leading ones) would implement it completely (or almost all of it), but till then, the developers would have to live with polyfill (polimorphically backfill) the html5 functionality for non supporting browsers.   A javascript library at www.modernizer.com is a big help in implementing this transparently.


As Scott very aptly put it, "We'd program for the faster animal in the herd, and allow the rest of the slower ones to polyfill. As and when they catch up with the fastest one, need for polyfill will automatically go away".


From what I see in html5 spec (what ever part that I have come to know), it looks very very interesting and powerful.  Lots of current functionality that is implemented today with the help of third party libraries/plugins is going to be implemented natively.


And, let me not forget to mention the one single most important innovation that is coming through with html5, semantic web.  Its not really a set of tags or something similar, rather a concept.   There are tags available in spec, which actually indicate the semantics (meaning) of the content. For example, there is a tag called


. This tag wont do much on its own, but when someone is reading the code, or for that matter the parser program is going through the code, the tag name already says that its a footer.  The tag name actually means something.  This also paves way for future improvements on the implementation side.


Perhaps a separate post for html5 possibilities for mobile applications, a huge area in itself.


Resources

  1. www.html5rocks.com
  2. www.html5doctor.com
  3. www.diveintohtml5.org -> this is a unique one, a complete book on html 5, which is available free of cost, completely online.  One of the finest resources for html5.
  4. www.html5demos.com
  5. www.modernizer.com  -> javascript library for polyfill



   ## Technorati - CJYMQNJWMX9K

Sunday 13 March 2011

convert Informatica Session Logs to text/xml format for out of tool readability

There was a situation recently when the infa repository was not able to point us to the old session logs.
However, the file system still had those files.

But the session log files are in binary format by default. If you didnt ask for backward compatible session log files, you'd get a binary format of session log file on file system. This is done to allow better importability of the session log files for infa support guys.

So, I had a few log files in binary format, and needed to analyze them.
Informatica provides a subcommand for infacmd to achieve that conversion.

The convertLogFile subcommand takes three parameters. Syntax is as follows -


convertLogFile <-InputFile|-in> input_file_name
                 [<-Format|-fm> format_TEXT_XML]
                 [<-OutputFile|-lo> output_file_name]


so, when you launch this, you can specify the input file to be converted, the format as TEXT or XML and the output file that you'd want as a result of conversion.

An example call would look like this -  (expecting server on unix)

infacmd.sh convertLogFile -in /path/to/binary/format/sesslog/file -fm TEXT -lo /my/home/text/format/sesslog/file



Tuesday 22 February 2011

Just found out about this amazing thing...

A research initiative at Stanford University, Data Wrangler.. Wonderfully helping for analysts.

Try a demo video here -




And read more about it on http://vis.stanford.edu/wrangler

Wrangler Demo Video from Stanford Visualization Group on Vimeo.

Sunday 20 February 2011

Exadata - is it really worth the hype

Well, I am not going to try to answer that, rather, more on the question side...
Recently one of my projects moved to exadata device. There was so much talk around that, the queries and db processes need not be looked into, exadata will take care of them already.

However, two things happened.. first, there was a technical talk on the device's configuration. The device turned out to be a mammoth piece of hardware. In nutshell, its a 8 node cluster, each node having 4 cpu's.  Each CPU has around 2-4 GB of RAM. Then there is this high speed secondary storage which can hold a lot of cache.

The nodes are interconnected using a special switch which can transfer data faster than Gigabit networks.

With such hardware configuration, any software can claim the kind of performance gain they claim. Not to undermine the performance gains, I just want to say that the hype around the out of the world performance gains, is actually the result of better hardware, not really revolutionary software.

I, personally was expecting something of that type from Oracle, since they lack in that area. Except Teradata, there is almost no player who delivers that kind of Data Warehouse architecture and performance. and I was hoping that Oracle would do something around there and bring out something.

And, the second thing, one of the processes tried to load data to an exadata instance using informatica. Initially we left things at default so that exadata can tune it itself and we should not force anything.  However, there too, exadata failed big time and couldnt put in any perf gain. At the end, all the tuning had to be done by us only.

So, the other claim of exadata regarding intelligence to pick up processes and fix them on its own also went down for us.

Though I agree that its rather new for its own evolution, i believe oracle marketing should be doing a better job.:)