Sunday, February 9, 2014

Programmatic interaction with Facebook data

The last time I interacted with Facebook using a program (2011), it was fairly straightforward, and perhaps it was too simple, it was abused.  As a result, in my recent experiments, I come to find out that, they have made it so complicated (my feeling, no offence to Facebook here), that its quite hard, if not impossible to do genuine development work using Facebook API, let alone achieve something sinister.

I am a self-anointed tech junky, and therefore no one should be surprised that i took up this challenge couple of hours back to build a brand new routine to get data in/out from/to facebook using a popular and open source Data integration tool, called Pentaho 

There were two triggers, one, I had not done some core debugging based research using something new. Second, a friend of mine is building this ecommerce site, www.freemall.in for which the marketing efforts would need a simpler interface to push details to the site, instead of some human going up there and doing all.  So, this programming interface becomes the first step of the future plugin for his ecommerce venture.

As I already indicated, programming with facebook's current API is a relatively hard task, at least for couple of iterations, before you get the hang of it.  So, I started with the tutorial at pentaho's site - http://www.pentahoevalcenter.com/data-integration/advanced-data-integration/facebook/ which turns out to be based on the old Facebook API and prompted me to look elsewhere for the right way of working.

Then comes the big thing - the oAuth authentication protocol employed by Facebook, get a token of this type, then of this type. go to this site, achieve this, copy this and take it there.. Its all so much that I had to go back to google and see if there is something simpler.. and yes, god bless google, there was someone like me who had gone to all the trouble and then of documenting it... God bless his soul for this - 
http://stackoverflow.com/questions/12168452/long-lasting-fb-access-token-for-server-to-pull-fb-page-info

After this, I went ahead with the pentaho tutorial, which did work out nice and fine for me.

I was able to post couple of messages to my test page using the application that i built for this test case.

hope this collection of links at a single place helps others.



Monday, February 3, 2014

Proud - Cloudera Certified Developer for Apache Hadoop

I have not been a believer of certifications, that should be clear from the fact that even though I have been working with Informatica since 2002, I never tried their certification till 2011. And, same about Oracle, whose exam I have not yet attempted/planned/thought over. I used to think that the knowledge level will prevail anyway, whether or not an authority stamps on it.

However, I believe, I am changing, to a certain extent so to say.  It seems that I have come to accept the certifications' worth, and therefore, after the training provided by Cloudera, I picked up the opportunity and went through with the rigor of examination.

Fortunately, I came through.  And, as much as I try not to showcase it, its a great feeling.  Somehow, the knowledge is vindicated, that yes, this guy knows something about hadoop and you better listen to him, :) . Funny that one has to put a badge out there to be heard.

Well, all said and done, the certification is done, courtesy the employer, who sponsored the training and of course the examination coupon that came along.  I would like to thank our trainer from cloudera, Amandeep Khurana for his depth and breadth around all things hadoop.  As much as I knew about hadoop before going in the training, those 2.5 days added precious layers to my knowledge.  Thanks Amandeep.

Thursday, January 30, 2014

Removing ports from an existing mapping - Do's and Dont's

Recently, a colleague called in for an issue that he was facing with an Informatica mapping. Let me recreate that situation here -

There is a mapping that get n ports from a source and loads m ports to a target. Standard stuff, nothing special or fishy around there. There comes a change request that says, such and such x number of columns are not required in the target, and since there would be a sure performance penalty for carrying through extra data (however small), the mapping should be changed to remove those ports from the pipeline.

What this gentleman did was to remove the connections for those ports from the source qualifier onwards. That saved him from changing and re-importing the source definition.

However, the mapping execution failed, complaining about a certain error situation. Thats when I got the call :)

whether I could discuss and fix it is something else, but what was the reason of the error - ?  

The very fact that if you are bringing in certain ports in the source qualifier, you HAVE to take them forward.  Thats a rule from informatica's side. Which means, if you are pulling up n ports from the source, you HAVE to expose all those n ports going away from the source qualifier. Or, putting it differently, in whichever way you create a set of ports in your source qualifier, all of them have to be consumed by some transformation object. Not a single one can be left unconnected, on the input side or the output side. Simple.

That was the reason, and the solution was fairly simple, to reduce his effort, either remove the ports from source qualifier as well, or just carry them forward to one transformation. and then drop them onwards.

Having said that, the very reason for which this whole change was initiated, the performance gain that would come through by not having the extra ports being carried forward, would come only when you remove the un-wanted ports from the source itself.

Friday, January 24, 2014

Moving an ecommerce site to Amazon Web Services

This is an interesting one.. There is an e-commerce startup, run by a dear friend.  One fine day we were chatting around and he mentioned challenges with his business. Challenges included performance, scaling and cost issues about his hosting services provider.

My instant reaction was, why dont you move to aws.. and it clicked just like that.. We decided that we'd start with the free tier, with his mysql server on amazon RDS and rest of app server functionality on the micro instance. All of which is within purview of the free tier.  Using Cloudfront we'd localize the static content, to add to the performance of content delivery. Not now, but eventually, we'd start using ELB - the load balancer for distributing load across the instances we would have.

With something like aws, planning all that is really really that straightforward.  You just have to be a bit more technically oriented to think in terms of these things and you are set for good.

We are already thinking about configuring autoscaling for the services, so that running on the micro instance doesnt become a bottleneck for the customers. With that we'd hope to achieve dynamic automatic scaling up and down the infrastructure.

We have already taken the first steps, and I am helping his venture migrate to AWS, one of my dream jobs, to be able to consult around migration / integration of Traditional systems to cloud based systems.

Saturday, January 12, 2013

Teradata TPump vs MultiLoad

Was doing some research around Teradata's load utilities and found some useful info on Teradata forums 

MultiLoad
* Loads data to TeraData from a Mainframe or flat file
* Multiple tables can be loaded in the same MultiLoad.
* Up to 20 INSERTS, UPDATES, or DELETES on upt o 5 tables.
* UPSERT is supported
* There can be NO - Unique Secondary Indexes (USI), Triggers, Referential Integrity or Join Indexes.
* Duplicate rows are allowed
* Each Import task can do multiple INSERT, UPDATE and DELETE functions.
* Some Secondary Indexes (NUSI) and RI are allowed
* Locks at the table level
* Block Level transferring of Data.

TPump

* Loads data to TeraData from Mainframe or flat file
* Processes INSERTS, UPDATES or DELETES
* Tables are usually populated.
* Can have Secondary Indexes and RI on tables.
* Does not support MULTI-SET tables.
* Locks at the row hash level
* It uses time based checkpoints not count based.If you are using an OLTP the TPUMPs trickle or continuous loads to populated tables. It acts like a water faucet (tap), that is it can be turned up and load millions of rows or at peak periods tuned down to trickle ffed into the tables.Generally MultiLoad performs better for large bulk loads because of the 64k block loading of data and TPump works better on Low volume changes.

General Writeup 
Multiload performs better in almost all cases.

The only time TPump's performance approaches Multiload is when you are updating a very small percentage of the rows. Tpump could probably beat Multiload if you had a very small number of rows in that Multiload has to log on to one session per AMP, whereas Tpump sessions can be controlled. So, with a very few number of rows, the overhead of Multiload may make it slower than TPump.

Multiload performs better because it sends the data from the host to the DBMS more efficiently (in block mode; with no embedded commands).  Tpump sends the data as part of a statement (exec macro statement). Tpump allows you to pack statements together to gain more efficiency, but it's still doesn't send the data as efficiently as Multiload.

The second reason that Multiload is faster is that it then applies the updates in block mode. So, if you have multiple updates destined for the same data block, they will get applied with one physical write of the data block. Tpump will need to write the data block once for each update.

The advantage that Tpump has over Multiload is that it locks only the rows (actually row hashes) that it's updating whereas Multiload locks the entire table for write while it's updating the data.  Because of this, you can run multiple Tpumps against the same table at the same time, whereas you can only run one Multiload against a table at a time.

Since Multiload takes a write lock on the table it's updating during it's APPLY phase (the phase where it actually updates the table), you can only access the table with an access lock (i.e. dirty read). With Tpump, you could access individual rows with a regular read lock.  If you tried to do a query that required a read lock on the table during a Tpump, the read lock would end up blocking the Tpump updates until the query finished, so it's still not a great idea to try to run queries requiring a read lock on the table during a Tpump.

Another advantage that Tpump has over Multiload is that there is no Tpump code within the DBMS, so new features are automatically enabled with Tpump, whereas there are a number of features that you can't use with Multiload (USI's, referential integrity, join indices, etc.).

In summary, 

Multiload is better for bulk updating especially if done in traditional batch mode.
Tpump is usually better for continuous updating of a table.

Friday, January 4, 2013

Finding Informatica domain name

Recently I came across a situation where the customer people had provided us with informatica server hostname, but not the domain name, nor the port for domain.

In such a case, we lost quite some time figuring out how to go through the domain configuration. That's when I started thinking about the alternates for finding the domain name information from the system (assuming different access levels)

If the repository database access is available, i.e. you can access the informatica repository database, you can use the following query to get the domain name out.

select pos_name
from PO_DOMAINSERVICECONFIG

 This sql will need to be run in the schema where the domain repository has been created.

On another approach, if the database access is not there, and the informatica server access is available, another file, domains.infa in the $PM_HOME equivalent directory will be able to provide information on domain name/port etc..

About the port

Though the installations process allows customization of the ports for domain, many installations keep the default as is.  In any case, a simple telnet to the host on the suspicious port will confirm whether the port is open or not.

In my example situation, it turned out to be the default 6005.

Thursday, June 14, 2012

hadoop/hive with tableu


It was in 2010 that  I had the first taste of hadoop/hive.  Back then I was still using hadoop 0.20 and was doing a proof of concept for a customer, who wanted to see if hadoop can be a solution for their problems.

Since then, I have been reading up and following the changes in the hadoop world, and tweaking things here and there with the home installation.   Today, I tried to mount hive on hadoop (without hbase, with hbase will be the next experiment) and see how can I get it playing nicely.

The experience is awesome as usual, and it reinforces my belief in the fact that hadoop ecosystem has a huge role to play in the computing industry of tomorrow.

The analytical capabilities of the volumes of data managed by the hadoop kind of system are ever increasing, therefore the interest from many instant BI players to provide access to the data behind hadoop.

One such player is the instant dashboard tool - Tableu.  They have announced that Tableu 7 will be able to read data directly from hive environments.  

In real life it was a bit of a challenge, but whats the fun if there is no challenge. In a nutshell, it does work. No doubt.  However, the kind of configuration that is required and administration can be tricky.

1. You have to install the hive driver (available from their website - http://www.tableausoftware.com/support/drivers)

2. you have to launch hive in a particular way, as a service. (hive --service hiverserver).  Also, hive on a pseudo cluster only allows one user connected (since the metadata store is single user on Derby).  as a result, if you are using Tableu connectivity, nothing else can access hive, not even a command line inerface.

3. Remember that each addition/change to the data set on tableu interface triggers a map-reduce job on the hive cluster/server.  And that, hadoop/hive are not really meant to be fast responsive systems.  Therefore, expect high delays in fulfilling your drag and drop requests.

4. There might/will be additional troubles in aggregating certain types of data, since the data types on hive might not be additive in the same way as front end expects them to be.

All in all, it wins me in the ease of use provided for accessing the data behind the hadoop environment, however, there are faster ways that already exist to achieve the same result.

Saturday, May 5, 2012

Mounting a aws instance into Nautilus in ubuntu

For a particular project I needed to transfer files from my local desktop system (running ubuntu ) to my aws instance running also running Ubuntu.  shell based connections work just fine for me, however, I need a GUI solution.  GUI solution was favoured since I wanted to have a quick drag and drop solution to move files rather easily across the two instances.

Although, Nautilus provides a "Connect to Server" option, and you can bookmark it as well, the trick/issue is that with aws instances you are dealing with private/public key  authentication, not password based authentication.

For some reason the "Connect to Server" option doesnt allow you to embed a public key and therefore that authentication method doesnt work through in that case.

After quite a bit of research and googling, I had to put this question up on stackoverflow.com  where, I got a prompt and sweet response.

The trick was to use the .ssh/config file and to create an alias there, specifying the identity file, and then onwards, there is no challenge to a ssh or sftp call, since ssh routes it through the config file automatically.

Here are the sample contents of the .ssh/config file that worked for me -

Host {alias}
HostName {hostname or IP address}
User {username to log in with}
IdentityFile {path to abcd.pem}
With this in place, its possible for me to have a bookmark and clicking on that just opens the target location (like a mount point).  Sweet little trick, and makes me like ubuntu a bit more...

Sunday, April 29, 2012

Starting off with Rails 3.2.2 on Ubuntu 11

After a long time, I picked up to build a web application for a friend. I havent worked in web apps for about 6-8 months now, and was quite rusty.

Ruby on Rails was a natural choice. However, when I issued the "gem update" it turned out that rails is on 3.2.2 these days, which means there has been at least 4 releases since I last worked with rails.

I typically work in a way that learning happens while working only.  Therefore, I just started building a new application and built a default scaffold (I know, its  an old habit of testing installation using scaffold)... scaffold generated, which means that the backward compatibility is still alive. Server started.. and bang, launched chrome to see how localhost:3000 looks like...

it bombed... there was this "ExecJS::RuntimeError" staring at my face...

It turned out that Ubuntu is the culprit and needs nodejs to be installed as a JS runtime engine is required and by default Ubuntu doesnt provide one.

after "sudo apt-get install nodejs", its all smooth and shiney...

Will share further...

Thursday, April 12, 2012

App vs web browser based access to websites from devices

Every other website these days launches their own app as soon as they find a decent following among customers.  However, there are some things we need to watch out for when using apps as against browsing the same content over a web browser.

A browser is a relatively safe sandbox when it comes to executing website content and rendering it.  There really has to be a loophole in the browser engine for a website to exploit it and do weird things to your device, be it a phone, tablet or laptop/desktop.

On the other hand, when we install "apps" for websites, we provide them "permissions" to do things on our devices. This works nicely based on the trust foundation.  I trust the website, and therefore I trust their app to not do anything untoward to my device.  This trust, can be unfounded in cases, and lead to unknown actions/behaviours from apps.

Among the benefits of using apps, they  do provide a better user experience (in most of the cases) since the rendering is specific to the device.  Also, the apps provide lot more customised user interaction information to their base websites, thereby providing more and accurate and contextual intelligence about their usage. There are reports about apps stealing private information from the devices.

Most of the time, the reason for the app gaining access to information is the grain of the access control used in the device. If its too low, the access permissions to be provided are a huge list, if its too high, you can provide too much access without intending to. There comes the maturity of the device operating system.

With web browsers the information sent back for analytics purposes is rather generic, since its from the browser sandbox.

I believe its safe to say that using apps is a bit of a trade-off, between the user experience and the safety /privacy of the user.  Lets be a bit more careful about which apps do we download and use, and what all permissions that app needs. Lets just be a bit more skeptic and end up being safer for it, hopefully.

PS - There are a lot other comparisons that already exist, however, its hard to say how many of them talk about security aspects. User Experience is one major discussions point, for sure.  Try Googling it

Tuesday, April 3, 2012

time zone conversion in Oracle

Often times we need to see a given time-stamp column in a particular time zone.   Without casting as well, oracle allows a very simple way - 


SELECT <column-name> AT TIME ZONE <time zone> FROM <table-name>;

this method saves the expensive cast operations.

Informatica Questions from a friend - Part 2 - Schedulers

Need of Scheduling and Commonly used Schedulers
 
Any and all Data warehousing environments need some kind of scheduler setup to enable
jobs being run at periodic intervals without human intervention.  Another important feature
is the repeatability of the jobs set up such.  Without the help of a scheduler, things would
become very ad-hoc and thus prone to errors and messups. 
Oracle provides an built in scheduling facility, accessible through its dbms_scheduler package.
Unix provides basic scheduling facility using cron command. Similarly, Informatica also 
provides basic scheduling facilities in the Workflow Manager client.
 
The features provided by these scheduling tools are fairly limited, often limited to launching
a job at a given time, providing basic dependency management etc. 
 
However, in real time data warehousing solutions, the required functionality is lot more 
sophisticated than whats offered by these basic features.  Therefore, the need for full 
fledged scheduling tools, e.g. Tivoli Workload Scheduler, Redwood Cronacle, Control-M, 
Cisco Tidal etc..
 
Most of these tools provide sophisticated launch control, dependency management features 
and therefore allow the data warehouse to be instrumented at finer levels.
 
Some of the tools, e.g. Tidal for informatica and Redwood for Oracle, provide support for
the Tools' API as well, therefore integrating even better with the corresponding tool.  

Friday, March 30, 2012

Informatica Questions from a friend - Part 1

HOW TO USE A COBOL FILE FOR TRANSFORMATION
 
Informatica allows reading data from cobol copybook formatted data files. These files mostly 
come from mainframe based source systems. Given that many of the world's leading business 
systems still use IBM Mainframe as their computing systems, e.g. airlines, banks, insurance 
companies etc, these systems act as a major source of information for Data warehouses, 
and thus to our Informatica mappings.  
For using a cobol copy book structure as a source, you'd have to put that copybook in a 
empty skeleton cobol program. 
IDENTIFICATION DIVISION.
PROGRAM-ID. RAGHAV.

ENVIRONMENT DIVISION.
SELECT FILE-ONE ASSIGN TO "MYFILE". 

DATA DIVISION.
FILE SECTION.
FD FILE-ONE.

COPY "RAGHAV_COPYBOOK.CPY".

WORKING-STORAGE SECTION.

PROCEDURE DIVISION.

STOP RUN. 

The copybook file can by a plain record structure.
Read more about defining copybooks around here.

 
 
 

Thursday, March 29, 2012

Counting columns in a tab delimited file


It sounds so simple, however, when you sit down to write this, specially as a single line expression, it can take a while.

In my experiments I found it rather easy to count it with other delimiters as compared to TAB character.
Here is the command for counting columns

cat <FILENAME>| awk -F'\t' '{print NF}'

 cat can be slow at times, especially with larger files, therefore an alternative without that...

awk -F'\t' '{print NF}' <FILENAME>


 

Wednesday, March 14, 2012

how to find sql id of a long running sql in oracle


Sql for finding out sql id etc details of some long running query. Often useful for sending kill/monitoring instructions to DBA friends..

select distinct t.sql_id, s.inst_id,s.sid, s.serial#,s.osuser, s.program, s.status, t.sql_text
from gv$session s, gv$sqlarea t
where s.username = '<USERNAME>'
and s.sql_id = t.sql_id
and t.sql_text like '<%provide a segment of sql to identify it%>'

Saturday, February 25, 2012

Timezones in Oracle

It might have been written umpteen times here and there, but it always manages to confuse me. So, here it is another time on the internet...


Datatypes
Timestamp/timezone datatypeWhat Oracle storesWhat Oracle displays
TIMESTAMP WITH TIME ZONEYear, month, day, hour, minute, second, fractional second, and time zone displacement (HH:MI difference from GMT)Stored value
TIMESTAMP WITH LOCAL TIME ZONEYear, month, day, hour, minute, second, fractional second; does NOT store time zone information, but instead converts data to the database time zone and stores it w/o time zone informationConverts the stored data to the session's time zone before displaying


With due respect to - http://toolkit.rdbms-insight.com/tz.php

Friday, February 3, 2012

And now.. a Solar Powered ubuntu laptop...

Nick Rutledge has conceptualized a thin laptop that runs Ubuntu, is beautiful and hopes to run on Solar power... isnt that a killer combination...

check out his concept here...

http://nrutledge.blogspot.com/p/ubuntu-laptop-concept.html

Saturday, October 15, 2011

Data Lineage.. what is that ?

It is one of those buzzwords, that keep doing the circuit every once in a while. Almost every enterprise wants to do the analysis regarding this, and is almost always hard to find people with knowledge/experience doing this kind of analysis.

For the unaware, Data Lineage is basically (really in very short words) a study of the data from its source to its eventual target, similar to what we'd do for our generation tree, we analyze the generation analysis of the data we are dealing with.

Starting from the source of the data, it travels through different subsystems, sometimes going through transformations, and thus possibly changing shape too...

Informatica had a very interesting blog post around this (already in 2007), which can turn out to be fairly informative.


Wednesday, September 28, 2011

Informatica & hadoop... solutions for future ?

Distributed computing using hadoop has taken the IT industry by a whirlwind in the last few years.  After getting almost "adopted" by yahoo, hadoop has progressed quite fast, and is now maturing slowly but steadily.

More and more enterprise solution providers are annoucing their support for the hadoop platform, hoping to get a pie of the big Data business chunk.  Its possibly a fair thing to expect that the leader in Data Integration business solutions space, Informatica has also announced a tie up with Cloudera, for porting Informatica platform to hadoop.

Though the exact details are yet to come out, the possibilities are endless.  With hadoop (and its inherent distributed computing based on map/reduce technology), informatica can actually think of processing big data in sustainable time frames.

For one my customers, I deal with about 200 million rows of data per day in one job.  Besides the issues with oracle in tuning the query etc, the informatica component itself consumes times in terms of hours.  With map reduce in place, I hope to get that in minutes, oracle issues notwithstanding.

Although word about hadoop is spreading quite fast, its adoption (from buzzword to actual usage in enterprise) is not as fast.  To aid their cause, Informatica and cloudera have started an interesting series of webinars, termed as "hadoop tuesdays".  Its free to join, and they get experts to talk about various related issues around hadoop and big data and informatica.  Its been very useful and informative so far.

Monday, July 25, 2011

Switching defaults in Ubuntu

Ubuntu allows you to have multiple alternatives installed for many software.. for example, java.
You can have the default open jdk installed, and then you can actually have the Sun version installed.

For example, to see what alternatives are installed for your software, try going to /etc/alternatives. Here you'd see many pieces of software with alternatives listed.

With these software installed, you would need to point your system to use one of them as the default, this is important especially after installing a newer version of the software.

In such a case, to switch the alternatives, you need to use this

sudo update-alternatives

If you do a man on update-alternatives, there is a plethora of options to use.

For our example, to configure the default for java, use this

sudo update-alternatives --config java