my experiments with technology: 2016

Tuesday, December 20, 2016

Hadoop - Small Files vs Big Files

Credits- https://blogs.msdn.microsoft.com/cindygross/2015/05/04/hadoop-likes-big-files/

One of the frequently overlooked yet essential best practices for Hadoop is to prefer fewer, bigger files over more, smaller files. How small is too small and how many is too many? How do you stitch together all those small Internet of Things files into files "big enough" for Hadoop to process efficiently?

The Problem

One performance best practice for Hadoop is to have fewer large files as opposed to large numbers of small files. A related best practice is to not partition “too much”. Part of the reason for not over-partitioning is that it generally leads to larger numbers of smaller files.

Too small is smaller than HDFS block size (chunk size), or realistically small is something less than several times larger than chunk size. A very, very rough rule of thumb is files should be at least 1GB each and no more than maybe around 10,000-ish files per table. These numbers, especially the maximum total number of files per table, vary depending on many factors. However, it gives you a reference point. The 1GB is based on multiples of the chunk size while the 2nd is honestly a bit of a guess based on a typical small cluster.

Why Is It Important?

One reason for this recommendation is that Hadoop’s name node service keep track of all the files and where the internal chunks of the individual files are. The more files it has to track the more memory it needs on the head node and the longer it takes to build a job execution plan. The number and size of files also affects how memory is used on each node.

Let’s say your chunk size is 256MB. That’s the maximum size of each piece of the file that Hadoop will store per node. So if you have 10 nodes and a single 1GB file it would be split into 4 chunks of 256MB each and stored on 4 of those nodes (I’m ignoring the replication factor for this discussion). If you have 1000 files that are 1MB each (still a total data size of ~1GB) then every one of those files is a separate chunk and 1000 chunks are spread across those 10 nodes. NOTE: In Azure and WASB this happens somewhat differently behind the scenes – the data isn’t physically chunked up when initially stored but rather chunked up at the time a job runs.

With the single 1GB file the name node has 5 things to keep track of – the logical file plus the 4 physical chunks and their associated physical locations. With 1000 smaller files the name node has to track the logical file plus 1000 physical chunks and their physical locations. That uses more memory and results in more work when the head node service uses the file location information to build out the plan for how it will split out any Hadoop job into tasks across the many nodes. When we’re talking about systems that often have TBs or PBs of data the difference between small and large files can add up quickly.

The other problem comes at the time that the data is read by a Hadoop job. When the job runs on each node it loads the files the task tracker identified for it to work with into memory on that local node (in WASB the chunking is done at this point). When there are more files to be read for the same amount of data it results in more work and slower execution time for each task within each job. Sometimes you will see hard errors when operating system limits are hit related to the number of open files. There is also more internal work involved in reading the larger number of files and combining the data.

Stitching

There are several options for stitching files together.

Combine the files as they land using the code that moves the files. This is the most performant and efficient method in most cases.
INSERT into new Hive tables (directories) which creates larger files under the covers. The output file size can be controlled with settings like hive.merge.smallfiles.avgsize and hive.merge.size.per.task.
Use a combiner in Pig to load the many small files into bigger splits.
Use the HDFS FileSystem Concat APIhttp://hadoop.apache.org/docs/current/api/org/apache/hadoop/fs/FileSystem.html#concat.
Write custom stitching code and make it a JAR.
Enable the Hadoop Archive (HAR). This is not very efficient for this scenario but I am including it for completeness.

There are several writeups out there that address the details of each of these methods so I won’t repeat them.

Merging small files on HDInsight http://blogs.msdn.com/b/mostlytrue/archive/2014/04/10/merging-small-files-on-hdinsight.aspx which uses a Java MapReduce JAR https://github.com/mooso/smallfilesmerge.
Quick Tip for Compressing Many Small Text Files within HDFS via Pighttp://dennyglee.com/2014/01/06/quick-tip-for-compressing-many-small-text-files-within-hdfs-via-pig/.
FileCrush https://github.com/edwardcapriolo/filecrush.
HDFS FileSystem Concat API
CombineFileInputFormat (splits)
- This may not work with really large numbers of files and has to be used EVERY time a job is run.
- http://www.ibm.com/developerworks/library/bd-hadoopcombine/index.html
- Process Small Files on Hadoop Using CombineFileInputFormat (1)http://www.idryman.org/blog/2013/09/22/process-small-files-on-hadoop-using-combinefileinputformat-1/
Dealing with Hadoop’s small files problem http://snowplowanalytics.com/blog/2013/05/30/dealing-with-hadoops-small-files-problem/ “aggregating with the small files first reduced total processing time from 2 hours 57 minutes to just 9 minutes – of which 3 minutes was the aggregation, and 4 minutes was running our actual Enrichment process. That’s a speedup of 1,867%.”
The Small Files problem in Hadoop http://piglog4j.blogspot.com/2013/06/the-small-files-problem-in-hadoop.html
Hadoop Archive: File Compaction for HDFS https://developer.yahoo.com/blogs/hadoop/hadoop-archive-file-compaction-hdfs-461.html
The Small Files Problem http://blog.cloudera.com/blog/2009/02/the-small-files-problem/ “Reading through files in a HAR is no more efficient than reading through files in HDFS, and in fact may be slower since each HAR file access requires two index file reads as well as the data file read (see diagram). And although HAR files can be used as input to MapReduce, there is no special magic that allows maps to operate over all the files in the HAR co-resident on a HDFS block.”

The key here is to work with fewer, larger files as much as possible in Hadoop. The exact steps to get there will vary depending on your specific scenario.

Tuesday, November 15, 2016

Eclipse - installing Scala plugin manually?

I have been playing around with Scala for some time, and was always using the Scala IDE (www.scala-ide.org) which is based on a relatively older version of Eclipse (Luna).

I recently discovered this, wherein you could install the scala plug-in on a regular Eclipse installation.

Just add the following url as a new update site in your local eclipse installation and you'd be able to install the scala plugin just like that -

http://download.scala-ide.org/sdk/lithium/e44/scala211/stable/site

Saturday, July 23, 2016

Links to free big-data-sets

Many people who are starting their journey with big data and analytics find it hard to get their hands on the right kind of data to play or experiment with.

Most of the time, people have enthusiasm, they are learning the skill too, but they just don't have the right kind of dataset to apply their newly acquired skills.

Democratising data has been at the forefront of discussions for many data pioneers. Through their efforts and with some re-alignment of technology priorities, some government bodies have opened up their datasets to the public.

As a result, here is a set of links (reproduced) to some of the free sources.

Data.gov http://data.gov The US Government pledged last year to make all government data available freely online. This site is the first stage and acts as a portal to all sorts of amazing information on everything from climate to crime.
US Census Bureau http://www.census.gov/data.html A wealth of information on the lives of US citizens covering population data, geographic data and education.
Socrata is another interesting place to explore government-related data, with some visualisation tools built-in.
European Union Open Data Portal http://open-data.europa.eu/en/data/ As the above, but based on data from European Union institutions.
Data.gov.uk http://data.gov.uk/ Data from the UK Government, including the British National Bibliography – metadata on all UK books and publications since 1950.
Canada Open Data is a pilot project with many government and geospatial datasets.
Datacatalogs.org offers open government data from US, EU, Canada, CKAN, and more.
The CIA World Factbook https://www.cia.gov/library/publications/the-world-factbook/Information on history, population, economy, government, infrastructure and military of 267 countries.
Healthdata.gov https://www.healthdata.gov/ 125 years of US healthcare data including claim-level Medicare data, epidemiology and population statistics.
NHS Health and Social Care Information Centre http://www.hscic.gov.uk/home Health data sets from the UK National Health Service.
UNICEF offers statistics on the situation of women and children worldwide.
World Health Organization offers world hunger, health, and disease statistics.
Amazon Web Services public datasets http://aws.amazon.com/datasets Huge resource of public data, including the 1000 Genome Project, an attempt to build the most comprehensive database of human genetic information and NASA ’s database of satellite imagery of Earth.
Facebook FB +0.32% Graph https://developers.facebook.com/docs/graph-api Although much of the information on users’ Facebook profile is private, a lot isn’t – Facebook provide the Graph API as a way of querying the huge amount of information that its users are happy to share with the world (or can’t hide because they haven’t worked out how the privacy settings work).
Face.com: A fascinating tool for facial recognition data.
UCLA makes some of the data from its courses public.
Data Market is a place to check out data related to economics, healthcare, food and agriculture, and the automotive industry.
Google Public data explorer includes data from world development indicators, OECD, and human development indicators, mostly related to economics data and the world.
Junar is a data scraping service that also includes data feeds.
Buzzdata is a social data sharing service that allows you to upload your own data and connect with others who are uploading their data.
Gapminder http://www.gapminder.org/data/ Compilation of data from sources including the World Health Organization and World Bank covering economic, medical and social statistics from around the world.
Google GOOGL +0.66% Trends http://www.google.com/trends/explore Statistics on search volume (as a proportion of total search) for any given term, since 2004.
Google Finance https://www.google.com/finance 40 years’ worth of stock market data, updated in real time.
Google Books Ngrams http://storage.googleapis.com/books/ngrams/books/datasetsv2.htmlSearch and analyze the full text of any of the millions of books digitised as part of the Google Books project.
National Climatic Data Center http://www.ncdc.noaa.gov/data-access/quick-links#loc-clim Huge collection of environmental, meteorological and climate data sets from the US National Climatic Data Center. The world’s largest archive of weather data.
DBPedia http://wiki.dbpedia.org Wikipedia is comprised of millions of pieces of data, structured and unstructured on every subject under the sun. DBPedia is an ambitious project to catalogue and create a public, freely distributable database allowing anyone to analyze this data.
New York Times http://developer.nytimes.com/docs Searchable, indexed archive of news articles going back to 1851.
Freebase http://www.freebase.com/ A community-compiled database of structured data about people, places and things, with over 45 million entries.
Million Song Data Set http://aws.amazon.com/datasets/6468931156960467 Metadata on over a million songs and pieces of music. Part of Amazon Web Services.
UCI Machine Learning Repository is a dataset specifically pre-processed for machine learning.
Financial Data Finder at OSU offers a large catalog of financial data sets.
Pew Research Center offers its raw data from its fascinating research into American life.
The BROAD Institute offers a number of cancer-related datasets.

Credit to Forbes article at

http://www.forbes.com/sites/bernardmarr/2016/02/12/big-data-35-brilliant-and-free-data-sources-for-2016/#5b2a54cf6796