Recently we have had the opportunity to work on multiple Big Data projects from architecting an end to end solution to playing the role of strategic advisor to our clients on Hadoop and Big Data technologies. As the Hadoop ecosystem gains more momentum by the month, many of our conversations revolve around Hadoop’s role in an overall data architecture. We focus on what Hadoop is good at and where challenges exist. There is one theme seems to apply to many areas of the Hadoop infrastructure: Hadoop is great at doing few large things, but struggles when it’s asked to do many small things.
This concept applies to many areas important in a data infrastructure; everything from the number of users to the types of jobs run against the data. The most common instance of these, and the area that I’d like to focus on here, is data storage.
Large File Processing:
Hadoop does a great job at handling gigantic data sets. It can ingest these sources quickly and easily, store them cheaply, and process them efficiently. This is something Hadoop does very well due to the distributed storage and processing architecture inherent in the ecosystem.
Hadoop can trip up when it tried to handle many small files. Small files do not necessarily mean small data. One of the many use cases for Hadoop is the ability to do real-time or near-real-time ingestion and processing. Every time new data is written to Hadoop, it gets written as a new file. This presents problems when data is being streamed or written to Hadoop regularly. Take the example of a half-hour load into Hadoop. This will produce a new 48 new files a day, 336 a week and 17,472 files year. Double those numbers if you plan on processing every 15 minutes which is a very common and reasonable request.
Why are small files an issue?
When you need to process your data. Hadoop will look at each block of data associated with a file and produce a mapper to read that file. If there are many small files, then each file will get a mapper assigned to it. You can see this in Table A in the image below.
If you are running a process that queries two tables in Hive that were loaded at half-hour intervals, Hadoop will launch 34,944 mappers to process the job. If there were fewer and larger files, querying this data would be much more efficient.
What to do about small files:
Fortunately, the problem can be overcome with the right processing in place. A process will need to be implemented to find these small files and consolidate them together into fewer large files. There are a few open source utilities that can be used to handle this for you. These may not fit many use cases due to compression format choices and other setup issues specific to your environment.
My preference is to do this consolidation myself. This gives me greater freedom to meet my specific requirements and will allow me to pick my own storage format. This generally involves a quick process to move the small files into a temp directory, combining them and inserting the data back into the original directory. If done right, this process can handle just about any Hadoop storage format and can greatly increase the efficiency of storage and processing of data on Hadoop.