Dealing with Small Files Problem in Hadoop Distributed File
HDFS gives the programmer unlimited storage and is the only reason behind turning to Hadoop. But when it comes to storing lot of small files there is a big problem. HDFS is capable of handling large files which are GB or TB in size. Hadoop works better with a small number of large files and not with large number of small files. Large number of small files take up lots of memory on the Namenode. Each small file generates a map task and hence there are too many such map task with insufficient input. Storing and transforming small size file in HDFS creates an overhead to map reduce program which greatly affects the performance of Namenode.