Product: TIBCO Spotfire®
Solving "Small file problem"
Question: How do I specify the number of mappers?
I have hundreds or even thousands of small files generated by MapReduce jobs. I'd like to use these files for further analysis, but there are too many files!
Is there any way to consume these many small files as a few larger files?
By default MapReduce will spawn as many mappers as input splits. We'd like to use fewer mappers. Please help!
Solution for Pig Operators
- Choose a dataset that contains many "part-" files.
My dataset has 112 part- files.
- Calculate the total size of your dataset using this command.
hadoop fs -du -h /path/to/folder
In the following example, my folder uses 1.8Mb, or approx 1887436.8 bytes
$ hadoop fs -du -h /path/to/folder 1.8 M 5.4 M /path/to/folderCalculate the value for this parameter:
pig.maxCombinedSplitSize = total file size in bytes / (desired number of mappers)
pig.maxCombinedSplitSize = 1887436.8 / 37 mappers = approx 49152
- In the datasource connection, specify this parameter
pig.maxCombinedSplitSize = 49152
- Drag the "dataset" to the canvas.
- Connect with the Column Filter operator, which is a pig operator.
- Check RM of the Pig Job (Column Filter) to determine the number of mappers being used. In this case: 39.
Solution for MapReduce operators
- For MapReduce operators, such as Alpine Forest, I can pass the dataset through a "Column Filter" as above, selecting "all" columns.
- Notice the MapReduce operator (Alpine Forest) uses 39 mappers. 38 part-* files and one metadata file will be generated.
Scatter Plot Matrix
Null Value Replacement
(*) Also available with MR implementation.