I have a folder in hdfs which has two subfolders each one has about 30 subfolders which,finally,each one contains xml files. I want to list all xml files giving only the main folder's path. Locally I can do this with apache commons-io's FileUtils.listFiles(). I have tried this
FileStatus status = fs.listStatus( new Path( args[ 0 ] ) );
but it only lists the two first subfolders and it doesn't go further. Is there any way to do this in hadoop?
You'll need to use the FileSystem object and perform some logic on the resultant FileStatus objects to manually recurse into the subdirectories.
You can also apply a PathFilter to only return the xml files using the listStatus(Path, PathFilter) method
The hadoop FsShell class has examples of this for the hadoop fs -lsr command, which is a recursive ls - see the source, around line 590 (the recursive step is triggered on line 635)