What is the best way to analyze Microsoft Office and PDF documents?

advertisements

I'm developing a Desktop Search Engine using VB9 (VS2008) and Lucene.NET. The Indexer in Lucene.NET accepts only raw text data and it is not possible to directly extract raw text from a Microsoft Office (DOC, DOCX, PPT, PPTX) and PDF documents. What is the best way to extract raw text data from such files?


You can, like the Windows Desktop Search, use components implementing the IFilter interface.