How to determine whether the PDF is marked or not?

advertisements

How will I know if PDF is tagged or not? I'm developing a program that would copy a text inside a PDF file and display it in my app, so I tried to test the PDF file, I copied a table from a PDF file (Ordinary Copy+Paste) and paste it in MS Word. The result was a normal text without tables. There are some issues that when you copy a table from a pdf file and paste it to Word it becomes an image. Is that true?


How to determine if PDF is tagged or not?

Depending on the library you are using to process your files, you could try to retrieve the entry MarkInfo from the Catalog dictionary.

From the PDF Specification:

TABLE 3.25 Entries in the catalog dictionary
KEY: MarkInfo
TYPE: dictionary
VALUE: (Optional; PDF 1.4) A mark information dictionary containing information about the document’s usage of Tagged PDF conventions (see Section 10.6, “Logical Structure”).

However, even if the value of this property is set to TRUE, it does not mean that the tags will actually be there, and if they are, they might not be usefull to you at all for extracting tables. You can still find PDF files with tables that use the tags only for marking paragraphs and pictures.

Long story short, unless you are generating the files that your application is going to consume, so that you can know which tags to look for, it is not a good idea to rely on these tags for "tables extraction from PDF".