Sunday, December 11, 2011

#2 : PDF Parser

Portable Document Format or PDF file has been widely use file format in exchanging data or information compiled in a  single document. However, there are times that we need to take the text contents of it and save it on our database. That is basically the purpose of this application and I called it myself as PDF Parser.

PDF Parser is a utility to read text from PDF files. It uses PDFBox, an open source Java Class library. You can download it here http://sourceforge.net/projects/pdfbox/. This is a cool Java Class library, easy to use and very helpful tool in data mining.

You only need to add IKVM.GNU.Classpath and PDFBox-0.7.3 in your project reference. Then you just need to put up the following lines of code;

                    PDDocument doc = PDDocument.load("E:\sample.pdf");
                    PDFTextStripper stripper = new PDFTextStripper();
                    sOutputString string;
                    sOutputString = (stripper.getText(doc));

By the way, I'm using C# but you can easily convert it to Vb. Here's the screenshot sample PDF parser that I did;














Check my demo asp.net PDF Parser online at http://utility.aerinet.com/

No comments:

Post a Comment