com.lowagie.text.pdf.parser
Class PdfTextExtractor

java.lang.Object
  extended by com.lowagie.text.pdf.parser.PdfTextExtractor

public class PdfTextExtractor
extends Object

Extracts text from a PDF file.

Since:
2.1.4

Field Summary
private  PdfReader reader
          The PdfReader that holds the PDF file.
private  TextProvidingRenderListener renderListener
          The TextProvidingRenderListener that will receive render notifications and provide resultant text
 
Constructor Summary
PdfTextExtractor(PdfReader reader)
          Creates a new Text Extractor object, using a SimpleTextExtractingPdfContentRenderListener as the render listener
PdfTextExtractor(PdfReader reader, TextProvidingRenderListener renderListener)
          Creates a new Text Extractor object.
 
Method Summary
private  byte[] getContentBytesForPage(int pageNum)
          Gets the content bytes of a page.
private  byte[] getContentBytesFromContentObject(PdfObject contentObject)
          Gets the content bytes from a content object, which may be a reference a stream or an array.
 String getTextFromPage(int page)
          Gets the text from a page.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

reader

private final PdfReader reader
The PdfReader that holds the PDF file.


renderListener

private final TextProvidingRenderListener renderListener
The TextProvidingRenderListener that will receive render notifications and provide resultant text

Constructor Detail

PdfTextExtractor

public PdfTextExtractor(PdfReader reader)
Creates a new Text Extractor object, using a SimpleTextExtractingPdfContentRenderListener as the render listener

Parameters:
reader - the reader with the PDF

PdfTextExtractor

public PdfTextExtractor(PdfReader reader,
                        TextProvidingRenderListener renderListener)
Creates a new Text Extractor object.

Parameters:
reader - the reader with the PDF
renderListener - the render listener that will be used to analyze renderText operations and provide resultant text
Method Detail

getContentBytesForPage

private byte[] getContentBytesForPage(int pageNum)
                               throws IOException
Gets the content bytes of a page.

Parameters:
pageNum - the page number of page you want get the content stream from
Returns:
a byte array with the effective content stream of a page
Throws:
IOException

getContentBytesFromContentObject

private byte[] getContentBytesFromContentObject(PdfObject contentObject)
                                         throws IOException
Gets the content bytes from a content object, which may be a reference a stream or an array.

Parameters:
contentObject - the object to read bytes from
Returns:
the content bytes
Throws:
IOException

getTextFromPage

public String getTextFromPage(int page)
                       throws IOException
Gets the text from a page.

Parameters:
page - the page number of the page
Returns:
a String with the content as plain text (without PDF syntax)
Throws:
IOException

Hosted by Hostbasket