If you have a binary file, such as a PDF or an Office document, you can send it with the dataload API and let the Search Appliance extract the text from it.
<?xml version="1.0" encoding="UTF-8"?>
<ThunderstoneReplication
xmlns:dt="urn:schemas-microsoft-com:datatypes"
>
<Item>
<Type>I</Type>
<Url>http://www.example.com/dataload.pdf</Url>
<RawData dt:dt="bin.base64">0M8R4KGxGu....</RawData>
</Item>
</ThunderstoneReplication>
The elements are:
<Type>
The action to take with this data. Text value may be one of:
I
Insert the data (overwrite previous data for URL if any)
<Url>
The URL of the document.<RawData>
element with the base64 encoding of raw document. It must include
the dt:dt="bin.base64"
attribute.