timport - General purpose Texis importer

SYNOPSIS

timport [-s schemafile] [-schema_option(s)] [options] [-file file(s)]
timport -dbf [-schema_option(s)] [options] [-file file(s)]
timport -csv [-schema_option(s)] [options] [-file file(s)]
timport -col [-schema_option(s)] [options] [-file file(s)]
timport -mail [-schema_option(s)] [options] [-file file(s)]

DESCRIPTION
Timport takes a data and table description file, schema file, and imports files into Texis tables.

-s schemafile

Is required, unless using one of the special known format options, and specifies the name of the file containing the data and table descriptions.

-d database

Specifies the name of the database to use.

NOTE: This was changed in version 2.12 (Feb. 25 1999). See the -D option.

-v

Turns on verbose mode. Extra information about the processing will be printed. More -v's will increase verbosity. Placing a number immediately after the v will increase verbosity by that much.

-c

Prints Texis API calls as they are made. This is useful to programmers to see the correct usage of the Texis API.

-t

Prints a tic mark (.) for each record imported. It provides a status display so you can get a feel for how far along it is.

-D

Dumps parsed records to the screen instead of inserting into Texis. This is useful for working out the tags and expressions. When this is on no attempt is made to connect to the Texis server or database, so testing may be done without the server or fear of messing up a table.

NOTE: This was added in version 2.12 (Feb. 25 1999).

-g

Generates the schema file with all of the current settings from the specified schema file, command line, and guessed columns from csv and col formats. This is most useful when building a schema for a new dataset that is of format csv or col. When given just the -csv or -col command line options or a schema file with no fields defined, Timport will attempt to guess the column positions, types and names. You can generate a schema file based on its guess and adjust for any mistakes it might have made.

-h

Prints a short usage message.

-H

Prints a long usage message including information about the schema file.

-schema_option

This option allows you to specify anything that might be in a schemafile on the command line. Using this you can avoid writing a schema file for simple imports. It can also be used to override settings from the schema file. Specify an option just like it would be in the schema file. Make sure you quote things with backslash so the shell does not eat them.

e.g.: -database /tmp/testdb -csv "\x09"

-dbf

Import a dBase or FoxPro table.

-csv

Import "comma separted values" data. Guess at the field names.

-col

Import columnar data. Guess at the field positions and names.

-mail

Import data in Internet mail (RFC822) format. The fields From, Subject, Date, are stored in addition to the the full text of the message.

input_file(s)

The data files to import into Texis. You may specify multiple files. Or you may specify - to read from a pipe. Or you may specify a file containing a list of file name by preceeding the name with &.

Schema file format

Comment lines start with a # character. Blank lines are ignored. Each line has the syntax:

keyword value(s)

where any number of space(s) and/or tab(s) separate keywords and values.

Ordering of keywords is not important except that fields must be listed in the order that they appeared in the create table statement and fields should be listed last (after all other keywords). In Texis version 6 and later, a maximum of 1000 fields may be listed (previous versions had a limit of 800).

Possible keywords: (a * indicates a required item)

host     internet_address
   port     port_number
   user     texis_user
   group    texis_group
   pass     texis_password
   recdelim record_delimiting_rex
   recexpr  record_matching_rex
   readexpr record_delimiting_rex
   recsize  record_max_size
   datefmt  date_format_string
   dbf      optional_translation
   csv      optional_delimiter
   col
   mail
   oracle
   xml
   xmlns uri
   xmlns:prefix uri
   keepemptyrec
   stats
   multiple
   firstmatch
   allmatch  separator
   trimspace
   keepfirst
   csvquote
   csvescquote
   xmldatasetlevel value
   createtable boolean_value
   database  texis_database_name
   noid      texis_table_name
   droptable texis_table_name
   table     texis_table_name
   field     texis_field_name texis_sql_type tag_name_or_expr [default]

NOTE: field is not required if stats is used.

host, port, user, group, and pass are the settings used to log into the Texis server. If unspecified timport will log into the Texis server on the same machine on the default port as PUBLIC with no password. NOTE: Versions prior to 2.12 (May 13 1998) logged in as user _SYSTEM.

recdelim is used for separating records out of an input file containing multiple records. It implies multiple. This will override readexpr.

recexpr is an expression that matches an entire record. field tags are then numbers indicating the subexpression range for the field. Good for records that are not well delimited (like columns).

readexpr is used as an input file delimiter for reading when using "multiple" but not "recdelim". This is needed when using multiple but not rexexpr when reading from a pipe or redirection. It specifies how to delimit reads. This expression should match the interval between records. This is overridden by recdelim.

recsize sets the size of the maximum readable record when using recdelim or readexpr. The default is 1 megabyte. Increase this value if you ever see the "no end delimiter found" warning message.

datefmt is the format to expect date fields in. The default is Texis style "yyyy-mm-dd[ HH[:MM[:SS]]]". The scanner will treat all punctuation and space as delimiters.

Specify:

y: for year digits
m: for month digits or month name
d: for day of month digits
j: for day of year digits
H: for hour digits
M: for minute digits
S: for second digits
p: for "am" or "pm" string
j: for julian day of year (added in version 1.84, Oct 13 1997)
x: for junk

The date scanner will read up to the next delimiter or how many digits you specify, whichever comes first. Any non-digit is a delimiter for the digit only types. 'p' will only check for 'a' or 'p' then skip all trailing alphabetics. 'x' will skip all alphabetics. 1900 will be added to 2 digit year specs greater than or 69. 2000 will be added to 2 digit year specs less than 70.

Examples:

FORMAT                   MATCHES                    MEANS
  yy-mm-dd HHMM           95-04-27 16:54            1995-04-27 16:54:00
  dd-mm-yyyy HH:MM:SS     27/04/1995 16:54:32       1995-04-27 16:54:32
  yyyymmdd HHMMSS p       19950427 045432 pm        1995-04-27 16:54:32
  x, dd mmm yyyy HH:MM:SS Thu, 27 Apr 1995 16:55:56 1995-04-27 16:55:56
  yyyy-jjj                1997-117                  1995-04-27 00:00:00

dbf, csv, col, mail and oracle allow you to specify one of several known file formats. Instead of having to specify rex expressions for the fields timport will automatically parse out the fields from the known format. Specify one of the following keywords:

dbf

Load dBase/FoxPro tables into Texis. Don't specify any fields. The DBF files specified on the command line will be imported into Texis table(s). The Texis table name will be that provided with the table keyword or the name of the original DBF file if a table name is not provided. The fields will have the same names in the Texis table as they did in the DBF table. Data types will be preserved. Memo fields will become varchar fields. If your DBF table has special characters in it you may wish to use the dostoiso option to translate characters from the DOS code page to the ISO latin character set. (e.g. dbf dostoiso)

col

Load fixed width columnar data into Texis. Many printed reports and program outputs come in this format. If no fields are specified timport will attempt to guess the column positions and types by sampling a number of rows from the first input file. The first row is assumed to contain the names of the fields.

You can specify the precise column names and positions with the field keyword. Place the character column positions in the 3rd value for field. Character columns are numbered starting at 1. Specify a range of character columns by placing a hyphen (-) between the first and last columns numbers (e.g.: 5-9). To get all characters after a particular column include the hyphen, but leave off the second number (e.g.: 57-).

By default the first row of each specified file will not be imported. Use the keepfirst keyword to import the first row.

csv

Load comma separated values into Texis. Many programs will export data in this format. The field delimiter is assumed to comma(,). Specify a different delimiter by placing it after the csv keyword. Everything up to the end of line will be taken as the field delimiter. You may encode special characters in hex notation by using \x followed by the 2 digit hex code for the character (e.g.: for tab delimiters use: csv \x09).

If no fields are specified timport will attempt to guess the column names and types by sampling a number of rows from the first input file. The first row is assumed to contain the names of the fields.

You can specify the precise column names and types with the field keyword. Place the input field numbers in the 3rd value for field. Input fields are numbered starting at 1.

By default the first row of each specified file will not be imported. Use the keepfirst keyword to import the first row.

Normally double quotes (") are respected. If your data has quotes scattered through it and quotes are not used for field binding, you can turn off quote processing with the csvquote keyword.

If your data uses quotes around fields, but does not escape them within fields by doubling them, you can turn off embedded quote processing with the csvescquote keyword.

mail

Load internet style (RFC822) mail box data. The From, Subject, and Date fields will be imported as well as the full text of the mail message(s). To get other fields you may use the -g option to generate the schema file for this and edit it to your liking.

oracle

Load Oracle EXPORT format files. You must specify the fields that you want imported into Texis, as well as the datatypes you want to use in Texis. This is only known to work with Oracle V07.03.03 files, and not V08.

xml

Load an XML document. You must specify which fields you want to import with a XPath-like specifier. XML documents are expected to be in the following hierarchy (not necessarily these names):

<dataset>
	     <record>
	         <column1>abc</column1>
	         <column2>def</column2>
	         <column3>ghi</column3>
	     </record>
	       <record>
	           <column1>jkl</column1>
	           <column2>mno</column2>
	       </record>
	       ...
	</dataset>

It must be formatted as a set of records wrapped by an outer tag (and possibly more outer tags - see xmldatasetlevel below).

Note: Prior to July 2005, attributes on the dataset-level tag were not handled properly. In the following example:

<dataset randomattribute="value">
	  <record>
	    <column1>abc</column1>
        ...
	</dataset>

Prior to July 2005, timport would see randomattribute as the first row: timport's first row would have randomattribute set to value, and all fields under record would be set to their default values. For subsequent rows, randomattribute would be set to its fields default value.

With a July 2005 or later version, randomattribute will not be seen as a separate row, and dataset@randomattribute will be properly set to value for all rows fetched.

xmlns defines a default XML namespace for the schema. All schema elements will reside in this namespace. Unlike XML, the only way to specify a default namespace is for the entire schema. If finer control of where namespaces apply is needed, please use multiple xmlns:prefix commands.

xmlns:prefix defines a XML namespace prefix to be used in the schema, where prefix is replaced by whatever prefix you wish to use. It is legal (and very common) to define multiple prefixes in a single schema. Please see XML Namespaces (page here) for more detail.

keepemptyrec will use a record filled with default values when a completely empty record is found (default behavior will discard a completely empty record).

stats will add fields "Fsize long" and "Ftime date" and fill them in with the file's info for each file. It will also add "File varind" if no fields have been defined.

multiple indicates that there may be more than one record per input file.

firstmatch indicates that the first match of a tag expression should be stored instead of the last. Sometimes a tag expression will match data in a following field. This flag will ensure that the first occurrence of a tag within a record will be used instead of any subsequent match within that record.

allmatch indicates that all matches of a tag expression should be combined and stored instead. Multiple occurances are combined with the specified separator in between.

trimspace indicates that leading and trailing whitespace should be trimmed from character fields.

trimdollar indicates that leading whitespace and dollar signs should be trimmed from character fields.

keepfirst only applies to the special formats csv and col. It indicates that the first row from the input should be kept. By default it will be deleted because it usually contains titles.

csvquote only applies to the special format csv. It turns off special handling of quotes. Normally double quotes (") are respected. If your data has quotes scattered through it and quotes are not used for field binding, you will need this option.

csvescquote only applies to the special format csv. It turns off special handling of embedded quotes. Normally embedded quotes are expected to be escaped by doubling them. This will remove any attempt to handle embedded quotes.

xmldatasetlevel indicates how deep the dataset tag is in an XML document. If your data is buried a few levels deep in wrapper tags, you can use this command to specify what level to regard as the 'dataset' level (See examples below).

createtable indicates whether timport should attempt to make the table if it does not exist. To disable table creation set this to False.

droptable indicates that the table should be dropped before loading any new data into it.

noid will suppress the default "id counter" field for the specified table. Normally the field "id counter" is inserted at the beginning of all table definitions.

field expects 3 or 4 values.

1: The name of the field in the Texis table.
2: The type of the field in the Texis table.
3: The tag for the field or a '/' followed by a REX expression to match the tag. When using recexpr this is expected to be a range of subexpression numbers. When using csv or oracle this is expected to be an input field number. When using col this is expected to be a range of input column numbers.
4: A default value to insert if the field is not found. Everything up to the end of the line is used, including spaces.

These are documented in detail later in this manual and in the -H help.

Prerequisites

The Texis server must be running for client/server imports and the table(s) must match the schema before importing data. The importer will warn you if the table(s) don't match what you specified in the schema file. If the table does not exist it will be created. If the database does not exist and is supposed to be on the local machine it will be created.

EXAMPLE
Given the following schema file (timport.sch):

database /tmp/testdb
table    load
#       name    type            tag     default_val
field   Subject varchar         Subject
field   From    varchar         From
field   Number  long            Number  0
field   Date    date            Date
field   File    varind          -
field   Text    varchar         -

And the following input file (example.txt):

From: Thunderstone EPI Inc.
Subject: Test import
Number: 1
Date: 1995-04-19 11:31:00

This is my message; this is my file.
This is more message.
This is the last line of the message.

Use a command line like the following:

timport -s timport.sch example.txt