Multiple Records Per File

One of the most common types of record oriented text is where a few header lines precede a portion of narrative text. This whole pattern is repeated throughout the file, so that there are many records per file. You want to capture the headers to their respective fields, and also capture the full text of the record to its own field. The sample file timport.sch provides an example of this.

The individual fields might be defined as separate expressions, or they might be defined as subexpressions of one large expression defining the whole record. Where an expression is defined for an entire record its value is assigned to the keyword recexpr for record expression.

Where a recexpr is used, the individual fields can be defined with numbers indicating which portion or range of the overall expression is to be used to capture the data for that field. Where recexpr is not used, each field will have its own REX expression defined.

The expression for a field is referred to as its tag. Default expressions can be used, or your own complete REX expression constructed. In the example that follows, the fields are easily tagged as From, Subject, Number, and Date. The text of the whole record is stored in the field called Text.

The first portion of the file timport.sch is the schema. The last portion is sample text to import, which looks like this:

From: multiple record file
     Subject: First multiple record
     Number: 1
     Date: 1995-04-19 11:31:00

     This is my message; this is my file.
     ^L
     From: multiple record file
     Subject: Second multiple record
     Number: 2
     Date: 1995-04-19 11:32:00

     This is another message.
     ^L
     From: multiple record file
     Subject: Third multiple record
     Number: 3
     Date: 1995-04-19 11:33:00

     This is getting tedious!
     I'm going to stop now.

Where multiple records occur in a single file, they would be separated by some sort of repeating textual pattern. In this example, it is easy to see the form feed character \x0c which appears as a ^L separating the 3 records. The keyword for this is recdelim, for record delimiter. Where a recdelim is defined in a schema file, it implies that there are multiple records.

Sometimes the definition of the fields within the records defines an overall pattern which does not require a separate record delimiter. In this case you would prefer to use the keyword multiple. With a clear recdelim as in this example the keyword multiple is not required.

Specifically, the schema rules are:

  • recdelim is used for separating records out of an input file containing multiple records. It implies "multiple".

  • multiple indicates that there may be more than one record per input file.

  • recexpr is an expression that matches an entire record. Field tags are then numbers indicating the subexpression range for the field. It's good for records that are not well delimited (like columns).

Note that this schema file uses a recdelim. Therefore it does not need to also use the keyword multiple. It does not define the entire record with one expression, just with individual fields, so there is no recexpr defined.


Copyright © Thunderstone Software     Last updated: Apr 15 2024
Copyright © 2024 Thunderstone Software LLC. All rights reserved.