10.3.2 Delimiting an Entire Region

We want to do a text search against our table for a user's query, but limit the results to, say, a 100x100 mile box centered on a specific location, say the center of the user's city. Thus we can find results in adjacent cities, states or ZIP codes that aren't given in our query but are nonetheless physically nearby.

We modify our table schema and add the following columns:

  • Lat
    The latitude of the attraction, as an integer, in DDDMMSS format. Thus, 40 degrees 30 minutes 45 seconds north latitude is 403045.

  • Lon
    The longitude, in the same format.

  • GeoCode
    A long field which we'll use in a moment.

We also need a table, city , which lists cities and their latitude/longitudes, so we can look them up. (Thunderstone has such data already available, if you contact us.)

The slow/impossible way

At first, this would seem to be the best way to limit our search to a boxed region:


  <SQL MAX=1 "select lat, lon from city
              where Place\State like $city
              order by Pop desc">
    <SQL MAX=10 "select Name from tourist
                 where Desc likep $query
                   and Lat between $lat - 10000 and $lat + 10000
                   and Lon between $lon - 10000 and $lon + 10000">
      $Name <BR>
    </SQL>
  </SQL>

First, we look up the user's given $city in our city table. This gives us $lat and $lon for the center of town. (Note the order-by: we want larger cities first, eg. Cleveland, OH is probably preferred over Cleveland, TN if the state wasn't mentioned.)

2 degrees is roughly 120 miles, so we search within +/- 1 degree of $lat /$lon in our inner <SQL> . Our latitude/longitude data is in DDDMMSS format, as integers, so 10000 is 1 degree 0 minutes 0 seconds. Now we have a boxed regional search, centered on a specific location.

The problem

The problem is the post-processing required here. Any of the three clauses in the inner query, the $query clause, the latitude search, and the longitude search, could easily return a huge number of rows by themselves, no matter which we index and we put first. Even though the overall result may be a small number of rows, a lot of index rows would have to be ANDed to produce the results - slow.

The better way

Our actual geographic region is very small. If we could somehow tie the latitude and longitude together into one value, a search against it would return much fewer rows from such a (combined) index. But that requires a two-dimensional search against a one-dimensional index; how do we do that?

The <geo2code> function helps us out here. Available in version 2.1.904800000 (Sep. 2 1998) and later, it takes a latitude/longitude pair - in DDDMMSS format - and returns a single integer that we can store in our GeoCode column for each row. We then create a normal index on this column (and drop our lat/long indexes, by the way; we won't need them). The between operator has special knowledge of how to search such a column, when given a pair of encoded integers.

First, we need to encode all our locations in the table. It's faster to do this by copying to another table than updating in place. We run this code once to create our new tourist table:


  <SQL ROW "select * from oldtourist">
    <geo2code $Lat $Lon>
    <SQL MAX=1 NOVARS "insert into tourist
       values($Name, $Desc, $Lat, $Lon, $ret)">
    </SQL>
  </SQL>

For each row, <geo2code> returns a long containing both latitude and longitude, encoded in a special way. Note the ROW flag to avoid wasting memory. Then we drop oldtourist and make our indexes on our new tourist table: Metamorph compound inverted on Desc,GeoCode :

create metamorph inverted index xdesc on tourist(Desc,GeoCode);

Now we're ready to modify our search:


  <SQL MAX=1 "select lat, lon from city
              where Place\State like $city
              order by Pop desc">
    <geo2code $lat $lon 10000>
    <SQL "select Name
          from tourist
          where Desc like $query and GeoCode between " $ret>
      $Name
    </SQL>
  </SQL>

Here we added a third parameter to <geo2code> , a "radius". Given a radius (in DDDMMSS format), <geo2code> returns a string containing 2 numbers, instead of one. These two numbers represent the boxed region centered on $lat / $lon with sides of length 20000 (ie. a "radius" of 10000).

We pass this encoding to our SQL statement (but as part of the statement, not a parameter, because Texis must accept the double-value that way).

Advantages

Now we've got the ability to set an exact box over our desired region, regardless of state boundaries. And it's far faster than our top example, because lat/long are combined into one value.

Also, because of the compound index, we can resolve both the like and the latitude/longitude part of the query with one index.

Disadvantages

The enclosing region is a box, whereas we might want an exact circle for radius. It's possible to add the math to check the results and make the square region into a circle, if that's of concern. The main point to understand here is that with <geo2code> , we've reduced the enormous cost of regional searching combined with text searches to a manageable speed. The hard part is done; post-processing a small result set into a circle is much faster than trying to process the entire data set.

Back: Using ZIP Codes Next: Fast Value Lookup - xtree
Copyright © 2024 Thunderstone Software LLC. All rights reserved.