identifylanguage

Tries to identify the predominant language of a given string. By returning a probability in addition to the identified language, this function can also serve as a test of whether the given string is really natural-language text, or perhaps binary/encoded data instead. Syntax:

identifylanguage(text[, language[, samplesize]])

The return value is a two-element strlst: a probability and a language code. The probability is a value from 0.000 to 1.000 that the text argument is composed in the language named by the returned language code. The language code is a two-letter ISO-639-1 code.

If an ISO-639-1 code is given for the optional language argument, the probability for that particular language is returned, instead of for the highest-probability language of the known/built-in languages (currently de, es, fr, ja, pl, tr, da, en, eu, it, ko, ru).

The optional third argument samplesize is the initial integer size in bytes of the text to sample when determining language; it defaults to 16384. The samplesize parameter was added in version 7.01.1382113000 20131018.

Note that since a strlst value is returned, the probability is returned as a strlst element, not a double value, and thus should be cast to double during comparisons. In Vortex with arrayconvert on (the default), the return value will be automatically split into a two-element Vortex varchar array.

The identifylanguage() function is experimental, and its behavior, syntax, name and/or existence are subject to change without notice. Added in version 7.01.1381362000 20131009.


Copyright © Thunderstone Software     Last updated: Apr 15 2024
Copyright © 2024 Thunderstone Software LLC. All rights reserved.