SBT is pleased to announce the version 1.3 release of the LSA_ToolkitTM. Release 1.3 added the ability to process UTF-8 character strings as input to the parsing process along with features allowing the definition of the character set that will be accepted for processing. Base character sets are defined in the lsa_toolkit_base.h include file. These include predefined sets for numeric charaters (0-9), the English alphabetic characters, and German alphabetic characters. Sets may be combined and specific characters added to the working set by using the character set methods provided in the Collection object. Characters that are not part of the accepted set (punctuation, etc.) will be discarded as non-term characters during the parsing process.
Additional access methods have been introduced to the Docset class to provide information about weighted term values both globally for the Docset and locally for an individual document. Hinting routines for the expected makeup of the Docset are also available to allow the Docset class to pre-optimize its storage configuration and improve processing performance for the dataset being processed.
In the ProjectionSpace class, methods have been added to allow modification to centroid items through methods providing for the addition and removal of items to the centroid set, as well as a simplified method to create an intial centroid item from a single projection item.
The specific methods added in this release are:
- LSA_Toolkit::ProjectionSpace::addCentroid(single item)