LSA_ToolkitTM  Release 1.2
The Latent Semantic Analysis Toolkit
 All Classes Namespaces Functions Pages
LSA_Toolkit Documentation

Introduction

The LSA_Toolkit is a robust scalable efficient library that provides the necessary capabilities for performing Latent Semantic Analysis tasks on a collection of data. The toolkit is implemented as a library with a C++ API allowing it to be easily integrated into your existing applications or front ended with software customized to your specific needs. It is designed to be open ended and configurable as to resource usage. We are working toward future release versions that will support a generalizable parallel grid computation, running on 1 to n processors as requested.

The LSA_Toolkit provides an end to end implementation of the necessary functionality for performing Latent Semantic Analysis - taking a set of documents as input and parsing them to form a Document Set in the context of a Collection which is then used to construct a Latent Semantic Analysis hyperspace (LSA Space). This LSA Space represents the semantic clustering of the information parsed in the corresponding Document Set. The LSA_Toolkit supports several analysis and query functions that can be applied using an LSA Space or an associated ProjectionSpace to understand the associations and semantic relationships within your data. (See Overview of Typical Use)

A Brief Introduction to LSA

Latent Semantic Analysis (LSA), sometimes referred to as Latent Semantic Indexing (LSI), is an information analysis and search technology. It enables you to organize and examine large collections of information based on its meaning (semantics), not just matching the words or items being used. It does this automatically and objectively in a way that helps you understand and manage your information like never before.

Most existing technology falls short in meeting the needs of today’s information rich culture. Other automated information analysis and search techniques are based primarily on the matching of common items such as terms or phrases. Even sophisticated models for text mining based on this sort of matching do not capture the underlying semantics, the meaning, of the information they work with.

LSA is different. Latent Semantic Analysis embodies both a computational model as well as an underlying theory of meaning. It creates a mapping of meaning acquired from the subject information itself, based completely on the semantic relationships between items of information contained in the collection. This is done without the need for any specific foreknowledge of the information in the collection. An LSA based system is able to identify semantically similar items solely based on the collection of information being analyzed.

LSA can be applied to any sort of data items that have a collective meaning even if they are not strictly text based. LSA can be used for any language – even collections containing items in multiple languages – without the need for prior translation. It can be used to search, compare, evaluate, and understand the information in a collection. In fact, LSA provides a computational model that can be used to perform many of the cognitive tasks with information that humans do essentially as well as humans do them.

The power of LSA technology is based on mathematical properties and special computations that can identify the relationships of meaning between all of the items in a collection of information.

The computations behind LSA are fairly complex, but the basic concept looks something like this: Given a collection of items of information it is possible to represent the collection as a simple, but very large, table with a row for each item (article, document, etc.) and a column for each term (word, phrase, etc.) in the collection. The value in each of the cells of the table is simply the number of times the term for that column appears in the item for that row. This table is essentially a numeric matrix. We adjust the numbers in the matrix using various weighting schemes based on the numeric properties of the data in the matrix. The weighted matrix is then processed using a mathematical algorithm to reduce this large matrix to a multi-dimensional hyperspace where each item and each term is represented by a vector projecting into this space. You can roughly picture this with a simple 3-dimensional representation where the vectors point out into the 3D space.

3d-LSA2.png
Simple 3D Example

This illustration is an extremely simplified representation using only 3 dimensions so you can visualize it. In practice we use a hyperspace, typically with anywhere from 300 to 500 dimensions or more.

When this processing is completed, information items are left clustered together based on the latent semantic relationships between them. The result of this clustering is that items which are similar in meaning are located close to each other in the space while dissimilar items are more distant from each other, regardless of whether or not they share common terms. In many ways, this is how a human brain organizes the information an individual accumulates over a lifetime.

In addition to simply retrieving data items, the entire data collection can be analyzed based on meaning in several different ways. This allows a large information collection to be digested and examined from a number of angles that are simply not possible using other techniques.

There are five primary operations you can perform with an LSA space:

  • Retrieval - retrieve items of information based on meaning, ranked according to semantic relevance
  • Clustering - identify clusters of meaning within a single semantic space
  • Comparison - compare items and clusters within a single semantic space as well as comparison of multiple spaces
  • Interpretation - identify where a new item maps in a given semantic space or in multiple spaces as well as merging information from multiple semantic spaces
  • Completion - use a semantic space to suggest next word or missing word elements in an information item

A Simple Illustration

Given these five primary operations, imagine some of the ways you could use Latent Semantic Analysis to evaluate the content of a library for instance:

  • Searching - You want to find items in the library that discuss a certain topic but want to be sure items of similar meaning are retrieved even if they don’t contain the search keywords. (eg: You search for “railroad trains” and get back relevant items discussing “locomotives” that don’t necessarily include the original search terms)
  • Indexing of multilingual collections - If the library consists of items in different languages, you would like to be able to search in one language but retrieve relevant items even if they are in a different language – but without having to translate all of the items into a single language beforehand.
  • Content analysis - You could evaluate the content of the library and determine what major subjects are covered by the items in it. Is there a specific concentration or clusters of subject matter in the library, or is the content widely dispersed? Are there particular items that don’t fit with the main body of the library collection (outliers)?
  • Evaluation of “fit” into an existing collection – If you are considering adding a new item to the library, does it fit in with the subject matter you already have or is it an outlier? Is it a near duplicate of something that is already in the library collection?
  • Comparison of multiple collections - Considering two library collections, do they overlap in content or are they complimentary? If we consider representations of our library at different times, how has the content of our library changed over time?
  • Correction of scanned documents – When adding scanned documents to your library using Optical Character Recognition (OCR) errors are frequently introduced by the OCR processing. You would like to be able to recognize these errors and correct them by supplying the correct word chosen based on the context of the data item rather than introduce “dirty” data items into your collection.

Release History

Current Release - Version 1.2

Release date: May 4, 2015

With version 1.2 the ProjectionSpace class is introduced. ProjectionSpace objects allow the use of an underlying LSASpace to provide a semantic mapping basis for evaluating items separately from the content of the initial document collection. This supports techniques for semantic evaluation and analysis that extend the application of LSA to a broad range of areas. The specific methods added in this release are:

  • Projection space settings
    • ProjectionSpace::setTrackMissingTerms
    • ProjectionSpace::getTrackMissingTerms
    • ProjectionSpace::setQueryDimensions
    • ProjectionSpace::getQueryDimensions
  • Projection creation
    • ProjectionSpace::addProjection
  • Centroid functions
    • ProjectionSpace::addCentroid
    • ProjectionSpace::getCentroidComponentHandles
  • Get information about the projection space
    • ProjectionSpace::isEmpty
    • ProjectionSpace::getNumProjections
  • Get information about individual projection items
    • ProjectionSpace::getProjectionUniqueTermCount
    • ProjectionSpace::getProjectionTotalTermCount
    • ProjectionSpace::getProjectionMissingTerms
  • Compare query to all projections
    • ProjectionSpace::queryPSpace
  • Compare specific projection to termspace or docspace from context
    • ProjectionSpace::compareItemToTermspace
    • ProjectionSpace::compareItemToDocspace
  • Individual comparison routines
    • ProjectionSpace::compareItemToItem
    • ProjectionSpace::compareItemToQuery

Past Release Versions

Version 1.1

Release date: September 15, 2014

Release Version 1.1 adds several new methods to the Collection, Docset, and LSASpace classes to support the management and use of stop word lists, purging terms from a Docset, and specifying the number of dimensions to use in analysis operations. The specific methods added are:

  • Collection::getTermList
  • Collection::getDocList
  • Collection::getStopList
  • Collection::getStopWords
  • Collection::clearStopWords
  • Docset::removeTerm
  • Docset::purgeTermsBelowDA
  • Docset::purgeTermsBelowGF
  • LSASpace::setQueryDimensions
  • LSASpace::getQueryDimensions

In addition to these new features, Release Version 1.1 also deploys our new license key management system that provides more options for licensing the LSA_Toolkit library in redistributable products and parallel grid computing environments.

Version 1.0

Release date: March 10, 2014

This release makes major changes to the API and improves the overall ease of use for the library. With the introduction of Collections and Docsets and their associated parsing features, the library now has complete front end functionality and better support for query and comparison operations after a LSASpace is created. Docsets make it easier to manage the addition of documents for processing, and Collections provide the dictionary context to ensure that term and document identification is uniform and controlled throughout all of the interactions with a Docset, Collection, and LSASpace group.

Version 0.4.1

Release date: November 15, 2013

This release package provides a new section of utility programs for analyzing LSA spaces, the ability to remove an applied weighting from a document collection, improved documentation, and several other minor changes in preparation for the upcoming 0.5 release.

Version 0.4

Release date: March 15, 2013

This release makes several improvements to the query processing and adds the ability to fold-in new documents to an already calculated LSA Space. Along with additional performance improvements, additional safety checks have been added to improve feedback to library users and prevent the creation of flawed LSA spaces.

Version 0.3

Release date March 19, 2012

Version 0.3 adds several new features and performance enhancements including additional functionality to the complete the basic processing workflow for creating and working with an LSA Space:

  • Tools to parse input documents and query strings into term IDs for processing.
  • Local/Global weighting application.
  • Query processing for arbitrary query strings.

Version 0.2.1

Release date: September 9, 2011

This minor update release provides improved performance for larger datasets and implements some bug fixes.

Version 0.2

Release date: July 25, 2011

In addition to all of the features provided in version 0.1, version 0.2 provides:

  • Save and Load data collection information quickly and efficiently using a binary storage file for the initial input matrix.
  • Save and Load computed LSA Space information quickly and efficiently with binary storage files.
  • Perform basic analysis queries comparing information items across the data collection with a cosine comparison measure.
  • Optimize the constructed LSA Space for extensive query processing if desired.

Version 0.1

Release date: April 30, 2011

We have rebuilt Latent Semantic Analysis processing from the ground up. Going back to the core mathematics, we have designed a complete system architecture, new storage mechanisms, and thoroughly re-examined the computational accuracy to provide the best implementation possible. Our goals have been to create a system that is easy to use and that is robust, scalable, and performant.

While this initial release does not have a large number of visible features, there is a lot going on behind the scenes. This version marks a solid foundation for building forward. Future releases will expand the feature set to provide all the tools needed to deploy LSA technology into your applications.