LSA_ToolkitTM  Release 1.1
The Latent Semantic Analysis Toolkit
 All Classes Namespaces Functions Pages
How to use the LSA_Toolkit Library

The Moving Parts

The LSA_Toolkit is comprised of four main objects that are used to manage data and perform Latent Semantic Analysis related tasks. The four objects are:

  • Environment
  • Collection
  • Docset
  • LSASpace

The Environment

The LSA_Toolkit::Environment provides the basic context for accessing the LSA_Toolkit functions. The settings in the environment are used to control the way the library operates. Environment functions also provide information about the current status of the library. An Environment object is passed as an argument to the constructor for any Collection objects that are instantiated. A single Environment can be used for multiple collections at the same time. The number of Environment objects that can be instantiated simultaneously is controlled by the LSA_Toolkit license key(s) that you possess.

The Collection

The LSA_Toolkit::Collection is the foundation for a given set of information to be processed. It must be constructed within an Environment context. Multiple collections may be created simultaneously for an active Environment.

The Collection houses the term dictionary and document handle (docHandle) catalog, and ties them both to the Docset and the LSASpace. Terms and docHandles are created as items are added to the Docset. They are then referenced from the LSASpace for query and analysis functions, and additional docHandles may be created from the LSASpace if new documents are added (folded in) to an existing LSASpace.

Text parsing characteristics are stored in the Collection object so they will be defined both for Docset processing and for query string processing in the LSASpace. The current version of the library has a basic definition of parsing characteristics, simply processing all input strings uniformly by lowercasing the string, replacing each punctuation character with a space, and breaking the resulting string up on whitespace to form terms. Punctuation is simply defined as any non alphanumeric character. Future releases will allow the user to adjust these parsing characteristics as needed by changing settings in the Collection.

A Collection may be assigned a user defined name that can be used to identify the Collection (see setCollectionName). This name will be saved with the Collection and can be referenced when the Collection is loaded again.

The Collection is saved as a distinct unit, separate from any associated Docset or LSASpace that may rely on the collection. NOTE: When a Docset or an LSASpace is saved to a file for later use, the Collection should be saved as well. Since the Collection is saved as a separate item, it is possible to load a Collection and a Docset or LSASpace that was not originally associated (the Docset or LSASpace was created using a different collection). Linking a Collection that is not the same as the one used to create a Docset or LSASpace will give undefined behavior. In this version of the LSA_Toolkit there is no specific checking performed to detect this condition.

Currently, the Collection can be associated with a single Docset and a single LSASpace at a given time.

The Docset

A LSA_Toolkit::Docset represents the documents or items that form the Collection. Once instantiated, the Docset is built by adding documents using the addDoc method. The definition of what constitutes a document is entirely user defined. Whatever is passed to the addDoc method as the input string will be considered a single "document". This could be a single sentence, a paragraph, multiple paragraphs, strings of other information, etc., depending upon the application. Be aware that the processing of the input document strings are governed by the parsing characteristics that are currently set in the Collection object to which the Docset is attached.

As each document is added, the addDoc method will return a document identifier (a docHandle) that references the added document within the Collection. This docHandle is guaranteed to be unique to the document in the context of the given Collection, and will not be reused or reassigned. Applications using the LSA_Toolkit should record this docHandle in whatever system is being used to curate the raw documents. Once processed, the raw document string is not recorded within in any of the LSA_Toolkit objects and only the docHandle is used as a reference. Using the setParseCatalogFile method will cause the assigned docHandle and raw document string to be logged to a user designated output file whenever a document is added to the Docset.

Once a Docset is assembled, it is generally desirable to apply a weighting function to Docset before building a LSASpace based on the Docset. This is done with the applyWeighting method. The desired weighting is indicated by passing a flag to the applyWeighting method. Currently available weighting schemes include:

  • no_weighting - no weighting is applied - raw term frequencies are used
  • log_entropy - the local log multiplied by the global entropy value for each term
  • Others to be added in future releases

The weighting may be removed explicitly by using the removeWeighting method, or by setting the weighting to the no_weighting value. It will also be removed automatically if additional documents are added to the Docset with the addDoc method, as adding documents would invalidate any weighting values that had been previously calculated. Applying a new weighting method to an already weighted Docset will remove the existing weighting and apply the new weighting that is requested.

The Docset information may be saved to a file and reloaded later as needed by using the save and load commands. NOTE: Currently the Docset is saved as a distinct unit, separate from the Collection that it is associated with. Since the Docset is saved as a separate item, it is possible to load a Docset in the context of a Collection to which it was not originally associated (the Docset was created using a different collection). Linking a Collection that is not the same as the one used to create a Docset will give undefined behavior. In this version of the LSA_Toolkit there is no specific checking performed to detect this condition.

The LSASpace

The LSA_Toolkit::LSASpace object is the workhorse for the library. Its content is built using information from the Collection context used to instantiate it and an associated Docset. Content for the LSASpace may also be loaded from a previously built LSASpace. The LSASpace provides the basis for all the query and analysis functions that make up Latent Semantic Analysis.

After instantiating the LSASpace object, the LSA hyperspace is formed by calling the buildSpace method. This method performs a series of steps that are applied to the content of the Docset. These steps consists of several mathematical transforms and are computationally intensive, placing a high demand on system memory and resources. On systems where memory is limited, it may be desirable to free up additional memory space by calling the Collection::dropDictionaries method which will delete the term and docHandle tables from memory. These tables are not used during the buildSpace process, but should be reloaded after the buildSpace method has completed as they are necessary for query and analysis processing. Be sure to use the Collection::reloadDictionaries method to do this, not the load method.

The LSASpace consists of two conceptual parts that are referred to as Docspace and Termspace. Docspace is a hyperspatial mapping of all the documents contained in the Docset. Termspace is a separate hyperspatial mapping of all the terms contained in the Docset. These two parts are used individually and in concert to make semantic comparisons between items based on the knowledge base formed from the Docset contents.

Comparsion of items within the context of the LSASpace is performed by computing a similarity measure, or distance between items mapped in the LSA hyperspace. General query functions, queryDocspace and queryTermspace, allow related documents and terms to be identified and retrieved based on their proximity to a search string. Additional comparison functions, such as compareAtoB and compareDocToQuery, can be used to compare items to specific items in the Collection, or to make comparison between two items that are not in the collection but are simply interpreted in the context of the LSASpace.

The LSASpace information may be saved to a file and reloaded later as needed by using the save and load commands. NOTE: Currently the LSASpace is saved as a distinct unit, separate from the Collection that it is associated with. Since the LSASpace is saved as a separate item, it is possible to load a LSASpace in the context of a Collection to which it was not originally associated (the LSASpace was created using a different collection). Linking a Collection that is not the same as the one used to create a LSASpace will give undefined behavior. In this version of the LSA_Toolkit there is no specific checking performed to detect this condition.

Overview of Typical Use

A simple typical usage of the LSA_Toolkit would be made up of the following actions:

At this point the LSASpace is ready for performing query tasks, comparing individual items, and further analysis.

See the sample programs in the examples directory of the LSA_Toolkit install package for starting points using the LSA_Toolkit.

Some Simple Examples

The example-parse.cpp program from the examples directory of the LSA_Toolkit install package illustrates the basic usage of the LSA_Toolkit functions to load documents into a Docset and save the Docset information. The example-buildspace.cpp program shows how to build a LSASpace from the Docset that was created and saved in the example-parse.cpp program, and the example-query.cpp program demonstrates the use of the LSASpace for some simple query tasks. This discussion will walk through the highlights of each of these example.

Parsing Documents into a Docset

Setting Things Up

The example program begins by including the necessary header files for the required objects from the LSA_Toolkit:

#include "environment.h"
#include "collection.h"
#include "docset.h"

Inside the main program object pointers are declared for the Environment, the Collection, and the Docset that will be constructed:

// Declare needed LSA_Toolkit variables

The Environment object contains operating parameters and other information that is used throughout the use of the LSA_Toolkit, and is always needed. Here the Environment object is instantiated with the default settings. Next the is the instantiation of the Collection attached to the Environment, and the Docset attached to the Collection:

// Instantiate Environment object
// Instantiate Collection and DocSet objects
DS = new LSA_Toolkit::Docset(CL);

Skipping ahead, the example includes some general housekeeping tasks, such as setting a name for the collection and designating the log file for capturing the parse catalog:

// Set collection name
CL->setCollectionName("TestName");
// Set to store the parsed document catalog - contains document handles and corresponding document
DS->setParseCatalogFile("documentSet.cat");

Loading the Documents

The input documents may exist in any number of possible input forms. The specific application will determine how the documents need to be accessed for adding them into the Docset. In this example they are read from a simple text file containing all the documents. The format of the text file allows a document to consist multiple lines of text, but each document must be separated by at least one blank line as a delimiter. Non-empty lines are accumulated until a blank line is encountered, and then the accumulated document is parsed and added to the Docset with a simple call to Docset::addDoc.

docFile.open(docFileName.c_str(), fstream::in);
if (docFile.is_open())
{
document.clear();
getline(docFile, docline);
while (docFile.good())
{
// process the line
if(!docline.empty())
{
document += docline;
}
else
{
if (!document.empty())
{
std::cout << "Processed doc: "<< docCount + 1
<< " Handle: " << DS->addDoc(document) << '\r';
document.clear();
docCount++;
}
}
getline(docFile, docline);
}
if (!document.empty()) // Check for a document terminated by end of file
{
cerr << "Processed doc: "<< docCount + 1 << " Handle: "
<< DS->addDoc(document) << '\r';
document.clear();
docCount++;
}
docFile.close();
std::cout << std::endl;
}

As each document is processed, the generated docHandle is output to cerr along with the processed document count. The application would need to store this docHandle in whatever system is being used to curate the documents (database, filesystem, etc.). In this example a log of the assigned docHandles along with the raw text that was processed for each is output to the "documentSet.cat" file, the filename was given in the setParseCatalogFile command above.

Saving the Work

Once the Docset has been populated it is usually a good idea to save it for future reference. The save() method in both the Collection and the Docset provides a rapid way to save the information in compact binary files.

// Save document set
DS->save("testspace.ds");
// Save collection
CL->save("testspace.cl");

Constructing the LSASpace

The next step is the most computationally intensive one in the LSA process. The Docset is transformed into the multi-dimensional LSA hyperspace where each document and each term is semantically mapped. Construction of the LSASpace content is triggered by a call to the buildSpace method. But first there are a few things to do to set up the Docset for processing as is illustrated in the example-buildspace.cpp program.

Setting Things Up

Again, the example program begins by including the necessary header files for the objects that are required from the LSA_Toolkit:

#include "environment.h"
#include "collection.h"
#include "docset.h"
#include "lsaspace.h"

Inside the main program some object pointers are declared for the Environment, the Collection, the Docset, and the LSASpace that will be constructed:

// Declare needed LSA_Toolkit variables

Next, the Environment and Collection objects are instantiated and the Collection is loaded from file storage:

// Instantiate Environment object
// Instantiate Collection object and load the collection
CL->load("testspace.cl");

NOTE: The Docset object should be instantiated AFTER loading the collection from a file. Loading a Collection object will invalidate any attached Docset or LSASpace objects.

// Instantiate DocSet and LSASpace objects
DS = new LSA_Toolkit::Docset(CL);
LS = new LSA_Toolkit::LSASpace(CL);

Skipping some output messages in the example program, the Docset content is loaded from a file with the load method:

// Load DocSet from file and print some messages about the document set
DS->load("testspace.ds");

Now with the existing data all loaded, the program is ready to proceed with the real work.

Refining the Docset

At this point the Docset contains raw term frequencies describing the occurrence of each term in each document of the collection. The next processing step is typically to refine the Docset by applying a weighting function in order to increase or decrease the importance of a term within a document or across the entire document collection. This is generally done to dampen the effect of frequently used terms within the collection. It also allows the effect of infrequently used terms to more accurately impact the underlying meaning represented by the document collection. There are several possible weighting schemes that could be applied.

The application of a weighting scheme is as simple as a basic call to the applyWeighting method for the Docset. Here the typically preferred weighting is applied: log-entropy.

// Apply log-entropy weighting to document set if not already weighted
std::cout << "Document Set Check: DS weighted?: " << DS->isWeighted() << std::endl;
if (!DS->isWeighted())
{
DS->applyWeighting(LSA_Toolkit::log_entropy);
}

Build the LSASpace Content

Now things are ready to build the LSASpace by calling the buildSpace method. Depending on the size of the document collection being processed, the parameters that have been set, and available computational resources, this step can take from a few seconds to several hours. On systems with limited memory, it is desirable to drop the term and docHandle dictionaries from the Collection before calling the buildSpace method. After the buildSpace work is completed they need to be reloaded if any query or analysis tasks are to be performed.

// Build LSA space
// Drop the dictionaries and then reload after build space
LS->buildSpace();
CL->reloadDictionaries("testspace.cl");

When this processing is completed, information items are left clustered together based on the latent semantic relationships between them. The result of this clustering is that items which are similar in meaning are clustered close to each other in the hyperspace and dissimilar items are distant from each other.

Saving the Work

Once the LSA Space has been constructed it is usually a good idea to save it for future use. The save method provides a rapid way to save the LSA hyperspace in a compact binary file that can easily be reloaded when desired.

// Save LSA space
LS->save("testspace.lsa");

Using the LSA Space

Once the LSA Space has been constructed, there are several interesting things can be done with it. The example-query.cpp program demonstrates just a few simple operations.

Setting Things Up

To begin, the LSA_Toolkit objects are instantiated just as in the previous examples and the Collection is loaded from storage. Once the LSASpace content has been built, the Docset is not needed, so it is not loaded in this example as doing so would simply consume memory for no purpose.

// Declare needed LSA_Toolkit variables

NOTE: The LSASpace object should be instantiated AFTER loading the collection from a file. Loading a Collection object will invalidate any attached Docset or LSASpace objects.

// Instantiate Environment object
// Instantiate Collection object and load the collection
CL->load("testspace.cl");
// Instantiate LSASpace object
LS = new LSA_Toolkit::LSASpace(CL);

Next, the LSASpace is loaded from a file and then checked to see if it has been optimized for query processing. If the space is going to be used extensively for analysis and query processing, there are a number of intermediate values that can be pre-computed to speed processing of the individual queries and comparisons. Pre-computing these values consumes memory, so there is a trade-off. If the application is not doing a large number of analysis computations, then optimizing the LSASpace will not be worth the time and memory required.

// Load already computed LSA space (do not need document set)
LS->load("testspace.lsa");
// Optimize for query processing if LSA space is not already.
if (!LS->isQOptimized())
{
std::cout << "Optimizing LSA space for query processing." << endl;
LS->qOptimize();
}

Searching in Termspace and Docspace

This example demonstrates how to process an input query that consists of a short string and search for documents contained in the space that are close to it. The queryDocspace method projects the query into the Docspace and measures its proximity to all the documents contained in the collection. It then returns a list of results sorted by the similarity measurement which are placed in the queryResults vector.

The queryResults vector was declared earlier in the example-query.cpp program:

std::vector< std::pair<std::string, double> > queryResults;

The query results are returned to this vector of pairs, with each pair consisting of a docHandle and its corresponding similarity measurement to the query string.

// Compare a query to documents
query = "The benefits of the development of technology";
std::cout << "Compare query: *" << query << "* to all documents in LSA space." << std::endl;
queryResults = LS->queryDocspace(query);
print_resultVector(queryResults, 20);

The top 20 results for this query:

Compare query: *The benefits of the development of technology* to all documents
in LSA space.
results:
h121b 0.7336346439
h11b8 0.6030183325
h10bf 0.5962997653
hf16 0.5843834815
h1290 0.5748053652
h120b 0.574444513
h2080 0.5682740598
hfbd 0.5650387889
h112e 0.5483287004
h3366 0.5425152654
hf41 0.5420996534
hf39 0.5405356019
h100c 0.5379394334
h1315 0.5362405546
h12a1 0.5336399171
h12e1 0.522310385
hfc4 0.5216654494
hf50 0.513641388
h1117 0.5133050409
heff 0.5060165201

Similarly, the next example demonstrates the comparison of an existing document identified by its docHandle to all of the terms contained in Termspace using the compareDocToTermspace method. The results are again stored to the queryResults vector of pairs, with the pairs consisting of a term and its corresponding similarity measurement to the document identified by the docHandle.

// Compare a document to all terms
docHandle = "h9bd";
std::cout << "Compare document *" << docHandle << "* to all terms in LSA space." << std::endl;
queryResults = LS->compareDocToTermspace(docHandle);
print_resultVector(queryResults, 20);

The top 20 results for this query:

Compare document *h9bd* to all terms in LSA space.
results:
canoes 0.5401498733
ships 0.5314741057
lighthouses 0.5083034322
vessels 0.4456527487
coast 0.4368307445
steamships 0.4341279897
maneuvering 0.4263558785
dearth 0.4217167297
gato 0.4128901371
bremerton 0.412068924
navy 0.4069709385
shipbuilding 0.4026408109
maritime 0.3936886446
refit 0.3886042243
shipwrecks 0.38721978
manila 0.3755671448
warships 0.3751927523
puget 0.3747575519
1769 0.3678573048
ports 0.366982475

Similar query functions returning ranked lists can be performed with individual terms, existing documents, or arbitrary query strings to retrieve information from either Termspace or Docspace as desired. See the LSASpace class documentation for a full list of available query methods.

Comparing Two Items

Another use of the LSASpace is to compare two items in the semantic context represented by the LSASpace. When this is done, the result is a single similarity measurement that represents the proximity of the two items in this particular space. The primary method for doing this is the compareAtoB function.

This first code snipped demonstrates how to compare two terms. This comparison will automatically be performed in Termspace.

// Compare two terms
term = "population";
term2 = "density";
cosResult = LS->compareAtoB(term, term2);
std::cout << "Cosine Result between terms *" << term
<< "* and *" << term2 << "* = " << std::setprecision(10)
<< cosResult << std::endl;

The results of this comparison:

Cosine Result between terms *population* and *density* = 0.7643037543

It is also possible to compare two arbitrary query strings of any length. This comparison between to multi-term query strings will automatically be mapped in the Docspace.

// Compare query to query
query = "constitutional rights";
query2 = "housing shortage";
cosResult = LS->compareAtoB(query, query2);
std::cout << "Cosine Result between query: *" << query << "* AND query: *" << query2 << "* = " << std::setprecision(10) << cosResult << std::endl;

The results of this comparison:

Cosine Result between query: *constitutional rights* AND query: *housing shortage* = 0.07164391723

The length of query strings given as input to the compareAtoB method, or to any of the query and comparison methods, is not limited (within the confines of system capacity). Entire documents may be used as input to these methods.

Additional comparison functions are described in the class documentation for the LSASpace class. Several additional comparison methods are planned for future releases, as well as the use of alternate similarity measures.

These example programs should serve as an illustration of the basic features provided by the LSA_Toolkit, but there is much more that can be done with it. The feature set provided by the LSA_Toolkit will continue to grow.

Support

Small Bear Technologies is committed to helping you make the best use of LSA technology in your specific application. If you have questions about the operation of the library or about the results you are getting from your data, please contact us for advice.

Support is available from Small Bear Technologies, Inc.

Contact us at:
(865)309-4LSA
(865)309-4572

or by email at:
info@.nosp@m.smal.nosp@m.lbear.nosp@m.tech.nosp@m.nolog.nosp@m.ies..nosp@m.com

Support resources online at:
http://smallbeartechnologies.com The Latent Semantic Analysis Experts