CPCUG Monitor
Creating dynamic websites with ColdFusion
Part 5: Verity Free Text Search
What is ColdFusion?
In this article we continue to look at what ColdFusion is
and how you can use it for dynamic website creation. We cover free text
searching of both multiple files and databases using the ColdFusion Verity text
search engine. Free text searching lets you look for words anywhere in a
directory structure or database.
In case you missed previous articles that introduced
ColdFusion, let me explain what it is. ColdFusion is a programming language
based on standard HTML (Hyper Text Markup Language) that is used to write
dynamic webpages. It lets you create pages on the fly that differ depending on
user input, database lookups, time of day or what ever other criteria you dream
up! ColdFusion pages consist of standard HTML tags such as <FONT
SIZE="+2"> together with CFML
(ColdFusion Markup Language) tags such as <CFSEARCH>, <CFIF>
and <CFLOOP>.
ColdFusion was introduced by Allaire in 1996 and is currently on version 4.
Text Searching in ColdFusion
Free Text searching is a very powerful programming tool
that lets you search thousands of files or database records for any text any
where within them. ColdFusion implements text searching looping with Verity
using the <CFSEARCH> and <CFINDEX> tags. The search language allows
for:
·
Wildcards - regular expression style use of ?, *, [],
-, ^
·
Evidence operators - STEM, WILDCARD, WORD
·
Proximity operators - NEAR, PARAGRAPH, PHRASE, SENTENCE
·
Relational operators - CONTAINS, MATCHES, STARTS, ENDS,
SUBSTRING
·
Concept operators - AND, OR, ACCRUE
·
Score operators - YESNO, PRODUCT, SUM, COMPLEMENT
You could write hundreds of lines of code to do these kinds
of searches yourself, but it would run orders of magnitude slower.than using
the single <CFSEARCH> tag. This is because Verity creates a word lookup
index (or collection) of every piece
of text in your files or records so that it can go straight to the ones your
are searching for. This is analogous to the index in a book, that lists all the
pages that a certain word appears on. If you imagine how tedious it would be to
search for words in a book without an index, it will give you an idea of the
advantages Verity can give your ColdFusion programs!
The Verity Engine
The free text indexing and searching functionality in
ColdFusion is based on Verity, Inc.’s SEARCH’97 product. Indexing data is
available both through the <CFINDEX> tag and the ColdFusion
Administrator, where you can create and manage collections. Searching is done
using the <CFSEARCH> tag. Output of search results to your pages is done
using the same <CFOUTPUT> tag that you would use with database queries.
The Verity engine performs searches against collections. Collections consist of an
index of all the words in all the files or records you want to search.
Collection information includes:
·
Word indexes
·
An internal documents table
·
Logical pointers to actual document files
In your ColdFusion application, you can populate and search
multiple collections, each of which can be designed to focus on a specific
group of documents or queries, according to subject, document type, location,
or any other logical grouping. Searches can be performed against multiple
collections, giving you lots of flexibility in designing your search interface.
The <CFINDEX> tag lets you manage the data in an
existing collection, including:
·
Indexing text or binary data in specified directories,
or indexing ColdFusion queries.
·
Purging a collection of data.
·
Updating, refreshing, and optimizing a collection.
·
Creating a Verity Collection
However, before you can perform any of these operations
using <CFINDEX>, you need to create the collection in the ColdFusion
Administrator. This is somewhat similar to how you have to create a datasource
for SQL queries in the Administrator. Here are the steps for creating a
collection:
1.
Open the ColdFusion Administrator Verity page.
2.
Enter a name for your collection. The Administrator fills in
the Collection Root path with a corresponding directory path.
3.
Click Create. The new collection name and path appear in the
Verity Collections list.
Figure 1. Creating a
Verity Collection in the ColdFusion Administrator.
Once your collection is created, you can use either the
Administrator or the <CFINDEX> tag to populate it with documents to
search. Generally I use the administrator for static data and the
<CFINDEX> tag for data that changes and must be re-indexed frequently.
Indexing documents
ColdFusion allows you to index and search collections
populated with data from:
·
ASCII text files.
·
Binary Office documents (see below for details about
document types).
·
ColdFusion queries resulting from data returned by a
<CFQUERY> operation.
You can index libraries of HTML and CFML documents and other
ASCII text files. Choose specific documents or an entire directory tree as the
target of your collection. Collections can be stored anywhere, so you have a
lot of flexibility in accessing indexed data. This adds enormous value to any
content-rich Web site.
For example, at TeraTech we are always coming across useful
emails, documents, code snippets, web pages and newsgroup references. We never
knew how to store these effectively for future reference. Paper printouts were
hard to search and share in a team, and our existing computer copies were not
much better! So we came up with a simple knowledgebase by creating a
straightforward directory-based system that can be searched by Verity. (It also has the added advantage of being
very easy to save documents to.) If you make it to hard to save documents for
reference, there will be no documents to search (it’s useless if no one uses
it)! This is why we prefer saving the text documents to a simple directory
system, instead of trying to be sophisticated and saving it in a database.
Whenever a document is found, either in email, news groups,
or from the web, that is found to have some reference value, it is saved to the
knowledge-base directory on our shared X: drive. It is useful to give the file
a long, descriptive name, since this will basically be the title of the
document when search results are returned. We have found that Eudora email
convieniently saves email messages with a file name based on the subject of the
message!
The ColdFusion code to create the Verity collection for our
knowledge base of documents is:
<CFINDEX
ACTION="REFRESH"
COLLECTION="KnowledgeBase"
KEY="X:\knowledgebase"
TYPE="PATH"
EXTENSIONS=".htm,
.cfm, .dbm, .txt, .htm*, .doc, .rtf, .pdf, *."
RECURSE="Yes">
Here we are refreshing a collection named KnowledgeBase that
is stored in the directory X:\knowledgebase\. The recurse parameter tells
Verity to index all subdirectories too. The extensions parameter lists the file
types to index.
Note: if X: is not a physical drive on the ColdFusion
server, you may have to refer to it by a UNC (Universal Naming Convention) such
as \\mswebserver\x-drive. This is because by default the ColdFusion process
runs without logging into the machine, and so it doesn’t see mapped drive
letters such as X:.
The knowledge base directory is broken down into common
developer’s areas of interest, such as JavaScript, ColdFusion, ASP, Access97,
VB, HTML, etc. New directories can be
added as needed. The directories are
not really necessary as far as Verity is concerned, but are useful to prevent
information scramble/overload (and in case we ever want to do any clean up of
the data).
For many documents the <CFINDEX> tag can take some
time to run (on our site it takes 45 seconds on average for 1000 documents). To
avoid user delays and still keep the collection up to date as new documents are
saved we use the ColdFusion scheduler to automatically run the refresh action
above at 6am every day. A <CFMAIL> tag emails me to confirm that the
command ran ok.
<CFSET starttime=now()>
<CFINDEX
ACTION="REFRESH"
COLLECTION="KnowledgeBase"
KEY="\\mswebserver\x-drive\knowledgebase"
TYPE="PATH"
EXTENSIONS=".htm,
.cfm, .dbm, .txt, .htm*, .doc, .rtf, .pdf, *."
RECURSE="Yes">
<CFMAIL
TO="[email protected]"
FROM="[email protected]"
SUBJECT="Knowledgebase
refresh"
SMTPSERVER="smtp.teratech.com">
Knowledge base successfully
refeshed
Time taken: #DateDiff('s',
starttime, now())# seconds.<br>
</cfmail>
Indexing queries
In addition to indexing documents, Verity can index your
output from a <CFQUERY>. Of course you could do this in SQL using the
LIKE operator or the INSTR() function, but both of these methods use full table
scans and so are slow on any but the smallest databases. Another advantage is
that the search interface is simple both for the user and for you coding it, as
typically you have one input field that is searched through all fields in the
database.
To index a ColdFusion query:
1. Define a logical name and location for your
collection using the ColdFusion Administrator Verity page.
2. Execute a <CFQUERY> to retrieve data
from the desired ODBC data source.
3. Generate the collection using the
<CFINDEX> tag.
The query set is indexed using the <CFINDEX> tag in
which you specify a KEY, typically a unique value like the primary key, and the
column in which you want to conduct searches, the BODY. In our example we have
a database of email messages to query from.
<CFQUERY NAME="Messages" DATASOURCE="TestDatasource">
SELECT Message_ID
, Body, UserName
FROM Messages
</CFQUERY>
<CFINDEX COLLECTION="Messages"
ACTION="UPDATE"
TYPE="CUSTOM"
BODY="Body"
KEY="Message_ID"
TITLE="UserName"
QUERY="Messages">
This <CFINDEX> statement specifies the Body column as
the core of the collection and names the KEY as the Message_ID column, the
table's primary key. Note that the TITLE attribute names the UserName column
from the Messages table. The TITLE attribute can be used to designate an output
parameter when you are displaying your Verity search results.
<CFOUTPUT>
Message number
#SearchOutput.Message_ID# was written by
#SearchOutput.TITLE#.
</CFOUTPUT>
We will explain in detail how to search the collection
below.
To index more than one column in a collection, enter a
comma-separated list of column names for values of the BODY attribute, such as:
BODY=FirstName,LastName,Company
As an alternative, you can use the concatenation function of
your DBMS in a SELECT statement, such as:
SELECT FIRSTNAME+’ ‘+LASTNAME AS
WHOLENAME
·
A space is inserted between each concatenated value to
avoid mixing up words. You would then generate a collection from
WHOLENAME.
Searching a Verity collection
The <CFSEARCH> tag lets you search one or more Verity
collections. Searches can either be for single words, multiple words or complex
proximity operators such as within 3 words or same sentence.
In our file based Knowledge base example:
<CFSEARCH
COLLECTION="KnowledgeBase"
NAME="Articles"
TYPE="SIMPLE"
CRITERIA="#URL.SearchText#">
Here we are searching the collection called KnowledgeBase
with a simple word search for words contained in the URL parameter SearchText.
This parameter has been passed on the URL string to our search results page.
The list of files matching the search is returned in the query named Articles.
To display the search results a pageful at a time we use the
<CFOUTPUT> tag with the startrow and maxrows parameters. These would be
set using paging buttons on the results page, which to save space we have not
shown here. We use a table format to make the display easier to read.
<TABLE BORDER="0" CELLPADDING="2"
CELLSPACING="2">
<TR>
<TD><B>Score</B></TD><TD><B>Summary</B></TD>
</TR>
<CFOUTPUT QUERY="Articles" STARTROW=#StartAt#
MAXROWS=#stepsize#>
<TR>
<TD WIDTH="30%"
VALIGN="TOP">#score#</td>
<TD WIDTH="70%" VALIGN="TOP">
<A
HREF="/knowledgebase/#URLEncodedFormat(url)#/#Replace(url, ' ',
'','ALL')#" TARGET="_new">
<B>#Replace(key,
"\\mswebserver\x-drive\knowledgebase\",
'','ALL')#</B></A>
<BR>#HTMLEditFormat(Summary)#
</TD>
</TR>
</CFOUTPUT>
</TABLE>
In the output we use the standard <CFSEARCH> output
columns score, url, key and summary (see below). We also use the
URLEncodedFormat function in case the file name contains spaces and we add the
file name on the end of the URL a second tie with spaces stripped so that if
the file is downloaded it will be saved with the stripped name. For example “My
Test.doc” would have URL My%20Test%2Edoc/MyTest.doc and if you clicked on the
link the file name would be MyTest.doc. The target="_new"
parameter of the HTML <A HREF>
tag tells the browser to use a new window when you click on the link. We use
the HTMLEditFormat function on the summary variable because if it contains HTML
it could screw up our display - the function converts the HTML codes to
displayable text.
A full list of verity variables is:
·
KEY — the value of the KEY attribute defined in the
CFINDEX tag used to populate the collection. In our case the filename and path.
·
TITLE — Returns the value of the TITLE attribute
defined by the <TITLE> HTML tag in any HTML or ColdFusion application
page file that was indexed by CFINDEX. If the collection was TYPE=CUSTOM, TITLE
returns the value of the TITLE attribute defined by the CFINDEX tag. If the
collection was TYPE=FILE, TITLE also returns the value of the TITLE attribute
defined by the CFINDEX tag.
·
SCORE — Returns the relevancy score of the document
based on the search criteria from 0 to 100.
·
URL — Returns the value of the URLPATH attribute
defined in the CFINDEX tag used to populate the collection.
·
SUMMARY - the best three sentences or 500 characters of
documents returned by a search.
·
CUSTOM1, CUSTOM2 - user defined key fields
·
RECORDCOUNT — The total number of records returned by
the query
·
CURRENTROW — The current row of the query being
processed by CFOUTPUT
·
RECORDSSEARCHED — The total number of records in the
index that were searched.
·
Figure 2: Verity
search results page
Verity Search Query Language
You can do more than just search for single words using the
<CFSEARCH> CRITERIA parameter. You can also enter comma-delimited strings
and use wildcard characters (regular expressions). By default, a simple query
searches for words, not strings. For example, entering the word "all"
will find documents containing the word "all" but not
"allegorical." You can use wildcards, however to broaden the scope of
the search. "all*" will return documents containing both
"all" and "alliterate." Case is ignored, but only when (as
above) the search string is all lowercase or all uppercase. If the criteria is mixed case
("All"), only the same case would match (only "All", not
"all" or "ALL").
You can enter multiple words separated by commas: software,
Microsoft, Oracle. The comma in a Simple query expression is treated like a
logical OR. If you omit the commas, the query expression is treated as a
phrase, so documents would be searched for the phrase "software Microsoft
Oracle."
You can use the AND, OR, and NOT operators in a simple
query: software AND (Microsoft OR Oracle). To include an operator in a search,
you surround it with double quotation marks: software "and" Microsoft.
This expression searches for the phrase "software and Microsoft."
A simple query employs the STEM operator and the MANY
modifier. STEM searches for words that derive from those entered in the query
expression, so that entering "find" will return documents that
contain "find," "finding," "finds," etc. The MANY
modifier forces the documents returned in the search to be presented in a list
based on a relevancy score.
For a full list of Verity operators see the on-line help
page at our knowledge base page http://www.teratech.com/knowledgebase/. You can
also try out our verity knowledge base too!
Summary
In this article we learned how to index both documents and
large database queries for free text searches using Verity. We used the CFINDEX
and CFSEARCH tags together with a CFOUTPUT to display results
To Learn More
You can download a free 30 day-evaluation version of
ColdFusion from Allaire or request a free eval CD-ROM from
the Allaire website http://www.allaire.com/
Allaire Corporation
1 Alewife Center
Cambridge, MA 02140
Tel: 617.761.2000 voice
Fax: 617.761.2001 fax
Toll Free: 888.939.2545
Email: [email protected]
Web: www.allaire.com
ColdFusion Resources
Allaire also maintains an extensive knowledge base and
tech support forums on their website.
CPCUG and TeraTech ColdFusion Conference http://www.cfconf.org/
TeraTech maintains a ColdFusion code cuttings called
ColdCuts at http://www.teratech.com/ColdCuts/.
This page also has links to about a dozen ColdFusion white papers in the CF
Info Center.
The Maryland ColdFusion User Group meets the second
Tuesday of each month at Backstreets Cafe, 12352 Wilkins Avenue, Rockville. See http://www.cfug-md.org/ for details and
directions.
The DC ColdFusion User Group meets the first Wednesday
each month at Figleaf , 16th and P St NW, Washington DC. See the
DCCFUG page on http://www.figleaf.com/
for details and directions.
Bio
Michael Smith is president of TeraTech, a ten year old
Rockville Maryland based consulting company that specializes in ColdFusion,
Database and Visual Basic development. You can reach Michael at [email protected] or 301-424-3903.