Introduction to Zend Framework

 Learning Zend Framework

appendix

 Zend Framework Reference


  •  Zend_Gdata
  •  Zend_Http
  •  Zend_InfoCard
  •  Zend_Json
  •  Zend_Layout
  •  Zend_Ldap
  •  Zend_Loader
  •  Zend_Locale
  •  Zend_Log
  •  Zend_Mail
  •  Zend_Markup
  •  Zend_Measure
  •  Zend_Memory
  •  Zend_Mime
  •  Zend_Navigation
  •  Zend_Oauth
  •  Zend_OpenId
  •  Zend_Paginator
  •  Zend_Pdf
  •  Zend_ProgressBar
  •  Zend_Queue
  •  Zend_Reflection
  •  Zend_Registry
  •  Zend_Rest

  •  Zend_Search_Lucene
  •  Zend_Serializer
  •  Zend_Server
  •  Zend_Service
  •  Zend_Session
  •  Zend_Soap
  •  Zend_Tag
  •  Zend_Test
  •  Zend_Text
  •  Zend_TimeSync
  •  Zend_Tool
  •  Zend_Tool_Framework
  •  Zend_Tool_Project
  •  Zend_Translate
  •  Zend_Uri
  •  Zend_Validate
  •  Zend_Version
  •  Zend_View
  •  Zend_Wildfire
  •  Zend_XmlRpc
  • ZendX_Console_Process_Unix
  • ZendX_JQuery
  • Update 2011-11-16 - Revision 24438 - Version ZF 1.11.x

    Chapter 60. Zend_Search_Lucene

    Table of Contents

    60.1. Overview
    60.1.1. Introduction
    60.1.2. Document and Field Objects
    60.1.3. Understanding Field Types
    60.1.4. HTML documents
    60.1.5. Word 2007 documents
    60.1.6. Powerpoint 2007 documents
    60.1.7. Excel 2007 documents
    60.2. Building Indexes
    60.2.1. Creating a New Index
    60.2.2. Updating Index
    60.2.3. Updating Documents
    60.2.4. Retrieving Index Size
    60.2.5. Index optimization
    60.2.5.1. MaxBufferedDocs auto-optimization option
    60.2.5.2. MaxMergeDocs auto-optimization option
    60.2.5.3. MergeFactor auto-optimization option
    60.2.6. Permissions
    60.2.7. Limitations
    60.2.7.1. Index size
    60.2.7.2. Supported Filesystems
    60.3. Searching an Index
    60.3.1. Building Queries
    60.3.1.1. Query Parsing
    60.3.2. Search Results
    60.3.3. Limiting the Result Set
    60.3.4. Results Scoring
    60.3.5. Search Result Sorting
    60.3.6. Search Results Highlighting
    60.4. Query Language
    60.4.1. Terms
    60.4.2. Fields
    60.4.3. Wildcards
    60.4.4. Term Modifiers
    60.4.5. Range Searches
    60.4.6. Fuzzy Searches
    60.4.7. Matched terms limitation
    60.4.8. Proximity Searches
    60.4.9. Boosting a Term
    60.4.10. Boolean Operators
    60.4.10.1. AND
    60.4.10.2. OR
    60.4.10.3. NOT
    60.4.10.4. &&, ||, and ! operators
    60.4.10.5. +
    60.4.10.6. -
    60.4.10.7. No Operator
    60.4.11. Grouping
    60.4.12. Field Grouping
    60.4.13. Escaping Special Characters
    60.5. Query Construction API
    60.5.1. Query Parser Exceptions
    60.5.2. Term Query
    60.5.3. Multi-Term Query
    60.5.4. Boolean Query
    60.5.5. Wildcard Query
    60.5.6. Fuzzy Query
    60.5.7. Phrase Query
    60.5.8. Range Query
    60.6. Character Set
    60.6.1. UTF-8 and single-byte character set support
    60.6.2. Default text analyzer
    60.6.3. UTF-8 compatible text analyzers
    60.7. Extensibility
    60.7.1. Text Analysis
    60.7.2. Tokens Filtering
    60.7.3. Scoring Algorithms
    60.7.4. Storage Containers
    60.8. Interoperating with Java Lucene
    60.8.1. File Formats
    60.8.2. Index Directory
    60.8.3. Java Source Code
    60.9. Advanced
    60.9.1. Starting from 1.6, handling index format transformations
    60.9.2. Using the index as static property
    60.10. Best Practices
    60.10.1. Field names
    60.10.2. Indexing performance
    60.10.3. Index during Shut Down
    60.10.4. Retrieving documents by unique id
    60.10.5. Memory Usage
    60.10.6. Encoding
    60.10.7. Index maintenance

    60.1. Overview

    60.1.1. Introduction

    Zend_Search_Lucene is a general purpose text search engine written entirely in PHP 5. Since it stores its index on the filesystem and does not require a database server, it can add search capabilities to almost any PHP-driven website. Zend_Search_Lucene supports the following features:

    • Ranked searching - best results returned first

    • Many powerful query types: phrase queries, boolean queries, wildcard queries, proximity queries, range queries and many others.

    • Search by specific field (e.g., title, author, contents)

    Zend_Search_Lucene was derived from the Apache Lucene project. The currently (starting from ZF 1.6) supported Lucene index format versions are 1.4 - 2.3. For more information on Lucene, visit http://lucene.apache.org/java/docs/.

    [Note]

    Previous Zend_Search_Lucene implementations support the Lucene 1.4 (1.9) - 2.1 index formats.

    Starting from Zend Framework 1.5 any index created using pre-2.1 index format is automatically upgraded to Lucene 2.1 format after the Zend_Search_Lucene update and will not be compatible with Zend_Search_Lucene implementations included into Zend Framework 1.0.x.

    60.1.2. Document and Field Objects

    Zend_Search_Lucene operates with documents as atomic objects for indexing. A document is divided into named fields, and fields have content that can be searched.

    A document is represented by the Zend_Search_Lucene_Document class, and this objects of this class contain instances of Zend_Search_Lucene_Field that represent the fields on the document.

    It is important to note that any information can be added to the index. Application-specific information or metadata can be stored in the document fields, and later retrieved with the document during search.

    It is the responsibility of your application to control the indexer. This means that data can be indexed from any source that is accessible by your application. For example, this could be the filesystem, a database, an HTML form, etc.

    Zend_Search_Lucene_Field class provides several static methods to create fields with different characteristics:

    $doc = new Zend_Search_Lucene_Document();

    // Field is not tokenized, but is indexed and stored within the index.
    // Stored fields can be retrived from the index.
    $doc->addField(Zend_Search_Lucene_Field::Keyword('doctype',
                                                     
    'autogenerated'));

    // Field is not tokenized nor indexed, but is stored in the index.
    $doc->addField(Zend_Search_Lucene_Field::UnIndexed('created',
                                                       
    time()));

    // Binary String valued Field that is not tokenized nor indexed,
    // but is stored in the index.
    $doc->addField(Zend_Search_Lucene_Field::Binary('icon',
                                                    
    $iconData));

    // Field is tokenized and indexed, and is stored in the index.
    $doc->addField(Zend_Search_Lucene_Field::Text('annotation',
                                                  
    'Document annotation text'));

    // Field is tokenized and indexed, but is not stored in the index.
    $doc->addField(Zend_Search_Lucene_Field::UnStored('contents',
                                                      
    'My document content'));

    Each of these methods (excluding the Zend_Search_Lucene_Field::Binary() method) has an optional $encoding parameter for specifying input data encoding.

    Encoding may differ for different documents as well as for different fields within one document:

    $doc = new Zend_Search_Lucene_Document();
    $doc->addField(Zend_Search_Lucene_Field::Text('title',
                                                  
    $title,
                                                  
    'iso-8859-1'));
    $doc->addField(Zend_Search_Lucene_Field::UnStored('contents',
                                                      
    $contents,
                                                      
    'utf-8'));

    If encoding parameter is omitted, then the current locale is used at processing time. For example:

    setlocale(LC_ALL'de_DE.iso-8859-1');
    ...
    $doc->addField(Zend_Search_Lucene_Field::UnStored('contents'$contents));

    Fields are always stored and returned from the index in UTF-8 encoding. Any required conversion to UTF-8 happens automatically.

    Text analyzers (see below) may also convert text to some other encodings. Actually, the default analyzer converts text to 'ASCII//TRANSLIT' encoding. Be careful, however; this translation may depend on current locale.

    Fields' names are defined at your discretion in the addField() method.

    Java Lucene uses the 'contents' field as a default field to search. Zend_Search_Lucene searches through all fields by default, but the behavior is configurable. See the "Default search field" chapter for details.

    60.1.3. Understanding Field Types

    • Keyword fields are stored and indexed, meaning that they can be searched as well as displayed in search results. They are not split up into separate words by tokenization. Enumerated database fields usually translate well to Keyword fields in Zend_Search_Lucene.

    • UnIndexed fields are not searchable, but they are returned with search hits. Database timestamps, primary keys, file system paths, and other external identifiers are good candidates for UnIndexed fields.

    • Binary fields are not tokenized or indexed, but are stored for retrieval with search hits. They can be used to store any data encoded as a binary string, such as an image icon.

    • Text fields are stored, indexed, and tokenized. Text fields are appropriate for storing information like subjects and titles that need to be searchable as well as returned with search results.

    • UnStored fields are tokenized and indexed, but not stored in the index. Large amounts of text are best indexed using this type of field. Storing data creates a larger index on disk, so if you need to search but not redisplay the data, use an UnStored field. UnStored fields are practical when using a Zend_Search_Lucene index in combination with a relational database. You can index large data fields with UnStored fields for searching, and retrieve them from your relational database by using a separate field as an identifier.

      Table 60.1. Zend_Search_Lucene_Field Types

      Field Type Stored Indexed Tokenized Binary
      Keyword Yes Yes No No
      UnIndexed Yes No No No
      Binary Yes No No Yes
      Text Yes Yes Yes No
      UnStored No Yes Yes No

    60.1.4. HTML documents

    Zend_Search_Lucene offers a HTML parsing feature. Documents can be created directly from a HTML file or string:

    $doc Zend_Search_Lucene_Document_Html::loadHTMLFile($filename);
    $index->addDocument($doc);
    ...
    $doc Zend_Search_Lucene_Document_Html::loadHTML($htmlString);
    $index->addDocument($doc);

    Zend_Search_Lucene_Document_Html class uses the DOMDocument::loadHTML() and DOMDocument::loadHTMLFile() methods to parse the source HTML, so it doesn't need HTML to be well formed or to be XHTML. On the other hand, it's sensitive to the encoding specified by the "meta http-equiv" header tag.

    Zend_Search_Lucene_Document_Html class recognizes document title, body and document header meta tags.

    The 'title' field is actually the /html/head/title value. It's stored within the index, tokenized and available for search.

    The 'body' field is the actual body content of the HTML file or string. It doesn't include scripts, comments or attributes.

    The loadHTML() and loadHTMLFile() methods of Zend_Search_Lucene_Document_Html class also have second optional argument. If it's set to TRUE, then body content is also stored within index and can be retrieved from the index. By default, the body is tokenized and indexed, but not stored.

    The third parameter of loadHTML() and loadHTMLFile() methods optionally specifies source HTML document encoding. It's used if encoding is not specified using Content-type HTTP-EQUIV meta tag.

    Other document header meta tags produce additional document fields. The field 'name' is taken from 'name' attribute, and the 'content' attribute populates the field 'value'. Both are tokenized, indexed and stored, so documents may be searched by their meta tags (for example, by keywords).

    Parsed documents may be augmented by the programmer with any other field:

    $doc Zend_Search_Lucene_Document_Html::loadHTML($htmlString);
    $doc->addField(Zend_Search_Lucene_Field::UnIndexed('created',
                                                       
    time()));
    $doc->addField(Zend_Search_Lucene_Field::UnIndexed('updated',
                                                       
    time()));
    $doc->addField(Zend_Search_Lucene_Field::Text('annotation',
                                                  
    'Document annotation text'));
    $index->addDocument($doc);

    Document links are not included in the generated document, but may be retrieved with the Zend_Search_Lucene_Document_Html::getLinks() and Zend_Search_Lucene_Document_Html::getHeaderLinks() methods:

    $doc Zend_Search_Lucene_Document_Html::loadHTML($htmlString);
    $linksArray $doc->getLinks();
    $headerLinksArray $doc->getHeaderLinks();

    Starting from Zend Framework 1.6 it's also possible to exclude links with rel attribute set to 'nofollow'. Use Zend_Search_Lucene_Document_Html::setExcludeNoFollowLinks($true) to turn on this option.

    Zend_Search_Lucene_Document_Html::getExcludeNoFollowLinks() method returns current state of "Exclude nofollow links" flag.

    60.1.5. Word 2007 documents

    Zend_Search_Lucene offers a Word 2007 parsing feature. Documents can be created directly from a Word 2007 file:

    $doc Zend_Search_Lucene_Document_Docx::loadDocxFile($filename);
    $index->addDocument($doc);

    Zend_Search_Lucene_Document_Docx class uses the ZipArchive class and simplexml methods to parse the source document. If the ZipArchive class (from module php_zip) is not available, the Zend_Search_Lucene_Document_Docx will also not be available for use with Zend Framework.

    Zend_Search_Lucene_Document_Docx class recognizes document meta data and document text. Meta data consists, depending on document contents, of filename, title, subject, creator, keywords, description, lastModifiedBy, revision, modified, created.

    The 'filename' field is the actual Word 2007 file name.

    The 'title' field is the actual document title.

    The 'subject' field is the actual document subject.

    The 'creator' field is the actual document creator.

    The 'keywords' field contains the actual document keywords.

    The 'description' field is the actual document description.

    The 'lastModifiedBy' field is the username who has last modified the actual document.

    The 'revision' field is the actual document revision number.

    The 'modified' field is the actual document last modified date / time.

    The 'created' field is the actual document creation date / time.

    The 'body' field is the actual body content of the Word 2007 document. It only includes normal text, comments and revisions are not included.

    The loadDocxFile() methods of Zend_Search_Lucene_Document_Docx class also have second optional argument. If it's set to TRUE, then body content is also stored within index and can be retrieved from the index. By default, the body is tokenized and indexed, but not stored.

    Parsed documents may be augmented by the programmer with any other field:

    $doc Zend_Search_Lucene_Document_Docx::loadDocxFile($filename);
    $doc->addField(Zend_Search_Lucene_Field::UnIndexed(
        
    'indexTime',
        
    time())
    );
    $doc->addField(Zend_Search_Lucene_Field::Text(
        
    'annotation',
        
    'Document annotation text')
    );
    $index->addDocument($doc);

    60.1.6. Powerpoint 2007 documents

    Zend_Search_Lucene offers a Powerpoint 2007 parsing feature. Documents can be created directly from a Powerpoint 2007 file:

    $doc Zend_Search_Lucene_Document_Pptx::loadPptxFile($filename);
    $index->addDocument($doc);

    Zend_Search_Lucene_Document_Pptx class uses the ZipArchive class and simplexml methods to parse the source document. If the ZipArchive class (from module php_zip) is not available, the Zend_Search_Lucene_Document_Pptx will also not be available for use with Zend Framework.

    Zend_Search_Lucene_Document_Pptx class recognizes document meta data and document text. Meta data consists, depending on document contents, of filename, title, subject, creator, keywords, description, lastModifiedBy, revision, modified, created.

    The 'filename' field is the actual Powerpoint 2007 file name.

    The 'title' field is the actual document title.

    The 'subject' field is the actual document subject.

    The 'creator' field is the actual document creator.

    The 'keywords' field contains the actual document keywords.

    The 'description' field is the actual document description.

    The 'lastModifiedBy' field is the username who has last modified the actual document.

    The 'revision' field is the actual document revision number.

    The 'modified' field is the actual document last modified date / time.

    The 'created' field is the actual document creation date / time.

    The 'body' field is the actual content of all slides and slide notes in the Powerpoint 2007 document.

    The loadPptxFile() methods of Zend_Search_Lucene_Document_Pptx class also have second optional argument. If it's set to TRUE, then body content is also stored within index and can be retrieved from the index. By default, the body is tokenized and indexed, but not stored.

    Parsed documents may be augmented by the programmer with any other field:

    $doc Zend_Search_Lucene_Document_Pptx::loadPptxFile($filename);
    $doc->addField(Zend_Search_Lucene_Field::UnIndexed(
        
    'indexTime',
        
    time()));
    $doc->addField(Zend_Search_Lucene_Field::Text(
        
    'annotation',
        
    'Document annotation text'));
    $index->addDocument($doc);

    60.1.7. Excel 2007 documents

    Zend_Search_Lucene offers a Excel 2007 parsing feature. Documents can be created directly from a Excel 2007 file:

    $doc Zend_Search_Lucene_Document_Xlsx::loadXlsxFile($filename);
    $index->addDocument($doc);

    Zend_Search_Lucene_Document_Xlsx class uses the ZipArchive class and simplexml methods to parse the source document. If the ZipArchive class (from module php_zip) is not available, the Zend_Search_Lucene_Document_Xlsx will also not be available for use with Zend Framework.

    Zend_Search_Lucene_Document_Xlsx class recognizes document meta data and document text. Meta data consists, depending on document contents, of filename, title, subject, creator, keywords, description, lastModifiedBy, revision, modified, created.

    The 'filename' field is the actual Excel 2007 file name.

    The 'title' field is the actual document title.

    The 'subject' field is the actual document subject.

    The 'creator' field is the actual document creator.

    The 'keywords' field contains the actual document keywords.

    The 'description' field is the actual document description.

    The 'lastModifiedBy' field is the username who has last modified the actual document.

    The 'revision' field is the actual document revision number.

    The 'modified' field is the actual document last modified date / time.

    The 'created' field is the actual document creation date / time.

    The 'body' field is the actual content of all cells in all worksheets of the Excel 2007 document.

    The loadXlsxFile() methods of Zend_Search_Lucene_Document_Xlsx class also have second optional argument. If it's set to TRUE, then body content is also stored within index and can be retrieved from the index. By default, the body is tokenized and indexed, but not stored.

    Parsed documents may be augmented by the programmer with any other field:

    $doc Zend_Search_Lucene_Document_Xlsx::loadXlsxFile($filename);
    $doc->addField(Zend_Search_Lucene_Field::UnIndexed(
        
    'indexTime',
        
    time()));
    $doc->addField(Zend_Search_Lucene_Field::Text(
        
    'annotation',
        
    'Document annotation text'));
    $index->addDocument($doc);
    digg delicious meneame google twitter technorati facebook

    Comments

    Loading...