Wprowadzenie do Zend Framework

     Nauka Zend Framework


     Przewodnik po Zend Framework

  • Zend_Gdata
  • Zend_Http
  • Zend_InfoCard
  • Zend_Json
  • Zend_Layout
  • Zend_Ldap
  • Zend_Loader
  • Zend_Locale
  • Zend_Log
  • Zend_Mail
  • Zend_Markup
  • Zend_Measure
  • Zend_Memory
  • Zend_Mime
  • Zend_Navigation
  • Zend_Oauth
  • Zend_OpenId
  • Zend_Paginator
  • Zend_Pdf
  • Zend_ProgressBar
  • Zend_Queue
  • Zend_Reflection
  • Zend_Registry
  • Zend_Rest

  • Zend_Search_Lucene
  • Zend_Serializer
  • Zend_Server
  • Zend_Service
  • Zend_Session
  • Zend_Soap
  • Zend_Tag
  • Zend_Test
  • Zend_Text
  • Zend_TimeSync
  • Zend_Tool
  • Zend_Tool_Framework
  • Zend_Tool_Project
  • Zend_Translate
  • Zend_Uri
  • Zend_Validate
  • Zend_Version
  • Zend_View
  • Zend_Wildfire
  • Zend_XmlRpc
  • ZendX_Console_Process_Unix
  • ZendX_JQuery
  • Translation 21.3% Update 2011-11-16 - Revision 24356 - Version ZF 1.11.x

    58.6. Character Set

    58.6.1. UTF-8 and single-byte character set support

    Zend_Search_Lucene works with the UTF-8 charset internally. Index files store unicode data in Java's "modified UTF-8 encoding". Zend_Search_Lucene core completely supports this encoding with one exception. [16]

    Actual input data encoding may be specified through Zend_Search_Lucene API. Data will be automatically converted into UTF-8 encoding.

    58.6.2. Default text analyzer

    However, the default text analyzer (which is also used within query parser) uses ctype_alpha() for tokenizing text and queries.

    ctype_alpha() is not UTF-8 compatible, so the analyzer converts text to 'ASCII//TRANSLIT' encoding before indexing. The same processing is transparently performed during query parsing. [17]


    Default analyzer doesn't treats numbers as parts of terms. Use corresponding 'Num' analyzer if you don't want words to be broken by numbers.

    58.6.3. UTF-8 compatible text analyzers

    Zend_Search_Lucene also contains a set of UTF-8 compatible analyzers: Zend_Search_Lucene_Analysis_Analyzer_Common_Utf8, Zend_Search_Lucene_Analysis_Analyzer_Common_Utf8Num, Zend_Search_Lucene_Analysis_Analyzer_Common_Utf8_CaseInsensitive, Zend_Search_Lucene_Analysis_Analyzer_Common_Utf8Num_CaseInsensitive.

    Any of this analyzers can be enabled with the code like this:


    UTF-8 compatible analyzers were improved in Zend Framework 1.5. Early versions of analyzers assumed all non-ascii characters are letters. New analyzers implementation has more accurate behavior.

    This may need you to re-build index to have data and search queries tokenized in the same way, otherwise search engine may return wrong result sets.

    All of these analyzers need PCRE (Perl-compatible regular expressions) library to be compiled with UTF-8 support turned on. PCRE UTF-8 support is turned on for the PCRE library sources bundled with PHP source code distribution, but if shared library is used instead of bundled with PHP sources, then UTF-8 support state may depend on you operating system.

    Use the following code to check, if PCRE UTF-8 support is enabled:

    if (@preg_match('/\pL/u''a') == 1) {
    "PCRE unicode support is turned on.\n";
    } else {
    "PCRE unicode support is turned off.\n";

    Case insensitive versions of UTF-8 compatible analyzers also need mbstring extension to be enabled.

    If you don't want mbstring extension to be turned on, but need case insensitive search, you may use the following approach: normalize source data before indexing and query string before searching by converting them to lowercase:

    // Indexing




    $doc = new Zend_Search_Lucene_Document();


    // Title field for search through (indexed, unstored)

    // Title field for retrieving (unindexed, stored)
    // Searching




    $hits $index->find(strtolower($query));

    [16] Zend_Search_Lucene supports only Basic Multilingual Plane (BMP) characters (from 0x0000 to 0xFFFF) and doesn't support "supplementary characters" (characters whose code points are greater than 0xFFFF)

    Java 2 represents these characters as a pair of char (16-bit) values, the first from the high-surrogates range (0xD800-0xDBFF), the second from the low-surrogates range (0xDC00-0xDFFF). Then they are encoded as usual UTF-8 characters in six bytes. Standard UTF-8 representation uses four bytes for supplementary characters.

    [17] Conversion to 'ASCII//TRANSLIT' may depend on current locale and OS.

    digg delicious meneame google twitter technorati facebook