Zend Framework の紹介

 Learning Zend Framework


 Zend Framework リファレンス

  • 第36章 Zend_Gdata
  • 第37章 Zend_Http
  • 第38章 Zend_InfoCard
  • 第39章 Zend_Json
  • 第40章 Zend_Layout
  • 第41章 Zend_Ldap
  • 第42章 Zend_Loader
  • 第43章 Zend_Locale
  • 第44章 Zend_Log
  • 第45章 Zend_Mail
  • 第46章 Zend_Markup
  • 第47章 Zend_Measure
  • 第48章 Zend_Memory
  • 第49章 Zend_Mime
  • 第50章 Zend_Navigation
  • 第51章 Zend_Oauth
  • 第52章 Zend_OpenId
  • 第53章 Zend_Paginator
  • 第54章 Zend_Pdf
  • 第55章 Zend_ProgressBar
  • 第56章 Zend_Queue
  • 第57章 Zend_Reflection
  • 第58章 Zend_Registry
  • 第59章 Zend_Rest

  • 第60章 Zend_Search_Lucene
  • 第61章 Zend_Serializer
  • 第62章 Zend_Server
  • 第63章 Zend_Service
  • 第64章 Zend_Session
  • 第65章 Zend_Soap
  • 第66章 Zend_Tag
  • 第67章 Zend_Test
  • 第68章 Zend_Text
  • 第69章 Zend_TimeSync
  • 第70章 Zend_Tool
  • 第71章 Zend_Tool_Framework
  • 第72章 Zend_Tool_Project
  • 第73章 Zend_Translate
  • 第74章 Zend_Uri
  • 第75章 Zend_Validate
  • 第76章 Zend_Version
  • 第77章 Zend_View
  • 第78章 Zend_Wildfire
  • 第79章 Zend_XmlRpc
  • ZendX_Console_Process_Unix
  • ZendX_JQuery
  • Translation 70.6% Update 2010-11-28 - Revision 23415

    60.6. Character Set

    60.6.1. UTF-8 and single-byte character set support

    Zend_Search_Lucene works with the UTF-8 charset internally. Index files store unicode data in Java's "modified UTF-8 encoding". Zend_Search_Lucene core completely supports this encoding with one exception. [15]

    Actual input data encoding may be specified through Zend_Search_Lucene API. Data will be automatically converted into UTF-8 encoding.

    60.6.2. Default text analyzer

    However, the default text analyzer (which is also used within query parser) uses ctype_alpha() for tokenizing text and queries.

    ctype_alpha() is not UTF-8 compatible, so the analyzer converts text to 'ASCII//TRANSLIT' encoding before indexing. The same processing is transparently performed during query parsing. [16]


    Default analyzer doesn't treats numbers as parts of terms. Use corresponding 'Num' analyzer if you don't want words to be broken by numbers.

    60.6.3. UTF-8 compatible text analyzers

    Zend_Search_Lucene also contains a set of UTF-8 compatible analyzers: Zend_Search_Lucene_Analysis_Analyzer_Common_Utf8, Zend_Search_Lucene_Analysis_Analyzer_Common_Utf8Num, Zend_Search_Lucene_Analysis_Analyzer_Common_Utf8_CaseInsensitive, Zend_Search_Lucene_Analysis_Analyzer_Common_Utf8Num_CaseInsensitive.

    Any of this analyzers can be enabled with the code like this:


    UTF-8 compatible analyzers were improved in Zend Framework 1.5. Early versions of analyzers assumed all non-ascii characters are letters. New analyzers implementation has more accurate behavior.

    This may need you to re-build index to have data and search queries tokenized in the same way, otherwise search engine may return wrong result sets.

    All of these analyzers need PCRE (Perl-compatible regular expressions) library to be compiled with UTF-8 support turned on. PCRE UTF-8 support is turned on for the PCRE library sources bundled with PHP source code distribution, but if shared library is used instead of bundled with PHP sources, then UTF-8 support state may depend on you operating system.

    Use the following code to check, if PCRE UTF-8 support is enabled:

    if (@preg_match('/\pL/u''a') == 1) {
    "PCRE unicode support is turned on.\n";
    } else {
    "PCRE unicode support is turned off.\n";

    Case insensitive versions of UTF-8 compatible analyzers also need mbstring extension to be enabled.

    If you don't want mbstring extension to be turned on, but need case insensitive search, you may use the following approach: normalize source data before indexing and query string before searching by converting them to lowercase:

    // Indexing




    $doc = new Zend_Search_Lucene_Document();


    // Title field for search through (indexed, unstored)

    // Title field for retrieving (unindexed, stored)
    // Searching




    $hits $index->find(strtolower($query));

    [15] Zend_Search_Lucene supports only Basic Multilingual Plane (BMP) characters (from 0x0000 to 0xFFFF) and doesn't support "supplementary characters" (characters whose code points are greater than 0xFFFF)

    Java 2 represents these characters as a pair of char (16-bit) values, the first from the high-surrogates range (0xD800-0xDBFF), the second from the low-surrogates range (0xDC00-0xDFFF). Then they are encoded as usual UTF-8 characters in six bytes. Standard UTF-8 representation uses four bytes for supplementary characters.

    [16] Conversion to 'ASCII//TRANSLIT' may depend on current locale and OS.

    digg delicious meneame google twitter technorati facebook