[原创中文翻译]symfony askeet24:第二十一天,创建搜索引擎。

October 10, 2007 – 7:53 pm

[欢迎转载,转载请注名出处http://symfony.net.cn。本文英文版权归symfony官方网站所有]

How to build a search engine?

怎么创建一个搜索引擎呢?

The most popular suggestion about the 21st day addition proved to be a search engine.

第二十一世纪最流行的是互联网搜索引擎。

If the Zsearch extension (a PHP implementation of the Lucene search engine from Apache) had already been released by Zend, this would have been a piece of cake to implement. Unfortunately, Zend seems to take longer than expected to launch their PHP framework, so we need to find another solution.

如果Zend已经发布了Zsearch搜索引擎(apache下用PHP实现的一个搜索引擎),用PHP构建搜索引擎的蛋糕就等于少了一片。遗憾的是(我看作者是暗自高兴),Zend似乎还要花很长时间来研究他的PHP框架,所以我们使用其他方法来构建我们的搜索引擎。

Integrating a foreign library (like, for instance, mnoGoSearch) would probably take more than one hour, and lots of custom adaptations would be necessary to obtain a good result for the askeet specific content. Plus, foreign search libraries are often platform or database dependant, and not all of them are open-source, and that’s something we don’t want for askeet.

使用外部的类库(例如,mnoGoSearch)开发需要花费超过一小时时间,许多的自定义修改会变得很有必要,因为askeet的内容特别而且多。另外,外部search类库通常需要数据库和平台支持,他们并不全是开源免费的,这可不是我们的askeet开发想要的。

The MySQL database offers a full-text indexation and search for text content, but it is restricted to MyISAM tables. Once again, basing our search engine on a database-specific component would limit the possible uses of the askeet application, and we want to do everything to preserve the large compatibility it has so far.

mysql数据库提供了全文索引和文本搜索功能,但只限制在MyISAM格式的表里使用。进一步说,基于数据库的搜索功能会限制askeet的搜索功能实现,会碰到很多兼容性问题。

The only alternative left is to develop a full-text PHP search engine by ourselves. And we have less than one hour, so we’d better get started.

最后剩下的事就是我们自己开发全文搜索的PHP搜索引擎了。剩下的时间不超过一个小时了,我们最好快开始。

Word index

文字索引

The first step is to build a search index. The index can be seen as a table indexing all occurrences of a particular word. For example, if question #34 has the following characteristics:

第一步是建立搜索索引表。索引可以被看成一个数据表,存储了一些特定的字符。例如,下面的问题#34是这样的:

Title: What is the best Zodiac sign for my child?

Body: My husband doesn’t care about Zodiac signs for our next child, but we already have a Cancer girl and an Aries boy, and they get along with each other like hell. My mother-in-law didn’t express any preference, so I am completely free to choose the Zodiac sign of my next baby. What do you think?

Tags: zodiac, real life, family, children, sign, astrology, signs

An index has to be created to list the words of this question, so that a search engine can find it.

索引被创建来显示问题的文字,那么搜索引擎能通过索引表找到它。

Index table

索引表

The index should look like:

索引表看起来像是下面这种样子:

|id | word | count|
|34 |sign | 4 |
|34 |zodiac| 4 |
|34 |child | 2 |
|34 |hell | 1 |
|34 | … | … |

A new SearchIndex table is added to the askeet schema.xml before rebuilding the model:

搜索索引表被加到schema.xml里,重建模型:

<table name="ask_search_index" phpName="SearchIndex">
<column name="question_id" type="integer" />
<foreign-key foreignTable="ask_question" onDelete="cascade">
<reference local="question_id" foreign="id"/>
</foreign-key>
<column name="word" type="varchar" size="255" />
<index name="word_index">
<index-column name="word" />
</index>
<column name="weight" type="integer" />
</table>

The onDelete attribute ensures that the deletion of a question will lead to the deletion of all the records in the SearchIndex table related to this question, as explained yesterday.

onDelete属性确保删除问题将会删除搜索索引表上所有和此问题相关的记录,像昨天描述的那样。

Splitting phrases into words

分词

The input content that will be used to build the index is a set of sentences (question title and body) and tags. What is eventually needed is a list of words. This means that we need to split the sentences into words, ignoring all punctuation, numbers, and putting all words to lowercase. The str_word_count() PHP function will do the trick:

输入的内容被用来建索引,用一套句子(问题题目和内容)和标签。最后需要的是一列文字。这意味着我们需要把句子分解成词,忽略所有的标点符号,数字,把所有的大写字母转换成小写。str_word_count()PHP函数就能实现这一功能:

// split into words
$words = str_word_count(strtolower($phrase), 1);
...

Stop words

需要忽略的字(Stop Words)

Some words, like “a,” “of,” “the,” “I,”, “it”, “you,” and “and”, have to be ignored when indexing some text content. This is because they have no distinctive value, they appear in almost every text content, they slow down a text search and make it return a lot of poorly interesting results that have nothing to do with a user’s query. They are known as stop words. The stop words are specific to a given language.

一些字,像a,of,the,I,it,you,and,在分词时可以忽略。这是因为他们没有特殊的意义,他们几乎在所有的内容中出现,他们减慢了搜索速度并且带来大量低效无用的搜索结果。他们被认为是stop word。stop word是特殊用语。

For the askeet search engine, we will use a custom list of stop words. Add the following method to the askeet/lib/myTools.class.php class:

对askeet搜索引擎,我们用自定义stop word列表。给askeet/lib/myTools.class.php类加几个方法:

public static function removeStopWordsFromArray($words)
{
$stop_words = array(
'i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your', 'yours',
'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', 'her', 'hers',
'herself', 'it', 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves',
'what', 'which', 'who', 'whom', 'this', 'that', 'these', 'those', 'am', 'is', 'are',
'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does',
'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until',
'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into',
'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down',
'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here',
'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more',
'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so',
'than', 'too', 'very',
);

return array_diff($words, $stop_words);
}

Stemming

Stemming(词干)

The first thing that you should notice in the example question given above is that words having the same radical should be seen as a single one. ‘Children’ should increase the weight of ‘child’, as should ’sign’ do for ’signs’. So before indexing words, they have to be reduced to their greatest common divisor, and in linguistics vocabulary, this is called a stem, or “the base part of the word including derivational affixes but not inflectional morphemes, i. e. the part of the word that remains unchanged through inflection”.

对于上面给出的问题里的同样的词应该被看成是一个。“children”比“child”的weight大,sign和signs一样。所以在对文字加索引以前,要找到他们的最大公约数,在语言学词表里,这叫词干(stem),或者”the base part of the word including derivational affixes but not inflectional morphemes, i. e. the part of the word that remains unchanged through inflection”。

There are lots of rules to transform a word into its stem, and these rules are all language-dependant. One of the best stemming techniques for the English language so far is called the Porter Stemming Algorithm and, as we are very lucky, it has been ported to PHP5 in an open-source script available from tartarus.org.

转换一个单词为它所对应的词干有很多规则,这些规则都依赖于各自的语言。对英文来讲,最好的stemming技术目前被叫做the Porter Stemming Algorithm,我们很幸运,它已经被封装到了PHP5里,来自tartarus.org。

The PorterStemmer class provides a ::stem($word) method that is perfect for our needs. So we can write a method, still in myTools.class.php, that turns a phrase into an array of stem words:

PorterStemmer类提供了::stem($word)方法来满足我们的需要。那么我们写个方法在myTools.class.php,转换短语为词干然后放到数组里:

public static function stemPhrase($phrase)
{
// split into words
$words = str_word_count(strtolower($phrase), 1);

// ignore stop words
$words = myTools::removeStopWordsFromArray($words);

// stem words
$stemmed_words = array();
foreach ($words as $word)
{
// ignore 1 and 2 letter words
if (strlen($word) <= 2)
{
continue;
}

$stemmed_words[] = PorterStemmer::stem($word, true);
}

return $stemmed_words;
}

Of course, you have to put the PorterStemmer.class.php in the same askeet/lib/ directory for this to work.

当然,我们要把PorterStemmer.class.php文件放到askeet/lib/下才能工作。

Giving weight to words

给单词加“重量”

The search results have to appear in order of pertinence. The questions that are more tightly related to the words entered by the user have to appear first. But how can we translate this idea of pertinence into an algorithm? Let’s write a few basic principles:

搜索结果显示有一定的针对性。用户输入的搜索用词和askeet的问题对比,文字最相近的那条问题应该被优先显示。但是我们如何把这个功能写成一个算法呢?来写几个基本原理吧:

If a searched word appears in the title of a question, this question should appear higher in a search result than another one where the word appears only in the body.

如果搜索的词显示在甲问题的题目里,那么甲问题显示级别就比搜索的词出现在问题内容里的乙问题高。

If a searched word appears twice in the content of a question, the search result should show this question before others where the word appears only once.

如果搜索的词在问题的内容里出现了两次,搜索结果就应该把这个问题的优先级提高,高于只显示一次的那些问题。

That’s why we need to give weight to words according to the part of the question they come from. As the weight factors have to be easily accessible, to make them vary if we want to fine tune our search engine algorithm, we will put them in the application configuration file (askeet/apps/frontend/config/app.yml):

这就是为什么我们需要根据问题里的单词出现情况来给对应单词加上重量了。重量因式很容易理解,如果我们想要好的搜索引擎算法,我们就要让它多样化,我们把这些写到程序配置文件里(askeet/apps/frontend/config/app.yml):

all:
...

search:
body_weight: 1
title_weight: 2
tag_weight: 3

In order to apply the weight to a word, we simply repeat the content of a string as many times as the weight factor of its origin:

为了应用单词的重量功能,我们按照单词的重量属性来简单重复字符串内容:

...
// question body
$raw_text = str_repeat(' '.strip_tags($question->getHtmlBody()), sfConfig::get('app_search_body_weight'));

// question title
$raw_text .= str_repeat(’ ‘.$question->getTitle(), sfConfig::get(’app_search_title_weight’));

The basic weight of the words will be given by their number of occurrences in the text. The array_count_values() PHP function will help us for that:

单词的基本重量取决于他们在文本里出现的次数。PHP函数——array_count_values()会帮我们实现这个:

...
// phrase stemming
$stemmed_words = myTools::stemPhrase($raw_text);

// unique words with weight
$words = array_count_values($stemmed_words);

Updating the index

更新索引

The index has to be updated each time a question, tag or answer is added. The MVC architecture makes it easy to do, and you have already seen how to override a save() method in a class of the Model with a transaction, for instance during day four. So the following should not surprise you. Open the askeet/lib/model/Question.php file and add in:

当新问题、标签和答案被提交时,索引就被更新。MVC模式下实现起来很简单,你已经看到了如何用事物处理重写模型里的save()方法(try…catch…),例如在第四天的教材里那样。那么接下来就没什么稀奇的了。打开askeet/lib/model/Question.php文件写上:

public function save($con = null)
{
$con = sfContext::getInstance()->getDatabaseConnection('propel');
try
{
$con->begin();

$ret = parent::save($con);
$this->updateSearchIndex();

$con->commit();

return $ret;
}
catch (Exception $e)
{
$con->rollback();
throw $e;
}
}

public function updateSearchIndex()
{
// delete existing SearchIndex entries about the current question
$c = new Criteria();
$c->add(SearchIndexPeer::QUESTION_ID, $this->getId());
SearchIndexPeer::doDelete($c);

// create a new entry for each of the words of the question
foreach ($this->getWords() as $word => $weight)
{
$index = new SearchIndex();
$index->setQuestionId($this->getId());
$index->setWord($word);
$index->setWeight($weight);
$index->save();
}
}

public function getWords()
{
// body
$raw_text = str_repeat(’ ‘.strip_tags($this->getHtmlBody()), sfConfig::get(’app_search_body_weight’));

// title
$raw_text .= str_repeat(’ ‘.$this->getTitle(), sfConfig::get(’app_search_title_weight’));

// title and body stemming
$stemmed_words = myTools::stemPhrase($raw_text);

// unique words with weight
$words = array_count_values($stemmed_words);

// add tags
$max = 0;
foreach ($this->getPopularTags(20) as $tag => $count)
{
if (!$max)
{
$max = $count;
}

$stemmed_tag = PorterStemmer::stem($tag);

if (!isset($words[$stemmed_tag]))
{
$words[$stemmed_tag] = 0;
}
$words[$stemmed_tag] += ceil(($count / $max) * sfConfig::get(’app_search_tag_weight’));
}

return $words;
}

We also have to update the question index each time a tag is added to it, so override the save() method of the Tag model object as well:

每次当新标签被加入后,就需要更新问题的索引,那么重写标签模型对象的save()方法如下:

public function save($con = null)
{
$con = sfContext::getInstance()->getDatabaseConnection('propel');
try
{
$con->begin();

$ret = parent::save($con);
$this->getQuestion()->updateSearchIndex();

$con->commit();

return $ret;
}
catch (Exception $e)
{
$con->rollback();
throw $e;
}
}

Test the index builder

测试建好的索引

The index is ready to be built. Initialize it by populating the database again:

建立索引准备就绪。先加测试数据到数据库:

$ php batch/load_data.php

You can inspect the SearchIndex table to check if the indexing went all well:

检查搜索索引表来看看索引是不是正常:

id word weight
10 blog 6
9 offer 4
8 girl 3
8 rel 3
8 activ 3
10 activ 3
9 present 3
9 reallif 3
11 test 3
12 test 3
13 test 3
8 shall 3
8 tonight 2
8 girlfriend 2
.. ….. ..

The search function

搜索功能

AND or OR?

与还是或?

We want the search function to manage both ‘AND’ and ‘OR’ searches. For instance, if a user enters ‘family zodiac’, he (she?) must be given the choice to look only for the questions where both the two terms appear (that’s an ‘AND’), or for all the questions where at least one of the term appears (that’s an ‘OR’). The trouble is that these two options lead to different queries:

我们想让搜索功能使用与搜索和或搜索。例如,如果用户输入family zodiac,得给他个选择搜索方式,搜索两个单词都一起出现(AND);或者只出现一个单词(OR)。麻烦在于这带来了不同的查询要求:

// OR query
SELECT DISTINCT question_id, COUNT(*) AS nb, SUM(weight) AS total_weight
FROM ask_search_index
WHERE (word = "family" OR word = "zodiac")
GROUP BY question_id
ORDER BY nb DESC, total_weight DESC

// AND query
SELECT DISTINCT question_id, COUNT(*) AS nb, SUM(weight) AS total_weight
FROM ask_search_index
WHERE (word = "family" OR word = "zodiac")
GROUP BY question_id
HAVING nb = 2
ORDER BY nb DESC, total_weight DESC

Thanks to the HAVING keyword (explained, for instance, at w3schools), the AND SQL query is only one line longer than the OR one. As the GROUP BY is on the id column, and because there is only one index occurrence for a given word in a question, if a question_id is returned twice, it is because the question matches both the ‘family’ and ‘zodiac’ term. Neat, isn’t it?

由于HAVING keyword(例如,见w3schools),AND查询只比OR多一行。GROUP BY取决于id,因为这是唯一一个question的索引,如果question_id返回两个,因为问题匹配family和zodiac。牛比吧?

The search method

搜索函数

For the search to work, we need to apply the same treatment to the search phrase as to the content, so that the words entered by the user are reduced to the same kind of stem that lies in the index. Since it returns a set of questions without any foreign constraint, we decide to implement it as a method of the QuestionPeer object.

搜索能正常使用,对于用户使用的搜索词也要采取同样的措施,把用户输入的词分解成索引表那样的stem。返回一列格式化好的问题文字,我们决定把它放到QuestionPeer对象中来执行。

The search results need to be paginated. As we use a complex request, the sfPropelPager object cannot be employed here, so we will do a pagination by hand, using an offset.

搜索结果可以翻页。综合考虑,这里不用sfPropelPager对象,我们手动加分页,用上偏移量。

There is one more thing to remember: askeet is made to work with universes (that was the subject of the eighteenth day tutorial). This means that a search function must only return the questions tagged with the current app_permanent_tag if the user is browsing askeet in a universe.

需要额外注意的是:askeet是工作在二级域名下的(第十八天教材的主题)。这意味着如果用户在二级域名下使用搜索功能则必须按照包含的标签显示搜索结果。

All these conditions make the SQL query slightly more difficult to read, but not much different from the ones described above:

这些让SQL查询变得有些难读,但和上面描述的还不一样:

public static function search($phrase, $exact = false, $offset = 0, $max = 10)
{
$words = array_values(myTools::stemPhrase($phrase));
$nb_words = count($words);

if (!$words)
{
return array();
}

$con = sfContext::getInstance()->getDatabaseConnection(’propel’);

// define the base query
$query = ‘
SELECT DISTINCT ‘.SearchIndexPeer::QUESTION_ID.’, COUNT(*) AS nb, SUM(’.SearchIndexPeer::WEIGHT.’) AS total_weight
FROM ‘.SearchIndexPeer::TABLE_NAME;

if (sfConfig::get(’app_permanent_tag’))
{
$query .= ‘
WHERE ‘;
}
else
{
$query .= ‘
LEFT JOIN ‘.QuestionTagPeer::TABLE_NAME.’ ON ‘.QuestionTagPeer::QUESTION_ID.’ = ‘.SearchIndexPeer::QUESTION_ID.’
WHERE ‘.QuestionTagPeer::NORMALIZED_TAG.’ = ? AND ‘;
}

$query .= ‘
(’.implode(’ OR ‘, array_fill(0, $nb_words, SearchIndexPeer::WORD.’ = ?’)).’)
GROUP BY ‘.SearchIndexPeer::QUESTION_ID;

// AND query?
if ($exact)
{
$query .= ‘
HAVING nb = ‘.$nb_words;
}

$query .= ‘
ORDER BY nb DESC, total_weight DESC’;

// prepare the statement
$stmt = $con->prepareStatement($query);
$stmt->setOffset($offset);
$stmt->setLimit($max);
$placeholder_offset = 1;
if (sfConfig::get(’app_permanent_tag’))
{
$stmt->setString(1, sfConfig::get(’app_permanent_tag’));
$placeholder_offset = 2;
}
for ($i = 0; $i < $nb_words; $i++)
{
$stmt->setString($i + $placeholder_offset, $words[$i]);
}
$rs = $stmt->executeQuery(ResultSet::FETCHMODE_NUM);

// Manage the results
$questions = array();
while ($rs->next())
{
$questions[] = self::retrieveByPK($rs->getInt(1));
}

return $questions;
}

The method returns a list of Question objects, ordered by pertinence.

方法返回一列Question对象,按照人气排序。

The search form

搜索表单

The search form has to be always available, so we choose to put it in the sidebar. As there are two distinct sidebars, they should include the same partial:

搜索表单应该从任何地方都可以访问,那么我们把它放到测边栏里。这里有两个明显的测边栏,应该包含同样的局部模板:

// add to defaultSuccess.php and questionSuccess.php in askeet/apps/frontend/modules/sidebar/templates/
<h2>find it</h2>
<?php include_partial('question/search') ?>

// create the following askeet/apps/frontend/modules/question/templates/_search.php fragment
<?php echo form_tag(‘@search_question’) ?>
<?php echo input_tag(’search’, htmlspecialchars($sf_params->get(’search’)), array(’style’ => ‘width: 150px’)) ?>
<?php echo submit_tag(’search it’, ‘class=small’) ?>
<?php echo checkbox_tag(’search_all’, 1, $sf_params->get(’search_all’)) ?> <label for=”search_all” class=”small”>search with all words</label>
</form>

wwwsymfony-projectcom__search_form.gif

The @search_question rule has to be defined in the routing.yml:

@search_question路由规则被定义在routing.yml:

search_question:
url: /search/*
param: { module: question, action: search }

Do you know what this question/search action does? Almost nothing, since most of the work is handled by the QuestionPeer::search() method described above:

你知道question/search动作都干了什么吗?几乎啥都没有,因为大部分工作都被上面提到的QuestionPeer::search()方法做了:

public function executeSearch ()
{
if ($this->getRequestParameter('search'))
{
$this->questions = QuestionPeer::search($this->getRequestParameter('search'), $this->getRequestParameter('search_all', false), ($this->getRequestParameter('page', 1) - 1) * sfConfig::get('app_search_results_max'), sfConfig::get('app_search_results_max'));
}
else
{
$this->redirect(‘@homepage’);
}
}

The action has to translate a page request parameter into an offset for the ::search() method. The app_search_results_max is the number of results per page, and as usual, it is an application parameter defined in the app.yml file:

动作翻译了页面请求参数为offset给::search()方法。app_search_result_max是每页的结果数量,和往常一样,在app.yml文件里定义了程序参数:

all:
search:
results_max: 10

Display the search result

显示搜索结果

The hardest part of the job is done, we just have to display the search result in a askeet/apps/frontend/modules/question/templates/searchSuccess.php. As we didn’t implement a real pagination to keep the query light, the template has no information about the total number of results. The pagination will just display a ‘more results’ link at the bottom of the result list if the number of results equals the maximum of results per page:

最难的工作做完了,我们现在需要做的是把搜索结果显示在askeet/apps/frontend/modules/question/templates/searchSuccess.php。我们没有运行分页来保持查询结果清晰明了,模板没有显示结果总数量。如果结果数量超过了页面最大允许数,就在结果列表页面里显示链接“more results”:

<?php use_helpers('Global') ?>

<h1>questions matching “<?php echo htmlspecialchars($sf_params->get(’search’)) ?>”</h1>

<?php foreach($questions as $question): ?>
<?php include_partial(’question/question_block’, array(’question’ => $question)) ?>
<?php endforeach ?>

<?php if ($sf_params->get(’page’) > 1 && !count($questions)): ?>
<div>There is no more result for your search.</div>
<?php elseif (!count($questions)): ?>
<div>Sorry, there is no question matching your search terms.</div>
<?php endif ?>

<?php if (count($questions) == sfConfig::get(’app_search_results_max’)): ?>
<div class=”right”>
<?php echo link_to(’more results »’, ‘@search_question?search=’.$sf_params->get(’search’).’&page=’.($sf_params->get(’page’, 1) + 1)) ?>
</div>
<?php endif ?>

Ah, yes, this is the final surprise. We refactored a little the question templates to create a _question_block.php question block, as the code was reused in more than one place. Have a look at this fragment in the source repository, there is nothing new in it. But it helps us to keep the code clean.

哈,好的,看看最后的结果。我们再改一下question模板,创建_question_block.php,这些代码可以在更多的地方重用。看看这些代码片段,这里没什么新东西。这样可以保持代码干净。

wwwsymfony-projectcom__search_results.gif

Post a Comment