Home » Server Options » Text & interMedia » Stem operator ($) problem
icon5.gif  Stem operator ($) problem [message #269471] Sat, 22 September 2007 12:46 Go to next message
amazing
Messages: 46
Registered: September 2007
Location: Venezuela
Member
Hi. I have a problem with Oracle Text stem operator($). I’m developing my own “Spanish” text search application by the Oracle® Text Reference 10g Release 2 (10.2) B14218-01 (June 2005) documentation. So, I have the following structure:

A base table to store de text with two fields: the key (NUMBER) and another field (VARCHAR) to store the text.

These are my own preferences:

ctx_ddl.create_preference('FNTC_Lexer', 'BASIC_LEXER');

ctx_ddl.set_attribute('FNTC_Lexer', 'base_letter', 'yes');

ctx_ddl.set_attribute('FNTC_Lexer', 'base_letter_type', 'GENERIC');

ctx_ddl.set_attribute('FNTC_Lexer', 'override_base_letter', 'false');

ctx_ddl.set_attribute('FNTC_Lexer', 'mixed_case', 'NO');

ctx_ddl.set_attribute('FNTC_Lexer', 'index_themes', 'NO');

ctx_ddl.set_attribute('FNTC_Lexer', 'index_stems', 'SPANISH');

ctx_ddl.set_attribute('FNTC_Lexer', 'index_text', 'YES');

-----------------------------------------------------------------

ctx_ddl.create_preference('FNTC_Word_List', 'BASIC_WORDLIST');

ctx_ddl.set_attribute('FNTC_Word_List', 'stemmer', 'SPANISH');

ctx_ddl.set_attribute('FNTC_Word_List', 'fuzzy_match', 'SPANISH');

ctx_ddl.set_attribute('FNTC_Word_List', 'fuzzy_score', '60');

ctx_ddl.set_attribute('FNTC_Word_List', 'fuzzy_numresults', '100');

ctx_ddl.set_attribute('FNTC_Word_List', 'substring_index', 'true');

ctx_ddl.set_attribute('FNTC_Word_List', 'prefix_index', 'true');

ctx_ddl.set_attribute('FNTC_Word_List', 'prefix_min_length', '1');

ctx_ddl.set_attribute('FNTC_Word_List', 'prefix_max_length', '50');

ctx_ddl.set_attribute('FNTC_Word_List', 'wildcard_maxterms', '50');

-----------------------------------------------------------------

ctx_ddl.create_stoplist('FNTC_Stop_List', 'BASIC_STOPLIST');

In this last case I use an empty stop list because hay I do not want to use any stop words.

This is the index structure:

create index IDX_DENOMINACION on dbusqueda(descripcion)
indextype is ctxsys.context
parameters('LEXER FNTC_Lexer WORDLIST FNTC_Word_List STOPLIST FNTC_Stop_List SYNC (ON COMMIT)');

I have this data in the base table:

ID DESCRIPCION
-----------------
1 PEPITO
2 PEPITA
3 PEPITE
4 PEPITIN
5 PEPITU
6 PEPE

So here is de problem. When I use the stem operator ($):

select score(1), t.*
from dbusqueda t
where contains(descripcion, '$pepito', 1) > 0 order by score(1) desc;

it should return all rows. Doesn’t it??

But only returns

ID DESCRIPTION
-----------------
1 PEPITO
6 PEPE

So I did the book’s stem example (chapter 3, Oracle Text CONTAINS Query Operators, page 3-36) to make me sure is problem of my solution. And I change my preferences attributes INDEX_STEMS, STEMMER and FUZZY_MATCH values to ENGLISH, change my base table data to values same as the example:

ID DESCRIPCION
-----------------
1 scream
2 screaming
3 screamed

then I rebuild the index

ALTER INDEX IDX_DENOMINACION REBUILD;

So, when I use the stem operator once again:

select score(1), t.*
from dbusqueda t
where contains(descripcion, '$scream', 1) > 0 order by score(1) desc;

it should return all rows. Doesn’t it??

But only returns

ID DESCRIPCION
-----------------
1 scream

When would have to be all the rows, just like the example.

So…what is wrong here…?










Re: Stem operator ($) problem [message #269488 is a reply to message #269471] Sat, 22 September 2007 19:34 Go to previous messageGo to next message
Barbara Boehmer
Messages: 9077
Registered: November 2002
Location: California, USA
Senior Member
The stem operator selects words that stem from the same root word, not just words that start with the same letters. I believe that they must match as to whether they are nouns or verbs and masculine or feminine as well. So, I believe you got the expeted results for the spanish words that you used. I added some additional words to the test below.

When you change the attributes of your preferences, it is not sufficient to rebuild the index. You need to drop the index and recreate it. When the index is created, it fetches the current attribute values and stores those values, not the preference names. So, if you only rebuild the index, it uses the old stored values. It only fetches the new values if you drop and recreate the index. I have demonstrated that below.

-- initial test environment:
SCOTT@10gXE> SELECT banner FROM v$version
  2  /

BANNER
----------------------------------------------------------------
Oracle Database 10g Express Edition Release 10.2.0.1.0 - Product
PL/SQL Release 10.2.0.1.0 - Production
CORE	10.2.0.1.0	Production
TNS for 32-bit Windows: Version 10.2.0.1.0 - Production
NLSRTL Version 10.2.0.1.0 - Production

SCOTT@10gXE> CREATE TABLE dbusqueda
  2    (key	     NUMBER,
  3  	descripcion  VARCHAR2(30))
  4  /

Table created.

SCOTT@10gXE> INSERT ALL
  2  INTO dbusqueda VALUES (1, 'PEPITO')
  3  INTO dbusqueda VALUES (2, 'PEPITA')
  4  INTO dbusqueda VALUES (3, 'PEPITE')
  5  INTO dbusqueda VALUES (4, 'PEPITIN')
  6  INTO dbusqueda VALUES (5, 'PEPITU')
  7  INTO dbusqueda VALUES (6, 'PEPE')
  8  INTO dbusqueda VALUES (7, 'El chilla.') -- verbo
  9  INTO dbusqueda VALUES (8, 'El chilló.') -- verbo
 10  INTO dbusqueda VALUES (9, 'Oí un chillido.') -- nombre
 11  INTO dbusqueda VALUES (10, 'Oí algunos chillidos.') -- nombre
 12  INTO dbusqueda VALUES (11, 'scream')
 13  INTO dbusqueda VALUES (12, 'screaming')
 14  INTO dbusqueda VALUES (13, 'screamed')
 15  SELECT * FROM DUAL
 16  /

13 rows created.

SCOTT@10gXE> BEGIN
  2    ctx_ddl.create_preference('FNTC_Lexer', 'BASIC_LEXER');
  3    ctx_ddl.set_attribute('FNTC_Lexer', 'base_letter', 'yes');
  4    ctx_ddl.set_attribute('FNTC_Lexer', 'base_letter_type', 'GENERIC');
  5    ctx_ddl.set_attribute('FNTC_Lexer', 'override_base_letter', 'false');
  6    ctx_ddl.set_attribute('FNTC_Lexer', 'mixed_case', 'NO');
  7    ctx_ddl.set_attribute('FNTC_Lexer', 'index_themes', 'NO');
  8    ctx_ddl.set_attribute('FNTC_Lexer', 'index_stems', 'SPANISH');
  9    ctx_ddl.set_attribute('FNTC_Lexer', 'index_text', 'YES');
 10    ctx_ddl.create_preference('FNTC_Word_List', 'BASIC_WORDLIST');
 11    ctx_ddl.set_attribute('FNTC_Word_List', 'stemmer', 'SPANISH');
 12    ctx_ddl.set_attribute('FNTC_Word_List', 'fuzzy_match', 'SPANISH');
 13    ctx_ddl.set_attribute('FNTC_Word_List', 'fuzzy_score', '60');
 14    ctx_ddl.set_attribute('FNTC_Word_List', 'fuzzy_numresults', '100');
 15    ctx_ddl.set_attribute('FNTC_Word_List', 'substring_index', 'true');
 16    ctx_ddl.set_attribute('FNTC_Word_List', 'prefix_index', 'true');
 17    ctx_ddl.set_attribute('FNTC_Word_List', 'prefix_min_length', '1');
 18    ctx_ddl.set_attribute('FNTC_Word_List', 'prefix_max_length', '50');
 19    ctx_ddl.set_attribute('FNTC_Word_List', 'wildcard_maxterms', '50');
 20    ctx_ddl.create_stoplist('FNTC_Stop_List', 'BASIC_STOPLIST');
 21  END;
 22  /

PL/SQL procedure successfully completed.

SCOTT@10gXE> create index IDX_DENOMINACION on dbusqueda(descripcion)
  2  indextype is ctxsys.context
  3  parameters
  4    ('LEXER	   FNTC_Lexer
  5  	 WORDLIST  FNTC_Word_List
  6  	 STOPLIST  FNTC_Stop_List
  7  	 SYNC	   (ON COMMIT)')
  8  /

Index created.


-- tests of spanish words:
 
SCOTT@10gXE> select score(1), t.*
  2  from   dbusqueda t
  3  where  contains(descripcion, '$pepito', 1) > 0
  4  order  by score(1) desc
  5  /

  SCORE(1)        KEY DESCRIPCION
---------- ---------- ------------------------------
         5          6 PEPE
         5          1 PEPITO

SCOTT@10gXE> select score(1), t.*
  2  from   dbusqueda t
  3  where  contains(descripcion, '$chilla', 1) > 0 -- verbo
  4  order  by score(1) desc
  5  /

  SCORE(1)        KEY DESCRIPCION
---------- ---------- ------------------------------
         5          8 El chilló.
         5          7 El chilla.

SCOTT@10gXE> select score(1), t.*
  2  from   dbusqueda t
  3  where  contains(descripcion, '$chilló', 1) > 0 -- verbo
  4  order  by score(1) desc
  5  /

  SCORE(1)        KEY DESCRIPCION
---------- ---------- ------------------------------
         5          8 El chilló.
         5          7 El chilla.

SCOTT@10gXE> select score(1), t.*
  2  from   dbusqueda t
  3  where  contains(descripcion, '$chillido', 1) > 0 -- nombre
  4  order  by score(1) desc
  5  /

  SCORE(1)        KEY DESCRIPCION
---------- ---------- ------------------------------
         6         10 Oí algunos chillidos.
         6          9 Oí un chillido.

SCOTT@10gXE> select score(1), t.*
  2  from   dbusqueda t
  3  where  contains(descripcion, '$chillidos', 1) > 0 -- nombre
  4  order  by score(1) desc
  5  /

  SCORE(1)        KEY DESCRIPCION
---------- ---------- ------------------------------
         6         10 Oí algunos chillidos.
         6          9 Oí un chillido.

SCOTT@10gXE> 


-- change attributes:
SCOTT@10gXE> BEGIN
  2    ctx_ddl.set_attribute('FNTC_Lexer', 'index_stems', 'ENGLISH');
  3    ctx_ddl.set_attribute('FNTC_Word_List', 'stemmer', 'ENGLISH');
  4    ctx_ddl.set_attribute('FNTC_Word_List', 'fuzzy_match', 'ENGLISH');
  5  END;
  6  /

PL/SQL procedure successfully completed.


-- just rebuilding the index is not sufficient:
SCOTT@10gXE> ALTER INDEX idx_denominacion REBUILD
  2  /

Index altered.

SCOTT@10gXE> select score(1), t.*
  2  from   dbusqueda t
  3  where  contains(descripcion, '$scream', 1) > 0
  4  order  by score(1) desc
  5  /

  SCORE(1)        KEY DESCRIPCION
---------- ---------- ------------------------------
         6         11 scream



-- dropping and recreating the index works:
SCOTT@10gXE> DROP INDEX idx_denominacion
  2  /

Index dropped.

SCOTT@10gXE> create index IDX_DENOMINACION on dbusqueda(descripcion)
  2  indextype is ctxsys.context
  3  parameters
  4    ('LEXER	   FNTC_Lexer
  5  	 WORDLIST  FNTC_Word_List
  6  	 STOPLIST  FNTC_Stop_List')
  7  /

Index created.

SCOTT@10gXE> select score(1), t.*
  2  from   dbusqueda t
  3  where  contains(descripcion, '$scream', 1) > 0
  4  order  by score(1) desc
  5  /

  SCORE(1)        KEY DESCRIPCION
---------- ---------- ------------------------------
         5         11 scream
         5         13 screamed
         5         12 screaming

SCOTT@10gXE> 



icon12.gif  Re: Stem operator ($) problem [message #269819 is a reply to message #269488] Mon, 24 September 2007 14:30 Go to previous message
amazing
Messages: 46
Registered: September 2007
Location: Venezuela
Member
Thanks for the explantion...it's very usefull to me...
Previous Topic: Upgrade Oracle Text 8i to 10g
Next Topic: Disable indexing for specific records
Goto Forum:
  


Current Time: Thu Mar 28 04:47:15 CDT 2024