Home » Server Options » Text & interMedia » ctxhx process and Oracle Text indexes (DB 19.6)
ctxhx process and Oracle Text indexes [message #680343] Fri, 08 May 2020 04:51 Go to next message
John Watson
Messages: 8402
Registered: January 2010
Location: Global Village
Senior Member
I have a process $OH/ctx/bin/ctxhx that has been taking 100% of a core for several hours. It seems to be associated with this call (object names obfuscated),

CREATE INDEX XXX_docs_content ON XXX_docs(blob_content) INDEXTYPE IS CTXSYS.CONTEXT

There are about 45000 rows in the table, total file size is 39GB. I can't find much about this ctxhx process, but I think it is used to extract text from the blob docs which are mostly .PDF with some .DOCX too.

Is there some way to monitor what is going on? Or predict how long it might take? There is nothing in v$session_longops, there doesn't seem to be much happening in terms of segment space usage, no logs I can find.

Thank you for any insight.
Re: ctxhx process and Oracle Text indexes [message #680349 is a reply to message #680343] Fri, 08 May 2020 16:13 Go to previous messageGo to next message
Barbara Boehmer
Messages: 8922
Registered: November 2002
Location: California, USA
Senior Member
Ctxhx.exe is used to convert files to html, so that Oracle Text can tokenize them to index them. For example, if I run:

host ctxhx "Alpha List.pdf" test.txt

then I get an html file in test.txt. If you suspect a problem with a particular file or file type, then you can use this to test such a file.


To monitor the progress of indexing, you can:

EXEC CTX_OUTPUT.START_LOG ('test.log')
CREATE INDEX XXX_docs_content ON XXX_docs(blob_content) INDEXTYPE IS CTXSYS.CONTEXT
/
EXEC CTX_OUTPUT.END_LOG

While this is running, you can display the contents of test.log. On my system I can type it from a Windows command prompt. On my system it is in:

C:\app\baboehme\product\12.1.0\dbhome_2\ctx\log\test.log

It may be in a different location on your system, depending on operating system, Oracle version, and whether you have changed the default log_directory parameter.

There are also other parameters that can be set for the ctx_output besides just start_log and end_log.

I suspect your problem is with some unrecognized document format, such as something from a very new version of pdf of docx that the Oracle filter that uses ctxhx does not yet support.



Re: ctxhx process and Oracle Text indexes [message #680361 is a reply to message #680349] Sat, 09 May 2020 04:05 Go to previous messageGo to next message
John Watson
Messages: 8402
Registered: January 2010
Location: Global Village
Senior Member
THank you for this, it is a lot clearer now.
Re: ctxhx process and Oracle Text indexes [message #680516 is a reply to message #680361] Tue, 19 May 2020 10:45 Go to previous message
John Watson
Messages: 8402
Registered: January 2010
Location: Global Village
Senior Member
Just in case it is ever useful to anyone, the job took about 9 hours to index about 45000 blobs totalling about 35GB. That was on an AWS t3a.medium instance. During that time, the ctxhx process was running continuously (or rather, continually starting, running,stopping, re-starting) and occupied 100% of one vCPU. Very little activity in the DB. I'll upload the log, which is a trace file nowadays: you can give it a name, but that gets ignored.
Previous Topic: 19c Orace Text Index(CTXSYS.CONTEXT ) - String search anomaly
Next Topic: ORACLE SQL POSSIBE BUG
Goto Forum:
  


Current Time: Fri Oct 30 11:37:14 CDT 2020