0003288: Major overhaul of fulltext search support, the "Composr fast custom index"

ID	Project	Category	View Status	Date Submitted	Last Update

0003288	Composr	core_database_drivers	public	2017-05-16 22:47	2022-08-15 15:47

Reporter	Chris Graham	Assigned To	Chris Graham
Severity	Feature-request
Status	resolved	Resolution	fixed
Product Version
Fixed in Version	10.0.31

Summary	0003288: Major overhaul of fulltext search support, the "Composr fast custom index"
Description	We can implement support for fulltext search on all database drivers, regardless of whether the database backends support it. This is doable via an 'inverted index'. Essentially we just do our own word tokenisation and store it in our own table. An advantage to this is we can then do much more sophisticated metrics to stop slow searches. We could have fields in our inverted index table for all search qualifiers (post time, author, etc), so that we know we can easily only ever be looking at a subset of the index. Currently the problem with MySQL fulltext search is that the index can not be done against search qualifiers, so you get this huge trawl through non-qualified results as it gets to the occasional qualified ones. If we do this, I see no reason we can't just drop the regular fulltext search support we have. We can do better, so there's no need to do anything native to each database backend. EDIT: It is actually useful for non-search-engine querying, e.g. in the find_filedump_links function as an optimisation, or in Filtercode. It should stay. We don't need to support it on all DB backends though as it's an optimisation only. This depends on 0003958 - we don't want to be populating the index when the server is under high load. EDIT: Not really. It is an iterative process.
Additional Information	Alternatively we can also implement more fulltext search support for individual databases. DB2, Oracle, and Sqlite (with non-default extension), also support fulltext search but we don't implement it. If we do this we will need to totally overhaul our fulltext search code too because our code currently assumes it works via indexes, which is not a universally valid assumption. Particularly when indexes are deleted, it doesn't even go into the database driver API for that. I prefer to implement our own fulltext rather than make a complex multi-vendor API, with lots of testing. It'd be a better result for less investment.
Tags	ocProducts client-work (likely), Type: Performance

Time estimation (hours)	32
Sponsorship open

Chris Graham 2019-11-18 02:36 administrator ~0006156	I think each existing full text index would need to be replaced with its own new table - we couldn't just use one table. And each table would need to have additional properties beyond ID & word, to allow multi-dimensional indexing. Otherwise we're not going to get the speed boosts we need. For example, for the forum posts search it would need to have forum IDs in there, so that it can have word & forum ID in a single index and easily carve out the result sets needed.

Chris Graham 2019-11-18 02:45 administrator ~0006157 Last edited: 2020-05-09 01:17 View 4 revisions	Further explanation... 1) multi-dimensional indexing. Full text gets very slow because it hits the full text index for a set of matching post IDs but then it has to intersect these against other search parameters, such as forum ID. For a popular term such as 'comcode' this may mean MySQL skipping through 100s of posts that match the term for every 1 it finds in the right forum. With our own custom multi-dimensional index we have those intersections precomputed. 2) background indexing. We can populate the indexing in the background, rather than real-time. Meaning we can have our expansive indexing with no performance hit. 3) custom stop word lists and min-max word lengths. MySQL's list of stop words ('the', 'an', etc) is compiled in, as is the word lengths. With our own implementation we would have much more control. I'm sure there's words not included in searches that we want, and words so common that their presence in queries really bogs down index traversal. 4) searches no longer creating read locks. As searching right now has to do the aforementioned intersecting against post data held in the main f_posts table, it will create read locks (which block writes). We could massively reduce read locking if the main work is happening in isolated indexes rather than the main table. EDIT: Actually the locks are probably still needed as we do need to query against it, so things like validation checks can happen, so a join can fix a case of a missing record, and generally to avoid having to do complex double querying. However, as searches will be a lot faster the issue of a long read lock will be gone.

Date Modified	Username	Field	Change
2017-05-16 22:47	Chris Graham	New Issue
2017-05-16 22:47	Chris Graham	Tag Attached: Type: Performance
2017-05-16 23:16	Chris Graham	Additional Information Updated	View Revisions
2017-06-23 02:41	Chris Graham	Relationship added	related to 0001479
2017-07-05 13:13	Chris Graham	Tag Attached: ocProducts client-work (likely)
2018-02-07 20:29	Chris Graham	Tag Detached: ocProducts client-work (likely)
2019-07-19 01:45	Chris Graham	Relationship added	related to 0003848
2019-07-31 19:25	Chris Graham	Relationship added	related to 0001573
2019-11-14 22:02	Chris Graham	Tag Attached: Roadmap: v12
2019-11-14 22:02	Chris Graham	Tag Attached: ocProducts client-work (likely)
2019-11-18 02:36	Chris Graham	Note Added: 0006156
2019-11-18 02:45	Chris Graham	Note Added: 0006157
2019-11-18 02:45	Chris Graham	Note Edited: 0006157	View Revisions
2019-11-19 01:17	Chris Graham	Relationship added	related to 0003958
2019-11-19 01:17	Chris Graham	Description Updated	View Revisions
2019-11-26 22:11	Chris Graham	Tag Detached: Roadmap: v12
2020-01-09 15:58	Chris Graham	Tag Attached: Roadmap: v12
2020-05-09 01:09	Chris Graham	Description Updated	View Revisions
2020-05-09 01:12	Chris Graham	Note Edited: 0006157	View Revisions
2020-05-09 01:17	Chris Graham	Note Edited: 0006157	View Revisions
2020-05-19 01:01	Chris Graham	Summary	Major overhaul of fulltext search support => Major overhaul of fulltext search support, the "Composr fast custom index"
2020-05-19 01:01	Chris Graham	Assigned To	=> Chris Graham
2020-05-19 01:01	Chris Graham	Status	non-assigned => resolved
2020-05-19 01:01	Chris Graham	Resolution	open => fixed
2022-08-15 15:47	Chris Graham	Tag Detached: Roadmap: v12

View Issue Details

Relationships

Activities

Issue History

related to	0003958	non-assigned		Opportunistic scheduler
related to	0001479	closed	Chris Graham	Implement sphinx
related to	0003848	closed	Chris Graham	Metaphonic search indexing
related to	0001573	non-assigned		Database natural sorting