View Issue Details

IDProjectCategoryView StatusLast Update
0002646Composrcorepublic2024-04-22 17:18
ReporterChris GrahamAssigned To 
SeverityFeature-request 
Status non-assignedResolutionopen 
Product Version 
Fixed in Version 
Summary0002646: Bayesian spam detection
DescriptionWhen content is deleted for spam, copy the text into a new 'spam' table. Use this, and normal content, for spam/ham detection using a Bayesian algorithm.

Every once in a while (or when the cleanup tool is run), the prior probabilities for keywords would be updated.

This then can be integrated as extra signalling for 0002384
TagsRoadmap: Over the horizon, Type: Spam
Time estimation (hours)20
Sponsorship open

Relationships

related to 0002384 resolvedChris Graham Anti-spam heuristics 
related to 0002057 resolvedChris Graham Delete member content on punishment form 

Activities

Guest

2016-06-09 02:18

viewer   ~0004019

Question. How would one mark content being deleted as being deleted for spam? Would there be a new checkbox or something for staff?

Chris Graham

2016-06-09 02:20

administrator   ~0004020

It would be done with 0002057, all the punished members content would be considered spam if a checkbox as ticked.

Patrick Schmalstig

2016-06-09 02:51

administrator   ~0004024

Hmm... what if only a few pieces of content are considered spam? That approach would wipe everything, including legitimate posts. I can see its use for sole spammers, but it may be a problem for the casual "I did it once and I learned from it" spammer.

Patrick Schmalstig

2024-01-06 04:08

administrator   ~0008153

Last edited: 2024-01-06 04:10

View 2 revisions

Doing some research, this could be a useful antispam feature for v11 as spam continues to go on the rise.

For the algorithm to work effectively, it will need to be trained both on spam and on ham. However, we would need to know when to classify a piece of content as ham.

Perhaps an hourly scheduled task can be run that trains the algorithm on content which is X hours or older as ham (configurable, perhaps a default of 72 hours as we can reasonably assume in most cases content which has not been moderated as spam within 3 days is ham).

Or, we could go a dynamic route:

* Have two tables... spam and spam_probabilities.
* "spam" is a collection of raw text which has been marked as spam.
* "spam_probabilities" is the training data for the Bayesian algorithm.
* Have a scheduled hook run every hour (but only when new content was recently added) which recalculates spam_probabilities. It does this by looking at content which currently exists on the site and is newer than X days (let's say a default of 30) and considers it ham. All entries in "spam" no older than X days (again, 30 by default, same number as ham) is trained as spam. We should also run this every time a new entry is added into spam. And this should also clean out old entries from the spam table (or perhaps let the privacy hooks do that instead in case an admin wants to increase the number of days to look back).
* When checking for spam, we run the requested content submission through the algorithm to determine if it is likely spam and apply a score if so.
* We should consider other fields as well, not just main body content... like title, SEO keywords, etc. Basically, any text-type field.

Patrick Schmalstig

2024-04-22 17:18

administrator   ~0008659

Due to the strict v11 timeline for release, this has been put off to over the horizon (11.1 or later)

Issue History

Date Modified Username Field Change
2016-06-08 01:21 Chris Graham New Issue
2016-06-08 01:21 Chris Graham Tag Attached: Type: Spam
2016-06-08 01:24 Chris Graham Description Updated View Revisions
2016-06-08 01:24 Chris Graham Relationship added child of 0002384
2016-06-08 01:25 Chris Graham Relationship added child of 0002057
2016-06-09 02:18 Guest Note Added: 0004019
2016-06-09 02:20 Chris Graham Note Added: 0004020
2016-06-09 02:51 Patrick Schmalstig Note Added: 0004024
2016-10-25 17:43 Chris Graham Relationship deleted child of 0002057
2016-10-25 17:43 Chris Graham Relationship deleted child of 0002384
2016-10-25 17:43 Chris Graham Relationship added related to 0002384
2016-10-25 17:43 Chris Graham Relationship added related to 0002057
2024-01-06 03:58 Patrick Schmalstig Tag Attached: Roadmap: v11
2024-01-06 04:08 Patrick Schmalstig Note Added: 0008153
2024-01-06 04:10 Patrick Schmalstig Note Edited: 0008153 View Revisions
2024-04-22 17:18 Patrick Schmalstig Tag Detached: Roadmap: v11
2024-04-22 17:18 Patrick Schmalstig Tag Attached: Roadmap: Over the horizon
2024-04-22 17:18 Patrick Schmalstig Note Added: 0008659