Half a croissant, on a plate, with a sign in front of it saying '50c'
h a l f b a k e r y
Fewer ducks than estimates indicate.

idea: add, search, annotate, link, view, overview, recent, by name, random

meta: news, help, about, links, report a problem

account: browse anonymously, or get an account and write.

user:
pass:
register,


                   

Bayesian Categorization

Taming the "Category, pick one" menu
  (+3)
(+3)
  [vote for,
against]

There's something like 1500 idea categories on the Halfbakery, making it really hard to pick the right one from the menu on the "new idea" page.

I think that a little pattern-matching software could automatically pick categories for ideas, or at least make plausible suggestions.

The idea is to use what's called a "naive Bayesian classifier." This is a fairly simple bit of software that would extract features from ideas and, using probabilities gleaned from a training set, assign a probability that the idea belongs in each possible category. It could display the top ten (say) as hints on the "new idea" page.

For a training set, we can just use the current Halfbakery database.

The main open question is what the feature set ought to be. Obvious candidates are words in the text of the idea, or better, their thesaurus categories. (Word pairs or triples might work even better.)

td, Dec 17 2003

Bayesian classifier in Python http://www.divmod.org/Reverend/
Sample code. [td, Oct 04 2004]

Classifying spam http://www.paulgraham.com/spam.html
This article, which is about statistically recognizing spam, has a good description of the naive Bayesian classifier buried in it. [td, Oct 04 2004]

Please log in.
If you're not logged in, you can see what this page looks like, but you will not be able to add anything.
Short name, e.g., Bob's Coffee
Destination URL. E.g., https://www.coffee.com/
Description (displayed with the short name and URL.)






       I find the best way to get a category is simply to search for closely-related ideas. This also has other benefits.
kropotkin, Dec 17 2003
  

       Searching for something closely-related works poorly for sufficiently weird ideas. This idea came to mind because searching wasn't helping me find an apposite category for Wasabi Nasal Spray.   

       You can think of this idea as a (fairly sophisticated) search feature that works on the text of the submitted idea.
td, Dec 17 2003
  

       Tom certainly does have a point though. Manually picking a category is nigh impossible these days.
waugsqueke, Dec 18 2003
  

       Not impossible, but tedious certainly. Fortunately I've got around the problem by not having any new ideas.
DrBob, Dec 18 2003
  

       I find picking the category to be half the fun, sometimes. Although often people disagree with my decision.
Loris, Dec 18 2003
  

       This smacks of Windows XP to me.   

       I really don't like it when computers try to be more intelligent than they are capable of being.
phundug, Dec 18 2003
  

       Regarding what feature set to use... to get the fastest and simplest, if not necessarily most accurate, clasifier, just use the words themselves. Using their thesaurus categories *might* (or might not) increase accuracy (and would certainly reduce the database size), but it would decrease speed. Using word pairs or triples will certainly increase accuracy, but it will both decrease speed *and* increase the database size.   

       I'd also like to make a suggestion on the user interface for it: Since the clasifier would reside on the HB server, any new idea (that the author wants to pick a category for) needs to be submitted to it, in order to be analysed. I would suggest that this be done via a "Suggest a Category" button.   

       When pressed, the form data (including the summary and text of the idea) would go to the HB server, analysed by the clasifier, and then the new html page would have the *entire* category list sorted, with most likely category at the top, and least likely at the bottom. Another button, "Sort Categories Alphabetically" would send it back to it's normal order.   

       If you just display the top 10, the author would still have to scroll through the category list to find the item, and select it; if the sorted list is in the <select> box, then it's easy to just click one.
goldbb, Mar 17 2009
  


 

back: main index

business  computer  culture  fashion  food  halfbakery  home  other  product  public  science  sport  vehicle