algorithm - Naive Bayesian for rating -
suppose have training set has following data:
type | size | price | rating | suggestion --------------------------------------------------- shirt m budget 0 bad trouser l budget 4.2 shirt m expensive 2.3 ....etc....
here taking suggestion
class need suggest when input sample provided. means, when input sample(different training dataset) given, need figure out whether good
or bad
.
am able understand probability calculation based on example found internet:
dataset:
calculation input sample:
the doubt in dataset that, have column called rating
. so, column also, probability calculation other columns(like in screenshot above)? or need consider other way 1 particular column's values? mean , standard deviation?
thank you
columns "size" , "price" represent categorical data (well, actually, ordinal, that's point). while can model "rating" categorical value too, may bad idea , it'd better model data numerical. , here's why.
the difference in treating data categorical , numerical in different value. suppose have 3 observations of x: x=12
, x=13
, x=1344
. question then: how can probabilities p(x=12)
, p(x=1344)
, p(x=13)
differ? answer heavily depends on kind of data these values represent.
for example, x
denotes user id or ordering irrelevant, these probabilities can differ arbitrary. if x
denotes, say, pay rate, there's not difference between 12 , 13 compared third value.
it helps infer more knowledge data. example, there might no values 4.9 in dataset, lots of 4.8 , 5.0. model "interpolates" between these two, giving probability 4.9 though wasn't presented in dataset.
so, yes, should use numerical distribution (gaussian, example) rating data. suggest cleanup: apparently, 0 stands "not rated" rather "extremely bad", may want tell model (for example, replacing 0s average rating).
Comments
Post a Comment