S u rv e y pa p e r

Yüklə 471,61 Kb.

Pdf görüntüsü

səhifə	12/16
tarix	08.10.2017
ölçüsü	471,61 Kb.
	#3814

1 ... 8 9 10 11 12 13 14 15 16

Top 10 algorithms in data mining

be the best possible classiﬁer in any particular application, but it can usually be relied on to

be robust and to do quite well. General discussion of the naive Bayes method and its merits

are given in [

,

33

9.2 The basic principle

For convenience of exposition here, we will assume just two classes, labeled i

= 0, 1. Our

aim is to use the initial set of objects with known class memberships (the training set) to

construct a score such that larger scores are associated with class 1 objects (say) and smaller

scores with class 0 objects. Classiﬁcation is then achieved by comparing this score with a

threshold, t. If we deﬁne P

(i|x) to be the probability that an object with measurement vector

x

= (x

, . . . , x

) belongs to class i, then any monotonic function of P(i|x) would make a

suitable score. In particular, the ratio P

(1|x)/P(0|x) would be suitable. Elementary proba-

bility tells us that we can decompose P

(i|x) as proportional to f (x|i)P(i), where f (x|i) is

the conditional distribution of x for class i objects, and P

(i) is the probability that an object

will belong to class i if we know nothing further about it (the ‘prior’ probability of class i ).

This means that the ratio becomes

(1|x)

(0|x)

=

f

(x|1)P(1)

(x|0)P(0)

(20)

To use this to produce classiﬁcations, we need to estimate the f

(x|i) and the P(i). If the

training set was a random sample from the overall population, the P

(i) can be estimated

directly from the proportion of class i objects in the training set. To estimate the f

(x|i),

the naive Bayes method assumes that the components of x are independent, f

(x|i) =

p

j

=1

f

(x

j

|i), and then estimates each of the univariate distributions f (x

|i), j = 1, . . . , p;

= 0, 1, separately. Thus the p dimensional multivariate problem has been reduced to p uni-

variate estimation problems. Univariate estimation is familiar, simple, and requires smaller

training set sizes to obtain accurate estimates. This is one of the particular, indeed unique

attractions of the naive Bayes methods: estimation is simple, very quick, and does not require

complicated iterative estimation schemes.

If the marginal distributions f

|i) are discrete, with each x

taking only a few values,

then the estimate ˆ

f

|i) is a multinomial histogram type estimator (see below)—simply

counting the proportion of class i objects which fall into each cell. If the f

|i) are

continuous, then a common strategy is to segment each of them into a small number of

intervals and again use multinomial estimator, but more elaborate versions based on contin-

uous estimates (e.g. kernel estimates) are also used.

Given the independence assumption, the ratio in (

) becomes

(1|x)

(0|x)

=

p

j

=1

f

(x

j

|1)P(1)

p

j

=1

f

(x

j

|0)P(0)

=

P

(1)

(0)

p

j

=1

f

(x

j

|1)

|0)

(21)

Now, recalling that our aim was merely to produce a score which was monotonically

related to P

(i|x), we can take logs of (

)—log is a monotonic increasing function. This

gives an alternative score

ln

P

(1|x)

P

(0|x)

= ln

P

(1)

(0)

+

p

j

|1)

|0)

(22)

123

X. Wu et al.

If we deﬁne

w

j

= ln( f (x

j

|1)/f (x

|0)) and a constant k = ln(P(1)/P(0)) we see that

(

22

) takes the form of a simple sum

ln

P

(1|x)

(0|x)

= k +

p

j

(23)

so that the classiﬁer has a particularly simple structure.

The assumption of independence of the x

within each class implicit in the naive Bayes

model might seem unduly restrictive. In fact, however, various factors may come into play

which mean that the assumption is not as detrimental as it might seem. Firstly, a prior var-

iable selection step has often taken place, in which highly correlated variables have been

eliminated on the grounds that they are likely to contribute in a similar way to the separation

between classes. This means that the relationships between the remaining variables might

well be approximated by independence. Secondly, assuming the interactions to be zero pro-

vides an implicit regularization step, so reducing the variance of the model and leading to

more accurate classiﬁcations. Thirdly, in some cases when the variables are correlated the

optimal decision surface coincides with that produced under the independence assumption,

so that making the assumption is not at all detrimental to performance. Fourthly, of course,

the decision surface produced by the naive Bayes model can in fact have a complicated non-

linear shape: the surface is linear in the

w

j

but highly nonlinear in the original variables x

so that it can ﬁt quite elaborate surfaces.

9.3 Some extensions

Despite the above, a large number of authors have proposed modiﬁcations to the naive Bayes

method in an attempt to improve its predictive accuracy.

One early proposed modiﬁcation was to shrink the simplistic multinomial estimate of the

proportions of objects falling into each category of each discrete predictor variable. So, if the

j th discrete predictor variable, x

j

, has c

categories, and if n

of the total of n objects fall

into the r th category of this variable, the usual multinomial estimator of the probability that

a future object will fall into this category, n

/n, is replaced by (n

+ c

−1

r

)/(n + 1). This

shrinkage also has a direct Bayesian interpretation. It leads to estimates which have lower

variance.

Perhaps the obvious way of easing the independence assumption is by introducing extra

terms in the models of the distributions of x in each class, to allow for interactions. This has

been attempted in a large number of ways, but we must recognize that doing this necessarily

introduces complications, and so sacriﬁces the basic simplicity and elegance of the naive

Bayes model. Within either (or any, more generally) class, the joint distribution of x is

f

(x) = f (x

) f (x

, x

) · · · f (x

, x

, . . . , x

−1

(24)

and this can be approximated by simplifying the conditional probabilities. The extreme arises

with f

, . . . , x

−1

) = f (x

) for all i, and this is the naive Bayes method. Obviously,

however, models between these two extremes can be used. For example, one could use the

Markov model

(x) = f (x

) f (x

) · · · f (x

−1

(25)

This is equivalent to using a subset of two way marginal distributions instead of the

univariate marginal distributions in the naive Bayes model.

Another extension to the naive Bayes model was developed entirely independently of it.

This is the logistic regression model. In the above we obtained the decomposition (

) by

123

Yüklə 471,61 Kb.

Dostları ilə paylaş:

1 ... 8 9 10 11 12 13 14 15 16