Top 10 algorithms in data mining
25
be the best possible classifier in any particular application, but it can usually be relied on to
be robust and to do quite well. General discussion of the naive Bayes method and its merits
are given in [
22
,
33
].
9.2 The basic principle
For convenience of exposition here, we will assume just two classes, labeled i
= 0, 1. Our
aim is to use the initial set of objects with known class memberships (the training set) to
construct a score such that larger scores are associated with class 1 objects (say) and smaller
scores with class 0 objects. Classification is then achieved by comparing this score with a
threshold, t. If we define P
(i|x) to be the probability that an object with measurement vector
x
= (x
1
, . . . , x
p
) belongs to class i, then any monotonic function of P(i|x) would make a
suitable score. In particular, the ratio P
(1|x)/P(0|x) would be suitable. Elementary proba-
bility tells us that we can decompose P
(i|x) as proportional to f (x|i)P(i), where f (x|i) is
the conditional distribution of x for class i objects, and P
(i) is the probability that an object
will belong to class i if we know nothing further about it (the ‘prior’ probability of class i ).
This means that the ratio becomes
P
(1|x)
P
(0|x)
=
f
(x|1)P(1)
f
(x|0)P(0)
.
(20)
To use this to produce classifications, we need to estimate the
f
(x|i) and the P(i). If the
training set was a random sample from the overall population, the P
(i) can be estimated
directly from the proportion of class i objects in the training set. To estimate the f
(x|i),
the naive Bayes method assumes that the components of x are independent, f
(x|i) =
p
j
=1
f
(x
j
|i), and then estimates each of the univariate distributions f (x
j
|i), j = 1, . . . , p;
i
= 0, 1, separately. Thus the p dimensional multivariate problem has been reduced to p uni-
variate estimation problems. Univariate estimation is familiar, simple, and requires smaller
training set sizes to obtain accurate estimates. This is one of the particular, indeed unique
attractions of the naive Bayes methods: estimation is simple, very quick, and does not require
complicated iterative estimation schemes.
If the marginal distributions f
(x
j
|i) are discrete, with each x
j
taking only a few values,
then the estimate ˆ
f
(x
j
|i) is a multinomial histogram type estimator (see below)—simply
counting the proportion of class i objects which fall into each cell. If the f
(x
j
|i) are
continuous, then a common strategy is to segment each of them into a small number of
intervals and again use multinomial estimator, but more elaborate versions based on contin-
uous estimates (e.g. kernel estimates) are also used.
Given the independence assumption, the ratio in (
20
) becomes
P
(1|x)
P
(0|x)
=
p
j
=1
f
(x
j
|1)P(1)
p
j
=1
f
(x
j
|0)P(0)
=
P
(1)
P
(0)
p
j
=1
f
(x
j
|1)
f
(x
j
|0)
.
(21)
Now, recalling that our aim was merely to produce a score which was monotonically
related to P
(i|x), we can take logs of (
21
)—log is a monotonic increasing function. This
gives an alternative score
ln
P
(1|x)
P
(0|x)
= ln
P
(1)
P
(0)
+
p
j
=1
ln
f
(x
j
|1)
f
(x
j
|0)
.
(22)
123
26
X. Wu et al.
If we define
w
j
= ln( f (x
j
|1)/f (x
j
|0)) and a constant k = ln(P(1)/P(0)) we see that
(
22
) takes the form of a simple sum
ln
P
(1|x)
P
(0|x)
= k +
p
j
=1
w
j
,
(23)
so that the classifier has a particularly simple structure.
The assumption of independence of the x
j
within each class implicit in the naive Bayes
model might seem unduly restrictive. In fact, however, various factors may come into play
which mean that the assumption is not as detrimental as it might seem. Firstly, a prior var-
iable selection step has often taken place, in which highly correlated variables have been
eliminated on the grounds that they are likely to contribute in a similar way to the separation
between classes. This means that the relationships between the remaining variables might
well be approximated by independence. Secondly, assuming the interactions to be zero pro-
vides an implicit regularization step, so reducing the variance of the model and leading to
more accurate classifications. Thirdly, in some cases when the variables are correlated the
optimal decision surface coincides with that produced under the independence assumption,
so that making the assumption is not at all detrimental to performance. Fourthly, of course,
the decision surface produced by the naive Bayes model can in fact have a complicated non-
linear shape: the surface is linear in the
w
j
but highly nonlinear in the original variables x
j
,
so that it can fit quite elaborate surfaces.
9.3 Some extensions
Despite the above, a large number of authors have proposed modifications to the naive Bayes
method in an attempt to improve its predictive accuracy.
One early proposed modification was to shrink the simplistic multinomial estimate of the
proportions of objects falling into each category of each discrete predictor variable. So, if the
j th discrete predictor variable, x
j
, has c
r
categories, and if n
jr
of the total of n objects fall
into the r th category of this variable, the usual multinomial estimator of the probability that
a future object will fall into this category, n
jr
/n, is replaced by (n
jr
+ c
−1
r
)/(n + 1). This
shrinkage also has a direct Bayesian interpretation. It leads to estimates which have lower
variance.
Perhaps the obvious way of easing the independence assumption is by introducing extra
terms in the models of the distributions of x in each class, to allow for interactions. This has
been attempted in a large number of ways, but we must recognize that doing this necessarily
introduces complications, and so sacrifices the basic simplicity and elegance of the naive
Bayes model. Within either (or any, more generally) class, the joint distribution of x is
f
(x) = f (x
1
) f (x
2
|x
1
) f (x
3
|x
1
, x
2
) · · · f (x
p
|x
1
, x
2
, . . . , x
p
−1
),
(24)
and this can be approximated by simplifying the conditional probabilities. The extreme arises
with f
(x
i
|x
1
, . . . , x
i
−1
) = f (x
i
) for all i, and this is the naive Bayes method. Obviously,
however, models between these two extremes can be used. For example, one could use the
Markov model
f
(x) = f (x
1
) f (x
2
|x
1
) f (x
3
|x
2
) · · · f (x
p
|x
p
−1
).
(25)
This is equivalent to using a subset of two way marginal distributions instead of the
univariate marginal distributions in the naive Bayes model.
Another extension to the naive Bayes model was developed entirely independently of it.
This is the logistic regression model. In the above we obtained the decomposition (
21
) by
123