Data Mining for the Masses

Yüklə 4,8 Kb.

Pdf görüntüsü

səhifə	55/65
tarix	08.10.2017
ölçüsü	4,8 Kb.
	#3815

1 ... 51 52 53 54 55 56 57 58 ... 65

Chapter 12: Text Mining
203

In the main process window, ensure that both the exa and wor ports on the Process Documents
operator are connected to res ports as shown in Figure 12-16.

Figure 12-16. The Federalist Papers text mining model.

The  exa  port  will  generate  a  tab  in  results  perspective  showing  the  words  (tokens)  from  our
documents  as  attributes,  with  the  attributes’  relative  strength  in  each  of  the  four  documents
indicated by a decimal coefficient.  The wor port will create a tab in results perspective that shows
the  words  as  tokens  with  the  total  number  of  occurrences,  and  the  number  of  documents  each
token  appeared  in.    Although  we  will  do  a  bit  more  modeling  in  this  chapter’s  example,  at  this
point we will go ahead and proceed to…

EVALUATION

Let’s run our model again. We can see the WordList tab in results perspective, showing our tokens
and their frequencies in the input documents.

Data Mining for the Masses
204

Figure 12-17. Tokens generated from Federalist Papers 5, 14, 17 and 18, with frequencies.

There are many tokens, which is not surprising considering the length of each essay we have fed
into  the  model.    In  Figure  12-17,  we  can  see  that  some  of  our  tokens  appear  in  multiple
documents.    Consider  the  word  (or  token)  ‘acquainted’.    This  term  shows  up  one  time  each  in
three of the four documents.  How can we tell?  The Total Occurrences for this token shows as 3,
and the Document Occurrences shows as 3, so it must be in each of three documents one time.
(Note  that  even  a  cursory  review  of  these  tokens  reveals  some  stemming  opportunities—for
example  ‘accomplish’  and  ‘accomplished’  or  ‘according’  and  ‘accordingly’.)    Click  on  the  Total
Occurrences column twice to bring the most common terms to the top.

Chapter 12: Text Mining
205

Figure 12-18. Our tokens re-sorted from highest to lowest total occurrences.

Here we see powerful words that all of the authors have relied upon extensively.  The Federalist
Papers  were  written  to  argue  in  favor  of  the  adoption  of  a  new  constitution,  and  these  tokens
reflect  that  agenda.    Not  only  were  these  terms  frequently  used  across  all  four  documents,  the
vocabulary reflects the objective of writing and publishing the essays in the first place.  Note again
here that there is an opportunity to benefit from stemming (‘government’, ‘governments’).  Also,
some  n-grams  would  be  interesting  and  informative.    The  term  ‘great’  is  both  common  and
frequent,  but  in  what  context?    Could  it  be  that  an  n-gram  operator  might  yield  the  term
‘great_nation’,  which  bears  much  more  meaning  than  just  the  word  ‘great’?    Feel  free  to
experiment by re-modeling and re-evaluating.

These results in and of themselves are interesting, but we haven’t gotten to the heart of Gillian’s
question,  which  was:  Is  it  likely  that  Federalist  Paper  18  was  indeed  a  collaboration  between
Hamilton and Madison?  Think back through this book and about what you have learned thus far.
We  have  seen  many  data  mining  methodologies  that  help  us  to  check  for  affinity  or  group
classifications.  Let’s attempt to apply one of these to our text mining model to see if it will reveal
more about the authors of these papers. Complete the following steps:

Data Mining for the Masses
206

1)

Switch back to design perspective. Locate the k-Means operator and drop it into your
stream between the exa port on Process Documents and the res port (Figure 12-19).

Figure 12-19. Clustering our documents using their token frequncies as means.

2)

For this model we will accept the default k of 2, since we want to group Hamilton’s and
Madison’s writings together, and keep Jay’s separate. We’d hope to get a
Hamilton/Madison cluster, with paper 18 in that one, and a Jay cluster with only his paper
in there. Run the model and then click on the Cluster Model tab.

Figure 12-20. Cluster results for our four text documents.

Chapter 12: Text Mining
207

3)

Unfortunately, it looks like at least one of our four documents ended up associated with
John Jay’s paper (no. 5).  This probably happened for two reasons: (1) We are using the k-
Means methodology and means in general tend to try to find a middle with equal parts on
both sides; and (2) Jay was writing on the same topic as were Hamilton and Madison. Thus,
there is going to be much similarity across the essays, so the means will more easily balance
even if Jay didn’t contribute to paper 18.  The topic alone will cause enough similarity that
paper 18 could be grouped with Jay, especially when the operator we’ve chosen is trying to
find equal balance.  We can see how the four papers have been clustered by clicking on the
Folder View radio button and expanding both of the folder menu trees.

Figure 12-21. Examining the document clusters.

4)

We can see that the first two papers and the last two papers were grouped together.  This
can be a bit confusing because RapidMiner has renumbered the documents from 1 to 4, in
the order that we added  them to our model.  In the book’s example, we added them in
numerical order: 5, 14, 17, and then 18.  So paper 5 corresponds to document 1, paper 14
corresponds to document 2, and so forth.  If we can’t remember the order in which we
added the papers to the model, we can click on the little white page icon to the left of the
document number to view the document’s details:

Yüklə 4,8 Kb.

Dostları ilə paylaş:

1 ... 51 52 53 54 55 56 57 58 ... 65