Chapter 12: Text Mining
203
In the main process window, ensure that both the exa and wor ports on the Process Documents
operator are connected to res ports as shown in Figure 12-16.
Figure 12-16. The Federalist Papers text mining model.
The exa port will generate a tab in results perspective showing the words (tokens) from our
documents as attributes, with the attributes’ relative strength in each of the four documents
indicated by a decimal coefficient. The wor port will create a tab in results perspective that shows
the words as tokens with the total number of occurrences, and the number of documents each
token appeared in. Although we will do a bit more modeling in this chapter’s example, at this
point we will go ahead and proceed to…
EVALUATION
Let’s run our model again. We can see the WordList tab in results perspective, showing our tokens
and their frequencies in the input documents.
Data Mining for the Masses
204
Figure 12-17. Tokens generated from Federalist Papers 5, 14, 17 and 18, with frequencies.
There are many tokens, which is not surprising considering the length of each essay we have fed
into the model. In Figure 12-17, we can see that some of our tokens appear in multiple
documents. Consider the word (or token) ‘acquainted’. This term shows up one time each in
three of the four documents. How can we tell? The Total Occurrences for this token shows as 3,
and the Document Occurrences shows as 3, so it must be in each of three documents one time.
(Note that even a cursory review of these tokens reveals some stemming opportunities—for
example ‘accomplish’ and ‘accomplished’ or ‘according’ and ‘accordingly’.) Click on the Total
Occurrences column twice to bring the most common terms to the top.
Chapter 12: Text Mining
205
Figure 12-18. Our tokens re-sorted from highest to lowest total occurrences.
Here we see powerful words that all of the authors have relied upon extensively. The Federalist
Papers were written to argue in favor of the adoption of a new constitution, and these tokens
reflect that agenda. Not only were these terms frequently used across all four documents, the
vocabulary reflects the objective of writing and publishing the essays in the first place. Note again
here that there is an opportunity to benefit from stemming (‘government’, ‘governments’). Also,
some n-grams would be interesting and informative. The term ‘great’ is both common and
frequent, but in what context? Could it be that an n-gram operator might yield the term
‘great_nation’, which bears much more meaning than just the word ‘great’? Feel free to
experiment by re-modeling and re-evaluating.
These results in and of themselves are interesting, but we haven’t gotten to the heart of Gillian’s
question, which was: Is it likely that Federalist Paper 18 was indeed a collaboration between
Hamilton and Madison? Think back through this book and about what you have learned thus far.
We have seen many data mining methodologies that help us to check for affinity or group
classifications. Let’s attempt to apply one of these to our text mining model to see if it will reveal
more about the authors of these papers. Complete the following steps:
Data Mining for the Masses
206
1)
Switch back to design perspective. Locate the k-Means operator and drop it into your
stream between the exa port on Process Documents and the res port (Figure 12-19).
Figure 12-19. Clustering our documents using their token frequncies as means.
2)
For this model we will accept the default k of 2, since we want to group Hamilton’s and
Madison’s writings together, and keep Jay’s separate. We’d hope to get a
Hamilton/Madison cluster, with paper 18 in that one, and a Jay cluster with only his paper
in there. Run the model and then click on the Cluster Model tab.
Figure 12-20. Cluster results for our four text documents.
Chapter 12: Text Mining
207
3)
Unfortunately, it looks like at least one of our four documents ended up associated with
John Jay’s paper (no. 5). This probably happened for two reasons: (1) We are using the k-
Means methodology and means in general tend to try to find a middle with equal parts on
both sides; and (2) Jay was writing on the same topic as were Hamilton and Madison. Thus,
there is going to be much similarity across the essays, so the means will more easily balance
even if Jay didn’t contribute to paper 18. The topic alone will cause enough similarity that
paper 18 could be grouped with Jay, especially when the operator we’ve chosen is trying to
find equal balance. We can see how the four papers have been clustered by clicking on the
Folder View radio button and expanding both of the folder menu trees.
Figure 12-21. Examining the document clusters.
4)
We can see that the first two papers and the last two papers were grouped together. This
can be a bit confusing because RapidMiner has renumbered the documents from 1 to 4, in
the order that we added them to our model. In the book’s example, we added them in
numerical order: 5, 14, 17, and then 18. So paper 5 corresponds to document 1, paper 14
corresponds to document 2, and so forth. If we can’t remember the order in which we
added the papers to the model, we can click on the little white page icon to the left of the
document number to view the document’s details:
Dostları ilə paylaş: |