Data Mining
for the Masses
208
Figure 12-22. Details of document 1.0 in RapidMiner.
5)
Click on the Value column heading twice. This will bring the file path for the document
toward the top, as shown in Figure 12-23.
Figure 12-23. Document 1’s values in reverse sort order.
Chapter 12:
Text Mining
209
6)
We can see by looking at the first several attributes that for document ID 1, the file is
Chapter12_Federalist05_Jay.txt. Thus if we can’t remember that we added paper 5 first,
resulting in RapidMiner labeling it document 1, we can check it in the document details.
This little trick works when you have used the Read Document operator, as the document
being read becomes the value for the metadata_file attribute, however when using some
other operators, such as the Create Document operator, it doesn’t work, as you will see
momentarily. Since we added our papers in numerical order in this chapter’s example, we
do not necessarily need to view and sort the details for each of the documents, but you
may if you wish. Knowing that documents 1 and 2 are Jay (no. 5) and Madison (no. 14),
and documents 3 and 4 are Hamilton (no. 17) and suspected collaboration (no. 18), we can
be encouraged by what we see in this model. It appears that Hamilton
does have something
to do with Federalist Paper 18, but we don’t know about Madison yet because Madison
was grouped with Jay, probably as a result of the previously discussed mean balancing that
k-means clustering is prone to do.
7)
Perhaps we can address this by better training our model to recognize Jay’s writing. Using
your favorite search engine, search the Internet for the text of Federalist Paper No. 3.
Gillian knows that this paper’s authorship has been connected to John Jay. We will use the
text to train our model to better recognize Jay’s writing. If paper 18 was written by, or
even contributed to by Jay, perhaps we will find that it gets clustered with Jay’s papers 3
and 5 when we add paper 3 to the model. In this case, Hamilton and Madison should get
clustered together. If on the other hand paper 18 was
not written or contributed to by Jay,
paper 18 should gravitate toward Hamilton (no. 17) and/or Madison (no. 14), so long as
Jay was consistent in his writing between papers 3 and 5. Copy the text of paper 3 by
highlighting it in whichever web site you found (it is available on a number of sites). Then
in design perspective in RapidMiner, locate the Create Document operator and drag it into
your process (Figure 12-23).
Data Mining for the Masses
212
10)
On the Cluster Model tab in results perspective, with the cluster menu trees expanded, we
now see that documents 2 and 4 (papers 14 (Madison) and 18 (collaboration)) are grouped
together, while the two of Jay’s papers (documents 1 (paper 5) and 5 (paper 3)) are grouped
with Hamilton’s paper (document 3; paper 17). This is very encouraging because the
suspected collaboration paper (no. 18) has now been associated with both Madison’s and
Hamilton’s writing, but not with Jay’s. Let’s give our model one more of Jay’s papers to
further train it in Jay’s writing style, and see if we can find further evidence that paper 18 is
most strongly connected to Madison and Hamilton. Repeat steps 7 through 9, only this
time, find the text of Federalist Paper 4 (also written by John Jay) and paste it into a new
Create Document operator.
Figure 12-26. The addition of another Create Document
operator containing the
text of Federalist Paper 4 by John Jay.
11)
Be sure to rename the second Create Document operator descriptively, as we have done in
Figure 12-26. When you have used the Edit Text button to paste the text for Federalist
Paper 4 into your model and have ensured that your ports are all connected correctly, run
the model one last time and we will proceed to…