Data Mining for the Masses

Yüklə 4,8 Kb.

Pdf görüntüsü

səhifə	56/65
tarix	08.10.2017
ölçüsü	4,8 Kb.
	#3815

1 ... 52 53 54 55 56 57 58 59 ... 65

Data Mining for the Masses
208

Figure 12-22. Details of document 1.0 in RapidMiner.

5)

Click on the Value column heading twice. This will bring the file path for the document
toward the top, as shown in Figure 12-23.

Figure 12-23. Document 1’s values in reverse sort order.

Chapter 12: Text Mining
209

6)

We  can  see  by  looking  at  the  first  several  attributes  that  for  document  ID  1,  the  file  is
Chapter12_Federalist05_Jay.txt.  Thus if we can’t remember that we added paper 5 first,
resulting in RapidMiner labeling it document 1, we can check it in the document details.
This little trick works when you have used the Read Document operator, as the document
being  read  becomes  the value  for  the  metadata_file  attribute,  however  when  using  some
other  operators,  such  as  the  Create  Document  operator,  it  doesn’t  work,  as  you  will  see
momentarily.  Since we added our papers in numerical order in this chapter’s example, we
do not necessarily need to view and sort the details for each of the documents, but you
may if you wish.  Knowing that documents 1 and 2 are Jay (no. 5) and Madison (no. 14),
and documents 3 and 4 are Hamilton (no. 17) and suspected collaboration (no. 18), we can
be encouraged by what we see in this model.  It appears that Hamilton does have something
to  do  with  Federalist  Paper  18,  but we  don’t know  about  Madison  yet  because  Madison
was grouped with Jay, probably as a result of the previously discussed mean balancing that
k-means clustering is prone to do.

7)

Perhaps we can address this by better training our model to recognize Jay’s writing.  Using
your  favorite  search  engine,  search  the  Internet  for  the  text  of  Federalist  Paper  No.  3.
Gillian knows that this paper’s authorship has been connected to John Jay.  We will use the
text  to  train  our  model  to  better  recognize  Jay’s writing.    If  paper  18 was  written  by,  or
even contributed to by Jay, perhaps we will find that it gets clustered with Jay’s papers 3
and 5 when we add paper 3 to the model.  In this case, Hamilton and Madison should get
clustered together.  If on the other hand paper 18 was not written or contributed to by Jay,
paper 18 should gravitate toward Hamilton (no. 17) and/or Madison (no. 14), so long as
Jay  was  consistent  in  his  writing  between  papers  3  and  5.    Copy  the  text  of  paper  3  by
highlighting it in whichever web site you found (it is available on a number of sites).  Then
in design perspective in RapidMiner, locate the Create Document operator and drag it into
your process (Figure 12-23).

Data Mining for the Masses
210

Figure 12-23. Adding a Create Document operator to our text mining model.

8)

Be  sure  the  Create  Document  operator’s  out  port  is  connected  to  one  of  the  Process
Document operator’s doc ports.  It will likely connect itself to a res  port, so you’ll have to
reconnect  it  to  the  Process  Documents  operator.    Let’s  rename  this  operator  ‘Paper  3
(Jay)’.  Then click on the Edit Text button in the Parameters area on the right hand side of
the screen. You will see a window like Figure 12-24.

Chapter 12: Text Mining
211

Figure 12-24. Adding a text document through a Create Document operator.

9)

Paste the text of Federalist Paper 3 into the  Edit Parameter Text window and then click
OK.  We now have five documents to be processed and run through our k-Means model.
RapidMiner will assign document ID 5 to this new document, since it was the fifth one we
added to our main process.  Let’s run the model to see how our documents are grouped
now.

Figure 12-25. New clusters identified by RapidMiner with the addition of another of Jay’s papers.

Data Mining for the Masses
212

10)

On the Cluster Model tab in results perspective, with the cluster menu trees expanded, we
now see that documents 2 and 4 (papers 14 (Madison) and 18 (collaboration)) are grouped
together, while the two of Jay’s papers (documents 1 (paper 5) and 5 (paper 3)) are grouped
with  Hamilton’s  paper  (document  3;  paper  17).    This  is  very  encouraging  because  the
suspected collaboration paper (no. 18) has now been associated with both Madison’s and
Hamilton’s writing, but not with Jay’s.  Let’s give our model one more of Jay’s papers to
further train it in Jay’s writing style, and see if we can find further evidence that paper 18 is
most strongly connected to Madison and Hamilton.  Repeat steps 7 through 9, only this
time, find the text of Federalist Paper 4 (also written by John Jay) and paste it into a new
Create Document operator.

Figure 12-26. The addition of another Create Document operator containing the
text of Federalist Paper 4 by John Jay.

11)

Be sure to rename the second Create Document operator descriptively, as we have done in
Figure 12-26.  When you have used the Edit Text button to paste the text for Federalist
Paper 4 into your model and have ensured that your ports are all connected correctly, run
the model one last time and we will proceed to…

Yüklə 4,8 Kb.

Dostları ilə paylaş:

1 ... 52 53 54 55 56 57 58 59 ... 65