Data Mining for the Masses

Yüklə 4,8 Kb.

Pdf görüntüsü

səhifə	54/65
tarix	08.10.2017
ölçüsü	4,8 Kb.
	#3815

1 ... 50 51 52 53 54 55 56 57 ... 65

Case Sensitivity
Generate n-Grams
Replace Tokens

Chapter 12: Text Mining
197

Figure 12-9. A view inside the sub-process of our Process Documents operator.

10)

Note that the blue up arrow in the process toolbar is now illuminated, where previously it
has  been  grayed  out.    This  will  allow  us  to  return  to  our  main  process,  once  we  have
constructed our sub-process.  Within the sub-process though, there are a few things we
need to do, and a couple we can choose to do, in order to mine our text.  Use the search
field  in  the  Operators  tab  to  locate  an  operator  called  Tokenize.    It  is  under  the  Text
Processing menu in the Tokenization folder. When mining text, the words in the text must
be grouped together and counted.  Without some numeric structure, the computer cannot
assess the meaning of the words.  The Tokenize operator performs this function for us.
Drag  it  into  the  sub-process  window  (labeled  ‘Vector  Creation’  in  the  upper  left  hand
corner).  The doc ports from the left hand side of the screen to the operator, and from the
operator  to  the  right  hand  side  of  the  screen,  should  all  be  connected  by  splines,  as
illustrated in Figure 12-10.

Data Mining for the Masses
198

Figure 12-10. Adding tokenization to the text mining model’s sub-process.

11)

Run the model and briefly review the output. You will see that each word from our four
input documents is now an attribute in our data set. We also have a few new special
attributes, created by RapidMiner.

Figure 12-11. A view of the words from our input documents as tokens (attributes).

Chapter 12: Text Mining
199
12)

Switch back to design perspective.  You will see that we return to the sub-process from
where we ran the model. We’ve put the words from our documents into attributes through
tokenization, but further processing is needed to make sense of the value of the words in
relation to one another.  For one thing, there are some words in our data set that really
don’t  mean  much.    These  are  necessary  conjunctions  and  articles  that  make  the  text
readable in English, but that won’t tell us much about meaning or authorship.  We should
remove these words.  In the Operators search field, look for the word ‘Stop’.  These types
of  words  are  called  stopwords,  and  RapidMiner  has  built-in  dictionaries  in  several
languages to find and filter these out.  Add the Filter Stopwords (English) operator to the
sub-process stream.

Figure 12-12. Removing stopwords such as ‘and’, ‘or’, ‘the’, etc. from our model.

13)

In  some  instances,  letters  that  are  uppercase  will  not  match  with  the  same  letters  in
lowercase.  When text mining, this could be a problem because ‘Data’ might be interpreted
different from ‘data’.  This is known as Case Sensitivity.  We can address this matter by
adding a Transform Cases operator to our sub-process stream.  Search for this operator
in the Operators tab and drag it into your stream, as shown in Figure 12-13.

Data Mining for the Masses
200

Figure 12-13. Setting all tokens (word attributes) from our text to be lowercase.

At this point, we have a model that is capable of mining and displaying to us the words that are
most frequent in our text documents. This will be interesting for us to review, but there are a few
more operators that you should know about in addition to the ones we are using here. These are
highlighted by black arrows in Figure 12-14, and discussed below.

Chapter 12: Text Mining
201

Figure 12-14. Additional text mining operators of interest.



Stemming:  In text mining, stemming means finding terms that share a common root and
combining them to mean essentially the same thing.  For example, ‘America’, ‘American’,
‘Americans’, are all like terms and effectively refer to the same thing.  By stemming (you
can  see  there  are  a  number  of  stemming  operators  using  different  algorithms  for  you  to
choose from), RapidMiner can reduce all instances of these word variations to a common
form, such as ‘Americ’, or perhaps ‘America’, and have all instances represented in a single
attribute.



Generate n-Grams:  In text mining, an n-gram is a phrase or combination of words that
may  take  on  meaning  that  is  different  from,  or  greater  than  the  meaning  of  each  word
individually.  When creating n-grams, the n is simply the maximum number of terms you
want RapidMiner to consider grouping together.  Take for example the token ‘death’.  This
word by itself is strong, evoking strong emotion.  But now consider the meaning, strength
and emotion if you were to add a Generate n-Grams operator to your model with a size of
2 (this is set in the parameters area of the n-gram operator).  Depending on your input text,
you might find the token ‘death_penalty’.  This certainly has a more specific meaning and

Data Mining for the Masses
202
evokes  different  and  even  stronger  emotions  than  just  the  token  ‘death’.    What  if  we
increased the n-gram size to 3?  We might find a token ‘death_penalty_execution’.  Again,
more specific meaning and perhaps stronger emotion is attached.  Understand that these
example gram tokens would only be created by RapidMiner if the two or three words in
each of them were found together, and in close proximity to one another in the input text.
Generating grams can be an excellent way to bring a more granular analysis to your text
mining activities.



Replace  Tokens:    This  is  similar  to  replacing  missing  or  inconsistent  values  in  more
structured data.  This operator can come in handy once you’ve tokenized your text input.
Suppose for example that you had the tokens ‘nation’, ‘country’,  and  ‘homeland’ in your
data set but you wanted to treat all of them as one token.  You could use this operator to
change  both  ‘country’  and  ‘homeland’  to  ‘nation’,  and  all  instances  of  any  of  the  three
terms  (or  their  stems  if  you  also  use  stemming)  would  subsequently  be combined  into  a
single token.

These are a just a few of the other operators in the Text Processing area that can be nice additions
to a text mining model.  There are many others, and you may experiment with these at your leisure.
For now though, we will proceed to…

MODELING

Click the blue up arrow to move from your sub-process back to your main process window.

Figure 12-15. The ‘Return to Parent Operator’ arrow (indicated by the black arrow).

Yüklə 4,8 Kb.

Dostları ilə paylaş:

1 ... 50 51 52 53 54 55 56 57 ... 65