Chapter 12: Text Mining
197
Figure 12-9. A view inside the sub-process of our Process Documents operator.
10)
Note that the blue up arrow in the process toolbar is now illuminated, where previously it
has been grayed out. This will allow us to return to our main process, once we have
constructed our sub-process. Within the sub-process though, there are a few things we
need to do, and a couple we can choose to do, in order to mine our text. Use the search
field in the Operators tab to locate an operator called Tokenize. It is under the Text
Processing menu in the Tokenization folder. When mining text, the words in the text must
be grouped together and counted. Without some numeric structure, the computer cannot
assess the meaning of the words. The Tokenize operator performs this function for us.
Drag it into the sub-process window (labeled ‘Vector Creation’ in the upper left hand
corner). The doc ports from the left hand side of the screen to the operator, and from the
operator to the right hand side of the screen, should all be connected by splines, as
illustrated in Figure 12-10.
Data Mining for the Masses
198
Figure 12-10. Adding tokenization to the text mining model’s sub-process.
11)
Run the model and briefly review the output. You will see that each word from our four
input documents is now an attribute in our data set. We also have a few new special
attributes, created by RapidMiner.
Figure 12-11. A view of the words from our input documents as tokens (attributes).
Chapter 12: Text Mining
199
12)
Switch back to design perspective. You will see that we return to the sub-process from
where we ran the model. We’ve put the words from our documents into attributes through
tokenization, but further processing is needed to make sense of the value of the words in
relation to one another. For one thing, there are some words in our data set that really
don’t mean much. These are necessary conjunctions and articles that make the text
readable in English, but that won’t tell us much about meaning or authorship. We should
remove these words. In the Operators search field, look for the word ‘Stop’. These types
of words are called stopwords, and RapidMiner has built-in dictionaries in several
languages to find and filter these out. Add the Filter Stopwords (English) operator to the
sub-process stream.
Figure 12-12. Removing stopwords such as ‘and’, ‘or’, ‘the’, etc. from our model.
13)
In some instances, letters that are uppercase will not match with the same letters in
lowercase. When text mining, this could be a problem because ‘Data’ might be interpreted
different from ‘data’. This is known as Case Sensitivity. We can address this matter by
adding a Transform Cases operator to our sub-process stream. Search for this operator
in the Operators tab and drag it into your stream, as shown in Figure 12-13.
Data Mining for the Masses
200
Figure 12-13. Setting all tokens (word attributes) from our text to be lowercase.
At this point, we have a model that is capable of mining and displaying to us the words that are
most frequent in our text documents. This will be interesting for us to review, but there are a few
more operators that you should know about in addition to the ones we are using here. These are
highlighted by black arrows in Figure 12-14, and discussed below.
Chapter 12: Text Mining
201
Figure 12-14. Additional text mining operators of interest.
Stemming: In text mining, stemming means finding terms that share a common root and
combining them to mean essentially the same thing. For example, ‘America’, ‘American’,
‘Americans’, are all like terms and effectively refer to the same thing. By stemming (you
can see there are a number of stemming operators using different algorithms for you to
choose from), RapidMiner can reduce all instances of these word variations to a common
form, such as ‘Americ’, or perhaps ‘America’, and have all instances represented in a single
attribute.
Generate n-Grams: In text mining, an n-gram is a phrase or combination of words that
may take on meaning that is different from, or greater than the meaning of each word
individually. When creating n-grams, the n is simply the maximum number of terms you
want RapidMiner to consider grouping together. Take for example the token ‘death’. This
word by itself is strong, evoking strong emotion. But now consider the meaning, strength
and emotion if you were to add a Generate n-Grams operator to your model with a size of
2 (this is set in the parameters area of the n-gram operator). Depending on your input text,
you might find the token ‘death_penalty’. This certainly has a more specific meaning and
Data Mining for the Masses
202
evokes different and even stronger emotions than just the token ‘death’. What if we
increased the n-gram size to 3? We might find a token ‘death_penalty_execution’. Again,
more specific meaning and perhaps stronger emotion is attached. Understand that these
example gram tokens would only be created by RapidMiner if the two or three words in
each of them were found together, and in close proximity to one another in the input text.
Generating grams can be an excellent way to bring a more granular analysis to your text
mining activities.
Replace Tokens: This is similar to replacing missing or inconsistent values in more
structured data. This operator can come in handy once you’ve tokenized your text input.
Suppose for example that you had the tokens ‘nation’, ‘country’, and ‘homeland’ in your
data set but you wanted to treat all of them as one token. You could use this operator to
change both ‘country’ and ‘homeland’ to ‘nation’, and all instances of any of the three
terms (or their stems if you also use stemming) would subsequently be combined into a
single token.
These are a just a few of the other operators in the Text Processing area that can be nice additions
to a text mining model. There are many others, and you may experiment with these at your leisure.
For now though, we will proceed to…
MODELING
Click the blue up arrow to move from your sub-process back to your main process window.
Figure 12-15. The ‘Return to Parent Operator’ arrow (indicated by the black arrow).
Dostları ilə paylaş: |